Wednesday, August 3, 2011

Challenges of Big Data

After my last post, I received few comments regarding the actual nature of the problem. Prima-facie it looks as if the expansion of the data is the challenge. But the fact is that is not entire story. Big Data is not only about the expansion of data, it is also about finding the hidden [undiscovered] relations among the data and finding those relations in real time. So the challenge is not only about storage, challenge is also about analysis of the data and the throughput of the analytics engine. That however takes it more into the domain of conventional Data warehousing products.
Ovum says, "Combining traditional structured transactional information with unstructured interaction data generated by humans and the Internet (customer records, social media) and, increasingly, machines (sensor data, call detail records) is clearly the sweet spot. These types of interaction data have traditionally been difficult to access or process using conventional BI systems. The appeal of adding these new data types is to allow enterprises to achieve a more complete view of customers, with new insights into relationships and behaviors from social media data." Ovum is referring to unstructured data as the new challenge for traditional data warehousing software. But before we dive into the realm of unstructured data, let's gloss over the Industry's response to the challenge of big structured data.
A new challenge, as always, translates to new business opportunity. All the existing Data warehouse software vendors rose to the challenge with new solutions, either in the form of an enhanced product or an acquired product. When Oracle launched Exadata, IBM acquired Netezza and EMC acquired Greenplum. In Gartner's 2011 Magic Quadrant report [link and diagram available thanks to Teradata] Gartner placed Teradata, IBM, EMC and SAP/Sybase in the leader quadrant while AstorData and ParAccel in the visionary quadrant [Note: Magic Quadrant is a Gartner concept]. Incidentally AstorData was acquired by TeraData in March, this year.
'Big Data' Appliances
To address the need for high-throughput and low-latency, Netezza and Teradata appliances both use proprietary hardware-design [MPP-based architecture] for speed, throughput and scale. To understand what we are talking about, let me quote the Wiki:
Teradata provides :
  • Complex ad hoc queries with up to 256 joins. 
  • Parallel efficiency, such that the effort for creating 100 records is same as that for creating 100,000 records. 
  • Scalability, so that increasing of the number of processors of an existing system linearly increases the performance
While Teradata uses EMC disk array for storage, Netezza uses native disks for storage. EMC storage controller [for Teradata] and Netezza's own FPGA units provide the necessary intelligence to manage the formidable lag between disk I/O speed and CPU processing speed.  Both Netezza and Teradata have separate product line focusing higher performance as opposed to higher capacity. For example Netezza High Capacity Appliance provides as much as 10-petabyte storage capacity and 4-times data density compared to Netezza TwinFin product and the TwinFin offers 35% higher processing throughput compared to its sibling.
EMC GreenPlum at its core is a software product but also is sold as appliance with a hardware that also has a MPP-based scalable architecture for faster data processing. With the new-found enterprise-readiness of solid-state devices, many vendors have introduced solid-state storage in their appliances to boost the random I/O access performance.
Essentially all these products are targeted for processing structured data like call data records, financial trading data, log files or other forms of machine-generated information. They use traditional RDBMS for storing and manipulating the logical records. These products are fast and accurate but also is expensive [Netezza's lowest cost per terabyte hovers around $2500].
However, the Industry veterans tell us that most of the data that we see today that is expanding enormously are unstructured. They are not machine generated, they are mostly consumer data [internet-based], images, video files, text. Analyzing these data require different algorithm, different architecture. Single Correlation may require sifting small to really large amount of data. The real challenge or opportunity for the Big Data Analytics however is presented by these data.
Quoting Jim Baum, the CEO of Netezza, "I get excited about what we see happening in 'Big Data'. Most of the applications I have seen so far are focused on consumer facing businesses and traditional enterprises leveraging structured data assets. Data warehouses supporting reporting, ETL jobs running in Hadoop, “machine to machine” analytics making decisions in real time.  We have just scratched the surface and the opportunities to go further are immense." and he is bang on the target.
In the next post we will review some of  the existing solutions for unstructured Big Data.

No comments:

Post a Comment