Thursday, July 28, 2011

Big Digital Data

Of late the term, Big Data is seen often in news posts and articles. Knowledge industry loves new jargon. Probably it adds a touch of mystery and solidity to the knowledge that the jargon associates to, which otherwise may be lost in the zillion bytes that stream through our awareness everyday. In a sense jargon creates identity that people love to be identified with.
So, what is Big Data exactly?
In simple language any data repository for large data stream like GPS or atmospheric pressure sensory data or twitter posts can be qualified as Big Data. They are large and incremental; applications require access to the entire database.
IBM qualifies it as:
Big data spans three dimensions: Variety, Velocity and Volume.
Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.

Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
 In somewhat crude but simpler terms, the data must be online [that means offline backup data is not qualified] and the daily incremental addition should be in the order of hundreds of gigabytes or more to be qualified as Big Data.
So, why is this particularly important?
The challenge of storing such a large data and providing online accessibility, which could be mostly random, to it, makes it a non-trivial problem. A non-trivial problem requires non-trivial solution and thus it becomes a new opportunity for the data storage solution providers. Undoubtedly this problem is not new. Technology for structuring and storing large database has been existing for more than a decade, in fact all the existing RDBMS providers have the technology for partitioning large data and providing the engine to retrieve the information efficiently. One issue with RDB [Relational DB] is when the table becomes huge [in the order of petabytes or more], the join becomes a very costly operation and if the data retrieval requires multiple joins performance can degrade quite heavily. And there the new-generation noSQL db comes into picture. But before we go there, perhaps some idea about applications that need Big Data would help.
Most of the application for large Data are Data Analytics. Analytics is not a new branch, so far most of the analytics application assumed structured data set. For example consumer credit card transaction can be considered Big structured data and many commercially useful data analysis can be done by profiling either customers or the products. In other words, data analysis is a large business and it is no-brainer that larger the input data, better the accuracy of the analysis is. In other words data analytics need large data and cheaper the cost for maintaining it, better it is for the business. It is also important to note that the rate at which data gets generated, has multiplied in last few years [McKinsey says data expansion rate is around 40% per year], again adding to the problem of Big Data.
So the question really boils down to what technology we can use to create relatively cheap but scalable computing and storage infrastructure that enables reasonably fast accessibility to the data for analytics engine.
While existing relational DB-based model does provide proven solution for large structured data, their cost/performance for large distributed unstructured data does not scale that well. And here, new-age web 2.0 technologies are bringing a viable alternative compared to traditional [read Relational] database models.
The other aspect of the story is that most of the consumers of large data analysis are trying to use Cloud storage as the storage infrastructure for their analytics data as it brings down their CapEx cost. However, using noSQL DB is not without issues. Relational DBs are popular for many reasons:
Two main points are:
1. it helps to reduce redundancy in the data [data modelling ensures 3NF] and
2. SQL provides an easy, consistent and standardized data access interface.
With noSQL, both of the benefits are forfeited. Data structure is driven by what and how the users access the data and data access interface gets designed as per that need [ in other words no SQL support].
In next post we will review some of the existing large data infrastructure solution.
Further Reading
McKinsey Report on opportunities and challenges of Big Data [May 2011]