Thursday, August 11, 2011

Creating storage for big unstructured data:path to NoSQL

I mentioned in my last post that next post will be on infrastructure for large unstructured data. But rather than jumping into available infrastructure I felt a better approach would be to start with clean slate and then pick what we need. 
To keep it simple, let us focus on the primary dimensions as we know which are storage and processing capacity of data analytics engine.
In the last post we saw how the commercial products from big houses tried to provide an unified solution for structured data. From the vendors' point of view the solution helps the customer, since IT managers are traditionally used to think of capacity building in terms of adding more devices into the pool. Just like they would add a new NAS box to increase storage capacity in their existing NAS setup, one can partition the analytics data and keep adding Big Data Appliances to address added capacity need in each partition. It helps because most of the structured data, even when they become big, follow a pattern that is not very different from when they were small. But when we think of unstructured data, it is difficult to forecast the analytics pattern of the data and it could be that while the data source and repository may be one, there would be multiple analytics engines to address different analytics need.
So if we are to build a scalable solution, it makes sense to look at what we need to build large data store that can be scaled with linear cost and then address how we can adapt our analytics engine to this store.

 'unstructured' property of the data can be an advantage!

Unstrucured means no implicit relation that one can use to map to tables and so on, can be expected of the data. If we force relational construct on the data, we are going to create artificial constraint(s) that would become hindrance later, since imposition of a particular relational structure would mean 1. implicitly removing the other possible relations from the data design and later if the analytics engine has to find those relations, it will have to deconstruct the relational structure first, which is kind of a double taxation.

Base model

One option is to preserve the input data structure in the data store design. i.e. if the input data comes in the form of stream of text, keep it in the same way and postpone all structural correlation for the next stage. Let this be our base design. One single table [no relation] with one row corresponding to one frame of input data stream. Content of the row can be few bytes to multiple of megabytes or gigabytes depending on the content type and density of the information [a profile data, for example, can be just name and id or it can contain full range of personal details]. A single compute/store unit may contain only few gigabytes or multiples of terabytes depending on the unit's capacity. To address expansion of data, let's assume that we have at our disposal multiple compute units connected over shared network and there is some mechanism through which each compute unit gets to know which compute unit stores which chunk of data. Now, all we need is one layer of software that will translate the input request to different data set residing in different compute unit, route the request to those specific compute units, collate the results and convert them to single response for the input request. Assuming we have enough network bandwidth, the model can scale to large number of units. One obvious problem with this design is that retrieval and search would be tedious and can take unpredictable time  if we insist on accuracy.
Second issue is that some compute server can be down at times. If we are to ensure Availability of data across these unit failures, we have to keep some redundant copies of data in the network. That does add challenge w.r.t to keeping the entire data consistent all the time. But we will come back to that aspect later.
Improving on the basic design
One way of addressing the unpredictability aspect of retreival function would be to find out the primary search/Query pattern and create a metadata store that makes the search predictable. In that way we are not imposing the structure on data store but adding a bit of extra information by processing the data itself so that search output can be consistent. To illustrate it a bit more, let's consider the data store for all web-content inside an organization. If we find that most sought queries are based on say hundred words, we can build an index that maps each of these hundred word to all matching web-content. If the next search comes on any of the above words, retrieval will be lot faster.
Addressing Modification/update [ACID and CAP]
This looks fine, as long as we assume that data always get added in a consistent and predictable manner so that all update/modifications are recognizable separately and traceable. However,  with multiplicity of data source, asynchronous nature of  update and property of unstructured-ness, that assumption does not hold much water. And there comes the next problem.
One advantage that RDBMS provide us is the inherent support for ACID requirement. ACID refers to the ability to preserve Atomicity, Consistency, Isolation and Durability of all the database transactions. How do we support this requirement in our design? We can trivially support the requirement if we serialize all transactions[see Figure] which means transaction B does not get started till the transaction A is completely committed to database. Now what happens if connection to the compute unit that has particular data, fails in between? All requests wait indefinitely for the transaction to complete, which basically means that system becomes unavailable. That brings us to another interesting aspect of distributed dats store. Brewer's  CAP theorem tells that no distributed system can guarantee Consistency of data, Availability of the system and Tolerance to Partition of store [across network] all together. System can only guarantee two of the requirements at a time. This page provides more elaborate explanation.
In order that we do not get confused, CAP tells us about the property of distributed system as a whole and ACID requirement is particularly applicable to database transactions. To bring broad correlation, Consistency in CAP roughly corresponds to Atomicity and Isolation of transaction in ACID, A and P property does not have any correlation with ACID.
Daniel Abidi, an assistant prof with Yale has explained the issue with lot more rigour  here.  He brought Latency as another dimension alongwith Consistency and argues that if there is Partition of data store [i.e. data store maintained in two different data centres], then the system chooses either Consistency or Availability and if it prioritises Availability over Consistency, it also chooses Latency over Consistency. Example that he cites about this type of system are Cassandra or Amazon's dynamo.
The other type of systems are fully ACID compliant [traditional RDB] system and he shows that this type of datastore makes consistency paramount and in turn compromises on Availability and Latency factor. This is intuitive. If we have datastore divided into partitions and each partition keeps one replica, when the network between two Data centre breaks down, database that chooses consistency [like banking transactions], will make the system unavailable till both the sides are up, otherwise the two replicas will soon be incongruent to each other rendering the database inconsistent overall.
But if Availability (and therefore Latency) is prioritized, the system allows updates continuing even if one partition fails, thereby making the database relatively inconsistent for that time. In this case, responsibility to maintain consistency of data gets transferred to application accessing the data [pushing some work downstream]. One advantage is that it makes the database store design simpler.
Eventually Consistent Database
Concept of Eventually Consistent Database also is attributed to Eric Brewer. He described the consistency as a range. Strict consistency is what RDBMS provides. Weak consistency means system allows an window of inconsistency when most recent update will not be seen by all clients. Some conditions must be met before the data reaches fully consistent state.
Eventual Consistency is a special version of weak consistency. Quoting Werner Vogels, "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme."
One of the successful system that uses this consistency model is DNS. Even though a DNS name update may not reach all DNS nodes as soon as the update occurs, the protocol ensures that the update reaches all the nodes eventually after sufficient time elapsed since the update happened. Vogels has elaborated on different aspects and variations of eventual consistency in his paper [see reference] which are crucial factors when one designs a datastore.
Why Eventual Consistency
The reason eventual consistency is important is there are lot of data usage that do not need strict consistency. Adopting the eventual consistency model for such type of data, opens up an opportunity to build cheap, distributed, greatly scalable, still reliable and lot faster databases. Here we enter the realm of so-called NoSQL databases.
NoSQL Movement
The NoSQL movement started with web 2.0 based startups, when they decided to build their own data store for the type of data that they were interested did not fit into relational model [page ranking of web pages or facebook content]. When Yahoo brought up Hadoop, Google brought up BigTable, Amazon brought up dynamo. Facebook later developed Cassandra. Hadoop and Cassandra became open-source. Apache projects after Yahoo and Facebook forked their software. Now of course you have many other NoSQL alternatives like MongoDB, HBASE [an opensource version of BigTable]. Hadoop incidentally has lot many adopters even from established storage players as shown in the table below.
Hadoop Ecosystem 
Hadoop Ecosystem [Borrowed Table]

Reference:

Wednesday, August 3, 2011

Challenges of Big Data

After my last post, I received few comments regarding the actual nature of the problem. Prima-facie it looks as if the expansion of the data is the challenge. But the fact is that is not entire story. Big Data is not only about the expansion of data, it is also about finding the hidden [undiscovered] relations among the data and finding those relations in real time. So the challenge is not only about storage, challenge is also about analysis of the data and the throughput of the analytics engine. That however takes it more into the domain of conventional Data warehousing products.
Ovum says, "Combining traditional structured transactional information with unstructured interaction data generated by humans and the Internet (customer records, social media) and, increasingly, machines (sensor data, call detail records) is clearly the sweet spot. These types of interaction data have traditionally been difficult to access or process using conventional BI systems. The appeal of adding these new data types is to allow enterprises to achieve a more complete view of customers, with new insights into relationships and behaviors from social media data." Ovum is referring to unstructured data as the new challenge for traditional data warehousing software. But before we dive into the realm of unstructured data, let's gloss over the Industry's response to the challenge of big structured data.
A new challenge, as always, translates to new business opportunity. All the existing Data warehouse software vendors rose to the challenge with new solutions, either in the form of an enhanced product or an acquired product. When Oracle launched Exadata, IBM acquired Netezza and EMC acquired Greenplum. In Gartner's 2011 Magic Quadrant report [link and diagram available thanks to Teradata] Gartner placed Teradata, IBM, EMC and SAP/Sybase in the leader quadrant while AstorData and ParAccel in the visionary quadrant [Note: Magic Quadrant is a Gartner concept]. Incidentally AstorData was acquired by TeraData in March, this year.
'Big Data' Appliances
To address the need for high-throughput and low-latency, Netezza and Teradata appliances both use proprietary hardware-design [MPP-based architecture] for speed, throughput and scale. To understand what we are talking about, let me quote the Wiki:
Teradata provides :
  • Complex ad hoc queries with up to 256 joins. 
  • Parallel efficiency, such that the effort for creating 100 records is same as that for creating 100,000 records. 
  • Scalability, so that increasing of the number of processors of an existing system linearly increases the performance
While Teradata uses EMC disk array for storage, Netezza uses native disks for storage. EMC storage controller [for Teradata] and Netezza's own FPGA units provide the necessary intelligence to manage the formidable lag between disk I/O speed and CPU processing speed.  Both Netezza and Teradata have separate product line focusing higher performance as opposed to higher capacity. For example Netezza High Capacity Appliance provides as much as 10-petabyte storage capacity and 4-times data density compared to Netezza TwinFin product and the TwinFin offers 35% higher processing throughput compared to its sibling.
EMC GreenPlum at its core is a software product but also is sold as appliance with a hardware that also has a MPP-based scalable architecture for faster data processing. With the new-found enterprise-readiness of solid-state devices, many vendors have introduced solid-state storage in their appliances to boost the random I/O access performance.
Essentially all these products are targeted for processing structured data like call data records, financial trading data, log files or other forms of machine-generated information. They use traditional RDBMS for storing and manipulating the logical records. These products are fast and accurate but also is expensive [Netezza's lowest cost per terabyte hovers around $2500].
However, the Industry veterans tell us that most of the data that we see today that is expanding enormously are unstructured. They are not machine generated, they are mostly consumer data [internet-based], images, video files, text. Analyzing these data require different algorithm, different architecture. Single Correlation may require sifting small to really large amount of data. The real challenge or opportunity for the Big Data Analytics however is presented by these data.
Quoting Jim Baum, the CEO of Netezza, "I get excited about what we see happening in 'Big Data'. Most of the applications I have seen so far are focused on consumer facing businesses and traditional enterprises leveraging structured data assets. Data warehouses supporting reporting, ETL jobs running in Hadoop, “machine to machine” analytics making decisions in real time.  We have just scratched the surface and the opportunities to go further are immense." and he is bang on the target.
In the next post we will review some of  the existing solutions for unstructured Big Data.