This post is the first review of NewSQL DBs, as promised in my last post and we will start with Clustrix. Clustrix is interesting for two reasons: 1. the architect and some of the principle designers came from NAS space (as opposed to database technology) and 2. they appear to be doing the Isilon equivalent in Database space.
Scale Out
Scale out as a concept became popular with massive growth of Data centre. Essentially scale out means to scale horizontally i.e. adding more compute resources to the infrastructure pool in order to take care of growing demand. Essentially it means if you have one server for n users, to serve m*n users you simply need to add m similar servers. In contrast Scale-up means scaling vertically i.e. changing your server with a more powerful server when numbers of users increase.
Scale Out NAS
So scaling out is simpler, assuming one has enough rack-space in the data centre. During last decade, Scale-out NAS became a strong market driver and many innovation engines took off that promised faster, simpler and cheaper scale out. Huge growth of unstructured data in entreprise data centre pushed the scale-up model to the edge with each hardware replacement cost getting more and more expensive. Addditonally traditional NAS also comes with certain scaling limitations such as,
1. performance is limited by single head [multicore CPUs alleviated the issue just a little]
2. limited Namespace [ typically this is File system limitation brought in to ensure performance does not degrade unacceptably with large number of files]
3. migration of data between two NAS namespace is costly.
3PAR, Spinnaker and Islion were few of the innovation leaders in the space of Scale-out NAS. After NetApp bought Spinnaker, Isilon almost occupied this niche space entirely till EMC bought it.
What did Isilon bring technologically?
Isilon's OneFS clustered File System is the core of its technology. The OneFS software is designed with file-striping functionality across each node in a cluster, a fully distributed lock manager, caching, fully distributed meta-data, and a remote block manager to maintain global coherency and synchronization across the cluster. That enables Isilon to support a massive capacity of 15PB of file storage in a single file system, making EMC position Isilon NAS as the solution for Big Data for File-storage.
Scale-out Database Appliance
Experience of building the distributed lock manager has some relevance to Database and that gave confidence to the Isilon engineers when they started designing Clustrix. Concurrency management is one of the crucial challenge in designing a clustered database. Clustrix claims to use MVCC [Multi-version concurrency Control] with a share-nothing architecture but the most important part of their innovation is building a distributed Query execution model which they call 'Sierra Distributed Engine'. True to their philosophy of "Bring Query to Data and not the Data to the Query",
Sierra’s most basic primitive is a compiled program called a query fragment. They can read, insert, update a container, execute functions, modify control flow, format rows, perform synchronization, and send rows to query fragments on other nodes. These query fragments are run on the nodes that contain the data. The communication between the nodes consists of just the intermediate and result rows needed for the queries. Many, many query fragments can operate simultaneously across the cluster. Those query fragments may be different components of the same query or parts of different queries.
Queries enter the system through the front-end network and are translated to an internal representation used by Sierra. Sierra then executes the queries in an efficient parallel fashion. Sierra uses SSDs to store the data, NVRAM to journal changes, and Infiniband to communicate with other nodes in the cluster. [Clustrix whitepaper]
Performance
Clustrix claims that transaction throughput /million of records in the database scales linearly with addition of nodes.
They support replication with MySQL, both Key-Value based query and SQL query as well as on-the-fly Schema change. The user does not need to worry about sharding / partitioning since basic performance bottleneck observed with RDBMS is removed here. Overall, Clustrix seem to have brought a strong NewSQL product.
For more information take a look at Clustrix resource page.
Tuesday, February 21, 2012
Friday, February 17, 2012
NewSQL way, neither traditional RDBMS, nor NoSQL
This December, NuoDB [erstwhile NimbusDB] released the Beta 5 of its database software. This database, it claims, "is a NewSQL database. It looks and behaves like a traditional SQL database from the outside but under the covers it's a revolutionary database solution. It is a new class of database for a new class of datacenter."
NewSQL database?
It appears that although the NoSQL variant got popularity with likes of Google, Yahoo, Facebook, it did not make much dent to RDBMS clientale base and main reason, the NewSQL advocates cite, is that people like SQL and irrespective of scalability and other issues, people decided to stay with SQL. The fact that SQL has been in the game for last 20-25 years makes it so entrenched in Business Application space that it is almost impossible to take it away from the Business Application Space. Even if one ignores the large SQL investment, one cannot ignore the value SQL brought to business community. SQL essentially separated data from the code, that protects the data, i.e. Database Engine, enabling the DBMS packages to be commoditized.
SQL also provided a well-undersood and simple Data manipulation interface which Business Applications could use without needing to assimilate full-complexities of computer programming. Business Applications would have become far more complex if they had to deal with ACID requirement of their data in addition to implement their business logic.
NoSQL, however scalable and flexible they are, comes with a huge cost, one has to design his own data manipulation engine and larger the scale of the data and larger the distribution of computing resources, more intensive is the effort. The uniqueness of the applications demands specific design of the engine that manipulates the applications data and thereby makes it more tied to business logic of the enterprise. While it has its strength, it is obvious that most of the business whose core operation is not about the data itself, has stayed away from NoSQL movement, however large their data are.
Promises of NewSQL : SQL for Big Data
NewSQL tries to bridge this gap by keeping the SQL interface intact but trying to reengineer the basic database engine. Evidently this is far more daunting that coming up with NoSQL alternative. It requires a change in the design that has held its ground for almost 30 years. Only those who understands the intricacies of the original design can venture to take this task of rearchitecting the database engine keeping its original promises intact. What does it mean? It means that the engine must fully support SQL, engine must guarantee ACID [my previous posts elaborated on how NoSQL addressed this requirement]. Additionally engine must 1. provide support for loosely connected set of computing resources, such as computers connected over Internet and 2. scale the performance with the number of computers. The last requirement came from NoSQL land, where huge number of computing resources are connected over Internet and are designed to sift through the massive distributed data to find out answer for a single query. Essentially this engine must be capable of building a distributed database spread over huge number of affordable [i.e. cheap] computers connected on Internet and provide a SQL interface for the entire data. That makes the NewSQL truly the database engine for Big Data.
Contenders
Most of the challengers in this space are started by someone who has participated in the Database software development in early 70's . Let's take the example of VoltDB, started by Michael StoneBraker, a luminary in Database Research who architected Ingres [one of the first Relational DBs], Postgres and many more.
Similarly NuoDB boasts of Jim Starkey, the person behind DEC's relational DB suites during 75-85.
The other prominent NewSQL venture, ScaleDB was started by another Database legend Vern Watts, the architect of famed DB2 from IBM.
Then there is JustoneDB that proclaim itself as the Relational DB of 21st century, boasts of its CTO, Duncan Pauly.
The list is along one but I must mention of Clustrix that boasts of its CTO, Aaron Passey, with Isilon fame [Isilon brought new definition in mainstream storage clustering]. Clustrix appear to have brought appliance model in the NewSQL Database world.
I am sure there are many more to come in this space and I am sure I missed few in listing down here but given the emeregence of this new technology space, we will have to revisit this topic.
I will try to provide more detailed review of these products in next posts starting with Clustrix [see the post here]
Summary
Here is a quick comparison between three different DB technologies:
NewSQL database?
It appears that although the NoSQL variant got popularity with likes of Google, Yahoo, Facebook, it did not make much dent to RDBMS clientale base and main reason, the NewSQL advocates cite, is that people like SQL and irrespective of scalability and other issues, people decided to stay with SQL. The fact that SQL has been in the game for last 20-25 years makes it so entrenched in Business Application space that it is almost impossible to take it away from the Business Application Space. Even if one ignores the large SQL investment, one cannot ignore the value SQL brought to business community. SQL essentially separated data from the code, that protects the data, i.e. Database Engine, enabling the DBMS packages to be commoditized.
SQL also provided a well-undersood and simple Data manipulation interface which Business Applications could use without needing to assimilate full-complexities of computer programming. Business Applications would have become far more complex if they had to deal with ACID requirement of their data in addition to implement their business logic.
NoSQL, however scalable and flexible they are, comes with a huge cost, one has to design his own data manipulation engine and larger the scale of the data and larger the distribution of computing resources, more intensive is the effort. The uniqueness of the applications demands specific design of the engine that manipulates the applications data and thereby makes it more tied to business logic of the enterprise. While it has its strength, it is obvious that most of the business whose core operation is not about the data itself, has stayed away from NoSQL movement, however large their data are.
Promises of NewSQL : SQL for Big Data
NewSQL tries to bridge this gap by keeping the SQL interface intact but trying to reengineer the basic database engine. Evidently this is far more daunting that coming up with NoSQL alternative. It requires a change in the design that has held its ground for almost 30 years. Only those who understands the intricacies of the original design can venture to take this task of rearchitecting the database engine keeping its original promises intact. What does it mean? It means that the engine must fully support SQL, engine must guarantee ACID [my previous posts elaborated on how NoSQL addressed this requirement]. Additionally engine must 1. provide support for loosely connected set of computing resources, such as computers connected over Internet and 2. scale the performance with the number of computers. The last requirement came from NoSQL land, where huge number of computing resources are connected over Internet and are designed to sift through the massive distributed data to find out answer for a single query. Essentially this engine must be capable of building a distributed database spread over huge number of affordable [i.e. cheap] computers connected on Internet and provide a SQL interface for the entire data. That makes the NewSQL truly the database engine for Big Data.
Contenders
Most of the challengers in this space are started by someone who has participated in the Database software development in early 70's . Let's take the example of VoltDB, started by Michael StoneBraker, a luminary in Database Research who architected Ingres [one of the first Relational DBs], Postgres and many more.
Similarly NuoDB boasts of Jim Starkey, the person behind DEC's relational DB suites during 75-85.
The other prominent NewSQL venture, ScaleDB was started by another Database legend Vern Watts, the architect of famed DB2 from IBM.
Then there is JustoneDB that proclaim itself as the Relational DB of 21st century, boasts of its CTO, Duncan Pauly.
The list is along one but I must mention of Clustrix that boasts of its CTO, Aaron Passey, with Isilon fame [Isilon brought new definition in mainstream storage clustering]. Clustrix appear to have brought appliance model in the NewSQL Database world.
I am sure there are many more to come in this space and I am sure I missed few in listing down here but given the emeregence of this new technology space, we will have to revisit this topic.
I will try to provide more detailed review of these products in next posts starting with Clustrix [see the post here]
Summary
Here is a quick comparison between three different DB technologies:
SQL-DB | No-SQL DB | NewSQL DB |
Basic architecture from 70’s relational Database Model | New [2000] Architecture from likes of Google, Yahoo; designed for single large distributed database | Newer [post-2000] Architecture promises to scale for both standalone and large installation |
Centralized Transaction Processing | Distributed processing | Distributed Processing |
Fully ACID compliant | Breaks ACID, brings eventual consistency model | ACID-compliant |
Integrates SQL engine | No support for SQL | Full support for SQL |
Limited scalability | High Scalability | High scalability, tries to break dependency on any single engine |
Mature Technology; has been in the core of all popular OLTP suites | Relatively mature; suits better for SaaS model | Still Evolving; has the potential to scale for both the use-cases |
Thursday, September 15, 2011
Big Data Fuelling Storage growth?
Recent IDC report tells us that enterprises are spending on storage again and it appears that preparing for 'big data' is a major growth driver this time. The boost in storage has come along with investments in cloud computing and data-centre virtualisation, IDC analyst Liz Conner said. Companies are updating their storage systems for the era of "big data," to deal with huge and growing volumes of information, she said.
While money spent on external storage increased by 12.2% Y-over-Y for the second quarter of this year, the total capacity grew by more than 47%.
Sales increased across all major product categories, including NAS (network-attached storage) and all types of SANs (storage-area networks). The total market for non-mainframe networked storage systems, including NAS and iSCSI (Internet SCSI) SANs, grew 15.0% from a year earlier to $4.8 billion (£2.96 billion) in revenue, IDC reported. EMC led that market with 31.9% of total revenue, followed by NetApp with a 15.0% share. NAS revenue alone increased 16.9% from a year earlier, and EMC dominated this market with 47.2% of revenue. NetApp came in second at 30.7%.
EMC led non-mainframe SAN market too with a hold of 25.7% of that market, followed by IBM with 16.7% and HP with 13.4%, according to IDC.
[IDC is a division of International Data Group, the parent company of IDG News Service.]
Unfortunately the report does not elaborate how big data influences the storage growth.
Is it that the enterprises are anticipating that their internal data will grow faster and therefore investing in expansion fo storage? Or is the growth happening primarily because enterprises are building new storage infrastructure dedicated for 'big data'?
The first scenario is not much different from the decade-old enterprise storage expansion pattern. In the second scenario, enterprises need to think differently. They would be essentially building their own cloud infrastructure. So they would need to decide on distribution of objects/storage elements, which distributed file system they should use, how applications will access these data etc and those will drive the decision of the storage system they will buy. But given that both NetApp and EMC are leading the growth and are selling their already established products in SAN and NAS space, actual scenario most likely to remain closer to the first case. In that case it is the expansion of existing NAS and SAN infrastructure that is propelling the storage growth. Should we then talk about 'Big NAS' and 'Big SAN' instead of Big Data?
While money spent on external storage increased by 12.2% Y-over-Y for the second quarter of this year, the total capacity grew by more than 47%.
Sales increased across all major product categories, including NAS (network-attached storage) and all types of SANs (storage-area networks). The total market for non-mainframe networked storage systems, including NAS and iSCSI (Internet SCSI) SANs, grew 15.0% from a year earlier to $4.8 billion (£2.96 billion) in revenue, IDC reported. EMC led that market with 31.9% of total revenue, followed by NetApp with a 15.0% share. NAS revenue alone increased 16.9% from a year earlier, and EMC dominated this market with 47.2% of revenue. NetApp came in second at 30.7%.
EMC led non-mainframe SAN market too with a hold of 25.7% of that market, followed by IBM with 16.7% and HP with 13.4%, according to IDC.
[IDC is a division of International Data Group, the parent company of IDG News Service.]
Unfortunately the report does not elaborate how big data influences the storage growth.
Is it that the enterprises are anticipating that their internal data will grow faster and therefore investing in expansion fo storage? Or is the growth happening primarily because enterprises are building new storage infrastructure dedicated for 'big data'?
The first scenario is not much different from the decade-old enterprise storage expansion pattern. In the second scenario, enterprises need to think differently. They would be essentially building their own cloud infrastructure. So they would need to decide on distribution of objects/storage elements, which distributed file system they should use, how applications will access these data etc and those will drive the decision of the storage system they will buy. But given that both NetApp and EMC are leading the growth and are selling their already established products in SAN and NAS space, actual scenario most likely to remain closer to the first case. In that case it is the expansion of existing NAS and SAN infrastructure that is propelling the storage growth. Should we then talk about 'Big NAS' and 'Big SAN' instead of Big Data?
Friday, September 9, 2011
Some interesting posts on Big Data and noSQL
BusinessWeek reports that Hadoop is becoming the dominant choice for organizations dabbling with Big Data. They have cited Walmart, Nokia, GE or BoA, all moving their big data on Hadoop. Here is the article: http://www.businessweek.com/technology/getting-a-handle-on-big-data-with-hadoop-09072011.html
Couple of posts from Nati Shalom is also interesting. First one is on big data platform for real-time analytics and he uses facebook model and tries to refine it. The post: http://natishalom.typepad.com/nati_shaloms_blog/2011/07/real-time-analytics-for-big-data-an-alternative-approach.html
And the latest, little lengthy one http://natishalom.typepad.com/nati_shaloms_blog/2011/09/big-data-application-platform.html
And Alex Popescu's blog should not be missed if you are NoSQL enthusiastic. Here is one dig at Digg's Cassandra implementation: http://nosql.mypopescu.com/post/334198583/presentation-cassandra-in-production-digg-arin
And I found couple of interesting infogpraphics:
First one created by Mozy that shows interesting comparison among largest data centres and second one is a graphical illustration on growth of big data.
Couple of posts from Nati Shalom is also interesting. First one is on big data platform for real-time analytics and he uses facebook model and tries to refine it. The post: http://natishalom.typepad.com/nati_shaloms_blog/2011/07/real-time-analytics-for-big-data-an-alternative-approach.html
And the latest, little lengthy one http://natishalom.typepad.com/nati_shaloms_blog/2011/09/big-data-application-platform.html
And Alex Popescu's blog should not be missed if you are NoSQL enthusiastic. Here is one dig at Digg's Cassandra implementation: http://nosql.mypopescu.com/post/334198583/presentation-cassandra-in-production-digg-arin
And I found couple of interesting infogpraphics:
First one created by Mozy that shows interesting comparison among largest data centres and second one is a graphical illustration on growth of big data.
Thursday, August 11, 2011
Creating storage for big unstructured data:path to NoSQL
I mentioned in my last post that next post will be on infrastructure for large unstructured data. But rather than jumping into available infrastructure I felt a better approach would be to start with clean slate and then pick what we need.
To keep it simple, let us focus on the primary dimensions as we know which are storage and processing capacity of data analytics engine.
In the last post we saw how the commercial products from big houses tried to provide an unified solution for structured data. From the vendors' point of view the solution helps the customer, since IT managers are traditionally used to think of capacity building in terms of adding more devices into the pool. Just like they would add a new NAS box to increase storage capacity in their existing NAS setup, one can partition the analytics data and keep adding Big Data Appliances to address added capacity need in each partition. It helps because most of the structured data, even when they become big, follow a pattern that is not very different from when they were small. But when we think of unstructured data, it is difficult to forecast the analytics pattern of the data and it could be that while the data source and repository may be one, there would be multiple analytics engines to address different analytics need.
So if we are to build a scalable solution, it makes sense to look at what we need to build large data store that can be scaled with linear cost and then address how we can adapt our analytics engine to this store.
'unstructured' property of the data can be an advantage!
Unstrucured means no implicit relation that one can use to map to tables and so on, can be expected of the data. If we force relational construct on the data, we are going to create artificial constraint(s) that would become hindrance later, since imposition of a particular relational structure would mean 1. implicitly removing the other possible relations from the data design and later if the analytics engine has to find those relations, it will have to deconstruct the relational structure first, which is kind of a double taxation.
Base model
Second issue is that some compute server can be down at times. If we are to ensure Availability of data across these unit failures, we have to keep some redundant copies of data in the network. That does add challenge w.r.t to keeping the entire data consistent all the time. But we will come back to that aspect later.
Improving on the basic design
One way of addressing the unpredictability aspect of retreival function would be to find out the primary search/Query pattern and create a metadata store that makes the search predictable. In that way we are not imposing the structure on data store but adding a bit of extra information by processing the data itself so that search output can be consistent. To illustrate it a bit more, let's consider the data store for all web-content inside an organization. If we find that most sought queries are based on say hundred words, we can build an index that maps each of these hundred word to all matching web-content. If the next search comes on any of the above words, retrieval will be lot faster.
Addressing Modification/update [ACID and CAP]
This looks fine, as long as we assume that data always get added in a consistent and predictable manner so that all update/modifications are 
One advantage that RDBMS provide us is the inherent support for ACID requirement. ACID refers to the ability to preserve Atomicity, Consistency, Isolation and Durability of all the database transactions. How do we support this requirement in our design? We can trivially support the requirement if we serialize all transactions[see Figure] which means transaction B does not get started till the transaction A is completely committed to database. Now what happens if connection to the compute unit that has particular data, fails in between? All requests wait indefinitely for the transaction to complete, which basically means that system becomes unavailable. That brings us to another interesting aspect of distributed dats store. Brewer's CAP theorem tells that no distributed system can guarantee Consistency of data, Availability of the system and Tolerance to Partition of store [across network] all together. System can only guarantee two of the requirements at a time. This page provides more elaborate explanation.
In order that we do not get confused, CAP tells us about the property of distributed system as a whole and ACID requirement is particularly applicable to database transactions. To bring broad correlation, Consistency in CAP roughly corresponds to Atomicity and Isolation of transaction in ACID, A and P property does not have any correlation with ACID.
Daniel Abidi, an assistant prof with Yale has explained the issue with lot more rigour here. He brought Latency as another dimension alongwith Consistency and argues that if there is Partition of data store [i.e. data store maintained in two different data centres], then the system chooses either Consistency or Availability and if it prioritises Availability over Consistency, it also chooses Latency over Consistency. Example that he cites about this type of system are Cassandra or Amazon's dynamo.
Daniel Abidi, an assistant prof with Yale has explained the issue with lot more rigour here. He brought Latency as another dimension alongwith Consistency and argues that if there is Partition of data store [i.e. data store maintained in two different data centres], then the system chooses either Consistency or Availability and if it prioritises Availability over Consistency, it also chooses Latency over Consistency. Example that he cites about this type of system are Cassandra or Amazon's dynamo.
The other type of systems are fully ACID compliant [traditional RDB] system and he shows that this type of datastore makes consistency paramount and in turn compromises on Availability and Latency factor. This is intuitive. If we have datastore divided into partitions and each partition keeps one replica, when the network between two Data centre breaks down, database that chooses consistency [like banking transactions], will make the system unavailable till both the sides are up, otherwise the two replicas will soon be incongruent to each other rendering the database inconsistent overall.
But if Availability (and therefore Latency) is prioritized, the system allows updates continuing even if one partition fails, thereby making the database relatively inconsistent for that time. In this case, responsibility to maintain consistency of data gets transferred to application accessing the data [pushing some work downstream]. One advantage is that it makes the database store design simpler.
Eventually Consistent Database
Concept of Eventually Consistent Database also is attributed to Eric Brewer. He described the consistency as a range. Strict consistency is what RDBMS provides. Weak consistency means system allows an window of inconsistency when most recent update will not be seen by all clients. Some conditions must be met before the data reaches fully consistent state.
Eventual Consistency is a special version of weak consistency. Quoting Werner Vogels, "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme."
Why Eventual Consistency Eventually Consistent Database
Concept of Eventually Consistent Database also is attributed to Eric Brewer. He described the consistency as a range. Strict consistency is what RDBMS provides. Weak consistency means system allows an window of inconsistency when most recent update will not be seen by all clients. Some conditions must be met before the data reaches fully consistent state.
Eventual Consistency is a special version of weak consistency. Quoting Werner Vogels, "the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value. If no failures occur, the maximum size of the inconsistency window can be determined based on factors such as communication delays, the load on the system, and the number of replicas involved in the replication scheme."
One of the successful system that uses this consistency model is DNS. Even though a DNS name update may not reach all DNS nodes as soon as the update occurs, the protocol ensures that the update reaches all the nodes eventually after sufficient time elapsed since the update happened. Vogels has elaborated on different aspects and variations of eventual consistency in his paper [see reference] which are crucial factors when one designs a datastore.
The reason eventual consistency is important is there are lot of data usage that do not need strict consistency. Adopting the eventual consistency model for such type of data, opens up an opportunity to build cheap, distributed, greatly scalable, still reliable and lot faster databases. Here we enter the realm of so-called NoSQL databases.
NoSQL Movement
The NoSQL movement started with web 2.0 based startups, when they decided to build their own data store for the type of data that they were interested did not fit into relational model [page ranking of web pages or facebook content]. When Yahoo brought up Hadoop, Google brought up BigTable, Amazon brought up dynamo. Facebook later developed Cassandra. Hadoop and Cassandra became open-source. Apache projects after Yahoo and Facebook forked their software. Now of course you have many other NoSQL alternatives like MongoDB, HBASE [an opensource version of BigTable]. Hadoop incidentally has lot many adopters even from established storage players as shown in the table below.
Hadoop Ecosystem
![]() |
Hadoop Ecosystem [Borrowed Table] |
Reference:
- A compilation paper on NoSQL databases
Labels:
ACID,
Big Data,
CAP,
Eventual Consistency,
noSQL
Wednesday, August 3, 2011
Challenges of Big Data
After my last post, I received few comments regarding the actual nature of the problem. Prima-facie it looks as if the expansion of the data is the challenge. But the fact is that is not entire story. Big Data is not only about the expansion of data, it is also about finding the hidden [undiscovered] relations among the data and finding those relations in real time. So the challenge is not only about storage, challenge is also about analysis of the data and the throughput of the analytics engine. That however takes it more into the domain of conventional Data warehousing products.
Ovum says, "Combining traditional structured transactional information with unstructured interaction data generated by humans and the Internet (customer records, social media) and, increasingly, machines (sensor data, call detail records) is clearly the sweet spot. These types of interaction data have traditionally been difficult to access or process using conventional BI systems. The appeal of adding these new data types is to allow enterprises to achieve a more complete view of customers, with new insights into relationships and behaviors from social media data." Ovum is referring to unstructured data as the new challenge for traditional data warehousing software. But before we dive into the realm of unstructured data, let's gloss over the Industry's response to the challenge of big structured data.
A new challenge, as always, translates to new business opportunity. All the existing Data warehouse software vendors rose to the challenge with new solutions, either in the form of an enhanced product or an acquired product. When Oracle launched Exadata, IBM acquired Netezza and EMC acquired Greenplum. In Gartner's 2011 Magic Quadrant report [link and diagram available thanks to Teradata] Gartner placed Teradata, IBM, EMC and SAP/Sybase in the leader quadrant while AstorData and ParAccel in the visionary quadrant [Note: Magic Quadrant is a Gartner concept]. Incidentally AstorData was acquired by TeraData in March, this year.
'Big Data' Appliances
To address the need for high-throughput and low-latency, Netezza and Teradata appliances both use proprietary hardware-design [MPP-based architecture] for speed, throughput and scale. To understand what we are talking about, let me quote the Wiki:
Teradata provides :
EMC GreenPlum at its core is a software product but also is sold as appliance with a hardware that also has a MPP-based scalable architecture for faster data processing. With the new-found enterprise-readiness of solid-state devices, many vendors have introduced solid-state storage in their appliances to boost the random I/O access performance.
Essentially all these products are targeted for processing structured data like call data records, financial trading data, log files or other forms of machine-generated information. They use traditional RDBMS for storing and manipulating the logical records. These products are fast and accurate but also is expensive [Netezza's lowest cost per terabyte hovers around $2500].
However, the Industry veterans tell us that most of the data that we see today that is expanding enormously are unstructured. They are not machine generated, they are mostly consumer data [internet-based], images, video files, text. Analyzing these data require different algorithm, different architecture. Single Correlation may require sifting small to really large amount of data. The real challenge or opportunity for the Big Data Analytics however is presented by these data.
Quoting Jim Baum, the CEO of Netezza, "I get excited about what we see happening in 'Big Data'. Most of the applications I have seen so far are focused on consumer facing businesses and traditional enterprises leveraging structured data assets. Data warehouses supporting reporting, ETL jobs running in Hadoop, “machine to machine” analytics making decisions in real time. We have just scratched the surface and the opportunities to go further are immense." and he is bang on the target.
In the next post we will review some of the existing solutions for unstructured Big Data.
Ovum says, "Combining traditional structured transactional information with unstructured interaction data generated by humans and the Internet (customer records, social media) and, increasingly, machines (sensor data, call detail records) is clearly the sweet spot. These types of interaction data have traditionally been difficult to access or process using conventional BI systems. The appeal of adding these new data types is to allow enterprises to achieve a more complete view of customers, with new insights into relationships and behaviors from social media data." Ovum is referring to unstructured data as the new challenge for traditional data warehousing software. But before we dive into the realm of unstructured data, let's gloss over the Industry's response to the challenge of big structured data.
A new challenge, as always, translates to new business opportunity. All the existing Data warehouse software vendors rose to the challenge with new solutions, either in the form of an enhanced product or an acquired product. When Oracle launched Exadata, IBM acquired Netezza and EMC acquired Greenplum. In Gartner's 2011 Magic Quadrant report [link and diagram available thanks to Teradata] Gartner placed Teradata, IBM, EMC and SAP/Sybase in the leader quadrant while AstorData and ParAccel in the visionary quadrant [Note: Magic Quadrant is a Gartner concept]. Incidentally AstorData was acquired by TeraData in March, this year.
'Big Data' Appliances
To address the need for high-throughput and low-latency, Netezza and Teradata appliances both use proprietary hardware-design [MPP-based architecture] for speed, throughput and scale. To understand what we are talking about, let me quote the Wiki:
Teradata provides :
- Complex ad hoc queries with up to 256 joins.
- Parallel efficiency, such that the effort for creating 100 records is same as that for creating 100,000 records.
- Scalability, so that increasing of the number of processors of an existing system linearly increases the performance
EMC GreenPlum at its core is a software product but also is sold as appliance with a hardware that also has a MPP-based scalable architecture for faster data processing. With the new-found enterprise-readiness of solid-state devices, many vendors have introduced solid-state storage in their appliances to boost the random I/O access performance.
Essentially all these products are targeted for processing structured data like call data records, financial trading data, log files or other forms of machine-generated information. They use traditional RDBMS for storing and manipulating the logical records. These products are fast and accurate but also is expensive [Netezza's lowest cost per terabyte hovers around $2500].
However, the Industry veterans tell us that most of the data that we see today that is expanding enormously are unstructured. They are not machine generated, they are mostly consumer data [internet-based], images, video files, text. Analyzing these data require different algorithm, different architecture. Single Correlation may require sifting small to really large amount of data. The real challenge or opportunity for the Big Data Analytics however is presented by these data.
Quoting Jim Baum, the CEO of Netezza, "I get excited about what we see happening in 'Big Data'. Most of the applications I have seen so far are focused on consumer facing businesses and traditional enterprises leveraging structured data assets. Data warehouses supporting reporting, ETL jobs running in Hadoop, “machine to machine” analytics making decisions in real time. We have just scratched the surface and the opportunities to go further are immense." and he is bang on the target.
In the next post we will review some of the existing solutions for unstructured Big Data.
Thursday, July 28, 2011
Big Digital Data
Of late the term, Big Data is seen often in news posts and articles. Knowledge industry loves new jargon. Probably it adds a touch of mystery and solidity to the knowledge that the jargon associates to, which otherwise may be lost in the zillion bytes that stream through our awareness everyday. In a sense jargon creates identity that people love to be identified with.
So, what is Big Data exactly?
In simple language any data repository for large data stream like GPS or atmospheric pressure sensory data or twitter posts can be qualified as Big Data. They are large and incremental; applications require access to the entire database.
IBM qualifies it as:
Big data spans three dimensions: Variety, Velocity and Volume.
Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
In somewhat crude but simpler terms, the data must be online [that means offline backup data is not qualified] and the daily incremental addition should be in the order of hundreds of gigabytes or more to be qualified as Big Data.
So, why is this particularly important?
The challenge of storing such a large data and providing online accessibility, which could be mostly random, to it, makes it a non-trivial problem. A non-trivial problem requires non-trivial solution and thus it becomes a new opportunity for the data storage solution providers. Undoubtedly this problem is not new. Technology for structuring and storing large database has been existing for more than a decade, in fact all the existing RDBMS providers have the technology for partitioning large data and providing the engine to retrieve the information efficiently. One issue with RDB [Relational DB] is when the table becomes huge [in the order of petabytes or more], the join becomes a very costly operation and if the data retrieval requires multiple joins performance can degrade quite heavily. And there the new-generation noSQL db comes into picture. But before we go there, perhaps some idea about applications that need Big Data would help.
Most of the application for large Data are Data Analytics. Analytics is not a new branch, so far most of the analytics application assumed structured data set. For example consumer credit card transaction can be considered Big structured data and many commercially useful data analysis can be done by profiling either customers or the products. In other words, data analysis is a large business and it is no-brainer that larger the input data, better the accuracy of the analysis is. In other words data analytics need large data and cheaper the cost for maintaining it, better it is for the business. It is also important to note that the rate at which data gets generated, has multiplied in last few years [McKinsey says data expansion rate is around 40% per year], again adding to the problem of Big Data.
So the question really boils down to what technology we can use to create relatively cheap but scalable computing and storage infrastructure that enables reasonably fast accessibility to the data for analytics engine.
While existing relational DB-based model does provide proven solution for large structured data, their cost/performance for large distributed unstructured data does not scale that well. And here, new-age web 2.0 technologies are bringing a viable alternative compared to traditional [read Relational] database models.
The other aspect of the story is that most of the consumers of large data analysis are trying to use Cloud storage as the storage infrastructure for their analytics data as it brings down their CapEx cost. However, using noSQL DB is not without issues. Relational DBs are popular for many reasons:
Two main points are:
1. it helps to reduce redundancy in the data [data modelling ensures 3NF] and
2. SQL provides an easy, consistent and standardized data access interface.
With noSQL, both of the benefits are forfeited. Data structure is driven by what and how the users access the data and data access interface gets designed as per that need [ in other words no SQL support].
In next post we will review some of the existing large data infrastructure solution.
Further Reading
McKinsey Report on opportunities and challenges of Big Data [May 2011]
So, what is Big Data exactly?
In simple language any data repository for large data stream like GPS or atmospheric pressure sensory data or twitter posts can be qualified as Big Data. They are large and incremental; applications require access to the entire database.
IBM qualifies it as:
Big data spans three dimensions: Variety, Velocity and Volume.
Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
In somewhat crude but simpler terms, the data must be online [that means offline backup data is not qualified] and the daily incremental addition should be in the order of hundreds of gigabytes or more to be qualified as Big Data.
So, why is this particularly important?
The challenge of storing such a large data and providing online accessibility, which could be mostly random, to it, makes it a non-trivial problem. A non-trivial problem requires non-trivial solution and thus it becomes a new opportunity for the data storage solution providers. Undoubtedly this problem is not new. Technology for structuring and storing large database has been existing for more than a decade, in fact all the existing RDBMS providers have the technology for partitioning large data and providing the engine to retrieve the information efficiently. One issue with RDB [Relational DB] is when the table becomes huge [in the order of petabytes or more], the join becomes a very costly operation and if the data retrieval requires multiple joins performance can degrade quite heavily. And there the new-generation noSQL db comes into picture. But before we go there, perhaps some idea about applications that need Big Data would help.
Most of the application for large Data are Data Analytics. Analytics is not a new branch, so far most of the analytics application assumed structured data set. For example consumer credit card transaction can be considered Big structured data and many commercially useful data analysis can be done by profiling either customers or the products. In other words, data analysis is a large business and it is no-brainer that larger the input data, better the accuracy of the analysis is. In other words data analytics need large data and cheaper the cost for maintaining it, better it is for the business. It is also important to note that the rate at which data gets generated, has multiplied in last few years [McKinsey says data expansion rate is around 40% per year], again adding to the problem of Big Data.
So the question really boils down to what technology we can use to create relatively cheap but scalable computing and storage infrastructure that enables reasonably fast accessibility to the data for analytics engine.
While existing relational DB-based model does provide proven solution for large structured data, their cost/performance for large distributed unstructured data does not scale that well. And here, new-age web 2.0 technologies are bringing a viable alternative compared to traditional [read Relational] database models.
The other aspect of the story is that most of the consumers of large data analysis are trying to use Cloud storage as the storage infrastructure for their analytics data as it brings down their CapEx cost. However, using noSQL DB is not without issues. Relational DBs are popular for many reasons:
Two main points are:
1. it helps to reduce redundancy in the data [data modelling ensures 3NF] and
2. SQL provides an easy, consistent and standardized data access interface.
With noSQL, both of the benefits are forfeited. Data structure is driven by what and how the users access the data and data access interface gets designed as per that need [ in other words no SQL support].
In next post we will review some of the existing large data infrastructure solution.
Further Reading
McKinsey Report on opportunities and challenges of Big Data [May 2011]
Subscribe to:
Posts (Atom)