Wednesday, December 12, 2012

Storage Deduplication: a technology quick-digest

Six years ago, when we started the journey with Storage Deduplication Technology, I was excited at the opportunity to be able to associate with something that was posited to be the game-changer for storage industry in the near future. It was not that conceptually something new was being created. Newness came from the fact that enterprise customers, technology analysts and leading storage vendors, all three saw the importance and immediate need of deduplication in the storage jigsaw puzzle from their own perspectives. While Datadomain was the biggest newsmaker that sailed on deduplication, many others came up with their own deduplication solution. And as it turned out, Storage Deduplication did turn up as the game changer for storage industry for next couple of years. If you must need a reference, Google for Data domain or NetApp's guarantee of 50% savings with VMware VM.
This post mostly is an attempt to look back and distill the threadbare technological essence for the present readers.

Why and What

Essentially value of deduplication lies on the common perception that digital content tends to be copied multiple times and in an organization IT setup, one often sees 7-14 copies of same data lying across large but finite storage pool. Think about a document shared among multiple users and also saved inside multiple back-up images (I am talking about disk backup alone here) taken daily/weekly. Intuitively one cannot but marvel at the possibility of having a magic tool that would wipe clean the duplicate copies transparently without specific user's knowledge. If that is possible, one can imagine the cost saving it would provide to the IT manager, always juggling with his precious storage budget.

How

Given that modern digital storage is organized in layers thereby decoupling physical storage layout from logical view presented to users, it turns out that building such a tool is not only feasible but can be done without any major change in the storage architecture. In the most simplistic version, the tool would scan the storage-pool and for each logically identifiable content i.e. file, find the duplicates, remove the duplicate content and convert the duplicate files to pointers to one original copy. In the software parlance this is quite similar to multiple directory-entries pointing to single inode entry.
But there are couple of major challenges with this approach:
1. it will miss the duplicate data in files that have one (or more) new  block (of bytes) appended [or even a block is removed /modified] to the original file and
2. it does not quite work in SAN setup where all data are seen as stream of bytes.
To solve these, one needs to look at the physical organization of data. Most of the modern disk or solid-state storage system organizes data as series of contiguous blocks of bytes called blocks. That gives one the opportunity to think about block level deduplication engine. Given that the blocks are of relatively smaller size [less than 512KB] and block layout in the disk is relatively immune to logical view of the storage [files in NAS or streams in SAN], it would be a lot easier to manipulate block pointers thereby leaving the top level applications undisturbed, that is, assuming that block read path and block write path of applications remain unchanged by all the deduplication-induced manipulation of blocks.
As we shall see later, that assumption is not entirely necessary especially when one can afford to be specific, e.g. building a dedicated backup appliance that intends to use deduplication in a major way.
                        The puzzle of deduplication, essentially is therefore, to find an efficient way to 1. design a mechanism to store duplicate information [which logical block(s) map to which physical block] for a storage container and 2. use that information to sort the new data when they are being written to storage subsystem. To appreciate and understand the problem better, I will pick a little different but algorithmically similar puzzle.

Under the hoods

Imagine that we have billions (or trillions) of data blocks  with millions of duplicate entries and we have to find the duplicates and remove them.
The easy approach would be 1. sort the blocks so that duplicate blocks are next to each other in the order and 2. scan the sorted list and for each unique block check if the next block is identical to it, if it is identical remove it and manipulate the reference to the duplicate block so that it points to the first unique block. Cost for this approach is one sort + one scan per deduplication cycle. Can we improve it further?
Evidently only way we can improve it is if we can somehow avoid sort in each cycle. There is another scope of improvement. Given that each data block comparison takes finite amount of processing depending on the block size, we can make the cycle faster if we avoid full-block comparison. One way of doing it would be to hash the blocks to m bits [m should be sufficiently large and hash function should be reasonably sophisticated e.g. SHA1 so that probability of hash values of two different blocks being same would be small enough to ignore. At the same time m should be small enough compared to size of the data block, so that there is appreciable saving] and create a bloomfilter of k bits which could uniquely accommodate [within reasonable margin of error] all possibilities of m-bit hash. With bloomfilter, one advantage is that cost of evaluating if a hash is present or not is constant. Once the hashes and bloomfilter are created, one can determine if a block is a duplicate and find its copy in the list with constant cost without needing to sort the list. That is big improvement only if one can ignore the additional cost of maintaining bloomfilter and hash.
In a dynamic system, there will be more data coming in as time progresses and list would continue to grow where the above scheme will stand more beneficial compared to the first scheme. Most of the commercial storage deduplication solutions use variants of the above scheme. However some implement it in inline mode i.e. as the data comes, data is deduplicated before writing to storage (e.g. disks) and few others implement it in offline mode, i.e. normal data read/write path are not disturbed; the deduplication engine runs as a post-data-write process somewhat similar to File system maintenance cycle. Obviously Inline mode requires large deduplication processing capacity whereas offline mode poses less processing tax especially during peak load.

The case of Back-up Appliance

Alright, so far so good. How can we improve this if the data is primarily for backup? One of the characteristic of backup data is that the data is written once, never modified and read very rarely. Two things are important here:
1. how much one can squeeze the data and
2. how can one improve accuracy of read data.
Experience with digital tape tells us that reading back often fails with taped data. Disk backup solves that problem. With disk backup, one can employ deduplication quite efficiently, since 90% of data between two consecutive backup images remain almost unchanged which means 90% of the second image is outright duplicate! Backup appliance takes the advantage of that and manipulates the data organization in such that it takes most advantage of deduplication even sometimes with a small penalty to read-path. Additionally one can compress the data after deduplication which should provide additional storage savings. Typically most modern backup appliances use compression after deduplication which means during read, the backup image has to be decompressed before read request of specific data block can be served. If logical data to physical block mapping is transparent in the image, deduplicated blocks will continue to remain deduplicated.

Disadvantages, if any

Hardly there is any engineering solution that does not have any flip-side. Deduplication whether made inline or offline, inflicts one problem, it affects the contiguity of data thereby affecting read-ahead effectiveness of storage system. To address this issue, commercial vendors typically employ more implementation specific heuristics which vary widely across vendors and therefore out of the scope of present deliberation.
There are however couple of open-source deduplication solutions. Unfortunately I have no data on how popular they have been. With the growing popularity of storage cloud, undoubtedly the need and use of deduplication has waned a bit. I hope to visit the case of deduplication in cloud setup sometime in this blog.

Tuesday, February 21, 2012

From Scale out NAS to Scale out Database for Big Data

This post is the first review of NewSQL DBs, as promised in my last post and we will start with Clustrix. Clustrix is interesting for two reasons: 1. the architect and some of the principle designers came from NAS space (as opposed to database technology) and 2. they appear to be doing the Isilon equivalent in Database space.
Scale Out
 Scale out as a concept became popular with massive growth of Data centre. Essentially scale out means to scale horizontally i.e. adding more compute resources to the infrastructure pool in order to take care of growing demand. Essentially it means if you have one server for n users, to serve m*n users you simply need to add m similar servers. In contrast Scale-up means scaling vertically i.e. changing your server with a more powerful server when numbers of users increase.
Scale Out NAS
So scaling out is simpler, assuming one has enough rack-space in the data centre. During last decade, Scale-out NAS became a strong market driver and many innovation engines took off that promised faster, simpler and cheaper scale out. Huge growth of unstructured data in entreprise data centre pushed the scale-up model to the edge with each hardware replacement cost getting more and more expensive. Addditonally traditional NAS also comes with certain scaling limitations such as,
1. performance is limited by single head [multicore CPUs alleviated the issue just a little]
2. limited Namespace [ typically this is File system limitation brought in to ensure performance does not degrade unacceptably with large number of files]
3. migration of data between two NAS namespace is costly.
3PAR, Spinnaker and Islion were few of the innovation leaders in the space of Scale-out NAS. After NetApp bought Spinnaker, Isilon almost occupied this niche space entirely till EMC bought it.
What did Isilon bring technologically?
Isilon's OneFS clustered File System is the core of its technology. The OneFS software is designed with file-striping functionality across each node in a cluster, a fully distributed lock manager, caching, fully distributed meta-data, and a remote block manager to maintain global coherency and synchronization across the cluster. That enables Isilon to support a massive capacity of 15PB of file storage in a single file system, making EMC position Isilon NAS as the solution for Big Data for File-storage.
Scale-out Database Appliance
Experience of building the distributed lock manager has some relevance to Database and that gave confidence to the Isilon engineers when they started designing Clustrix. Concurrency management is one of the crucial challenge in designing a clustered database.  Clustrix claims to use MVCC [Multi-version concurrency Control] with a share-nothing architecture but the most important part of their innovation is building a distributed Query execution model which they call 'Sierra Distributed Engine'.  True to their philosophy of  "Bring Query to Data and not the Data to the Query",
Sierra’s most basic primitive is a compiled program called a query fragment. They can read, insert, update a container, execute functions, modify control flow, format rows, perform synchronization, and send rows to query fragments on other nodes. These query fragments are run on the nodes that contain the data. The communication between the nodes consists of just the intermediate and result rows needed for the queries. Many, many query fragments can operate simultaneously across the cluster. Those query fragments may be different components of the same query or parts of different queries.
Queries enter the system through the front-end network and are translated to an internal representation used by Sierra. Sierra then executes the queries in an efficient parallel fashion. Sierra uses SSDs to store the data, NVRAM to journal changes, and Infiniband to communicate with other nodes in the cluster. [Clustrix whitepaper]

Performance
Clustrix claims that transaction throughput /million of records in the database scales linearly with addition of nodes.
They support replication with MySQL, both Key-Value based query and SQL query as well as on-the-fly Schema change. The user does not need to worry about sharding / partitioning since basic performance bottleneck observed with RDBMS is removed here. Overall, Clustrix seem to have brought a strong NewSQL product.
For more information take a look at Clustrix resource page.

Friday, February 17, 2012

NewSQL way, neither traditional RDBMS, nor NoSQL

This December, NuoDB [erstwhile NimbusDB] released the Beta 5 of its database software. This database, it claims, "is a NewSQL database. It looks and behaves like a traditional SQL database from the outside but under the covers it's a revolutionary database solution. It is a new class of database for a new class of datacenter."
NewSQL database?
It appears that although the NoSQL variant got popularity with likes of Google, Yahoo, Facebook, it did not make much dent to RDBMS clientale base and main reason, the NewSQL advocates cite, is that people like SQL and irrespective of scalability and other issues, people decided to stay with SQL. The fact that SQL has been in the game for last 20-25 years makes it so entrenched in Business Application space that it is almost impossible to take it away from the Business Application Space. Even if one ignores the large SQL investment, one cannot ignore the value SQL brought to business community. SQL essentially separated data from the code, that protects the data, i.e. Database Engine, enabling the DBMS packages to be commoditized.
SQL also provided a well-undersood and simple Data manipulation interface which Business Applications could use without needing to assimilate full-complexities of  computer programming. Business Applications would have become far more complex if they had to deal with ACID requirement of their data in addition to implement their business logic.
NoSQL, however scalable and flexible they are, comes with a huge cost, one has to design his own data manipulation engine and larger the scale of the data and larger the distribution of computing resources, more intensive is the effort. The uniqueness of the applications demands specific design of the engine that manipulates the applications data and thereby makes it more tied to business logic of the enterprise. While it has its strength, it is obvious that most of the business whose core operation is not about the data itself, has stayed away from NoSQL movement, however large their data are.

Promises of NewSQL : SQL for Big Data
NewSQL tries to bridge this gap by keeping the SQL interface intact but trying to reengineer the basic database engine. Evidently this is far more daunting that coming up with NoSQL alternative. It requires a change in the design that has held its ground for almost 30 years. Only those who understands the intricacies of the original design can venture to take this task of rearchitecting the database engine keeping its original promises intact. What does it mean? It means that the engine must fully support SQL, engine must guarantee ACID [my previous posts elaborated on how NoSQL addressed this requirement]. Additionally engine must 1.  provide support for loosely connected set of computing resources, such as computers connected over Internet and 2. scale the performance with the number of computers.  The last requirement came from NoSQL land, where huge number of computing resources are connected over Internet and are designed to sift through the massive distributed data to find out answer for a single query. Essentially this engine must be capable of building a distributed database spread over huge number of affordable [i.e. cheap] computers connected on Internet and provide a SQL interface for the entire data. That makes the NewSQL truly the database engine for Big Data.
Contenders
 Most of the challengers in this space are started by someone who has participated in the Database software development in early 70's . Let's take the example of VoltDB, started by Michael StoneBraker, a luminary in Database Research who architected Ingres [one of the first Relational DBs], Postgres and many more.
Similarly  NuoDB boasts of Jim Starkey, the person behind DEC's relational DB suites during 75-85.
The other prominent NewSQL venture, ScaleDB was started by another Database legend Vern Watts, the architect of famed DB2 from IBM.
Then there is JustoneDB that proclaim itself as the Relational DB of 21st century, boasts of its CTO, Duncan Pauly.
The list is along one but I must mention of Clustrix that boasts of its CTO, Aaron Passey, with Isilon fame [Isilon brought new definition in mainstream storage clustering]. Clustrix appear to have brought appliance model in the NewSQL Database world.
I am sure there are many more to come in this space and I am sure I missed few in listing down here but given the emeregence of this new technology space, we will have to revisit this topic.
I will try to provide more detailed review of these products in next posts starting with Clustrix [see the post here]
Summary
Here is a quick comparison between three different DB technologies:

SQL-DB
No-SQL DB
NewSQL DB
Basic architecture from 70’s relational Database Model
New [2000] Architecture from likes of Google, Yahoo; designed for single large distributed database
Newer [post-2000] Architecture promises to scale for both standalone and large installation
Centralized Transaction Processing
Distributed processing
Distributed Processing
Fully ACID compliant
Breaks ACID, brings  eventual consistency model
ACID-compliant
Integrates SQL engine
No support for SQL
Full support for SQL
Limited scalability
High Scalability
High scalability, tries to break dependency on any single engine
Mature Technology; has been in the core of all popular OLTP suites
Relatively mature; suits better for SaaS model
Still Evolving; has the potential to scale for both the use-cases