Cuboid - Technology, Industry and People: Storage

Showing posts with label Storage. Show all posts

Wednesday, December 12, 2012

Storage Deduplication: a technology quick-digest

Six years ago, when we started the journey with Storage Deduplication Technology, I was excited at the opportunity to be able to associate with something that was posited to be the game-changer for storage industry in the near future. It was not that conceptually something new was being created. Newness came from the fact that enterprise customers, technology analysts and leading storage vendors, all three saw the importance and immediate need of deduplication in the storage jigsaw puzzle from their own perspectives. While Datadomain was the biggest newsmaker that sailed on deduplication, many others came up with their own deduplication solution. And as it turned out, Storage Deduplication did turn up as the game changer for storage industry for next couple of years. If you must need a reference, Google for Data domain or NetApp's guarantee of 50% savings with VMware VM.
This post mostly is an attempt to look back and distill the threadbare technological essence for the present readers.

Why and What

Essentially value of deduplication lies on the common perception that digital content tends to be copied multiple times and in an organization IT setup, one often sees 7-14 copies of same data lying across large but finite storage pool. Think about a document shared among multiple users and also saved inside multiple back-up images (I am talking about disk backup alone here) taken daily/weekly. Intuitively one cannot but marvel at the possibility of having a magic tool that would wipe clean the duplicate copies transparently without specific user's knowledge. If that is possible, one can imagine the cost saving it would provide to the IT manager, always juggling with his precious storage budget.

How

Given that modern digital storage is organized in layers thereby decoupling physical storage layout from logical view presented to users, it turns out that building such a tool is not only feasible but can be done without any major change in the storage architecture. In the most simplistic version, the tool would scan the storage-pool and for each logically identifiable content i.e. file, find the duplicates, remove the duplicate content and convert the duplicate files to pointers to one original copy. In the software parlance this is quite similar to multiple directory-entries pointing to single inode entry.
But there are couple of major challenges with this approach:
1. it will miss the duplicate data in files that have one (or more) new block (of bytes) appended [or even a block is removed /modified] to the original file and
2. it does not quite work in SAN setup where all data are seen as stream of bytes.
To solve these, one needs to look at the physical organization of data. Most of the modern disk or solid-state storage system organizes data as series of contiguous blocks of bytes called blocks. That gives one the opportunity to think about block level deduplication engine. Given that the blocks are of relatively smaller size [less than 512KB] and block layout in the disk is relatively immune to logical view of the storage [files in NAS or streams in SAN], it would be a lot easier to manipulate block pointers thereby leaving the top level applications undisturbed, that is, assuming that block read path and block write path of applications remain unchanged by all the deduplication-induced manipulation of blocks.
As we shall see later, that assumption is not entirely necessary especially when one can afford to be specific, e.g. building a dedicated backup appliance that intends to use deduplication in a major way.
The puzzle of deduplication, essentially is therefore, to find an efficient way to 1. design a mechanism to store duplicate information [which logical block(s) map to which physical block] for a storage container and 2. use that information to sort the new data when they are being written to storage subsystem. To appreciate and understand the problem better, I will pick a little different but algorithmically similar puzzle.

Under the hoods

Imagine that we have billions (or trillions) of data blocks with millions of duplicate entries and we have to find the duplicates and remove them.
The easy approach would be 1. sort the blocks so that duplicate blocks are next to each other in the order and 2. scan the sorted list and for each unique block check if the next block is identical to it, if it is identical remove it and manipulate the reference to the duplicate block so that it points to the first unique block. Cost for this approach is one sort + one scan per deduplication cycle. Can we improve it further?
Evidently only way we can improve it is if we can somehow avoid sort in each cycle. There is another scope of improvement. Given that each data block comparison takes finite amount of processing depending on the block size, we can make the cycle faster if we avoid full-block comparison. One way of doing it would be to hash the blocks to m bits [m should be sufficiently large and hash function should be reasonably sophisticated e.g. SHA1 so that probability of hash values of two different blocks being same would be small enough to ignore. At the same time m should be small enough compared to size of the data block, so that there is appreciable saving] and create a bloomfilter of k bits which could uniquely accommodate [within reasonable margin of error] all possibilities of m-bit hash. With bloomfilter, one advantage is that cost of evaluating if a hash is present or not is constant. Once the hashes and bloomfilter are created, one can determine if a block is a duplicate and find its copy in the list with constant cost without needing to sort the list. That is big improvement only if one can ignore the additional cost of maintaining bloomfilter and hash.
In a dynamic system, there will be more data coming in as time progresses and list would continue to grow where the above scheme will stand more beneficial compared to the first scheme. Most of the commercial storage deduplication solutions use variants of the above scheme. However some implement it in inline mode i.e. as the data comes, data is deduplicated before writing to storage (e.g. disks) and few others implement it in offline mode, i.e. normal data read/write path are not disturbed; the deduplication engine runs as a post-data-write process somewhat similar to File system maintenance cycle. Obviously Inline mode requires large deduplication processing capacity whereas offline mode poses less processing tax especially during peak load.

The case of Back-up Appliance

Alright, so far so good. How can we improve this if the data is primarily for backup? One of the characteristic of backup data is that the data is written once, never modified and read very rarely. Two things are important here:
1. how much one can squeeze the data and
2. how can one improve accuracy of read data.
Experience with digital tape tells us that reading back often fails with taped data. Disk backup solves that problem. With disk backup, one can employ deduplication quite efficiently, since 90% of data between two consecutive backup images remain almost unchanged which means 90% of the second image is outright duplicate! Backup appliance takes the advantage of that and manipulates the data organization in such that it takes most advantage of deduplication even sometimes with a small penalty to read-path. Additionally one can compress the data after deduplication which should provide additional storage savings. Typically most modern backup appliances use compression after deduplication which means during read, the backup image has to be decompressed before read request of specific data block can be served. If logical data to physical block mapping is transparent in the image, deduplicated blocks will continue to remain deduplicated.

Disadvantages, if any

Hardly there is any engineering solution that does not have any flip-side. Deduplication whether made inline or offline, inflicts one problem, it affects the contiguity of data thereby affecting read-ahead effectiveness of storage system. To address this issue, commercial vendors typically employ more implementation specific heuristics which vary widely across vendors and therefore out of the scope of present deliberation.
There are however couple of open-source deduplication solutions. Unfortunately I have no data on how popular they have been. With the growing popularity of storage cloud, undoubtedly the need and use of deduplication has waned a bit. I hope to visit the case of deduplication in cloud setup sometime in this blog.

Wednesday, August 3, 2011

Challenges of Big Data

After my last post, I received few comments regarding the actual nature of the problem. Prima-facie it looks as if the expansion of the data is the challenge. But the fact is that is not entire story. Big Data is not only about the expansion of data, it is also about finding the hidden [undiscovered] relations among the data and finding those relations in real time. So the challenge is not only about storage, challenge is also about analysis of the data and the throughput of the analytics engine. That however takes it more into the domain of conventional Data warehousing products.
Ovum says, "Combining traditional structured transactional information with unstructured interaction data generated by humans and the Internet (customer records, social media) and, increasingly, machines (sensor data, call detail records) is clearly the sweet spot. These types of interaction data have traditionally been difficult to access or process using conventional BI systems. The appeal of adding these new data types is to allow enterprises to achieve a more complete view of customers, with new insights into relationships and behaviors from social media data." Ovum is referring to unstructured data as the new challenge for traditional data warehousing software. But before we dive into the realm of unstructured data, let's gloss over the Industry's response to the challenge of big structured data.

A new challenge, as always, translates to new business opportunity. All the existing Data warehouse software vendors rose to the challenge with new solutions, either in the form of an enhanced product or an acquired product. When Oracle launched Exadata, IBM acquired Netezza and EMC acquired Greenplum. In Gartner's 2011 Magic Quadrant report [link and diagram available thanks to Teradata] Gartner placed Teradata, IBM, EMC and SAP/Sybase in the leader quadrant while AstorData and ParAccel in the visionary quadrant [Note: Magic Quadrant is a Gartner concept]. Incidentally AstorData was acquired by TeraData in March, this year.
'Big Data' Appliances
To address the need for high-throughput and low-latency, Netezza and Teradata appliances both use proprietary hardware-design [MPP-based architecture] for speed, throughput and scale. To understand what we are talking about, let me quote the Wiki:
Teradata provides :

Complex ad hoc queries with up to 256 joins.
Parallel efficiency, such that the effort for creating 100 records is same as that for creating 100,000 records.
Scalability, so that increasing of the number of processors of an existing system linearly increases the performance

While Teradata uses EMC disk array for storage, Netezza uses native disks for storage. EMC storage controller [for Teradata] and Netezza's own FPGA units provide the necessary intelligence to manage the formidable lag between disk I/O speed and CPU processing speed. Both Netezza and Teradata have separate product line focusing higher performance as opposed to higher capacity. For example Netezza High Capacity Appliance provides as much as 10-petabyte storage capacity and 4-times data density compared to Netezza TwinFin product and the TwinFin offers 35% higher processing throughput compared to its sibling.
EMC GreenPlum at its core is a software product but also is sold as appliance with a hardware that also has a MPP-based scalable architecture for faster data processing. With the new-found enterprise-readiness of solid-state devices, many vendors have introduced solid-state storage in their appliances to boost the random I/O access performance.
Essentially all these products are targeted for processing structured data like call data records, financial trading data, log files or other forms of machine-generated information. They use traditional RDBMS for storing and manipulating the logical records. These products are fast and accurate but also is expensive [Netezza's lowest cost per terabyte hovers around $2500].
However, the Industry veterans tell us that most of the data that we see today that is expanding enormously are unstructured. They are not machine generated, they are mostly consumer data [internet-based], images, video files, text. Analyzing these data require different algorithm, different architecture. Single Correlation may require sifting small to really large amount of data. The real challenge or opportunity for the Big Data Analytics however is presented by these data.
Quoting Jim Baum, the CEO of Netezza, "I get excited about what we see happening in 'Big Data'. Most of the applications I have seen so far are focused on consumer facing businesses and traditional enterprises leveraging structured data assets. Data warehouses supporting reporting, ETL jobs running in Hadoop, “machine to machine” analytics making decisions in real time. We have just scratched the surface and the opportunities to go further are immense." and he is bang on the target.
In the next post we will review some of the existing solutions for unstructured Big Data.

Thursday, July 28, 2011

Big Digital Data

Of late the term, Big Data is seen often in news posts and articles. Knowledge industry loves new jargon. Probably it adds a touch of mystery and solidity to the knowledge that the jargon associates to, which otherwise may be lost in the zillion bytes that stream through our awareness everyday. In a sense jargon creates identity that people love to be identified with.
So, what is Big Data exactly?
In simple language any data repository for large data stream like GPS or atmospheric pressure sensory data or twitter posts can be qualified as Big Data. They are large and incremental; applications require access to the entire database.
IBM qualifies it as:
Big data spans three dimensions: Variety, Velocity and Volume.
Variety – Big data extends beyond structured data, including unstructured data of all varieties: text, audio, video, click streams, log files and more.
Velocity – Often time-sensitive, big data must be used as it is streaming in to the enterprise in order to maximize its value to the business.
Volume – Big data comes in one size: large. Enterprises are awash with data, easily amassing terabytes and even petabytes of information.
In somewhat crude but simpler terms, the data must be online [that means offline backup data is not qualified] and the daily incremental addition should be in the order of hundreds of gigabytes or more to be qualified as Big Data.
So, why is this particularly important?
The challenge of storing such a large data and providing online accessibility, which could be mostly random, to it, makes it a non-trivial problem. A non-trivial problem requires non-trivial solution and thus it becomes a new opportunity for the data storage solution providers. Undoubtedly this problem is not new. Technology for structuring and storing large database has been existing for more than a decade, in fact all the existing RDBMS providers have the technology for partitioning large data and providing the engine to retrieve the information efficiently. One issue with RDB [Relational DB] is when the table becomes huge [in the order of petabytes or more], the join becomes a very costly operation and if the data retrieval requires multiple joins performance can degrade quite heavily. And there the new-generation noSQL db comes into picture. But before we go there, perhaps some idea about applications that need Big Data would help.
Most of the application for large Data are Data Analytics. Analytics is not a new branch, so far most of the analytics application assumed structured data set. For example consumer credit card transaction can be considered Big structured data and many commercially useful data analysis can be done by profiling either customers or the products. In other words, data analysis is a large business and it is no-brainer that larger the input data, better the accuracy of the analysis is. In other words data analytics need large data and cheaper the cost for maintaining it, better it is for the business. It is also important to note that the rate at which data gets generated, has multiplied in last few years [McKinsey says data expansion rate is around 40% per year], again adding to the problem of Big Data.
So the question really boils down to what technology we can use to create relatively cheap but scalable computing and storage infrastructure that enables reasonably fast accessibility to the data for analytics engine.
While existing relational DB-based model does provide proven solution for large structured data, their cost/performance for large distributed unstructured data does not scale that well. And here, new-age web 2.0 technologies are bringing a viable alternative compared to traditional [read Relational] database models.
The other aspect of the story is that most of the consumers of large data analysis are trying to use Cloud storage as the storage infrastructure for their analytics data as it brings down their CapEx cost. However, using noSQL DB is not without issues. Relational DBs are popular for many reasons:
Two main points are:
1. it helps to reduce redundancy in the data [data modelling ensures 3NF] and
2. SQL provides an easy, consistent and standardized data access interface.
With noSQL, both of the benefits are forfeited. Data structure is driven by what and how the users access the data and data access interface gets designed as per that need [ in other words no SQL support].
In next post we will review some of the existing large data infrastructure solution.
Further Reading
McKinsey Report on opportunities and challenges of Big Data [May 2011]