Tuesday, January 15, 2013

A storage system potpourrie for beginners

Storage is easily one of the most talked about, most invested by people's attention and most confusing technologies around. Anyone who can read this sentence, is aware of digital storage as a concept. Any data that is generated by computing m/c is digital and requires digital storage. However when it comes to technology aspect, storage is easily most clouded concept that is infested with unending series of acronyms: DAS, NAS, SAN, SCSI, SATA, SAS, NFS, CIFS, RAID.. and multiple technology families, like tape storage, disk storage, solid-state storage and then there is all-encompassing Cloud. If you hoped  that with cloud you have finally one thing that you can take refuge in, hold that hope for you must first ascertain what constitutes cloud  to be sure that you can rest with Cloud.
                 One way to make sense out of these apparent forest of acronyms and concepts is try to appreciate what we need storage for. Essentially entire purpose of all storage technologies is to help us to store our ever-expanding digital data in such a way that is
  1. safe and persistent, that is data does not get destroyed, lost, mutated or corrupted once stored 
  2. secure against unauthorized access 
  3. accessible when one needs and 
  4. affordable. 
There is one more complexity that we must be mindful, which is, complexity of size. As the size of the data grows, the means to deliver on all those four parameters, must evolve, often drastically so that overall solution remain attractive to user. For example, if you have only 100GB data, a single external hard disk is often good enough for your need, however if that data becomes 1 exabyte [1 exabyte is 1000 petabytes and 1 petabyte is 1000,000 GB], you need whole range of technologies to manage that data. Difference between personal storage and enterprise storage to a large extent is an illustration of how Quantity transforms into a qualitative attribute at larger magnitude.

Personal Storage

 For non-professional personal need, typically a 300GB hard disk that comes by default with a laptop is more than sufficient. A 250GB hard disk for example can hold around 50,000 normal size photos or mp3 music. If you are avid user of video, you probably will buy few 1 TB external hard disk in addition and that would be DAS or Directly Attached Storage system for you. If you are a cloud aficionado, you probably would rely on Google Drive or Microsoft SkyDrive for your additional needs. In which case you have both DAS and public Cloud in your system.

Enterprise Storage

When it comes to enterprise, many aspects like, data growth, data retention, preparedness towards recovery of data against site disaster and access frequency of data comes into consideration, making the storage planning a costly and complex business.  Additionally with increasing sensitivity towards unstructured data, enterprise is experiencing faster expansion of storage demands. According to IDC's Worldwide Quarterly Disk Storage Systems Tracker, 3Q12 marked the first time that external disk storage systems makers shipped over seven exabytes, or 7,104 petabytes, of capacity in a single quarter for a year-over-year growth rate of 24.4 percent.[source: Infostor]. This means in next 5-6 years there will be many organizations that would hit exabyte of enterprise data. 

Storage Tiers

To get around this challenge of data explosion, enterprise try to bring storage tiers where the data is organized into different classes based how actively they are used. For example, very active (data modification rate is high and data access rate is very high) data requires that they are kept online in fast and most reliable storage tier [let's say tier 1] and the least active [no data modification and only accessed in special scenario like past data audit or recovery] data could be archived in off-line storage. This way, the enterprise provides most resources to most active data and efficiently reduces cost of storage for lesser active data.

Fig 1. Storage Tiers based on Data usage
Fig. 2 tapes and disks
Typically most of the online storage in an enterprise is maintained in disk-based storage. Traditionally digital tapes were used for all offline storage for advantages that tapes can be preserved with very low electrical power consumption and can be moved to different location physically with very little cost. But tapes are serial and therefore require different hardware setup. They also are more prone to read-failures compared to disk. Last ten years of innovations increased storage density of disks manifold and brought down the cost/GB of storage for disk lower than to that of tape and eventually established disks very strongly for archival storage so much so that most enterprises of late are opting for disk-based backup over tape. It started with VTL [Vitual Tape library] appliances replacing physical Tape backup appliances and of late VTLs got merged with standard disk-based backup appliances. Almost all backup appliances use Deduplication in a major way to reduce storage footprint. An added advantage that this transition has brought is archived data can be made online within a very small time-window. Datadomain appliances are very good example of how disk-based backup appliances shaped up. Additionally the backup appliances provide some desirable features such as compliance support where the system can be configured to ensure immutability of data once written into it for a duration defined by the administrator, or automatic data shredding where the data gets destroyed when someone tries to access the data from disk without going through proper authentication procedure.
Compared to archival data, Tier-1 storage employs high-end faster disks [15K RPM] quite often along with SSDs [Solid State Disks]. SSDs are new favourite in this segment with vendors, like Samsung, Sandisk competing with each other to bring out new products that are cheaper, denser and last longer. SSDs are a lot faster and support true random read/write compared to disks.With fast falling price, higher capacity and increased life-time, solid-state drives are finding their places in a large way in tier-1 storage gears. Other advantages of SSDs are that they occupy less physical space, less electrical power and can transition from offline to online a lot quicker compared to disks. It however will take some time, before we see SSDs completely replacing disks in this Tier.
Fig 3: simple stack comparison - SAN, NAS and DAS
Fig. 4 Tiered NAS storage organization in Data Centre
Sometimes called primary, mission-critical storage appliances, Tier-1 storage gear provides fast, reliable storage for mission critical data. They often provide multiple levels of redundancy in order to reduce data down-time. Since these gears are the most expensive of the lot, many storage vendors provide mechanism to transparently move less active data to less expensive disk storage. This Low-Cost Storage Tier or sometimes referred as Near-line storage often is made up of large set of high-capacity but slower SATA disks [5400/7200 RPM]. NAS (Network attached Storage) designs are inherently suited for this type of tiered use, which kind of explains why NAS sells more compared to SANs. Also SAN uses fibre-channel or SAS disks making it more expensive compared to NAS when the data is not mission critical. [see slides for an illustrative comparison between NAS and SAN]. In either SAN or NAS a single disk-array must have all its disks of similar type and speed. For example either they all will be FC high speed disks or they will be SAS. Either way, higher level data access syntax are built into the NAS/SAN software. NAS mimics File access syntax as provided by a File System and SAN provides block access that File systems can use. So NFS (Network File System) and CIFS are the two primary interfaces that a  NAS server supports whereas iSCSI and FC are the two interfaces that SAN provides primary support for the host server file systems.
Fig 4 provides an illustration of a typical enterprise with two data centres, both simultaneously serving its users as well as  providing storage replication service to the other site, a popular configuration to support Site Disaster Recovery, while internally each data centre organizes data into 3 tiers. Tier 1 storage almost always come in a primary-standby configuration in order to support high availability.

Cloud Storage

courtesy: HDS: Thin Provisioning with Virtual Volume

Cloud as a concept became popular only after Virtualization became successful in large-scale. With virtualization, one could have hundreds of virtual servers running on a single physical server. With that, came software that could make provisioning hundreds of applications a matter of running few software commands which could be invoked remotely over HTTP. Ability to dynamically configure servers using software brought up a new paradigm where an application can be commissioned to run across multiples of virtual servers (that are communicating with each other using a common communication structure), serving a large user base entirely using software commands that administrator could execute remotely from his desktop. This type of server provisioning demanded new way of storage provisioning. Concept of virtual volume or logical storage container became popular. Now one can define multiple containers residing in the same physical storage volume and provision them to the server manager remotely.  The concept of Thin provisioning became predominant in storage provisioning where the idea is that a server is provided a virtual volume that uses little physical storage to start with but as it grows the physical storage allocation also grows underneath based on demand. Advantage with this is that one does  not need to plan for all the storage in advance, as the data grows, one can keep adding more storage to the virtual volume, making the virtual volume grow. That decoupled physical storage planning from server's storage provisioning. Storage provisioning became dynamic like virtualized server provisioning. As long as the software can  provision, monitor and manage the servers and virtual volumes allotted to the server over a software defined interface, without errors and within acceptable performance degradation, the model can scale to any size. As it is apparent, there is no real category called 'Cloud storage', what we have rather is 'Cloud service'. Data centres are designed and maintained in the same way the data centres are designed and built all along using combinaton of NAS, SAN and DAS.
Cloud provides a software framework to manage the resources in the data centres by bringing them in a common sharable pool. Cloud in that sense is more about integrating and managing the resources and is less about what storage technologies or systems per se. are used in the data centre(s). Given that Cloud software is the essential element of Cloud service, as long as the software is designed carefully, one can have any type of devices /systems below it, ranging from inexpensive storage arrays of JBOD (Just a Bunch Of Disks) to highly sophisticated HDS, HP, EMC disk arrays or  NAS servers. The figure below from EMC's online literature illustrates this nicely.
It is apparent that as the cloud size grows larger and larger, the complexity and sophistication of the software increase by magnitude and so does the cost advantage of data storage. One can look at the cost of provisioning (server and storage) in public clouds like that of  Google, Rackspace and Amazon and imagine the complexity and sophistication of their Cloud management software. Fortunately many have published  a version of their software in open source for others to learn/try.

source: http://managedview.emc.com/2012/08/the-software-defined-data-center/
courtesy: EMC





Further Reading:
Brocade Document on Data centre infrastructure

My slides on slideshare

1 comment:

  1. Share great information about your blog , Blog really helpful for us . We read your blog , share most useful information in blog . Thanks for share your blog here .IBM TS2250 Tape Drive Model H5S

    ReplyDelete