|HOME||STORAGE VIRTUALIZATION||WHITE PAPERS||CONTACT|
The concept of Thin Provisioning was first introduced to the Open Computing world of SANs when DataCore Software Corporation released the revolutionary "Network Managed Volume" feature in their SANsymphony product in early 2002 -- quite some time before 3Par and others introduced their own Thin Provisioning implementations. DataCore renamed the feature "Auto Provisioning" in their SANmelody Storage Server products, but it's the same mature code set found in the SANsymphony suite.
Thin Provisioning has many merits:
Thin Provisioning permits implementing another concept called Over Subscription (or Over Provisioning) which is the concept of publishing volume sizes whose cumulative advertised capacity exceeds the physical amount of storage available in the pool.
Thin Provisioning Versus Over Subscription
Some industry pundits usually Vice Presidents of storage vendors whose offerings don't include Thin Provisioning have little good to say about the feature. Their principle argument against Thin Provisioning runs along the lines of, "it's a lie; you're telling a user they have more storage than they really do. It can create headaches for the storage admins." Their arguments fail to differentiate between Thin Provisioning and Over Subscription two distinct concepts. Administrators of Thin Provisioning are not by any means required to employ Over Subscription (although there are many cases where they will want to do so that we will discuss in a later section).
Thin Provisioning in and of itself offers the storage administrator a facility of storage management even without Over Subscription via simplification of storage allocation. New volumes can be created with a simple click of the mouse without regard to finding a large enough contiguous block to represent the new volume.
An Example Sans Auto-Provisioning: Wasted Resources
Let us consider a project of provisioning storage using a non-Thin Provisioning storage array. For the sake of simplicity we will use a small array with only 1TB of physical storage. We will not concern ourselves with the specific composition of the storage the type of disks or the RAID groups employed. Suffice it to say that the management console of the array will permit us to slice up the storage resources into "partitions" and those partitions will become the logical volumes that we need to complete our project.
Our project requires us to create four volumes for various applications. We have already determined the size of the data requirements, but we wish to allocate additional capacity for each of the volumes in anticipation of future growth. Table 1 below indicates the current sizes of our data, the sizes of the volumes we will create, and the space that will consequently be reserved or allocated for those volumes.
Our total data is only 540GB of space, but we have almost fully utilized the 1TB of storage in the array, leaving us only 50GB of free space. Thus we are effectively only using 57% of the allocated storage, while 43% is wasted, statically reserved for future growth of the four existing volumes, but unusable should other volumes need to be deployed.
An Example with Thin Provisioning: Fully Utilized Storage Resources
Now let us consider the same example using Thin Provisioning. We use the Thin Provisioning manager's GUI to instantly allocate the four volumes without any partition creation, without notion of where they will be placed in the pool. We then set the volume sizes as per our plan. The Thin Provisioning manager makes the allocation of the four new volumes as simple as a few mouse clicks.
At the time of creation, no space will have been taken from the pool. It is only when we begin writing data onto the volumes that space will be allocated to them. So in our example, we assign or map the four volumes to their respective application servers. From those servers, we format the volumes and write our data onto them. Table 2 reveals that we have only allocated from the pool the physical disk space required to hold the current data.
So far, there has been no Over Subscription. With little effort we have created four volumes totaling 950 GB, nearly all of our 1TB pool. Our applications could fully utilize their allotted storage as there is enough physical storage in the pool, but on examining the Thin Provisioning manager, we determine that the pool is only about 54% allocated. In fact, as we populated our volumes, the Thin Provisioning manager only allocated the storage necessary to hold the real data. We still have 46% of our storage pool remaining, available should we need to use it.
An Example of Over Subscription: No-More-Headaches Storage Management
Let us now suppose a few months have passed, our applications continue using the available storage of the pool, each growing at some rate. One day we need to create another volume for a project that will require about 90GB today, but could grow to 200GB. We use the Thin Provisioning manager to create another volume from the pool and size it to 200GB.
At this point, we have over subscribed: we have provisioned more storage to volumes than we physically have in the pool. See Table 3 below.
Note that in spite of the fact we are over subscribed by 150GB, the pool is still only 67.5% utilized, and 325GB of storage remains available for the growth of the 5 volumes. Without Thin Provisioning, we would have had to acquire additional physical storage in order to add the new 200GB volume.
What Happens When the Pool is Depleted?
What happens, indeed? Well, for a given volume based on the depleted pool, I/O operations will continue as usual reads will complete successfully as expected, and writes to pre-existing blocks will also be written correctly as expected all will continue fine until such time as the server or servers using the volume attempt to write a block for which storage has not yet been allocated. The pool manager will attempt to allocate a storage allocation unit from the depleted pool, and as the pool is empty, the write will fail and an error will be returned by the controller. How the error will be handled is entirely dependent on the volume's associated servers' OS and respective application. The OS may mark the volume as failed and take it offline; the application may flag errors and halt or may simply log the error and continue running.
One thing is clear: if you're the storage administrator (or the director of the IT department), you want to avoid the "depleted pool" situation like the plague. It is, in simple terms, a "bankrupt" situation, similar to having run up a huge credit card bill and finding you can't make the payment when the bill comes due.
Over subscription is a very powerful tool that can greatly simplify your storage management and give you more control over storage acquisition -- buying storage when you actually need it. Nonetheless, just as with your credit card, you need to use it responsibly and monitor your storage pools. DataCore's Thin Provisioning allows you to set depletion thresholds as a percentage, alerting you when the pool reaches a low water mark. You can have emails dispatched to warn you of depletion, or even of a "run away train" type of depletion caused by a misbehaved application.
We'll talk more about Over Subscription later when we discuss best practices for using Thin Provisioning.
Growing The Pool With No Downtime
The abstraction layer created by Storage Pooling allows us to add storage to the pool at any time and without disrupting production. Since our Thin-Provisioned volumes are based on a virtual geometry and not on any particular physical disk structure or RAID group, we can easily add more disks or LUNs to the storage pool without modification to those volumes and without informing their associated application servers. We are, in effect, pouring more storage into the storage pool from which our virtual volumes feed.
If the new physical storage can be added without shutting down the server, the procedure is as simple as discovering the new physical storage (i.e. "Rescan Disks" to find new LUNs) and adding them to the pool.
If adding the new storage requires shutting down the server, using synchronous mirroring between two storage servers can allow you to bring down the server without interrupting production: you simply fail all the storage server's active volumes over to the surviving server, effectively performing a "rolling upgrade".
Thin Provisioning With VMWare VMFS Volumes
Using Thin Provisioning with your VMFS volumes radically improves capacity utilization and simplifies allocating space for VMDK's. It also helps to decouple VM Server Sprawl from capacity allocation, meaning your calls to the storage vendor are less frequent. [More on this later as I find time...]
Thin Provisioning With Snapshots
Using Thin Provisioning with point-in-time Snapshots can dramatically improve storage utilization. Thin Provisioned snapshot volumes take the guesswork out of sizing snapshot reserve space and permits the sharing of space among many snapshots. In fact, Snapshots were the mother of DataCore's invention of Thin Provisioning.
DataCore implements Snapshots using Copy-On-First-Write technology: Changes made to original blocks on a snapshot-enabled source volume cause the original blocks to be copied to the snapshot destination volume before the changes are committed, overwriting the original blocks of the source volume. Using Thin Provisioning means you no longer have to pre-allocate large amounts of reserved disk space to hold the snapshot image - physical storage will be allocated from the pool as needed to hold the original blocks when the source is changed.
Consider the example of taking a nightly snapshot of our volumes for backup purposes. [To be continued...]
Best Practices Using Thin Provisioning
Following are a few "best practices" you should consider when deploying storage pools, creating Thin Provisioned volumes, and choosing their logical sizes. Some of these items will seem contradictory; in fact, this is normal, as every SAN deployment is different different performance requirements, different numbers of spindles and capacities, different types of I/O. Keep that in mind as you consider these best practices.
Use Over-Subscription As A Credit Instrument Not As Free Money
First and foremost, resist the urge to go "hog-wild" over-subscribing. The Thin Provisioning manager makes it so easy to allocate storage that you can very quickly over-subscribe your storage by several hundred percent.
Think of Over Subscription as a credit instrument. Just like a credit card, it would be unwise to spend, spend, spend till you've reached your credit limit and discover you don't have enough money to pay the bill at the end of the month. It doesn't mean the credit card is a bad thing it just means you didn't know how to use it.
I've seen numerous cases of newbies "kicking the tires" of DataCore's free SANmelody evaluation software blindly running into Over-Subscription problems. The typical scenario involves setting up a test SANmelody server that may have a few small drives in a RAID group, with which they make a LUN of maybe 270GB of storage. They place the LUN in a storage pool, carve out a virtual volume leaving it at the default size of 2TB. The virtual volume is then presented to a Windows test server with IOmeter installed. The IOmeter test is quickly set up using defaults. The test runs brilliantly for several minutes and then comes to a screeching halt as IOmeter attempts to write beyond the 270GB available in the storage pool. User error.
As you begin with Thin Provisioned storage pools, I recommend you not necessarily over-subscribe until you've understood your "spending habits". You may already know how certain of your volumes have grown over time. You can use that information to size their new Thin-Provisioned existence to cover their future growth.
Right-Size Your Virtual Volumes
Further to the previous recommendation on Over-Subscription, avoid creating unnecessarily large volumes. For instance, in DataCore's implementation of Thin Provisioning, a new virtual volume has a default 2TB size. Before presenting the volume, you should size it to an appropriate size for the application.
For instance, if you plan to use the virtual volume in a typical ESX environment as VMFS storage for your VMs, it would be unwise to leave it at 2TB as you may be setting yourself up for bad performance. Read this best practices white paper to learn more.
Unnecessarily deploying 2TB volumes can have other implications in the datacenter, right down to simple things like tape backups and restores.
Over-Subscription is indeed a tool to facilitate provisioning and avoid purchasing physical storage until you actually need it use it intelligently and size your volumes to the maximum sizes you expect them to grow.
Use Storage Pools To Tier Your Storage
Proper planning of quantity and tiered nature of your storage pools will assure you flexibility and best performance characteristics. DataCore's SANsymphony and SANmelody products allow you to create multiple Storage Pools so you create tiered storage. For instance, you might create a "high performance" pool that contains 15K RPM Fibre or SAS drives in RAID-10, and another "capacity" pool of 750GB SATA drives in RAID-5.
With DataCore, the Pool Manager is very flexible and will allow you to place any type of storage into any pool you've created. You can even toss individual 5400 RPM SATA drives in the same pool with your 15K RPM 4Gb FC LUNs if you so choose. Best practices and a little common sense would dictate that wouldn't be such a great idea.
I've observed that as users discover storage tiering, they are likely to deploy one of two extreme strategies: -1- Put all the spindles into one large pool; -2- Over-tier the spindles by excessively deploying pools. On the surface, arguments can be presented for each of these extremes. In practice, the best strategy is often somewhere between the two extremes. Exactly where between those two extremes depends on a number of things: spindle counts, spindle types (SAS, SATA, FC, iSCSI, SDD, RAID levels, RPMs), and application (snapshots, test/dev, production, etc). The next couple of "best practice" points can help steer you to the best strategy to use in your environment.
Name Your Pools Descriptively
The DataCore Storage Pool Manager allows you to name the pools for administrative purposes. Volumes created from the pool sport the pool name in the interface so you can quickly tell from whence they come. I therefore name my pools so I know what type of storage they are comprised of. As DataCore's SANmelody and SANsymphony are multi-node "storage clusters", I also incorporate the node's hostname into the pool name to simplify management when analyzing synch mirror performance, volume provisioning, etc.
Here are a few examples of how I name pools.
|Pool Name||Pool Description|
|A-SAS-R5-15K||RAID5 comprised of 15K SAS drives on SANmelody A|
|B-SSD-R5||RAID5 based on SSD drives on SANmelody B|
|B-SATA-7200-R6||RAID6 comprised of 7200 SATA drives on SANmelody B|
|S3-FC-R5-15K||RAID5 comprised of 15K Fibre Channel drives on SANsymphony node 3|
|S1-XISE-R5-15K||LUNs from Xiotech Emprise 5000, using RAID5 on 15K FC Datapacs, allocated to SANsymphony node 1|
|S2-CX4-R5-15K||LUNs from a CX4-120, using RAID5 on 15K FC, allocated to SANsymphony node 2|
|A-ISCSI-SNAPSHOT||RAID'ed iSCSI LUNs from a Lefthand or EqualLogic appliance, used for snapshot volume clones|
Build Pools That Minimize Idle Spindles
By their very nature, Storage Pools aggregate LUNs. DataCore's Storage Pools stripe the LUNs which really turbo-charges performance. The pool manager builds the stripe based on the provisioning granularity the administrator chose when creating the pool. This provisioning granularity is referred to as the "Storage Allocation Unit" or SAU size. As new writes to the pool require provisioning, that provisioning will occur based on the Storage Allocation Unit size and be taken from each pool member in a Round Robin fashion, in much the same way as RAID0 stripe. The pool manager can thus maintain concurrent I/O operations, reading from multiple LUNs at the same time it is writing to the others in the pool.
And what does this do for performance? A lot! Consider a traditional array that does not implement storage pooling. If your particular database needs maximum performance, perhaps you've created a dedicated partition on a RAID5 stripe of nine expensive 15K drives. At 15K RPM, you'll get at best 190 IOPS per spindle, yielding 190 x 8 = 1520 IOPS. Now consider a pool with six RAID5 LUNs, each comprised of nine 7200 RPM disks. We effectively have a pool with 6 x 8 = 48 physical spindles. At 7200 RPM, we can expect at best only 90 IOPS per spindle, yielding 90 x 48 = 4320 IOPS for the pool. Over 2.8 times faster, using more spindles, albeit much cheaper spindles… but as much as you need the performance, perhaps you really can't justify 54 spindles worth of capacity dedicated to that application.
Here's the payoff: the pool can be shared by many applications, and more than likely you will have many applications that aren't I/O intensive. For instance, consider all those C: drives of your application servers. Once they've booted, some only infrequently touch the disk. If you have isolated those drives onto a separate RAID5, once they've booted, the drives in that RAID5 are now basically "idle spindles", not contributing to the overall performance of the SAN. And as spindles are almost always the bottleneck in your datacenter, wouldn't it be better to take an "all hands on deck" / "all spindles contributing" approach?
Consider The Storage Vendor's Provisioning Recommendations
Some storage vendors may implement their own "in the box" LUN virtualization. For instance, Xiotech's Emprise 5000 has a unique platter-striping system called "RAGS" that radically improves performance over traditional arrays and offers a number of unique benefits (such as self-healing storage). Xiotech's field engineers may make specific recommendations on how to get the extreme maximum performance out the box.
Similar story for HP's EVA. The "V" in EVA refers to the virtual nature of their RAID groups; the EVA stripes across all spindles. If HP's documentation makes specific recommendations about how to get the best performance out the box when carving up LUNs, follow it.
A well-implemented pool manager in a Storage Virtualization platform such as DataCore's should be flexible enough to not impede the storage vendor's performance tweaks, and as the spindles are ultimately responsible for performance, you should follow the storage vendors recommendations for how to carve up their LUNs.
Choose Storage Allocation Sizes Based On Pool Purpose
Hopefully, your chosen vendor's Thin Provisioning system allows you to choose the provisioning granularity. I know there are a several traditional storage vendors that boast "Thin Provisioning" that isn't very… thin. With DataCore, you select the granularity of provisioning when you create the storage pool. The granularity is referred to as a Storage Allocation Unit and is expressed in megabytes, in multiples of two, with a default of 128MB.
In the DataCore implementation, the Storage Allocation Unit (or SAU) is the correlation between the virtual volume's logical surface and the physical storage allocated from the pool. For instance, if you choose a storage pool SAU of 16MB, then consider that the pool's member LUNs will be striped and sliced into a matrix of 16MB chunks. Resulting virtual volumes coming the pool will also be logically sliced into 16MB chunks.
Consider a storage pool with an SAU of 16MB and containing 3 member disks. Suppose we create our first virtual volume from the pool. The first block of storage, taken from Disk 1, will represent logical block 0 of the virtual volume; that is to say, the first byte on the disk and the consecutive 16 megabytes. Imagine the application were writing a 512MB contiguous file. In this case, once the first 16 megabytes [0..15] were written, the pool manager would allocate another SAU, this time from Disk 2, to hold the next 16 megabytes [16..31], then another from Disk 3 for megabytes [32..47], then back to Disk 1 for megabytes [48..63] and so on.
However, if the application were randomly writing small 64byte chunks, say in blocks [25, 398234, 756678, 986547], given a 16MB SAU size those writes would correspond with the virtual volume's SAUs [1, 12, 23, 30]. Thus the pool manager would allocate a 16MB SAU from Disk 1 for block 25, then a 16MB SAU for block 398234 from Disk 2, again a 16MB SAU for block 756678 from Disk 3, and finally a 16MB SAU for block 986547 from Disk 1. You can see that although the application has only written 4 * 64 = 256 bytes, we've allocated 64MB from the pool.
While this example may seem a bit farfetched, it nonetheless underscores the value of being able to choose the granularity of thin provisioning depending on the pool's purpose. For most applications and file systems, such as Windows / NTFS and VMWare ESX / VMFS, DataCore's default of 128MB is fine. However, certain file systems and applications such as EXT3 or Snapshots will benefit from a smaller granularity, such as 1 or 2MB.
Carefully Plan Your Snapshot Storage Strategy
As a matter of best practice, DataCore advises you should use a separate storage pool for your snapshot volumes... and for good reason. Keep in mind that most SAN controllers implement snapshots using Copy-On-First-Write technology. In this case, every time a write occurs to a snapshot source volume, the snapshot manager must first determine whether the new write will overwrite the original source data. If so, it must copy a snapshot chunk from source to destination disk. That involves a read from the source volume, with the corresponding write to the snapshot destination volume, and finally the new write is written to the source. One read, two writes for each "first time" we write to an as-yet-uncopied chunk of the snapshot source.
Imagine you've placed the source volume and snapshot destination volume in the same pool. Now imagine that pool contains one pool member. Even if the writes are cached, you can see that if there are frequent writes to the source volume, there will be a corresponding degradation in performance due to the copy operation as the snapshot destination volume is populated with the original source data.
Further, if you plan to use the snapshot for test/dev or tape-backup, etc., sharing a LUN for snapshot source and destination volumes will only further degrade performance as two applications contend for the LUN and its I/O queue.
We've already discussed provisioning granularity and snapshots. Most Copy-On-First-Write implementations will use a small "snapshot chunk" size (such as 64KB) for efficiency. If you're placing your snapshot destination volumes in a production pool with a provisioning granularity of 128MB, the thinly-provisioned snapshot destination volumes will grow by 128MB even as small changes to the source cause Copy-On-First-Write 64KB chunk copies.
Using a separate storage pool for snapshot destinations should yield a more deterministic and improved performance when snapshots are active, as well as better capacity utilization.
Ultimately, your pool strategy will largely be determined by the number of spindles or LUNs you have available, and this will vary wildly between large enterprises and small shops. Those with SAN "appliances" having limited spindles will need to strike the right balance between spindle counts and capacities reserved for snapshots.
For instance, if you have 16 spindles available, splitting the spindles into two RAID5 groups one for production data, the other for snapshots would be impractical and wasteful. On the other hand, one could imaging creating two 7-disk RAID5 groups for production data, one disk as a hot spare, and the remaining disk for your snapshot pool. Sounds good from the point of view of "snapshot reserved capacity", but that single disk will gate the performance of all writes on source volumes with active snapshots. To give you some idea, if you are using 15K spindles, the two 7-disk RAID5 groups in a storage pool would yield roughly 2280 IOPS, compared to 190 IOPS for the single spindle in the snapshot pool. While storage controller caching may somewhat mask the performance disparity, during heavy traffic the single spindle will of the snapshot pool will have a significant impact on performance.
In this particular case, given 16 disks in an appliance, you may find it preferable to create 3 RAID5 sets of 5 disks each, keeping a disk as your hot spare. From there, you may find you get the best performance by placing all three of the resulting LUNs in the same storage pool, thus maximizing spindles and I/O queues in the pool, and spreading the snapshot blocks across the production data. Again, _not_ best practice, but it may be the best option given your limited number of spindles. As with all things related to storage performance, "Your mileage may vary" and only by testing will you be able to determine the best strategy for your environment.
Do Not Defragment Thin-Provisioned Volumes
I've written an entire paper on the subject of defragmentation. While I highly recommend defragmenting your local storage (such as your desktop or laptop internal drives), there are numerous reasons why you should avoid defragmenting shared "logical" or "virtual" storage. Defragmenting a Thin-Provisioned volume will cause it to grow quickly as the defragmenting engine moves blocks around, likely causing unnecessary storage allocation. You can read more about the impact of disk defragmentation in this white paper.
Actively Monitor Storage Pools
If you over-subscribe storage, I recommend following the vendor's recommendations on implementing active pool monitoring to avoid ever having a pool completely deplete. The DataCore Pool Manager allows you to use Microsoft's events and performance counters to send emails or alerts when pool depletion reaches a low-water threshold. You can also use the same techniques to warn you of "runaway train" situations, where perhaps an application begins rapidly depleting a pool.
Storage pool levels in an over-subscribed environment is probably the first and most important metric of the SAN that you should be monitoring, and setting up automatic alerts via email is a simple way to make that monitoring "hassle free".
Determine A Schedule For Occasional Storage Reclamation
If you have experience using Thin Provisioning today, you may have noticed that deleting large files doesn't cause a corresponding reduction in provisioning for the associated LUN. Indeed, even deleting the file system doesn't release the provisioned storage. That's normal, considering block level storage devices (disks, SAN LUNs) don't have a "delete" command... they can only be mounted, unmounted, read, written. Indeed, deleting files from a file system actually results in writes to the disk typically, the file system writes a bit in its catalog indicating the file is "deleted".
In short, storage thin-provisioned to a volume remains with the volume for its life, unless the vendor has a means of identifying and reclaiming provisioned Storage Allocation Units no longer in use.
As far as I know, as of today the only vendor offering a reclamation feature for Thin Provisioned volumes is DataCore. Their SANmelody and SANsymphony packages have a reclamation utility that can be used to return provisioned but unused storage allocation units back to their associated pools. The utility can be called by script or set up to run as a scheduled task.
Depending on provisioning trends in your environment, you may find you never need to reclaim storage. On the other hand, you may find that you can benefit from occasional or scheduled reclamation.
If you find that your pools are frequently depleting and requiring reclamation, you probably need to do some analysis to find out what is causing the depletion. Here are few possibilities you might want to look into:
|Possible Causes||Possible Solutions|
|• Defragmenting volumes causing excessive provisioning||• Don't defragment SAN volumes. Read this white paper on defragmenting SAN storage.|
|• Snapshots of NTFS volumes mysteriously growing at unexpected rate.||• Verify that the snapshot volumes' associated pool has a provisioning granularity that is suitable for snapshots. Many vendors snapshot implementations copy only small chunks (e.g. 64KB) from when a Copy-On-First-Write occurs. Consider a snapshot pool with a 1MB or 2MB SAU size.|
|• May be related to Last Access Writes. Read this white paper on the subject.|
|• If the snapshots are of Boot From SAN volumes, the snaps may be growing due to writes to associated page files, temp caches, etc.|
|• Is someone performing some action (Defragmentation, Restore From Backup, etc.) on the snapshot's associated source volume? Such actions cause "Copy On First Writes", which is the expected behavior of the snapshot.|
|• VMFS volumes mysteriously growing at unexpected rate.||• Do you have active ESX snapshots on the VMs? Each and every write to a VMDK with an active snapshot will cause the snapshot file to grow... thus resulting in additional provisioning on the VMFS file system's associated LUN.|
|• Unless you are reserving memory for your VM's, ESX will create a swap file equal to the size of the machine's RAM when the VM is started. If your ESX hosts are limited on memory, as a last resort, ESX will swap unused pages out to the swap file. This phenomenon is typically associated with poor performance from your VMs.|
Storage Pooling and Thin Provisioning are very powerful tools in a SAN environment, and when
used properly will considerably simplify storage management. Understanding how Thin Provisioning
works and following the best practices above will make a big difference in your experience. Hopefully
this paper will be of some value for you.