Las Solanas Consulting

Storage Virtualization | FAQs & Discussions

Maximizing Storage Performance With VMWare ESX and Virtual Server Environments

A White Paper by Tim Warden, April 12, 2008

Over the course of the last few years, I have had the opportunity to install numerous SAN storage solutions for customers implementing server virtualization and requiring shared storage. These experiences and my own testing in the lab have led me to several observations about how best to configure your shared file systems and their underlying storage in virtual server environments. The purpose of this paper is to consolidate and share these "storage best practices".

HOW MANY VM'S SHOULD I DEPLOY PER DATASTORE?

One of the most common misunderstandings when configuring shared storage for a Virtual Server environment has to do with laying out VMs on the shared LUNs. The tendency I've observed is to create one or two large LUNs and create all the VM's on the one or two resulting Datastores. Indeed, when I first began working with ESX with SANmelody, I simply created two LUN's, one from each of my active SANmelody controllers and distributed the VM's between the two. It worked great.

However, working with several customers and partners over the last few years I've been made aware this is not by any means the ideal configuration for the typical production environment. I've encountered cases where the customer had well in excess of 30 running VM's all on the same LUN — a LUN shared across two or more ESX hosts.

While it may seem obvious that a LUN with multiple VMs, each actively generating I/O, could pose a bottleneck, most admins will assume that the bottleneck is the media itself and not the implementation of the protocol over the media. For instance, most will presume the bottleneck is the bandwidth of a 1Gb Ethernet of the iSCSI "fabric". The reality is a bit more complicated.

The SCSI-3 recommendation on which Fibre Channel and iSCSI are based specifies a mechanism for the sharing of target devices (disks or LUNs) among cooperating initiators, such as clustered servers. Referred to as "SCSI Reservations", it describes a locking / releasing mechanism designed for negotiating which initiator can write the LUN at any given instant. A clustered server attempting to write a LUN will get a "lock" on the LUN unless some other server already has the lock. The reservation is released once the write has completed, so that other servers can subsequently write the LUN. A clustered server attempting to write a LUN which is already locked will get a "Reservation Conflict" warning and need to retry. These "reservation collisions" correspond directly to a performance degradation on the LUN as I/O operations are delayed awaiting the retry. In extreme cases, upon receiving too many consecutive reservation confict errors from attempted retries, the write operation will fail.

In the case of shared ESX Datastores or XenServer Storage Repositories, having too many VM's all attempting to write the same LUN from different hosts will result in reservation conflicts. In the case of ESX, you can verify if you are experiencing conflicts on a LUN by analyzing /var/log/vmkwarning in the ESX host logs.

The solution? Avoid the monolithic "1 LUN" approach to building your logical storage architecture. Create multiple LUNs, distributing their presentation across the storage processors and then distributing your VM's among the datastores based on expected or measured I/O load.

So how many VM's are too many? How many VM's per datastore? The answer will differ depending on whose white paper or blog you read. I've heard that VMWare recommends between 5 to 16 per datastore. I've seen other articles recommend not exceeding 30. The diversity of answers underscores the fact that each environment is different. If you have 20 VMs on the same datastore that are doing very little I/O, this may well be a non-issue. However, if you have noticed a correlation between performance degradation and VM sprawl on a datastore, you should consider migrating some of the VMs to other datastores.

If you currently have one or two LUNs for the 10 or 20 VM's shared by your ESX servers, you may notice a remarkable improvement by simply redistributing those VM's over four or six LUNs. If you have a VM that is particularly I/O bound, you should consider moving its virtual disks to a dedicated datastore or, alternatively, a Raw Device Mapping.

MATCHING SPINDLES TO APPLICATION REQUIREMENTS

Keep in mind that a VM may have multiple virtual disks. In the case of a Windows SQL VM, you might create a system/boot .vmdk for the C: drive, and then create a virtual disk for the transaction logs and yet another virtual disk for the tables. Having the three .vmdk's on the same datastore will probably not yield the best results.

But let's consider the reasoning behind why you would have isolated different data types to different virtual disks in the first place.

(to be continued -- in short, some I/O is write intensive, such as DB Redo logs, whereas other I/O will be read-oriented. RAID 5 performs well for reads, but has a high overhead for writes. So for your DB, you might place tables on one or more RAID 5 volumes, placing the redo logs on RAID 1 or RAID 10.)

Conclusion: in an I/O intensive environment, performance will ultimately be gated by the latency imposed by the slowest devices in the chain: the physical spindles. If after implementing all the "best practices" you are still experiencing slow performance on your SQL database, consider respecting the vendor's recommendations... place the redo logs on a datastore based on a RAID 10 LUN, placing the tables on a datastore based on a RAID 5 LUN.

VMFS EXTENTS

Extents are logical extensions to a file system, usually implemented as concatenations of disk partitions or logical volumes. Extents are typically used to grow a file system beyond the constraints of its logical or physical device.

For instance, in ESX, you can grow a VMFS file system by adding extents to the original LUN (or Master Extent) on which the file system is based.

In the VMFS implementation it is possible for files span multiple volumes. The implication is that a single write on a file traversing extents can result in SCSI reservations across multiple LUNs, with the associated degradation on overall performance.

In VMFS, all metadata is located on the master extent (typically the first LUN on which you created the datastore). If one of the extents goes offline, its data becomes inaccessible... e.g. entire .vmdk's, or in the worse case, those partial .vmdk's that span the extent. If the master extent goes offline, the entire datastore becomes inaccessible.

By default, VMFS will attempt to keep individual files confined to an extent. Nonetheless, considering that file data on a VMFS file system consists of more than just our defined .vmx and .vmdk's, you can see how, for example, a dynamically created swap file, or a dynamically growing snapshot file for a VM could potentially span extents.

Finally, VMFS issues SCSI bus resets rather than target or LUN resets, which can have a more global affect on the attached fabric. In a poorly designed environment employing extents and experiencing excessive SCSI reservations, reservation conflicts, and bus resets, this can result in extents becoming inaccessible.

My advice is thus to avoid using extents. If you need additional datastore space, create a new datastore. Plan storage allocation for your VM's associated data in an evenly distributed manner across multiple LUNs.

VMWARE: VMFS or RDM?

work in progress....

SWAP, PAGE FILES AND SHARED STORAGE

(discussion of virtual memory (e.g. swap or paging) in the host and guest VMs, their placement on datastores, shared or local, memory reservations and paging, balloon driver effects, etc.)

CHANNEL FANOUT AND DISTRIBUTING WORKLOADS

(discussion of multiple paths, multiple initiator and target ports, active/active storage processors, etc. Again, multiple datastores on multiple LUNs is beneficial in that you can distribute the I/O for the different LUNs across different paths.)

TWEAKING CACHE

Different storage controllers use different cache algorithms. For instance, some controllers allow you to adjust the amount of cache each processor reserves for writes as opposed to reads. Others will dynamically adjust cache allocation based on usage. Some cache algorithms will limit write cache on a per-LUN basis, preventing over-zealous application servers from saturating the SAN with slow writes. This guarantees fair play across the storage clients while generally prioritizing cache hits for the slower write operations, resulting in overall better performance across the SAN.

Understanding the cache engine will allow you to make intelligent decisions for maximizing performance. As an anecdote, I recall more than one professional services engagement where in analyzing Clariion performance issues I would find that Write Cache had never been enabled — turning the write cache on typically yielded an immediate and remarkable improvement in performance. Consult your storage array vendor's documentation for recommendations on configuring cache.

In the case of DataCore's SANmelody and SANsymphony products, the cache engine dynamically manages read and write cache allocation among the LUNs. By design, write cache is limited per LUN, so spreading your writes over several LUNs will result in increased write cache hits.

In terms of server virtualization products such as VMWare's ESX, this would again support my arguments in favor of having several LUNs for your VM's, rather than the monolithic one or two datastores.

SNAPSHOTS AND PERFORMANCE

work in progress....

VMTN UNOFFICIAL STORAGE PERFORMANCE BENCHMARK

In April of 2006 a certain gentlemen known only as Christian Z. from Frechen, Germany create a thread on the VMWare VMTN site entitled the "Open unofficial storage performance thread". Christian had performed a handful of IOMeter tests using different storage platforms and published the results, encouraging the rest of the community to do the same. Using Christian's provided IOMeter profile, many people have since posted their own results and the read is interesting, to say the least.

One of my colleagues has since created several PowerPoint slides that summarize the results of several configurations, comparing the performance of SANmelody to SANs from vendors such as EMC, HP, IBM, EqualLogic, LeftHand, NetApp, Compellent, etc. The last slide of the presentation gives a cross-comparison of the different vendors. The graphs indicate MB/s and IOPS; you'll need to read through the VMTN thread if you want the details of each configuration, but this presentation will give you a good high-level overview.

I recommend you stress the word "unofficial" as you consult the forum. There is much good information to be gleaned from the thread, but performance testing is an art and a science, and as I read through the results I note there are inconsistencies between test environments that probably aren't documented. Also, different SAN controllers have completely different performance characteristics depending on the number of LUNs they are presented, which can be attributed to several things, such as write cache algorithms, HW architecture (e.g. EMC's CX vs. EMC's DMX), and / or fabric configuration.

For instance, as I've noted in the section on cache above, DataCore's SANmelody and SANsymphony cache engines are designed to favor cache writes across multiple LUNs attached to multiple hosts. They won't yield their highest write cache performance when working with a single LUN. Makes perfect sense: attaching your test server to a single LUN on a SAN is akin to testing DAS, which is not at all a "Real Life" utilization of a SAN or shared storage array. However, many of the published results are for a single IOMeter instance writing a single LUN. SANmelody's excellent "real life" results would have been considerably higher had the test aggregated multiple LUNs on multiple IOMeter instances. Caveat lector...

Click here to see the PowerPoint slides.

ENTERPRISE PERFORMANCE AT APPLIANCE PRICES

As a DataCore employee, I'm completely sold on the SANmelody technology. I know that I can implement a Tier-1 SAN storage array by simply choosing an x86 platform with adequate memory and slots, populating those slots with NICs and RAID controller ports with attached disk shelves. It's an awesome product offering Synchronous Mirroring, Asynchronous Replication, Snapshots, FC and iSCSI targets, and a superior cache engine and I/O subsystem.

If you want to take a test drive of DataCore's SANmelody, you can download a free, no-obligation 30-day evaluation. You'll see for yourself how SANmelody out-performs the traditional hardware based Enterprise SAN storage arrays and midrange appliances.

For more information, please contact DataCore™ or one of DataCore's Authorized Resellers.

ABOUT THIS SITE

Las Solanas Consulting is not a DataCore reseller. The opinions expressed herein are entirely those of the author and in no way should be misconstrued so as to represent those of DataCore Software Corporation.

I have the privilege of working with some of the industry's most outstanding engineers having vast experience in the domains of VMWare ESX and SAN Storage. The following persons have contributed much to this paper through numerous casual conversations surrounding professional services engagements or pre-sales technical discussions.

In alphabetical order, thanks to: