Maximizing Storage Performance With VMWare ESX and Virtual Server Environments
A White Paper by Tim Warden, April 12, 2008
Over the course of the last few years, I have had the opportunity to install
numerous SAN storage solutions for customers implementing server virtualization
and requiring shared storage. These experiences and my own testing in the lab
have led me to several observations about how best to configure your shared
file systems and their underlying storage in virtual server environments. The
purpose of this paper is to consolidate and share these "storage best practices".
HOW MANY VM'S SHOULD I DEPLOY PER DATASTORE?
One of the most common misunderstandings when configuring shared storage
for a Virtual Server environment has to do with laying out VMs on the shared
LUNs. The tendency I've observed is to create one or two large LUNs and
create all the VM's on the one or two resulting Datastores. Indeed, when
I first began working with ESX with SANmelody, I simply created two LUN's,
one from each of my active SANmelody controllers and distributed the VM's
between the two. It worked great.
However, working with several customers and partners over the last few years
I've been made aware this is not by any means the ideal configuration for the
typical production environment. I've encountered cases where the customer
had well in excess of 30 running VM's all on the same LUN a LUN shared
across two or more ESX hosts.
While it may seem obvious that a LUN with multiple VMs, each actively generating
I/O, could pose a bottleneck, most admins will assume that the bottleneck is the
media itself and not the implementation of the protocol over the media. For
instance, most will presume the bottleneck is the bandwidth of a 1Gb Ethernet
of the iSCSI "fabric". The reality is a bit more complicated.
The SCSI-3 recommendation on which Fibre Channel and iSCSI are based
specifies a mechanism for the sharing of target devices (disks or LUNs) among
cooperating initiators, such as clustered servers. Referred to as "SCSI Reservations",
it describes a locking / releasing mechanism designed for negotiating which
initiator can write the LUN at any given instant. A clustered server attempting
to write a LUN will get a "lock" on the LUN unless some other server already
has the lock. The reservation is released once the write has completed,
so that other servers can subsequently write the LUN. A clustered
server attempting to write a LUN which is already locked will get a "Reservation
Conflict" warning and need to retry. These "reservation collisions" correspond
directly to a performance degradation on the LUN as I/O operations are delayed
awaiting the retry. In extreme cases, upon receiving too many consecutive
reservation confict errors from attempted retries, the write operation will fail.
In the case of shared ESX Datastores or XenServer Storage Repositories, having
too many VM's all attempting to write the same LUN from different hosts will
result in reservation conflicts. In the case of ESX, you can verify if you
are experiencing conflicts on a LUN by analyzing /var/log/vmkwarning in the
ESX host logs.
The solution? Avoid the monolithic "1 LUN" approach to building your
logical storage architecture. Create multiple LUNs, distributing their
presentation across the storage processors and then distributing your VM's
among the datastores based on expected or measured I/O load.
So how many VM's are too many? How many VM's per datastore? The answer
will differ depending on whose white paper or blog you read. I've heard
that VMWare recommends between 5 to 16 per datastore. I've seen other articles
recommend not exceeding 30. The diversity of answers underscores the fact
that each environment is different. If you have 20 VMs on the same datastore
that are doing very little I/O, this may well be a non-issue. However, if you
have noticed a correlation between performance degradation and VM sprawl on a
datastore, you should consider migrating some of the VMs to other datastores.
If you currently have one or two LUNs for the 10 or 20 VM's shared by your
ESX servers, you may notice a remarkable improvement by simply redistributing
those VM's over four or six LUNs. If you have a VM that is particularly
I/O bound, you should consider moving its virtual disks to a dedicated
datastore or, alternatively, a Raw Device Mapping.
MATCHING SPINDLES TO APPLICATION REQUIREMENTS
Keep in mind that a VM may have multiple virtual disks. In the case of
a Windows SQL VM, you might create a system/boot .vmdk for the C: drive, and
then create a virtual disk for the transaction logs and yet another virtual
disk for the tables. Having the three .vmdk's on the same datastore will
probably not yield the best results.
But let's consider the reasoning behind why you would have isolated different
data types to different virtual disks in the first place.
(to be continued -- in short, some I/O is write intensive, such as DB Redo
logs, whereas other I/O will be read-oriented. RAID 5 performs well for reads,
but has a high overhead for writes. So for your DB, you might place tables
on one or more RAID 5 volumes, placing the redo logs on RAID 1 or RAID 10.)
Conclusion: in an I/O intensive environment, performance will ultimately
be gated by the latency imposed by the slowest devices in the chain: the physical
spindles. If after implementing all the "best practices" you are still
experiencing slow performance on your SQL database, consider respecting
the vendor's recommendations... place the redo logs on a datastore based on a
RAID 10 LUN, placing the tables on a datastore based on a RAID 5 LUN.
Extents are logical extensions to a file system, usually implemented as
concatenations of disk partitions or logical volumes. Extents are typically
used to grow a file system beyond the constraints of its logical or physical
For instance, in ESX, you can grow a VMFS file system by adding extents
to the original LUN (or Master Extent) on which the file system is based.
In the VMFS implementation it is possible for files span multiple
volumes. The implication is that a single write on a file traversing extents
can result in SCSI reservations across multiple LUNs, with the associated
degradation on overall performance.
In VMFS, all metadata is located on the master extent (typically the first
LUN on which you created the datastore). If one of the extents goes offline,
its data becomes inaccessible... e.g. entire .vmdk's, or in the worse case,
those partial .vmdk's that span the extent. If the master extent goes offline,
the entire datastore becomes inaccessible.
By default, VMFS will attempt to keep individual files confined to an extent.
Nonetheless, considering that file data on a VMFS file system consists of more
than just our defined .vmx and .vmdk's, you can see how, for example, a dynamically
created swap file, or a dynamically growing snapshot file for a VM could
potentially span extents.
Finally, VMFS issues SCSI bus resets rather than target or LUN resets,
which can have a more global affect on the attached fabric. In a poorly
designed environment employing extents and experiencing excessive SCSI reservations,
reservation conflicts, and bus resets, this can result in extents becoming
My advice is thus to avoid using extents. If you need additional datastore
space, create a new datastore. Plan storage allocation for your VM's associated
data in an evenly distributed manner across multiple LUNs.
VMWARE: VMFS or RDM?
work in progress....
SWAP, PAGE FILES AND SHARED STORAGE
(discussion of virtual memory (e.g. swap or paging) in the host and guest VMs,
their placement on datastores, shared or local, memory reservations and
paging, balloon driver effects, etc.)
CHANNEL FANOUT AND DISTRIBUTING WORKLOADS
(discussion of multiple paths, multiple initiator and target ports, active/active
storage processors, etc. Again, multiple datastores on multiple LUNs is beneficial
in that you can distribute the I/O for the different LUNs across different paths.)
Different storage controllers use different cache algorithms.
For instance, some controllers allow you to adjust the amount of cache
each processor reserves for writes as opposed to reads. Others will
dynamically adjust cache allocation based on usage. Some cache algorithms
will limit write cache on a per-LUN basis, preventing over-zealous application
servers from saturating the SAN with slow writes. This guarantees fair play
across the storage clients while generally prioritizing cache hits for the slower
write operations, resulting in overall better performance across the SAN.
Understanding the cache engine will allow you to make intelligent decisions
for maximizing performance. As an anecdote, I recall more than one professional
services engagement where in analyzing Clariion performance issues I would
find that Write Cache had never been enabled turning the write cache
on typically yielded an immediate and remarkable improvement in performance.
Consult your storage array vendor's documentation for recommendations on
In the case of DataCore's SANmelody and SANsymphony products, the cache
engine dynamically manages read and write cache allocation among the LUNs.
By design, write cache is limited per LUN, so spreading your writes over
several LUNs will result in increased write cache hits.
In terms of server virtualization products such as VMWare's ESX, this
would again support my arguments in favor of having several LUNs for your
VM's, rather than the monolithic one or two datastores.
SNAPSHOTS AND PERFORMANCE
work in progress....
VMTN UNOFFICIAL STORAGE PERFORMANCE BENCHMARK
In April of 2006 a certain gentlemen known only as Christian Z. from Frechen,
Germany create a thread on the VMWare VMTN site entitled the
"Open unofficial storage performance thread".
Christian had performed a handful of IOMeter
tests using different storage platforms and published the results, encouraging
the rest of the community to do the same. Using Christian's provided
IOMeter profile, many people
have since posted their own results and the read is interesting, to say the least.
One of my colleagues has since created several PowerPoint slides
that summarize the results of several configurations, comparing the performance
of SANmelody to SANs from vendors such as EMC, HP, IBM, EqualLogic, LeftHand, NetApp,
Compellent, etc. The last slide of the presentation gives a cross-comparison
of the different vendors. The graphs indicate MB/s and IOPS; you'll need to
read through the VMTN thread if you want the details of each configuration, but
this presentation will give you a good high-level overview.
I recommend you stress the word "unofficial" as you consult the forum. There
is much good information to be gleaned from the thread, but performance testing
is an art and a science, and as I read through the results I note there are
inconsistencies between test environments that probably aren't documented.
Also, different SAN controllers have completely different performance characteristics
depending on the number of LUNs they are presented, which can be attributed to several
things, such as write cache algorithms, HW architecture (e.g. EMC's CX vs. EMC's DMX),
and / or fabric configuration.
For instance, as I've noted in the section on cache above, DataCore's SANmelody
and SANsymphony cache engines are designed to favor cache writes across multiple LUNs
attached to multiple hosts. They won't yield their highest write cache performance
when working with a single LUN. Makes perfect sense: attaching your test server to
a single LUN on a SAN is akin to testing DAS, which is not at all a "Real Life"
utilization of a SAN or shared storage array. However, many of the published
results are for a single IOMeter instance writing a single LUN. SANmelody's
excellent "real life" results would have been considerably higher had the
test aggregated multiple LUNs on multiple IOMeter instances.
Click here to see the PowerPoint slides.
ENTERPRISE PERFORMANCE AT APPLIANCE PRICES
As a DataCore employee, I'm completely sold on the SANmelody technology.
I know that I can implement a Tier-1 SAN storage array by simply choosing
an x86 platform with adequate memory and slots, populating those slots
with NICs and RAID controller ports with attached disk shelves. It's an
awesome product offering Synchronous Mirroring, Asynchronous Replication,
Snapshots, FC and iSCSI targets, and a superior cache engine and I/O subsystem.
If you want to take a test drive of DataCore's SANmelody, you can download a
30-day evaluation. You'll see for yourself how SANmelody out-performs the traditional hardware
based Enterprise SAN storage arrays and midrange appliances.
For more information, please contact DataCore™ or one of DataCore's
ABOUT THIS SITE
Las Solanas Consulting is not a DataCore reseller. The opinions expressed
herein are entirely those of the author and in no way should be misconstrued so
as to represent those of DataCore Software Corporation.
I have the privilege of working with some of the industry's most outstanding
engineers having vast experience in the domains of VMWare ESX and SAN Storage.
The following persons have contributed much to this paper through numerous casual
conversations surrounding professional services engagements or pre-sales technical
In alphabetical order, thanks to:
- Alex Best (DataCore Germany, Germany)
- Scot Colmer (Access Flow, Inc., Sacramento, CA)
- James Price (DataCore Software, Ft. Lauderdale, FL)
- Bob Strachen (DataCore Software, NY)
- Aaron Schneider (The Pinnacle Group, Carlsbad, CA)
- Brian Sterck (Southland Technology, San Diego, CA)
- Wendy Stutzman (DataCore Software, Ft. Lauderdale, FL)