DR and Asynchronous Replication - Tutorial and Best Practices
A White Paper by Tim Warden
Disaster Recovery (or DR as the industry has come to call it) refers to a
contingency plan to salvage one's business in the event of a "smoking crater"
event a catastrophe. The scope of the topic is broad and creating a DR plan
can involve many parts of the organization, including management, facilities, etc.
In the scope of IT, Disaster Recovery refers specifically to the implementation
of some mechanism to protect the company's data, usually transporting it or replicating
it offsite where it can then be used to get critical business systems back up and
running. This alternate site can be a cold site, a "colo" (or collocation facility),
or even a production data center at another of your organization's sites.
In the event a major incident occurs at the primary datacenter, a DR site
provides some level of recoverability that could be measured in minutes, hours,
or days depending on how much planning and resources have gone into the DR
implementation.
DR compared to HA or BC
To begin with, let's delineate between a few of the common terms often
employed when discussing matters of data protection and availability of
service: H/A, BC and DR. The concepts are different, but they are not
mutually exclusive.
High Availability or H/A refers to systems and components designed to
withstand a variety of non-catastrophic local failures. The vendors implement
H/A in servers and storage arrays via redundancy: redundant power, cooling,
cabling, switches, RAID groups, dual processors, etc. The idea is that a fault
of a component (such as an GBIC on an HBA) or a pulled cable shouldn't stop the
show. An alternate path or component can take over without missing a beat. With
H/A, users shouldn't notice any disruption in service when such a failure occurs.
Business Continuity / Continuance or BC takes this one step further. It is
the idea of adding some additional level of redundancy to the architecture so
that it can withstand the failure of entire systems without stopping production.
Often when storage vendors talk about Business Continuity, they are implying the
use of Synchronous Mirroring between two of their high-end storage arrays, perhaps
separated over some short distance, such as between two buildings on a campus.
Whereas Business Continuity is usually implemented on a local campus or
metro basis, DR is understood to imply geographical separation. Like BC, that
separation can be "across the parking lot" or "across campus", but more typically
it's "across the state" or "across the country".
As I said, H/A, BC and DR are not mutually exclusive. Many shops have requirements
to be both Highly Available and have a DR plan in place; in some cases the two may
be deployed using the same mechanisms, such as a stretch cluster across a campus
or metropolitan area, where the two ends are both active production sites,
synchronous mirrors of each other.
However, in the majority of cases, DR sites are not just BC extensions of
the main site. Many shops with critical availability requirements will implement
BC locally, and have a contingency DR site in another state or part of the country.
Bringing the DR site into production is a decision taken only when the primary
site is deemed unavailable because of a disaster. How quickly that site is brought
online depends on how much effort and resources have gone into the planning.
Bringing the site online typically requires network changes (DNS, etc.), and an
understanding that once you've flipped the switch, some planning will be required
to flip it back once the primary site is restored to service.
How Much DR is Enough?
A DR project requires planning and the planning phase should not be taken
lightly. While my intention is to orient this paper towards the technical
aspects of a DR implementation, the choices you will make should, obviously
enough, be guided by a risk assessment involving input from those "powers
that be" in your organization.
Example. You run a small datacenter for a manufacturing subsidiary of a
large company whose headquarters is overseas. Your local management expects
you to keep systems running even if a large fire consumes the datacenter.
You decide to implement your "DR plan" using pairs of VMotion/DRS enabled ESX
servers and Synchronous Data Mirroring between pairs of storage servers.
You physically separate the pairs of servers and storage arrays at extreme
ends of the plant, separated by firewalls. You provide distinct redundant
paths for both network and disk traffic. You install an external power
generator at the site. You've architected a solution that combines aspects
of Business Continuance with DR. This architecture can withstand a major
fire or the destruction of half the plant. In fact, this type of solution
should provide continuous service even if one half of the datacenter
is down.
On the other hand, if a terrorist act or a so-called "Act of God" takes out the
entire site, you probably don't care whether there is a DR site at the overseas
headquarters or not you reason that the plant is "history", so you and your
colleagues will be out looking for new jobs. The subsidiary's data is not your
fiduciary responsibility and therefore not your headache.
Perhaps the plant's management (who happen to be officers of the company)
wouldn't agree with your plan. Depending on the perspective, one could
argue you have not at all implemented DR...
What Do You Have To Lose?
...Which brings us to the topic of risk assessment. The risk assessment can
be relatively straightforward. What external or internal forces could create
an unrecoverable incident at the datacenter and how will the DR plan address
that? For instance, if your datacenter is located in a seismically active
area, you will want your DR site to be at least a few hundred miles away
in a seismically stable zone, rather than 50 miles south near the same
fault line.
As a part of the planning process, you'll need to define exactly what
the DR site is supposed to provide. Which systems will you replicate?
You may have some business systems that are "nice to have" but not
critical or essential. And are you only interested in having a
recent copy of the data offsite in order to satisfy regulatory laws?
Or is the DR site supposed to provide a restoral of service within a
defined window? Partial or full? And at what quality of service?
The common terms you will hear are MTRS (Mean Time to Restore Service)
or RTO (Recovery Time Objective), meaning "How quickly can we flip the switch
and get production running again at the DR site?" Another acronym you will need
to be familiar with is RPO or Recovery Point Objective. The RPO defines the
acceptable rollback point in case of disaster, or how much recent data loss
your organization can tolerate. The RPO is typically expressed in minutes
or hours, depending on the importance of the data you are replicating. You
may have different RPO's for different systems, determined by the nature
of the data and bandwidth / cost limitations.
You will obviously want to consider the feasibility of using existing
resources (another of your organization's sites located in another city)
versus building a site or leasing space and bandwidth at a colo. Is there
sufficient bandwidth between the sites? Is that bandwidth provisioned for
other applications, such as VoIP or Video Conferencing?
 Three Geographically Distant BC Sites Inter-Replicating
Have you implemented Server Virtualization or Storage Virtualization?
If not, you should consider both, as they facilitate implementing DR and
can save you a lot of money.
Does your SAN storage array or Storage Virtualization system offer
Thin Provisioning? This can make an enormous difference in bandwidth
utilization for a few of the replication tasks.
Do you have databases with multiple tables spanning multiple LUNs?
How will your replication strategy ensure consistency across those
inter-dependent volumes?
How are you implementing backups today? Are tapes transported offsite?
Do you have a VTL? Could the new DR site replace or augment the current
backup mechanism? Are you implementing CDP (Continuous Data Protection) and
will your CDP system be compatible with the DR implementation?
These are among the questions the consultant or VAR should be asking
when you begin planning your DR project.
Synchronous vs. Asynchronous Mirroring
So far, we've discussed the differences between Availability and Disaster
Recovery. We've also briefly touched on planning, or determining what you
require from a DR implementation. Now let's turn our attention to the
implementation and discuss the replication of your data.
Mirroring or replicating data essentially involves writing two copies of
the data, one to each of a mirrored half. There are a few ways to implement
mirroring. Many OS's have the ability to create disk mirrors, or "Software RAID-1".
There are also third party software products that can implement asynch
mirroring on behalf of the host. In general, it is preferable to have an
independent, shared storage server implement the data mirroring, as it can
consolidate the activity for many hosts and offload those hosts from
the performance penalty of the extra writes.
As for Synchronous vs Asynchronous, as the terms would indicate, it's all
about time. In particular, it's about the time that data written to disk
is "committed". In Synchronous Data Mirroring, the write acknowledgement
isn't returned to the requesting server until the data has been written to
both storage arrays. This is akin to "RAID-1": the two mirror halves stay
in "lockstep" you know that at any moment you have two identical copies
of the data on two devices.
Because the acknowledgement isn't returned until both storage devices have
a copy of the write, this implies that the bandwidth of the channels used for
the writes needs to be sufficient so as not to cause undue latency which would
affect the responsiveness of the production servers.
As this latency directly affects the performance, Synchronous Data Mirroring
is often not a viable solution for DR where distance and limited bandwidth
would cause roundtrip latency to go way up. In such cases, Async
Replication is the method of choice.
While not a hard and fast rule, most people associate Synchronous Mirroring
with Business Continuity, whereas Asynch Mirroring is typically associated with
Disaster Recovery.
Asynchronous Mirroring An Overview
In Asynchronous Mirroring, we still have the concept of mirror halves, but
for the purpose of clarity we will refer to them as a "source" volume and a
"destination" or "replicated" volume. The source volume is the local production
volume a host (such as your SQL server) writes to and reads from. Some agent
then copies any of the writes over to the destination volume at the distant
DR or remote site.
The agent is almost always software, implemented either on the host itself,
on a SAN-based or DAS-based Storage Array, or on an intermediary engine, perhaps
a Storage Virtualization platform.
If the agent is running on the host itself or on a file server (a.k.a. "NAS"
storage such as a CIFS or NFS service), the replication can be implemented
at the file level. When the agent is run on a Storage Array or Virtualization
platform, the replication is invariably implemented at a raw disk or block
level, where only the low-level writes or disk blocks that have been modified
are copied to the destination volume. The scope of this article is limited
to disk- or block-level replication.
With Asynchronous Replication it is understood that the bandwidth between
the two sites is limited, and so writes to the local source volume are
acknowledged immediately, and the agent subsequently replicates those
changes to the destination volume. The destination volume is usually always
in a "catch up" mode, anywhere from a few writes to a few gigabytes worth
of writes behind the current state of the source. How far the destination
falls out of synch (or lags behind) the source volume is determined by
the quantity of changes occurring on the source volume, the bandwidth of
the intersite link used for the replication, network QoS policies and/or
replication schedules or throttles. There are many other factors affecting
overall replication system performance, some of which we will discuss later.
Finally, network outages that break the replication link can cause the two
volumes to fall further out of synch.
Clearly, provisions must be made to keep track of changes to the source volume
in order to replicate those changes and bring the source and destination
as close to synch as time and bandwidth permit. There are a number of ways
to do this, ranging from marking changed blocks that need to be replicated,
to buffering the changes in a reserved storage area.
Some vendors offer de-duplicating technologies, which attempt to reduce
the replication traffic by deleting buffered writes for blocks that have
been written (and buffered) more than once; the idea is that they will
only replicate the latest change to the block. There are pros and cons to
the different technologies used; I personally prefer at least having the
option to buffer every change to every block, as that facilitates snapshotting
and newer technologies such as
CDP (Continuous Data Protection)
at the DR site.
Bandwidth The Replication Challenge
Bandwidth usually presents the biggest challenge. High-speed connections aren't
cheap and if you are an SMB, you may be financially constrained to T1 speeds.
Of course, the asynchronous replication engine is often based on a protocol,
protocol converter, or application running on top of IP or TCP. Between the
transport protocol overhead and the replication protocol overhead, you never get
anything near the full, theoretical bandwidth of the connection.
If you're lucky enough to get 70% of the bandwidth usable, you're likely
to see transfer rates for a dedicated connection similar to those in the
table below.
Technology |
Mb/s |
Theoretical GB/h |
Expected GB/h |
| T1 |
1.536 |
0.66 |
0.46 |
| 10Base-T LAN |
10 |
4.39 |
3.08 |
| DS3 (T3) |
43.2 |
18.98 |
13.29 |
| 100Base-T LAN |
100 |
43.95 |
30.76 |
| OC3 |
155 |
68.12 |
47.68 |
| OC12 |
622 |
273.34 |
191.34 |
Of course, there is also the overhead of Asynchronous Replication housekeeping
(reconstruction of replicated blocks in the destination volume, etc.) not to
mention the contention with other applications that may be consuming storage
and network resources at the source and at the destination sites.
Ultimately, you'll need to quantify the change rate or the cumulative amount
of data writes that occur over a period of time for the volumes you need to
actively replicate. You will use that value to determine what bandwidth
you will need. We will refer to the period of time as your replication latency
window. It is determined by such factors as production or peak hours and your
target RPO (Recovery Point Objective, the rollback point, or acceptable lag
between where you were when the disaster hit and your "last known good"
at the DR site).
For instance, if your organization is 24/7, has a very aggressive RPO and
you're changing a lot of data on an hourly basis, you'll need more bandwidth.
On the other hand, if most of your data writes occur between 8am and 5pm, and
your RPO is 24 hours, and you only have about 20GB of data changing daily,
then you can tolerate a 24 hour replication latency window. You then determine
you only need about a 3Mb/s connection between the two sites... the replication
will "catch up" during the off-hours.
Measuring The Change Rate
Many tools are available to measure the rate of change of your data, and
you may already have monitoring tools available in your organization, either
on the servers themselves or perhaps on a SAN storage array.
If you need help collecting the change rate, DataCore offers a package called
SANmaestro
that can be used to determine the quantity of writes per volume. If you already
have a DataCore SAN in place, you'll know that SANmaestro can collect and
report on hundreds of parameters from the DataCore storage controllers.
If you have not yet virtualized your storage with DataCore, you can still
use their Windows-based SANmaestro agents to collect the change rate on any
of your Windows servers.
Tweaking Performance
The vendor or consultant with whom you will work to deploy the solution
should be able to provide some guidelines for how to maximize performance,
based upon the vendor's particular implementation. This may involve choosing
a suitable media to buffer replication traffic, making sure the buffering
doesn't create contention with production volumes, adjusting buffer sizes and
parameters, etc.
In the replication example at the end of this article, I will give a handful
of general best practices for the DataCore AIM product.
Know Your Data Know What To Replicate
We've already discussed determining which applications you need to replicate.
Perhaps you have application servers on the floor that are "nice to have" and
not "Critical" in a disaster recovery scenario. Your available bandwidth
(and budget) will ultimately determine which of those "nice to have" servers
can be replicated.
When considering the data to be replicated to the DR site, separate out that
disk data which is "program", "system", or "execution" oriented data from the
real "user" data. The "pipe" between your two sites more than likely will have
limited bandwidth (often a T1 or DS3), so you'll only want to actively replicate
the essential data.
Note there will be some essential servers on the floor that will not be
candidates for replication, although you will certainly need their counterparts
to offer those same services at the DR site. Examples include the servers
providing your DNS, DHCP, or Active Directory services.
But even on your database and file servers, there will be disk-based data
that does not need to be actively replicated. Consider a Windows 2003 based SQL
server. Typically, you'll have a "C:" drive that serves as your Boot and System
partition. In most cases, this drive will have the Windows system, standard
Program Files as well as the SQL application and utilities.
By default, it will also contain the page file used as swap space for
virtual memory. If the server is a memory hog, the page file usage may not
be negligible. This means your "C:" drive can have a lot of transitory write
activity. Unfortunately, a block level replication engine can not distinguish
between such temporary writes and the real data you need to replicate, so if
you're replicating the "C:" drive, you will also be replicating any of the
page file writes and temporary caches that are located on that volume.
For this reason (among many others, such as performance, stability,
manageability, portability, etc.), it is always preferable to keep user
data well separated from system and program data. In the case of your SQL
database, you'll usually have at least one or two additional drives to hold
logs and tables, perhaps drive letters "E:" for the logs and "F:" for the
tables (let's not make the example too complicated...)
You'll only need to replicate the "E:" and "F:" drives which contain
the real data. You will want to have a clone of the "C:" drive at the DR site.
This may be a ghost image or perhaps you'll use the replication engine to
initially replicate the "C:" drive into a "quiesced" or "crash coherent" state
and then break the replication relationship. This will be your "baseline"
"C:" drive containing the functioning system and applications. You will then
need to determine a policy for applying any updates or patches to not only the
local server (local "C:" drive) but also its dormant DR counterpart, the split
mirror half serving as a baseline at the DR site. Presumably you can use
something like Remote Desktop, VNC, Virtual Console or an ILO type package
to access the DR servers to patch them as your policy dictates. You may
determine that the dormant DR servers do not require the same rigor for
patching as the live production servers.
In this discussion we've talked about one server, replicating its
three drives to the DR site. More typically, you will have several
servers you need to replicate: one or more file servers, database servers,
etc., so your bandwidth will be shared by all.
As such, you may need to look at creative techniques for reducing the
quantity of data you will replicate. For instance, some database vendors
can offer their professional services to help you implement the replication
of a database wherein you only actively replicate the redo logs. You make
a baseline copy of the tables at the DR site, and periodically apply the
replicated redo logs to the baseline. In our example above, that could
potentially eliminate the need to actively replicate the "F:" drive.
Much has been written on this subject by the DB vendors, so we'll
leave it outside the scope of our discussion.
Replication with Virtual Machines a la VMWare
Server Virtualization packages such as VMWare or Virtual Iron facilitate the
implementation of a DR site. Even if you aren't currently using server virtualization
for your production systems, being able to consolidate several of the replicated servers
as guests in a virtualized environment can save your organization a lot in terms of capital
assets, facilities and management.
Virtual servers also make it easier to bring up an isolated environment in which you
run your replicated servers and their associated data to test for coherency.
Several tools exist from companies such as VisionCore, PlateSpin and VMWare to provide
Physical to Virtual services to help you redeploy physical production servers in a virtual
DR environment.
Most shops implementing virtual servers with asynchronous replication simply replicate
the LUNs containing their existing virtual server file systems (a.k.a. VMFS in VMWare).
While this is a viable solution, the additional layer of virtualization implied by the
.vmdk files means some important guest file system information may not be committed to
disk as frequently as required by your RPO. Ultimately, what this means is that although
a replicated VMFS may appear coherent and healthy, one or more of its .vmdk files may not be.
If possible, consider using RDMs (Raw Device Mappings) in ESX to hold the data. RDM's
makes it easier to control which volumes you replicate and make it easier to take crash
coherent snapshots of the volumes. In VI3, RDMs work with VMotion in Physical Compatibility
Mode, and so present an alternative to having the data prisoner in a proprietary vmdk.
If you are only planning on using server virtualization at the DR site, RDM's can
facilitate the virtualization, allowing you to leave the replicated volumes as they are,
presenting them to the virtual machines in their native, non-virtualized state.
However, using RDM's have its own set of issues. Most of the ESX admins I've spoken
with don't like the idea of having to manage the multiple RDM's associated with the
"server sprawl" that is typical of virtual server environments. If you have many volumes,
you will need to consider not only that management aspect, but also the disk management
overhead on both the hosts and the storage server.
If you must use VMFS volumes, at least separate the guest VMs' system and program
file data from the user data you need to replicate by placing their corresponding vmdk's
on different VMFS volumes. You will then actively replicate a guest's VM's user data and
use a complementary mechanism (storage server-based snapshots, etc.) to replicate the
system volumes.
VDI adds yet another perspective to splitting out data from systems. If you were to
replicate a wholesale VMFS containing many virtual desktops and their associated volumes,
you would also be replicating the page files, web browser caches, etc.
Finally, if you are using or plan to use VCB (VMWare Consolidated Backup), keep in
mind that the current implementation does not support having a VM guest's .vmdk files
on different VMFS's, so this could present a challenge.
Establishing The Asynchronous Mirror
When establishing a relationship between a replication source volume and its
partner destination volume at the DR-site, a baseline mirror will need to be created.
This is to say, before the destination volume can be used, all the data on the source
will need to be initially copied to the destination.
If the volume is large and the intersite bandwidth limited (which is often the
case), some means will need to be employed to initialize the mirror "out of band".
Most vendor's solutions provide a mechanism (or white paper) describing how to
physically transport a tape or disk containing the mirror to the remote site.
The idea is that FedEx or DHL will be faster than your T1.
Depending on the SAN or shared storage array technology you are using, this may
easy or hard. In the case of the DataCore solution, it is a relatively easy
procedure, and can be accomplished using simple USB-2 or Firewire disks.
The table below is a rough estimate of how long it would take to initialize
a volume given different network technologies. Keep in mind that these measurements
are based on initializing the volumes one at a time, and presuming that each has
dedicated use of the bandwidth. Real life is, obviously, quite a bit different,
but this table should help you get some idea of whether you will be able to
establish the mirrors via the network or whether you'll have to ship disks.
|
Est. Hours To Replicate Capacity in GB |
Technology |
20 |
80 |
120 |
200 |
300 |
730 |
| T1 |
42.33 |
169.31 |
253.97 |
423.28 |
634.92 |
1544.97 |
| 10Base-T LAN |
6.50 |
26.01 |
39.01 |
65.02 |
97.52 |
237.31 |
| DS3 / T3 |
1.50 |
6.02 |
9.03 |
15.05 |
22.57 |
54.93 |
| 100Base-T LAN |
0.65 |
2.60 |
3.90 |
6.50 |
9.75 |
23.73 |
| OC3 |
0.42 |
1.68 |
2.52 |
4.19 |
6.29 |
15.31 |
| OC12 |
0.10 |
0.42 |
0.63 |
1.05 |
1.57 |
3.82 |
Again, the numbers above are just a rough guide based on getting about
70% of the advertised bandwidth to move your data. There can be a lot of
other factors that can seriously skew those numbers: network congestion, disk
and server performance, etc. The main point to take home is that if it's
going to take more than a week (> 168 hours) to get everything synched up,
you may want to consider shipping disks.
If you have the luxury of establishing the asynchronous mirror relationship
prior to formatting the source volume, you will not need to do this initial
synchronization, as both source and destination volumes can be considered
in synch in as much as they are both un-formatted. So if you are planning
to standup a new file server that you wish to replicate, create the asynch
mirror relationship before formatting and creating the file shares. Also,
if your OS supports the notion, do a "Quick Format" as opposed to a low-level
format which would write every block on the disk, effectively resulting
in a full synchronization.
Avoid Routines That Generate Unnecessary Write Traffic
The first and most obvious that comes to mind is Disk Defragmentation.
Presumably you will be implementing block-level replication in a SAN-based
or shared storage environment where disk defragmentation has less value.
For example, if the volume you are defragmenting is a SAN-based volume
on, say, a RAID-5 group on an EVA-6000, the defragmentation becomes meaningless
as the EVA "virtualizes" that RAID-5 group and actually writes the volume
across all the spindles in the storage array. Defragmenting will simply
move blocks around on the EVA, potentially scattering them further,
potentially increasing seek and rotational latencies, all the while
potentially consuming all the bandwidth of your intersite link by
redundantly replicating the defrag writes.
[Read my white paper on
Defragmentation of SAN Volumes.]
Coherency of Replicated Volumes
[discussion of volume coherency, ability of servers and apps to recover
from volumes marked inconsistent, etc.]
Replicated Volumes: A Moving Target
Once the mirror has been established, the destination volume will be in
a constant state of flux as new replicated blocks are received and destaged
to their corresponding logical locations on the volume. In this state, you
will be unable to use the replicated volume. If you attempted to mount it
on a DR site server, you would eventually get errors as the DR site server
would not be aware the volume's contents were changing beneath it.
To that end, the replication solution should provide some means for you
to access the destination volumes while they are actively receiving the
replication. Typically, this is done with point-in-time "Snapshots".
Implementing Coherency Points
Snapshots are point-in-time copies or clones of volumes. They are usually
implemented in the shared storage array, or in the case of a SAN or tiered
storage environment, they are ideally implemented in a Storage Virtualization
engine. While different Snapshot technologies exist, the common denominator
is that they allow you to manipulate, mount, backup, etc. an independent copy
or image of a production volume without disturbing the actual volume.
Snapshots are commonly used to create rollback points, eliminate backup
windows, or generate copies of real data for dev/test, datamining, etc.
They are also an ideal way of making crash-coherent rollback points of
the replicated volumes.
Snapshots provide a means to implement secure off-site tape backup at your
DR site.
The replication technology you choose should allow you to automate the
snapshot process or at least to script the process. It should also allow
you to create more than one snapshot relationship per replicated volume.
For instance, you may want a script to take a snapshot taken every morning
at 3am and initiate a backup of the snapshot to tape, whereas you may also
have another snapshot on the volume that is updated every thirty minutes
in accordance with your RPO.
Employing Source Snapshots To Control Replication
Depending on the vendor's implementation, you may also be able to use
snapshots at the source site as a means to control or reduce replication
traffic. The idea here is to replicate the snapshot volume and periodic
updates to the snapshot volume instead of replicating the ever-changing
production source volume. This is a means of de-duplicating the replication
stream in that regardless of how many times a block has changed within a given
snapshot interval, only its most recent content will be replicated.
As we have noted, there are a few methods to implement snapshot technology.
For the sake of this discussion we will limit the example to "Copy On Write"
type snapshots with the proviso that the implementation can create complete
snapshot clones where all blocks are copied to the snapshot volume.
In "Copy On Write" snapshots, a relationship is created between a source
volume and some reserve space or snapshot volume that presents an image
of the source at some particular fixed point of time. When the snapshot is
first "taken" or "enabled", we will call that time T0. Most Copy On Write
snapshot engines allow you to periodically increment or update the snapshot
to T1, T2, etc. With each update Tn, we are in effect copying the most
recent changes to all changed blocks over the interval [Tn-1 .. Tn] from
the source volume to the snapshot volume. If we are replicating the snapshot
volume, those most recent changes will be replicated as they are written
to the snapshot volume.
Consider a volume "G:" that you want to replicate to the DR site with an
RPO of 30 minutes. At the production site, create a clone or complete image
snapshot of the source volume "G:". Let's call the snapshot volume "S:".
Replicate that snapshot volume "S:" to the DR site. Once initialized,
we will have a baseline volume of time T0 at the DR site. Call that
replicated snapshot "R:". Now schedule the snapshot "S:" to automatically
increment or update every 30 minutes.
- 8:00 - Snapshot updates to T1
- 8:18 - You create a file on "G:" named "8h18"
- 8:25 - You mount a snapshot of the replicated volume "R:" at the DR
site and notice that file "8h18" does not yet exist.
- 8:30 - Snapshot updates to T2
- 8:37 - You create a file on "G:" named "8h37"
- 8:45 - You mount a snapshot of the replicated volume "R:" at the DR
site and see file "8h18", but not file "8h37" which has not yet been
replicated.
- 9:00 - Snapshot updates to T3
- 9:05 - You mount a snapshot of the replicated volume "R:" at the DR
site and observe that file "8h37" is now available.
The example above makes assumptions about the replication latency window,
and obviously the times chosen were completely arbitrary. Clearly, all the
factors we have discussed in previous sections still apply: you will need
to assure that the snapshot updates that need to be replicated from "S:" to
"R:" can take place within the defined replication latency window
in this case, our RPO of 30 minutes.
To recap, in some instances replicating snapshots of your production
volumes can be useful for reducing replication traffic by only replicating
the most recent changes of an incremental snapshot interval, effectively
de-duplicating the changed blocks.
Testing The Solution Is The Replicated Data Usable?
You will want to periodically test your DR environment to assure it is
ready to deploy.
Snapshots can be used to mount copies of the replicated volumes on their
corresponding DR servers.
Using server virtualization at your DR site will facilitate bringing up
your replicated servers with their data in an isolated environment, enabling
you to verify the readiness of the servers and coherency of data without
actually "flipping the switch".
Continue to the next page
to read about how DataCore implements Async Replication.
|