- Overview
- Interview with Dr. Cole
- Student Comments
- Domain 1: Access
- Domain 2: Network
- Domain 3: Management
- Domain 4: Application
- Domain 5: Cryptography
- Domain 6: Architecture
- Domain 7: Operations
- Domain 8: Planning
- Domain 9: Law
- Domain 10: Physical
- Recovery Strategies
- Category: Business Continuity Planning (BCP) and Disaster Recovery Planning (DRP)
Author: Cameron Worrell
Date Added: February 6th, 2007
Introduction
In today's market landscape, unprecedented pressure is placed on company leadership to ensure that sound business continuance measures are implemented. Companies are forced to deal with information availability in the face of increasing threats and regulations, as well as constantly changing IT environments. To make matters worse, IT budgets are often stretched paper thin, which leaves little or no funding for Disaster Recovery Planning (DRP) and Business Continuity Planning (BCP). This causes tremendous obstacles for IT organizations to overcome to ensure successful business recovery after an unplanned outage.
Formulating and deploying the right recovery strategy is pivotal to ensure that a business can survive all types of events, such as simple hardware failures and catastrophic disasters. The purpose of this white paper is to discuss various recovery strategies and the considerations that should be made before deploying them. We first discuss the basic considerations that should be part of any recovery strategy. Then, we provide details into several recovery techniques and see how they might support key areas of your DR and BCP requirements.
Recovery Components
When planning a recovery strategy it is important to consider all aspects of the business, which typically roll up under three distinct umbrellas: personnel, processes, and technology.
The personnel that make up a company are the business units that actually conduct business on a day-to-day basis. They include operations, marketing, supply chain, finance, HR, and so on. These business units carry out critical functions that must be maintained during recovery efforts.
Processes are the blueprints for running the business, and they should be incorporated into recovery plans. Finally, technology automates processes and allows information to flow where it needs to. Although this paper focuses on technology, the processes and personnel are equally important.
Technology includes infrastructure, such as systems and network (voice and data) resources, operating systems, and applications. The key thread that ties these components together is the information or data that drives the business. The following sections discuss strategies for recovering the information and resources a business depends on after an unplanned outage.
Basics
There are several basic ingredients that should be addressed before the development of a recovery strategy. First and foremost, it is imperative to have executive level sponsorship behind the initiative. This ensures that the business is committed to the effort and avoids wasting time and resources on an ineffective recovery strategy.
After executive sponsorship is obtained, the next step is to perform a Business Impact Analysis (BIA). The BIA helps to quantify risk levels, such as acceptable downtime parameters and financial impact for business functions. The findings also include two metrics: Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO is the maximum amount of lost data that a business can sustain due to an outage. RPO is measured in units of time. RTO is the maximum allowable time between an outage and the resumption of business. RPO and RTO play a key role in strategy development process.
Maintaining recovery plans for business and technology recovery is also of paramount importance. Plans should be updated as needed and validated to ensure their effectiveness. This validation is typically performed through testing an actual recovery or a simulation of one.
After the basic components are in place, strategies can be developed to support the requirements. Most programs rank resources by priority in a tiered fashion. The RPO/RTO metrics determine the tiers. The following chart depicts a sample tiered structure:
| Tier | Timeframe (RTO) |
|---|---|
| Tier One | 0-4 hours |
| Tier Two | 4-12 hours |
| Tier Three | 12-24 hours |
| Tier Four | 24-72 hours |
After a particular business function is placed into a tier, all the systems and resources that function depends on, are subject to the required RTO.
Recovery Strategies
In order to make sure IT resources are brought make up as quickly as possible, recovery strategies need to be put in place.
Back Office Resources
Availability of the back office environment is essential for a successful recovery. Before business functions can be recovered, there are many dependencies that must first be met. These include the following:
- Facility (power, space, HVAC, work space)
- Hardware (systems, network gear, platforms; mainframe, Intel, Sun, RS6000, and so on)
- Network (LAN, WAN, remote access, Internet access)
- Software (OS, applications, licensing)
- Data (application, configuration)
- Staff (IT, operations, facilities)
Establishing an Alternate Site
A company should provide access to an alternate facility for data center and work group space in a recovery strategy. There are several methods to obtain access to an alternate facility, depending on specific business requirements. For organizations with multiple facilities, space can be set aside at a remote data center or office for recovery of resources. Reciprocal agreements are also a common strategy to secure alternate facility space.
Care should be taken to ensure the facility is equipped to handle environmental requirements, such as space, power, HVAC, and so on. If there is no alternate location available internally, a third party vendor can be contracted to provide the facility. Many disaster recovery vendors provide facilities throughout the world. These facilities can consist of empty floor space or they might be fully populated with equipment.
For business functions with a zero downtime requirements, redundant mirrored facilities can be implemented. This is typically a high cost strategy, but it can be warranted based on your BIA findings. For obvious reasons, alternate facilities should not be located within the proximity of the production facility, as a regional disruption can impact both locations, which renders the business unrecoverable. For example if a city is hit with a hurricane or snow storm, have a redundant data center 10 miles away will not help.
Mobile facilities can also be provisioned or contracted for through a vendor. Mobile recovery introduces additional logistical concerns that should be considered up front. For example, guards might need to be hired if the physical security is not equivalent to the current facility.
Backup Hardware
Can back-up hardware be procured at time of disaster (ATOD) or does it need to be procured up front? This is the key question. The business requirements answer this question, as it is a key cost driver. Some businesses might accept a plan that requires hardware to be purchased, although others require a complete inventory of back-up hardware at a cold site. Many hardware vendors offer quick ship services where hardware can be delivered within 48 hours. Third party vendors also offer back-up hardware that can be contracted for in advance without having to make a capital investment. As stated previously, the strategy deployed depends on the RTO requirements. Because recovery priorities vary across business functions, a mixture of back-up hardware provisioning is often prudent.
Technologies
Assuming the appropriate alternate facility is available, the next step is to deploy a strategy that addresses the RTO/RPO requirements. This section covers the several approaches that support specific recovery time targets. Each approach has unique requirements and characteristics that are mentioned. Multiple approaches can be utilized across business functions to form an overall strategy.
Traditional Recovery
Traditional recovery typically consists of the standard tape-based restoration of systems. This is the most widely deployed strategy to recover back office resources. This strategy offers an RTO at least 24-48 hours, although recovery time might take longer, depending on the amount of data being restored. The RPO for this strategy is dictated by the age of the last full backup—for instance, up to 24 hours for shops doing daily backups.
Tapes must be transported to the recovery facility, which adds significant risk with this approach. If the tapes are lost or damaged, the entire recovery can fail. Businesses must maintain tapes off site and implement plans to ensure that they are safely transported to the recovery location in the appropriate time frame.
Before data can be restored from tape, the supporting infrastructure must be in place. This includes the backup environment. In addition the OS, software and backup agents must be installed prior to restoring the data. If this is completed at the time of the disaster, recovery time should account for the build out.
Electronic Vaulting
Electronic vaulting is a relatively new recovery strategy that is rapidly gaining popularity. This technology has characteristics of tape technologies, but it has several huge benefits in the context of recovery. From a recovery perspective, there are no tapes; therefore, the concern about transporting tapes to the recovery facility during a disaster is negated. Also, restores are typically much faster, as they occur over the wire (often at GigE speeds) and do not depend on a tape drive that can read only data in a linear fashion. With vaulting, data is stored on disk. This allows for random access, which greatly increasing read/write performance. For similar reasons, vaulting can be utilized to replace production tape backup systems, which reduces backup windows and lowers overall TCO for backup environments.
Stand By OS
Stand by OS provides a remote secondary system that is up and running with the operating system. It does not contain all application or configuration data. This strategy reduces RTO because it eliminates the OS installation and allows recovery efforts to focus on restoring applications and data. Stand by OS can be used in conjunction with traditional recovery to lower RTO by several hours.
On the downside, Stand by OS requires a dedicated secondary facility and that appropriate backup hardware is procured and maintained. This approach is often ruled out as more desirable options are within budgetary reach when these conditions exist. Third party vendors can often be utilized to achieve this at a fraction of the cost of an internal solution.
Remote Journaling & Database Shadowing
These techniques utilize database level technologies to move data off site. Although there is often no tape restoration necessary, transactions must still be restored at the database level. Shadowing has the advantage over Journaling because it does not require database restoration before recovery of shadowed transactions. These strategies require live backup hardware and network connectivity between production and the hot site.
Replication, Mirroring and Clustering
Business functions requiring a more aggressive RTO of 0 to 12 hours need to utilize a more advanced recovery strategy. These strategies are often based on replication, mirroring, and clustering. These technologies transport data to the backup facility in a real-time or near-real-time fashion. They consist of a process that replicates disk writes to the remote storage device. Data replication can occur at the storage level or at the host level.
Data replication alone can consist only of remote storage containing the copied data. This storage can be connected to systems for recovery and testing purposes only. Storage level replication technologies include EMC SRDF, IBM Global Mirror, and Hitachi Universal Replicator. Additional recovery time is necessary to attach systems to the storage and bring them online. Mirroring takes it a step further, and it consists of running systems with OS and application data. Mirroring still requires some steps to prepare the backup environment for production. This might include changes to application configuration, DNS, or network connectivity. Clustering consists of a true redundant environment (N+1) that is capable of recovery automatically or at the flip of a switch. Clustering is the highest level of disaster recovery, although it also carries a hefty price tag. It is normally reserved for mission critical systems.
Diagram: Recovery Strategies and RTO/RPO Considerations
The following chart depicts the discussed strategies with their respective RTO / RPO:

Summary
When considering the types of natural and man-made threats that exist in today's world, it is clear that business cannot afford to ignore Business Continuity and Disaster Recovery. Before the development of a recovery strategy, business function owners should quantify and approve requirements, including RPO and RTO. An effective recovery strategy must be in lock step with these requirements. Regardless of the strategy deployed, there are fundamental recovery components that need to be addressed, including facilities, network connectivity, work group space, and so on. Many proven technologies are available to integrate into a strategy, although the technology should ultimately be selected based on its ability to address the business objectives. If diligence is applied in advance, the appropriate components can be developed and implemented to provide an effective recovery strategy for your business.

