If fire, terrorists or hostile aliens zap your data centre, can you be back online before your business collapses? A good IT disaster recovery plan will be vital, as Carl Bradbury explains.
Writing and testing a disaster recovery plan is one of the key elements of business continuity management. Traditionally, business continuity and disaster recovery (DR) planning have always been split between the business and the information technology (IT) department. It has long been recognized that this divide creates more problems that it solves. After all most businesses could not continue to operate successfully if their IT services were unavailable for any period of time. Depending on the nature of the business, this may well range from a few hours to several days.
The recent launch of BS 25999 has established a Business Continuity Management (BCM) standard which intrinsically links BCM, Incident Management, and IT DR. Essentially the key message is that to have true business continuity you must also have strong IT DR capability.
An IT disaster recovery plan should interface with the overall business continuity management plan, be clear and concise, focus on the key activities required to recover the critical IT services, be tested reviewed and updated on a regular basis, have an owner, and enable the recovery objectives to be met. Recovery objectives
The two key recovery objectives with which many people are familiar are:
■ the recovery time objective: how long can my business continue to function without the critical IT services (how quickly must I recover the service from the ‘decision to invoke’)?
■ the recovery point objective: from what time in my processing cycle am I going to recover my data (how much data am I prepared to lose or have to reenter from an alternate source)? There are several options:
■ zero data loss, recovery to the point of failure
■ start of the current business day (SoD)
■ end of the previous business day (EoD)
■ intraday, a point between the last available backup either SoD or EoD and the failure, for argument’s sake midday
■ period end, the weekly or monthly backup.
Additionally there is another measure which is the maximum tolerable outage (MTO). The MTO is the maximum time that a business will survive from the initial service interruption.
The recovery objectives must be based upon solid business requirements identified by the business impact analysis (BIA) process. Figure 1 clearly demonstrates the correlation between the incident starting, the reporting process, the investigation process, the decisionmaking process, and the recovery process. If the MTO is 12 hours and the IT DR process takes eight hours to perform from the invocation point, then the decision to invoke has to be made within four hours of the initial incident. Knowing this lead time is crucial to implementing an effective incident management and escalation process. The recovery time objective is where most misunderstanding occurs between the business and IT department. The message from IT to the business is ‘of course we can recover services within your required recovery time’. The hidden proviso is – ‘assuming we start the recovery immediately the incident is detected’.
DR plan objectives Figure 2 shows a high level incident management and DR invocation flow. The objective of a disaster recovery plan is to detail the key activities required to reinstate the critical IT services within the agreed recovery objectives. The most effective start point for any DR plan is the declaration of a disaster once an incident has been deemed serious enough that ‘forward fixing’ at the primary location is impractical, or is likely to result in an outage exceeding the maximum tolerable outage.
There are a number of common mistakes which organisations make when creating a DR plan. These relate to the level of detail they contain and the stand-alone nature of their construction. Asking the right questions
What level of detail should the plan contain? The answer will depend on who you ask. The more people you ask the greater the variety of replies you will receive. It is advisable to keep the IT DR plan as concise as possible and focus only on the key information required at the time of a disaster.
What information should the DR plan contain?
As a minimum, the plan should contain a statement detailing the scope and capability of the DR plan, exactly when the plan should be used and what consequences are covered. It is advisable to focus on the consequences of an incident rather than the cause.
Why focus on consequences rather than the cause? Is it is really important why the data centre is destroyed? As far as the DR plan is concerned the answer is no. The same process and recovery stages will be followed regardless of the cause. The only relevant question are, what is the impact? and, can I realistically continue to host services from my primary site or should I invoke and recover and resume the critical services at my secondary site?
What else do I need? You need a description of the key roles and responsibilities so that anyone assigned to a particular role in the recovery team understands what is required of them. Should you name individuals in the plan? Ideally individuals who are to be expected to perform a particular role should already be aware that they are likely to be called upon and should have received the relevant training. It is advisable to record the names and contact details of individuals in the relevant section of the overall BCM plan rather than the DR plan. There is no reason why the individual names at the time cannot be entered into the recovery log as the ‘designated recovery manager’ or other predefined role. You also need a summary of the critical services, their recovery objectives and recovery priorities.
This information may be lifted from the business impact analysis (BIA) performed as part of the overall BCM process. Summarising them in the invocation plan will remove the inevitable discussions at the time of the incident and provide a reference point for the recovery teams.
You should also include third party contact details, particularly those that may be required to assist in the recovery effort or those that provide recovery services, for example:
■ the secondary (DR) data centre service provider – you will need contact details, address, maps, and of course the invocation process and codes. It is advisable to invoke as soon as it becomes clear the incident is likely to become a disaster recovery situation.
You can always ‘stand down’ if the incident can be forward fixed (some service providers may levy a charge for this)
■ your media handling company – are your disaster recovery tapes removed from your data centre and vaulted off-site? If so, you will want to arrange for them to be retrieved and sent to your recovery centre at the earliest opportunity
■ mobilisation of the recovery teams – what teams and individuals need to be contacted to recover the services? At this stage of the recovery, the incident management team will already know the extent of the incident and will have placed the recovery teams on standby (if not you need to make sure you dothis at the earliest opportunity). The plan should show teams and skills required, not individuals. Individual contact details have to be recorded somewhere.
It is normal practice as part of the overall business continuity management programme to have contact lists, rather than repeat the detailed contact information. The DR plan should reference the relevant sections in the BCM plan.
In respect of detailed recovery activities and sequence of events, including pre-requisites, dependencies, and responsibilities, what level of detail should you include in this section of the DR plan? This is very much down to personal choice.
However, as a minimum you should include:
■ the recovery process and flow of activities
■ high level activities, for example, load operating systems, install application software, restore data, synchronise database, make configuration changes, post recovery checks, open service to users
■ prerequisites and dependencies for each activity
■ responsibilities – who will perform each activity.
Should you include the detailed activities for installing an operating system or restoring a database?
The detailed recovery activities should be held locally by the team responsible for performing them. There are several reasons for this. The ‘how do I install Windows’ instructions will be used for business-as-usual activities, minor incidents and disaster recovery. The DR plan only needs to reference these documents. If you find it an absolute necessity to include them in your DR plan, then do so as an appendix and not in the main body of the document. Do not allow the key purpose of the DR plan to be lost in unnecessary or duplicated detail.
Testing the DR plan IT DR testing should be performed on a regular basis. The exact frequency very much depends on your own organisational needs. However, it is usual for full deployment tests to be performed, as a minimum, on an annual basis. There are of course other trigger points, for example, a change in your infrastructure that affects your disaster recovery strategy. What do I test? is probably the most common question asked, and the answer is simple. You test the plans, the process, the people, and the infrastructure, in fact every component required to recover and resume your critical IT services.
What are the key objectives of a DR test? There are several and the main ones are:
■ exercise the recovery processes and procedures
■ familiarise staff with the recovery process and documentation
■ verify the effectiveness of the recovery documentation
■ verify the effectiveness of the recovery site
■ establish if the recovery objectives are achievable
■ identify improvements required to the DR strategy, infrastructure, and recovery processes.
The scope of a DR test will very much depend on the maturity of your DR strategy and capability. It is important to scope the test to stretch the objectives and success criteria of the previous test. For example, if this is your first test you may not want to have the entire user community scheduled to come in and perform lots of testing. You may wish to limit the scope to just IT staff and maybe a couple of ‘friendly users’ to test functionality. Depending on the complexity of your environment, it may take several tests to build confidence and perform a full deployment test.
Common DR testing mistakes are:
■ operating within your comfort zone, for example, recovering the servers you know you can recover while avoiding the more difficult components
■ not testing the recovery of a service but focusing on the hardware, systems and applications.
Remember, a particular service may require several servers to be recovered. It may also require data held on local drives and network attached devices, and network connectivity from the data centre to the user
■ trying to achieve too much too soon and overstating your DR capability and readiness
■ not planning appropriately.
Testing and live invocation are very different. In a live invocation you do not have a live environment to protect. Consider the impact that testing may have on your live services. Engage with the appropriate people at an early stage; a full deployment test may take several weeks to plan.
Carl Bradbury is senior consultant, Siemens Enterprise Communications Ltd. Siemens Enterprise Communications will be exhibiting at the Business Continuity Expo and Conference at EXCEL Docklands on 2-3 April 2008. For further information visit www.businesscontinuityexpo.co.uk