You have a problem and you need to switch to your alternative off-site facility. But will it be the seamless transition you planned for? Jared Landin discusses how you can avoid unpleasant surprises.
In considering business continuity strategies and plans it is almost inevitable that the question of off-site facilities, sometimes known as a DR (disaster recovery) site or secondary location comes up. The thought of a large vacuum of wasted warehouse-style space (and money) in a remote location often follows. In some cases, this may be true but certainly not always. In fact, depending on an organisation's strategic approach towards recovering its mission-critical business processes, its IT systems and its budget constraints, some off-site facilities could be in the centre of a business district, fully fitted with state of the art technology, and actively managed 24/7 by a team of business continuity and IT professionals.
Off-site facilities are business-owned, leased, or vendor-provided sites designated for the timely recovery of business operations and IT systems in the event of a full or partial loss of critical components at the primary site. Off-site facilities are more often than not geographically separated from the normal office location to minimise the chance of both sites being crippled by the same event. This separation, however, by no means guarantees a successful recovery, since we have seen many threats that could affect multiple locations, such as the outbreak of an epidemic, hurricanes, tsunamis, computer viruses or acts of war. Therefore there is no hard and fast rule of minimum distance between primary and secondary sites, although common sense dictates that the two sites should not share the same power grid or telecommunication exchange, nor inhabit the same environmentally vulnerable zones.
It is important to note that although greater geographic separation tends to reduce external risks, it must be balanced against problems of accessibility and cost. There is no point having a recovery site so remote that staff members, suppliers and customers will find it impossible to get to (especially in the event of a larger scale disaster) and meet the defined recovery time objective (RTO - the maximum allowable time for a process or system to recover).
Some global organisations have processing capabilities which are so time and mission critical that they have decided to operate with revolving primary and secondary sites across continents. In other words, each site takes its turn at being the primary site for eight hours of the day (during local daylight), while the other two fully mirrored, or replicated sites become the off-site facilities should any disruption occur to the primary.
Choosing a site
There are many ways an organisation can recover from major disruptions and as a result many types of recovery sites, each with its own pros and cons. These in turn drive the organisation's decisions in its recovery site selection processes.
From a technology perspective a fully mirrored site provides an exact duplicate of systems and data which would enable uninterrupted processing in the event of a disaster. Fully mirrored sites allow all critical systems to switch from the primary site to secondary site almost instantly. However, this type of recovery site is very costly to build and maintain and can only be justified if the nature of the underlying business process requires an extremely short RTO (under 10 minutes). From a technology standpoint, if fully mirrored sites are adequately controlled they are highly resilient, but they are also very susceptible to technology-based threats, such as data corruption and computer infections, because these threats are often immediately replicated at the mirrored site.
'Hot' sites are less costly than fully mirrored sites, but they are by no means less complex to control. Organisations fail in recovery tests and in real-life recovery, because they falsely assume that since the sites are called 'hot', they must be ready to go. All too often they only find out on the day that there are not enough seats to conduct business; all the desktop computers are still running older versions of operating software making applications inoperable; the virus definitions and patches are not current, posing security risks; the recovery networks are not compatible with the production ones, which means systems cannot talk to each other; backup tapes which contain all the critical transaction details cannot be restored as different drive technology has been implemented - and the list goes on.
Hot sites generally cater for RTOs of less than 12 hours. They are fully equipped in terms of IT, equipment and office space and facilities designed to allow subscribers to continue critical operations for at least 90 days (although contract terms vary) while they prepare for medium to long-term resumption of business. This could be through rebuilding damaged facilities, conversion of non-impacted buildings, or moving to cheaper commercial recovery sites such as partially-equipped 'warm' sites or minimally-equipped office space known as 'cold' sites, which could take two to three days to become operational.
As well as becoming recovering organisations' medium term operating sites, warm and cold sites are also suitable in the short term for recovering non-critical and supporting functions of the business, or if an organisation has a relatively long RTO. In today's business environment, however, they are less popular, as rapid development of mobile technology and high-speed internet now allows a large number of workers to effectively operate and communicate from wherever they are, making the use of these sites less of a necessity.
Many business continuity solution providers support 'on-demand' contracts which are used in conjunction with warm and cold sites. These contracts provide for timely delivery of a wide variety of up-to-date hardware, ranging from basic fax, phone systems, printers, servers and PCs, to mobile air conditioning and even power generators. As a result organisations no longer need to waste their budgets on depreciating redundant equipment which is unused, except during disasters. Such contracts also offer greater location flexibility, as the delivery location for the equipment can often be nominated at the time of invocation. Dependability and timeliness of delivery could create major issues for organisations, however, as service levels range from being guaranteed to 'best efforts'. Therefore it is essential to understand the contract terms and the risks.
Many of the key risks relevant to the internal auditor in providing assurance about recovery sites have been touched on above. However, no site review would be complete without considering governance. Poor governance could lead to the selected recovery site being unable to deliver the level of service required by the organisation, its customers, suppliers and regulators, thereby causing serious reputation and financial loss during contingency events. Moreover, the recovery site's failure to meet the baseline standards and controls of the subscriber (for example, information security) could also expose the business to unnecessary risks when it is already in a high state of risk due to a disaster.
In auditing the governance of a recovery site, especially if the vendor relationship is new, the auditor could begin with the due diligence in selecting the provider. At a high level this includes evaluation of the vendor's technical and industry expertise, operations and controls and financial condition.
Technical and industry expertise cover areas such as currency and design of technology, effectiveness of site invocation methodology, depth of experience, use of third party experts, ability to respond to unforeseeable events, other customers' testimonials and on-site visits to verify the site's operation and support. Operations and controls include performing a gap analysis of the vendor's standards, policies and procedures compared to the organisation's baseline, facilities and resource management, privacy protection, security (physical and data), records retention and even employee background checks. The vendor management's knowledge of relevant regulations and insurance coverage, including fire and property damage, liability, data losses, and so on, should also be evaluated. To ensure the vendor does not have a going-concern issue, its financial condition should be analysed in detail, including the most recent audited financial statements, credit ratings (if available), market share, dependence on any one customer, and prior years' budgets for technology and facilities investments.
Reviewing the contract
Once due diligence has been completed, the single most powerful yet complex control which governs the site vendor relationship becomes the contract. As a result internal audit should ensure that all key control aspects have been documented explicitly in the contract and that it has been reviewed and approved by the subscriber's legal department and senior management prior to contract finalisation.
Sections which should be reviewed include scope of service, which describes the rights and responsibilities of all parties to the contract. Performance standards (service level agreement) stipulated should meet the subscriber's overall recovery objectives especially the RTOs. Furthermore, terms concerning security, confidentiality and ownership of intellectual property, including vendor obligation to report any potential breach, to maintain adequate internal controls to the agreed standards and regulations and the subscriber's rights to access the vendor's audit reports (for example, SAS70) should be specifically documented. The subscriber and its audit team should have rights to inspect and audit the vendor and to receive in a timely manner vendor's performance, security reports and financial statements. The contract should oblige the vendor to maintain its own business continuity planning and resiliency of operation and to notify the subscriber of any significant changes to sub-contractors, prohibiting the vendor from assigning the contract to a third party without the subscriber's explicit consent.
On the financial side, full descriptions of fees and cost calculations for all services and limitations to cost increases should be highlighted in detail and approved by the finance and procurement departments. Contract duration and termination need to be well thought out and documented. To mitigate litigation risks as well as ensuring continuation of service during dispute, dispute resolution procedures should be agreed and indemnification should be in place to prevent the subscriber for becoming liable for claims as a result of the vendor's actions.
Many organisations fall into the trap of treating their recovery site as a 'once in a while' test exercise which requires minimal 'business as usual' supervision. By doing so they ignore the fact that the service provider could become insolvent, under-staffed or be operating in a substandard manner which could potentially compromise the subscriber's business. Again the internal auditor plays the important role of identifying these risks, by independently verifying the vendor's financial conditions, operations and internal controls, ensuring that the right controls are regularly monitored by the vendor manager and issues appropriately challenged.
Jared Landin is director of internal controls, Jefferson Wells; Tel: 0870 145 4343, E-mail: firstname.lastname@example.org