Even a high-profile, internationally regarded provider of services to heavyweight organisations, such as government agencies and banks, gets caught out occasionally.
EDS was struck unawares on at least two occasions by power failures within a month of each other in September and October 2001.
The two events were unrelated, says Phil Faithfull, of EDS’ business continuity management team, but have led the company — which already had plenty of redundancy in its equipment and a detailed disaster recovery plan — to change some implementation policies.
“The first event on September 12 was an accident caused by a contractor,” says Faithfull. “The second event was during the commissioning of a new UPS system. After both these events we did a massive amount of analysis on our total, diverse client base and determined, in partnership with the clients, the ‘best’ time of day to undertake potentially disruptive risk-related work. We built a complex matrix that categorises the degree of risk associated with a myriad of different activities, and from there we can determine the appropriate time of the day or night to undertake work. All our high-risk work is now done between 1am and 5am on a Monday morning, which is the quietest processing time of the week,” he says.
“We also learned that despite the absolute best in terms of external consultant help, work planning and risk mitigation planning, things can [still] go wrong. We now plan around a worst-case view and work back from there. So, if the unthinkable happens, the impact to our client base is minimised.”
Higher expectations, raised overheads
“All this has added a lot of cost to the maintenance of EDS’s critical facilities,” says Faithfull, “but that cost is small in terms of the potential impact [of a failure] on our business, our clients’ business and the public’s perception of IT.
“We have learned that a short power outage can cause major disruption to client services, particularly with customer facing applications.”
The highest-priority categories of threat “are those most likely to occur”, Faithfull says. “These are probably environmental issues, such as buildings’ [structural integrity regarding] power and cooling, fire or flood, denial of access, data corruption, either accidental or deliberate and bomb threat.
“More serious would be those probably less likely to occur such as a regional seismic or volcanic event or storm, and terrorism or sabotage. However EDS, like any IT service provider, must be ready to cope with any disaster regardless of the cause,” he says.
He considers EDS well equipped to deal with disaster, as it operates six data centres, and another six service centres located throughout New Zealand, with redundant network capability.
“Recovery is therefore possible even for those customers that do not have formal DR plans,” Faithfull says. EDS needs in disaster prevention and recovery are similar to those of other IT service providers.
Planning to meet every possible scenario is impracticable, he says, so DR plans are prepared and exercised on the basis of the worst possible scenario – the total loss of a data centre. Recovery plans cover “the critical platforms housed there, network connectivity to customers & support staff availability.
“Any lesser disaster can normally be coped with. EDS would only relocate processing to an alternate data centre if the outage was likely to be an extended one,” Faithfull says.
“At the data-centre level, planning and recovery is similar to other EDS data centres worldwide, utilising similar standards, practices and tools, and EDS NZ can utilise the company’s global DR experience.”
However, New Zealand clearly presents higher geological risk that some other countries. “We have to be aware of the seismic and volcanic risks associated with any one area, and of course some areas are more risk-prone than others. EDS has several data centres in the Auckland area, which is generally accepted as ‘safer’ than Wellington.”
Incorporating the new
Disaster recovery is co-ordinated by EDS’s Asia-Pacific business continuity management team, located in both New Zealand and Australia, who create, test and review plans across the Asia-Pacific region. “An annual DR test and plan review is the norm, although some customers require testing twice annually,” Faithfull says.
“As new architectures are introduced, plans are updated. For example, there are now many different methods of replicating data to the backup site, where once tape backups were the only option. Data loss is greatly reduced when these newer techniques are utilised.
“Client expectations for recovery time and data loss have increased as more of their applications become customer-facing. This in turn has pushed up the cost of DR, in line with the increased level of sophistication.”
When it comes to planning, EDS client managers liaise with customers before the specialist DR team come in to analyse their business needs and, in conjunction with EDS’s technical teams, create and test the required DR capability, Faithfull says.
“Service levels to be invoked in a DR situation are negotiated with each client, based on the business need. Service managers liaise with clients during DR situations and tests to ensure that client involvement and satisfaction are achieved.
“EDS’ situation management team coordinates all major production issues, and ensures that the correct teams are informed and involved. EDS has been fortunate in that apart from the recent power issues at Mt Wellington it has not suffered a major data centre outage,” Faithfull says.
From time to time service interruptions do occur with cheque processing and printing, he says, and this is circumvented by relocating the work to another of the six service centres nationwide, generally without breaking the service level agreements (SLAs). The six centres all operate identical hardware and processes, and are flexible in approach, he says.
“For example, when all the cheque-reading transports at the Auckland Service Centre failed in March 2001, the workload was redirected to Hamilton and Wellington. Data entry staff from several locations were then able to log into the recovery sites to assist with the keying functions. All work was processed within the SLAs.
“In May 2001 we came close to having to evacuate an Auckland data centre, due to the possible threat of a localised gas leak in a nearby street.
‘Fortunately the gas situation was resolved before this was necessary. If this had occurred EDS would have relocated operations and technical staff to another Auckland building and continued to operate the systems from there.”
While EDS provides disaster recovery to clients who are contracted for these services, he says all its clients are recovered on a “best endeavours” basis at another EDS site using the offsite copies of data that EDS holds, regardless of whether formal DR is in place.
“However, recovery on this basis could take a considerable time, depending on the severity of the disaster and the availability of replacement hardware.”
Faithfull appears bemused by companies that fail to take disaster recovery seriously.
“Most major organisations would be critically impacted by the loss of their core application platforms, yet there are still some who do not have a structured disaster recovery strategy and plans in place, and who seem to accept the risk that goes with that approach.”