Getting back to business after a major physical or communications disruption without any repercussions is unlikely. But you can minimise the after-effects by building in plenty of backup capability and having a cover-all disaster recovery plan.
The much-ballyhooed, somewhat delayed e-government portal will change its function in the event of a major natural disaster. Front-ending government websites, that in many cases aren’t functioning, is pointless; the website will become the first public port of call for information about the disaster’s impact and progress with rescue and remedy.
“In that case we’re not interested in preserving business as usual,” says portal business manager Kent Dustin. “We’ll be giving civil defence [an information channel] to work with”, assisting the centre of the disaster with business continuance in the widest sense.
The portal is based on a replicated network of more than 20 web servers, applications servers and database servers, run by Datacom from replicated centres in Wellington and Auckland. The workload is dynamically balanced between the two, with typically about two-thirds of the capacity in Auckland. In the event of a large-scale failure in one centre, everything naturally fails over to the other. “We will lose capacity, but not capability,” Dustin says.
The system is over-engineered, with plenty of resilience and spare capacity, says e- government senior adviser Mark Harris.
A plan of defence
As more and more businesses become reliant on their information technology to maintain client response and competitive edge, so well thought-out procedures for recovering from mishaps to the computer facility become more important.
New Zealand’s proverbially unstable geology has always been a occasion for fears of major digital information disruption, and we are now as subject as any other country to the rising tide of viruses and other electronic attacks, with the so far unrealised threat of “cyberterrorism” looming on the horizon.
Yet disaster preparedness still seems patchy, with some organisations evidently taking a simplistic attitude to the problem. There is no visible coherent IT disaster recovery lead from the top in government either, as the process of government becomes ever more wedded to electronic channels.
The business of backing up hardware, communications links, and power supplies is comparatively well understood, Dustin says. What are often given less attention are the “people” aspects — how well staff know their individual responsibilities and courses of action in a disaster.
The portal support staff are in the process of planning these procedures. A back-up member of staff may not immediately know what has happened in a remote centre, but should know when the lack of service has escalated to the extent that that person is required to take control. All relevant contact details should be available to everyone, and the planning should bear in mind that most people’s first priority will be to contact home, reassure family members and check on their safety, alongside their duties in the workplace.
In the absence of information, people should still be able to behave predictably, Dustin says.
People and process
The computer side — as against the people side — is one way of drawing a distinction between the intertwined practices of disaster recovery (primarily the former) and business continuity (primarily the latter). In the wake of the September 11 terrorist attacks, many businesses in affected buildings “recovered” quickly in computer systems terms, but the businesses still failed in the longer run because the people aspects were not properly planned.
The other side of disaster, of course, concerns the more regular dangers of attacks from viruses, denial of service (DoS) and internet failure.
In only three or four years DoS and virus attacks have risen from the status of a rare phenomenon to what Dustin calls “a constant background noise”, rising to a peak “on a reasonably frequent basis”.
“Datacom does well at identifying and controlling those issues,” says Dustin. “They have a bunch of software tools to track patterns in the way the network is behaving.” This enables the firm to forestall DoS attacks and the like. “And they keep up to date on the CERT [the international Computer Emergency Response Team] reports [on vulnerabilities] and information like that.” This was one of the reasons Datacom was chosen to host the system and why the portal operators are relieved to “leave the paranoia to Datacom” on the hardware and software continuity side.
But on the people side, training is a key need, Harris says, and this means having skills tested, rather than just being able to put a tick in the box to say staff have been on a course. Planning for various scenarios and simulation exercises based on such scenarios is important.
The world of distributed systems has begun to learn the lessons known and put into practice for decades among the mainframe people, Dustin says.
Because PCs were initially not mission-critical systems and gradually became so, it has taken the distributed systems people time to catch up with the need for disaster recovery and become skilled in the art.
Importance-to-cost ratio key says recovery old hand
Gov't standards hard to find
EDS learns useful lessons from outages