Every organisation should have a simple “paint by numbers” plan for disaster recovery, says Chris Corke, chief information officer for the New Zealand Stock Exchange.
At a fraught time when everyone’s yelling and things have to be done quickly, you don’t want to have to make sudden, on-the-fly decisions, he says.
Corke talked disaster recovery and its counterpart, business continuity, at a recent breakfast meeting of the New Zealand Computer Society in Wellington.
Business continuity, he says, involves ensuring as far as possible that “the world keeps on turning” when a fault occurs. Disaster recovery is what you do when business continuity fails.
“For business continuity, you identify weak spots and architecturally design [ways] to manage and communicate information ... to get around failure at any of those spots.” This plan should be reviewed in advance by those in touch with the detail involved at every stage, says Corke.
“Don’t forget the little things,” he says. This was a persistent theme of his address. Apparently, it’s the insignificant things that often trip people up when it comes to trying to keep the business going or helping it recover.
The NZX has had its share of failures. One major ICT mishap happened two days after Corke took over as CIO in 2003. The mishap uncovered some failings in the NZX’s own procedures. But it is probably unfair to blame the NZX for its more recent stoppage, which happened as a consequence of the infamous “rat and posthole-digger” incident that took out a large part of Telecom’s network, affecting much of the country. A lot of organisations were affected but because the NZX provides such a high-profile service it came in for a lot of criticism.
The experience illustrates the necessity of being alert to even small risks, says Corke. The chance of a total telco network collapse may be remote but “so is the chance of winning Lotto and people do that every week”.
Data accuracy also needs to be an important part of the disaster recovery plan, says Corke. An on-going effort needs to made to keep both primary and backup data “clean”.
A recovered system with corrupt data loaded from the backup store “is as much use as a chocolate teapot”, he says.
Regular “fire drills” need to be conducted to ensure recovery procedures are in place, says Corke. It’s okay to do this when workloads are low but staff should not be warned in advance or they will not behave naturally, says Corke.
“Automate the recovery points wherever possible and create a plan to manage the others,” he advises.
Testing should involve doing everything as it would done if there was actually a disaster. “Recover the data, don’t emulate recovery, Corke says. The test should also be end-to-end, that is assumptions should not be made that because stages A to B and stages B to C have been successfully tested on different occasions, the whole sequence from A to C will necessarily work.
“Cater for the lowest common denominator” he says. Spell out the most obvious things so there is no scope for misunderstanding.
“Test and keep testing, and keep your head,” he says. A well-drafted plan is the best assurance you will be able to stay calm in a disaster.