Late last month, Craigslist vanished from the internet. So did LiveJournal and Technorati. CNET.com and Second Life were reportedly gone for a while too. What happened? The datacentre they all shared went dark because of a power failure. Simple enough, right? Except that the main point of using that datacentre was so they’d never have to worry about power failures.
See, a major marketing feature of 365 Main, the humongous San Francisco co-location facility that failed, is that it offers power that just won’t quit. When power from the local utility goes out, a bank of ten 3,000-hp diesel generators is supposed to kick in automatically and keep running until stable power is restored — for days, if necessary.
In fairness to 365 Main, it always worked that way in the past.
But not last week. Early in the afternoon of Tuesday July 24, external electric power started fluctuating wildly. A nearby underground transformer exploded. Power went out for a large section of downtown San Francisco, including the Financial District — up to 50,000 customers in all.
And for reasons that 365 Main is still investigating, some of its backup generators didn’t fire up as they should have. It took about 45 minutes for onsite engineers to start the generators manually.
By then, the damage was done for Craigslist, LiveJournal and the others — between 20% and 40% of 365 Main’s customers. Their servers went down hard. And instead of the magically continuous service their businesses had counted on, those servers had to be brought back up the hard way, slowly and carefully.
The lucky ones were offline for only a few hours. But even for them, the magic was gone.
It should be gone for the rest of us, too. It’s time to accept some hard reality.
Bad things happen. They happen no matter how carefully we plan for them, because we can’t plan for everything. They happen no matter who we’ve paid to take on the job of handling those bad things, no matter how much we’ve paid, no matter what promises we’ve been given.
Co-location and outsourcing don’t work — at least not if what we expect them to do is solve our business continuity problems.
They won’t do that. They can’t. We shouldn’t expect them to.
In fact, we should assume that they won’t, and plan accordingly.
That’s true even if a company like 365 Main brags that its power can’t go down. It can. Murphy willing, it will. And nothing 365 Main does after the fact can make whole the lost sales, lost customers and lost confidence that come in the wake of that failed boast.
So, is outsourcing always the wrong move? Of course not. Trusting outsourcers — that’s the wrong move.
We have to believe they’ll do their best. Otherwise, we shouldn’t be doing business with them. But we also have to know that they’re not perfect, no matter what their brightly coloured brochures say.
We can hand off work, but we can’t hand off responsibility for our company’s IT functions. That’s still ours.
Which means we can’t outsource sleepless nights. We can’t quit developing what-if scenarios and contingency plans. We can’t stop looking for ways to backstop our vendors’ “bulletproof” services — just in case a bullet somehow gets through.
When it comes to reliability, worry is good. Trust? Not so much.
One of the 365 Main customers, online retailer RedEnvelope, had the right idea. RedEnvelope maintained a backup datacentre in Cincinnati to avoid the results of just the sort of problem that struck last month.
But after two years without a glitch in San Francisco, 365 Main issued a press release announcing that RedEnvelope had shuttered the Ohio facility.
That was Tuesday morning. That afternoon, RedEnvelope was offline.