Opinion: The failure behind the Amazon outage isn't just Amazon's

David Linthicum looks at the wider issues raised by the outage

When Amazon.com's outage last week - specifically, the failure of its EBS (elastic block storage) subsystem - left popular websites and services such as Reddit, Foursquare, and Hootsuite crippled or outright disabled, the blogosphere blew up with noise around the risks of using the cloud. Although a few defenders spoke up, most of these instant experts panned the cloud and Amazon.com. The story was huge, covered by the New York Times and the national business press; Amazon.com is now "enjoying" the same limelight that fell on Microsoft in the 1990s. It will be watched carefully for any weakness and rapidly kicked when issues occur. It's the same situation we've seen since we began to use computers: They are not perfect, and from time to time, hardware and software fails in such a way that outages occur. Most cloud providers, including Amazon.com, have spent a lot of time and money to create advanced multitenant architectures and advanced infrastructures to reduce the number and severity of outages. But to think that all potential problems are eliminated is just being naive. Some of the blame around the outage has to go to those who made Amazon.com a single point of failure for their organizations. You have to plan and create architectures that can work around the loss of major components to protect your own services, as well as make sure you live up to your own SLA requirements. Although this incident does indeed show weakness in the Amazon.com cloud, it also highlights liabilities in those who've become overly dependent on Amazon.com. The affected companies need to create solutions that can fail over to a secondary cloud or locally hosted system - or they will again risk a single outage taking down their core moneymaking machines. I suspect the losses around this outage will easily track into the millions of dollars. Never trust a single system component, be it a cloud, a network, a router, a database, or whatever. Figure out what to do when a component goes offline or fails in other ways. The typical solution is to fail to secondary components that can operate until the primary is back online. That used to be a given in IT. Unfortunately, many organisations have put too much trust into clouds, pushing their systems out to providers with the incorrect thought that a third party will provide the resiliency and the redundancy they require. As we've seen so dramatically, clouds have limitations, too. Don't get mad at that fact - just deal with it.

Join the newsletter!

Error: Please check your email address.

Tags managementamazonamazon.comec2

Show Comments
[]