In Cloud Computing, Downtime is Endemic – But Does it Matter?

CaduceusThere is a perennial debate in cloud computing about whether a failure of one cloud service provider can be more generalized to a “failure of cloud computing”. It is an important question because availability is a key decision factor in choosing between private and public cloud, and between public cloud providers.

The most recent example of such failures is the power outage at IaaS provider Rackspace’s London facility, but of course, we have seen this before from many public cloud providers – including Rackspace in particular, and not just once. SaaS provider Salesforce.com (and its PaaS arm, Force.com) has also had one outage already this year, an event that is far from unusual, and nothing new. Amazon, Yahoo, Microsoft, GoGrid, RIM, Twitter, Paypal and many others have also had substantial (and often repeated) outages.

There are some who dismiss these failures as one-offs, write off partial or short-term failures as too low-impact to matter, or just give poor DR a pass because it is the cloud, and we should not expect any better. Others reach to find semantic differences, calling it a service outage, an application failure, a facilities outage, a power outage, or a resource shortage. Some just redefine cloud to include only those services that did not go down this week (bonus points for adding a vainglorious reference to the ‘real cloud’ or ‘true cloud’).

YMMV, but I don’t see it that way at all. With so many repeated failures in so many cloud providers, these are not just one-off failures. They don’t just happen to isolated providers, they happen across the board. Regardless of the cause – the application, the facilities, the power supply, the lightning rod – an outage of a cloud service provider is still a cloud outage. And the definition of cloud I use is not dogmatic enough to exclude any of the providers that I have cited (and others), let alone define a ‘true cloud’.

So I see every reason to believe that downtime in the public cloud is not the exception, it is the rule; that outages in the public cloud are endemic, and they are systemic.

“Outages in the public cloud are endemic, and they are systemic.”

However, this judgement is absolute, not relative. Failure in one cloud provider may (and I believe does) implicate all cloud providers, but it does not imply downtime is more of a problem in the public cloud than in traditional enterprise IT. Indeed, there is a strong argument that enterprise IT has as many if not more outages, so uptime and availability is no worse in the public cloud than with traditional IT.

In fact, EMA research has shown average enterprise IT uptime is just ‘two nines’, at 99.5%. For a 24×7 system, that is over 50 minutes of downtime, each and every week. Contrast this with public cloud providers. Even with their problems, Amazon EC2 offers a “reasonable effort” to deliver an annual uptime of at least 99.95% – or about 5 minutes downtime per week – and offers a 10% credit for “eligible” breaches. Google guarantees ‘three nines’ (99.9%) uptime for its Premier Edition, or around 10 minutes downtime per week (although it promotes a study that claims an average downtime of 15 minutes a week). The Rackspace SLA promises network, HVAC, and power will be up 100%, though it does not guarantee server availability (beyond promising a 60 minute maximum repair window), and all promises exclude “scheduled maintenance”.

So for the average enterprise, ‘normal’ cloud computing outages, while endemic, can still be 5 to 10 times less frequent than in their own data centers.

However, it is not a black and white issue, not least because a focus on broad uptime percentages or on single instance failures ignores the huge nuance behind a single uptime number.

For example, many environments report ‘five nines’ (99.999%) or even 100% uptime – less than one second of unplanned downtime each day – for their critical systems by using processes and tools for high availability, fault tolerance, asset maintenance, live migration, etc. EMA has also found that best performers in Virtual Systems Management – 15% of enterprises – report an average of five nines uptime.

If they need to, enterprise CIOs can invest in technology to provide two, three, four or five nines uptime within their own data center. They can implement redundant hardware, HA and FT, multi-site replication, and more – if they want to pay for it. They can monitor for outages, know exactly when they happen, and react automatically to fix them immediately (or even use predictive analytics and automation tools to avoid them entirely). They can provide this as required, as a value-add to their business unit customers, or as an additional charge (or at least an exposed cost) to the business to let them choose how critical their applications really are.

However, with the public cloud, neither the business nor the CIO has any real choice. With few or no management or automation tools, public cloud providers simply do not currently offer the same flexibility and accountability as internal IT. Without good management tools, no public cloud provider currently matches enterprise IT at the higher mission-critical reaches of availability.

So, this fight does not end in a knock-out for either side. As is common in the real world, nothing is black and white, but rather many shades of grey.

In the end, the solid achievements of public cloud providers, despite the bad press, does not absolve them of any blame or negate generalizations of downtime being endemic in the public cloud. However, the relatively poor performance of enterprise IT on average still does not ensure public cloud will be any better in any specific cases.

What this does show, however, is that CIOs who are planning to build their own private cloud have a surprisingly high bar to reach. They should not dismiss public cloud options out of hand, but rather should strongly consider whether they can realistically and cost-effectively meet the three, four, and even five nines that public cloud providers guarantee.

4 comments for “In Cloud Computing, Downtime is Endemic – But Does it Matter?

Comments are closed.