Time To Stop Forgiving Cloud Providers for Repeated Failures

Cloud is surely a great success, but it is time to stop excusing cloiud failures

Cloud can be a great success, but it is time to stop excusing cloud failures

For a long time now, cloud pundits – service providers, boosters, analysts, vendors, and other mostly vested interests – have stood behind a curtain of “downtime happens, design for failure” when assessing cloud outages.

It seems that with every new failure, the self-styled clouderati repeatedly implore IT leaders to believe that, despite what we are seeing with our own eyes on an almost weekly basis, cloud providers are better at IT than you are.

Over and over, the pundits’ response is a yet another chorus of, “Pay no attention to that man behind the curtain!” Rather, CIOs and others are implored to simply trust the great and powerful Oz (or is that ‘Aws’?) to provide better uptime than their own in-house IT services.

I am as much at fault for this increasingly baseless mantra as anyone

I am not blameless in this either. Indeed, I am as much at fault for my own role in perpetuating this increasingly baseless mantra as anyone. For years I have talked about how empirical research shows that cloud providers have better uptime than in-house IT. I have highlighted how cloud is essential to maintaining uptime during peak loads. I have rejected the knee-jerk “no cloud, no way” response to outages and instead advocated a balanced risk mitigation approach to cloud downtime.

The latest iteration of this cycle comes hot on the heels of a major incident affecting Netflix and others around Christmas Eve, a crippling 23 hour outage caused by a failure in the Amazon Web Services (AWS) load balancers. No enterprise IT shop could allow mission-critical services go down for almost a whole day. It is beyond outrageous to suggest that this level of service is better than the average enterprise IT shop.

Moreover, AWS does this time and again, with little more than a hollow copy-pasta apology to make up for it.Paul McNamara nails Amazon on this point, with my early front-runner for the most acerbic and amusing words-per-inch of 2013 as he hoists AWS up on its own petard by showing how Amazon’s apologies all sound the same.

Yet in response to this incident, for example, cloud consultant David Linthicum trotted out the same old saw, suggesting in his GigaOm blog that everyone is just getting too ‘outage-sensitive’. It is nothing out of the ordinary, he says, and nothing to be concerned about, because you couldn’t do better anyway:

I just don’t see the point of ‘handwringing’ over each of these outages. Just look at the number of times your internal enterprise systems also experience outages. I suspect it’s more than AWS, Rackspace, Microsoft, or the other larger providers.

This may work out in the aggregate, but I think it is a deeply flawed assumption in the specific. While poorly run in-house IT abounds, I do know many enterprise IT shops that maintain better uptime than cloud service providers, and for very good reason. Cloud is immature as an IT discipline, and almost by definition embodies greater risk. Most providers lack the experience and risk-aversion of large enterprise IT, and even of more established IT providers. Most large enterprises have been running larger and more complex IT systems than AWS for 20, 30, 40 years or more. It is almost perverse to assume cloud providers can run more stable IT systems without this long-term experience.

Don’t be misled by Amazon’s SLA claims of 99.95% uptime, either. To start with, the AWS SLA is so porous you can drive a truck through it (probably this truck). The wonderful Beth Pariseau at TechTarget’s SearchCloudComputing has a great new article that makes this point very clearly:

That 99.95% refers to regional uptime, as opposed to uptime by individual availability zone. This in turn makes uptime for individual data centers within Amazon’s cloud, and an apples-to-apples comparison with enterprise data centers, close to impossible to precisely calculate.

No CIO is allowed to count the best uptime for any one of several regional data centers as the uptime for all their regional data centers. In the real world of enterprise IT, that is really quite absurd! If SLA reporting of this nature were normal for enterprise IT shops, I have no doubt that published research on enterprise uptime would report a much more proficient average availability.

Beth also adds up the actual AWS downtime in 2012, which she makes out to a 99.5% uptime (not accounting for Amazon’s questionable way of measuring uptime). This is about the same as the average for enterprise IT uptime that EMA research found in 2007 (around 99.3%), but far less than the average of 99.9% that Ponemon Institute found in 2011. This is , but is still much less than AWS promise. There is also the pesky “scheduled maintenance”, which never counts toward downtime numbers. Not to mention any unreported downtime – and I for one do not assume that the average cloud provider accounts to their users for every second or even minute of downtime, unlike enterprise IT shops, which are required to do just that.

Even at face value, AWS has demonstrably worse uptime than an average IT shop.

Even at face value then, AWS has at best the same uptime, and at worst demonstrably worse, than the average IT shop, let alone an accomplished, proficient, and best-practice IT shop. You pay them a premium too, so it is fair to expect much better than just average.

Adding insult to injury is the oft-repeated platitude that, not only are cloud failures to be expected, but cloud providers are not really even at fault. Rather, cloud customers are at fault for expecting more stable systems, and for not writing their applications properly. Customers should expect cloud services to go down, says the mantra, so their applications should be “designed for failure”. For example, says Linthicum:

I’m sure there will be many more outages this year, and next year. You just need to build those types of events into your cloud service usage and operations planning.

‘Design for Failure’ is admirable advice, which I highly recommend, regardless of whose infrastructure or platform you are using. It is especially appropriate for new cloud applications, what I have dubbed ‘cloud-native’ services. However, this approach is arguably not possible for most cloud services.

To stat with, it is rarely practical to build such resilience into internal legacy applications you are re-hosting on the cloud – what I have called ‘cloud-migrant’ services, running on IaaS cloud services. Then there are third-party cloud-migrant services, like SharePoint or SAP, that have not been designed for failure – and there is nothing the user can do about it. ‘Design for Failure’ is barely achievable in most PaaS environments too, because there are rarely any compatible alternatives to use for failover, so you simply must ride any full platform outage. And it is literally impossible for SaaS applications, where the customer does not have any option but to use the application as it was written, and suffer downtime when (not if) it happens.

Here’s a thought – how about we demand cloud providers ‘design for failure’?

Here’s a thought – how about we all demand cloud providers ‘design for failure’ instead, and demand they supply higher quality – dare I say, “enterprise quality” – cloud services?

After all, if your in-house IT infrastructure failed, your ops team could not get away with blaming your developers because your applications were not ‘designed for failure’. Similarly, there is no reason cloud providers (and their various apologists) should be allowed to get away with it either. I can only assume cloud providers already do ‘design for failure’ to some degree, but given the number, severity, and duration of cloud failures, it certainly seems like many of them are doing a pretty awful job of it.

I must give Brandon Butler of Network World his due for being out in front and at least asking the question. In a great recent article, Brandon asks ‘How long will big-name customers like Netflix put up with Amazon cloud outages?‘, putting the video streaming service on the spot for yet another major AWS-induced failure.

Not that this is purely an Amazon AWS issue either. I cite AWS here mainly because this is the most recent and high profile cloud outage, but the same can easily be seen in outages affecting many of the low-grade and/or commoditized IaaS, PaaS, and SaaS cloud providers. Outages in the cloud are becoming so endemic across the board, CRN’s Jack McCarthy recently compiled a list of ‘The 10 Biggest Cloud Outages Of 2012‘. This was a sequel to Andrew Hickey’s December 2011 article on The 10 Biggest Cloud Outages Of 2011, which followed up his July 2011 article on The 10 Biggest Cloud Outages Of 2011 (So Far), which in turn followed The 10 Biggest Cloud Outages Of 2010 (So Far). I think I am starting to see a pattern!

From consumer and small business services like Tumblr, DropBox, and GoDaddy, to business-oriented services like Salesforce.com, Google App Engine, and Microsoft Azure (and of course Amazon AWS), downtime of cloud services is proving to be far more than a niggling concern. It is becoming a persistent thorn in the side of anyone who continues to insist there is nothing unusual or concerning in cloud downtime.

This does not mean there is no place for cloud in the modern enterprise. Far from it.

This does not mean there is no place for cloud in the modern enterprise. Far from it, I maintain that cloud computing provides an amazing opportunity for innovation, agility, flexibility, and even (sometimes) cost reduction.  And remember, there absolutely are enterprise-grade alternatives to the common commodity cloud providers, although I do not (necessarily) advocate cloud customers jump ship over just one or even a couple of isolated problem incidents.

However, for large enterprise cases especially, we (pundits, analysts, vendors, commentators, press) need to take a long hard look at ourselves in the room of mirrors, and critically reassess the way we publicly excuse failing cloud providers, time and again, for what really should be a pretty big deal. It also means that CIOs and other IT leaders need to seriously reconsider their own choices of cloud providers, and actively reassess which providers will provide a quality experience that justifies a long-term trust, and a long-term investment. The right cloud services may not come from the providers you are used to hearing about.

We cannot keep giving cloud providers a pass for downtime, slowdowns, identity thefts, data loss, and other failures.

Because in the face of these repeated outages, we cannot keep giving cloud providers a pass for downtime, slowdowns, identity thefts, data loss, and other failures. And we cannot continue to blame cloud consumers for not designing their cloud applications and operations properly.

It is time for all of us to stop excusing cloud providers for their repeated failures. It is time we all instead start holding them accountable to their promises, and more importantly, accountable to our expectations.