The hue and cry when public cloud services fail – which they do with some frequency – is almost deafening. What we rarely hear about is when the public cloud works the way we hope it will.
To be fair, when Amazon takes a hit, it is an important event that begs to be over-analyzed. After all, how could anyone ever expect to possibly survive a day without harvesting digital crops, streaming an episode of Friends, making a cellphone pic look like crap, or checking into the local dive bar?! Internets are serious business!!*
However, the punditry rarely give credit where credit is due – particularly in the case of public cloud uptime. So let me now give public cloud its due.
Last week,we saw a great example as the popular community/forum website Reddit.com used cloud computing to successfully weather (well, mostly) one of the biggest peak loads it has ever experienced.
When the Internet found out that US President, Barack Obama, was planning to host an ‘AMA’ (“Ask Me Anything”) on Reddit’s ‘IAmA‘ subreddit (short for “I am a …” – an online Q&A forum with specialists, celebrities, and more), the massive response was perhaps predictable (to a degree).
In the first five minutes, there were 37 comments. By the ten minute mark, redditors had made 278 comments. Within half an hour that number jumped to 5,266 and was over 10,000 by the end of the first hour.
There were, in total, almost 3 million page views for this thread on the day, generating an unprecedented 30% of all visitors to Reddit at its peak, and transferring 48 MB of data per second (only compressed text too – no movie files here) to the Internet – between five to ten times the normal traffic for “an extremely popular submission”. This was the most traffic the site has ever seen.
However, Reddit is hosted on AWS, including EC2, S3, and EBS, so despite a record day for pageviews, the site was able to remain (more or less) available throughout, because the admins at Reddit responded to this load spike with the on-demand elasticity and scalability of cloud computing:
In preparation for the IAMA, we initially added 30 dedicated servers (20%~ increase) just for the comment thread. This turned out not to be enough, so we added another 30 dedicated servers to the mix.
The new servers were completely dedicated to serving the AMA. The Reddit systems are architected for the cloud, and able to isolate some infrastructure to support specific service features. In addition, Reddit uses scripting to automate the provisioning process, “to automatically take a base image to a server running reddit [sic] code in a few minutes.” As a result of this cloudbursting capability (yeah, I said it – wanna fight about it!?), Reddit stayed up, and was able to handle a major load spike, the biggest in its history. Pretty impressive, right? Not that you’d know from the lack of coverage from the usual suspects.
Not to say that this was all fluffy unicorn rainbows. It was not. I logged in several times only to get a Reddit timeout page, so the availability was still a little sketchy. According to Reddit, this was caused by the freakishly high bandwidth overwhelming the Reddit load balancers, compounded by an issue with how the registration service interacted with Reddit’s CDN provider, Akamai.
Notably, the President was not immune from this availability impact, despite having a dedicated server allocated to him. The admins eventually had to give him “access to an internal server that didn’t go through the load balancers” to make sure he could answer the questions streaming in, suggesting that Reddit still runs at least some of its own servers (just like Netflix does).
Yet, despite these small (and quickly resolved) issues, imagine this same scenario without elastic scalability of pooled resources, and you have a much worse outcome – almost certainly, the site would have been completely unavailable until the load spike subsided, and possibly longer.
Of course, this same scenario could have played out with a private cloud. there is nothing in this scenario that could not have been duplicated with an on-premise cloud model, assuming the additional 60 servers were available in – or could be freed up and moved to – the resource pool. However, in this case, it was Amazon’s public cloud service that saved the day. On this occasion, Amazon did not fall over for 2 days, taking with it dozens of web sites and services, and generating an outcry that could be heard on other planets – and in other galaxies. No, on this occasion, AWS definitely proved its worth, and proved to Reddit and its thousands of users (and more) the value of the public cloud.
To quote Reddit admin rram:
Everything we have runs on Amazon Web Services. Being in the cloud certainly helped us with quickly scaling.
It is just a pity we won’t see a hundred frothy news pieces on that.
p.s. if you want more technical details, you can check out the Reddit code on github, and the AMA that the Reddit admins themselves held a few months back. Because Reddit is sorta awesome that way 🙂
*Fair disclosure – I do not play Farmville, stream Friends from Netflix, post pictures to Instagram, or check-in with FourSquare. But I do Reddit, and I know it is serious business! 😉
5 comments for “Public Cloud Saves The Day – Not That You’d Know”