Clouds over Amazon

Share |
If you were hoping to access Reddit, Quora, FourSquare, Hootsuite, SCVNGR, Heroku, Wildfire, parts of the New York Times, ProPublica and about 70 other sites last Thursday, you may have been out of luck. It was also no fault of their own that they went offline, some for up to 36 hours, it was a cloud problem. More specifically, a problem with Amazon's cloud services otherwise known as  Amazon Elastic Cloud Compute or EC2. Nevertheless, Amazon's damage control has kept the news pretty quiet, considering the number of sites and people affected. Most people would not even know that Amazon was hosting their favorite site.

In what news services are describing as a killer blow to the blossoming cloud services industry, the unquestionable leader of the pack failed, and with it the promise of scalable, flexible, cost effective and particularly efficient solutions for enterprises lost considerable credibility.

CNN likened it the Titanic of online services sinking. Mashable called it a Cloudgate or Cloudpocalypse. Not wanting to be as over-dramatic as these reputable news services, The Insider is more concerned with the repercussions it will have on a burgeoning sector of our industry. Most concerns to date have been over the issue of security of data with less concern over the reliability of the systems themselves. It seems quite incredible that Amazon, with all its brilliant, tried and tested technology could have suffered such a high-profile glitch.

The trouble was apparently due to "excessive re-mirroring of its Elastic Block Storage (EBS) volumes." The crash started at Amazon’s northern Virginia data center, located in one of its East Coast availability zones. In its status log, Amazon said that a “networking event” caused a domino effect across other availability zones in that region, in which many of its storage volumes created new backups of themselves. That filled up Amazon’s available storage capacity and prevented some sites from accessing their data.

These Availability Zones are supposed to be able to fail independently without bringing the whole system down. Instead, there was a single point of failure that shouldn’t have been there. Amazon has been tight-lipped about the incident, and the company said it won’t be able to fully comment on the situation until it does a “post-mortem.” Amazingly, the theories expressed above have come from external cloud experts that have already managed to work it out from the evidence present.

The fact that it has taken Amazon so long to explain itself is cause for greater concern. It's one thing keeping customers from running their business for up to 36 hours, it's another keeping the reasons from them for days more. No doubt many lawyers spent their Easter break going over SLAs with "a fine-toothed comb." It also highlights the need for adequate redundancy planning without leaving everything to the cloud provider. This could also alleviate concerns for migrating data should the need arise to change service providers.

Nevertheless, those reporting the failures are also emphasizing that this industry sector is still in its infancy and events like this should not damn cloud services completely, although they do highlight the need to expect the unexpected, even with such a reputable partner as Amazon.

For CSPs rushing to get into the cloud space it highlights the ever-present concern that they don't know what they don't know. If Amazon with all it's experience can have a catastrophic boo-boo then CSPs should be doubly prepared for any event.


Posted 04-25-2011 2:30 AM by The Insider
Filed under: , ,

Comments

Kartikeya Karnatak wrote re: Clouds over Amazon
on 04-25-2011 2:35 PM

well, i still believe its a lesson to the customer's of clouds to make their fault management strong enough to deal with such situations. Certainly, AWS is to be blamed for this failure because it doesn't go with the image of amazon and primarily because Amazon is one of the first companies in letting a common user making its own cloud.

Apart, from the provider's responsibility, I believe the customer share equal responisbility in managing the services well, especially when cloud services and infant and trying to groom themselves.

But, all this cannot let amazon escape from the massacre that happened. It different zone and regions' theory seemed to be failed and it looked like more a centrally controlled architecture.

No idea, what exactly happened, but I really hope amazon would share its experience so that it might be taken as a lesson and not to repeat such things again.

Swami Vivekanand quoted "If you are walking on the path and not facing any troubles, you are certainly travelling on wrong path". So, I believe we are on right path and cloud computing is growing and will take acquire its right position in near future.

We welcome your feedback! To comment on this blog post please either Log-In or Register to the TM Forum Community

Paid Advertisement
About TM Forum
Introduction, History, Board, Management Team...
Membership
How to Join, Benefits, Member List...
Community
Community Home, Groups & Teams, Blogs...
Conferences
Event Calendar, Management World, Supported Events...
Training & Webcasts
Upcoming Training Courses, Upcoming Webinars, Podcasts, On-Demand Webcasts...
Initiatives
Cable, Enabling Cloud Services, Government and Defense...
Best Practices & Standards
Frameworx, Business Process Framework (eTOM), Information Framework (SID)...
Resources
Document Library, Case Studies, White Papers
Research & Publications
Business Benchmarking, Newsletters, Insights Research...
Copyright © 1988-2012, TeleManagement Forum. All Rights Reserved
Contact Us
Careers with TM Forum
News Room
Privacy Policy
Terms of Use
Sitemap