StatusCake

Lessons Learned From Amazon’s S3 Outage

ssl monitoring

Over 150,000 businesses rely on Amazon’s Simple Storage Service (S3) for backend cloud-based services for their websites. In March of this year, many of those businesses found out how dependent they were on the cloud when Amazon S3 experienced an outage for almost four hours. Many websites slowed to a crawl and some were unable to load at all.

The outage occurred when Amazon was attempting to fix a problem with a payment and billing system and executed a command that was supposed to remove a few servers from one of S3’s subsystems. However, an incorrect command resulted in removing many support servers and disrupted the websites of many S3 users. Restoring those support servers took much longer than expected.

The outage had a major impact on large e-commerce retailers. Of the top 100 online retailers, 54 suffered a reduction in loading time of 20% or more. Of the affected sites, loading speed decreased on average by 29.7 seconds, with sites taking an average of 42.7 seconds to load. In the world of e-commerce, when page loading speed declines, so does revenue. If a site fails to load, it’s the equivalent of closing the doors of a high street retailer.

The main lesson from this incident is not to put all your eggs into one basket. You at least need to have a contingency plan for how to handle an outage at a third-party provider, such as storing backup data and images on local servers that you can use if needed.

It may cost more, but using more than one source for cloud services and connecting them with automatic failovers can keep your site running smoothly. If you take that approach, using two sources, you should not utilize more than 40% of the capacity of each site to ensure you have enough capacity if once source should experience an outage.

Netflix is a good example of the effectiveness of using multiple sources for cloud services. In 2012, an electrical storm caused a power outage at Amazon and Netflix went down for about three hours, costing the company an estimated $600,000 (£480,000) in revenue. After that incident, Netflix decided to implement a strategy to have its cloud services based in 12 locations worldwide that were designed to roll over automatically should one our more locations fail. That proved to be a wise decision, as Netflix did not experience any performance degradation during the recent Amazon S3 outage.

No third-party service can or will guarantee 100% uptime. Most offer 99.99% uptime, but you do need to worry about that 0.01% possibility of downtime. As Murphy’s Law states, anything that can go wrong will go wrong. Be prepared for the worst, and build redundancy into your operations, backup your data, and test for vulnerabilities.

One last lesson you should take away from this incident applies to any critical operation you undertake, not just to potential cloud problems – always double-check before you implement a major action. Had Amazon followed that advice, this incident would not have happened – a typo in the command instruction caused the outage.

Share this

More from StatusCake

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

4 min read How AI Is Shifting Software Engineering’s Primary Constraint For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size,

Buy vs Build in the Age of AI (Part 3)

5 min read Autonomous Code, Trust Boundaries, and Why Governance Now Matters More Than Ever In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t

Buy vs Build in the Age of AI (Part 2)

6 min read The Real Cost of Owning Monitoring Isn’t Code — It’s Everything Else In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff

Buy vs Build in the Age of AI (Part 1)

5 min read AI Has Made Building Monitoring Easy. It Hasn’t Made Owning It Any Easier. A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything

Alerting Is a Socio-Technical System

3 min read In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting

Designing Alerts for Action

3 min read In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

*By providing your email address, you agree to our privacy policy and to receive marketing communications from StatusCake.