Lessons Learned From Amazon’s S3 Outage

Over 150,000 businesses rely on Amazon’s Simple Storage Service (S3) for backend cloud-based services for their websites. In March of this year, many of those businesses found out how dependent they were on the cloud when Amazon S3 experienced an outage for almost four hours. Many websites slowed to a crawl and some were unable to load at all.

The outage occurred when Amazon was attempting to fix a problem with a payment and billing system and executed a command that was supposed to remove a few servers from one of S3’s subsystems. However, an incorrect command resulted in removing many support servers and disrupted the websites of many S3 users. Restoring those support servers took much longer than expected.

The outage had a major impact on large e-commerce retailers. Of the top 100 online retailers, 54 suffered a reduction in loading time of 20% or more. Of the affected sites, loading speed decreased on average by 29.7 seconds, with sites taking an average of 42.7 seconds to load. In the world of e-commerce, when page loading speed declines, so does revenue. If a site fails to load, it’s the equivalent of closing the doors of a high street retailer.

The main lesson from this incident is not to put all your eggs into one basket. You at least need to have a contingency plan for how to handle an outage at a third-party provider, such as storing backup data and images on local servers that you can use if needed.

It may cost more, but using more than one source for cloud services and connecting them with automatic failovers can keep your site running smoothly. If you take that approach, using two sources, you should not utilize more than 40% of the capacity of each site to ensure you have enough capacity if once source should experience an outage.

Netflix is a good example of the effectiveness of using multiple sources for cloud services. In 2012, an electrical storm caused a power outage at Amazon and Netflix went down for about three hours, costing the company an estimated $600,000 (£480,000) in revenue. After that incident, Netflix decided to implement a strategy to have its cloud services based in 12 locations worldwide that were designed to roll over automatically should one our more locations fail. That proved to be a wise decision, as Netflix did not experience any performance degradation during the recent Amazon S3 outage.

No third-party service can or will guarantee 100% uptime. Most offer 99.99% uptime, but you do need to worry about that 0.01% possibility of downtime. As Murphy’s Law states, anything that can go wrong will go wrong. Be prepared for the worst, and build redundancy into your operations, backup your data, and test for vulnerabilities.

One last lesson you should take away from this incident applies to any critical operation you undertake, not just to potential cloud problems – always double-check before you implement a major action. Had Amazon followed that advice, this incident would not have happened – a typo in the command instruction caused the outage.

StatusCake Team

More from StatusCake

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

4 min read How AI Is Shifting Software Engineering’s Primary Constraint For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size,

James Barnes March 25, 2026

Buy vs Build in the Age of AI (Part 3)

5 min read Autonomous Code, Trust Boundaries, and Why Governance Now Matters More Than Ever In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t

James Barnes March 18, 2026

Buy vs Build in the Age of AI (Part 2)

6 min read The Real Cost of Owning Monitoring Isn’t Code — It’s Everything Else In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff

James Barnes March 11, 2026

Buy vs Build in the Age of AI (Part 1)

5 min read AI Has Made Building Monitoring Easy. It Hasn’t Made Owning It Any Easier. A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything

James Barnes March 4, 2026

Alerting Is a Socio-Technical System

3 min read In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting

James Barnes February 25, 2026

Designing Alerts for Action

3 min read In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams

James Barnes February 18, 2026

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

Buy vs Build in the Age of AI (Part 3)

Buy vs Build in the Age of AI (Part 2)

Life @ StatusCake

Dev

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

Buy vs Build in the Age of AI (Part 3)

Buy vs Build in the Age of AI (Part 2)

Uptime

How to monitor IPFS assets with StatusCake

Website accessibility for all, by all

How to make money online for beginners

Freshly Baked

Lessons Learned From Amazon’s S3 Outage

StatusCake Team

More from StatusCake

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

Buy vs Build in the Age of AI (Part 3)

Buy vs Build in the Age of AI (Part 2)

Buy vs Build in the Age of AI (Part 1)

Alerting Is a Socio-Technical System

Designing Alerts for Action

Monitoring Suite

Features

Our Plans

Resources

Company

Want to know how much website downtime costs, and the impact it can have on your business?

Life @ StatusCake

Lessons Learned From Amazon’s S3 Outage

StatusCake Team

More from StatusCake

Sign up for the StatusCake newsletter

Monitoring Suite

Features

Our Plans

Resources

Company

Want to know how much website downtime costs, and the impact it can have on your business?