StatusCake

Fortnite, AWS, and the Importance of Monitoring

statuscake

The Battle Royale game Fortnite has become a sensation amongst online gamers in no time at all. To explain it in simple terms, 100 players are simultaneously dropped into a battleground measuring several (in-game) square kilometers, and must proceed alone or as part of a team towards a random central point on the map whilst avoiding or confronting the other players. The last man or team standing takes the top spot and wins the game. It all adds up to an intense and at times hilarious experience that can last around 1-20 minutes.

x3lcwubljaexogivi5wy

The growth in popularity of the game has been epic from a 60,000 players on launch last July to 3,200,000 players in under nine months, and suddenly keeping the game up-and-running was going to require some pretty serious infrastructure.

From day one Epic, the publisher behind Fortnite, has like so many other large businesses such as Airbnb, Unilver, and Netflix relied on Amazon Web Services (AWS) to keep it online.
AWS gives Epic the ability to cope when player numbers spike; the difference in infrastructure workload might be up to ten times difference between the peaks and troughs.

Epic also takes advantage of AWS’s “availability zones”. These 55 zones are designed to ensure web services don’t lag in any one zone. Where one zone fails another simply takes up the baton. Fortnite currently runs across 24 of these zones.

This isn’t to say that AWS and the use of availability zones are infallible. In February of this year Fortnite experienced multiple outages which even AWS’s availability zoning couldn’t prevent.

It’s also worth remembering that whilst many companies such as Epic rely on AWS for its reliability and stability it’s worth remembering that Amazon itself can still have problems.

Just last month on Amazon’s Prime Day the rush for bargains not only brought Amazon down but impacted AWS. Whilst the AWS service itself continued to operate normally, AWS customers were unable to login to their accounts.

More serious however was the four hour outage in AWS’ US-East-1 region in February this year which saw over half of the top 100 internet retailers impacted. Many websites saw the performance of their sites impacted severely (Disney’s store took over 1000% longer to load than normal), many other sites went down completely; the same availability zone having similar issues again in May.

All of this highlights that even if you’re using cloud service providers such as AWS or Google Cloud that monitoring your website is as important as ever.

Share this

More from StatusCake

A Notification List Is Not a Team

3 min read In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years

Alert Noise Isn’t an Accident — It’s a Design Decision

3 min read In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate.

The Incident Checklist: Reducing Cognitive Load When It Matters Most

4 min read In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already

When Things Go Wrong, Systems Should Help Humans — Not Fight Them

3 min read In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening,

When AI Speeds Up Change, Knowing First Becomes the Constraint

5 min read In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things

Make Your Engineering Processes Resilient. Not Your Opinions About AI

4 min read Why strong reviews, accountability, and monitoring matter more in an AI-assisted world Artificial intelligence has become the latest fault line in software development.  For some teams, it’s an obvious productivity multiplier.  For others, it’s viewed with suspicion.  A source of low-quality code, unreviewable pull requests, and latent production risk. One concern we hear frequently goes

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

*By providing your email address, you agree to our privacy policy and to receive marketing communications from StatusCake.