StatusCake

Lesson Learned From Major Website Outages

statuscake

Unfortunately, website outages are common and can occur at very inopportune times. In the US, the National Hurricane Data Center’s website went down this October due to DNS errors just as Hurricane Matthew approached the coast of Florida. Amazon, BT, BBC, Google and Microsoft all had website crashes in recent years, and the recent DDoS attack on Dyn caused website outages at many large international companies, including PayPal and Twitter. The question is not “if” your website will go down, the question is “when.” Here are a few lessons learned from major outages that can help you reduce the possibility of an outage and help you cope when your site does go down.

Anticipate potential problems

It may be a cliché, but it’s good advice: “The best offence is a good defense.” Be proactive, and defend against potential website crashes by evaluating your network and systems before problems occur. Determine what could cause failure at critical points and determine where you need to build in system redundancy. Practice how to restore critical systems before they go down so you’ll be ready to take quick action if a real failure occurs.

Monitor your website traffic. If it’s steadily growing, be sure you have the capacity to handle future growth. Traffic growth can be unpredictable. A favorable review of your company and its services could cause a spike in traffic, and your site could go down if you’re not equipped to handle it.

Communicate with your customers

If your website is down for any length of time, use social media and email to keep your customers informed. Be realistic and honest with them – they will appreciate it, and you’ll get some goodwill out of a difficult situation. Let them know why the site crashed, what steps you are taking to get it back online and how long you think it’s going to take. If it’s taking longer than you anticipated, give your customers an updated report. An uniformed customer becomes an unhappy customer, and an unhappy customer becomes an ex-customer.

Let your customers know when your site is back online and thank them for their patience. If possible, offer your customers something of value to compensate them for their inconvenience. For example, if you provide a paid subscription service, offer your customers the service for free for a short time. Again, your objective is to make your customers happy and retain them.

Don’t try to fix problems on the fly

When your site goes down, you want to get back online as quickly as possible. However, a quick fix may not be a stable fix, and your site may crash again. Roll back to a previous, trusted version of your site while you diagnose what caused your website outage, and take the time to test a fix properly before you implement it.

Monitor your website performance

Always remember that everything is impermanent. Despite your best planning efforts, problems can occur and cause your website to go down. That is why you should monitor your website’s performance to know promptly if your website is down allowing immediate corrective action to get you back online.

Perhaps the best lesson is the most obvious – have a plan and react quickly to implement it when you have a website outage.

Share this

More from StatusCake

Designing Alerts for Action

3 min read In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams

A Notification List Is Not a Team

3 min read In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years

Alert Noise Isn’t an Accident — It’s a Design Decision

3 min read In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate.

The Incident Checklist: Reducing Cognitive Load When It Matters Most

4 min read In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already

When Things Go Wrong, Systems Should Help Humans — Not Fight Them

3 min read In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening,

When AI Speeds Up Change, Knowing First Becomes the Constraint

5 min read In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

*By providing your email address, you agree to our privacy policy and to receive marketing communications from StatusCake.