Want to know how much website downtime costs, and the impact it can have on your business?
Find out everything you need to know in our new uptime monitoring whitepaper 2021



At current it seems likely this downtime has been caused by a cache server failure followed by a load balancing issue. This means requests are not getting redirected to an applicable server and as such are failing at the point of request.
Update 09:24 BST: After around 29 minutes of Downtime Facebook has started to recover for many users and our global uptime monitoring servers have started to receive status code 200). There are still some lingering speed issues and some countries are finding the service fluctuate.
Original: If you thought Downtime was just an issue that small independent stores have to deal with then think again. The world’s largest social network is currently experiencing a global blackout. Starting around 8.54am BST the social network has been producing 503 errors (indicating no server available to handle the request).
We got alerted to this issue within seconds and are working on a resolution. I don’t expect it will take any longer than half an hour – Facebook Source
This isn’t the first time the social network has experienced downtime and won’t be the last, though this downtime highlights the importance of having a 3rd party monitoring service and even more importantly a remotely hosted status page.
Share this
3 min read In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years
3 min read In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate.
4 min read In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already
3 min read In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening,
5 min read In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things
4 min read Why strong reviews, accountability, and monitoring matter more in an AI-assisted world Artificial intelligence has become the latest fault line in software development. For some teams, it’s an obvious productivity multiplier. For others, it’s viewed with suspicion. A source of low-quality code, unreviewable pull requests, and latent production risk. One concern we hear frequently goes
Find out everything you need to know in our new uptime monitoring whitepaper 2021