StatusCake

Monitoring StatusCake… With StatusCake

website down

How does the monitor, monitor the monitor? No it’s not a tongue twister, but rather a question we faced when StatusCake started to be joined by big companies such as the BBC, NHS, EA to name but a few.

What do we do?

StatusCake now monitors over 250,000 websites and has tens of thousands of users who rely on it to let them know if their site goes down. If StatusCake were to face difficulties, it would impact our users on a grand scale delaying their alerts and missing downtime – so it’s clear we needed a good monitoring system.

StatusCake was built around the principals of easily deployable nodes that can come up and down without impacting service quality. We have a high level of redundancy with around 50% of our node servers able to go down at any one point without impacting check rates or alert quality at all. Each node is independent of each other, and each grabs a workload and holds the entire systems workload on it at any one time. Using this independent structure, if a node were to be unable to connect to the master servers it continues on and tests servers that have been assigned to it. To ensure tests are not duplicated among servers they each talk to each other letting each server know which servers are having trouble and what work load to take because of that.

So that reduces the possibilities of something going wrong to an extreme level and means we can use StatusCake to monitor StatusCake. We have a StatusCake account that is set up for us that monitors all our servers, even the HTTP server! If any part of StatusCake were to go down another part would notify us almost instantly. We don’t believe in over redundancy when it comes to offering our users the insurances that their monitoring will remain in place, no matter the difficulties that may arise.

Share this

More from StatusCake

A Notification List Is Not a Team

3 min read In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years

Alert Noise Isn’t an Accident — It’s a Design Decision

3 min read In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate.

The Incident Checklist: Reducing Cognitive Load When It Matters Most

4 min read In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already

When Things Go Wrong, Systems Should Help Humans — Not Fight Them

3 min read In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening,

When AI Speeds Up Change, Knowing First Becomes the Constraint

5 min read In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not. That post sparked a follow-up question in the comments that’s worth sitting with: With AI speeding things

Make Your Engineering Processes Resilient. Not Your Opinions About AI

4 min read Why strong reviews, accountability, and monitoring matter more in an AI-assisted world Artificial intelligence has become the latest fault line in software development.  For some teams, it’s an obvious productivity multiplier.  For others, it’s viewed with suspicion.  A source of low-quality code, unreviewable pull requests, and latent production risk. One concern we hear frequently goes

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

*By providing your email address, you agree to our privacy policy and to receive marketing communications from StatusCake.