Want to know how much website downtime costs, and the impact it can have on your business?
Find out everything you need to know in our new uptime monitoring whitepaper 2021



In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not.
That post sparked a follow-up question in the comments that’s worth sitting with:
With AI speeding things up, how do teams realise something’s gone wrong before users do?
It’s the right question to ask next. Because once change velocity increases, prevention alone stops being enough.
AI shortens the distance between an idea and production. That’s the upside.
But it also shortens the distance between:
This isn’t new. What’s new is how little time there is between those moments.
Historically, slower release cycles acted as a buffer. Problems often surfaced during long staging phases, manual QA, or extended rollouts. AI-assisted development compresses those buffers.
The result isn’t more mistakes. It’s less time to notice them.
Good review processes still matter. Tests still matter. None of that goes away.
But once teams are shipping more frequently, the question quietly shifts:
If something does go wrong, how fast will we know — and who will know first?
This isn’t a tooling question. It’s a systems question. And it’s one many teams haven’t fully revisited yet.
In day-to-day engineering terms, this usually looks something like:
If the first signal of trouble is a support ticket, a social post, or a customer escalation, then customers have effectively become your canary in the coal mine.
That’s a risky place to be once change velocity increases.
Playbook Summary: Designing for “Knowing First”
As AI increases change velocity, resilient engineering teams:
The goal isn’t zero failure.
It’s early awareness, smaller blast radius, and faster recovery.
This isn’t about adopting specific tooling. It’s about designing deliberately for early signal.
Here’s how those principles show up in practice.
Fast teams don’t deploy and hope. They deploy and observe.
Every change should come with clear expectations:
That might be availability, latency, error rates, or reachability from outside your network. If you can’t articulate those expectations, you can’t notice failure quickly.
Historically, UAT was something we did before shipping.
As deployment frequency increases, validation moves closer to, and beyond, production. Post-deploy checks, real-world validation, and continuous verification become part of the release itself.
Shipping is no longer the end of testing. It’s the beginning of observation.
Fast teams don’t just ask “can we deploy this?” They ask “what breaks if this goes wrong?”
That means understanding:
It also means being honest about how confident you are in those answers.
In older or more tightly coupled systems, the true blast radius is often wider than expected. Legacy code paths, implicit dependencies, and infrastructure that’s grown organically make outcomes harder to predict.
The less certain you are about how a system behaves, the more you need to pause, reduce scope, and increase validation.
In short, deployment risk isn’t just about what you’re changing. It’s about how well you understand the system you’re changing.
Many teams estimate work based on how long it takes to build.
But as change velocity increases, the cost of deploying a change matters just as much. Riskier changes demand more attention, more validation, and stronger signals.
If story sizing ignores deployment risk, teams are incentivised to move quickly without accounting for operational impact. That gap tends to surface later; usually under pressure.
Internal dashboards tell you how the system thinks it’s behaving.
Users experience how it’s actually behaving.
Independent, external signals answer a simple question:
Can someone use this right now?
As change velocity increases, that outside-in view becomes more important, not less.
Detection doesn’t stop when you act. It stops when you know the action worked.
After a rollback or fix:
Fast feedback here matters as much as fast detection. Otherwise, velocity just turns into anxious waiting.
AI increases throughput. It lowers the cost of making changes.
That’s a good thing; provided that awareness keeps up.
When it doesn’t:
Teams slow down not because AI failed, but because trust in their systems did.
This is one reason external monitoring still matters. Independent availability and performance signals give teams a clear, unbiased view of user experience. They help teams spot issues early and confirm when fixes have actually worked; especially as change velocity increases.
Tools like StatusCake provide that outside-in signal. Not as a replacement for good engineering, but as a complement to it.
Across teams, industries, and stacks, the same pattern shows up:
AI doesn’t create this gap. It just reveals it.
Realising something’s gone wrong before users do is only the first step.
Once signals fire, humans still have to interpret them, make decisions under pressure, and act using the tools and processes available to them.
That raises the next question:
Are our systems designed to help humans make good decisions when things go wrong — or to get in their way?
That’s the layer worth exploring next.
So what does this mean in practice?
If AI is an amplifier, awareness is what keeps amplification from turning into instability.
Teams that can see problems early don’t just recover faster. They’re able to ship with more confidence. They take appropriate risks because they understand their systems and trust their signals.
As change velocity increases, the teams that thrive won’t be the ones that try to eliminate failure. They’ll be the ones that design for awareness, act quickly when reality diverges from intent, and learn continuously.
That’s what makes speed sustainable.
Share this
3 min read In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting
3 min read In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams
3 min read In the previous post, we looked at how alert noise is rarely accidental. It’s usually the result of sensible decisions layered over time, until responsibility becomes diffuse and response slows. One of the most persistent assumptions behind this pattern is simple. If enough people are notified, someone will take responsibility. After more than fourteen years
3 min read In a previous post, The Incident Checklist: Reducing Cognitive Load When It Matters Most, we explored how incidents stop being purely technical problems and become human ones. These are moments where decision-making under pressure and cognitive load matter more than perfect root cause analysis. When systems don’t support people clearly in those moments, teams compensate.
4 min read In the previous post, we looked at what happens after detection; when incidents stop being purely technical problems and become human ones, with cognitive load as the real constraint. This post assumes that context. The question here is simpler and more practical. What actually helps teams think clearly and act well once things are already
3 min read In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening,
Find out everything you need to know in our new uptime monitoring whitepaper 2021