
Want to know how much website downtime costs, and the impact it can have on your business?
Find out everything you need to know in our new uptime monitoring whitepaper 2021



In a recent post, I argued that AI doesn’t fix weak engineering processes; rather it amplifies them. Strong review practices, clear ownership, and solid fundamentals still matter just as much when code is AI-assisted as when it’s not.
That post sparked a follow-up question in the comments that’s worth sitting with:
With AI speeding things up, how do teams realise something’s gone wrong before users do?
It’s the right question to ask next. Because once change velocity increases, prevention alone stops being enough.
AI shortens the distance between an idea and production. That’s the upside.
But it also shortens the distance between:
This isn’t new. What’s new is how little time there is between those moments.
Historically, slower release cycles acted as a buffer. Problems often surfaced during long staging phases, manual QA, or extended rollouts. AI-assisted development compresses those buffers.
The result isn’t more mistakes. It’s less time to notice them.
Good review processes still matter. Tests still matter. None of that goes away.
But once teams are shipping more frequently, the question quietly shifts:
If something does go wrong, how fast will we know — and who will know first?
This isn’t a tooling question. It’s a systems question. And it’s one many teams haven’t fully revisited yet.
In day-to-day engineering terms, this usually looks something like:
If the first signal of trouble is a support ticket, a social post, or a customer escalation, then customers have effectively become your canary in the coal mine.
That’s a risky place to be once change velocity increases.
Playbook Summary: Designing for “Knowing First”
As AI increases change velocity, resilient engineering teams:
The goal isn’t zero failure.
It’s early awareness, smaller blast radius, and faster recovery.
This isn’t about adopting specific tooling. It’s about designing deliberately for early signal.
Here’s how those principles show up in practice.
Fast teams don’t deploy and hope. They deploy and observe.
Every change should come with clear expectations:
That might be availability, latency, error rates, or reachability from outside your network. If you can’t articulate those expectations, you can’t notice failure quickly.
Historically, UAT was something we did before shipping.
As deployment frequency increases, validation moves closer to, and beyond, production. Post-deploy checks, real-world validation, and continuous verification become part of the release itself.
Shipping is no longer the end of testing. It’s the beginning of observation.
Fast teams don’t just ask “can we deploy this?” They ask “what breaks if this goes wrong?”
That means understanding:
It also means being honest about how confident you are in those answers.
In older or more tightly coupled systems, the true blast radius is often wider than expected. Legacy code paths, implicit dependencies, and infrastructure that’s grown organically make outcomes harder to predict.
The less certain you are about how a system behaves, the more you need to pause, reduce scope, and increase validation.
In short, deployment risk isn’t just about what you’re changing. It’s about how well you understand the system you’re changing.
Many teams estimate work based on how long it takes to build.
But as change velocity increases, the cost of deploying a change matters just as much. Riskier changes demand more attention, more validation, and stronger signals.
If story sizing ignores deployment risk, teams are incentivised to move quickly without accounting for operational impact. That gap tends to surface later; usually under pressure.
Internal dashboards tell you how the system thinks it’s behaving.
Users experience how it’s actually behaving.
Independent, external signals answer a simple question:
Can someone use this right now?
As change velocity increases, that outside-in view becomes more important, not less.
Detection doesn’t stop when you act. It stops when you know the action worked.
After a rollback or fix:
Fast feedback here matters as much as fast detection. Otherwise, velocity just turns into anxious waiting.
AI increases throughput. It lowers the cost of making changes.
That’s a good thing; provided that awareness keeps up.
When it doesn’t:
Teams slow down not because AI failed, but because trust in their systems did.
This is one reason external monitoring still matters. Independent availability and performance signals give teams a clear, unbiased view of user experience. They help teams spot issues early and confirm when fixes have actually worked; especially as change velocity increases.
Tools like StatusCake provide that outside-in signal. Not as a replacement for good engineering, but as a complement to it.
Across teams, industries, and stacks, the same pattern shows up:
AI doesn’t create this gap. It just reveals it.
Realising something’s gone wrong before users do is only the first step.
Once signals fire, humans still have to interpret them, make decisions under pressure, and act using the tools and processes available to them.
That raises the next question:
Are our systems designed to help humans make good decisions when things go wrong — or to get in their way?
That’s the layer worth exploring next.
So what does this mean in practice?
If AI is an amplifier, awareness is what keeps amplification from turning into instability.
Teams that can see problems early don’t just recover faster. They’re able to ship with more confidence. They take appropriate risks because they understand their systems and trust their signals.
As change velocity increases, the teams that thrive won’t be the ones that try to eliminate failure. They’ll be the ones that design for awareness, act quickly when reality diverges from intent, and learn continuously.
That’s what makes speed sustainable.
Share this

3 min read The allure of OpenClaw is undeniable. You deploy a highly autonomous, self-hosted AI agent, give it access to your repositories and inboxes, and watch it reason through complex workflows while you sleep. It is the dream of the ultimate 10x developer tool realized. But as any veteran DevOps engineer will tell you: running an LLM-backed
7 min read There are cloud outages, and then there are us-east-1 outages. That distinction matters because failures in AWS’s Northern Virginia region rarely feel like ordinary regional incidents. They tend instead to expose something larger and more uncomfortable: too much of the modern internet still behaves as though one place is an acceptable concentration point for infrastructure,
7 min read Artificial intelligence is making software easier to produce. That much is already obvious. Code that once took hours to scaffold can now be drafted in minutes. Boilerplate, integration logic, tests, refactors and small internal tools can be generated with startling speed. In some cases, even substantial pieces of implementation can be assembled quickly enough to
10 min read Whilst AI has compressed the visible stages of software delivery; requirements, validation, review and release discipline have not disappeared. They have been pushed into automation, runtime and governance. The real risk is not that the lifecycle is dead, but that organisations start acting as if accountability died with it. There is a now-familiar story about
4 min read How AI Is Shifting Software Engineering’s Primary Constraint For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size,
5 min read Autonomous Code, Trust Boundaries, and Why Governance Now Matters More Than Ever In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t
Find out everything you need to know in our new uptime monitoring whitepaper 2021