Want to know how much website downtime costs, and the impact it can have on your business?
Find out everything you need to know in our new uptime monitoring whitepaper 2021



For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort.
Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size, architecture, release cadence, through to how we thought about technical debt.
When production is expensive, you optimise for output. You remove friction from shipping. You invest in tooling that increases developer productivity, and you accept some structural mess in exchange for forward motion.
For decades, that trade-off made sense. The dominant bottleneck was human output. But AI has materially shifted that constraint. The marginal cost of producing code is falling meaning that:
The friction to produce change has been reduced. But whenever a constraint is relaxed in a system, another becomes dominant. If production is no longer the primary bottleneck, what is?
Increasingly, it is comprehension under operational stress. The constraint has moved.
Every engineering organisation operates within a Production–Comprehension Balance. It is not a metric on a dashboard. It is a structural relationship, and it describes the balance between:
Production refers to the rate at which new code, features, and structural changes are introduced.
Comprehension refers to the shared mental models, observability, ownership clarity, documentation, and operational readiness that allow teams to reason about system behaviour; especially when it fails.
As long as production and comprehension scale together, the system feels resilient:
The problem therefore isn’t velocity; it’s imbalance.
When production accelerates faster than comprehension, fragility begins to accumulate. That shift is rarely dramatic at first; it’s gradual.
Imbalance does not typically appear in the roadmap. Velocity may remain high. Features continue to ship, and output is visible and celebrated.
The cost appears elsewhere:
The system still functions. Until it doesn’t.
During an outage, degraded comprehension reveals itself quickly:
Let’s consider a common pattern.
A team accelerates delivery using AI-assisted development. Deployment frequency increases significantly. New services are introduced quickly, and interfaces evolve rapidly.
Months later, an incident occurs involving an unexpected interaction between two services modified weeks apart by different teams.
The code in isolation is sound. Failure emerges from interaction:
Resolution takes hours; not because the fix is complex, but because reconstructing system behaviour under stress requires rebuilding shared context.
Nothing “went wrong” in the traditional sense, but production had outpaced comprehension.
Modern SRE practice already provides language for managing trade-offs.
These are not just operational metrics. They are economic signals. They describe how efficiently an organisation converts change into value without incurring unacceptable risk.
When AI increases deployment velocity, several second-order effects follow; for example:
If MTTR remains stable while deployment frequency rises, production and comprehension are scaling together.
If MTTR drifts upward while change volume increases, imbalance is emerging.
If change failure rate rises as output accelerates, the marginal cost of change has not disappeared – it has shifted into recovery.
The Production–Comprehension Balance is visible in these signals, and it is measurable.
Whilst AI lowers the friction to produce code, it does not eliminate coordination cost. Parallel change increases:
In distributed systems, interaction effects multiply quickly. The difficulty is rarely in writing the code, but rather it is in reasoning about its interactions.
And whilst AI can suggest improvements to a function, it can’t resolve organisational misalignment. It can’t automatically update the unwritten assumptions that exist between services.
Coordination therefore remains a human constraint.
As production accelerates, coordination load increases unless architecture, communication, and observability evolve in tandem.
Monitoring provides the feedback loop required to manage the Production–Comprehension Balance. It answers these critical questions:
Without instrumentation, imbalance is felt subjectively. With instrumentation, it becomes visible.
So whilst monitoring does not eliminate cognitive debt, it does reveal when production is outpacing comprehension. It transforms fragility from a surprise into a signal. In this sense, monitoring isn’t just operational tooling, It’s the nervous system of a high-velocity organisation.
Let’s be clear, acceleration is rational. Competitive environments reward speed, and customers expect rapid iteration. Internal ambition drives improvement.
As such when the cost of production falls, organisations will produce more. That higher rate of velocity is visible and rewarded. But the comprehension degradation is subtle.
The responsibility of engineering leadership is not to resist acceleration. It is to preserve balance. That may require:
AI changes the slope of production, but it does not (and should not) remove the need for discipline.
Whilst AI has lowered the cost of producing software, it hasn’t lowered the cost of misunderstanding software.
Incidents still require coordinated reasoning. Recovery still depends on shared mental models, and reliability still rests on clarity and observability.
As production accelerates, comprehension becomes the scarce resource.
As such, the primary constraint has shifted. Recognising and managing the Production–Comprehension Balance may be one of the defining engineering leadership challenges of this era.
Share this
4 min read How AI Is Shifting Software Engineering’s Primary Constraint For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size,
5 min read Autonomous Code, Trust Boundaries, and Why Governance Now Matters More Than Ever In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t
6 min read The Real Cost of Owning Monitoring Isn’t Code — It’s Everything Else In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff
5 min read AI Has Made Building Monitoring Easy. It Hasn’t Made Owning It Any Easier. A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything
3 min read In the previous posts, we’ve looked at how alert noise emerges from design decisions, why notification lists fail to create accountability, and why alerts only work when they’re designed around a clear outcome. Taken together, these ideas point to a broader conclusion. That alerting is not just a technical system, it’s a socio-technical one. Alerting
3 min read In the first two posts of this series, we explored how alert noise emerges from design decisions, and why notification lists fail to create accountability when responsibility is unclear. There’s a deeper issue underneath both of those problems. Many alerting systems are designed without being clear about the outcome they’re meant to produce. When teams
Find out everything you need to know in our new uptime monitoring whitepaper 2021