StatusCake

In the Age of AI, Operational Memory Matters Most During Incidents

Artificial intelligence is making software easier to produce. That much is already obvious. Code that once took hours to scaffold can now be drafted in minutes. Boilerplate, integration logic, tests, refactors and small internal tools can be generated with startling speed. In some cases, even substantial pieces of implementation can be assembled quickly enough to make older assumptions about software effort look dated.

It is tempting, then, to conclude that the hard part of software is receding. Yet this would be a mistake. For while AI may reduce the cost of generating code, it does not reduce the cost of understanding a live system when that system begins to fail. If anything, it may increase it.

That is because AI accelerates the production of change more readily than it accelerates the production of shared understanding around that change. A team may now be able to introduce new logic, new integrations and new dependencies into production at a pace that would recently have seemed implausible. But the ability to explain, under pressure, what the system is doing, why it is behaving unexpectedly, what to check first and what to do next remains stubbornly human. It is also, in many organisations, badly under-recorded.

This is why I suspect one of the least appreciated consequences of AI in software engineering will be a renewed importance for two rather old-fashioned artefacts: runbooks and postmortems.

They are not glamorous. They do not lend themselves to inflated claims about transformation. Yet when systems fail, they are among the clearest expressions of whether an engineering organisation has managed to preserve operational memory, or whether it has merely become proficient at producing software faster than it can truly understand it.

Faster software does not mean easier incidents

Much of the discussion about AI in engineering still takes place at the point of creation. The focus is on generation: how much quicker code can be produced, how much routine effort can be eliminated, how many tasks can be compressed or delegated. These are real gains. But they belong mainly to the front end of the lifecycle. They say comparatively little about what happens once a system is in production and begins, as all systems eventually do, to behave in ways its authors did not fully anticipate.

For production has a habit of ignoring fashionable narratives. A service either remains available under real conditions, or it does not. A deployment either degrades gracefully, or it does not. A dependency either fails in a tolerable way, or it drags something larger down with it. No matter how quickly the software was written, the incident itself arrives at full speed.

This is where the modern conversation about AI sometimes feels oddly incomplete. It tends to assume that a reduction in implementation effort constitutes a reduction in engineering difficulty as such. But incidents remind us that software has never been difficult merely because code was laborious to produce. It has been difficult because software systems are socio-technical systems: they contain history, assumptions, brittle dependencies, operational quirks, failure modes and half-visible decisions layered over time. When something breaks, the real question is not whether code once existed to make it happen. It is whether the organisation still knows enough to respond intelligently.

If AI increases the pace of change, that question becomes more pressing, not less.

What matters in an incident is not just information, but usable memory

There is, of course, no shortage of information in most software environments. There are logs, dashboards, code repositories, deployment records, traces, alert histories and ticket trails. In principle, all this should help. In practice, raw information is not the same thing as operational understanding.

An incident rarely unfolds as a neat puzzle whose answer lies waiting in a single graph or stack trace. It is usually messier than that. Signals conflict. Alerts arrive too late or too often. A symptom appears in one service while the cause lies in another. Engineers must decide what to examine first, which metrics are trustworthy, what has changed recently, what is safe to try, and which actions might make matters worse. Such judgment depends on memory as much as on instrumentation.

That is where runbooks matter.

A good runbook is not merely a document explaining how a service works. It is an attempt to preserve operational judgment in a form that can be used under pressure. It captures, in compressed form, the practical knowledge that teams otherwise leave dispersed among a few individuals: what healthy behaviour looks like, which checks should come first, which dependencies commonly fail, what immediate mitigations are available, what escalation path is sensible, and where an inexperienced responder is most likely to waste time. In short, it turns recollection into procedure without pretending that procedure can eliminate the need for judgment.

This is precisely why runbooks become more rather than less important as AI enters the engineering environment. The more quickly systems can be modified, the less wise it is to assume that the people responding to an incident will possess deep, tacit familiarity with every path by which the current state came into existence. A team may have shipped efficiently and still be underprepared operationally. In such a world, the absence of a good runbook is not merely an inconvenience. It is a sign that the organisation has failed to convert knowledge into resilience.

Runbooks are a test of whether a team truly understands its systems

It is easy to dismiss runbooks as procedural clutter. Many teams do. They are written late, updated reluctantly and consulted only in moments of stress. Yet that is precisely why they reveal so much.

A team that cannot produce a useful runbook for a service is often a team that does not understand that service as well as it thinks. It may understand the code. It may understand the architecture at a whiteboard level. But to write a runbook requires a more exacting form of knowledge. One must know which failures are likely, which indicators are meaningful, which actions are safe, which dependencies matter first, and how someone else should proceed when time is short and confidence is low.

That is a different standard. It is closer to operational truth.

And it is operational truth, not developmental fluency, that determines how an organisation behaves when something is actually broken.

This is where the age of AI may expose weaknesses that were easier to ignore before. If more code can be created more quickly, then more systems may arrive in production carrying less deeply shared understanding with them. That does not mean the code is poor. It means that a certain older source of familiarity — the slow, manual intimacy of building everything by hand — becomes less reliable as a mechanism for preserving knowledge. Teams can no longer count on the effort of implementation to generate enough understanding by itself. If the memory required during incidents is not written down deliberately, it may not exist where and when it is needed.

The practical consequence is straightforward enough. AI may help teams move faster; runbooks help them remain coherent while doing so.

Postmortems are how organisations remember what really happened

If runbooks are what matter during an incident, postmortems are what matter afterwards. They are the means by which an organisation decides whether a failure will become part of its memory or merely another episode to be half-forgotten until conditions happen to recreate it.

This, too, is often underestimated. Postmortems are sometimes treated as rituals of accountability, or as cultural signals, or as administrative exercises reluctantly completed once the service has been restored. They can certainly degenerate into all of those things. But at their best, they serve a much more serious purpose. They record reality before reality is softened by time.

A postmortem captures what the source code does not. It records what was observed, what was assumed, what proved false, what delayed diagnosis, what mitigated impact, which signals were missed, which dependencies were more fragile than expected, and what parts of the response relied too heavily on individual experience. In so doing, it transforms failure into knowledge that can be reused.

This is not an optional refinement for mature organisations. It is one of the few ways in which engineering teams can prevent operational lessons from evaporating.

And here again AI sharpens the case rather than weakening it. If the cost of creating software is falling, then the cost of relearning operational lessons becomes relatively higher. A team that can ship more quickly but continues to lose the knowledge generated by incidents is not becoming more capable in any deep sense. It is simply increasing the rate at which it forgets under more modern conditions.

The function of the postmortem, then, is not merely to explain the last incident. It is to make the next one less opaque. That may sound modest, but in complex systems it is a significant achievement. Reliability is rarely secured once and for all. It is built, in part, through the disciplined accumulation of remembered failure.

Monitoring tells you that something is wrong. Operational memory tells you what to do next.

An uptime monitoring company scarcely needs to be reminded of the value of visibility. Yet visibility alone is not enough. Monitoring can tell you that something is wrong. It can show that latency has climbed, checks are failing, customers are affected and a deployment may have coincided with the start of trouble. It can tell you, in the most important sense, that reality has diverged from expectation.

But it does not, by itself, tell an organisation how to respond well.

That is the work of operational memory. It is what allows monitoring data to become decision rather than merely alarm. A runbook tells the responder which path into the problem is most promising and which interventions are safe. A postmortem tells the organisation what it failed to understand last time, and what therefore deserves suspicion sooner this time. Monitoring reveals the incident; runbooks and postmortems help make the incident survivable and intelligible.

This distinction matters because it clarifies what AI can and cannot be expected to change. AI may help teams generate code more quickly, and in time it may assist with parts of diagnosis and analysis too. But the effectiveness of those interventions will still depend heavily on the quality of the operational context an organisation has bothered to preserve. A machine can only work with the memory a team has made available to it. If the organisation has not recorded how systems fail, what trade-offs were accepted, which signals matter and what the last incident actually taught, then it should not be surprised when both humans and tools reason poorly under pressure.

In that sense, runbooks and postmortems are not merely operational documents. They are repositories of usable context. They make future action more intelligent.

The teams that cope best with AI will be the ones that remember best

It is possible, when confronted with new tooling, to become over-impressed by the visible act of generation. Code appears, and because it appears quickly, one assumes that the engineering system has become more powerful overall. Sometimes it has. But power in software is not measured only by how quickly new things can be made. It is measured also by how well an organisation continues to understand what it has made, and how competently it responds when the real world contests that understanding.

That is why I think the future belongs less to the teams that merely use AI to write faster, and more to the teams that use speed without sacrificing memory. They will be the teams that continue to treat runbooks as operational infrastructure rather than administrative residue. They will be the teams that write postmortems not because process demands it, but because failure that is not remembered is failure that will be paid for again. And they will be the teams that understand that as software becomes easier to generate, the disciplines that preserve context during and after incidents become more valuable still.

One may put the matter more starkly. AI may reduce the cost of software creation. It does not reduce the cost of confusion during an incident. Nor does it reduce the cost of forgetting what an incident has already taught.

That is why, in the age of AI, operational memory matters most when systems fail. And it is why the quiet documents many teams neglect — the runbook consulted at the worst moment, the postmortem written after the noise has subsided — may turn out to be among the most valuable engineering artefacts of all.

Share this

More from StatusCake

In the Age of AI, Operational Memory Matters Most During Incidents

7 min read Artificial intelligence is making software easier to produce. That much is already obvious. Code that once took hours to scaffold can now be drafted in minutes. Boilerplate, integration logic, tests, refactors and small internal tools can be generated with startling speed. In some cases, even substantial pieces of implementation can be assembled quickly enough to

AI Didn’t Kill the SDLC. It Made It Harder to See

10 min read Whilst AI has compressed the visible stages of software delivery; requirements, validation, review and release discipline have not disappeared. They have been pushed into automation, runtime and governance. The real risk is not that the lifecycle is dead, but that organisations start acting as if accountability died with it. There is a now-familiar story about

When Code Becomes Cheap: The New Reliability Constraint in Software Engineering

4 min read How AI Is Shifting Software Engineering’s Primary Constraint For most of the history of software engineering, the primary constraint was production. Code was expensive, skilled engineers were scarce, and shipping features required concentrated human effort. Velocity was limited by how fast people could reason, implement, test, and deploy. That constraint shaped everything from team size,

Buy vs Build in the Age of AI (Part 3)

5 min read Autonomous Code, Trust Boundaries, and Why Governance Now Matters More Than Ever In Part 1, we looked at how AI has reduced the cost of building monitoring tools. Then in Part 2, we explored the operational and economic burden of owning them. Now we need to talk about something deeper. Because the real shift isn’t

Buy vs Build in the Age of AI (Part 2)

6 min read The Real Cost of Owning Monitoring Isn’t Code — It’s Everything Else In Part 1, we explored how AI has dramatically reduced the cost of building monitoring tooling. That much is clear. You can scaffold uptime checks quickly, generate alert logic in minutes, and set-up dashboards faster than most teams used to schedule the kickoff

Buy vs Build in the Age of AI (Part 1)

5 min read AI Has Made Building Monitoring Easy. It Hasn’t Made Owning It Any Easier. A few months ago, I spoke to an engineering manager who proudly told me they had rebuilt their monitoring stack over a long weekend. They’d used AI to scaffold synthetic checks. They’d generated alert logic with dynamic thresholds. They’d then wired everything

Want to know how much website downtime costs, and the impact it can have on your business?

Find out everything you need to know in our new uptime monitoring whitepaper 2021

*By providing your email address, you agree to our privacy policy and to receive marketing communications from StatusCake.