Years ago, I owned a feature of AWS S3. It was a relatively minor thing by Amazon standards. But this was S3: minor means millions of requests per second.
The feature handled data that had error correction built in two different places, just to be sure nothing bad could pass through. To me, it felt almost redundant.
Then one day a power surge hit some networking equipment in some data centre in the middle of some onion field, and flipped a handful of bits. Out of sheer coincidence, the specific way those bits flipped broke the error correction itself. Not triggered it, not exposed it, broke it.
Because we assumed that scenario would never happen, parts of the design turned it from a protection into a failure mode.
The system went down.
I remember thinking: what are the odds? Less than one in a million, probably. It had to be. Normal people typically file “power surge flips bits in exactly the wrong configuration” under “theoretical things that actually don’t happen outside of university textbooks”.
Instead, it happened outside of textbooks, and straight into my Tuesday afternoon.
The post-mortem we ran afterwards was long and rigorous. You know shit’s serious when the boss of your boss sits with you writing the root cause analysis document. And yes, we made data corruption less likely to cause the same problem again.
But that wasn’t all.
The main finding wasn’t the actual event, but rather that our response had been poor. Not the technical response. The human one. We hadn’t communicated promptly with impacted customers. We hadn’t been clear about what was happening, or when it would be resolved. The gap between the incident starting and customers understanding what was going on was far too long, and caused a lot of trouble to companies using our feature. Tens of millions of dollars of trouble. That’s where most of the actual damage happened.
The post-mortem outcome wasn’t a fix for bit-flipping. It was a structural program to overhaul how we communicated with customers during incidents. A roadmap that took the better part of a year, codenamed “Project Houston” (for “Houston we have a problem” from Apollo 13).
Let me introduce you to John Edensor Littlewood.
Littlewood’s Law of Miracles
Littlewood was a Cambridge mathematician, mostly known for serious work in analysis and number theory. He also had a habit of applying mathematical thinking to things mathematicians usually leave alone. It’s a move I find myself doing a lot: I’ve used CAP theorem to think about team tradeoffs, and PID controllers to think about how managers calibrate feedback. Littlewood is the same move, applied to incidents.
In his 1953 collection A Mathematician’s Miscellany (a lovely little book in its entirety), he made a small observation:
A “one in a million” event happens to each of us about once a month.
Here’s the arithmetic. When you’re awake and alert, you experience roughly one event every two seconds: something happening in the world, an email, a meeting, a notification, a conversation. Sixteen waking hours give you about 30,000 events per day. After 35 days, you’ve hit a million.
If the odds of any single event being extraordinary are one in a million, you should expect one extraordinary thing per month. Not as a surprise, as a schedule.
Now scale this to a team shipping software. Deploys. Hardware events. Network changes. Dependencies acting funny. You’re generating events fast enough that tail events aren’t rare. They’re routine. The bit-flip wasn’t a black swan. It was a scheduled appointment I hadn’t put in my calendar yet.
The implications of this go further than most people take them. It’s not just a fun mathematical curiosity.
If extraordinary events are scheduled, the orientation of incident management needs to change.
What you prepare for, what you measure, what you ask in a post-mortem: all of it looks different once you accept that you are permanently inside a distribution that will keep producing tail events no matter how carefully you engineer.
What post-mortems are actually for
Most post-mortems ask the wrong question: how do we make sure this never happens again?
It’s seductive because it feels like real progress. You found the thing. You fixed the thing. You’re done. But it treats the incident as an anomaly when it’s a draw from a distribution you’re always inside. You can make that specific failure less likely. You cannot opt out of the distribution. If it wasn’t this, it would have been something else, in some configuration you haven’t anticipated, at some point in the next few months.
Our post-mortem didn’t ask that. It asked: given that something like this will happen again, what do we want our response to look like?
The bit-flip was not fully addressable. You can harden against it, you can’t eliminate it. But the communication failure was entirely addressable, and it was the thing that caused the damage. Better error correction would have helped one scenario. Better customer communication processes would reduce the damage from every incident, of every kind, forever.
Beyond the fire
The trigger still gets fixed. You’re not ignoring the specific failure mode. But there’s a second fix that good post-mortems find and bad ones miss.
Slow detection. Unclear ownership. Poor customer communication. No runbook. No escalation path. These aren’t the cause of the incident, but they’re the cause of the impact. Think of them as incident response debt: invisible until you need them, expensive when you do. They’re also what will matter for the next incident, whatever form it takes. The S3 post-mortem found both: we fixed the bit-flip vulnerability, then spent a year on Project Houston, because one fix covered one scenario and the other covered all of them.
The shape of the next one
You can’t practise for the exact incident that just happened. You can practise for the shape.
A key system going dark. A key person suddenly unavailable. A third-party dependency vanishing without warning. The specific details change every time, but the first thirty minutes look the same. Game days and chaos engineering are built on this: not simulating last month’s incident, but building muscle memory for a class of situation. Who declares the incident? Who does triage? How the team communicates internally while managing external communication at the same time? You want those to have boring, automatic answers when things go wrong.
One of the most useful things a game day surfaces isn’t a technical gap but a human one: the moment nobody is sure who owns the decision, or two people are telling different stories to different audiences.
The same logic applies to what you need to write down: what you remember from the event is useful, but only if it helps figure out what someone with no context needs to do in the first half hour next time.
The next impossible thing will be unrecognisable in every detail that matters in the moment, but it will be identical in every detail that matters for initial response. A good runbook for “critical dependency suddenly unavailable” is worth more than a perfect write-up on the last specific failure.
What you can actually control
Incident rate is partly outside your control. Littlewood tells you that. Time to detect, time to respond, time to communicate, time to resolve: those you can move. Project Houston was measured in one thing above all else: how quickly impacted customers understood what was happening and what to expect. That’s a target you can set and hit. “Never have a power surge” is not.
Also, blame becomes irrational. If extraordinary events are draws from a distribution, attributing them to individual failure is just absurd. Something was always going to go wrong this month. The question was which thing, not whether. Good post-mortems focus on system response rather than individual guilt. This matters more than it sounds: blame culture makes people hide problems, avoid owning incidents, and optimise for not being caught rather than for fast recovery. It makes your first hour worse, every single time.
The bit-flip at S3 was not preventable in any meaningful sense. What was preventable, and what we fixed eventually, was the response that turned a technical incident into a customer relations problem.
That’s what a good post-mortem finds, because it’s asking the right question. Not: how does this never happen again. But: when something like this happens, and it will, next month, in some form, how does our response look?
The goal isn’t fewer incidents.
The goal is a better first half hour.