Fix Forward Was Right for a World That No Longer Exists
Somewhere in the last decade, "fix forward" became a badge of engineering maturity. You can hear it in incident reviews: the seasoned engineer who says, a little wearily, that rolling back is a junior instinct — that real teams understand the system well enough to push a corrective change rather than retreat to the last known-good state. Rollback is framed as panic. Fix-forward is framed as composure. The whole posture carries a moral weight that has very little to do with what's actually safest for users during an incident.
I want to argue that this is a cached belief. It was correct once, for a specific and now-eroding set of reasons, and most teams holding it have never re-derived it against how their deployment process actually works today. The fix-forward doctrine is the right answer to a question — "what do we do when reverting is slow, coarse, and dangerous?" — that instant feature flag rollback has quietly stopped asking.
Where the doctrine came from
The case for fixing forward was never about courage. It was about the mechanics of redeployment-based rollback, and those mechanics were genuinely bad.
In a world where the only way to undo a change is to revert a commit and re-run the pipeline, rollback is a coarse instrument. The bad change is almost never alone on the trunk. By the time you've detected a regression, three other people have merged. Reverting to last-known-good means reverting their work too, or untangling a revert that conflicts with everything that landed since. Then you wait ten to twenty minutes for CI to build and deploy the reverted artifact — the irreducible floor of any redeployment — while the error budget drains the entire time. And at the end of all that, you've often re-introduced the old bug that the bad change was trying to fix in the first place.
Against that, fix-forward genuinely was the calmer, frequently safer move. A small targeted patch that addresses the actual fault, shipped through the same pipeline, doesn't drag anyone else's work backward and doesn't resurrect old problems. The doctrine encoded a real cost structure: when rollback is expensive, coarse-grained, and itself risky, you should think hard before reaching for it, and often you shouldn't.
Every part of that reasoning depends on rollback being a redeployment. Change that premise and the whole conclusion has to be recomputed.
What changed underneath the belief
Feature-flag rollback is not a smaller version of redeployment rollback. It's a different operation with a different cost structure, and the differences attack the fix-forward argument at every point it rests on.
It's surgical, not coarse. Flipping a flag reverts exactly one change — the one behind the flag — and nothing else. The three commits that landed after it are untouched, because they were never coupled to it in the first place. The "rollback drags good work backward" problem simply doesn't occur. You are not retreating to a previous state of the world; you are turning off one specific behavior while everything else stays live.
It's instant, not pipeline-bound. There's no build, no test run, no artifact to propagate. A flag flip is a config change that reaches running instances in seconds. The ten-to-twenty-minute floor that made every rollback an error-budget catastrophe is gone. Instant deployment rollback isn't a tagline here; it's a literal description of the latency.
It doesn't resurrect the old bug. Because the change shipped behind a flag, the "off" path is the same code that was running safely before — not an older release with its own known defects. You revert to a state you were already comfortable with five minutes ago, not to history.
And critically, it can be automatic. Automated deployment rollback stops being a heroic human decision made under pressure and becomes a policy: automatic rollback on error rate spike in production, or on a latency regression against the control cohort, evaluated by the system continuously while the change ramps. The thing fix-forward advocates were right to distrust — a panicked human yanking a lever and making things worse — isn't even in the loop anymore.
Rollback-first is the mature default now
Once rollback is surgical, instant, non-regressive, and automatic, the entire risk calculation inverts. The question during an incident is no longer "is reverting worth the cost and danger of reverting?" The cost is near zero and the danger is gone. The question becomes the much simpler "do we want this specific behavior on right now, or off?"
Phrased that way, rollback-first is obviously correct, and it's correct precisely because it's the conservative move. Turn the suspect change off, restore the known-good behavior in seconds, stop the error-budget bleed — and then, with the incident no longer active and nobody watching a graph climb, diagnose the fault and fix it properly. Fix-forward under pressure asks engineers to design and ship a correct patch while the system is actively failing and adrenaline is high, which is the exact condition under which people ship the second bug of the night. Rollback-first moves the fix off the critical path entirely. You still fix forward — you just do it calmly, the next morning, against a system that isn't on fire.
This reframes mean time to recovery in a way that the fix-forward doctrine actively works against. MTTR is recovery time, not root-cause time. Conflating the two is the core error: fix-forward optimizes for resolving the underlying defect, which can legitimately take hours, while users suffer the whole time. Rollback-first optimizes for restoring service, which now takes seconds, and decouples it from the slower work of understanding what went wrong. Those are two different clocks, and only one of them is burning your error budget.
The remaining objection is that some changes can't be cleanly flipped off — a schema already mutated, a queue already drained, a migration past its point of no return. That's real, and it's exactly the category where you should have been more careful about coupling state to the rollout in the first place. But it's a minority of changes, and "this specific change is genuinely irreversible" is a far narrower and more honest claim than "fixing forward is the mature posture." The narrow claim earns its caution. The broad doctrine just inherited a worldview and never checked whether it still held.
The belief is downstream of the tooling
The deeper point is that engineering doctrines are usually rationalizations of the tools teams happen to have. Fix-forward felt like wisdom because, given redeployment-based rollback, it frequently was the better call — and a generation of engineers internalized the conclusion without keeping the premise attached. Hand those same engineers an instant, surgical, automated revert and most of them will still reach for the corrective patch, because the doctrine has detached from the conditions that justified it and hardened into identity.
This is the bet DeployRamp is built on: that the right move during a regression should be the cheap, boring, automatic one. When a risky change ships behind a flag the platform added, the system watches the flagged cohort against the baseline and pulls the flag the moment the comparison turns — no pipeline, no panic, no human deciding under pressure whether reverting is worth it. The corrective change still gets written. It just gets written tomorrow, by someone who slept, against a system that recovered seconds after it started to fail. Fix forward was a reasonable answer to a world where rollback hurt. That world is mostly gone, and the doctrine should go with it.