The math behind automatic rollback

March 2, 2026·5 min read·

rolloutsstatisticsreliability

If you've ever watched a feature flag dashboard during a rollout, you've played a mental game of statistics. Error rate went from 0.3% to 0.6%. Is that a regression, or is it noise? What about latency creeping up by 12 milliseconds? What about the checkout success rate dropping by half a point? How long do you wait before you're sure?

We wrote the first version of DeployRamp's auto-rollback logic the way most teams write it: with hard thresholds. "If error rate goes up by more than 50%, pause the rollout." It worked about as well as hard thresholds usually work, which is to say it kept catching false alarms on low-traffic flags and missing real regressions on high-traffic ones. This post is the story of what we replaced it with.

Why thresholds don't work

The problem with a threshold like "error rate goes up by more than 50%" is that it pretends 50% means the same thing in every context. It doesn't.

Imagine two flags rolled out at the same moment. One is live on a path that sees a hundred requests per minute. The other is on a path that sees a hundred thousand. Both show a 50% spike in error rate during the first five minutes after rollout.

On the high-traffic path, that 50% jump represents a statistically massive shift. The odds of seeing a swing that big from noise alone are vanishingly small. It's a real regression, and you should pause.

On the low-traffic path, a 50% jump might mean you went from two errors to three. That's well within what you'd expect from random chance. Pausing here would be a false alarm, and doing it repeatedly is how you train engineers to ignore the tool.

A threshold treats these two situations identically. It's blind to traffic volume, blind to baseline variance, and blind to how confident you should actually be. Every team that tries a pure threshold eventually ends up layering hacks on top of it — "don't fire if traffic is below X," "wait for Y samples before evaluating" — which is just a slow rediscovery of statistics.

The sequential test

What we wanted was a decision rule that:

Fires fast on real regressions, even if the absolute numbers are small.
Stays quiet on small blips, even if the percentage change is big.
Adapts automatically to the traffic volume and baseline variance of each flag.
Doesn't require the engineer to tune any knobs.

The technique that fits all four boxes is a sequential probability ratio test (SPRT). It's a classic result from Wald in the 1940s, originally developed for wartime quality control, and it's exactly the right shape for a rollout: you're watching data arrive one event at a time, and you want to stop as soon as you're confident — but no sooner.

The short version: for each metric (error rate, latency, success rate), we maintain two competing hypotheses.

H₀: The flag's "on" path has the same behavior as the "off" path (no regression).
H₁: The flag's "on" path is worse by at least some meaningful amount (a regression worth rolling back for).

Each new request contributes a log-likelihood ratio toward one hypothesis or the other. The test adds those up. If the cumulative ratio crosses an upper bound, we pause and flip back. If it crosses a lower bound, we promote the flag to the next traffic slice. If it's in between, we keep collecting data.

In code, the inner loop looks approximately like this:

function updateSprt(state: SprtState, event: Event) {
  const llr = logLikelihoodRatio(
    event,
    state.baselineRate,
    state.effectSize,
  );
  state.cumulative += llr;
 
  if (state.cumulative >= state.upperBound) {
    return { decision: "rollback" };
  }
  if (state.cumulative <= state.lowerBound) {
    return { decision: "promote" };
  }
  return { decision: "continue" };
}

The upper and lower bounds are set from the false-positive rate (α) and false-negative rate (β) you're willing to tolerate. We currently run with α = 0.01 (one false rollback per hundred healthy rollouts) and β = 0.05 (miss one real regression per twenty).

What this looks like in practice

The best way to see why this matters is a concrete example. Here's a rollout we caught earlier this year.

A customer deployed a change that affected about 0.4% of checkout attempts. The absolute numbers were tiny — fewer than fifty extra failed checkouts per minute. The percent change in global error rate was under one percentage point. A traditional threshold system, set loosely enough not to fire on noise, would have missed it entirely.

The SPRT caught it in four minutes and twenty seconds. The bad path had a consistent, reproducible failure signature, and the sequential test didn't need the effect to be huge — it only needed the effect to be consistent. Once the log-likelihood ratio crossed the upper bound, we paused the rollout at 18% traffic, flipped the flag back, and the customer woke up to a Slack message explaining what had happened and pointing at the commit.

The money graph for me is this: on that same rollout, a fixed "spike more than 2x baseline" threshold wouldn't have fired at all. The regression wasn't a spike. It was a slow, consistent bleed, which is exactly the class of bug that human-set thresholds tend to miss.

Three things the test does not do

I want to be careful not to oversell this. The SPRT is a decision rule. It is not magic. Specifically:

It does not work on metrics with no baseline. If a flag is protecting a brand-new code path and there's no "off" counterfactual, there's nothing to compare against. We fall back to a simpler absolute-error-rate watchdog for those cases.
It does not detect non-stationary changes. If your error rate is climbing because of an unrelated incident elsewhere in your system, the test will see that as a regression and pause the rollout. That's usually what you want, but it's worth knowing.
It does not replace observability. The test tells you that something is wrong. It does not tell you what. We link every auto-rollback to the traces and logs that triggered it, but the engineer still has to read them.

What we learned

The biggest surprise, looking back, is how much of the value came from taking the thresholds away from the engineer. Every time we gave someone a knob, they either forgot to set it or set it wrong — usually too loose on new flags (to avoid false alarms) and too tight on established ones (because they got paranoid after one incident). The SPRT works because it doesn't ask anyone what "too much" means. It asks what a regression would need to look like to be statistically distinguishable from noise, and then it waits for exactly enough data to make that call.

That's a small shift, but it's the difference between a tool that people have to babysit and a tool that just works. The latter is the only kind worth building.