The Failures That Survive Your Canary Are the Silent Ones
Most teams that have invested in progressive delivery tell the same story about how it pays off. A bad change ships behind a flag, ramps to five percent, the error rate spikes, the rollout aborts, the incident never happens. It's a good story, and it's true often enough to be worth the investment. But it quietly trains everyone to believe a comforting falsehood: that the dangerous changes are the ones that throw exceptions. They aren't. The changes that get all the way through a canary and into a hundred percent of traffic are, almost by definition, the ones your error-rate monitoring couldn't see.
This is the structural blind spot in how most engineering teams set rollout abort criteria. You instrument the obvious thing — HTTP 500s, unhandled exceptions, failed health checks — because those are the failures that are easy to count and easy to alert on. Then you wire those counts into the rollout: if errors exceed some threshold during the canary phase, hold or revert. The mechanism works. It catches the change that dereferences a null, the migration that breaks a query, the deploy that crashes on boot. What it does not catch is the change that makes everything thirty percent slower, the feature that silently returns empty results for a subset of users, the code path that degrades gracefully into wrongness. Those don't increment an error counter. They sail through.
Silent failures are the median production incident
If you go back through your last twenty postmortems and sort them by how the failure actually manifested, a pattern tends to emerge. The loud failures — the ones that page someone within ninety seconds — are real, but they're also the ones your existing tooling already handles reasonably well. They get caught fast precisely because they're loud. The incidents that drag on, the ones with a forty-minute detection gap before anyone realizes something is wrong, are overwhelmingly silent. A p99 latency regression that pushed a few percent of requests past the client timeout. A caching change that quietly halved the hit rate. A serialization tweak that dropped a field that three downstream consumers depended on. None of these throw. All of them hurt.
The reason silent failures dominate detection time is that error rate is a binary-ish signal — a request either errored or it didn't — and binary signals are easy. Latency, correctness, and throughput are continuous and contextual, and continuous signals are where the genuinely hard regressions hide. Alerting on p99 latency regression during a progressive rollout is strictly harder than alerting on error count, because there's no obvious threshold. Is p99 of 340ms bad? It depends entirely on what it was yesterday, what it is on the control group right now, and whether the increase tracks with traffic or with your rollout percentage. A static threshold either fires constantly or never fires. This is exactly why teams don't do it, and exactly why the failures it would catch keep surviving the canary.
Anomaly detection is comparison, not threshold
The way out is to stop thinking about absolute thresholds and start thinking about comparison. During a canary release you have something most production monitoring lacks: a built-in control group. The users still on the old code path are a live baseline running against the same traffic mix, the same time of day, the same upstream weather. The right question during a rollout is never "is p99 above 400ms," it's "is the canary cohort's p99 statistically worse than the control cohort's, right now, on comparable traffic." That comparison is what makes anomaly detection during a canary release tractable for an engineering team that doesn't have a dedicated ML group. You're not forecasting; you're doing a two-sample comparison between cohorts that differ only in the code they're running.
This reframes what "rollout abort criteria" should even contain. Setting abort criteria before deployment shouldn't mean picking a magic error-rate number. It should mean declaring, per change, which signals matter and how much regression you'll tolerate on each relative to control: error rate, yes, but also p95 and p99 latency, key business counters, and the saturation metrics of whatever the change touches. Detecting silent failures after a progressive rollout is almost entirely a matter of having decided, in advance, that latency and correctness deltas are abort-worthy — not just exceptions. The teams that catch silent regressions aren't smarter; they just instrumented the comparison and gave it the authority to stop the ramp.
Why this almost never gets built by hand
Here's the uncomfortable part. Everything above is well understood. Engineers know p99 matters. They know cohort comparison beats static thresholds. The reason silent failures still dominate incidents isn't ignorance — it's that wiring per-change, cohort-aware anomaly detection into every rollout is a substantial amount of toil, and it has to be redone for every feature, every flag, every service. Nobody has the time, so the default is the cheap signal: count the 500s, ship it. The good intentions die in the gap between knowing what to monitor and having the apparatus to monitor it automatically on every change.
This is the gap we built DeployRamp to close. When the system wraps a risky change in a flag and starts ramping it, it isn't just watching for exceptions — it's holding the flagged cohort up against the unflagged baseline across error rate and latency distributions, looking for the statistically significant divergence that signals a silent regression, and pulling the flag back the moment the comparison turns. The point isn't that humans couldn't set this up. It's that they reliably won't, change after change, and the failures that survive your canary are precisely the ones that exploit that. Making the comparison automatic, and giving it the authority to abort, is how you stop letting the quiet failures through.