Killing stale feature flags automatically

March 24, 2026·5 min read·

feature flagstech debtautomation

If you want to know how old a codebase is, count the feature flags that have been at 100% for more than a year. I have yet to meet a team whose answer was anything but "too many, and we're not going to touch them." The cleanup work is never urgent, the risk of touching working code is always non-zero, and the reward is invisible. It's the perfect shape for a task that never gets done.

We shipped an automated cleanup agent a few months ago. I was cautious about it — there's a difference between recommending a change and actually writing the code that deletes a branch. But the response from customers has been the warmest of anything we've launched this year, and I want to walk through why it works and what it took to make it safe enough to trust.

The decision: is a flag actually dead?

Before the agent can delete anything, it has to answer one question with high confidence: is this flag actually finished? A flag at 100% for two weeks is not finished — maybe the team is still validating it, maybe they're planning to ramp it back down. A flag at 100% for twelve months almost certainly is.

We use four signals, and a flag has to clear all four before the agent will touch it:

Time at 100%. Default minimum is 90 days. Configurable per repo, and longer for flags tagged as "kill-switch-only."
Evaluation consistency. Every evaluation in the last 30 days has returned the "on" branch. No exceptions. If even one "off" evaluation shows up — maybe from a cron job, maybe from a specific user segment — the flag is presumed still load-bearing.
No recent edits. Nobody has opened a PR that touches the flag's code boundary in the last 30 days. If the code around the flag is actively changing, the cleanup PR will conflict with someone's in-flight work, so we hold.
No kill-switch tag. Flags that the team has explicitly marked as production kill-switches never get cleaned up, no matter how long they've been quiet. A kill-switch that's never been used is doing its job.

A flag that clears all four is what we call "terminal." The agent files a cleanup PR.

The cleanup PR

Here's what a cleanup PR looks like when the agent opens it:

DeployRamp: remove terminal flag checkout_v2_enabled

This flag has been at 100% rollout for 127 days with no off-path evaluations. The change removes the flag boundary, keeps the "on" code path as the default, and deletes the fallback branch.

Files changed: 6 Lines removed: 184 Tests updated: 3

Merging this PR is reversible — if something breaks, git revert restores the flag and the fallback path in a single commit.

The diff itself is what you'd write by hand if you sat down to clean up the flag. The agent unwraps the if (flags.isEnabled(...)) block, inlines the "on" path, deletes the "off" path, removes the flag declaration, and drops the flag's import if nothing else in the file uses it.

There are two things we've learned to be paranoid about:

Don't touch tests

Early versions of the agent cheerfully deleted test cases that only exercised the "off" path. This was technically correct — if the off path no longer exists, a test for it is dead code. It was also a very bad idea. Those tests sometimes contained assertions about the old behavior that engineers wanted to keep as documentation, and deleting them surprised people in a way that eroded trust fast.

The new rule: tests that reference the flag are preserved, not deleted. If a test case calls setFlag("checkout_v2_enabled", false), the agent leaves it alone and adds a comment saying // legacy flag removed; behavior now unconditional. A human can decide whether to delete the test in a follow-up. The cleanup PR stays mechanical.

Don't chase the import graph

The first version of the agent would also try to delete the flag's fallback helper functions if nothing else referenced them after the cleanup. This caused exactly one horrible week when a helper function that looked unused turned out to be called reflectively from a dynamic dispatch table that the static analyzer couldn't see.

The new rule: the agent only deletes code that is syntactically inside the flag's boundary. Helpers, constants, and imports that used to be called from inside the flag are left in place. If they really are dead, the next PR that runs a linter will catch them. We'd rather leave a little dead code behind than delete something that was quietly load-bearing.

What happened when we shipped it

I was ready for this feature to be controversial. Deleting code on someone's behalf is a big thing to ask them to trust, and I had a list of conservative defaults queued up for anyone who pushed back.

The pushback never came. The most common response, by a wide margin, was "is there a way to make it run faster?" Teams wanted the agent to be more aggressive, not less. The cleanup debt was so large and so demoralizing to address manually that the existence of any credible automated option was a relief.

The second-most-common response was people sending screenshots of the diff stats. One customer ran the agent against a codebase that had accumulated feature flags for five years and ended up with a PR that removed just over eleven thousand lines of code across two hundred and seventy files. They merged it that week. Their CI got faster.

The things we still worry about

Not everything about this feature feels settled. Two things keep me up at night, or would, if I slept more.

Cross-repo flag dependencies. If a flag is defined in one service but evaluated in another (mobile apps are the usual culprit), the cleanup agent can only see the server-side usage. It can look terminal on the backend while still being evaluated live by last year's iOS build that hasn't auto-updated yet. We currently require mobile flags to be explicitly tagged before the agent will touch them, but that's a workaround, not a solution.

Flags that are decorative. Every so often we find a flag that hasn't been evaluated in ninety days because the code path it guards is genuinely never hit — not because the flag is finished, but because the feature it guards is broken and nobody noticed. Removing the flag in that case is technically correct but also uncovers a latent bug. The agent doesn't know the difference, and I'm not sure any static analysis can.

Both of these are probably solvable. Neither is solved today. If you have strong opinions, come tell us.

The underlying shift

If I had to name the thing that makes automatic cleanup actually work, it's not any particular heuristic. It's that cleanup has to be boring enough to be trustworthy.

Every time we made the agent smarter, we made it scarier. Every time we made it more conservative and more mechanical — fewer inferences, shorter diffs, less "cleverness" — we got more adoption. The customers aren't looking for a clever cleanup agent. They're looking for one they don't have to think about. Boring automation is the feature.

That's been the main lesson of building this whole product, honestly. Feature flags got hard because we kept asking humans to do boring jobs at high stakes. The win isn't a better UI for doing those jobs. The win is deleting the jobs.