All posts

How DeployRamp detects risky diffs in pull requests

·5 min read·
pr scanningllmengineering

The most common question we get from engineers evaluating DeployRamp is some version of this: "Are you just wrapping everything in a flag?" It's a fair question. If the tool is too aggressive, it clutters the codebase and trains people to dismiss the comments. If it's too passive, it misses the changes that actually need a safety net. Getting the line right is the hardest part of the product.

This post is a tour of how we currently draw that line.

The three-stage pipeline

When a pull request opens, DeployRamp runs the diff through three stages in sequence. Each stage is cheap enough that skipping it later would be pointless, and each stage exists because the stage before it couldn't make a confident call on its own.

Stage 1: Structural filters

The first pass is mechanical. No LLM, no embeddings, just a parse tree and a list of rules. We look for things that can only be reasoned about with the diff itself:

  • New or modified route handlers. A new POST /checkout is almost always worth flagging. A rename of an internal helper is not.
  • Schema changes. Any migration that alters a NOT NULL constraint, drops a column, or changes a type gets marked automatically.
  • Auth/permission logic. Files in paths matching the configured auth/ or permissions/ roots get weighted higher.
  • Payment and billing code. Same idea. Configurable per repo.
  • External API calls. A new fetch() against an external host is a candidate because the blast radius extends past your own service.

Stage 1 can reject a PR ("nothing here is risky") or promote a file to Stage 2. It cannot accept a PR for flagging on its own, because structural signals are too blunt. Adding a route handler to a test file doesn't need a flag.

Stage 2: Semantic scoring

Files that survive Stage 1 go through a cheap embedding-based classifier. Each hunk is embedded, then compared against a small library of reference diffs — examples of changes that previously caused incidents, plus a set of "this was safe" counterexamples. The classifier returns a risk score between 0 and 1 for each hunk.

This stage was originally an LLM call, but we replaced it with embeddings after measuring that the two agreed 94% of the time and the embeddings ran about 40x cheaper. The LLM only sees a hunk now if the embedding score lands in an ambiguous band (roughly 0.35 to 0.65). Everything above and below gets a fast decision without another round trip.

Stage 3: The LLM review

Ambiguous hunks and anything Stage 1 flagged as high-risk get a full LLM pass. This is where the tool earns its keep. The prompt is structured roughly like this:

const prompt = {
  diff: hunk.text,
  surrounding_code: await readContext(hunk.file, hunk.lineRange, 40),
  prior_incidents: await similarIncidents(repoId, hunk.file),
  repo_conventions: await conventionsFor(repoId),
  question: "Does this change need a feature flag? If yes, where should " +
            "the flag boundary sit, and what's the safe fallback path?"
};

Three things matter about this prompt that aren't obvious from the shape:

  1. Surrounding code, not just the diff. We pull 40 lines of context on either side of the hunk, because "is this risky?" is often a function of what the code is embedded in. A new error path in a hot loop is scarier than the same error path in a cold admin script.
  2. Prior incidents. If this file has been touched in a PR that later caused a rollback, we include a summary of what went wrong. The model gets sharper fast when you tell it "last time someone touched this file, prod went down."
  3. Repo conventions. Every repo has its own idioms — how it handles errors, where its safety checks live, what it considers a "hot path." We infer those once per repo and reuse them across PRs.

The model returns a JSON object: needs_flag, flag_boundary, fallback_description, confidence. If confidence is below a threshold, we don't post a comment at all. We'd rather miss a few true positives than train engineers to ignore us.

What we got wrong in v1

The first version of the pipeline was just Stage 3. Everything went through an LLM. It worked, sort of. It also cost about fifteen cents per PR and had a p95 latency of eleven seconds, which is death for a tool that's supposed to feel like autocomplete.

More importantly, it was wrong in a specific way: it over-flagged. Any diff that touched error handling, any diff in a file named payments.ts, any diff with more than fifty lines — it all came back as "flag this." Engineers dismissed the comments, so they started dismissing the whole tool.

Adding Stage 1 and Stage 2 in front of it didn't just make the system cheaper. It made the remaining LLM calls better, because the model only saw the ambiguous cases. It never had to decide whether a typo fix in a README was risky. Its calibration got much sharper once the easy calls were taken away from it.

What the output looks like

When the pipeline finishes, DeployRamp leaves a comment on the PR that looks roughly like this:

DeployRamp: flag recommended

The change to apps/api/src/checkout/submit.ts:124 modifies the final commit step of a paid transaction. Recommended flag boundary: around the new captureWithIntent() call. Suggested fallback: the existing captureDirect() path, which remains in the codebase and is covered by tests.

One-click apply · Dismiss · Explain

The "one-click apply" button is where the coding agent takes over. We'll cover that in a separate post, because it deserves its own walkthrough.

Calibration numbers

If you're wondering whether any of this actually works, here are the current numbers on a rolling 30-day window across customer repos:

  • Precision (comments that the engineer accepted): 78%
  • Recall (PRs that later caused an incident where we had already recommended a flag): 84%
  • Mean latency (PR opened → comment posted): 3.1 seconds
  • Mean cost per PR: 1.4 cents

Precision is the number we watch most closely. If engineers accept our recommendations less than about two thirds of the time, the product becomes noise. 78% feels healthy, but every new language and framework we add puts pressure on it, and we've had to re-tune the thresholds three times this quarter alone.

What's next

The pipeline's biggest weakness right now is that it treats each PR in isolation. A change that looks safe on its own can be risky in combination with another change someone else is merging the same day. We're experimenting with a cross-PR stage that looks at the whole queue of open PRs and highlights unsafe combinations. It's early, and I'm not sure yet whether the value justifies the complexity, but the signal in our test runs is promising enough to keep digging.

If you're interested in that kind of problem — production safety at the level of a whole repo, not just a single diff — we're hiring. Come talk to us.

Let DeployRamp handle the flags

Install the GitHub App, drop in the SDK, and ship a flagged PR in minutes. Free for up to 3 devs.

We use cookies to analyze site usage and improve your experience.