← Back to Blog

98.7% Recall, Zero False Positives — and Why We Don't Trust Our Own Benchmark

Our scanner catches 98.7% of the vulnerabilities in our benchmark with zero false positives. That sounds great until you remember who wrote the benchmark. So here's the uncomfortable part — and the fix.

XploitScan Team··7 min read

As of this week, XploitScan's public benchmark sits at 98.7% recall and 100% precision — 148 true positives, zero false positives, across a corpus of 218 labeled code fixtures. On the same corpus, Semgrep's community rulesets catch 20.7% and Bearer catches 43.3%.

Those are real numbers, rerun on every commit, with the corpus, the runner scripts, and the competitors' configs all in the public repo. You can reproduce them yourself in about five minutes. We're proud of them.

And you should be a little suspicious of them. We are.

The problem with acing your own exam

A detection benchmark is only as honest as its corpus. Ours is hand-curated: when we write a rule, we write fixtures — a vulnerable example the rule should catch, and a safe counter-example it must not flag. That's genuinely good engineering. Every false-positive fix ships with a clean fixture that locks the fix in, which is how we hold precision at 100% while adding rules.

But it has an obvious failure mode: the people who wrote the rules wrote the test. A fixture written by the rule author tends to look like the pattern the rule already matches. Score 98.7% on that and you've proven the rule catches the thing it was built to catch — not that it catches the same vulnerability when a stranger writes it three ways you didn't think of.

This is the single most fair criticism of any security scanner's self-reported numbers, and we'd rather raise it ourselves than have a skeptical buyer raise it for us.

Two things that keep the number honest

It's a floor, not a trophy. A perfect score on a small corpus is not a claim that we catch everything in production code. It's the line below which detection quality is not allowed to regress without us noticing — every commit reruns the suite, and a per-rule recall drop fails CI. The corpus grows with almost every release, so the test gets harder over time, not easier. You can watch the trend on the benchmark page.

The numbers exclude our AI filter. XploitScan has an optional AI false-positive reviewer, but the benchmark scores the deterministic scanner with nothing on top — so the precision number is reproducible by anyone with no API key and no model-version variance. Turning the filter on can only remove borderline findings, never add them. The published precision is the worst case, not the best.

The real fix: code we've never seen

The only way to answer “but does it work on code your rule authors didn't write?” is to test against exactly that. So we're building a held-out corpus from public, intentionally-vulnerable projects that no one on our side authored: OWASP NodeGoat, Juice Shop, the Damn Vulnerable NodeJS Application, and a Python equivalent.

These are real files with real, documented vulnerabilities — SQL injection through a raw Sequelize query, command injection through exec, XXE through libxmljs with external entities enabled, insecure deserialization, prototype pollution, JWT alg: none acceptance, hardcoded crypto keys — written by other people, years before our rules existed. We strip the hint comments these teaching projects include (so the scanner can't cheat off a // vulnerable marker) and score against the bare code.

We expect our recall to be lower on that set than on our own fixtures. That's the point. A held-out number that's a little worse and a lot more believable beats a self-graded number that's perfect and easy to dismiss. When it lands, it'll be on the same public benchmark page, scored the same way, with the sources cited.

Why we're telling you this

Most security tools show you a precision number and dare you to disprove it. We'd rather hand you the corpus, the runner, the competitors' configs, and the most honest critique of our own methodology, and let you check. The whole pitch of a scanner is that you can trust what it tells you. That trust has to start with how we talk about our own results.

Want to see where the numbers stand right now? The benchmark page is live and rerun on every commit. Or just point the scanner at your own code — npx xploitscan scan ., no signup, runs locally.

Scan your own AI-generated code

Free, no signup. Drag and drop a zip or run npx xploitscan scan .

Scan Your Code →
98.7% Recall, Zero False Positives — and Why We Don't Trust Our Own Benchmark | XploitScan