Detection Benchmark

We publish the raw numbers on how well our scanner performs against a labeled corpus of vulnerable and clean code samples. These results are regenerated on every commit so you are always seeing current data.

Last run: Jul 16, 2026 · Corpus size: 241 fixtures (142 vulnerable, 99 clean)

Precision

100.0%

true positives / all findings

Recall

98.8%

findings / known vulns

F1 Score

99.4%

harmonic mean of P and R

Counts

160 / 0 / 2

TP / FP / FN

Trend over time

49 runs · since May 29, 2026

Precision and recall on each benchmark run. The corpus grows over time, so a flat or rising line means detection kept pace as new fixtures were added.

Recall98.8%+0.1 pts

Precision100.0%

Corpus: 218 → 241 fixtures. Deterministic scanner only (excludes the optional AI false-positive filter).

How we score

Every fixture in packages/cli/test-fixtures/ is labeled with expectedFindings (rules that must fire, with file and line range) and mustNotFire (rules that must not fire anywhere in the fixture). The runner scans each fixture and counts:

True positive (TP) — a finding whose rule, file, and line fall inside an expected entry.
False negative (FN) — an expected entry with no matching finding.
False positive (FP) — a finding for a rule explicitly listed as mustNotFire on a clean fixture.

Findings on vulnerable fixtures for unrelated rules aren't counted as FPs — they're treated as adjacent observations, neither penalized nor rewarded. Precision and recall are micro-averaged across the whole corpus. The runner lives at scripts/benchmark.js in the repository, and a GitHub Action runs it on every pull request and every push to main.

About this baseline. The corpus is deliberately small and curated — growing it is ongoing work. A perfect score on a small corpus is not a claim that the scanner catches everything; it's the floor below which we will not regress. We will keep adding harder cases, and over time the score will become a more demanding indicator.

Held-out set — code we never wrote

93.8% recall

The corpus above is ours — we wrote the rules and the fixtures, so a high score partly measures how well we test ourselves. This set is different: 16 real vulnerabilities pulled from public, intentionally-vulnerable projects (OWASP NodeGoat, Juice Shop, DVNA) that no rule author ever saw, with the projects' hint comments stripped. We expect a lower number here — that's the point, and 15 of 16 is the honest one. It includes a real-world broken-access-control (BOLA) case we added knowing we currently miss it — we flag the common IDOR shapes, but this authenticated-but-not-authorized variant is one every scanner we test misses, ours included, and we'd rather show that gap in the table below than hide it. Why we built this →

For a fair fight, the same corpus is scored against other scanners with the same class-level methodology: did the tool flag this class of vulnerability anywhere in the fixture? XploitScan 93.8% · Semgrep 50.0% · Bearer 56.3%

Vulnerability class	Source	XploitScan	Semgrep	Bearer
Command injection	appsecco/dvna — core/appHandler.js (MIT)	✓ caught	missed	✓ caught
Insecure deserialization	appsecco/dvna — core/appHandler.js (MIT)	✓ caught	✓ caught	missed
Code injection (eval)	OWASP/NodeGoat — app/routes/contributions.js (Apache-2.0)	✓ caught	✓ caught	✓ caught
IDOR / broken access control (BOLA)	appsecco/dvna — core/appHandler.js (MIT)	missed	missed	missed
Hardcoded credentials	juice-shop/juice-shop — lib/insecurity.ts (MIT)	✓ caught	✓ caught	✓ caught
Open redirect	appsecco/dvna — core/appHandler.js (MIT)	✓ caught	✓ caught	✓ caught
Path traversal	juice-shop/juice-shop — routes/fileServer.ts (MIT)	✓ caught	✓ caught	✓ caught
Prototype pollution	kimmobrunfeldt/lodash-merge-pollution-example — index.js (ISC)	✓ caught	missed	missed
SQL injection	appsecco/dvna — core/appHandler.js (MIT)	✓ caught	✓ caught	✓ caught
SQL injection	juice-shop/juice-shop — routes/login.ts (MIT)	✓ caught	✓ caught	✓ caught
SSRF	OWASP/NodeGoat — app/routes/research.js (Apache-2.0)	✓ caught	missed	missed
Server-side template injection	juice-shop/juice-shop — routes/dataErasure.ts (MIT)	✓ caught	missed	missed
Weak password hashing	juice-shop/juice-shop — lib/insecurity.ts, routes/login.ts (MIT)	✓ caught	missed	✓ caught
NoSQL injection	OWASP/NodeGoat — app/data/allocations-dao.js (Apache-2.0)	✓ caught	missed	missed
Cross-site scripting (DOM)	juice-shop/juice-shop — frontend search-result.component.ts (MIT)	✓ caught	missed	missed
XXE	appsecco/dvna — core/appHandler.js (MIT)	✓ caught	✓ caught	✓ caught

The misses are real coverage gaps we track openly and are actively closing — most share a single root cause we've already scoped. Corpus + sources in the public repo under test-fixtures/held-out/.

Per-rule scores

Rule	TP	FN	Precision	Recall	F1
VC001	5	0	100.0%	100.0%	100.0%
VC003	2	0	100.0%	100.0%	100.0%
VC005	2	1	100.0%	66.7%	80.0%
VC006	8	0	100.0%	100.0%	100.0%
VC007	5	0	100.0%	100.0%	100.0%
VC015	1	0	100.0%	100.0%	100.0%
VC016	1	0	100.0%	100.0%	100.0%
VC023	4	0	100.0%	100.0%	100.0%
VC025	3	0	100.0%	100.0%	100.0%
VC030	1	0	100.0%	100.0%	100.0%
VC031	3	0	100.0%	100.0%	100.0%
VC033	1	0	100.0%	100.0%	100.0%
VC034	2	0	100.0%	100.0%	100.0%
VC035	2	0	100.0%	100.0%	100.0%
VC037	4	0	100.0%	100.0%	100.0%
VC038	2	0	100.0%	100.0%	100.0%
VC041	6	0	100.0%	100.0%	100.0%
VC042	10	0	100.0%	100.0%	100.0%
VC043	1	0	100.0%	100.0%	100.0%
VC044	6	1	100.0%	85.7%	92.3%
VC045	1	0	100.0%	100.0%	100.0%
VC046	2	0	100.0%	100.0%	100.0%
VC047	2	0	100.0%	100.0%	100.0%
VC048	3	0	100.0%	100.0%	100.0%
VC050	1	0	100.0%	100.0%	100.0%
VC051	2	0	100.0%	100.0%	100.0%
VC052	1	0	100.0%	100.0%	100.0%
VC054	2	0	100.0%	100.0%	100.0%
VC055	1	0	100.0%	100.0%	100.0%
VC057	1	0	100.0%	100.0%	100.0%
VC058	1	0	100.0%	100.0%	100.0%
VC059	1	0	100.0%	100.0%	100.0%
VC060	2	0	100.0%	100.0%	100.0%
VC062	1	0	100.0%	100.0%	100.0%
VC063	1	0	100.0%	100.0%	100.0%
VC072	1	0	100.0%	100.0%	100.0%
VC073	1	0	100.0%	100.0%	100.0%
VC074	1	0	100.0%	100.0%	100.0%
VC075	1	0	100.0%	100.0%	100.0%
VC077	1	0	100.0%	100.0%	100.0%
VC078	1	0	100.0%	100.0%	100.0%
VC079	2	0	100.0%	100.0%	100.0%
VC081	2	0	100.0%	100.0%	100.0%
VC082	5	0	100.0%	100.0%	100.0%
VC083	1	0	100.0%	100.0%	100.0%
VC086	2	0	100.0%	100.0%	100.0%
VC088	2	0	100.0%	100.0%	100.0%
VC090	1	0	100.0%	100.0%	100.0%
VC091	1	0	100.0%	100.0%	100.0%
VC093	1	0	100.0%	100.0%	100.0%
VC094	9	0	100.0%	100.0%	100.0%
VC132	1	0	100.0%	100.0%	100.0%
VC133	1	0	100.0%	100.0%	100.0%
VC135	1	0	100.0%	100.0%	100.0%
VC143	1	0	100.0%	100.0%	100.0%
VC146	1	0	100.0%	100.0%	100.0%
VC152	1	0	100.0%	100.0%	100.0%
VC153	1	0	100.0%	100.0%	100.0%
VC156	1	0	100.0%	100.0%	100.0%
VC158	2	0	100.0%	100.0%	100.0%
VC166	1	0	100.0%	100.0%	100.0%
VC168	1	0	100.0%	100.0%	100.0%
VC178	1	0	100.0%	100.0%	100.0%
VC184	1	0	100.0%	100.0%	100.0%
VC185	1	0	100.0%	100.0%	100.0%
VC186	2	0	100.0%	100.0%	100.0%
VC189	1	0	100.0%	100.0%	100.0%
VC191	1	0	100.0%	100.0%	100.0%
VC192	1	0	100.0%	100.0%	100.0%
VC194	1	0	100.0%	100.0%	100.0%
VC197	2	0	100.0%	100.0%	100.0%
VC198	1	0	100.0%	100.0%	100.0%
VC200	1	0	100.0%	100.0%	100.0%
VC201	2	0	100.0%	100.0%	100.0%
VC203	2	0	100.0%	100.0%	100.0%
VC204	1	0	100.0%	100.0%	100.0%
VC206	1	0	100.0%	100.0%	100.0%
VC207	1	0	100.0%	100.0%	100.0%
VC208	1	0	100.0%	100.0%	100.0%
VC209	1	0	100.0%	100.0%	100.0%
VC210	1	0	100.0%	100.0%	100.0%
VC211	1	0	100.0%	100.0%	100.0%
VC212	1	0	100.0%	100.0%	100.0%

Only rules with at least one ground-truth entry in the corpus appear here. The other 132 rules don't have fixtures yet and are excluded from the score.

Head-to-head vs open-source scanners

XploitScan F1

99.4%

210+ rules, 241 fixtures

Semgrep F1

36.5%

community rules, TP 37 / FP 4 / FN 125

Bearer F1

54.2%

open-source SAST, TP 71 / FP 29 / FN 91

VC Rule	XploitScan	Semgrep	Sem?	Bearer	Bear?
VC001	100.0%	0.0%	✗	80.0%	✓
VC003	100.0%	0.0%	✗	0.0%	✗
VC005	66.7%	0.0%	✗	66.7%	✓
VC006	100.0%	12.5%	✓	37.5%	✓
VC007	100.0%	40.0%	✓	60.0%	✓
VC015	100.0%	100.0%	✓	100.0%	✓
VC016	100.0%	100.0%	✓	100.0%	✓
VC023	100.0%	0.0%	✗	0.0%	✗
VC025	100.0%	0.0%	✗	100.0%	✓
VC030	100.0%	0.0%	✗	0.0%	✗
VC031	100.0%	100.0%	✓	100.0%	✓
VC033	100.0%	0.0%	✗	0.0%	✗
VC034	100.0%	0.0%	✗	100.0%	✓
VC035	100.0%	50.0%	✓	100.0%	✓
VC037	100.0%	0.0%	✗	50.0%	✓
VC038	100.0%	0.0%	✗	0.0%	✗
VC041	100.0%	16.7%	✓	83.3%	✓
VC042	100.0%	0.0%	✗	10.0%	✓
VC043	100.0%	0.0%	✗	0.0%	✗
VC044	85.7%	14.3%	✓	100.0%	✓
VC045	100.0%	0.0%	✗	0.0%	✗
VC046	100.0%	0.0%	✗	0.0%	✗
VC047	100.0%	50.0%	✓	50.0%	✓
VC048	100.0%	0.0%	✗	33.3%	✓
VC050	100.0%	0.0%	✗	0.0%	✗
VC051	100.0%	0.0%	✗	0.0%	✗
VC052	100.0%	0.0%	✗	100.0%	✓
VC054	100.0%	0.0%	✗	0.0%	✗
VC055	100.0%	0.0%	✗	0.0%	✗
VC057	100.0%	0.0%	✗	0.0%	✗
VC058	100.0%	0.0%	✗	0.0%	✗
VC059	100.0%	0.0%	✗	0.0%	✗
VC060	100.0%	50.0%	✓	100.0%	✓
VC062	100.0%	0.0%	✗	0.0%	✗
VC063	100.0%	0.0%	✗	100.0%	✓
VC072	100.0%	100.0%	✓	0.0%	✗
VC073	100.0%	100.0%	✓	100.0%	✓
VC074	100.0%	0.0%	✗	100.0%	✓
VC075	100.0%	100.0%	✓	0.0%	✗
VC077	100.0%	0.0%	✗	0.0%	✗
VC078	100.0%	100.0%	✓	0.0%	✗
VC079	100.0%	100.0%	✓	50.0%	✓
VC081	100.0%	50.0%	✓	100.0%	✓
VC082	100.0%	40.0%	✓	40.0%	✓
VC083	100.0%	100.0%	✓	0.0%	✗
VC086	100.0%	0.0%	✗	100.0%	✓
VC088	100.0%	0.0%	✗	50.0%	✓
VC090	100.0%	0.0%	✗	100.0%	✓
VC091	100.0%	0.0%	✗	0.0%	✗
VC093	100.0%	100.0%	✓	100.0%	✓
VC094	100.0%	55.6%	✓	77.8%	✓
VC132	100.0%	0.0%	✗	100.0%	✓
VC133	100.0%	0.0%	✗	0.0%	✗
VC135	100.0%	0.0%	✗	0.0%	✗
VC143	100.0%	0.0%	✗	0.0%	✗
VC146	100.0%	0.0%	✗	0.0%	✗
VC152	100.0%	0.0%	✗	0.0%	✗
VC153	100.0%	100.0%	✓	100.0%	✓
VC156	100.0%	0.0%	✗	0.0%	✗
VC158	100.0%	0.0%	✗	50.0%	✓
VC166	100.0%	0.0%	✗	0.0%	✗
VC168	100.0%	0.0%	✗	0.0%	✗
VC178	100.0%	0.0%	✗	0.0%	✗
VC184	100.0%	100.0%	✓	0.0%	✗
VC185	100.0%	100.0%	✓	0.0%	✗
VC186	100.0%	50.0%	✓	0.0%	✗
VC189	100.0%	0.0%	✗	0.0%	✗
VC191	100.0%	100.0%	✓	100.0%	✓
VC192	100.0%	100.0%	✓	0.0%	✗
VC194	100.0%	100.0%	✓	100.0%	✓
VC197	100.0%	0.0%	✗	0.0%	✗
VC198	100.0%	0.0%	✗	0.0%	✗
VC200	100.0%	0.0%	✗	100.0%	✓
VC201	100.0%	0.0%	✗	0.0%	✗
VC203	100.0%	0.0%	✗	0.0%	✗
VC204	100.0%	0.0%	✗	0.0%	✗
VC206	100.0%	0.0%	✗	0.0%	✗
VC207	100.0%	100.0%	✓	100.0%	✓
VC208	100.0%	0.0%	✗	0.0%	✗
VC209	100.0%	0.0%	✗	0.0%	✗
VC210	100.0%	0.0%	✗	0.0%	✗
VC211	100.0%	0.0%	✗	0.0%	✗
VC212	100.0%	0.0%	✗	0.0%	✗

Methodology. All scanners run against the same 241-fixture labeled corpus. A VC rule counts as "covered" by another scanner if any of that scanner's rules fires within ±10 lines of our expected range in the correct file. We don't require rule-ID equivalence — the question is capability to detect the class of vulnerability, not taxonomy alignment.

Semgrep. Version 1.86.0 with community rulesets p/security-audit, p/owasp-top-ten, p/javascript, p/typescript, p/react. Semgrep Pro's proprietary rules would likely score higher — we compare against the free tier because it's what's available to everyone.

Bearer. , SAST scanner mode, Bearer's built-in security ruleset. Free OSS; no account required. Bearer's primary focus is PII data-flow analysis — its security rules are a secondary feature — so it's not an apples-to-apples comparison with a dedicated SAST tool, but it's what's on the market and free.

FP counting. Both Semgrep and Bearer use their own rule taxonomies, so we can't per-rule attribute FPs against our mustNotFire list the way we can with our own scanner. Any finding from those scanners on a clean fixture counts as an FP — a stricter interpretation that puts the third-party scanners at a precision disadvantage we acknowledge.

Spot a gap?

The corpus is open and the runner is in the repo. If you have a real-world vulnerability pattern that our scanner misses, open a PR with a fixture at xploitscan-benchmark/test-fixtures/ or email admin@xploitscan.com. See the disclosure policy for anything you need to keep private.