← Back to docs

Detection Methodology

How XploitScan detects vulnerabilities, how detection quality is measured, and how to read the public benchmark.

Overview

XploitScan combines a fast pattern-matching pre-filter with a taint-aware AST analysis layer. Most rules use the regex layer alone — it's ~50× faster and sufficient for things like hardcoded API keys, missing security headers, and configuration mistakes. Rules that depend on data flow — SSRF, prototype pollution, mass assignment, SSTI, command injection from user input — confirm matches with a Babel-parsed AST and a local taint tracker.

Detection quality is scored on a public, version-controlled fixture corpus and refreshed on every commit. Numbers are visible at xploitscan.com/benchmark.

The two-layer scanner

Layer 1 — Pattern matching

Every rule starts with a regex. Patterns are tuned per category: file-extension and filename allowlists pre-filter irrelevant files; comment lines and string-literal contexts are stripped; well-known mitigation patterns (e.g. crypto.timingSafeEqual, DOMPurify.sanitize) act as local suppressors.

For pattern-only rules — secrets, missing CSP, weak TLS config, insecure cookie attributes, container misconfigs — this layer is the entire scanner. Fast, deterministic, no parser needed.

Layer 2 — AST + taint analysis

Rules that need to know where data came from — not just that a sink looks dangerous — opt into a Babel parse. The taint tracker recognizes user-controlled sources from Express (req.body, req.query, req.headers), Fastify (request.*), Koa (ctx.request.*), Next.js App Router (await request.json(), request.formData()), Web Fetch API, process.argv, and AWS Lambda event.body. Taint propagates through const/let bindings, destructuring (including renames), assignments, template literals, and member access.

Rules currently AST-confirmed: SSRF, prototype pollution, mass assignment, SSTI, XXE option-object inspection, timing-unsafe secret comparisons, and log injection. The AST pass only runs after the regex pre-filter matches, and the parse result is cached per file so multiple AST rules in a single scan pay the parse cost once.

How quality is measured

The corpus

Every detection claim is grounded in a labeled fixture corpus, stored in the public repo at packages/cli/test-fixtures/. Each fixture is a small directory with one or more code files and an expected.json manifest that lists either the exact findings the scanner should produce or the rules it must not fire.

Fixtures are split into two categories:

  • vulnerable — code that contains a real vulnerability with a known location. Each expected finding declares its rule ID, file, inclusive line range, and minimum severity.
  • clean — correctly written code that exercises the same APIs as a vulnerable fixture (e.g. jwt.verify with a pinned algorithm allowlist as the counter to algorithms: ['none']). Lists rules that must not fire.

Counting rules — TP / FP / FN

  • True positive (TP) — a finding on a vulnerable fixture that matches an expected entry by rule, file, and line (within the declared range).
  • False negative (FN) — an expected entry the scanner did not produce.
  • False positive (FP) — a finding for a rule listed in the fixture's mustNotFire array, or any finding on a clean fixture for a rule that should not fire there.

Other findings on vulnerable fixtures (rules that are neither expected nor explicitly forbidden) are ignored — they may be legitimate adjacent issues, and treating them as FPs would penalize useful findings.

Tracked-but-not-yet-detected entries

Some fixtures document vulnerabilities the current scanner doesn't catch yet — flagged with knownGap: true. They still count as FNs in the public benchmark (so recall stays honest), but the regression test suite skips them so a documented gap isn't treated as a regression while the rule is being improved. The /benchmark page shows them in a separate Tracked — in progress section.

Comparing against Semgrep and Bearer

The same corpus is also scanned by Semgrep (community rulesets: p/security-audit, p/owasp-top-ten, p/javascript, p/typescript, p/react) and Bearer (default SAST + security report). Results are normalized into the same TP / FP / FN counts and rendered side-by-side.

Two methodology notes that matter for fairness:

  • Positional coverage, not rule-ID matching. A third-party finding counts as a TP if any of its rules fires on the expected file within ±10 lines of the expected range. We don't expect Semgrep or Bearer to use the same rule names we do; we're measuring whether the scanner can detect the vulnerability class at all.
  • Stricter FP definition for third-party scanners. On clean fixtures, any finding from Semgrep or Bearer counts as an FP — we can't per-rule attribute against mustNotFire the way we do for our own rules. This is the more conservative interpretation and doesn't flatter the comparison.

Reproducibility

Everything in the benchmark is open. The corpus, the runner scripts, the Semgrep config, the Bearer invocation, and the scoring code all live in the public repo:

To reproduce locally:

git clone https://github.com/bgage72590/vibecheck
cd vibecheck
pnpm install
pnpm --filter xploitscan-shared-rules build

# XploitScan benchmark
node scripts/benchmark.js

# Semgrep comparison (requires `pip install semgrep`)
node scripts/semgrep-benchmark.js

# Bearer comparison (requires the `bearer` CLI on PATH)
node scripts/bearer-benchmark.js

Every push to main reruns all three benchmarks in CI, commits the JSON back to the repo, and Vercel rebuilds the /benchmark page. The numbers you see are never more than a few minutes stale.

What the corpus is — and isn't

It is a curated, growing set of labeled examples covering the vulnerability classes the rules target. The current corpus has 151 fixtures across 25+ vulnerability classes, including realistic multi-file mini-apps (auth flows, file upload pipelines, payment webhooks, GraphQL APIs, OAuth callbacks, admin dashboards) that exercise rules in integration-level contexts rather than only minimal one-off snippets.

It isn't a replacement for real-world testing on a real codebase. A perfect score on a small corpus is not a claim that the scanner catches everything in production code — it's the floor below which the scoring will not regress without notice. The corpus grows with every release; the score gets harder over time.

Contributing fixtures

Found a vulnerability class XploitScan doesn't catch — or catches incorrectly? Open a PR with a fixture. The format is documented in the test-fixtures README, but the gist is: a directory under vulnerable/ or clean/, your code, and an expected.json with the rule ID + file + line range. The next benchmark run will measure whether the scanner detects it; if not, the fixture becomes a tracked gap and a target for the next rule release.

See also