← Back to Blog

We Ran Semgrep Against Our Benchmark. It Missed Half the Bugs.

A head-to-head comparison of XploitScan and Semgrep on a public labeled corpus of AI-generated code. 100% F1 vs 62.5%. The gap is almost entirely about one thing: template-literal SQL injection.

XploitScan Team··8 min read

Every time we show someone XploitScan, we get the same question: “Isn't this just Semgrep?”

Fair question. Semgrep has been around for years, it's genuinely good, and most of its community rulesets are free. So we ran it against our own detection benchmark to see how it actually compares. The benchmark is public, the runner is public, and anyone can reproduce the numbers. Here's what we found.

The numbers

XploitScan  Semgrep (community)
F1          100%        62.5%
Precision   100%        83.3%
Recall      100%        50.0%
Corpus      26 labeled fixtures, identical for both scanners.

On 26 labeled fixtures covering SQL injection, XSS, command injection, eval, unvalidated redirects, reflected CORS, missing pagination, and hardcoded secrets (AWS, Anthropic, GitHub PAT, Slack, Supabase service role) — XploitScan detected all of them. Semgrep detected 5 of 10 vulnerability classes.

All the methodology, per-rule scores, and reproducibility details live on the /benchmark page. Short version: a rule counts as “covered” if any finding from that scanner lands within ±10 lines of the expected vulnerability. We don't require rule-ID equivalence — the question is can the tool detect the class of bug, not is the rule-name the same. That methodology actually favors Semgrep, because one Semgrep rule firing on a fixture counts as a hit even if it's a vaguely-related finding.

What Semgrep catches

We don't want to undersell Semgrep. Its community rules detected:

  • React XSS via dangerouslySetInnerHTML
  • eval() on request input
  • Express res.redirect(req.query.x) open-redirect
  • child_process.exec with user input
  • Reflected-origin CORS with credentials

These are the classic OWASP patterns Semgrep was designed around. If your code uses these specific shapes, Semgrep will flag them. That's not nothing.

What Semgrep misses

The interesting part is what didn't fire — and why. Five vulnerability classes where Semgrep returned zero findings:

1. SQL injection via template literals

This is the one that matters, because it's what AI coding tools actually write. Ask Cursor to “fetch a user by email” and you'll get something like this:

// What Cursor writes by default when asked for a user lookup:
export async function getUserByEmail(db, email) {
  const result = await db.query(
    `SELECT * FROM users WHERE email = '${email}'`
  );
  return result.rows[0];
}

This is textbook SQL injection. An attacker passes ' OR 1=1-- as the email and dumps your whole users table. Every security-aware developer has seen this pattern a thousand times.

Semgrep's community JavaScript rules miss it. All five SQL-injection variants in our benchmark — pg, mysql2, Prisma $queryRawUnsafe, Drizzle sql.raw, knex .raw — fire zero Semgrep findings. Each of them is an actual SQL injection. Each is exactly what an AI assistant writes by default.

Why? Look at Semgrep's SQL injection rules and you see patterns built for this:

// What Semgrep's rules were written to catch, ca. 2015:
const result = await db.query(
  "SELECT * FROM users WHERE email = ?",
  [email]
);

Those rules were written circa 2015 when people wrote db.query(sql, [params]) with two arguments. They pattern-match on “a .query() call with string concatenation and a specific library like mysql or pg”. They don't handle tagged template literals because template literals didn't exist when the rules were written, and the rule maintainers haven't rewritten them.

AI coding tools, meanwhile, were trained on post-2015 JavaScript. They write template literals by default. The rules and the code have drifted past each other, and nobody closed the gap.

2. Service-specific secret detection

Semgrep's community secret rules are generic — “does this look like base64” entropy-style detection. Specific formats like Anthropic API keys (sk-ant-api03-...), AWS access keys (AKIA...), Supabase service-role JWTs — Semgrep's community rules don't ship with specific detectors for any of them.

That matters because the generic entropy approach fires on every high-entropy string in your code: content hashes, build IDs, Tailwind class fingerprints. If you lower the threshold to catch real keys, you drown users in FPs. Specific prefix rules don't have that tradeoff.

3. Configuration and availability bugs

Semgrep is fundamentally a syntax-matching tool. It doesn't model business logic. Two examples that matter for AI code:

  • Missing Stripe webhook signature verification. The vulnerability isn't that any specific function is called unsafely — it's that stripe.webhooks.constructEvent isn't called at all. Semgrep can't express “this function is absent from the handler.”
  • Missing pagination on a list endpoint. Semgrep sees prisma.item.findMany({ where... }) and has no way to know the endpoint will return 100,000 rows. “No take parameter” isn't a syntactic pattern.

XploitScan rules for both patterns look at the shape of the handler — route declarations, function bodies, expected protective calls — not just individual lines. It's narrower than a taint analysis, but broader than regex.

The honest caveats

A few things this comparison deliberately does not claim:

  • Semgrep Pro would likely score higher. The Pro tier has proprietary rules, including dataflow analysis and additional language packs. We tested against the free community rulesets because that's what everyone has access to. XploitScan's 30 free rules + 158 Pro rules are also compared at the free tier in this benchmark.
  • 26 fixtures is not the full story. A thousand-fixture corpus would give both tools more room to shine and expose more gaps on both sides. We grew the corpus to 41 in the week after first publishing this post, and we're expanding further — see the benchmark page for the current numbers.
  • Semgrep is extensible. If you write your own rules tuned for AI-generated patterns, it can match XploitScan on any specific vulnerability. The benchmark measures out-of-the-box default behavior, which is what 99% of users actually experience.
  • This benchmark is ours. We wrote the fixtures, we wrote the runner, we wrote the scoring methodology. A different corpus would produce different numbers. If you think our corpus is biased, the fixtures are open on GitHub — open a PR with a fixture you think we miss.

When to use which

Semgrep and XploitScan are not really competitors — they're aimed at different audiences for different use cases. Rough heuristic:

  • Semgrep if you have a security engineer writing and maintaining custom rules for your specific codebase, you need cross-file dataflow analysis (Pro), or you're on a language XploitScan doesn't deeply support yet.
  • XploitScan if you're shipping a product built with Cursor, Bolt, Lovable, or Replit and you want out-of-the-box detection for the specific bugs AI coding tools produce, without rule-writing overhead.

They stack. Run both, dedupe findings, ship a safer app. If you do, you'll notice XploitScan catches the bugs Semgrep misses, and Semgrep occasionally catches something XploitScan missed (we fold those back into our rules when they do).

Reproduce the numbers

Everything in this post is in a public repo. To reproduce:

  1. Clone github.com/bgage72590/vibecheck
  2. Run node scripts/benchmark.js for XploitScan's numbers
  3. Install Semgrep (pip install semgrep) and run node scripts/semgrep-benchmark.js for the comparison
  4. Both scripts write JSON to the repo root and the /benchmark page renders whatever the latest runs produced

CI reruns both on every PR touching fixtures or rules, so the numbers on the page are always current.

Want to see XploitScan on your own code? The fastest test is the CLI — scans locally, nothing uploaded.

npx xploitscan scan .
Or try the browser scanner →