Difficulty Benchmark

The Difficulty Benchmark is a metric we built to help you pick the right scan tier for your needs. It measures how challenging a vulnerability is to discover and gives a transparent read on how much depth each AuditAgent scan tier adds. The benchmark contextualises Developer Scan and Auditor Scan against a standardised set of security findings.

How we calculate difficulty

We analyse data from public audit contests. The difficulty of finding a specific vulnerability is determined by the number of independent auditors who successfully identified and reported it.

Easier findings. Vulnerabilities reported by a number of auditors above the median. Often the more obvious issues that a surface-level analysis can catch.
Harder findings. Vulnerabilities reported by a number of auditors below the median. These typically require deeper analysis, more specialised knowledge, or complex testing to uncover.

We use the median as the threshold to create a balanced 50/50 split. Half of the vulnerabilities in our dataset are "easier" and half are "harder", which gives a clear and consistent benchmark.

The percentages you see on the scan selection screen represent our tool's detection capability for findings within that category. For example, "Easier 45%" means the scan detects 45% of the "easier to find" vulnerabilities in our dataset.

Our benchmark dataset

The current benchmark is built from 5 public audits across 5 different repositories, with 79 medium and high severity security issues. This range gives a meaningful read on what each scan tier catches.

Difficulty is not severity

The Difficulty Benchmark is not an indicator of a vulnerability's severity. The two concepts are independent.

A high-severity issue, for example one that could lead to total loss of funds, might be very simple and therefore "easy" to discover.
A low-severity issue could be complex and "hard" to discover.

The benchmark reflects the challenge of detection only. Always evaluate the severity and impact of any reported finding within the context of your specific project.

Noise and false positives

The Auditor Scan uncovers many more findings, and it also generates more noise. Review each finding carefully and remember that LLMs can produce false positives. See Recall Is Not Enough for our published analysis of the recall versus noise trade-off.

Benchmark scope vs scan output

The Difficulty Benchmark is calculated exclusively from high and medium severity issues confirmed in public audit competitions. This ensures the benchmark reflects our ability to find significant threats.

Scans also identify a broad range of other items, including low-severity vulnerabilities, informational findings, and best-practice recommendations. These additional findings are not factored into the benchmark metric, but they appear in your final scan report.

How we calculate difficulty​

Our benchmark dataset​

Difficulty is not severity​

Noise and false positives​

Benchmark scope vs scan output​

How we calculate difficulty

Our benchmark dataset

Difficulty is not severity

Noise and false positives

Benchmark scope vs scan output