Difficulty Benchmark

Overview

The Difficulty Benchmark is a metric we developed to help you select the best scan for your security needs. It indicates how challenging it is to discover a potential vulnerability within a given codebase and provide a transparent measure of the depth and power of our different scanning options.

This benchmark helps contextualize the capabilities of each scan type (Basic, Advanced, and Deep) against a standardized set of security findings.

How We Calculate Difficulty

To create an objective benchmark, we analyzed data from public audit contests. The difficulty of finding a specific vulnerability is determined by the number of independent auditors who successfully identified and reported it.

The logic is straightforward:

Easier Findings: Vulnerabilities that were reported by a number of auditors above the median are classified as "Easier" to find. These are often more obvious issues that a surface-level analysis can catch.
Harder Findings: Vulnerabilities that were reported by a number of auditors below the median are classified as "Harder" to find. These typically require deeper analysis, more specialized knowledge, or complex testing to uncover.

For full transparency, we use the median as the threshold to create a balanced 50/50 split. This means that half of the vulnerabilities in our dataset are categorized as "Easier" and the other half as "Harder," allowing for a clear and consistent benchmark.

The percentages you see on the scan selection screen represent our tool's detection capability for findings within that category. For example, an "Easier 45%" score means the scan is able to find 45% of the known "easier to find" vulnerabilities in our benchmark dataset.

Our Benchmark Dataset

The current benchmark is built from a dataset comprising:

5 Public Audits
5 Different Repositories
79 Medium to High Severity Security Issues

This diverse dataset allows us to assess the discovery capabilities of our different AuditAgent scan levels.

Important Notes:

1. Difficulty vs. Severity

It is crucial to understand that the Difficulty Benchmark is not an indicator of a vulnerability's severity.

The two concepts are independent:

A high-severity issue (e.g., one that could lead to a total loss of funds) might be very simple and therefore "easy" to discover.
A low-severity issue could be very complex and extremely "hard" to discover.

The benchmark solely reflects the challenge of detection. Always evaluate the severity and impact of any reported finding within the context of your specific project.

2. Noise and False Positives

While our most capable Deep Scan plan allows to uncover many more findings, it also generates more noise. It is critical to carefully review each finding and understand that LLM may generate false positives due to hallucinations.

3. Benchmark Scope vs. Scan Output

It's important to note that the Difficulty Benchmark is calculated exclusively using high and medium severity issues confirmed in public audit competitions. This ensures the benchmark reflects our ability to find significant threats.

Our scans, however, also identify a broad range of other items, including low-severity vulnerabilities, informational findings, and best-practice recommendations. These additional findings are not factored into the benchmark metric but will be included in your final scan report.

Overview​

How We Calculate Difficulty​

Our Benchmark Dataset​

Important Notes:​

1. Difficulty vs. Severity​

2. Noise and False Positives​

3. Benchmark Scope vs. Scan Output​