Methodology
This page documents the current benchmarking methodology used by scfuzzbench.
Objectives
- Run different fuzzers under equivalent infrastructure and runtime constraints.
- Pin versions and inputs so runs are reproducible.
- Publish enough raw and processed artifacts for independent inspection.
- Use robust, distribution-aware reporting across repeated runs.
End-to-End Benchmark Flow
1) Define and pin benchmark inputs
Core inputs are defined through Terraform vars and/or workflow dispatch:
- Target:
target_repo_url,target_commit - Mode:
benchmark_type(propertyoroptimization) - Infra:
instance_type,instances_per_fuzzer,timeout_hours - Fuzzer set:
fuzzers(or default all available) - Tool versions:
foundry_version,echidna_version,medusa_version, plus optionalbitwuzla_version
In CI (.github/workflows/benchmark-run.yml), inputs are validated before apply (value ranges, formats, and conservative character constraints).
2) Compute run identity and benchmark identity
Terraform computes two IDs used across the pipeline:
run_id:- Explicit
var.run_idif provided. - Otherwise
time_static.run.unix(state-stable; repeated applies can reuse it).
- Explicit
benchmark_uuid:md5(jsonencode(benchmark_manifest))ininfrastructure/main.tf.
benchmark_manifest includes pinned context such as:
scfuzzbench_commit,target_repo_url,target_commitbenchmark_type,instance_type,instances_per_fuzzer,timeout_hoursaws_region,ubuntu_ami_id- tool versions and selected
fuzzer_keys
This means changing any of those manifest fields changes benchmark_uuid.
3) Provision equivalent runners
Terraform provisions one EC2 instance per (fuzzer, run_index) pair:
- Same AMI family for all (
ubuntu_ami_ssm_parameter). - Same instance type and timeout budget for all fuzzers in a run.
- AZ auto-selected from offering data for the requested instance type (unless
availability_zoneis explicitly set). user_data_replace_on_change = trueso runner behavior changes trigger replacement.
4) Execute benchmark on each runner
Runner lifecycle is defined in infrastructure/user_data.sh.tftpl and fuzzers/_shared/common.sh:
- Install only that runner's fuzzer implementation (
fuzzers/<name>/install.sh). - Clone target repository and checkout the pinned commit.
- Build with
forge build. - Run fuzzer command under
timeout(SCFUZZBENCH_TIMEOUT_SECONDS). - Collect host metrics periodically into
runner_metrics.csv(enabled by default). - Upload artifacts to S3, then self-shutdown.
Instances are intentionally one-shot:
- A bootstrap sentinel (
/opt/scfuzzbench/.bootstrapped) avoids accidental reruns after reboot. - Shutdown occurs even on failures via trap/finalizer handling.
5) Benchmark type switching
benchmark_type behavior is applied by apply_benchmark_type in fuzzers/_shared/common.sh:
- Uses
SCFUZZBENCH_PROPERTIES_PATHfromfuzzer_envto locate the properties contract. - Applies deterministic
sedtransforms forpropertyvsoptimizationmode. - If
optimizationis requested but required markers/files are missing, the run fails early.
6) Upload and index artifacts
Each instance uploads:
- Logs zip:
s3://<bucket>/logs/<run_id>/<benchmark_uuid>/i-...-<fuzzer>.zip - Optional corpus zip:
s3://<bucket>/corpus/<run_id>/<benchmark_uuid>/i-...-<fuzzer>.zip - Benchmark manifest:
logs/<run_id>/<benchmark_uuid>/manifest.jsonruns/<run_id>/<benchmark_uuid>/manifest.json(timestamp-first index used by docs)
What Counts as a Complete Run
Docs and release automation use the same completion rule:
now >= run_id + (timeout_hours * 3600) + 3600
Notes:
run_idis interpreted as a Unix timestamp.timeout_hourscomes frommanifest.json(default24if missing).3600is a fixed 1-hour grace window.
This rule is implemented in:
scripts/generate_docs_site.py.github/workflows/benchmark-release.yml
Only complete runs are listed as benchmark results pages.
Analysis and Reporting Methodology
Canonical analysis pipeline
The default full pipeline is:
make results-analyze-all BUCKET=... RUN_ID=... BENCHMARK_UUID=... DEST=...This expands to:
- Download logs/corpus bundles (
scripts/download_run_artifacts.py) - Collect
*.logfiles into analysis layout (scripts/prepare_analysis_logs.py) - Parse events + summaries (
scripts/run_analysis_filtered.py->analysis/analyze.py) - Convert event stream to cumulative series (
analysis/events_to_cumulative.py) - Build report + charts (
analysis/benchmark_report.py) - Build broken-invariant overlap artifacts (
analysis/invariant_overlap_report.py) - Build runner CPU/memory artifacts (
analysis/runner_metrics_report.py)
Optional controls include EXCLUDE_FUZZERS, REPORT_BUDGET, REPORT_GRID_STEP_MIN, REPORT_CHECKPOINTS, REPORT_KS, INVARIANT_TOP_K, and RUNNER_METRICS_BIN_SECONDS.
Event extraction semantics (analysis/analyze.py)
- Parser is fuzzer-aware:
- Foundry: parse JSON lines and count events only from records with
type=invariant_failure, using the first JSONtimestampas elapsed-time baseline. - Medusa: parse elapsed markers and failed assertions/properties from textual logs.
- Echidna variants: parse falsification markers from textual logs.
- Unknown fuzzers: fall back to generic pattern parsing.
- Foundry: parse JSON lines and count events only from records with
- Event de-duplication is per run-instance stream (same event name counted once per run).
- Outputs:
events.csv(raw event stream)summary.csv(run-level aggregates)overlap.csv(cross-fuzzer Jaccard overlap)exclusive.csv(events found by exactly one fuzzer)throughput_samples.csv(raw tx/s and gas/s samples recovered from logs when available)throughput_summary.csv(per-fuzzer tx/s and gas/s distribution summary)progress_metrics_samples.csv(raw fuzzer-native progress metrics such as seq/s, coverage proxy, corpus size, favored items, failure rate when available)progress_metrics_summary.csv(per-fuzzer distribution summary of those progress metrics)
Cumulative conversion (analysis/events_to_cumulative.py)
- Produces long-form CSV:
fuzzer, run_id, time_hours, bugs_found. - Run keys are stabilized as
run_id:instance_id. - When
--logs-diris provided, runs with zero detected events still emit a time0row (unless--no-zero).
Report generation (analysis/benchmark_report.py)
- Validates each run's cumulative sequence:
- non-decreasing time
- non-decreasing integer bug counts
- non-negative counts
- Resamples all runs onto a common forward-filled time grid (
REPORT_GRID_STEP_MIN, default 6 min). - Computes distribution-oriented metrics per fuzzer:
- checkpoint medians + IQR
- normalized AUC
- plateau time
- late discovery share
- time-to-k median + reach rate
- final distribution (median + IQR)
- Note: these report scorecards are count-based. They do not score severity or root-cause uniqueness.
- If
throughput_summary.csvis present, the report also includes tx/s and gas/s summary tables. - If
throughput_samples.csvis present, the report also emits throughput trend charts (tx_per_second_over_time.png,gas_per_second_over_time.png). - If
progress_metrics_summary.csvis present, the report also includes per-fuzzer progress proxy tables (seq/s, coverage, corpus, favored, failure rate) and progress-metrics summary charts. - If
progress_metrics_samples.csvis present, the report also emits progress trend charts (seq_per_second_over_time.png,coverage_proxy_over_time.png,corpus_size_over_time.png,favored_items_over_time.png,failure_rate_over_time.png). - Emits:
REPORT.mdbugs_over_time.pngtime_to_k.pngfinal_distribution.pngplateau_and_late_share.png
If input CSV is empty, the report explicitly records the no-data condition and emits placeholder plots.
Broken-invariant overlap (analysis/invariant_overlap_report.py)
- Uses
events.csv(optionally budget-filtered) to summarize which invariant/event names were observed. - Emits:
broken_invariants.mdbroken_invariants.csvinvariant_overlap_upset.png
- These artifacts provide per-fuzzer totals, exclusives, shared subsets, and normalized invariant labels.
- Important interpretation note: UpSet overlap is approximate, not exact root-cause equivalence.
- Two assertions inside one target function can represent distinct bugs (for example, one in the
trysuccess path vs one in thecatchpath, where one indicates an unexpected successful-result condition and the other indicates a DoS/revert behavior). - Foundry-side assertion surfacing depends on the current
foundry-rs/foundry#13322behavior (https://github.com/foundry-rs/foundry/issues/13322), so normalized overlap should be read as an approximation. - Even in Echidna vs Medusa comparisons, overlap is still approximate: Echidna may falsify
assert(x != y)while Medusa falsifiesassert(a != b)in the same target-function body, which are distinct bugs even if function-level normalization groups them together.
- Two assertions inside one target function can represent distinct bugs (for example, one in the
- UpSet chart layout follows: Lex A, Gehlenborg N, Strobelt H, Vuillemot R, Pfister H. UpSet: Visualization of Intersecting Sets. IEEE TVCG 20(12), 2014 (doi:10.1109/TVCG.2014.2346248).
Runner resource reporting (analysis/runner_metrics_report.py)
- Uses
runner_metrics*.csvfiles collected on each runner to summarize host resource usage over time. - Emits:
runner_resource_usage.mdrunner_resource_summary.csvrunner_resource_timeseries.csvcpu_usage_over_time.pngmemory_usage_over_time.png
- CPU is reported as active percentage (
user + system + iowait) and memory is reported as used percentage/GiB.
Publication and Release
Benchmark Release workflow:
- Discovers complete runs automatically (or accepts explicit
benchmark_uuid+run_id). - Runs the same analysis pipeline in CI.
- Publishes analysis artifacts to:
s3://<bucket>/analysis/<benchmark_uuid>/<run_id>/...
- Creates a GitHub release tag:
scfuzzbench-<benchmark_uuid>-<run_id>
The docs site also supports legacy analysis under reports/<benchmark_uuid>/<run_id>/, but new runs should use analysis/....
Missing Analysis Triage
If a run is complete but shows missing analysis:
- Re-run GitHub Actions
Benchmark Releasefor thatbenchmark_uuid+run_id. - Or run analysis locally and upload artifacts to
analysis/<benchmark_uuid>/<run_id>/. - If the run is junk, delete its S3 prefixes (
runs/,logs/, optionalcorpus/, partialanalysis/).
These runs remain visible in docs to support triage.
Choosing Target Projects
Issue reference: https://github.com/Recon-Fuzz/scfuzzbench/issues/8
Guideline source: https://github.com/fuzz-evaluator/guidelines
Target selection should follow guideline items A.2.2-A.2.5:
- A.2.2: select a representative set from the target domain.
- A.2.3: include targets used by related work for comparability.
- A.2.4: do not cherry-pick targets based on preliminary outcomes.
- A.2.5: avoid overlapping codebases with substantial shared code.
Recommended operational policy for this repository:
- Freeze the target list before benchmark execution.
- Pin each target to an immutable commit.
- Record a rationale for each target (why it improves representativeness).
- Include related-work targets where feasible, and cite source papers/benchmarks.
- Track overlap groups (for forks/wrappers/shared-core code) and keep only one representative per overlap group unless explicitly justified.
- Keep the selection manifest in-repo so additions/removals are reviewable.
Suggested manifest fields per target:
- repository URL
- pinned commit
- properties path (
SCFUZZBENCH_PROPERTIES_PATH) - benchmark mode(s) intended
- rationale
- related-work reference(s)
- overlap group / exclusion notes
Caveats and Reproducibility Notes
timeout_hoursapplies to fuzzer execution; clone/build/setup occur before timed fuzzing starts.- Re-running Terraform without changing state can reuse
time_staticrun_id; set explicitrun_idfor distinct runs. - Bucket defaults allow public object read (
bucket_public_read=true) so docs/releases can link directly to S3 artifacts. - Keep secrets out of Terraform vars and docs; use SSM or environment-based secret handling.
