Glaas minimal logo, light

Benchmarks (micro)

On this page

This page covers micro-benchmarks: per-syscall and per-file tracer overhead on synthetic workloads, hashing throughput by algorithm and core count, and S3 proxy latency. For measured overhead on a real multi-GPU pre-training job, see Benchmarks (e2e).

roar captures lineage by observing execution at runtime — watching syscalls, hashing artifacts, recording provenance — rather than requiring manual logging or pipeline declarations. That approach only works if the observer itself stays operationally cheap. This page characterizes exactly how cheap, measured on real hardware, so you can decide whether roar fits your performance budget.

Every number below was produced by scripts in the roar repo on a single reference machine (see Setup); the Reproducing these benchmarks section has the exact commands.

What this means for real workloads

WorkloadExpected roar overheadNotes
1-hour training job (few hundred files)< 0.1% of wall-clockTracer adds ~20–50 ms; post-run hashing takes seconds
10-minute preprocessing stagetypically < 1%Dominated by the fixed ~2.5 s startup; varies with file count
10 GB of checkpoints, 8 cores~1 s to hashblake3 at ~10 GB/s parallel cold read
1 TB dataset, 16 cores~1.2 min to hashDisk-bound at ~14 GB/s SSD ceiling
Sub-second shell commands in a loopNot recommendedThe 2.5 s per-invocation startup dominates

When the overhead matters

roar run adds a fixed ~2.5 s per invocation (Python CLI startup + tracer attach). This is negligible for:

  • GPU training runs — PyTorch, JAX, TensorFlow jobs that run for minutes to hours
  • Data preprocessing — ETL, feature engineering, dataset generation steps
  • Distributed jobs — Ray, torchrun, and similar workflows where individual tasks run for seconds or longer
  • Notebook and interactive workflows — best for cells or commands that run for seconds or longer; not for tiny exploratory snippets

The fixed startup becomes a problem when wrapping many short-lived commands — a Makefile with hundreds of sub-second targets, or a shell script that calls roar run in a tight loop. For those patterns, wrap the outer script rather than each individual command, or use roar's session model to amortize the startup.

Choosing a tracer backend

roar chooses the best available backend automatically; preload is the default fallback, while eBPF is preferred when enabled and supported. The rest of this page breaks the headline numbers into components. The short version:

  • preload (default fallback) — best for most ML workloads. Starts in ~3 ms, adds ~33 µs per file. Lower total overhead than eBPF until ~1,600 files per step.
  • eBPF — best for I/O-heavy or high-file-count workloads (many small shards, data preprocessing over millions of records, shard-heavy dataloaders). Adds ~5 µs per file but carries a ~50 ms daemon startup. Wins decisively above ~1,600 files.
  • ptrace — universal fallback. Works everywhere but costs ~207 µs per file. Use only when preload and eBPF are unavailable.

Tracers

How much wall-clock overhead each tracer backend adds on top of a baseline command. See Tracers for the backend mechanics and tradeoffs that explain why the numbers differ. The short version: ptrace pays two context switches per syscall, preload interposes libc in-process, and eBPF filters in-kernel and ships events to a persistent daemon (roard). Those mechanics produce very different overhead profiles, decomposed below into a fixed startup cost and a marginal per-file / per-syscall cost.

Startup cost

Fixed overhead to wrap a trivial command (true), measured against an untraced baseline of ~0.5 ms.

BackendStartup overhead
ptrace~1.3 ms
preload~2.9 ms
eBPF (cold — first traced command)~112 ms
eBPF + daemon (warm, steady state)~50 ms

eBPF's startup is dominated by its daemon. The first traced command loads the BPF program into the kernel and starts roard, a one-time ~112 ms cost. roard then keeps the program resident, so every subsequent command skips the load and pays only ~50 ms for the per-invocation work (forking the child under SIGSTOP, registering it with the daemon, and setting up the per-run perf-buffer consumer). ptrace and preload have no daemon and no kernel program to load, so their startup is near-instant by comparison.

This is the one place eBPF is not the cheapest backend. Its per-invocation fixed cost is roughly 17× preload's. eBPF earns its keep on marginal cost, below — not on startup.

Per-file overhead

Marginal cost added per file touched (one write + one read each), from a linear regression over workloads of 0–5,000 files. This is the headline tracer number for file-count-bound workloads.

BackendPer-file overhead
ptrace207 µs
preload33 µs
eBPF + daemon4.7 µs

eBPF's in-kernel filtering makes each tracked file roughly 7× cheaper than preload and 44× cheaper than ptrace.

Per-syscall overhead

Per-file cost folds in the open/close around each file. The per-read/write cost is lower and worth separating, since a single large file is many read/write syscalls. From a regression over 400–25,600 read+write operations against a fixed set of files:

BackendPer-read/write overhead
ptrace86 µs
preload29 µs
eBPF + daemon~1.0 µs

This is where eBPF separates from the field: ~1 µs per syscall versus preload's 29 µs and ptrace's 86 µs. The two context switches ptrace forces on every syscall are the entire story of its 86 µs.

Projected overhead

Projecting startup + per-file &times; N linearly across file counts:

FileseBPF + daemonpreloadptrace
100~50 ms~6 ms~22 ms
1,000~55 ms~36 ms~210 ms
10,000~97 ms~340 ms~2.1 s
100,000~520 ms~3.3 s~21 s

The crossover is the interesting part. Because eBPF carries a ~50 ms fixed daemon cost while preload starts near zero, preload has lower total tracer overhead until a run touches roughly 1,600 files; above that, eBPF's tiny marginal cost wins decisively. Many GPU training workflows sit in the 100–1,000 file range — reading a few shards, writing a handful of checkpoints — where eBPF and preload are within a few tens of milliseconds of each other and both are negligible against a job that runs for minutes. Shard-heavy dataloaders, parquet-based pipelines, webdataset readers, and synthetic data generation can touch far more files; those workloads are where eBPF's marginal cost advantage compounds. eBPF also covers static and setuid binaries that preload cannot intercept. ptrace stays the universal fallback but its per-syscall cost makes it the wrong default for I/O-heavy work.

Measured scaling data

The raw per-file series behind the regressions (overhead in ms over the untraced baseline, mean of 10 iterations):

Files (N)BaselineptracepreloadeBPFeBPF + daemon
09.6 ms+3.5+2.5+44.8+40.6
20028.2 ms+45.3+10.4+25.2+22.0
50042.8 ms+104.5+19.3+11.4+7.2
1,00066.1 ms+204.3+36.6+36.8+34.1
2,000112.8 ms+396.4+69.5+41.7+37.3
5,000252.5 ms+1041.5+170.0+52.5+48.2

ptrace grows steeply and linearly; preload grows gently; both eBPF columns stay roughly flat (a fixed daemon floor plus a marginal cost so small it barely registers at these counts). The eBPF and eBPF+daemon columns track each other closely — in steady state both run through the persistent daemon. Their fixed floor sits around 40–50 ms here, consistent with the ~50 ms warm startup measured directly above.

A note on the eBPF startup figures. The per-file regression extrapolates an eBPF intercept of ~25–28 ms, but direct measurement of a warm daemon (including replicating the benchmark's exact roard startup) consistently shows ~50 ms per invocation and ~112 ms for a cold first run. The regression intercept is pulled below the true floor by the curvature and run-to-run noise of the small-N points; the directly measured ~50 ms / ~112 ms figures are what the Startup cost table reports.

End-to-end roar run (synthetic)

What users actually feel when they prefix a command with roar run. This wraps the tracer (auto mode resolved to eBPF on this host), then hashes every output file and records provenance to the local SQLite database. Baseline is the bare command; roar run is the same command wrapped. Workloads write then read N 4 KiB files; 7 iterations after 2 warmups. For overhead measured during a real training run, see Benchmarks (e2e).

FilesBaselineroar runOverhead
09.8 ms2,511 ms+2,502 ms
10012.4 ms2,520 ms+2,508 ms
50025.1 ms2,572 ms+2,547 ms
1,00040.7 ms2,603 ms+2,563 ms
5,000165.9 ms3,107 ms+2,941 ms

The fixed floor is ~2.5 s — the dominant cost for short-lived commands, but a constant that does not scale with your workload. For long-running ML jobs, marginal cost dominates. A ROAR_TIMING breakdown of the zero-file case attributes the fixed floor as:

  • roar CLI startup: ~1.6 s — importing the Python package and initializing the run. Paid once per roar run invocation regardless of what the command does.
  • tracer: ~0.66 s — spawning/attaching the eBPF tracer and tracing the wrapped Python interpreter's own startup syscalls (imports do hundreds of file opens before your code runs).
  • post-run: ~0.26 s — hashing outputs and writing provenance (provenance ~31 ms, DB record ~228 ms).

The number that actually scales with your workload is marginal cost: ~0.09 ms (~90 µs) per tracked file — the per-file overhead grows from +2,502 ms at 0 files to +2,941 ms at 5,000 files, about 88 µs per file for tracing, hashing, and recording each one. So a step that reads 500 shards and writes 10 checkpoints adds well under 100 ms of marginal cost on top of the fixed startup. For a training step that runs for minutes or hours, the entire roar run overhead is rounding error; the ~1.6 s startup matters only if you are wrapping thousands of sub-second commands, in which case a longer-lived session amortizes it.

Hashes

Every artifact roar records is identified by a content hash (see Hashes). After a job, roar hashes its outputs — so hashing throughput sets the post-run cost on data-heavy steps. The numbers that matter are cold-read numbers: training data and checkpoints are read from disk, not from a warm page cache.

blake3 hashing time by data volume and core count

Wall-clock time to blake3-hash a dataset, by total volume and worker count, using cold-read throughput (data streamed from the SSD, not page cache). This is what you experience hashing real data.

Volume1 vCPU2 vCPU4 vCPU8 vCPU16 vCPU30 vCPU
1 GB0.54 s0.28 s0.17 s0.10 s0.071 s0.071 s
10 GB5.4 s2.8 s1.7 s1.0 s0.71 s0.71 s
100 GB54 s28 s17 s10 s7.1 s7.1 s
1 TB9.3 min4.7 min2.8 min1.7 min1.2 min1.2 min

Cold throughput scales with workers up to ~16, then plateaus at the SSD's parallel read ceiling of ~14 GB/s — so the 16- and 30-vCPU columns are identical (adding cores past the disk ceiling buys nothing). At ~14 GB/s, hashing a 1 TB dataset takes a little over a minute; the per-core ceiling (~1.8 GB/s cold) means a single worker takes ~9 minutes for the same data.

Measured scaling (200 × 50 MB = 9.77 GB, blake3, 3 reps). Cold throughput is roughly half cached at every worker count, and the two diverge at the top end where cached keeps scaling on CPU while cold saturates the disk:

vCPUsCached (page cache)Cold (from SSD)
13.2 GB/s1.8 GB/s
26.5 GB/s3.6 GB/s
412.4 GB/s6.0 GB/s
820.8 GB/s9.9 GB/s
1631.6 GB/s14.1 GB/s
3037.2 GB/s13.5 GB/s

Cached throughput is CPU-bound and scales nearly linearly to ~16 cores (~37 GB/s at 30). Cold throughput is disk-bound and plateaus at ~14 GB/s by 16 workers — the 30-worker number is marginally lower, contention past the saturation point. Publishing the cached numbers as if they were real-world would overstate large-dataset throughput by ~2.6×.

Algorithm comparison

Single-threaded throughput on the same hardware, page-cached (this isolates raw hash speed from disk):

AlgorithmThroughput (single-threaded)Parallel
blake33.3 GB/sscales across cores — ~37 GB/s on 30 cores (cached); see scaling table
sha2561.0 GB/sdoesn't parallelize across one input
sha5120.54 GB/sdoesn't parallelize across one input
md50.60 GB/sdoesn't parallelize across one input

blake3 is the default because it is ~3.2× faster than sha256 single-threaded and, unlike the SHA family and MD5, its Merkle-tree structure parallelizes — both with SIMD within a hash and across files when roar hashes a batch of outputs. That directly cuts the wall-clock cost of post-run hashing on large outputs. roar falls back to sha256 automatically if the blake3 module is missing, and can record sha256 alongside blake3 when you need interop with ecosystems that standardize on it (Hugging Face, git, container registries).

Wall-clock hashing cost per file

Time to hash a single file at various sizes, page-cached, single-threaded:

File sizeblake3sha256sha512md5
10 MB0.003 s0.009 s0.018 s0.016 s
100 MB0.031 s0.096 s0.18 s0.16 s
1 GB0.30 s0.97 s1.8 s1.7 s
10 GB2.9 s9.7 s18 s17 s

These are page-cached (CPU-bound) times. On a cold read from the SSD, single-threaded blake3 runs about 1.9× slower (~1.8 GB/s vs ~3.3 GB/s) because read and hash serialize in one thread; the per-core-count table above is the better guide for cold, multi-worker hashing. The 10 GB blake3 figure is measured; the 10 GB SHA/MD5 figures are linear extrapolations from their (size-independent) throughput.

Native hashing backend

roar ships a Rust extension (roar._hash_native) that hashes files in parallel, computing multiple algorithms in a single pass over each file. Against a pure-Python baseline on a mixed dataset (608 files, 68.7 MiB, blake3 + sha256):

BackendThroughputFiles/sSpeedup
Python799 MiB/s7,0741.0×
native (Rust)3,627 MiB/s32,1014.5×

The native backend is what runs by default; the 4.5× speedup comes from one-pass multi-hash and parallelism across files, and matters most on jobs that emit many output files.

Proxy

The S3 proxy intercepts S3 traffic to capture remote lineage, re-signing and forwarding each request. The proxy is a single Rust binary (roar-proxy) running on localhost; its overhead has two components: the proxy's own CPU work (signing + credential resolution), which is fixed and deployment-independent, and the network round-trip to S3, which the proxy passes through unchanged.

Proxy CPU overhead

The proxy instruments its own phases internally. Across 400 requests (200 PUTs + 200 GETs) per size tier:

PhaseMedianp95p99
Credential resolution1.4 ms3.2 ms5.6 ms
Request signing0.07 ms0.09 ms0.11 ms
Total proxy CPU~1.5 ms~3.3 ms~5.7 ms

This is the overhead roar controls. It does not vary by network path or deployment. The p99 spikes in credential resolution (~6 ms) are credential cache misses or token refreshes, not sustained overhead.

Total observed overhead

End-to-end latency delta (proxy vs direct-to-S3), measured from an EC2 instance in us-east-1 (same region as the S3 bucket), 200 iterations with paired request ordering. Median values only — tail percentiles of total overhead reflect S3 response jitter, not proxy behavior.

PUT:

Object sizeDirect (median)Proxy (median)Overhead
64 KB28.8 ms28.2 ms~0 ms
1 MB35.8 ms40.3 ms+5 ms
8 MB100.0 ms133.2 ms+33 ms

GET:

Object sizeDirect (median)Proxy (median)Overhead
64 KB25.1 ms27.5 ms+2 ms
1 MB27.3 ms31.3 ms+4 ms
8 MB95.7 ms108.8 ms+13 ms

Range GET:

Object sizeDirect (median)Proxy (median)Overhead
64 KB24.8 ms28.1 ms+3 ms
1 MB26.6 ms31.6 ms+5 ms
8 MB29.0 ms31.5 ms+2 ms

For small and medium objects (up to 1 MB), total overhead is 2–5 ms. For 8 MB objects, PUT overhead is higher (~33 ms) because the proxy buffers and re-signs the full request body. GET overhead at 8 MB is ~13 ms. Range GETs stay low regardless of object size because the transferred payload is capped at 1 MB.

Proxy measurements were taken from an EC2 instance (m5.2xlarge) in the same AWS region as the S3 bucket (us-east-1), not from the Lambda reference machine. The proxy's own CPU work (~1.5 ms median) is hardware-independent; total observed overhead depends on network path and object size. The proxy buffers request and response bodies up to 16 MB by default (configurable via ROAR_PROXY_BUFFER_RESPONSE_BYTES); objects above this threshold are streamed.

These benchmarks isolate steady-state single-request latency overhead. Large multipart transfers, retry-heavy environments, concurrent worker fanout, and WAN deployments introduce additional variability not characterized here. To benchmark the proxy against your own endpoint, see tests/benchmarks/bench_proxy_s3_live.py in the roar repo.

Reproducing these benchmarks

All benchmarks live in the roar repository. Install roar 0.3.3+, which includes the benchmark scripts and performance improvements measured here:

pip install 'roar-cli>=0.3.3' blake3 boto3
roar tracer enable ebpf              # sets CAP_BPF caps + sysctl (Linux only)

To build from source (needed if you want to modify the tracer binaries):

git clone https://github.com/treqs/roar.git && cd roar
bash scripts/build_wheel_with_bins.sh          # produces dist/roar_cli-*.whl
python3 -m venv .venv
.venv/bin/pip install dist/roar_cli-*.whl blake3 boto3
.venv/bin/roar tracer enable ebpf

Then run each benchmark serially (one at a time, so they don't contend for CPU or disk):

# 1. Tracer overhead (all backends): startup, per-file, per-syscall regressions
ROAR_BIN_DIR=$(.venv/bin/python -c "import roar, os; print(os.path.join(os.path.dirname(roar.__file__), 'bin'))")
PRELOAD_TRACER_BIN="$ROAR_BIN_DIR/roar-tracer-preload" \
EBPF_TRACER_BIN="$ROAR_BIN_DIR/roar-tracer-ebpf" \
ROARD_BIN="$ROAR_BIN_DIR/roard" \
.venv/bin/python tests/benchmarks/bench_tracer_overhead.py

# 2. Hash algorithm throughput (blake3 / sha256 / sha512 / md5)
.venv/bin/python tests/benchmarks/bench_hash_throughput.py

# 5. Artifact hashing: Python vs native Rust backend
.venv/bin/python tests/benchmarks/bench_artifact_hashing.py

Benchmark 3 (cold vs cached blake3), Benchmark 4 (parallel scaling, 1–30 workers), and Benchmark 6 (end-to-end roar run) are short standalone scripts; the cold-read benchmarks drop the page cache via sudo between runs. The ROAR_TIMING=1 environment variable on any roar run prints the tracer / post-run breakdown shown in the End-to-end section.

Setup

All numbers were produced on a single Lambda Cloud 1xA10 instance — a modest single-GPU ML node representative of common training infrastructure. Hashing numbers reflect its SSD bandwidth; tracer overhead is dominated by syscall interposition mechanics rather than GPU or memory class.

ComponentSpec
CPUIntel Xeon Platinum 8358 @ 2.60 GHz
vCPUs30
RAM222 GiB
Storage1.4 TiB virtio SSD (NVMe-backed). Single-stream sequential read ~3.1 GB/s (hdparm) / ~3.5 GB/s (dd, direct). Parallel cold reads sustain up to ~14 GB/s at high queue depth — the relevant ceiling for multi-worker hashing.
OSUbuntu 24.04.2 LTS
Kernel6.8.0-62-generic (x86_64)
Python3.12.3
roar0.3.3
Tracer configeBPF via roard daemon; kernel.perf_event_paranoid=1; CAP_BPF set on the eBPF binaries

The single-stream versus parallel disk gap is worth flagging: hdparm/dd measure one sequential stream (~3.1–3.5 GB/s), but 16–30 parallel readers hit ~14 GB/s on this SSD. Cold-read hashing throughput tracks the parallel figure, not the single-stream one.

Where to look next

  • Benchmarks (e2e)roar overhead on a real multi-GPU pre-training job, with matched pairs and explicit noise-floor measurements.
  • Tracers — backend mechanics and tradeoffs that explain the per-syscall overhead differences.
  • Proxy — the S3 proxy's request-signing path and architecture.
  • Hashes — why blake3 is the default, and when to record sha256 alongside it.