Benchmarks (e2e)

tl;dr roar adds 1–4% of wall-clock to LLM pre-training on a 2×H100 instance.
The default preload tracer is best for Python scripts and dynamically linked binaries: no kernel configuration, no CAP_BPF, and consistently cheaper than eBPF on long runs despite what the per-syscall micro-benchmarks suggest.

Micro-benchmarks characterize overhead per file and per syscall, but the number that matters in practice is: what does roar run add to a real training job? This page answers that with matched pairs on a real workload — nanochat base pre-training on Lambda 2×H100-SXM5 — at three model depths, two tracer backends, and N=3 within-pod cycles per condition. The experiment was designed specifically to keep the noise floor below the tracer signal, after an earlier attempt revealed that cross-pod hardware variance alone (~10%) is large enough to drown any measurement that lacks paired controls.

The source data and run scripts are in the roar repo under tests/benchmarks/nanochat_training/.

What this means for real training

For a training run on Lambda 2×H100-SXM5 (or comparable NVLink SXM-class hardware):

Model config	Training time	eBPF overhead	preload overhead	Post-run
d12 (smaller)	~20 min	+4.2%	+3.1%	2–4 s
d15 (mid)	~60 min	+1.9%	+1.5%	3–5 s
d20 (full, N=1)	~3 hr	+1.1%	—	~7 s

Overhead scales inversely with model size and GPU utilization. As compute saturation rises — bf16 MFU from 44% at d12 to 59% at d20 — the tracer's contribution shrinks as a fraction of total wall-clock. At d20 on a three-hour run, it is rounding error. Post-run hashing and recording scales with output size, not runtime: d20 writes ~6 GB of weights and optimizer shards, which hashes in ~7 seconds regardless of tracer choice.

One finding inverts the micro-benchmark ordering: preload is consistently cheaper than eBPF on long training runs (1.3–1.4× at every depth), the opposite of what the per-syscall micro-benchmarks predict for short commands. See preload vs eBPF on long training for the explanation. preload also has no kernel prerequisites — no CAP_BPF, no kernel.perf_event_paranoid adjustment, and it works in containers that don't grant elevated privileges.

Workload

nanochat is Andrej Karpathy's compact LLM training pipeline — a ~$100 single-box training run on an 8×H100. We exercise only base_train: the dataloader-intensive pre-training step that dominates wall-clock and is the natural shape for measuring tracer overhead. No in-training eval, no SFT, no post-training.

Parameters, held constant across all comparison runs:

--target-param-data-ratio=8   # total steps derived from model size
--device-batch-size=16
--fp8
--window-pattern L            # SDPA fallback; no FA3 on these nodes
--seed 4242
--core-metric-every -1        # in-training eval disabled

Using the same seed across all comparison runs lets val_bpb serve as a convergence consistency check: any genuine training disturbance from the tracer would show up there.

The three depths measured:

Depth	Steps	Wall-clock (Pod A)	bf16 MFU
d12	1,680	~20 min	43.8%
d15	3,392	~60 min	51.4%
d20	3,320	~3 hr	~59%

MFU rises with depth because the larger model saturates the GPU more fully. This is also why tracer overhead, as a percentage of wall-clock, falls with depth: the per-step compute denominator grows while the per-step tracer cost stays roughly constant. Token throughput (tok/sec × dt) confirms batch size in tokens is fixed across all depths — so the overhead drop is driven purely by per-step FLOP increase, not by fewer or larger batches.

Methodology

Paired within-pod cycles. All comparison runs at d12 and d15 were executed sequentially on the same physical machine, with the Torch inductor cache cleared between runs for cold-compile parity. This is the key design choice: cross-pod hardware variance within Lambda 2×H100-SXM5 was observed at ~10% in an earlier measurement attempt — larger than the tracer effect being measured. Within-pod, run-to-run variance on the untraced baseline was 0.04–0.29%, or 10–20× smaller than any measured tracer effect.

N=3 cycles per condition. Three complete cycles of (no-roar, eBPF, preload) at each depth gives a within-pod variance estimate and confirms the signal is not a single-run artifact. Run order within each cycle is fixed (no-roar → eBPF → preload) — a controlled sequence bias that does not affect the headline numbers since within-cycle drift was well below the measured SDs.

Overhead is measured from trace_s: roar's internally-reported wall-clock of the wrapped torchrun command, compared against the mean no-roar wall-clock at the same depth and pod. Post-run time (post_s) is separated out; it does not appear in trace_s.

ptrace is not measured here. The micro-benchmarks establish ptrace's per-syscall overhead at 207 µs — 44× eBPF — making it the universal fallback for environments where neither eBPF nor preload is available, not a practical choice for training workloads. If both eBPF and preload are available, ptrace is never the right selection.

Results

In-training overhead

Pod A (Lambda 2×H100-SXM5, N=3 paired cycles at d12 and d15):

depth	tracer	trace mean (s)	SD	overhead	post (s)
d12	no-roar	1,192.0	0.29%	—	—
d12	preload	1,229.2	0.02%	+3.12%	2.4
d12	eBPF	1,242.5	0.27%	+4.24%	3.9
d15	no-roar	3,601.0	0.04%	—	—
d15	preload	3,653.7	0.28%	+1.47%	2.7
d15	eBPF	3,669.6	0.08%	+1.91%	5.4
d20	eBPF	10,689.9	N=1	+1.10%	7.0

The within-pod no-roar SD is effectively zero at every depth. Tracer overhead sits 10–20× above that floor, making the signal clearly distinguishable from run-to-run noise.

The d20 row is from an earlier matched pair on a separate Lambda 2×H100 pod (not Pod A) and is included for directional context. It is consistent with the depth-scaling trend but is not replicated.

Model quality (val_bpb)

val_bpb across all 18 runs at d12 and d15, same seed:

depth	val_bpb range	band width
d12 (N=9 runs)	0.870791–0.871072	2.8 × 10⁻⁴
d15 (N=9 runs)	0.805128–0.805449	3.2 × 10⁻⁴

This is nanochat's own run-to-run non-determinism band (FP8 quantization, NCCL collective ordering, CUDA reductions). Tracer choice does not push val_bpb outside the model's inherent noise floor. The traced runs are convergence-equivalent to the untraced baseline at the precision this workload resolves.

preload vs eBPF on long training

In the micro-benchmarks, eBPF is 7× cheaper per file than preload on short synthetic commands. On a multi-minute training job this ordering inverts: preload is consistently cheaper by 1.3–1.4× across all depths.

The working explanation: eBPF's event path crosses the kernel↔userspace boundary for each intercepted syscall — the kernel hook fires, writes to a perf ring buffer, and the roard daemon reads and processes in userspace. On a hot dataloader loop issuing thousands of open/read syscalls per second, the cumulative boundary-crossing cost compounds. preload's LD_PRELOAD shim intercepts the same calls entirely in-process, with no mode switch and no ring-buffer handoff. On a 20-second synthetic workload the fixed startup cost dominates and eBPF's lower marginal rate wins; on a 20-minute training run the per-event rate accumulates.

This does not change the auto-select recommendation for most uses. For long-running training specifically, the practical difference is 1–2 percentage points; both backends stay well under 5% at d12 and under 2% above d15.

Hardware effects

The largest variable in this table is not which tracer you choose — it's which machine you're on.

provider	hardware	no-roar wall-clock	preload overhead	absolute roar cost
Lambda Pod A	2×H100 SXM5, Sapphire Rapids	1,192 s	+3.1%	~37 s
RunPod (PCIe pod)	2×H100 PCIe, Ice Lake-SP	2,251 s	+9.1%	~208 s

The no-roar baseline on RunPod PCIe is 89% longer than Lambda SXM5 — the same workload, nearly twice the wall-clock, from hardware alone. That gap is 20× larger than the biggest tracer effect measured in this experiment (+4.2% eBPF at d12). Hardware selection determines far more of your training budget than tracer selection.

The 9.1% preload overhead on RunPod PCIe is higher than Lambda's 3.1%, but it reads differently in context. In absolute terms, roar's overhead on the PCIe pod is ~208 s; on Lambda it's ~37 s. The percentage is higher because the slower CPU means the LD_PRELOAD shim costs more per syscall — PCIe H100s do more host-side bookkeeping per NCCL all-reduce than NVLink hardware, and that bookkeeping is syscall surface the shim intercepts. But both numbers are small relative to the job they sit on top of.

Note on RunPod SKU heterogeneity. The 2x_H100_SECURE catalog entry on RunPod is not a fixed hardware spec. Two runs in this experiment on nominally the same SKU landed different hardware: one likely SXM5 (MFU 34.7%), one confirmed PCIe (MFU 30.3%, Ice Lake-SP CPU). Any RunPod result should be read as specific to the hardware that pod received, not as representative of RunPod in general. eBPF cannot be tested on RunPod; containers do not grant CAP_BPF.

Setup

Primary measurements on Lambda Cloud 2×H100-SXM5 (Pod A). Cross-provider pair on RunPod.

Component	Spec
CPU	Intel Xeon Platinum 8480+ (Sapphire Rapids)
GPUs	2× H100 80GB HBM3 SXM5, NVLink (NV18)
RAM	442.7 GiB
OS	Ubuntu 22.04.5 LTS, kernel 6.8.0-60-generic
Driver / CUDA	570.148.08 / CUDA 12.8 (PyTorch 2.9.1+cu128)
roar	0.3.3
Tracer config	eBPF via `roard` daemon; `CAP_BPF` set on eBPF binaries; `kernel.perf_event_paranoid=1` (Lambda default is 4 — must be set explicitly before `roar tracer enable ebpf` succeeds).

tracer.fallback_enabled=false was set in roar config throughout to prevent silent backend substitutions; any misconfiguration produced a clear error rather than a silent fallback that would bias the measurement.

Caveats

Single pod is the primary source. Pod A (18 runs, N=3 per cell) is the only pod with full paired coverage across both depths and both tracers. Pod B provided a single d12 eBPF confirmation at +4.70%, but landed the same physical hardware as Pod A (Lambda reassigned the same node after teardown; GPU UUIDs matched). True cross-pod replication within Lambda is bounded by a single earlier abandoned run that showed ~10% variance between distinct pods — this is why the paired within-pod design was used rather than comparing across pods.

d20 is N=1 from separate hardware. The d20 eBPF row is directionally consistent with the depth-scaling pattern from d12 and d15, but comes from an earlier run on a separate pod and is not replicated here.

preload < eBPF inversion is observed but not mechanistically confirmed. The kernel↔userspace boundary hypothesis is consistent with the data; it has not been verified by direct event counting or roard CPU profiling.

Overhead percentages are hardware-dependent. Extrapolating these numbers to other hardware classes — PCIe vs NVLink, older CPUs, container runtimes with overlayfs — requires caution. The RunPod PCIe result (+9.07% preload at d12) illustrates the range.

Single-node experiment only. All runs used torchrun on a single 2×H100 node. Multi-node behavior — whether roar runs on each rank, only rank-0, or only the launch node, and how per-node lineage is assembled — is not characterized here.

Where to look next

Benchmarks (micro) — per-syscall and per-file tracer overhead on synthetic workloads, hashing throughput, and S3 proxy overhead.
Tracers — backend mechanics that explain the preload/eBPF tradeoff.
Hashes — post-run hashing throughput and why blake3 is the default.