Hashes

Why hash artifacts?

Every artifact roar records is identified by its content hash, not by filename or path. That makes two things work that wouldn't otherwise:

Dedup. If two jobs produce the same bytes, they share one artifact record. The DAG sees them as the same node.
Reproducibility. roar reproduce <hash> works because the hash uniquely identifies what you're trying to recreate, independent of which directory or which machine the file currently lives on.

Content hashing is foundational to roar; it's not configurable away.

Default: blake3

roar defaults to blake3 for all artifact hashing. The choice is mostly about throughput. blake3 is dramatically faster than sha256 on modern CPUs — single-threaded around 2× and with SIMD/multi-threading often 5–10× — which directly cuts the wall-clock cost of post-run hashing on large outputs. See Benchmarks for the numbers.

If the blake3 Python module isn't installed, roar falls back to sha256 automatically. No error, no failed run — just a slower hash. roar show <artifact> reports which algorithm was used.

blake3 vs. sha256

The honest tradeoff:

	blake3	sha256
Output size	256-bit (configurable)	256-bit
Theoretical collision bound	2⁻¹²⁸	2⁻¹²⁸
Throughput (single-threaded)	~2× sha256 on x86_64	baseline
Throughput (with SIMD / multi-threaded)	5–10× sha256	sha256 doesn't parallelize across one input
First publication	2020	2001
Ecosystem footprint	growing (cargo, IPFS support)	dominant (git, Docker, Hugging Face, every TLS cert)

The output sizes match — so the theoretical collision resistance is identical. Where the two differ is age: sha256 has been deployed at internet scale for two-plus decades, and the cryptanalytic literature has had that long to try (and fail) to break it. blake3 is younger; it's seen serious review, but less of it. Both are currently unbroken.

For a practical sense of the bound: if you registered 2⁶⁴ (≈ 1.8 × 10¹⁹) distinct artifacts under blake3, the birthday-paradox collision probability would still be on the order of 2⁻⁶⁴ — roughly one in 18 quintillion. The same math holds for sha256. The probabilistic risk of accidental collision is not the deciding factor for either choice.

When you'd want sha256 (or both)

Some ecosystems standardize on sha256 by convention: Hugging Face datasets and models, git objects, container images, IPFS, almost every package manager. When you produce or consume an artifact that needs to interoperate with those, recording its sha256 is what makes the cross-tool identity check work — even if blake3 stays as your primary.

roar lets you compute multiple algorithms for the same artifact in one pass and stores all of them. Each artifact in .roar/roar.db can have many hash entries; you can look an artifact up by any of them.

roar run --hash sha256 python train.py --output model.pkl
# `model.pkl` ends up with both a blake3 hash (the primary) and a sha256
# hash; `roar show model.pkl` shows both, and either is a valid lookup key.

Multiple hashes per artifact, mutual reference

In the DB, each artifact has an artifact_hashes table with one row per (artifact_id, algorithm, digest). So:

A single artifact can carry blake3, sha256, etag (for S3), and even md5 simultaneously.
roar show <hash> accepts any of them as a lookup key and resolves to the same artifact.
Multiple uploads of the same content under different conditions (different multipart sizes, say, producing different ETags) can be reconciled if a content hash like blake3 is also recorded — that's what cross-checks the identity claim.

This shape is what Scopes relies on for the "two orgs registering the same dataset" story: identity is one artifact globally, even if each side recorded a different mix of hash algorithms.

Configuration

Set the primary algorithm and any always-additionally-computed algorithms in roar.toml:

[hash]
primary = "blake3"            # default; what every artifact gets
run = ["sha256"]              # additional algos for `roar run` outputs
get = ["sha256"]              # additional algos for `roar get` inputs
put = ["sha256"]              # additional algos for `roar put` uploads

Per-invocation overrides:

Flag	Effect
`--hash <algo>`	Add `<algo>` to the computed set for this invocation, on top of primary + config. Repeatable.
`--hash-only <algo>`	Use only `<algo>` — skip the primary and config entries. Repeatable.

Valid algorithm names: blake3, sha256, sha512, md5. Plus composite-blake3 for composite artifacts (see below).

Composite artifacts

When roar registers a directory as a single artifact (a composite artifact — model checkpoints, dataset directories, etc.), its hash is composite-blake3 — a Merkle-tree-style hash derived from the blake3 hashes of every component file plus their paths. The properties:

Deterministic. Same directory contents and same path layout produce the same composite-blake3.
Content-aware. If any component file changes, the composite hash changes.
Cheap to verify partial. You can verify a subset of components against the composite hash without rehashing the whole tree.

composite-blake3 is recorded alongside any standard hashes of the root, so both lookup styles work.

ETag and S3

Artifacts published or fetched via S3 carry an etag hash recorded by the proxy. ETag is an S3-managed string — for single-part uploads it's the body's MD5, for multipart it's MD5(concat(part MD5s))-<N> which is not a content digest. roar stores ETag as one of the artifact's hash algorithms but treats blake3 (when present) as the source of truth for content identity. See Proxy: ETag and content hashing.

What hashes appear in `roar show`

roar show /home/ubuntu/mnist/train_feats.npz

The artifact-detail view lists every hash recorded for the artifact, labeled by algorithm. For blake3 specifically, the page shows a ≡ b3sum <path> breadcrumb — the digest roar computed should match what the upstream b3sum CLI tool produces for the same file. That gives you an independent verification path outside roar.

For other algorithms there's no equivalent breadcrumb today; the simplest verifications are sha256sum <path> for sha256 and openssl md5 <path> for md5.

Where to look next

Benchmarks — actual throughput numbers for blake3 vs. sha256 across file sizes.
Proxy — S3 ETag mechanics and how roar reconciles them with content hashes.
Core Concepts — content-addressable identity and how it shapes the DAG.
Scopes — the dedup model that makes "same hash, different scopes" work.