## Why hash artifacts?

Every artifact `roar` records is identified by its **content hash**, not by filename or path. That makes two things work that wouldn't otherwise:

- **Dedup.** If two jobs produce the same bytes, they share one artifact record. The DAG sees them as the same node.
- **Reproducibility.** `roar reproduce <hash>` works because the hash uniquely identifies what you're trying to recreate, independent of which directory or which machine the file currently lives on.

Content hashing is foundational to roar; it's not configurable away.

## Default: blake3

`roar` defaults to **blake3** for all artifact hashing. The choice is mostly about throughput. blake3 is dramatically faster than sha256 on modern CPUs — single-threaded around 2× and with SIMD/multi-threading often 5–10× — which directly cuts the wall-clock cost of post-run hashing on large outputs. See [Benchmarks](/docs/benchmarks#hashes) for the numbers.

If the `blake3` Python module isn't installed, `roar` falls back to `sha256` automatically. No error, no failed run — just a slower hash. `roar show <artifact>` reports which algorithm was used.

## blake3 vs. sha256

The honest tradeoff:

| | blake3 | sha256 |
|---|---|---|
| Output size | 256-bit (configurable) | 256-bit |
| Theoretical collision bound | 2⁻¹²⁸ | 2⁻¹²⁸ |
| Throughput (single-threaded) | ~2× sha256 on x86_64 | baseline |
| Throughput (with SIMD / multi-threaded) | 5–10× sha256 | sha256 doesn't parallelize across one input |
| First publication | 2020 | 2001 |
| Ecosystem footprint | growing (cargo, IPFS support) | dominant (git, Docker, Hugging Face, every TLS cert) |

The output sizes match — so the *theoretical* collision resistance is identical. Where the two differ is **age**: sha256 has been deployed at internet scale for two-plus decades, and the cryptanalytic literature has had that long to try (and fail) to break it. blake3 is younger; it's seen serious review, but less of it. Both are currently unbroken.

For a practical sense of the bound: if you registered 2⁶⁴ (≈ 1.8 × 10¹⁹) distinct artifacts under blake3, the birthday-paradox collision probability would still be on the order of 2⁻⁶⁴ — roughly one in 18 quintillion. The same math holds for sha256. The probabilistic risk of accidental collision is not the deciding factor for either choice.

## When you'd want sha256 (or both)

Some ecosystems standardize on sha256 by convention: Hugging Face datasets and models, git objects, container images, IPFS, almost every package manager. When you produce or consume an artifact that needs to interoperate with those, recording its sha256 is what makes the cross-tool identity check work — even if blake3 stays as your primary.

`roar` lets you compute multiple algorithms for the same artifact in one pass and stores all of them. Each artifact in `.roar/roar.db` can have many hash entries; you can look an artifact up by any of them.

```bash
roar run --hash sha256 python train.py --output model.pkl
# `model.pkl` ends up with both a blake3 hash (the primary) and a sha256
# hash; `roar show model.pkl` shows both, and either is a valid lookup key.
```

## Multiple hashes per artifact, mutual reference

In the DB, each artifact has an `artifact_hashes` table with one row per `(artifact_id, algorithm, digest)`. So:

- A single artifact can carry `blake3`, `sha256`, `etag` (for S3), and even `md5` simultaneously.
- `roar show <hash>` accepts any of them as a lookup key and resolves to the same artifact.
- Multiple uploads of the same content under different conditions (different multipart sizes, say, producing different ETags) can be reconciled if a content hash like blake3 is also recorded — that's what cross-checks the identity claim.

This shape is what [Scopes](/docs/scopes) relies on for the "two orgs registering the same dataset" story: identity is *one* artifact globally, even if each side recorded a different mix of hash algorithms.

## Configuration

Set the primary algorithm and any always-additionally-computed algorithms in `roar.toml`:

```toml
[hash]
primary = "blake3"            # default; what every artifact gets
run = ["sha256"]              # additional algos for `roar run` outputs
get = ["sha256"]              # additional algos for `roar get` inputs
put = ["sha256"]              # additional algos for `roar put` uploads
```

Per-invocation overrides:

| Flag | Effect |
|---|---|
| `--hash <algo>` | Add `<algo>` to the computed set for this invocation, on top of primary + config. Repeatable. |
| `--hash-only <algo>` | Use *only* `<algo>` — skip the primary and config entries. Repeatable. |

Valid algorithm names: `blake3`, `sha256`, `sha512`, `md5`. Plus `composite-blake3` for composite artifacts (see below).

## Composite artifacts

When `roar` registers a directory as a single artifact (a [composite artifact](/docs/core-concepts#composite-artifacts) — model checkpoints, dataset directories, etc.), its hash is `composite-blake3` — a Merkle-tree-style hash derived from the blake3 hashes of every component file plus their paths. The properties:

- **Deterministic.** Same directory contents and same path layout produce the same `composite-blake3`.
- **Content-aware.** If any component file changes, the composite hash changes.
- **Cheap to verify partial.** You can verify a subset of components against the composite hash without rehashing the whole tree.

`composite-blake3` is recorded alongside any standard hashes of the root, so both lookup styles work.

## ETag and S3

Artifacts published or fetched via S3 carry an `etag` hash recorded by the [proxy](/docs/proxy). ETag is an S3-managed string — for single-part uploads it's the body's MD5, for multipart it's `MD5(concat(part MD5s))-<N>` which is not a content digest. `roar` stores ETag as one of the artifact's hash algorithms but treats `blake3` (when present) as the source of truth for content identity. See [Proxy: ETag and content hashing](/docs/proxy#etag-and-content-hashing).

## What hashes appear in `roar show`

```bash
roar show /home/ubuntu/mnist/train_feats.npz
```

The artifact-detail view lists every hash recorded for the artifact, labeled by algorithm. For blake3 specifically, the page shows a `≡ b3sum <path>` breadcrumb — the digest `roar` computed should match what the upstream `b3sum` CLI tool produces for the same file. That gives you an independent verification path outside `roar`.

For other algorithms there's no equivalent breadcrumb today; the simplest verifications are `sha256sum <path>` for sha256 and `openssl md5 <path>` for md5.

## Where to look next

- [Benchmarks](/docs/benchmarks#hashes) — actual throughput numbers for blake3 vs. sha256 across file sizes.
- [Proxy](/docs/proxy) — S3 ETag mechanics and how `roar` reconciles them with content hashes.
- [Core Concepts](/docs/core-concepts) — content-addressable identity and how it shapes the DAG.
- [Scopes](/docs/scopes) — the dedup model that makes "same hash, different scopes" work.
