Composite Artifacts
On this page
Composite artifacts are
roar's answer to scale. MNIST has tens of files. LAION-scale image collections have hundreds of millions. The sameroar run python train.pyshould produce the same DAG shape for both — one input node for "the dataset," not one per file. That's the invariant; the rest of this page is how it's preserved as cardinality grows.
Why composite artifacts?
Some artifacts aren't files — they're datasets. A Zarr store with a million chunks, a Lance dataset across thousands of fragments, a TFRecord split into hundreds of shards: each individual file means very little on its own. The thing your lineage actually depends on is the dataset as a single coherent unit, with one identity and one content hash.
Composite artifacts are how roar represents that. One artifact, one content hash, but aware of every component file underneath it. When a job consumes a 5-million-file dataset, the DAG shows a single input node — not five million.
Without composites, every chunk is a separate artifact, every job has millions of inputs, the DAG becomes unreadable, and dedup loses meaning ("did this run use the same dataset?" becomes "did all five million hashes match?"). With composites, it collapses to one identity check.
What gets recognized
roar detects composite-shaped outputs automatically. The recognition heuristic scores candidate directories on signals like:
- Known dataset extensions —
.parquet,.lance,.tfrecord,.mcap,.csv,.jsonl, image trees, plus the directory-as-dataset patterns each implies. - Manifest files —
_metadata,manifest.json,_SUCCESS, the various "this is the table of contents" markers. - Partition layout — directories named
key=value(Hive-style partitioning) or numbered shards (part-00001-of-00064.parquet,shard_0001.tfrecord). - High cardinality with similar shape — many files with the same extension and similar names in one place.
When the score crosses a threshold, that directory becomes one composite artifact with kind="composite" and a composite-blake3 hash (see Hashes). The component files are still recorded internally so you can introspect them — but the artifact in the DAG is one node.
You can also annotate explicitly when you want the directory treated as a composite regardless of heuristics. See roar.toml [composites] config for the knobs.
Probabilistic membership (Bloom filters)
For modest datasets — say a few thousand files — roar keeps a full list of component hashes in the composite's record. You can ask "is this file in this dataset?" and get an exact yes/no.
For large datasets — a million chunks, ten million shards — storing every component hash isn't practical. roar switches to a Bloom filter: a compact probabilistic data structure that answers "is this hash a member?" with yes-probably (no false negatives) or no-definitely (no false positives the other way). The composite still has one content hash; what changes is the storage cost (a fixed-size filter regardless of cardinality, instead of a per-component hash list that grows linearly) and the cost of the membership query (a handful of hash applications instead of a lookup against the full list).
The trade is the standard Bloom one: a configurable false-positive rate (typically 1 in 10⁶) buys you a fixed-size index regardless of dataset cardinality. False-negatives don't happen — if the filter says "no", the file is genuinely not part of the composite. False-positives mean "probably yes; rehash to confirm if you need certainty."
The Bloom approach is what makes the headline invariant hold at the high end. A LAION-scale composite carries a fixed-size filter regardless of whether it has ten million components or a hundred million; the DB row for "this dataset" stays the same shape; the DAG still shows one input node.
Today: input datasets
In current roar, composite artifacts are primarily about input datasets: the parquet/lance/tfrecord/mcap stores that feed training. When roar run python train.py consumes a directory shaped like a dataset, the DAG records a single dataset input rather than the per-shard files.
Output artifacts today are recorded individually — a single model.pt is one artifact, a metrics.json is another. That works because typical training output is a small number of files.
Future: model bundles
The natural next step is model bundles — multi-file model outputs that should also collapse to one artifact. A Hugging Face checkpoint directory (config.json + tokenizer.json + model.safetensors) is a clear case. A PyTorch sharded checkpoint (pytorch_model-00001-of-00008.bin, …, index.json) is another. A TensorFlow SavedModel directory is a third. All multi-file, all conceptually one artifact, none currently collapsed.
Naming note. Calling these "model bundles" keeps them distinct from input datasets in CLI output and DAG visualizations. The underlying storage model —
kind="composite"pluscomposite-blake3— is shared with datasets; only the human-facing label differs. Open to alternative names if "model bundle" doesn't fit; "composite output" and "checkpoint composite" were considered but read as either too generic or too narrow.
Model bundles aren't shipping yet. Composite-shaped output detection on the output side is a planned next step, layered on top of the same recognition + Bloom-filter machinery already used for input datasets.
Configuration
Under [composites] in roar.toml:
| Key | Default | Effect |
|---|---|---|
composites.detect_inputs | true | Auto-recognize composite-shaped input directories. |
composites.detect_outputs | (varies) | Auto-recognize composite-shaped output directories (where supported). |
composites.bloom_false_positive_rate | 1e-6 | Target FPR for the Bloom membership filter on large composites. |
composites.full_membership_threshold | (varies) | Component count above which roar switches from full-list membership to Bloom filter. |
For the heuristic implementation and full set of recognized patterns, see roar/execution/recording/dataset_identifier.py in the roar repo — the family of regex matchers and scoring weights is documented in code.
Where to look next
- Hashes — how
composite-blake3is computed and how it interacts with multi-algo storage. - Core Concepts: Composite Artifacts — the conceptual placement in the DAG model.
- Labels — auto-labels (
dataset.modality,dataset.type) thatroarattaches to recognized composites.