Glaas minimal logo, light

Composite Artifacts

On this page

Composite artifacts are roar's answer to scale. MNIST has tens of files. LAION-scale image collections have hundreds of millions. The same roar run python train.py should produce the same DAG shape for both — one input node for "the dataset," not one per file. That's the invariant; the rest of this page is how it's preserved as cardinality grows.

Why composite artifacts?

Some artifacts aren't files — they're datasets. A Zarr store with a million chunks, a Lance dataset across thousands of fragments, a TFRecord split into hundreds of shards: each individual file means very little on its own. The thing your lineage actually depends on is the dataset as a single coherent unit, with one identity and one content hash.

Composite artifacts are how roar represents that. One artifact, one content hash, but aware of every component file underneath it. When a job consumes a 5-million-file dataset, the DAG shows a single input node — not five million.

Without composites, every chunk is a separate artifact, every job has millions of inputs, the DAG becomes unreadable, and dedup loses meaning ("did this run use the same dataset?" becomes "did all five million hashes match?"). With composites, it collapses to one identity check.

What gets recognized

roar uses structural detection — not a scored heuristic. It recognizes six declared dataset formats by looking for format-specific marker files in the directory:

FormatMarker
Zarr (v2 or v3).zarray, .zgroup, or zarr.json
LeRobotmeta/info.json + data/chunk-*/*.parquet layout
Lance*.lance/ directory
RLDS / TFDSdataset_info.json + features.json
WebDatasetTwo or more .tar shards
Flat parquet shardsTwo or more .parquet files

A directory that matches one of these is treated as a single composite artifact, regardless of how many component files it contains. Directories that don't match any structural pattern are recorded as individual files.

To force a directory to be treated as a composite regardless of structure, use roar put --dataset <path>.

Composites-of-composites ("nested composites") are also supported: a top-level directory containing multiple manifest-bearing subdatasets (e.g., a repo with both a Zarr store and a LeRobot dataset) is recorded as a nested composite, each sub-dataset tracked as its own composite with the parent as a container.

When composites form

The composite artifact — with its content hash and membership index — is formed only at roar get, roar put, and roar register boundaries, not during roar run.

During roar run, roar does detect dataset structure and attaches dataset.id labels to the job, so roar show will display a labeled dataset node. But the composite itself (the artifact entity with its composite-blake3 hash) is not materialized until the data crosses a publish or download boundary.

In practice this means:

  • roar get hf://... — forms a composite-sha256 anchor at download time (see below).
  • roar put <dir> — forms a composite-blake3 artifact from the directory at upload time.
  • roar register — forms composites from any manifest-bearing inputs or outputs recorded by the run (Zarr, Lance, LeRobot, RLDS).

Downloading datasets from Hugging Face

roar get hf://datasets/<owner>/<name>
roar get hf://datasets/<owner>/<name>@<commit>   # pin to a specific revision
roar get hf://datasets/<owner>/<name> --limit 1000   # partial download

When roar get fetches a Hugging Face dataset, it forms a full-dataset anchor: a composite-sha256 composite over the entire dataset manifest (using LFS sha256 oids from the HF repo metadata, without downloading every file just to compute the hash).

Partial downloads and views. A --limit N download doesn't mint a separate composite for the subset. Instead the job records the full-dataset anchor (stable across runs) alongside a selector (first:N) describing which portion was consumed. This keeps dataset identity clean — roar show can tell you "job used first 1,000 rows of dataset X@commit Y" without fragmenting the lineage graph into per-subset composites.

--full-anchor extends the anchor to include identity-bearing non-LFS files (e.g., dataset_info.json) even when they exceed the standard 64 MB budget cutoff.

Probabilistic membership (Bloom filters)

For every composite, roar builds both a full component list (for local introspection) and a Bloom filter (for fast membership queries regardless of cardinality). The Bloom filter is computed unconditionally — not just for large datasets.

The Bloom filter answers "is this file in this dataset?" with yes-probably or no-definitely:

  • False negatives don't happen — if the filter says no, the file is genuinely absent.
  • False positives are possible at a rate of ~0.1% (1 in 1,000). Rehash to confirm if you need certainty.

When uploading to GLaaS, the component list is capped at 1,000 entries / 90 KB for the wire payload. The full local record is not truncated.

Content hashes for composites

There are two composite hash algorithms, depending on how the composite was formed:

  • composite-blake3 — used for locally-formed composites (via roar put or roar register). A path-sensitive Merkle-tree-style hash: blake3 over each component file's hash and relative path, combined up the tree. Same directory contents and layout always produce the same hash.
  • composite-sha256 — used for Hugging Face datasets fetched via roar get hf://. Derived from the LFS sha256 oids in the HF manifest, so the anchor can be computed from metadata alone without downloading every file.

Both algorithms are valid composite hash identifiers. roar show, roar diff, and the GLaaS UI handle both.

Auto-labels on recognized composites

When roar forms a composite artifact, it automatically attaches system-managed labels describing the dataset's content:

  • dataset.id — a file:// URI identifying the local root path.
  • dataset.modality — inferred content type: tabular, image, text, or mixed.
  • dataset.format — primary file format(s) found in the composite (e.g., parquet, zarr).
  • dataset.type — structural kind, matching the detection category (e.g., zarr, lerobot, rlds).

These labels are also queryable in the GLaaS UI for filtering runs by dataset.

Where to look next

  • Hashes — how composite-blake3 and composite-sha256 are computed and how they interact with multi-algo storage.
  • Core Concepts: Composite Artifacts — the conceptual placement in the DAG model.
  • Labels — full reference for auto-labels and how to query by dataset.modality, dataset.type, etc.