Glaas minimal logo, light

Proxy

On this page

Why a proxy?

Not all the I/O roar cares about is local. Modern ML workflows pull training data from S3, write checkpoints to S3, push final artifacts to S3. The tracer model — observing read() / write() / open() on the worker's local files — sees none of that. To capture S3 lineage, roar needs to sit between the client and the cloud and watch the requests flow past.

That's what the roar proxy is: a small reverse proxy that forwards S3 traffic from your scripts to AWS unchanged, while recording GetObject and PutObject operations as lineage events. You don't change your scripts. You just point AWS clients at the proxy via AWS_ENDPOINT_URL and the proxy does the rest.

Status: AWS only today. GCP support is on the roadmap. Azure isn't planned in the near term. If you need a different cloud's object store traced, that's a follow-up; see Limitations below.

Quick use

roar proxy                       # show status (enabled? binary found? daemon running?)
roar proxy enable                # enable the proxy for `roar run`
roar proxy disable               # disable it
roar proxy start                 # start a standalone daemon (prints its port)
roar proxy stop                  # stop the standalone daemon

When enabled, roar run will spawn the proxy on a job-scoped port and set AWS_ENDPOINT_URL in the wrapped command's environment automatically. For ad-hoc use outside roar run, start a standalone daemon and point your script at it:

roar proxy start
# … "Proxy daemon started (pid=…, port=9090)." …
export AWS_ENDPOINT_URL=http://127.0.0.1:9090
python my_script.py             # S3 calls are now traced
roar proxy stop

What gets captured

Per S3 operation, the proxy records:

  • Bucket and key (the s3://bucket/key path becomes the artifact's path).
  • Operation typeGetObject → input edge, PutObject / UploadPart / CompleteMultipartUpload → output edge, DeleteObject → publication-style write.
  • Object size in bytes (from Content-Length on responses, request body for uploads).
  • ETag — recorded as the artifact's hash. See ETag and content hashing below.
  • Session and job IDs — propagated through request headers so each captured event lands on the right job in your DAG.

Captured events become regular input/output rows on the job record, just like local file I/O. roar dag and roar show treat S3 artifacts the same as local ones.

How it works

A small Rust binary (roar-proxy) running an axum HTTP server. It:

  1. Listens on a loopback port (default 9090; job-scoped when launched by roar run).
  2. Receives each S3 request that the AWS client would have sent directly.
  3. Re-signs the request with the caller's credentials and forwards it to the real S3 endpoint (https://s3.<region>.amazonaws.com by default).
  4. Streams the response back to the caller — small/medium responses buffered, large objects streamed through so download throughput isn't bottlenecked on the proxy.
  5. Extracts the relevant metadata from request/response headers (bucket, key, size, ETag) and emits a structured log line that the recording layer picks up.

Re-signing is what makes the proxy transparent to your AWS clients: from boto3's perspective, the response came from S3 normally. Your credentials never leave the local process — the proxy uses the standard AWS credential provider chain to re-sign.

For chained setups (LocalStack, MinIO, behind a corporate proxy), the daemon accepts an --upstream override that points it at a non-AWS endpoint. Same observation behavior; different forwarding target.

ETag and content hashing

S3 returns an ETag on every successful upload and read. roar records it as the artifact's hash — but ETag is not the strict content digest you might expect:

  • Single-part PutObject: ETag is the MD5 of the object body. Effectively a content digest for that object size.
  • Multipart upload (anything above the multipart threshold, typically 5 MB chunks): ETag is MD5(concat(MD5(part_1), MD5(part_2), …))-<N> where N is the number of parts. This is not a content digest — it depends on the part-size choice made by the uploading client, not just on the bytes.

The practical consequence: two byte-identical multipart uploads with different part sizes produce different ETags. So an S3 artifact's etag hash is reliable as an identity check within a single client but is not safe to assume content-equality across clients.

When roar can record both an ETag and a content blake3 for the same artifact (typically when the same file is also opened locally via builtins.open and streamed through _TrackedWriteFile), it stores both — see Hashes for how multiple hash algorithms are kept per artifact.

For perf characteristics of the proxy under load — overhead per request, throughput compared to direct-to-S3 — see Benchmarks.

Configuration

Under [proxy] in roar.toml:

KeyDefaultEffect
proxy.enabledfalseWhen true, roar run auto-spawns the proxy and sets AWS_ENDPOINT_URL in the wrapped command.
proxy.port9090Default listen port for the standalone daemon. roar run uses a job-scoped port to avoid collisions across concurrent runs.
proxy.upstream_endpoint(none)If set, the proxy forwards to this URL instead of https://s3.<region>.amazonaws.com. Use to chain through LocalStack, MinIO, or a corporate proxy.

Environment variables:

  • AWS_ENDPOINT_URL — set by roar run (when proxy is enabled) so AWS clients route through the proxy. Override for ad-hoc use as shown in Quick use.
  • ROAR_UPSTREAM_S3_ENDPOINT — alternative way to configure the upstream override. Useful when you want to set the proxy mid-script.

Supported clouds

  • AWS S3 — fully supported. Uses the AWS SDK's request format and re-signs requests with the caller's AWS credentials.
  • S3-compatible endpoints (LocalStack, MinIO, R2, etc.) — supported via --upstream. Same request format as AWS S3; the proxy just forwards to a different URL.
  • GCP Cloud Storage — on the roadmap. GCS uses a different request format and auth model; the proxy needs work to handle that.
  • Azure Blob Storage — not planned for the immediate term.

Validation

The proxy is exercised end-to-end against MinIO in CI (running in Docker via the same Compose harness used for the Ray integration). Tests cover single-part uploads, multipart uploads, downloads, attribution to the correct job under nested subprocess boundaries, and proxy log collection across Ray worker nodes. See tests/backends/ray/e2e/test_proxy_logs_collection.py and test_nested_subprocess_s3_lineage.py in the roar repo for the coverage.

For real-AWS validation (real network, real S3 credentials), see the live test suite under tests/backends/ray/live/.

Limitations

  • AWS only today. GCP is planned; Azure is not in scope near term.
  • ETag is not a content digest for multipart uploads. See the section above. If you need strict content equality across uploaders, also capture the local content hash (via roar's open patches or by writing the artifact locally before uploading).
  • Encrypted-in-transit traffic outside HTTPS — the proxy only handles the standard HTTPS request flow. Custom transports (raw socket, proprietary wire formats) won't be observed.
  • Credentials are passed through, not stored. The proxy uses the standard AWS credential provider chain — it never reads or caches your secret keys to disk. But the credentials are in process memory while a request is being signed; treat the proxy daemon's process the same way you treat any other AWS client on the box.
  • Pre-signed URLs (get_object via signed URL, etc.) — captured if and only if the request flows through AWS_ENDPOINT_URL. Pre-signed URLs that include the AWS hostname in the URL bypass the env-var setting and won't be intercepted.

Where to look next

  • Hashes — how roar reconciles ETags with content-based digests, and configures multiple hash algorithms per artifact.
  • Ray — Ray workers spawn per-node proxies automatically; see the Ray integration page for the multi-node story.
  • Benchmarks — proxy overhead numbers.
  • roar Guide — the full CLI reference.