## Why a proxy?

Not all the I/O `roar` cares about is local. Modern ML workflows pull training data from S3, write checkpoints to S3, push final artifacts to S3. The tracer model — observing `read()` / `write()` / `open()` on the worker's local files — sees none of that. To capture S3 lineage, `roar` needs to sit between the client and the cloud and watch the requests flow past.

That's what the **`roar` proxy** is: a small reverse proxy that forwards S3 traffic from your scripts to AWS unchanged, while recording `GetObject` and `PutObject` operations as lineage events. You don't change your scripts. You just point AWS clients at the proxy via `AWS_ENDPOINT_URL` and the proxy does the rest.

> **Status: AWS only today.** GCP support is on the roadmap. Azure isn't planned in the near term. If you need a different cloud's object store traced, that's a follow-up; see [Limitations](#limitations) below.

## Quick use

```bash
roar proxy                       # show status (enabled? binary found? daemon running?)
roar proxy enable                # enable the proxy for `roar run`
roar proxy disable               # disable it
roar proxy start                 # start a standalone daemon (prints its port)
roar proxy stop                  # stop the standalone daemon
```

When enabled, `roar run` will spawn the proxy on a job-scoped port and set `AWS_ENDPOINT_URL` in the wrapped command's environment automatically. For ad-hoc use outside `roar run`, start a standalone daemon and point your script at it:

```bash
roar proxy start
# … "Proxy daemon started (pid=…, port=9090)." …
export AWS_ENDPOINT_URL=http://127.0.0.1:9090
python my_script.py             # S3 calls are now traced
roar proxy stop
```

## What gets captured

Per S3 operation, the proxy records:

- **Bucket and key** (the `s3://bucket/key` path becomes the artifact's path).
- **Operation type** — `GetObject` → input edge, `PutObject` / `UploadPart` / `CompleteMultipartUpload` → output edge, `DeleteObject` → publication-style write.
- **Object size** in bytes (from `Content-Length` on responses, request body for uploads).
- **ETag** — recorded as the artifact's hash. See [ETag and content hashing](#etag-and-content-hashing) below.
- **Session and job IDs** — propagated through request headers so each captured event lands on the right job in your DAG.

Captured events become regular input/output rows on the job record, just like local file I/O. `roar dag` and `roar show` treat S3 artifacts the same as local ones.

## How it works

A small Rust binary (`roar-proxy`) running an `axum` HTTP server. It:

1. Listens on a loopback port (default `9090`; job-scoped when launched by `roar run`).
2. Receives each S3 request that the AWS client would have sent directly.
3. Re-signs the request with the caller's credentials and forwards it to the real S3 endpoint (`https://s3.<region>.amazonaws.com` by default).
4. Streams the response back to the caller — small/medium responses buffered, large objects streamed through so download throughput isn't bottlenecked on the proxy.
5. Extracts the relevant metadata from request/response headers (bucket, key, size, ETag) and emits a structured log line that the recording layer picks up.

Re-signing is what makes the proxy transparent to your AWS clients: from boto3's perspective, the response came from S3 normally. Your credentials never leave the local process — the proxy uses the standard AWS credential provider chain to re-sign.

For chained setups (LocalStack, MinIO, behind a corporate proxy), the daemon accepts an `--upstream` override that points it at a non-AWS endpoint. Same observation behavior; different forwarding target.

## ETag and content hashing

S3 returns an **ETag** on every successful upload and read. `roar` records it as the artifact's hash — but ETag is not the strict content digest you might expect:

- **Single-part `PutObject`**: ETag is the MD5 of the object body. Effectively a content digest for that object size.
- **Multipart upload** (anything above the multipart threshold, typically 5 MB chunks): ETag is `MD5(concat(MD5(part_1), MD5(part_2), …))-<N>` where `N` is the number of parts. **This is not a content digest** — it depends on the part-size choice made by the uploading client, not just on the bytes.

The practical consequence: two byte-identical multipart uploads with different part sizes produce different ETags. So an S3 artifact's `etag` hash is reliable as an identity check *within a single client* but is not safe to assume content-equality across clients.

When `roar` can record both an ETag and a content blake3 for the same artifact (typically when the same file is also opened locally via `builtins.open` and streamed through `_TrackedWriteFile`), it stores both — see [Hashes](/docs/hashes) for how multiple hash algorithms are kept per artifact.

For perf characteristics of the proxy under load — overhead per request, throughput compared to direct-to-S3 — see [Benchmarks](/docs/benchmarks#proxy).

## Configuration

Under `[proxy]` in `roar.toml`:

| Key | Default | Effect |
|---|---|---|
| `proxy.enabled` | `false` | When `true`, `roar run` auto-spawns the proxy and sets `AWS_ENDPOINT_URL` in the wrapped command. |
| `proxy.port` | `9090` | Default listen port for the standalone daemon. `roar run` uses a job-scoped port to avoid collisions across concurrent runs. |
| `proxy.upstream_endpoint` | (none) | If set, the proxy forwards to this URL instead of `https://s3.<region>.amazonaws.com`. Use to chain through LocalStack, MinIO, or a corporate proxy. |

Environment variables:

- `AWS_ENDPOINT_URL` — set by `roar run` (when proxy is enabled) so AWS clients route through the proxy. Override for ad-hoc use as shown in Quick use.
- `ROAR_UPSTREAM_S3_ENDPOINT` — alternative way to configure the upstream override. Useful when you want to set the proxy mid-script.

## Supported clouds

- **AWS S3** — fully supported. Uses the AWS SDK's request format and re-signs requests with the caller's AWS credentials.
- **S3-compatible endpoints (LocalStack, MinIO, R2, etc.)** — supported via `--upstream`. Same request format as AWS S3; the proxy just forwards to a different URL.
- **GCP Cloud Storage** — on the roadmap. GCS uses a different request format and auth model; the proxy needs work to handle that.
- **Azure Blob Storage** — not planned for the immediate term.

## Validation

The proxy is exercised end-to-end against MinIO in CI (running in Docker via the same Compose harness used for the Ray integration). Tests cover single-part uploads, multipart uploads, downloads, attribution to the correct job under nested subprocess boundaries, and proxy log collection across Ray worker nodes. See `tests/backends/ray/e2e/test_proxy_logs_collection.py` and `test_nested_subprocess_s3_lineage.py` in the `roar` repo for the coverage.

For real-AWS validation (real network, real S3 credentials), see the live test suite under `tests/backends/ray/live/`.

## Limitations

- **AWS only today.** GCP is planned; Azure is not in scope near term.
- **ETag is not a content digest for multipart uploads.** See [the section above](#etag-and-content-hashing). If you need strict content equality across uploaders, also capture the local content hash (via `roar`'s `open` patches or by writing the artifact locally before uploading).
- **Encrypted-in-transit traffic outside HTTPS** — the proxy only handles the standard HTTPS request flow. Custom transports (raw socket, proprietary wire formats) won't be observed.
- **Credentials are passed through, not stored.** The proxy uses the standard AWS credential provider chain — it never reads or caches your secret keys to disk. But the credentials *are* in process memory while a request is being signed; treat the proxy daemon's process the same way you treat any other AWS client on the box.
- **Pre-signed URLs (`get_object` via signed URL, etc.)** — captured if and only if the request flows through `AWS_ENDPOINT_URL`. Pre-signed URLs that include the AWS hostname in the URL bypass the env-var setting and won't be intercepted.

## Where to look next

- [Hashes](/docs/hashes) — how `roar` reconciles ETags with content-based digests, and configures multiple hash algorithms per artifact.
- [Ray](/docs/ray) — Ray workers spawn per-node proxies automatically; see the Ray integration page for the multi-node story.
- [Benchmarks](/docs/benchmarks#proxy) — proxy overhead numbers.
- [roar Guide](/docs/roar-guide) — the full CLI reference.
