HUD Documentation — Evaluations and RL Environments.

A built environment image is the end product for your tasks: one build packs every task from a single definition, and the same image runs unchanged on HUD, on your own infra, in CI, or on your laptop. Running one task is always the same exchange — start (get the prompt), the agent works, grade (get the reward). That’s the HUD protocol; packaging just decides where the container that serves it comes from.

Package it: `hud deploy`

The recommended path. hud deploy builds your environment from its Dockerfile.hud (scaffolded by hud init) on HUD and registers it by the name in your Environment(...) declaration — one step, no local Docker required. Then publish your tasks as a named taskset:

hud deploy
hud sync tasks my-taskset

hud deploy uploads the build context, builds the image on HUD, streams the build logs, and registers the environment (rebuilding in place if the name already exists).
hud sync tasks my-taskset diffs your tasks against the remote taskset and uploads only what changed.

Pass build-time config with --env KEY=VALUE / --env-file .env, --build-arg, and --secret. From the platform UI you then run batches, compare models on the same taskset, and browse every trace.

Pick where it runs: the runtime

In code, where a task runs is a runtime you pass at execution time — the task definition never changes. The same task.run(agent, runtime=…) call targets any substrate:

run.py

from hud import HUDRuntime, HostedRuntime, LocalRuntime, DockerRuntime, Runtime

HUDRuntime()                       # local agent loop against a HUD-hosted env
HostedRuntime()                    # run the whole rollout on HUD's hosted infra
LocalRuntime("env.py")             # a local child process (fastest iteration)
DockerRuntime("my-env")            # a fresh local container per rollout
Runtime("tcp://host:8765")         # attach to a container started elsewhere

run.py

from hud.agents import create_agent

agent = create_agent("claude-sonnet-4-5")
job = await fix_bug(difficulty=3).run(agent, runtime=HUDRuntime())
print(job.reward)

HUDRuntime() is the natural pair with hud deploy: the platform leases an instance, brings your deployed image up on it, and the SDK drives the env through the runtime tunnel. Use HostedRuntime() when the whole rollout should run remotely on the platform.

Run on your own infra

A runtime is just a function: given a task, start a container somewhere and yield its control-channel URL. That one function is the entire integration surface for any sandbox provider — Daytona, Modal, E2B, Runloop, or your own Kubernetes:

run.py

from contextlib import asynccontextmanager
from hud import Runtime

@asynccontextmanager
async def modal_runtime(task):
    sandbox = await start_my_sandbox(image="my-env")   # your infra spins the container up
    try:
        yield Runtime(f"tcp://{sandbox.host}:{sandbox.port}")
    finally:
        await sandbox.terminate()                       # …and tears it down

job = await fix_bug(difficulty=3).run(agent, runtime=modal_runtime)

DockerRuntime and LocalRuntime are just the built-in versions of this. Anything that can start your image and hand back a URL plugs in with no change to the environment or the task — that’s what “run anywhere” means concretely.

A self-contained image

For a fully-local artifact with no HUD account, build the image directly from the scaffolded Dockerfile.hud and drive a task with the packaged CLI — docker exec runs the commands inside the container, so nothing needs to be exposed:

docker build -f Dockerfile.hud -t my-env .

docker run -d --name run1 my-env
docker exec run1 hud task start fix_bug          # -> the prompt
docker exec run1 hud task grade fix_bug --answer "…"   # -> the reward
docker rm -f run1

hud task start returns the prompt; the agent works; hud task grade returns the reward — no source, no open port (hud task list shows what an image exposes).

Reproducible by construction. Each rollout gets its own fresh container, so results reproduce across runs and machines and one rollout never leaks state into the next. Keep per-task setup in @env.initialize so every run starts from the same state.

GPU environments (e.g. robot sims) take extra docker run flags through the placement: DockerRuntime(image, run_args=["--gpus", "all"]). For sims with multi-minute boots, prefer one long-lived container reused via Runtime(url) over a fresh DockerRuntime per rollout.

Next steps

Run on any model

The agent side: any model or harness drives the same task.

Designing tasks for signal

Compose a taskset that actually trains.

Train on your tasks

Turn the rewards you collected into GRPO advantages.

Harbor interop

Load existing benchmarks straight into the runtime.

​Package it: hud deploy

​Pick where it runs: the runtime

​Run on your own infra

​A self-contained image

​Next steps

Run on any model

Designing tasks for signal

Train on your tasks

Harbor interop

Package it: `hud deploy`

Pick where it runs: the runtime

Run on your own infra

A self-contained image

Next steps