hud eval command is the centra tool for evals: it drives
every eval and works the same whether your environment is deployed or a local env.py. Every rollout
traces to your platform.
There are two workflows, and they differ only in where the environment lives:
| Workflow | When to use it | Where you start it |
|---|---|---|
| Deployed environment | The env is built and hosted on the platform | Terminal or platform |
| Local environment | The env.py is on your disk | Terminal or run script |
Running a deployed environment
A deployed environment has been built and published to the platform withhud deploy. It runs entirely on hosted
infra, so there’s no local process to manage - you start a run from the terminal or from the platform,
and both trace to the same place.
From the terminal, pass the taskset name or id to hud eval. The tasks are fetched from the platform
and each rollout runs remotely on hosted infra.
Choosing an agent
The agent name (claude, openai, gemini) selects a built-in harness and routes calls through the
HUD gateway, where one HUD_API_KEY covers every provider. Switching models is a
single flag, and hud models list shows every model the gateway knows.
Reading traces and results
Each rollout is a trace: a replayable timeline of everything the agent did and the reward it earned. Traces are grouped into a job and shown on the platform. You can also read them from the terminal:Running a local environment
A local environment is anenv.py on your disk - the usual case while you’re still developing it.
hud eval is the way to run it; alternatively you can write your own run script
that calls Taskset.run the same way when you want programmatic
control. Both take a runtime, the one argument that decides where each rollout
runs. The environment definition never changes - only the runtime does.
With a
HUD_API_KEY set, local runs still trace to the platform. Without one, they
run and grade entirely on your machine with no platform calls.With hud eval
hud eval spawns the env subprocess for you, so a purely local run needs no hud serve, no Docker, and
no API key. Point it at your task source, pass an agent name, and set --runtime when you want a rollout
to run somewhere other than your machine.
By default each rollout runs in a child process from your
env.py. The --runtime flag moves that
placement elsewhere without touching the environment. See hud eval for the
full flag set.terminal
With a run script
When you want programmatic control - looping over agents, feeding a training pipeline, routing tasks to different infra - callTaskset.run directly and hand it a runtime object. This
is the same eval hud eval runs, written out in Python.
run.py
env.py or the tasks:
| Runtime | Where the env runs |
|---|---|
LocalRuntime("env.py") | A child process on your machine |
DockerRuntime("my-env") | A fresh local container per rollout |
ModalRuntime("my-env") | A fresh Modal sandbox per rollout |
DaytonaRuntime("my-env") | A fresh Daytona sandbox per rollout |
HUDRuntime() | Hosted infra, after hud deploy |
Runtime("tcp://host:port") | A substrate you started yourself |
Serving an environment directly
hud eval starts and stops the env subprocess for you. hud serve instead exposes the same control
channel as a standalone, long-lived process - useful for talking to a running env from a script, testing
a packaged image by hand, or driving hud task start and hud task grade yourself.
Runtime("tcp://localhost:8765"), or from hud eval with
--runtime tcp://localhost:8765. The protocol reference describes every message
the channel speaks.
With runs in hand, turn the reward spread into model updates - covered in
training agents.