HUD Documentation - Evaluations and RL Environments.

This guide outlines how to run an evaluation: pointing an agent at tasks, choosing where each rollout runs, and reading the reward. The hud eval command is the centra tool for evals: it drives every eval and works the same whether your environment is deployed or a local env.py. Every rollout traces to your platform. There are two workflows, and they differ only in where the environment lives:

Workflow	When to use it	Where you start it
Deployed environment	The env is built and hosted on the platform	Terminal or platform
Local environment	The `env.py` is on your disk	Terminal or run script

Running a deployed environment

A deployed environment has been built and published to the platform with hud deploy. It runs entirely on hosted infra, so there’s no local process to manage - you start a run from the terminal or from the platform, and both trace to the same place. From the terminal, pass the taskset name or id to hud eval. The tasks are fetched from the platform and each rollout runs remotely on hosted infra.

hud eval "My Taskset" claude
hud eval "My Taskset" claude --all --group 3

From the platform, open hud.ai, pick the environment, choose a taskset and a model, and launch - no CLI required.

Choosing an agent

The agent name (claude, openai, gemini) selects a built-in harness and routes calls through the HUD gateway, where one HUD_API_KEY covers every provider. Switching models is a single flag, and hud models list shows every model the gateway knows.

hud eval "My Taskset" claude --model claude-haiku-4-5   # a cheaper model for fast iteration
hud eval "My Taskset" openai --model gpt-5
hud eval "My Taskset" gemini

For a custom loop - a fine-tuned model, a framework you already use - see bring your own harness.

Reading traces and results

Each rollout is a trace: a replayable timeline of everything the agent did and the reward it earned. Traces are grouped into a job and shown on the platform. You can also read them from the terminal:

hud jobs               # recent jobs - id, name, taskset, status
hud jobs <job-id>      # the traces in one job
hud trace <trace-id>   # a single rollout in full

Running a local environment

A local environment is an env.py on your disk - the usual case while you’re still developing it. hud eval is the way to run it; alternatively you can write your own run script that calls Taskset.run the same way when you want programmatic control. Both take a runtime, the one argument that decides where each rollout runs. The environment definition never changes - only the runtime does.

With a HUD_API_KEY set, local runs still trace to the platform. Without one, they run and grade entirely on your machine with no platform calls.

With `hud eval`

hud eval spawns the env subprocess for you, so a purely local run needs no hud serve, no Docker, and no API key. Point it at your task source, pass an agent name, and set --runtime when you want a rollout to run somewhere other than your machine.

By default each rollout runs in a child process from your env.py. The --runtime flag moves that placement elsewhere without touching the environment. See hud eval for the full flag set.

terminal

hud eval tasks.py claude               # first task, one rollout
hud eval tasks.py claude --all         # every task
hud eval tasks.py claude --group 3     # 3 rollouts per task
hud eval tasks.py claude --runtime hud # on hosted infra

With a run script

When you want programmatic control - looping over agents, feeding a training pipeline, routing tasks to different infra - call Taskset.run directly and hand it a runtime object. This is the same eval hud eval runs, written out in Python.

run.py

import asyncio
from hud import Taskset, LocalRuntime
from hud.agents import create_agent

agent = create_agent("claude-sonnet-4-5")
ts = Taskset.from_file("tasks.py")

async def main():
    job = await ts.run(agent, runtime=LocalRuntime("env.py"))
    print(job.reward)

asyncio.run(main())

The runtime is where you set placement. Swap it for another and the env runs somewhere else, with no change to env.py or the tasks:

Runtime	Where the env runs
`LocalRuntime("env.py")`	A child process on your machine
`DockerRuntime("my-env")`	A fresh local container per rollout
`ModalRuntime("my-env")`	A fresh Modal sandbox per rollout
`DaytonaRuntime("my-env")`	A fresh Daytona sandbox per rollout
`HUDRuntime()`	Hosted infra, after `hud deploy`
`Runtime("tcp://host:port")`	A substrate you started yourself

The runtime reference covers each constructor and how to bring your own.

Serving an environment directly

hud eval starts and stops the env subprocess for you. hud serve instead exposes the same control channel as a standalone, long-lived process - useful for talking to a running env from a script, testing a packaged image by hand, or driving hud task start and hud task grade yourself.

hud serve            # auto-detect env.py, bind :8765
hud serve env.py -p 9000

Attach to it from a script with Runtime("tcp://localhost:8765"), or from hud eval with --runtime tcp://localhost:8765. The protocol reference describes every message the channel speaks. With runs in hand, turn the reward spread into model updates - covered in training agents.

​Running a deployed environment

​Choosing an agent

​Reading traces and results

​Running a local environment

​With hud eval

​With a run script

​Serving an environment directly

Running a deployed environment

Choosing an agent

Reading traces and results

Running a local environment

With `hud eval`

With a run script

Serving an environment directly