HUD Documentation - Evaluations and RL Environments.

This guide outlines how to create an environment in the HUD format - defining the env.py file, registering tasks and capabilities, doing quick local tests, and deploying the environment to our platform. On this page: Start · The env.py file · Capabilities · Lifecycle hooks · Tasks · Testing locally · Deploying

Start

Run hud init in your terminal. This lets you select a template to start with, the default is a blank template.

hud init my-env
cd my-env

Inside the blank template my-env you now have four files, each with one job:

my-env/

├── env.pythe environment: object, capabilities, and task definitions

├── tasks.pythe concrete tasks to run, filled in with real arguments

├── Dockerfile.hudhow the environment is packaged into an image on deploy

└── pyproject.tomlthe Python dependencies

The `env.py` file

env.py is the only file that defines behavior. In this section we’ll go through each part of a full example (see it in full below). It’s an environment that gives the agent a sandboxed shell and file system, asks it to count a letter in a sentence, and scores the answer.

Full env.py example

We’ll use the following env.py example in this section

env.py

from pathlib import Path
from hud.environment import Environment

ROOT = Path("workspace")
env = Environment(name="counter")
env.workspace(ROOT)

@env.initialize
async def _setup():
    (ROOT / "instructions.txt").write_text("Count letters case-insensitively.\n")

@env.shutdown
async def _teardown():
    (ROOT / "instructions.txt").unlink(missing_ok=True)

@env.template(id="count")
async def count(sentence: str, letter: str):
    answer = yield f"How many times does '{letter}' appear in '{sentence}'?"
    yield 1.0 if str(sentence.lower().count(letter.lower())) in (answer or "") else 0.0

1 · Imports and the environment object

env = Environment(name="counter") creates the environment - the central object everything else attaches to. Capabilities, lifecycle hooks, and tasks all register against env; it holds nothing on its own. The name identifies it - how a task finds its way back to this definition, and what it’s published under when you deploy.

env.py

from pathlib import Path
from hud.environment import Environment

ROOT = Path("workspace")
env = Environment(name="counter")

2 · Capabilities

env.workspace(ROOT) attaches a capability: a sandboxed shell and file system the agent reaches over ssh, rooted at the workspace/ directory next to env.py.

env.py

env.workspace(ROOT)

Capabilities

A capability is a connection the agent drives, and it’s what keeps an environment portable: you declare the connection, and any agent that speaks the protocol drives it with its own tools. Above, env.workspace(ROOT) attached a single ssh capability - the common case. An environment isn’t limited to one, though: publish as many as the tasks need, and the agent sees every one. You attach them three ways:

in the Environment(...) constructor, for a service that already exists,
with env.workspace(root), the one-line shortcut for the typical shell case above,
with env.add_capability(...) from a hook, for a service the environment runs itself.

That last form is how you expose a custom tool. A FastMCP server is a common choice: an @env.initialize hook starts it, waits for it to bind, then publishes its address.

@env.initialize
async def _start():
    asyncio.create_task(server.run_http_async(host="127.0.0.1", port=8040))   # 1. start the server
    await asyncio.sleep(0.2)                                                   # 2. let it bind
    env.add_capability(Capability.mcp(name="tools",                           # 3. publish its address
                                      url="http://127.0.0.1:8040/mcp"))

Other protocols - a browser over cdp, a full desktop over rfb, a robot loop - attach the same way. The capabilities page has the spin-up for each, and the protocol covers how an agent discovers them over the wire.

3 · Lifecycle hooks

@env.initialize runs once before the env serves - the place to seed files or start services - and @env.shutdown runs on teardown, after the run finishes, to release them. Both are optional; here they’re dummy hooks for demonstration, writing and removing a file.

env.py

@env.initialize
async def _setup():
    (ROOT / "instructions.txt").write_text("Count letters case-insensitively.\n")

@env.shutdown
async def _teardown():
    (ROOT / "instructions.txt").unlink(missing_ok=True)

Lifecycle hooks

When a task needs state - seeded files, a running service, a browser - you bring it up in @env.initialize and release it in @env.shutdown. @env.initialize runs once before the env accepts any connection, so by the time an agent connects, everything it needs is already in place. Because the env.workspace(...) hook is registered first, the workspace directory already exists when your own hook runs, so you can write straight into ROOT. Hooks run per environment, not per task, so use them for state every task shares. The environment reference covers the full lifecycle.

4 · Tasks

@env.template(id="count") defines the task. The first yield sends the prompt and pauses; when the agent’s answer arrives the function resumes, and the second yield returns the reward - 1.0 if the count matches, else 0.0.

env.py

@env.template(id="count")
async def count(sentence: str, letter: str):
    answer = yield f"How many times does '{letter}' appear in '{sentence}'?"
    yield 1.0 if str(sentence.lower().count(letter.lower())) in (answer or "") else 0.0

Tasks

A task is HUD’s unit of evaluation: a prompt the agent attempts and a reward that scores how it did. You write one as an async generator with exactly two yields:

The first yield returns the prompt and pauses the function while the agent works in the environment, driving whatever capabilities it was given. Its answer comes back into answer.
The second yield returns the score - a reward from 0.0 to 1.0. This is where grading goes: compare answer inline, or call a grader to judge what the agent left behind.

@env.template() registers it as a template, not a single task: its parameters fill in per run, so one definition describes a whole space of tasks. Calling it - count(sentence="strawberry", letter="r") - mints one concrete, runnable task.

Grading

The second yield is the reward, a number from 0.0 to 1.0. Here it’s plain Python: compare the agent’s answer to the count Python computes. You have three options as tasks grow:

Approach	What it does
Plain Python	Compute the answer inline, like above. Comparison helpers (`exact_match`, `numeric_match`, …) live in `hud.graders`.
Async graders	`BashGrader` scores a shell command by exit code; `LLMJudgeGrader` scores against rubric criteria. This is how you grade the state the agent left behind - a test passing, a file written.
Composed graders	`combine` runs several at once and weights them into one reward, each subscore visible in the trace.

An async grader is an awaited call whose result you yield - BashGrader, for instance, runs a command in the workspace and scores by exit code:

result = await BashGrader.grade(weight=1.0, command="pytest -q", cwd=str(ROOT))
yield result.value

The full set lives in the graders reference; see designing tasks for what separates good rewards from hackable ones.

Listing tasks in `tasks.py`

env.py defines the shape of a task. tasks.py lists the concrete ones to run by calling the template with real arguments:

tasks.py

from env import count

tasks = [
    count(sentence="strawberry", letter="r"),
    count(sentence="banana", letter="a"),
]

Each call binds one runnable task. Build them in a loop to turn one definition into a whole dataset.

Testing the environment locally

Run an agent against the tasks. hud eval spawns the environment locally, runs the agent, grades each task, and prints the reward:

hud eval tasks.py claude

No Docker or separate server needed - the only thing you need is a provider key for the model (ANTHROPIC_API_KEY, …), or a HUD_API_KEY to route through the HUD gateway instead.

Runs on macOS, Windows, and Linux, but ssh sandbox isolation is Linux-only and BashGrader needs bash (scores 0.0 on native Windows). Off Linux, get full behavior via --runtime hud (see next guide for running), WSL2, or the built image (Dockerfile.hud).

hud eval spawns a fresh env per run; to keep one serving locally for repeated runs or manual inspection, use hud serve.

Deploying to the platform

When it works locally, ship it. Deploying talks to the platform, so set your API key first:

hud set HUD_API_KEY=your-key   # get one at hud.ai

hud deploy builds the image from Dockerfile.hud and publishes the environment under its Environment(name=...):

hud deploy

To run a taskset from the platform, publish it too. hud sync uploads a taskset and only what changed:

hud sync tasks my-tasks   # publish tasks.py as a named taskset
hud sync env              # sync environment metadata

Your environment now lives on the platform, ready to run at scale and compare across models. Running those evaluations - locally, on a cloud sandbox, or on the platform - is covered in evaluating agents.

Export to other formats

A HUD environment isn’t locked to HUD: its tasks export to other benchmark formats, like self-contained Harbor task folders.

See environment examples

For more to build on, pick a different starting point with hud init, or read the cookbooks for worked environments like a coding agent and an ops diagnostics task.

​Start

​The env.py file

​Capabilities

​Lifecycle hooks

​Tasks

​Grading

​Listing tasks in tasks.py

​Testing the environment locally

​Deploying to the platform

​Export to other formats

​See environment examples

Start

The `env.py` file

Capabilities

Lifecycle hooks

Tasks

Grading

Listing tasks in `tasks.py`

Testing the environment locally

Deploying to the platform

Export to other formats

See environment examples