HUD Documentation — Evaluations and RL Environments.

Once the basics are in place, these patterns help you build richer environments. Each builds on Environments and Tasks.

Compose multiple capabilities

An environment can expose several capabilities at once; the harness opens whichever it needs. A task that spans a shell and a browser declares both:

env.py

from hud.capabilities import Capability
from hud.environment import Environment

env = Environment(
    name="full-stack",
    capabilities=[
        Capability.cdp(url="ws://127.0.0.1:9222"),    # cdp: a browser you run
    ],
)
env.workspace("/workspace")                           # ssh: shell + files, served by the env

The same environment serves a shell-only coding task and a browser-driving task — the difference is which capabilities the harness opens, not the environment.

Stateful environments and backing daemons

Use @env.initialize / @env.shutdown to manage anything the tasks need running — a database, a seeded service, a fixture. The hooks run once around serving:

env.py

import asyncpg

db: asyncpg.Connection | None = None

@env.initialize
async def _start():
    global db
    db = await asyncpg.connect("postgresql://localhost/app")

@env.shutdown
async def _stop():
    if db is not None:
        await db.close()

Keep environment state frozen across rollouts: every run of a task should see the same starting state, so reward differences reflect the agent, not a drifting environment.

Parameterize for a difficulty spread

One task definition should span a range. Parameterize the generator and create a concrete task per point:

tasks.py

@env.template()
async def fix_bug(difficulty: int = 1):
    answer = yield f"Fix the level-{difficulty} bug in your workspace."
    result = await BashGrader.grade(weight=1.0, command="pytest -q")
    yield result.value

tasks = [fix_bug(difficulty=d) for d in range(1, 6)]

A controlled difficulty distribution is what makes a taskset trainable — see Designing tasks for signal.

Structure a large taskset across files

Keep tasks in modules and collect them into a Taskset at the top:

tasks.py

from hud.eval import Taskset
from coding_tasks import fix_bug, add_feature
from review_tasks import review_pr

taskset = Taskset("engineering-work", [
    *(fix_bug(difficulty=d) for d in range(1, 6)),
    add_feature(spec="health endpoint"),
    review_pr(pr_id=1421),
])

hud eval tasks.py claude --full runs the whole set; hud sync tasks my-taskset publishes it. Give each task a stable slug so it’s identifiable on the platform:

tasks.py

v = fix_bug(difficulty=3)
v.slug = "fix-bug-3"

Group rollouts for variance

To measure variance (or feed training), run each task several times. group repeats share a GRPO group:

run.py

taskset = Taskset("bugs", [fix_bug(difficulty=d) for d in range(1, 6)])
job = await taskset.run(
    agent, group=8, max_concurrent=10,
)
rewards = [run.reward for run in job.runs]

Patterns

Compose multiple capabilities

Stateful environments and backing daemons

Parameterize for a difficulty spread

Structure a large taskset across files

Group rollouts for variance

See also

Designing tasks for signal

Environment reference

Package & deploy

Train on rewards

​Compose multiple capabilities

​Stateful environments and backing daemons

​Parameterize for a difficulty spread

​Structure a large taskset across files

​Group rollouts for variance

​See also

Designing tasks for signal

Environment reference

Package & deploy

Train on rewards

Compose multiple capabilities

Stateful environments and backing daemons

Parameterize for a difficulty spread

Structure a large taskset across files

Group rollouts for variance

See also