HUD Documentation — Evaluations and RL Environments.

A complete, runnable example: an environment with a managed shell, a task that asks the agent to make a failing test pass, and a BashGrader that scores by running the test suite.

The environment

The workspace gives the agent a sandboxed shell and files — the env starts it and publishes the shell capability when it serves. We seed a buggy module and a test in @env.initialize, then declare the task — the grader runs pytest and scores by exit code. One design point matters here: the grader runs an authoritative copy of the test that lives outside the agent’s workspace. The agent gets its own copy to read and run, but if the grader re-ran that editable copy, the cheapest path to a passing pytest would be weakening or deleting the test — classic reward hacking.

env.py

from pathlib import Path

from hud.environment import Environment
from hud.graders import BashGrader

ROOT = Path("workspace").resolve()     # the agent's directory
CHECKS = Path("checks").resolve()      # grader-only, outside the workspace

TEST = "from calc import add\n\ndef test_add():\n    assert add(2, 3) == 5\n"

env = Environment(name="coder")
env.workspace(ROOT)

@env.initialize
async def _seed():
    (ROOT / "calc.py").write_text("def add(a, b):\n    return a - b\n")   # bug
    (ROOT / "test_calc.py").write_text(TEST)          # the agent's copy
    CHECKS.mkdir(exist_ok=True)
    (CHECKS / "test_calc.py").write_text(TEST)        # the authoritative copy

@env.template()
async def fix_add(target: str = "test_calc.py"):
    yield f"There's a failing test in {target} in your workspace. Find and fix the bug so the test passes."
    result = await BashGrader.grade(
        weight=1.0,
        command=f"python -m pytest {CHECKS / target} -q",
        cwd=str(ROOT),
    )
    yield result.value

tasks = [fix_add()]

This task has no answer = yield — the deliverable is the state of the workspace, not a text answer.

To start from an existing repo instead of seeding files inline, write it into the workspace root in @env.initialize, or pass mounts= (see Capabilities).

Run it

Point a coding agent at the environment. claude opens the ssh capability, edits calc.py, and the grader re-runs the test:

hud eval env.py claude

For Claude Code (the claude CLI driving the shell over SSH), use the ClaudeSDKAgent in code:

run.py

import asyncio
from hud.agents import ClaudeSDKAgent
from hud.agents.types import ClaudeSDKConfig
from env import fix_add

async def main():
    agent = ClaudeSDKAgent(ClaudeSDKConfig(model="claude-sonnet-4-5"))
    job = await fix_add().run(agent)
    print("reward:", job.reward)

asyncio.run(main())

Read the trace

Every step — the shell commands, the edit, the test run — is on the trace at hud.ai. A reward of 1.0 means pytest exited 0; 0.0 means the test still fails.

Make it a dataset

Parameterize the task definition and create concrete tasks for a spread of bugs:

tasks.py

from env import fix_add

tasks = [fix_add(target=t) for t in ("test_calc.py", "test_utils.py", "test_io.py")]

BashGrader needs bash, so on native Windows it scores 0.0 — grade from macOS/Linux, WSL, or a built image.

Coding agent

The environment

Run it

Read the trace

Make it a dataset

See also

Environment reference

Graders

Designing tasks for signal

Ops diagnostics

​The environment

​Run it

​Read the trace

​Make it a dataset

​See also

Environment reference

Graders

Designing tasks for signal

Ops diagnostics

The environment

Run it

Read the trace

Make it a dataset

See also