HUD Documentation — Evaluations and RL Environments.

A task template is the measurement instrument: one async generator that prompts and grades. Calling it with different arguments mints different tasks — one function becomes a whole dataset, no duplication. The template ships inside the environment image — one image mints every task in your dataset on demand, with no separate artifact per task.

The two-yield generator

Register a template with @env.template(). The first yield is the prompt; the value it returns is the agent’s answer; the second yield is the reward (a float, usually 0.0–1.0).

tasks.py

from hud import Environment

env = Environment(name="letter-count")

@env.template()
async def count_letter(word: str = "strawberry", letter: str = "r"):
    answer = yield f"How many '{letter}'s are in '{word}'? Reply with just the number."
    yield 1.0 if answer and str(word.count(letter)) in answer else 0.0

The template id defaults to the function name; override it with @env.template(id="...").

Tasks: one definition, many data points

Calling the template mints a task — one runnable, parameterized row bound to the environment by name:

tasks.py

tasks = [count_letter(word=w) for w in ("strawberry", "raspberry", "blueberry")]

count_letter(word="raspberry") doesn’t run anything; it returns a Task (a plain row: env name, template id, args). A list of tasks is a dataset, and hud eval tasks.py claude runs each one. This is the core move: parameterize the generator, and a single definition spans a whole spread of difficulties or inputs.

Grading

The second yield is the reward. You have three options, in increasing power.

1. Plain Python

For simple checks, just compute a float. HUD ships normalized comparison helpers in hud.graders:

tasks.py

from hud.graders import numeric_match

@env.template()
async def count_letter(word: str = "strawberry", letter: str = "r"):
    answer = yield f"How many '{letter}'s are in '{word}'?"
    yield numeric_match(answer, word.count(letter))

Available helpers (each returns a float): exact_match, contains, contains_any, contains_all, numeric_match, f1_score, and normalize (a text-normalization building block). See the Graders reference.

2. Async graders

BashGrader runs a shell command and scores by exit code (1.0 if it exits 0); LLMJudgeGrader scores an answer against rubric criteria with an LLM. Both are async and return a SubScore:

tasks.py

from hud.graders import BashGrader

@env.template()
async def fix_tests(target: str = "tests/"):
    answer = yield f"Make the tests in {target} pass."
    result = await BashGrader.grade(weight=1.0, command=f"pytest {target} -q")
    yield result.value

3. Composed graders

combine runs several graders in parallel and combines them into a weighted EvaluationResult you can yield directly. Positive weights are normalized to sum to 1.0:

tasks.py

from hud.graders import BashGrader, LLMJudgeGrader, SubScore, combine, exact_match

@env.template()
async def implement_feature(spec: str = "add a /health endpoint"):
    answer = yield f"Implement this and summarize what you changed: {spec}"
    yield await combine(
        BashGrader.grade(weight=0.5, command="pytest -q"),
        LLMJudgeGrader.grade(weight=0.3, answer=answer, criteria=["Matches the spec"]),
        SubScore(name="mentions_endpoint", value=exact_match(answer, "/health"), weight=0.2),
    )

Subscores show up in the trace, so a partial reward is legible: you can see which component earned it. (LLMJudgeGrader needs the rubric package: pip install rubric.)

A grader that returns a constant, or echoes the answer back as a pass, teaches a model nothing and invites reward hacking. Design graders that actually separate good work from bad — see Designing tasks for signal.

Grade the outcome, not just the answer

A grader doesn’t have to read the agent’s words. Because the agent acts on a real system through its capabilities, the most reliable thing to score is often the state it left behind — tests passing, a file written, a row in a database, a service responding. The task simply skips the answer = and grades the world:

tasks.py

from hud import Environment
from hud.graders import BashGrader

env = Environment(name="api")
ws = env.workspace("workspace")

@env.template()
async def add_endpoint():
    yield "Add a /health endpoint to the app in your workspace and make it return 200."
    result = await BashGrader.grade(weight=1.0, command="pytest tests/test_health.py -q", cwd=str(ws.root))
    yield result.value

This is outcome verification: you score what the agent did, not how it described it — the same rigor as a test suite, with no fixed step-by-step protocol for the agent to conform to. The agent works however it likes through the capability; the grader checks the result.

Structured answers

By default the answer is the agent’s raw text. To receive a typed, parsed answer, declare returns= with a type; the answer arrives as an Answer[T] (parsed content, original raw):

tasks.py

from pydantic import BaseModel

class Summary(BaseModel):
    title: str
    bullets: list[str]

@env.template(returns=Summary)
async def summarize(doc: str = "..."):
    answer = yield f"Summarize:\n\n{doc}"
    yield 1.0 if len(answer.content.bullets) >= 3 else 0.0

Use input= and returns= to surface JSON schemas in the environment’s manifest. See the Types reference.

Sync metadata: `slug` and `columns`

When you publish a taskset to the platform (hud sync tasks), each task carries optional metadata. slug is its stable id (defaults to the template id plus an args hash); columns are arbitrary fields surfaced as filterable columns and leaderboard facets on the platform:

tasks.py

easy = count_letter(word="strawberry")
easy.slug = "count-strawberry"
easy.columns = {"difficulty": "easy", "length": 10}

Run them

While authoring, one command runs your tasks — it loads the env from your source and grades each one:

hud eval tasks.py claude --group 3          # one task, 3 rollouts
hud eval tasks.py claude --full --group 3   # the whole dataset, 3 rollouts each

That’s the loop you’ll live in. In code, calling a template mints a Task; run it for a Job of graded runs. With no runtime=, it serves the source the task was defined in, so it just works locally:

run.py

from hud.agents import create_agent
from tasks import count_letter

agent = create_agent("claude-sonnet-4-5")
job = await count_letter(word="strawberry").run(agent)
print(job.reward)

From here the path forks — and that’s where runtime= comes in:

Scale — package the environment and run it on your own infra or HUD-hosted. See Run tasks anywhere.
Train — drive a Taskset in a loop and turn rewards into GRPO advantages. See Train on your tasks.

Next steps

Designing tasks for signal

Make tasks that actually teach: difficulty, spread, and anti-reward-hacking.

Graders reference

Every grader, comparison helper, and the combine combiner.

Run on any model

Evaluate with Claude, OpenAI, Gemini, or your own endpoint.

Train on your tasks

Turn a group of rewards into GRPO advantages.

​The two-yield generator

​Tasks: one definition, many data points

​Grading

​1. Plain Python

​2. Async graders

​3. Composed graders

​Grade the outcome, not just the answer

​Structured answers

​Sync metadata: slug and columns

​Run them

​Next steps

Designing tasks for signal

Graders reference

Run on any model

Train on your tasks

The two-yield generator

Tasks: one definition, many data points

Grading

1. Plain Python

2. Async graders

3. Composed graders

Grade the outcome, not just the answer

Structured answers

Sync metadata: `slug` and `columns`

Run them

Next steps