The two-yield generator
Register a template with@env.template(). The first yield is the prompt; the value it returns is the agent’s answer; the second yield is the reward (a float, usually 0.0–1.0).
tasks.py
@env.template(id="...").
Tasks: one definition, many data points
Calling the template mints a task — one runnable, parameterized row bound to the environment by name:tasks.py
count_letter(word="raspberry") doesn’t run anything; it returns a Task (a plain row: env name, template id, args). A list of tasks is a dataset, and hud eval tasks.py claude runs each one. This is the core move: parameterize the generator, and a single definition spans a whole spread of difficulties or inputs.
Grading
The second yield is the reward. You have three options, in increasing power.1. Plain Python
For simple checks, just compute a float. HUD ships normalized comparison helpers inhud.graders:
tasks.py
float): exact_match, contains, contains_any, contains_all, numeric_match, f1_score, and normalize (a text-normalization building block). See the Graders reference.
2. Async graders
BashGrader runs a shell command and scores by exit code (1.0 if it exits 0); LLMJudgeGrader scores an answer against rubric criteria with an LLM. Both are async and return a SubScore:
tasks.py
3. Composed graders
combine runs several graders in parallel and combines them into a weighted EvaluationResult you can yield directly. Positive weights are normalized to sum to 1.0:
tasks.py
LLMJudgeGrader needs the rubric package: pip install rubric.)
Grade the outcome, not just the answer
A grader doesn’t have to read the agent’s words. Because the agent acts on a real system through its capabilities, the most reliable thing to score is often the state it left behind — tests passing, a file written, a row in a database, a service responding. The task simply skips theanswer = and grades the world:
tasks.py
Structured answers
By default the answer is the agent’s raw text. To receive a typed, parsed answer, declarereturns= with a type; the answer arrives as an Answer[T] (parsed content, original raw):
tasks.py
input= and returns= to surface JSON schemas in the environment’s manifest. See the Types reference.
Sync metadata: slug and columns
When you publish a taskset to the platform (hud sync tasks), each task carries optional metadata. slug is its stable id (defaults to the template id plus an args hash); columns are arbitrary fields surfaced as filterable columns and leaderboard facets on the platform:
tasks.py
Run them
While authoring, one command runs your tasks — it loads the env from your source and grades each one:Task; run it for a Job of graded runs. With no runtime=, it serves the source the task was defined in, so it just works locally:
run.py
runtime= comes in:
- Scale — package the environment and run it on your own infra or HUD-hosted. See Run tasks anywhere.
- Train — drive a
Tasksetin a loop and turn rewards into GRPO advantages. See Train on your tasks.
Next steps
Designing tasks for signal
Make tasks that actually teach: difficulty, spread, and anti-reward-hacking.
Graders reference
Every grader, comparison helper, and the
combine combiner.Run on any model
Evaluate with Claude, OpenAI, Gemini, or your own endpoint.
Train on your tasks
Turn a group of rewards into GRPO advantages.