- a template is the generator you author with
@env.template(). It’s callable, and calling it doesn’t run anything - it mints a task. - a task is a filled-in template: one set of arguments bound into a single runnable row (the
environment name, the template id, the args). You
runa task. - a taskset is a named, ordered collection of tasks - a table of those rows.
- a job is the receipt a run produces: the graded runs and their mean reward.
Defining a task
A task is an async generator with exactly two yields. The first yield is the prompt; the generator pauses while the agent works; the agent’s answer comes back into the generator; the second yield is the reward (0.0-1.0).
tasks.py
@env.template() registers the generator as a template; calling it mints a concrete Task:
returns=T on the template and the answer arrives as a parsed
Answer[T] (.content parsed, .raw the original string); without it, answer is
the raw string the agent submitted.
One definition, many tasks
The core move: one template spans a space of tasks. Call it with different arguments and a single function becomes a whole dataset, with no separate artifact per task:tasks.py
The Task row
ATask is a Pydantic model - one portable, validated row of data. It holds no live environment: env
is a name, the join key between the row and whatever brings that environment up at run time. So a task
runs anywhere without an env object in-process - the prompt and reward arrive over the wire from the
substrate that placement brings up.
| Field | Type | Description |
|---|---|---|
env | str | Name of the environment the row belongs to. |
id | str | Task id registered on the environment. |
args | dict | Bound arguments (what the template was called with). |
slug | str | None | Stable id for sync, filtering, and lookup. |
columns | dict | None | Metadata surfaced as filter/leaderboard facets. |
validation | list[dict] | None | Platform/sync metadata. |
agent_config | dict | None | Per-task agent overrides (e.g. {"max_steps": 50}). |
runtime_config | RuntimeConfig | None | Per-row launch hints (image, resources); the runtime applies what it supports. |
task.model_dump() and Task.model_validate(data) are the whole codec:
Grading: what the reward measures
The second yield is the reward. The first decision is what you score:- Grade the answer - compute a float from what the agent said.
- Grade the world - score the state the agent left behind: tests passing, a file written, a service responding. Because the agent acts on a real system through its capabilities, this is often the most reliable signal. It’s outcome verification: the rigor of a test suite, with no fixed protocol the agent has to follow.
1. Plain Python - compute a float yourself
1. Plain Python - compute a float yourself
For simple checks, just return a number. HUD ships normalized comparison helpers in Helpers include
hud.graders:tasks.py
exact_match, contains, numeric_match, f1_score, and normalize - each returns a float.2. Async graders - score a command or an LLM judgment
2. Async graders - score a command or an LLM judgment
BashGrader runs a shell command and scores by exit code; LLMJudgeGrader scores an answer against
rubric criteria. This is also how you grade the world - point a command at the state the agent left:tasks.py
3. Composed graders - combine several into one reward
3. Composed graders - combine several into one reward
combine runs several graders in parallel and weights them into one reward, with each subscore visible
in the trace so a partial reward is legible:tasks.py
Tasksets
ATaskset is a named collection of task rows. Build one in code, or load it from a source:
ts.to_file("tasks.json") (or .jsonl). Tasksets are also ordered
collections:
| Operation | Description |
|---|---|
len(ts) / iter(ts) | Count / iterate tasks in order. |
ts["slug"] | Look up one task by slug. |
ts.filter(slugs) / ts.exclude(slugs) | Keep / drop matching slugs (returns a new taskset). |
Running
taskset.run(agent, ...) executes every task and returns a Job. task.run(...) is the same
call over a taskset of one, with identical semantics:
runtime=chooses where each rollout runs (local subprocess, container, cloud sandbox, HUD). Swap it freely without touching the tasks; omit it and placement is inferred (a locally-authored source serves itself, platform/file rows go HUD-hosted). See Runtime.group=repeats each task N times so you can see the reward spread (the grouping GRPO trains on).max_concurrent=caps how many rollouts run in parallel.
Run inside the job rather than raising, so one bad rollout
never collapses a batch.
Jobs
AJob is the receipt for one execution. Every run reports under a job - there are no standalone
traces, so even a single task.run returns a job of one.
| Member | Type | Description |
|---|---|---|
id | str | HUD job id. |
name | str | Display name. |
runs | list[Run] | The graded Runs, in expansion order. |
group | int | Rollouts per task. |
reward | float | Mean reward across all runs. |
results | dict[str, list[Run]] | Runs grouped by task slug - the alignment-safe alternative to zip(tasks, runs) (list-valued since group > 1 gives several runs per task). |
run call mints its own job. To gather many calls under one id - a training session, a
multi-turn chat - open one with Job.start and pass it as job=:
Syncing to the platform
Sync is only for the platform: it publishes a locally-authored taskset to hud.ai so you can run it there, compare models on it, and browse its traces. Local runs never need it.hud sync tasks <name> uploads a taskset and only what changed. In code, diff() shows that
comparison as a SyncPlan:
| Field | Description |
|---|---|
to_create | Local tasks not present remotely. |
to_update | Local tasks whose content differs from remote. |
unchanged | Local tasks that match remote. |
remote_only | Remote tasks with no local counterpart. |
See also
Environments
Where the agent acts, and the capabilities a task grades against.
Graders
Every grader, comparison helper, and the
combine combiner.Runtime
Where each task runs, chosen at execution time.
Advice
Design tasks that actually teach: difficulty, spread, and resisting reward hacking.