Skip to main content
An environment is where the agent acts. A task is the work you measure there. A task only ever does two things: it prompts the agent, and it grades what happened into a single number, the reward. That one number is the whole point - it’s what an evaluation reports and what training learns from. Four words do all the work, and they’re worth keeping apart:
  • a template is the generator you author with @env.template(). It’s callable, and calling it doesn’t run anything - it mints a task.
  • a task is a filled-in template: one set of arguments bound into a single runnable row (the environment name, the template id, the args). You run a task.
  • a taskset is a named, ordered collection of tasks - a table of those rows.
  • a job is the receipt a run produces: the graded runs and their mean reward.

Defining a task

A task is an async generator with exactly two yields. The first yield is the prompt; the generator pauses while the agent works; the agent’s answer comes back into the generator; the second yield is the reward (0.0-1.0).
tasks.py
from hud import Environment

env = Environment(name="letter-count")

@env.template()
async def count_letter(word: str = "strawberry", letter: str = "r"):
    answer = yield f"How many '{letter}'s are in '{word}'?"   # 1st yield: the prompt
    yield 1.0 if answer == str(word.count(letter)) else 0.0   # 2nd yield: the reward
This shape is deliberate. Everything the agent does - every step, tool call, and observation - happens in the gap between the two yields, so you describe a task as plain Python: ask on one line, score on the next. The agent loop in the middle is HUD’s job, not yours. @env.template() registers the generator as a template; calling it mints a concrete Task:
task = count_letter(word="raspberry")   # a Task row, not yet run
Declare returns=T on the template and the answer arrives as a parsed Answer[T] (.content parsed, .raw the original string); without it, answer is the raw string the agent submitted.

One definition, many tasks

The core move: one template spans a space of tasks. Call it with different arguments and a single function becomes a whole dataset, with no separate artifact per task:
tasks.py
tasks = [count_letter(word=w) for w in ("strawberry", "raspberry", "blueberry")]
Parameterize across difficulties, inputs, or seeds and one definition becomes an entire benchmark.

The Task row

A Task is a Pydantic model - one portable, validated row of data. It holds no live environment: env is a name, the join key between the row and whatever brings that environment up at run time. So a task runs anywhere without an env object in-process - the prompt and reward arrive over the wire from the substrate that placement brings up.
FieldTypeDescription
envstrName of the environment the row belongs to.
idstrTask id registered on the environment.
argsdictBound arguments (what the template was called with).
slugstr | NoneStable id for sync, filtering, and lookup.
columnsdict | NoneMetadata surfaced as filter/leaderboard facets.
validationlist[dict] | NonePlatform/sync metadata.
agent_configdict | NonePer-task agent overrides (e.g. {"max_steps": 50}).
runtime_configRuntimeConfig | NonePer-row launch hints (image, resources); the runtime applies what it supports.
When you don’t have the template in hand (data pipelines, generated rows), build the model directly - the model is the row, so task.model_dump() and Task.model_validate(data) are the whole codec:
from hud import Task

task = Task(env="letter-count", id="count_letter", args={"word": "strawberry"}, slug="count-straw")

Grading: what the reward measures

The second yield is the reward. The first decision is what you score:
  • Grade the answer - compute a float from what the agent said.
  • Grade the world - score the state the agent left behind: tests passing, a file written, a service responding. Because the agent acts on a real system through its capabilities, this is often the most reliable signal. It’s outcome verification: the rigor of a test suite, with no fixed protocol the agent has to follow.
You reach for one of three approaches, in increasing power:
For simple checks, just return a number. HUD ships normalized comparison helpers in hud.graders:
tasks.py
from hud.graders import numeric_match

@env.template()
async def count_letter(word: str = "strawberry", letter: str = "r"):
    answer = yield f"How many '{letter}'s are in '{word}'?"
    yield numeric_match(answer, word.count(letter))
Helpers include exact_match, contains, numeric_match, f1_score, and normalize - each returns a float.
BashGrader runs a shell command and scores by exit code; LLMJudgeGrader scores an answer against rubric criteria. This is also how you grade the world - point a command at the state the agent left:
tasks.py
from hud.graders import BashGrader

@env.template()
async def add_endpoint():
    yield "Add a /health endpoint to the app in your workspace and make it return 200."
    result = await BashGrader.grade(command="pytest tests/test_health.py -q")
    yield result.value   # score what the agent did, not how it described it
combine runs several graders in parallel and weights them into one reward, with each subscore visible in the trace so a partial reward is legible:
tasks.py
from hud.graders import BashGrader, LLMJudgeGrader, combine

@env.template()
async def implement_feature(spec: str = "add a /health endpoint"):
    answer = yield f"Implement this and summarize what you changed: {spec}"
    yield await combine(
        BashGrader.grade(weight=0.5, command="pytest -q"),
        LLMJudgeGrader.grade(weight=0.5, answer=answer, criteria=["Matches the spec"]),
    )
The full catalog lives in the Graders reference.
A grader that returns a constant, or just echoes the answer back as a pass, teaches a model nothing and invites reward hacking. A good reward separates good work from bad - see Advice.

Tasksets

A Taskset is a named collection of task rows. Build one in code, or load it from a source:
from hud import Taskset

# in code - the authoring case
ts = Taskset("letters", [count_letter(word="strawberry"), count_letter(word="raspberry")])

# from a Python source (.py file or directory) - scans it for Task / Taskset objects
ts = Taskset.from_file("tasks.py")

# from a data file (.json / .jsonl) - portable rows, no source needed
ts = Taskset.from_file("tasks.jsonl")

# from the platform - by taskset name or id (uses HUD_API_KEY)
ts = Taskset.from_api("SheetBench-50")
Write rows back out with ts.to_file("tasks.json") (or .jsonl). Tasksets are also ordered collections:
OperationDescription
len(ts) / iter(ts)Count / iterate tasks in order.
ts["slug"]Look up one task by slug.
ts.filter(slugs) / ts.exclude(slugs)Keep / drop matching slugs (returns a new taskset).

Running

taskset.run(agent, ...) executes every task and returns a Job. task.run(...) is the same call over a taskset of one, with identical semantics:
from hud import LocalRuntime

# one task
job = await count_letter(word="strawberry").run(agent, runtime=LocalRuntime("env.py"))

# a whole taskset: 8 rollouts per task, capped concurrency
job = await ts.run(agent, runtime=LocalRuntime("env.py"), group=8, max_concurrent=10)
print(job.reward)
  • runtime= chooses where each rollout runs (local subprocess, container, cloud sandbox, HUD). Swap it freely without touching the tasks; omit it and placement is inferred (a locally-authored source serves itself, platform/file rows go HUD-hosted). See Runtime.
  • group= repeats each task N times so you can see the reward spread (the grouping GRPO trains on).
  • max_concurrent= caps how many rollouts run in parallel.
A crashed rollout comes back as a failed Run inside the job rather than raising, so one bad rollout never collapses a batch.

Jobs

A Job is the receipt for one execution. Every run reports under a job - there are no standalone traces, so even a single task.run returns a job of one.
MemberTypeDescription
idstrHUD job id.
namestrDisplay name.
runslist[Run]The graded Runs, in expansion order.
groupintRollouts per task.
rewardfloatMean reward across all runs.
resultsdict[str, list[Run]]Runs grouped by task slug - the alignment-safe alternative to zip(tasks, runs) (list-valued since group > 1 gives several runs per task).
job = await ts.run(agent, runtime=LocalRuntime("env.py"), group=4)
job.reward                          # mean across every run
job.runs[0].trace.content           # what the agent answered on the first run
for slug, runs in job.results.items():   # per-task: its runs, keyed by slug
    print(slug, sum(r.reward for r in runs) / len(runs))
By default each run call mints its own job. To gather many calls under one id - a training session, a multi-turn chat - open one with Job.start and pass it as job=:
from hud import Job

job = await Job.start("grpo-session", group=8)
for step in range(epochs):
    await ts.run(agent, runtime=LocalRuntime("env.py"), job=job)   # all runs accumulate here

Syncing to the platform

Sync is only for the platform: it publishes a locally-authored taskset to hud.ai so you can run it there, compare models on it, and browse its traces. Local runs never need it. hud sync tasks <name> uploads a taskset and only what changed. In code, diff() shows that comparison as a SyncPlan:
from hud.eval.sync import diff

plan = diff(Taskset.from_file("tasks.py"), Taskset.from_api("SheetBench-50"))
print(plan.summary())
FieldDescription
to_createLocal tasks not present remotely.
to_updateLocal tasks whose content differs from remote.
unchangedLocal tasks that match remote.
remote_onlyRemote tasks with no local counterpart.

See also

Environments

Where the agent acts, and the capabilities a task grades against.

Graders

Every grader, comparison helper, and the combine combiner.

Runtime

Where each task runs, chosen at execution time.

Advice

Design tasks that actually teach: difficulty, spread, and resisting reward hacking.