HUD Documentation - Evaluations and RL Environments.

A task is a teacher, not a test. A test grades a deliverable once; a training task gets optimized against, repeatedly, by gradient descent. That changes the design rules: anything you don’t actively reward gets ignored, and anything you accidentally reward gets exploited. This page distills the principles that make a task actually train a model.

Signal lives in within-group spread

Modern RL post-training (GRPO and its relatives) computes each rollout’s advantage by subtracting the group mean from its reward. If every rollout in a group earns the same reward, every advantage is zero and no gradient is produced - the task taught nothing, no matter how healthy the average looks. So the operational unit of trainability is spread within a group, not the mean. Run each task as a group and check that outcomes differ:

taskset = Taskset("spread-check", [my_task(seed=s) for s in range(5)])
job = await taskset.run(agent, group=16)
rewards = [run.reward for run in job.runs]
# All 0.0 (or all 1.0) -> no signal. You want a non-degenerate spread.

All-zero at small group sizes may still be learnable at training scale (larger k surfaces occasional successes), but it’s a red flag worth investigating.
All-one (saturated) produces no spread at any scale - the task is too easy and is wasted training surface.
Variance destruction: a task where the agent does real work but a hard cap, vocabulary gate, or oversized penalty clamps the reward to a narrow band is just as useless as one the agent can’t engage with. Keep the reward responsive to the quality of the work.

Difficulty is relative to a specific model

Difficulty has no absolute meaning - every claim of “hard” is anchored to a specific model, version, and reasoning effort. A task that spreads nicely for one model saturates for a stronger one. State which model and regime you calibrated against, and re-check when you change it. Compare across a span, not a cluster. If you only ever check a task against a few similar top-tier models, you can’t tell a well-calibrated task from a saturated one. Validate against a weak anchor and a strong anchor - a spanning capability range makes the difficulty coordinate legible.

Resist the cheapest path

The single most important grader property: the highest reward an agent can get without doing the work the task is about must sit at or below the floor. If there’s a shortcut, gradient descent will find it. Common exploits to design against:

Hardcoding outputs or substituting a constant for computation.
Symptom mitigation instead of a root-cause fix (e.g. a try/except that swallows a failing test).
Using the grader’s vocabulary without doing the underlying analysis.
Retrieving an upstream artifact (clone/fetch/install) when the task expects in-workspace work.

Never ship a grader that returns a constant. echo PASS, default-on-crash, or shape-only checks (“did it return a number?” instead of “did it return 86?”) give positive reward regardless of behavior - they are pure reward-hacking surface. Grade substance, not surface form: credit a correct answer in a different format (thousands separators, casing, whitespace), but never credit the shape alone.

Make it multi-step

A task where one inference call produces the deliverable doesn’t give RL enough rollout structure to learn from. Real training tasks require multiple steps - several observations, tool calls, or turns

so the trajectory carries learnable structure. If your task is single-shot, give the agent something to do: a capability to act through and a problem that requires integrating evidence across more than one observation.

Keep the answer out of the environment

A task that tests investigation must not hand over the conclusion. Watch for leakage:

Root-cause leakage - a diff, PR description, comment, or doc that names the bug/fix the agent is supposed to find.
Grader leakage - sentinel phrases or required vocabulary in the prompt that exist only to satisfy the grader. Weave any needed guidance into natural context instead.
Eval-context leakage - text implying the task is a test, rollout, or judged exercise. (It changes behavior.)
Author artifacts - oracle solutions, grading harnesses, or local paths left where the agent can read them.

Align the prompt and the grader

What the prompt sets up, the grader should test - and vice versa. Two related properties:

Prompt-grader alignment: don’t score for content the prompt never asked for, and don’t ask for work the grader ignores.
Score-quality monotonicity: a rollout whose substantive work is better must not score lower. If a generic memo that did no investigation can outscore a thorough one, the grader is measuring shape, not substance.

Compose graders so a partial reward is legible (see combine) - subscores let you see which component earned the reward, which is how you catch monotonicity violations.

Source substrate that isn’t memorized

If the agent saw your task’s material during pretraining, you’re measuring recall, not capability. Prefer proprietary, self-generated, or transformed substrate over public benchmarks:

Avoid contamination: popular public benchmarks and widely-scraped repos are overrepresented in pretraining - a model can recognize the source instead of solving the problem.
Public as inspiration, not substrate: a public codebase operated to generate fresh logs/traces is fine; the same codebase handed to the agent verbatim is not.
Authenticity is the value: real failures, partial successes, and edge cases carry the signal. Don’t sanitize them away, and don’t fabricate synthetic substrate to look real.

Compose a taskset that isn’t all one shape

A single great task isn’t a dataset. A taskset where every task does the same thing in a different costume - same operation, different proper nouns - won’t train general capability.

Diversify across failure modes targeted, substrate sources, deliverable shapes, and capabilities exercised. Diagnostic: if you can summarize every task with one sentence varying only the nouns, it’s too same-shape.
Spread the difficulty distribution. Concentrating tasks at score 0 or at saturation wastes training surface; aim for a controlled range against your calibration anchor.
Size it to the training run so it doesn’t overfit in the first few steps.

Checklist

The grader’s cheapest path scores at or below the floor (no constant/echo/shape-only passes).

A group of rollouts produces non-degenerate reward spread.

Difficulty is calibrated against a named model + reasoning regime, checked across a weak-to-strong span.

The task is multi-step and requires integrating evidence.

No root-cause, grader, or eval-context leakage in the environment or prompt.

Prompt and grader are aligned; better work always scores higher.

Substrate isn’t a memorized public benchmark.

The taskset is diverse and spans a difficulty distribution.

Designing tasks

Signal lives in within-group spread

Difficulty is relative to a specific model

Resist the cheapest path

Make it multi-step

Keep the answer out of the environment

Align the prompt and the grader

Source substrate that isn’t memorized

Compose a taskset that isn’t all one shape

Checklist

See also

Tasks & Tasksets

Graders

Training

Composing richer environments

​Signal lives in within-group spread

​Difficulty is relative to a specific model

​Resist the cheapest path

​Make it multi-step

​Keep the answer out of the environment

​Align the prompt and the grader

​Source substrate that isn’t memorized

​Compose a taskset that isn’t all one shape

​Checklist

​See also

Tasks & Tasksets

Graders

Training

Composing richer environments

Signal lives in within-group spread

Difficulty is relative to a specific model

Resist the cheapest path

Make it multi-step

Keep the answer out of the environment

Align the prompt and the grader

Source substrate that isn’t memorized

Compose a taskset that isn’t all one shape

Checklist

See also