HUD Documentation — Evaluations and RL Environments.

Graders turn an agent’s answer into a reward. HUD ships reusable ones so you don’t hand-build common scoring logic. Yield the result (a float or an EvaluationResult) as the task’s second yield.

from hud.graders import (
    BashGrader, LLMJudgeGrader, Grader,
    SubScore, EvaluationResult,
    combine, combine_any, combine_all,
    exact_match, contains, contains_any, contains_all,
    numeric_match, f1_score, normalize,
)

Comparison helpers

Each returns a float (0.0–1.0) you can yield directly or wrap in a SubScore.

Helper	Signature	Returns
`exact_match`	`exact_match(answer, expected, *, normalize_text=True)`	`1.0` if equal (normalized)
`contains`	`contains(answer, substring, *, case_sensitive=False)`	`1.0` if substring present
`contains_any`	`contains_any(answer, substrings, *, case_sensitive=False)`	`1.0` if any present
`contains_all`	`contains_all(answer, substrings, *, case_sensitive=False)`	`1.0` if all present
`numeric_match`	`numeric_match(answer, expected, *, tolerance=0.0)`	`1.0` if first number matches
`f1_score`	`f1_score(answer, reference)`	token-level F1
`normalize`	`normalize(text) -> str`	lowercased, punctuation/articles stripped

@env.template()
async def capital(country: str = "France"):
    answer = yield f"What is the capital of {country}?"
    yield exact_match(answer, "Paris")

`BashGrader`

Runs a shell command via /bin/bash -lc and scores by exit code (1.0 if it exits 0). Async; returns a SubScore. Needs bash — macOS, Linux, WSL, or a built image; on native Windows it scores 0.0 with a /bin/bash not found error.

@env.template()
async def fix_tests():
    answer = yield "Make the tests pass."
    result = await BashGrader.grade(weight=1.0, command="pytest -q", cwd="/workspace")
    yield result.value

cwd is the host directory to run in — for a workspace-backed task, pass the workspace root so the grader sees the same files the agent edited.

Parameter	Default	Description
`weight`	—	Weight in a composed grade.
`command`	—	Shell command to run.
`cwd`	`None`	Working directory.
`timeout_seconds`	`600`	Kill + score `0.0` on timeout.

`LLMJudgeGrader`

Scores an answer against weighted criteria with an LLM judge (uses the HUD gateway). Each criterion is graded MET/UNMET in parallel and combined by weight; no extra install needed.

result = await LLMJudgeGrader.grade(
    weight=1.0,
    answer=answer,
    criteria=["Correct", ("Well-reasoned", 2.0)],
    question=prompt,
    model="claude-haiku-4-5",
)

criteria items are strings, or (requirement, weight) tuples.

`combine` — compose multiple graders

combine resolves SubScores and grader coroutines in parallel and combines them into a weighted EvaluationResult. Positive weights are normalized to sum to 1.0; negative weights are penalties.

@env.template()
async def composed(answer: str = ""):
    answer = yield "Solve the task."
    yield await combine(
        BashGrader.grade(weight=0.5, command="pytest -q"),
        LLMJudgeGrader.grade(weight=0.3, answer=answer, criteria=["Matches the spec"]),
        SubScore(name="format", value=exact_match(answer, "42"), weight=0.2),
    )

Function	Description
`await combine(*items)`	Resolve `SubScore` / `Awaitable[SubScore]` in parallel → `EvaluationResult`.
`combine_any(weight, subscores)`	Boolean OR: a `SubScore` that passes if any input passes (max).
`combine_all(weight, subscores)`	Boolean AND: a `SubScore` that passes only if all inputs pass (min).

The subscores appear in the trace, so a partial reward is legible. combine_any/combine_all collapse alternatives into a single component you can feed to combine — e.g. “tests pass via pytest OR via make test” as one 0/1 subscore.

Custom graders

Subclass Grader and implement async compute_score (return a float, or (float, metadata)):

class LengthGrader(Grader):
    name = "length"

    @classmethod
    async def compute_score(cls, answer: str = "", target: int = 100, **kwargs):
        return 1.0 if len(answer) >= target else 0.0

result = await LengthGrader.grade(weight=1.0, answer=answer, target=200)

`SubScore` and `EvaluationResult`

A SubScore is one component of a grade: name, value (0–1), weight (default 1.0; negative = penalty), optional metadata. An EvaluationResult is the combined grade payload you can yield from a task:

Field	Default	Description
`reward`	`0.0`	Final score.
`done`	`True`	Episode complete.
`subscores`	`None`	Optional breakdown (shown in the trace).
`info`	`{}`	Extra metadata.
`content`	`None`	Human-readable explanation.
`isError`	`False`	Whether grading itself failed.

EvaluationResult.from_float(value) wraps a bare reward.

Graders

Comparison helpers

`BashGrader`

`LLMJudgeGrader`

`combine` — compose multiple graders

Custom graders

`SubScore` and `EvaluationResult`

See also

Tasks & grading

Designing tasks for signal

​Comparison helpers

​BashGrader

​LLMJudgeGrader

​combine — compose multiple graders

​Custom graders

​SubScore and EvaluationResult

​See also

Tasks & grading

Designing tasks for signal

Comparison helpers

`BashGrader`

`LLMJudgeGrader`

`combine` — compose multiple graders

Custom graders

`SubScore` and `EvaluationResult`

See also