Skip to main content
Graders turn an agent’s answer into a reward. HUD ships reusable ones so you don’t hand-build common scoring logic. Yield the result (a float or an EvaluationResult) as the task’s second yield.
from hud.graders import (
    BashGrader, LLMJudgeGrader, Grader,
    SubScore, EvaluationResult,
    combine, combine_any, combine_all,
    exact_match, contains, contains_any, contains_all,
    numeric_match, f1_score, normalize,
)

Comparison helpers

Each returns a float (0.0-1.0) you can yield directly or wrap in a SubScore.
HelperSignatureReturns
exact_matchexact_match(answer, expected, *, normalize_text=True)1.0 if equal (normalized)
containscontains(answer, substring, *, case_sensitive=False)1.0 if substring present
contains_anycontains_any(answer, substrings, *, case_sensitive=False)1.0 if any present
contains_allcontains_all(answer, substrings, *, case_sensitive=False)1.0 if all present
numeric_matchnumeric_match(answer, expected, *, tolerance=0.0)1.0 if first number matches
f1_scoref1_score(answer, reference)token-level F1
normalizenormalize(text) -> strlowercased, punctuation/articles stripped
@env.template()
async def capital(country: str = "France"):
    answer = yield f"What is the capital of {country}?"
    yield exact_match(answer, "Paris")

BashGrader

Runs a shell command via /bin/bash -lc and scores by exit code (1.0 if it exits 0). Async; returns a SubScore. Needs bash - macOS, Linux, WSL, or a built image; on native Windows it scores 0.0 with a /bin/bash not found error.
@env.template()
async def fix_tests():
    answer = yield "Make the tests pass."
    result = await BashGrader.grade(weight=1.0, command="pytest -q", cwd="/workspace")
    yield result.value
cwd is the host directory to run in - for a workspace-backed task, pass the workspace root so the grader sees the same files the agent edited.
ParameterDefaultDescription
weight-Weight in a composed grade.
command-Shell command to run.
cwdNoneWorking directory.
timeout_seconds600Kill + score 0.0 on timeout.

LLMJudgeGrader

Scores an answer against weighted criteria with an LLM judge (uses the HUD gateway). Each criterion is graded MET/UNMET in parallel and combined by weight; no extra install needed.
result = await LLMJudgeGrader.grade(
    weight=1.0,
    answer=answer,
    criteria=["Correct", ("Well-reasoned", 2.0)],
    question=prompt,
    model="claude-haiku-4-5",
)
criteria items are strings, or (requirement, weight) tuples.

combine - compose multiple graders

combine resolves SubScores and grader coroutines in parallel and combines them into a weighted EvaluationResult. Positive weights are normalized to sum to 1.0; negative weights are penalties.
@env.template()
async def composed(answer: str = ""):
    answer = yield "Solve the task."
    yield await combine(
        BashGrader.grade(weight=0.5, command="pytest -q"),
        LLMJudgeGrader.grade(weight=0.3, answer=answer, criteria=["Matches the spec"]),
        SubScore(name="format", value=exact_match(answer, "42"), weight=0.2),
    )
FunctionDescription
await combine(*items)Resolve SubScore / Awaitable[SubScore] in parallel → EvaluationResult.
combine_any(weight, subscores)Boolean OR: a SubScore that passes if any input passes (max).
combine_all(weight, subscores)Boolean AND: a SubScore that passes only if all inputs pass (min).
The subscores appear in the trace, so a partial reward is legible. combine_any/combine_all collapse alternatives into a single component you can feed to combine - e.g. “tests pass via pytest OR via make test” as one 0/1 subscore.

Custom graders

Subclass Grader and implement async compute_score (return a float, or (float, metadata)):
class LengthGrader(Grader):
    name = "length"

    @classmethod
    async def compute_score(cls, answer: str = "", target: int = 100, **kwargs):
        return 1.0 if len(answer) >= target else 0.0

result = await LengthGrader.grade(weight=1.0, answer=answer, target=200)

SubScore and EvaluationResult

A SubScore is one component of a grade: name, value (0-1), weight (default 1.0; negative = penalty), optional metadata. An EvaluationResult is the combined grade payload you can yield from a task:
FieldDefaultDescription
reward0.0Final score.
doneTrueEpisode complete.
subscoresNoneOptional breakdown (shown in the trace).
info{}Extra metadata.
contentNoneHuman-readable explanation.
isErrorFalseWhether grading itself failed.
EvaluationResult.from_float(value) wraps a bare reward.

See also

Tasks & Tasksets

Designing tasks