float or an EvaluationResult) as the task’s second yield.
Comparison helpers
Each returns afloat (0.0–1.0) you can yield directly or wrap in a SubScore.
| Helper | Signature | Returns |
|---|---|---|
exact_match | exact_match(answer, expected, *, normalize_text=True) | 1.0 if equal (normalized) |
contains | contains(answer, substring, *, case_sensitive=False) | 1.0 if substring present |
contains_any | contains_any(answer, substrings, *, case_sensitive=False) | 1.0 if any present |
contains_all | contains_all(answer, substrings, *, case_sensitive=False) | 1.0 if all present |
numeric_match | numeric_match(answer, expected, *, tolerance=0.0) | 1.0 if first number matches |
f1_score | f1_score(answer, reference) | token-level F1 |
normalize | normalize(text) -> str | lowercased, punctuation/articles stripped |
BashGrader
Runs a shell command via /bin/bash -lc and scores by exit code (1.0 if it exits 0). Async; returns a SubScore. Needs bash — macOS, Linux, WSL, or a built image; on native Windows it scores 0.0 with a /bin/bash not found error.
cwd is the host directory to run in — for a workspace-backed task, pass the workspace root so the grader sees the same files the agent edited.
| Parameter | Default | Description |
|---|---|---|
weight | — | Weight in a composed grade. |
command | — | Shell command to run. |
cwd | None | Working directory. |
timeout_seconds | 600 | Kill + score 0.0 on timeout. |
LLMJudgeGrader
Scores an answer against weighted criteria with an LLM judge (uses the HUD gateway). Each criterion is graded MET/UNMET in parallel and combined by weight; no extra install needed.
criteria items are strings, or (requirement, weight) tuples.
combine — compose multiple graders
combine resolves SubScores and grader coroutines in parallel and combines them into a weighted EvaluationResult. Positive weights are normalized to sum to 1.0; negative weights are penalties.
| Function | Description |
|---|---|
await combine(*items) | Resolve SubScore / Awaitable[SubScore] in parallel → EvaluationResult. |
combine_any(weight, subscores) | Boolean OR: a SubScore that passes if any input passes (max). |
combine_all(weight, subscores) | Boolean AND: a SubScore that passes only if all inputs pass (min). |
combine_any/combine_all collapse alternatives into a single component you can feed to combine — e.g. “tests pass via pytest OR via make test” as one 0/1 subscore.
Custom graders
SubclassGrader and implement async compute_score (return a float, or (float, metadata)):
SubScore and EvaluationResult
A SubScore is one component of a grade: name, value (0–1), weight (default 1.0; negative = penalty), optional metadata.
An EvaluationResult is the combined grade payload you can yield from a task:
| Field | Default | Description |
|---|---|---|
reward | 0.0 | Final score. |
done | True | Episode complete. |
subscores | None | Optional breakdown (shown in the trace). |
info | {} | Extra metadata. |
content | None | Human-readable explanation. |
isError | False | Whether grading itself failed. |
EvaluationResult.from_float(value) wraps a bare reward.