hud.native includes reusable grader helpers for scenarios that want structured scoring without hand-building EvaluationResult objects each time.
All graders are async. Grade.gather runs them in parallel and combines the results into an EvaluationResult you can yield directly from a scenario.
Quick Example
Grade
Grade combines SubScore values into a single EvaluationResult.
Grade.from_subscores(subscores)
Combines already-resolved subscores into a weighted result.
- Positive weights are normalized to sum to
1.0 - Negative weights are preserved as penalties
- Duplicate subscore names are de-duplicated
- Per-subscore metadata is copied into
EvaluationResult.info
Grade.gather(*items)
Resolves subscores and grader coroutines in parallel, then combines them. Accepts a mix of SubScore objects and awaitables (e.g. Grader.grade()).
Grader
Grader is the async base class for reusable scoring helpers. Subclasses implement compute_score(...) (async), and grade(...) packages the result as a SubScore.
grade(...) records JSON-safe copies of the grader parameters in subscore metadata under _parameters.
Combinators
Grader.any(...) and Grader.all(...) combine multiple subscores into a single summary subscore.
any(...)uses the maximum input scoreall(...)uses the minimum input score
BashGrader
Runs a shell command via/bin/bash -lc and scores by exit code. Fully async.
| Parameter | Type | Default | Description |
|---|---|---|---|
command | str | required | Shell command to run |
cwd | str | None | None | Working directory |
timeout_seconds | int | None | 600 | Timeout in seconds |
- exit code
0→1.0 - non-zero exit code →
0.0 - timeout →
0.0with timeout metadata
stdout, stderr, and exit_code.
LLMJudgeGrader
Grade an answer against rubric criteria using an LLM judge. Requires therubric package (pip install rubric). Uses the HUD inference gateway by default.
| Parameter | Type | Default | Description |
|---|---|---|---|
answer | str | "" | The answer to evaluate |
criteria | list[str | tuple[str, float]] | None | Rubric criteria; tuples set custom weight |
question | str | "" | The original question/prompt for context |
model | str | "claude-haiku-4-5" | LLM model for judging |
1.0) or (requirement, weight) tuples.
Metadata includes per-criterion verdicts, reasons, and the model used.
Answer Comparisons
These functions returnfloat (1.0 or 0.0) for direct use as SubScore.value.
exact_match(answer, expected, *, normalize_text=True)
1.0 if answer matches expected after normalization, 0.0 otherwise.
contains(answer, substring, *, case_sensitive=False)
1.0 if answer contains substring, 0.0 otherwise.
contains_any(answer, substrings, *, case_sensitive=False)
1.0 if answer contains at least one of the substrings.
contains_all(answer, substrings, *, case_sensitive=False)
1.0 if answer contains all substrings.