HUD Documentation - Evaluations and RL Environments.

The serializable shapes agents, tasks, and graders exchange.

from hud import Grade, Run, Trace
from hud.types import Step
from hud.agents.types import AgentStep, Citation, SubagentStep, ToolStep
from hud.environment import Answer

`Run`

The live handle for one task - the lifecycle plus the agent’s Trace. You get them in job.runs from task.run(agent) / taskset.run(agent), or construct one over a connected client for manual driving.

Member	Type	Description
`run.prompt`	`str \| list \| None`	The task’s opening prompt as `tasks.start` returned it (text, or chat-style message list).
`run.prompt_messages`	`list[PromptMessage]`	The prompt as normalized user/assistant turns - what agents consume.
`run.prompt_text`	`str`	The prompt flattened to plain text, for string-only backends.
`run.trace`	`Trace`	The trajectory the agent fills. The answer is `run.trace.content`.
`run.grade`	`Grade`	Structured grade result.
`run.reward`	`float`	The graded reward (`grade.reward`, set on exit).
`run.evaluation`	`dict`	The raw grade payload (`grade.raw`).
`run.runtime`	`str \| None`	Control-channel url the run executed against (the placement record).
`run.trace_id`	`str \| None`	Keys the trajectory for training.
`run.job_id` / `run.group_id`	`str \| None`	Batch + GRPO group, set by the runner.

A rollout that fails before its session is live comes back as a synthesized failed run (no prompt, no runtime); a mid-run failure keeps the real run - prompt, runtime, partial trace - with the error on run.trace.

`Grade`

Structured result from grading one run, parsed from the wire grade frame ({"score": ..., "done": ..., "isError": ..., ...}).

Field	Type	Description
`reward`	`float`	The frame’s `score`.
`done`	`bool`	Whether the task is complete.
`content`	`str \| None`	Human-readable grade content.
`info`	`dict`	Extra metadata.
`is_error`	`bool`	Whether grading failed.
`raw`	`dict`	The full original frame.

`Trace`

The agent’s trajectory for one rollout - an ordered collection of Steps plus the run summary, and the unit of training data. Every recorded step also streams to the platform as one schema-tagged span.

Field	Type	Description
`steps`	`list[Step]`	The ordered trajectory.
`status`	`"completed" \| "error" \| "cancelled" \| None`	How the run ended (`trace.is_error` reads it).
`content`	`str \| None`	The final answer (graded).
`trace_id`	`str \| None`	Keys server-side trajectories.

hud.types.Step is the shared skeleton (source, timing, error, plus the harness payloads: prompt messages and task_call lifecycle RPCs). The tool-agent family subclasses it in hud.agents.types, flat on the skeleton:

AgentStep - the model’s turn: content, reasoning, tool_calls, done, plus model, usage, and token-level sample when the backend is trainable.
ToolStep - one tool round-trip: the MCPToolCall paired with its MCPToolResult.
SubagentStep - a nested rollout’s Trace, embedded whole.

Derived reads go through the trace’s two query shapes - trace.final(get) (newest non-None answer wins; trace.error is a view on it) and trace.collect(get) (every answer, in step order). Family vocabulary stays at the call site:

samples = trace.collect(lambda s: s.sample if isinstance(s, AgentStep) else None)
citations = trace.final(lambda s: s.citations if isinstance(s, AgentStep) else None)

Answer & result types

`Answer[T]`

When a task declares returns=T, the answer arrives wrapped (from hud.environment import Answer): content is the answer parsed into T (or the original string when parsing failed - grade it accordingly), raw is always the string as submitted.

@env.template(returns=int)
async def count(word: str = "strawberry"):
    answer = yield f"How many letters in '{word}'?"
    yield 1.0 if answer.content == len(word) else 0.0

`Citation`

A normalized citation across providers (hud.agents.types.Citation): type, text, source, title, start_index, end_index. A reply annotation, not a grading input - provider agents attach them to AgentStep.citations, and chat surfaces read the final reply’s via the trace.final(...) query above. A task that wants to grade sources should declare them in its returns= schema so the agent submits them as part of the answer.

Grading shapes

SubScore and EvaluationResult live with the graders - see Graders.

Typed task I/O

Declare input= / returns= on @env.template to surface JSON schemas in the manifest and parse the agent’s answer into a typed Answer[T]. Any Pydantic model or standard type works.

Types

`Run`

`Grade`

`Trace`

Answer & result types

`Answer[T]`

`Citation`

Grading shapes

Typed task I/O

See also

Tasks & Tasksets

Graders

​Run

​Grade

​Trace

​Answer & result types

​Answer[T]

​Citation

​Grading shapes

​Typed task I/O

​See also

Tasks & Tasksets

Graders

`Run`

`Grade`

`Trace`

Answer & result types

`Answer[T]`

`Citation`

Grading shapes

Typed task I/O

See also