Skip to main content
The serializable shapes agents, tasks, and graders exchange.
from hud import Grade, Run, Trace
from hud.types import Step
from hud.agents.types import AgentStep, Citation, SubagentStep, ToolStep
from hud.environment import Answer

Run

The live handle for one task — the lifecycle plus the agent’s Trace. You get them in job.runs from task.run(agent) / taskset.run(agent), or construct one over a connected client for manual driving (see Running a Task).
MemberTypeDescription
run.promptstr | list | NoneThe task’s opening prompt as tasks.start returned it (text, or chat-style message list).
run.prompt_messageslist[PromptMessage]The prompt as normalized user/assistant turns — what agents consume.
run.prompt_textstrThe prompt flattened to plain text, for string-only backends.
run.traceTraceThe trajectory the agent fills. The answer is run.trace.content.
run.gradeGradeStructured grade result.
run.rewardfloatThe graded reward (grade.reward, set on exit).
run.evaluationdictThe raw grade payload (grade.raw).
run.runtimestr | NoneControl-channel url the run executed against (the placement record).
run.trace_idstr | NoneKeys the trajectory; satisfies Rewarded.
run.job_id / run.group_idstr | NoneBatch + GRPO group, set by the runner.
A rollout that fails before its session is live comes back as a synthesized failed run (no prompt, no runtime); a mid-run failure keeps the real run — prompt, runtime, partial trace — with the error on run.trace.

Grade

Structured result from grading one run, parsed from the wire grade frame ({"score": ..., "done": ..., "isError": ..., ...}).
FieldTypeDescription
rewardfloatThe frame’s score.
doneboolWhether the task is complete.
contentstr | NoneHuman-readable grade content.
infodictExtra metadata.
is_errorboolWhether grading failed.
rawdictThe full original frame.

Trace

The agent’s trajectory for one rollout — an ordered collection of Steps plus the run summary, and the unit of training data. Every recorded step also streams to the platform as one schema-tagged span.
FieldTypeDescription
stepslist[Step]The ordered trajectory.
status"completed" | "error" | "cancelled" | NoneHow the run ended (trace.is_error reads it).
contentstr | NoneThe final answer (graded).
trace_idstr | NoneKeys server-side trajectories.
hud.types.Step is the shared skeleton (source, timing, error, plus the harness payloads: prompt messages and task_call lifecycle RPCs). The tool-agent family subclasses it in hud.agents.types, flat on the skeleton:
  • AgentStep — the model’s turn: content, reasoning, tool_calls, done, plus model, usage, and token-level sample when the backend is trainable.
  • ToolStep — one tool round-trip: the MCPToolCall paired with its MCPToolResult.
  • SubagentStep — a nested rollout’s Trace, embedded whole.
Derived reads go through the trace’s two query shapes — trace.final(get) (newest non-None answer wins; trace.error is a view on it) and trace.collect(get) (every answer, in step order). Family vocabulary stays at the call site:
samples = trace.collect(lambda s: s.sample if isinstance(s, AgentStep) else None)
citations = trace.final(lambda s: s.citations if isinstance(s, AgentStep) else None)

Answer & result types

Answer[T]

When a task declares returns=T, the answer arrives wrapped (from hud.environment import Answer): content is the answer parsed into T (or the original string when parsing failed — grade it accordingly), raw is always the string as submitted.
@env.template(returns=int)
async def count(word: str = "strawberry"):
    answer = yield f"How many letters in '{word}'?"
    yield 1.0 if answer.content == len(word) else 0.0

Citation

A normalized citation across providers (hud.agents.types.Citation): type, text, source, title, start_index, end_index. A reply annotation, not a grading input — provider agents attach them to AgentStep.citations, and chat surfaces read the final reply’s via the trace.final(...) query above. A task that wants to grade sources should declare them in its returns= schema so the agent submits them as part of the answer.

Grading shapes

SubScore and EvaluationResult live with the graders — see Graders.

Training types

from hud.eval import group_relative
  • Rewarded — the protocol training needs: anything with trace_id: str | None and reward: float (a Run qualifies).
  • group_relative(rewards, *, normalize_std=True) — GRPO advantages over one group.

Typed task I/O

Declare input= / returns= on @env.template to surface JSON schemas in the manifest and parse the agent’s answer into a typed Answer[T]. Any Pydantic model or standard type works.

See also

Tasks & Tasksets

Graders