Bring your own harness
Because an environment only exposes capabilities and never a fixed agent, any loop or framework plugs in as a harness. Wrapping one is a thin adapter, not protocol work: you get aRun, drive the environment off it, and fill
run.trace.content.
The Agent seam
SubclassAgent and implement __call__. Open the capabilities you need off run.client, do your
work, and write the answer to run.trace.content (graded on exit):
harness.py
run - so one
instance drives many concurrent rollouts.
The Run you drive
Therun is the one object you work with for the whole task. Three things you do with it:
Read the prompt - what the task is asking.
| Member | Description |
|---|---|
run.prompt_messages | The prompt as normalized user/assistant turns - what most agents consume. |
run.prompt_text | The same flattened to plain text, for string-only backends. |
run.client is the live connection to the served environment.
| Call | Description |
|---|---|
run.client.open(protocol) | Open a managed capability client (shell, browser, …) to act through. |
run.client.binding(protocol) | Get a capability’s raw wire address, to hand to an external SDK. |
run.trace is the Trace you fill.
| Call | Description |
|---|---|
run.record(step) | Append a step and stream it to the platform live (step types in Types). |
run.trace.content = ... | Set the final answer, graded when the run ends. |
Reusing HUD’s loop: ToolAgent
There are two base classes, depending on how much of HUD’s loop you want:
Agent(hud.agents.base) - the bare seam above. Best for wrapping an external framework or a fully custom loop.ToolAgent(hud.agents.tool_agent, also exported asMCPAgent) - HUD’s catalog-driven tool-call loop, the base every provider agent subclasses. Implement the provider hooks (get_response, message/result formatting) and it handles capability wiring, the step loop, and recording.
AgentStep (a model turn), ToolStep (a tool
round-trip), or SubagentStep (a nested rollout); see Types. ToolAgent does all of
this for you.
Wrap an existing framework: browser-use on cdp
The bundled BrowserUseAgent is exactly this adapter - browser-use driving the cdp (browser)
capability:
run.py
ssh, mcp,
rfb, robot).
Any OpenAI-compatible endpoint
OpenAIChatAgent speaks the OpenAI Chat Completions API, so vLLM servers, local models, and hosted
checkpoints all work - point base_url at the server:
run.py
Composing richer environments
These patterns build on Environments once the basics are in place.Multiple capabilities at once
An environment can expose several capabilities; the harness opens whichever it needs. A task that spans a shell and a browser declares both:env.py
Stateful environments and backing daemons
Use@env.initialize / @env.shutdown to manage anything the tasks need running - a database, a
seeded service, a fixture. The hooks run once around serving:
env.py
Scaling a taskset
Parameterize for a difficulty spread
One task definition should span a range. Parameterize the generator and create a concrete task per point:tasks.py
Structure a large taskset across files
Keep tasks in modules and collect them into aTaskset at the top:
tasks.py
hud eval tasks.py claude --full runs the whole set; hud sync tasks my-taskset publishes it. Give
each task a stable slug so it’s identifiable on the platform:
tasks.py
Group rollouts for variance
To measure variance (or feed training), run each task several times.group repeats share a GRPO
group:
run.py
Route tasks to different substrates
A runtime is called once per rollout with the task row, so a callable can place heavier rows on heavier substrates:run.py
Subagents as tools
An MCP tool is just a function. A subagent is just a function that runs an agent over a task and returns its answer. Put the two together and an orchestrating agent can call a specialist sub-agent as a single tool call - no special class, nothing HUD-specific beyond the rollout you already write.Write the subagent as a function
Calling an@env.template mints a task; running it drives a fresh rollout whose Job carries the
result. Wrap that in a function and return the agent’s answer:
subagents.py
issue_id: str becomes the one parameter, the docstring becomes the description.
Register it as an MCP tool
Use a baseline FastMCP server - type hints + docstring become the schema, no subclass required:subagents.py
Expose it as an mcp capability
An orchestrating environment declares an mcp capability pointing at that server, so any harness that
opens it sees investigate_issue as a callable tool:
env.py
tools.run(transport="http", host="127.0.0.1", port=8080); in a built image, start it from your
container entrypoint or an
@env.initialize hook.
How it looks to the orchestrator
The orchestrating agent opens themcp capability, sees one tool - investigate_issue(issue_id) -
calls it, and gets the specialist’s findings back as the tool result. From its side it’s a single tool
call; underneath, a whole sub-rollout ran. Each subagent rollout streams under its own trace, so you
can inspect the specialist’s work separately from the orchestrator’s. Because the tool is an ordinary
function, everything composes normally: add retries, fan out to several specialists, or swap the model
- all in plain Python.
Chat and multi-turn
Most tasks yield a single text prompt. A chat-style task yields a list of messages instead, so the agent works against a multi-turn conversation. TheChat runner drives that conversation turn by
turn and keeps the history for you.
Reach for chat when the interaction itself is the thing - assistants, tool-use dialogues, anything
where the agent needs prior turns. For evals and training, the default single-turn task
is what you want. Either way the grading model is the same: you still yield a reward.
A chat-style task
A task’s prompt can be plain text or a list ofPromptMessages. To accept a running conversation,
take a messages parameter and yield it as the prompt:
tasks.py
run.prompt becomes the message list, and agents consume it as normalized turns through
run.prompt_messages.
Driving it with Chat
Chat wraps a concrete Task plus an Agent. Each send() appends the user message, runs the
agent over a fresh run with the full history, appends the reply, and returns the Trace:
chat.py
Chat is imported from hud.eval (also re-exported as hud.Chat). The task’s messages argument is
replaced with the running conversation on every send; pass runtime= to place each turn’s rollout
(omit it and the task’s source serves locally when minted in-process, else HUD-hosted by env name).
Managing history
The conversation history is the publicchat.messages list - persist it, restore it, or reset it
directly:
| Operation | Description |
|---|---|
await chat.send(message) | Send a user turn; returns the reply Trace. |
chat.messages | The history ({"role", "content"} dicts) - json.dumps to persist, assign to restore, clear to reset. |
Serving a chat
Chat is protocol-agnostic: any frontend - a web handler, a notebook, a wire protocol - just calls
await chat.send(...). For example, behind FastAPI: