Skip to main content
The standard workflow - declare an environment, write tasks, run a built-in agent - covers most of what you need. This page collects the patterns for going further: plugging in your own agent, composing richer environments, scaling tasksets, delegating to subagents, and driving multi-turn chats. Each is independent; jump to what you need.

Bring your own harness

Because an environment only exposes capabilities and never a fixed agent, any loop or framework plugs in as a harness. Wrapping one is a thin adapter, not protocol work: you get a Run, drive the environment off it, and fill run.trace.content.

The Agent seam

Subclass Agent and implement __call__. Open the capabilities you need off run.client, do your work, and write the answer to run.trace.content (graded on exit):
harness.py
from hud.agents.base import Agent
from hud import Run

class MyHarness(Agent):
    async def __call__(self, run: Run) -> None:
        prompt = run.prompt_text          # or run.prompt_messages for structured turns
        # ... drive your framework against a capability ...
        run.record(...)                   # stream steps to the platform live (optional)
        run.trace.content = "the final answer"
That is the whole seam. An agent keeps no per-run state - everything comes from the run - so one instance drives many concurrent rollouts.

The Run you drive

The run is the one object you work with for the whole task. Three things you do with it: Read the prompt - what the task is asking.
MemberDescription
run.prompt_messagesThe prompt as normalized user/assistant turns - what most agents consume.
run.prompt_textThe same flattened to plain text, for string-only backends.
Drive the environment - run.client is the live connection to the served environment.
CallDescription
run.client.open(protocol)Open a managed capability client (shell, browser, …) to act through.
run.client.binding(protocol)Get a capability’s raw wire address, to hand to an external SDK.
Record the result - run.trace is the Trace you fill.
CallDescription
run.record(step)Append a step and stream it to the platform live (step types in Types).
run.trace.content = ...Set the final answer, graded when the run ends.

Reusing HUD’s loop: ToolAgent

There are two base classes, depending on how much of HUD’s loop you want:
  • Agent (hud.agents.base) - the bare seam above. Best for wrapping an external framework or a fully custom loop.
  • ToolAgent (hud.agents.tool_agent, also exported as MCPAgent) - HUD’s catalog-driven tool-call loop, the base every provider agent subclasses. Implement the provider hooks (get_response, message/result formatting) and it handles capability wiring, the step loop, and recording.
Record the step family that matches what happened - AgentStep (a model turn), ToolStep (a tool round-trip), or SubagentStep (a nested rollout); see Types. ToolAgent does all of this for you.

Wrap an existing framework: browser-use on cdp

The bundled BrowserUseAgent is exactly this adapter - browser-use driving the cdp (browser) capability:
run.py
from hud.agents.browser_use import BrowserUseAgent
from hud.agents.types import BrowserUseConfig

agent = BrowserUseAgent(BrowserUseConfig(model="claude-sonnet-4-5", max_steps=25))
job = await my_browser_task().run(agent)
Use it as a template for wrapping other frameworks over whichever capability they need (ssh, mcp, rfb, robot).

Any OpenAI-compatible endpoint

OpenAIChatAgent speaks the OpenAI Chat Completions API, so vLLM servers, local models, and hosted checkpoints all work - point base_url at the server:
run.py
from hud.agents import OpenAIChatAgent
from hud.agents.types import OpenAIChatConfig

agent = OpenAIChatAgent(OpenAIChatConfig(
    model="my-model",
    base_url="http://localhost:8000/v1",
    api_key="local",
))

Composing richer environments

These patterns build on Environments once the basics are in place.

Multiple capabilities at once

An environment can expose several capabilities; the harness opens whichever it needs. A task that spans a shell and a browser declares both:
env.py
from hud.capabilities import Capability
from hud.environment import Environment

env = Environment(
    name="full-stack",
    capabilities=[
        Capability.cdp(url="ws://127.0.0.1:9222"),    # cdp: a browser you run
    ],
)
env.workspace("/workspace")                           # ssh: shell + files, served by the env
The same environment serves a shell-only coding task and a browser-driving task - the difference is which capabilities the harness opens, not the environment.

Stateful environments and backing daemons

Use @env.initialize / @env.shutdown to manage anything the tasks need running - a database, a seeded service, a fixture. The hooks run once around serving:
env.py
import asyncpg

db: asyncpg.Connection | None = None

@env.initialize
async def _start():
    global db
    db = await asyncpg.connect("postgresql://localhost/app")

@env.shutdown
async def _stop():
    if db is not None:
        await db.close()
Keep environment state frozen across rollouts: every run of a task should see the same starting state, so reward differences reflect the agent, not a drifting environment.

Scaling a taskset

Parameterize for a difficulty spread

One task definition should span a range. Parameterize the generator and create a concrete task per point:
tasks.py
@env.template()
async def fix_bug(difficulty: int = 1):
    answer = yield f"Fix the level-{difficulty} bug in your workspace."
    result = await BashGrader.grade(weight=1.0, command="pytest -q")
    yield result.value

tasks = [fix_bug(difficulty=d) for d in range(1, 6)]
A controlled difficulty distribution is what makes a taskset trainable - see Designing tasks.

Structure a large taskset across files

Keep tasks in modules and collect them into a Taskset at the top:
tasks.py
from hud.eval import Taskset
from coding_tasks import fix_bug, add_feature
from review_tasks import review_pr

taskset = Taskset("engineering-work", [
    *(fix_bug(difficulty=d) for d in range(1, 6)),
    add_feature(spec="health endpoint"),
    review_pr(pr_id=1421),
])
hud eval tasks.py claude --full runs the whole set; hud sync tasks my-taskset publishes it. Give each task a stable slug so it’s identifiable on the platform:
tasks.py
v = fix_bug(difficulty=3)
v.slug = "fix-bug-3"

Group rollouts for variance

To measure variance (or feed training), run each task several times. group repeats share a GRPO group:
run.py
taskset = Taskset("bugs", [fix_bug(difficulty=d) for d in range(1, 6)])
job = await taskset.run(agent, group=8, max_concurrent=10)
rewards = [run.reward for run in job.runs]

Route tasks to different substrates

A runtime is called once per rollout with the task row, so a callable can place heavier rows on heavier substrates:
run.py
def placer(task):
    gpus = 4 if task.args.get("big_model") else 1
    return DockerRuntime(f"hud/{task.env}", run_args=["--gpus", str(gpus)])(task)

await taskset.run(agent, runtime=placer)

Subagents as tools

An MCP tool is just a function. A subagent is just a function that runs an agent over a task and returns its answer. Put the two together and an orchestrating agent can call a specialist sub-agent as a single tool call - no special class, nothing HUD-specific beyond the rollout you already write.

Write the subagent as a function

Calling an @env.template mints a task; running it drives a fresh rollout whose Job carries the result. Wrap that in a function and return the agent’s answer:
subagents.py
from hud.agents import create_agent
from tasks import investigate   # an @env.template you defined

_specialist = create_agent("claude-haiku-4-5")   # one stateless instance drives every call

async def investigate_issue(issue_id: str) -> str:
    """Investigate an issue and return the root-cause findings."""
    job = await investigate(issue_id=issue_id).run(_specialist)
    return job.runs[0].trace.content or ""
The function’s signature and docstring are all an MCP server needs to build the tool schema: issue_id: str becomes the one parameter, the docstring becomes the description.

Register it as an MCP tool

Use a baseline FastMCP server - type hints + docstring become the schema, no subclass required:
subagents.py
from fastmcp import FastMCP

tools = FastMCP(name="specialists")
tools.tool(investigate_issue)        # or write @tools.tool above the function

Expose it as an mcp capability

An orchestrating environment declares an mcp capability pointing at that server, so any harness that opens it sees investigate_issue as a callable tool:
env.py
from hud.environment import Environment
from hud.capabilities import Capability

env = Environment(
    name="orchestrator",
    capabilities=[Capability.mcp(name="specialists", url="http://127.0.0.1:8080/mcp")],
)
Run the FastMCP server alongside the environment so the URL is live - for local iteration, tools.run(transport="http", host="127.0.0.1", port=8080); in a built image, start it from your container entrypoint or an @env.initialize hook.

How it looks to the orchestrator

The orchestrating agent opens the mcp capability, sees one tool - investigate_issue(issue_id) - calls it, and gets the specialist’s findings back as the tool result. From its side it’s a single tool call; underneath, a whole sub-rollout ran. Each subagent rollout streams under its own trace, so you can inspect the specialist’s work separately from the orchestrator’s. Because the tool is an ordinary function, everything composes normally: add retries, fan out to several specialists, or swap the model
  • all in plain Python.

Chat and multi-turn

Most tasks yield a single text prompt. A chat-style task yields a list of messages instead, so the agent works against a multi-turn conversation. The Chat runner drives that conversation turn by turn and keeps the history for you. Reach for chat when the interaction itself is the thing - assistants, tool-use dialogues, anything where the agent needs prior turns. For evals and training, the default single-turn task is what you want. Either way the grading model is the same: you still yield a reward.

A chat-style task

A task’s prompt can be plain text or a list of PromptMessages. To accept a running conversation, take a messages parameter and yield it as the prompt:
tasks.py
from hud import Environment
from mcp.types import PromptMessage

env = Environment(name="assistant")

@env.template()
async def assistant(messages: list[PromptMessage]):
    answer = yield messages          # the conversation so far is the prompt
    yield 1.0 if answer else 0.0     # grade the final turn however you like
run.prompt becomes the message list, and agents consume it as normalized turns through run.prompt_messages.

Driving it with Chat

Chat wraps a concrete Task plus an Agent. Each send() appends the user message, runs the agent over a fresh run with the full history, appends the reply, and returns the Trace:
chat.py
import asyncio
from hud import Chat
from hud.agents import create_agent
from tasks import assistant

async def main():
    chat = Chat(assistant(messages=[]), create_agent("claude-sonnet-4-5"))
    r1 = await chat.send("Book me a flight")
    r2 = await chat.send("SFO to JFK")
    print(r2.content)            # the assistant's latest reply

asyncio.run(main())
Chat is imported from hud.eval (also re-exported as hud.Chat). The task’s messages argument is replaced with the running conversation on every send; pass runtime= to place each turn’s rollout (omit it and the task’s source serves locally when minted in-process, else HUD-hosted by env name).

Managing history

The conversation history is the public chat.messages list - persist it, restore it, or reset it directly:
OperationDescription
await chat.send(message)Send a user turn; returns the reply Trace.
chat.messagesThe history ({"role", "content"} dicts) - json.dumps to persist, assign to restore, clear to reset.

Serving a chat

Chat is protocol-agnostic: any frontend - a web handler, a notebook, a wire protocol - just calls await chat.send(...). For example, behind FastAPI:
app = FastAPI()
chat = Chat(assistant(messages=[]), create_agent("claude-sonnet-4-5"))

@app.post("/api/chat")
async def chat_endpoint(message: str):
    result = await chat.send(message)
    return {"response": result.content}
For a complete A2A endpoint (sessions per context, agent card, citations transport), see the runnable A2A chat cookbook - the protocol adapter is deliberately not part of the SDK.

See also

Agents

Capabilities

Run & deploy

Harbor interop