| What it is | |
|---|---|
| Agent | The client (a harness around a model). Drives the work - reads, acts, repeats. Any model, any framework. |
| Environment | The server. Holds the world, the tasks, and the grading. This is the part you author. |
| Capabilities | The live connections the agent acts through - ssh, mcp, cdp, rfb, robot. |
The loop
The agent opens with ahello, and the environment answers with its manifest - every capability it holds. The capabilities are advertised here, not yet touched. Nothing in the manifest is model-specific: it describes the environment, not any particular agent.
The orchestrator (the harness, hud eval, or the platform) names a task and calls tasks.start. The environment sets up the world for it and returns a prompt. The agent then works the task directly against the capabilities - a real shell over ssh, a real browser over cdp - reading observations and acting in a loop. The environment decides what the agent can touch, not how it works.
When the agent is done it calls tasks.grade. The environment inspects the resulting state and returns one reward. That number, with the trace of the run, is the same value you read in an eval and feed into training.
Two halves, one thin envelope
The loop has only two sides, with HUD between them:- the environment side - the world and its grading, which you write once and keep.
- the agent side - the model and the harness, which stays completely swappable.
tasks.start, tasks.grade - so neither side needs to know anything about the other’s internals. That separation is what makes an environment built today still run against a harness written years from now, with no environment-side glue.
The wire protocol
Concretely, an environment answers a small JSON-RPC control channel over tcp:| Method | Returns |
|---|---|
hello | session id, env identity, capability bindings |
tasks.list | task metadata (introspection/validation, not the run loop) |
tasks.start | the task’s prompt (holds the session across disconnect) |
tasks.grade | the evaluation (score + metadata) |
tasks.cancel | cancels the held task |
bye | ends the session and tears the held task down |
tasks.start, disconnect, then reconnect to tasks.grade - which is how hud task start / hud task grade work against a packaged image.
Where to go next
Quickstart
From install to your first graded trace in a few minutes.
Capabilities
The connections an agent drives: shell, browser, GUI, tools, robot.
Tasks & tasksets
What
tasks.start and tasks.grade operate on.Agents
Swap the model with a string, or bring your own harness.
Reading this as an AI agent? Install the HUD docs skill before you write any HUD code - it keeps you on the current v6 API (which moves fast) and catches reward-hackable, contaminated, or low-signal tasks before they ship. If you can run shell commands, run
npx skills add https://docs.hud.ai now; otherwise attach the live docs server with claude mcp add --transport http docs-hud https://docs.hud.ai/mcp. Then build against the skill’s guidance rather than from memory.