Skip to main content
HUD is protocol-first. An agent and an environment never integrate directly - they sit on two sides of a thin envelope and exchange a handful of small messages. HUD owns only that envelope; everything inside it - the model, the harness, the work the agent does - stays swappable. Three things take part in every run:
What it is
AgentThe client (a harness around a model). Drives the work - reads, acts, repeats. Any model, any framework.
EnvironmentThe server. Holds the world, the tasks, and the grading. This is the part you author.
CapabilitiesThe live connections the agent acts through - ssh, mcp, cdp, rfb, robot.

The loop

The agent opens with a hello, and the environment answers with its manifest - every capability it holds. The capabilities are advertised here, not yet touched. Nothing in the manifest is model-specific: it describes the environment, not any particular agent. The orchestrator (the harness, hud eval, or the platform) names a task and calls tasks.start. The environment sets up the world for it and returns a prompt. The agent then works the task directly against the capabilities - a real shell over ssh, a real browser over cdp - reading observations and acting in a loop. The environment decides what the agent can touch, not how it works. When the agent is done it calls tasks.grade. The environment inspects the resulting state and returns one reward. That number, with the trace of the run, is the same value you read in an eval and feed into training.

Two halves, one thin envelope

The loop has only two sides, with HUD between them:
  • the environment side - the world and its grading, which you write once and keep.
  • the agent side - the model and the harness, which stays completely swappable.
The envelope between them is tiny - a manifest, tasks.start, tasks.grade - so neither side needs to know anything about the other’s internals. That separation is what makes an environment built today still run against a harness written years from now, with no environment-side glue.
In practice you rarely touch the agent side at all. The HUD inference gateway lets you switch models by changing a single string - "claude-sonnet-4-5" to "gpt-5" - and nothing else. And when you do want full control, you can bring your own harness: it’s a thin adapter, no protocol work.

The wire protocol

Concretely, an environment answers a small JSON-RPC control channel over tcp:
MethodReturns
hellosession id, env identity, capability bindings
tasks.listtask metadata (introspection/validation, not the run loop)
tasks.startthe task’s prompt (holds the session across disconnect)
tasks.gradethe evaluation (score + metadata)
tasks.cancelcancels the held task
byeends the session and tears the held task down
The held task survives a dropped connection, so a client can tasks.start, disconnect, then reconnect to tasks.grade - which is how hud task start / hud task grade work against a packaged image.

Where to go next

Quickstart

From install to your first graded trace in a few minutes.

Capabilities

The connections an agent drives: shell, browser, GUI, tools, robot.

Tasks & tasksets

What tasks.start and tasks.grade operate on.

Agents

Swap the model with a string, or bring your own harness.
Reading this as an AI agent? Install the HUD docs skill before you write any HUD code - it keeps you on the current v6 API (which moves fast) and catches reward-hackable, contaminated, or low-signal tasks before they ship. If you can run shell commands, run npx skills add https://docs.hud.ai now; otherwise attach the live docs server with claude mcp add --transport http docs-hud https://docs.hud.ai/mcp. Then build against the skill’s guidance rather than from memory.