HUD Documentation - Evaluations and RL Environments.

v6 is a leaner spec. The environment is no longer an MCP server that hands tools to the agent - it’s a small control channel that exposes capabilities (connections the agent drives itself) and tasks (prompt then reward). The agent’s harness owns the tools, so the environment side gets noticeably smaller.

What stays compatible

Environments are mostly backwards compatible. The v6 SDK still runs environments written against the v5 surface: @env.scenario, @env.tool / env.add_tool, env("scenario"), and env.run(...) all keep working - each emits a DeprecationWarning and adapts to v6 under the hood. New (v6) agents can evaluate your existing environments unchanged.

The break is on the agent side. v6 serves a new control channel instead of MCP stdio/http, so old (v5) agents cannot run old or new environments - once an environment is served by the v6 SDK (whether authored in the v5 or v6 style), only a v6 client can drive it. Upgrade the side that runs agents to v6.

So you can upgrade the SDK first and keep your environments as-is, then convert at your own pace. Converting is worth it: the v6 spec removes most of the tool-wiring boilerplate.

At a glance

v5	v6	Notes
`Environment("name")`	`Environment(name="name", capabilities=[...])`	positional name still works; declare capabilities up front
`@env.scenario("count")`	`@env.template()`	same `yield prompt` then `yield reward` generator
`@env.tool` / `env.add_tool(ComputerTool())`	a capability (`ssh` / `mcp` / `cdp` / `rfb` / `robot`)	the agent’s harness brings the tools now
`env("count", word=...)`	`count(word=...)`	keep the `@env.template` return value; calling it builds a `Task`
`task.run("claude")` / `hud.eval(task)`	`await task.run(agent)`	or just `hud eval tasks.py claude`
`env.run(transport=...)`	`await env.serve()` / `hud serve` / `hud deploy`	v6 serves a control channel, not MCP
`.slug`, `.columns` on a task	`.slug`, `.columns` on the `Task`	unchanged

The CLI you already use is stable: hud init, hud deploy, hud eval, and hud sync tasks all carry over. hud dev is now hud serve (the old name remains as a deprecated alias).

Walk through a conversion

Here’s a small v5 coding environment - a couple of tools and one scenario:

env.py (v5)

from hud import Environment
from hud.tools import BashTool, EditTool
from hud.native import BashGrader

env = Environment("coder")
env.add_tool(BashTool())
env.add_tool(EditTool())

@env.scenario("fix-tests")
async def fix_tests(target: str = "tests/"):
    answer = yield f"Make the tests in {target} pass."
    yield await BashGrader.grade(command=f"pytest {target} -q")

Replace tools with a capability

This is the biggest change. In v5 you registered tools and the environment forwarded them, translating per provider. In v6 you declare a capability - a connection - and the agent’s harness attaches its own tools to it. Shell and file tools become a Workspace: the environment starts the sandboxed workspace and publishes its ssh capability when it serves:

env.py (v6)

from hud.environment import Environment

env = Environment(name="coder")
env.workspace("/workspace")

Other tool kinds map the same way: a browser becomes cdp, full computer-use becomes rfb, a robot becomes robot, and any custom MCP tools become an mcp capability via Capability.mcp(name=..., url=...). You no longer hand-wire ComputerTool() / BashTool() or call env.as_claude_tools() - the harness does that.

Rename @env.scenario to @env.template

The generator body is identical - yield a prompt, receive the answer, yield a reward. Just swap the decorator and keep a reference to the returned Task:

env.py (v6)

from hud.graders import BashGrader

@env.template()
async def fix_tests(target: str = "tests/"):
    answer = yield f"Make the tests in {target} pass."
    yield await BashGrader.grade(command=f"pytest {target} -q")

@env.template() also accepts id=, description=, and optional input= / returns= types (surfaced as JSON schemas in the manifest). The decorated function is a template that mints Task rows when called. The v5 scenario options (chat, returns, exclude_tools, …) still parse through the compatibility layer if you keep @env.scenario.

Build tasks by calling the task function

env("fix-tests", target="tests/") becomes a direct call on the task function. It returns a Task - the runnable unit - and .slug / .columns work exactly as before:

tasks.py (v6)

from env import fix_tests

easy = fix_tests(target="tests/unit")
easy.slug = "fix-unit-tests"
easy.columns = {"suite": "unit"}

Run it

Locally, hud eval is unchanged:

hud eval tasks.py claude

Programmatically, the hud.eval(task) context manager and task.run(model) are replaced by handing an agent to the task - it returns a Job holding the graded runs:

from hud.agents import create_agent

agent = create_agent("claude-sonnet-4-5")
job = await fix_tests(target="tests/").run(agent)
print(job.reward)

create_agent routes any model (claude-..., gpt-..., gemini-..., grok-...) through the HUD gateway and wires the tools for whichever capabilities the environment exposes.

Serve and deploy

v5 served an MCP server via env.run(transport=...). v6 serves its control channel - use hud serve while iterating and hud deploy to publish (it builds and publishes in one step). await env.serve(host, port) is the in-code equivalent.

Converting with an agent

The conversion is mechanical, so the fastest path is to let your coding agent do it. Add the HUD docs to your agent - they’re available as an MCP server at docs.hud.ai/mcp, or use the Copy / Claude / ChatGPT buttons at the top of any docs page - then point it at this guide and the Environment reference and ask it to adapt your env.py. A prompt like:

Convert this v5 HUD environment to v6 using the migration guide at docs.hud.ai. Rename scenarios to tasks, replace registered tools with the capability they imply (shell/files → ssh, browser → cdp, computer-use → rfb, custom tools → mcp), switch env("name", ...) to calling the task, and fix the hud.tools imports below.

Because every old import still resolves (the SDK ships shims) and registered tools are auto-promoted to capabilities at serve time, your environment keeps running throughout - convert incrementally and let the DeprecationWarnings tell you what’s left.

Imports to update

In v6, hud.tools was removed entirely - tools are capabilities now - but every old import still resolves with a DeprecationWarning:

v5 import	What it resolves to now	What to do
`AgentTool`, `BaseTool`	removed - resolve to a no-op stand-in	drop the class; expose a sub-agent as a plain function on a FastMCP server and attach it as an `mcp` capability - see Subagents as tools
Result types: `EvaluationResult`, `ScenarioResult`, `SubScore`, `AgentAnswer`, `Citation`	redirected to their v6 homes: `hud.graders` (`ScenarioResult` is now `EvaluationResult`), `hud.environment` (`AgentAnswer` is now `Answer`, without `citations`), `hud.agents.types`	change the import to the module the warning names
`ContentResult`	supported in v6 at `hud.agents.types`	`from hud.agents.types import ContentResult` - `.to_content_blocks()` builds a tool’s `list[ContentBlock]` from text + an optional image
`ToolError`	removed (no v6 counterpart)	return an error result (`ContentResult(error=...).to_content_blocks()`) or raise an ordinary exception - the loop surfaces it to the agent and continues
`hud.server.MCPServer`	removed - now plain `fastmcp.FastMCP` (a deprecation shim keeps the old import working and warns)	`from fastmcp import FastMCP` (same `@server.tool` / `run_async`); manage its lifecycle with `@env.initialize` / `@env.shutdown`
Shell/edit tools: `BashTool`, `EditTool`, `ShellTool`, `ApplyPatchTool`, …	removed - resolve to a marker that synthesizes an `ssh` capability at serve	call `env.workspace(root)` instead
Computer tools: `HudComputerTool`, `AnthropicComputerTool`, `OpenAIComputerTool`, `GeminiComputerTool`, `QwenComputerTool`, …	removed - resolve to a marker that synthesizes an `rfb` capability at serve	declare an `rfb` (computer-use) or `cdp` (browser) capability instead
Anything else under `hud.tools`: `PlaywrightTool`, `JupyterTool`, `MemoryTool`, filesystem tools, executors, `SubmitTool`, `BaseHub`	no-op stand-in (silently does nothing)	remove it - declare a capability (`cdp` for browser) or serve your own tool over `mcp`
Graders: `hud.native` (`BashGrader`, `LLMJudgeGrader`, `exact_match`, …)	aliased to `hud.graders`	change the import to `from hud.graders import ...`
Chat: `hud.services.Chat`	aliased to `hud.eval.chat` (re-exported as `hud.Chat`)	change the import to `from hud import Chat`
`hud.services.ChatService`	removed - the A2A executor left the SDK	copy the reference server in `cookbooks/a2a-chat/server.py` (a thin A2A adapter over `Chat`)
`hud.shared.*` (`exceptions`, `requests`, …)	merged into `hud.utils` (no alias - no environment imported it)	change the import to `from hud.utils... import ...`

The rule of thumb: grading types move to hud.graders, tools become capabilities, and everything else under hud.tools is going away. When the deprecation log is quiet, the conversion is done.

Next steps

Environment reference

Define capabilities, lifecycle hooks, and tasks.

Tasks & Tasksets

Define tasks, collect tasksets, and grade runs.

Package & deploy

Publish with hud deploy and run at scale.

​What stays compatible

​At a glance

​Walk through a conversion

​Converting with an agent

​Imports to update

​Next steps

Environment reference

Tasks & Tasksets

Package & deploy

What stays compatible

At a glance

Walk through a conversion

Converting with an agent

Imports to update

Next steps