# Build Environments
Source: https://docs.hud.ai/build-environments/index
From `hud init` to production in five steps
Build MCP environments that wrap any software for agent interaction. Think of it in three phases:
**Phase 1: Environment** - Wrap software in MCP tools
**Phase 2: Tasks** - Define evaluation scenarios
**Phase 3: Agents** - Run evaluations and training
## Phase 1 · Create a project (2 min)
```bash theme={null}
# Pick a template: blank, deep-research, browser
hud init my-env
cd my-env
```
Start development servers:
```bash theme={null}
# Terminal 1 - Environment backend
cd environment && uv run uvicorn server:app --reload
# Terminal 2 - MCP server
cd server && uv run hud dev
```
### Edit-save-test flow
1. Open `server/tools.py`, add or tweak a tool.
2. Save – the mcp restarts instantly.
3. Visit `http://localhost:8765/docs` to test tools/
## Phase 2 · Write Tasks (2 min)
Build your environment image first (in the global folder):
```bash theme={null}
hud build
```
Create `tasks.json` using `docker run`:
```json theme={null}
{
"prompt": "Complete task",
"mcp_config": {
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "my-env:0.1.0"]
}
},
...your setup and evaluation tools
}
```
See [Task System](/core-concepts/task-system) or the `hud init` README for details.
## Phase 3: Run Agents
```bash theme={null}
# Test with agents
hud eval tasks.json
# Deploy to registry
hud push
# Train agents on your tasks
hud rl tasks.json
```
***
### Cheatsheet
| Action | Command |
| ---------------- | -------------------------- |
| Create env | `hud init my-env -p blank` |
| Hot-reload dev | `hud dev --build` |
| Interactive test | `hud dev --interactive` |
| Troubleshoot | `hud debug my-env:dev` |
| Build image | `hud build` |
| Push to registry | `hud push` |
| RL training | `hud rl tasks.json` |
***
### Learn more →
* **Blank template walkthrough**: `environments/blank/README.md` in the repo.
* **Technical spec**: [/build-environments/spec](/build-environments/spec)
* **CLI reference**: [`hud init`](/reference/cli/init) • [`hud dev`](/reference/cli/dev) • [`hud build`](/reference/cli/build) • [`hud push`](/reference/cli/push) • [`hud run`](/reference/cli/run)
Have fun – and remember: *stderr for logs, stdout for MCP!*
# Environment Spec
Source: https://docs.hud.ai/build-environments/spec
What qualifies as a HUD environment, and how to run it
If your Docker image starts a stdio MCP server on container startup, it is a hud environment.
## Architecture
```mermaid theme={null}
graph TD
Agent[Any Agent] -->|MCP JSON‑RPC| Server[MCP Server in Docker]
Server -->|optional IPC/HTTP/gRPC| App[Your Software]
```
* The MCP server can be written in any language or framework.
* Your software can be anything (service, desktop app, game, scripts). The connection between them is your choice.
* Logs must go to stderr; stdout is reserved for MCP protocol packets.
## Minimal requirements
* Docker image starts a process that speaks MCP on stdout/stdin (stdio) by default.
* No non‑MCP output on stdout (all logging to stderr).
* No required file layout, framework, or endpoints.
Recommended (for HUD RL/evals): provide tools named `setup` and `evaluate`.
## Make it runnable remotely (mcp.hud.ai)
Remote execution is built‑in. Push your image, then either:
* CLI (default is remote):
```bash theme={null}
export HUD_API_KEY=...
hud push
hud run myorg/my-env:latest # remote container with MCP
```
* Programmatic (Task config):
```json theme={null}
{
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "myorg/my-env:latest"
}
}
}
}
```
## Run arbitrary configs locally or remotely
* Local Docker (stdio):
```bash theme={null}
hud run myorg/my-env:latest --local --transport stdio
```
* Local HTTP proxy (for inspectors):
```bash theme={null}
hud run myorg/my-env:latest --local --transport http --port 8765
```
* Remote (default):
```bash theme={null}
hud run myorg/my-env:latest
```
## Notes
* Use `hud dev` only for local hot‑reload workflows; for production, build once and `hud run`.
* Use `hud debug` only when startup/compliance issues occur.
* Any MCP‑speaking Docker image works; controller/backend split is optional.
## Task config examples
```python theme={null}
from hud.datasets import Task
task = Task(
prompt="...",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "myorg/my-env:latest"
}
}
},
)
```
The same structure is used by `hud init`’s template and by programmatic tasks.
* Programmatic (inline in code):
```json theme={null}
{
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "myorg/my-env:latest"
}
}
},
"agent_config": {
"allowed_tools": ["act"]
},
"setup_tool": { "name": "setup", "arguments": {} },
"evaluate_tool": { "name": "evaluate", "arguments": {"target": 10} }
}
```
* File `tasks.json` (created by `hud init` templates), local dev variant:
```json theme={null}
[
{
"prompt": "Increment the counter to reach 10",
"mcp_config": {
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "my-env:latest"]
}
},
"agent_config": {
"allowed_tools": ["act"]
},
"setup_tool": { "name": "setup", "arguments": {} },
"evaluate_tool": { "name": "evaluate", "arguments": {"target": 10} }
}
]
```
Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud rl`, which will help convert it automatically).
Run tasks with either the CLI or an agent:
```bash theme={null}
# Evaluate tasks with an agent backend
hud eval tasks.json --agent claude
```
or programmatically by passing the Task to your agent (e.g., `ClaudeAgent`, `OpenAIChatAgent`).
# Architecture
Source: https://docs.hud.ai/core-concepts/architecture
How HUD connects agents, environments, and evaluation
HUD is built as a thin orchestration layer on top of MCP, providing structure for agent evaluation, environment management, and telemetry.
## System Overview
```mermaid theme={null}
graph TB
subgraph "Remote"
Dashboard["📊 Dashboard
(hud.ai)"]
API["🔌 MCP Orchestrator
(mcp.hud.ai)"]
end
subgraph Code["Your Code"]
Agent["🤖 Agent
(Claude/Operator)"]
Task["📋 Task
(Prompt + Evaluation)"]
SDK["📦 hud-python"]
end
subgraph "Environments"
LocalEnv["🖥️ Local Docker
(Development)"]
RemoteEnv["☁️ Remote Docker
(100s Parallel)"]
end
subgraph "OpenTelemetry"
Trace["📡 Traces & Metrics"]
end
Dataset["📚 Dataset
(HuggingFace)"]
AnyMCP["🔗 Any MCP Client
(Cursor, Claude, Custom)"]
Agent <--> SDK
Task --> SDK
Dataset <-.-> Task
SDK <-->|"MCP"| LocalEnv
SDK <-->|"MCP"| API
API <-->|"MCP"| RemoteEnv
SDK -->|"hud.trace"| OpenTelemetry
Trace --> Dashboard
AnyMCP -->|"MCP"| API
style Dashboard fill:#3b82f6,stroke:#1e40af,stroke-width:2px,color:#ffffff
style API fill:#6b7280,stroke:#374151,stroke-width:2px,color:#ffffff
style Agent fill:#ef4444,stroke:#dc2626,stroke-width:2px,color:#ffffff
style Task fill:#f97316,stroke:#ea580c,stroke-width:2px,color:#ffffff
style SDK fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
style LocalEnv fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#ffffff
style RemoteEnv fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
style Trace fill:#84cc16,stroke:#65a30d,stroke-width:2px,color:#ffffff
style Dataset fill:#a855f7,stroke:#9333ea,stroke-width:2px,color:#ffffff
style AnyMCP fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#ffffff,stroke-dasharray: 5 5
```
## Core Components
### 1. Agents (`hud.agents`)
Agents make decisions and call tools:
```python theme={null}
from hud.agents import ClaudeAgent, OperatorAgent
# Built-in agents
agent = ClaudeAgent() # Uses Anthropic's Claude
agent = OperatorAgent() # Uses OpenAI Operator
# All agents inherit from MCPAgent
class MCPAgent:
async def initialize(task: str | Task | None)
async def run(prompt_or_task: str | Task, max_steps: int = 10) -> Trace
async def call_tools(tool_calls: list[MCPToolCall]) -> list[MCPToolResult]
```
Agents can auto-create MCP clients from `task.mcp_config` - no manual client setup needed
### 2. Tasks (`hud.Task`)
Tasks define what agents should accomplish:
```python theme={null}
from hud.datasets import Task
task = Task(
prompt="Complete 3 items in the TODO list",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": f"Bearer {api_key}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Tool names and arguments map directly to MCP server tools
setup_tool={
"name": "setup", # Calls def setup(name, num_items) on the environment
"arguments": {
"name": "todo_seed",
"num_items": 5
}
},
evaluate_tool={
"name": "evaluate", # Calls def evaluate(name, expected_count) on the environment
"arguments": {
"name": "todo_completed",
"expected_count": 3
}
}
)
```
The `name` and `arguments` in setup/evaluate tools correspond exactly to the tool names and parameters exposed by the MCP server
### 3. MCP Clients (`hud.clients`)
Clients handle the MCP protocol:
```python theme={null}
from hud.clients import MCPClient
# Auto-created by agents, or manual:
client = MCPClient(mcp_config)
await client.initialize()
tools = await client.list_tools()
```
### 4. Environments
Environments are MCP servers exposing tools:
```python theme={null}
from hud.server import MCPServer
server = MCPServer("my-env")
@server.tool()
def move(direction: str):
# Environment logic
return result
```
### 5. Telemetry (`hud.trace`)
Real-time observability:
```python theme={null}
import hud
async with hud.async_trace("my-evaluation"):
result = await agent.run(task)
# View at hud.ai/traces/{trace_id}
```
## Execution Flow
Create a `Task` with prompt and MCP configuration
Agent creates MCP client (if needed) and connects to environment
Execute `setup_tool` to initialize environment state
Agent receives observations, makes decisions, calls tools
Execute `evaluate_tool` to score performance
All interactions streamed to HUD backend for analysis
## Key Design Principles
1. **Protocol-First**: Everything speaks MCP
2. **Composable**: Mix and match agents, environments, evaluations
3. **Observable**: Built-in telemetry for every interaction
4. **Testable**: Reproducible evaluations with Docker
5. **Extensible**: Easy to add new agents or environments
The `MCPServer` class wraps FastMCP with lifecycle management, making it easy to build Docker-based environments
## Next Steps
Test and benchmark agents on standardized tasks
Create MCP-compatible environments for your software
# MCP Protocol
Source: https://docs.hud.ai/core-concepts/mcp-protocol
Understanding the Model Context Protocol that connects agents to environments
HUD uses the Model Context Protocol (MCP) to provide a standard way for AI agents to interact with any software environment through tool calls.
## Why MCP?
Traditional agent frameworks couple agents tightly to specific environments. MCP decouples them:
* Agent code hardcoded for each environment
* No standardization across tools
* Difficult to swap agents or environments
* Any agent works with any environment
* Standard protocol for all interactions
* Easy to swap components
## How It Works
MCP standardizes agent-environment communication through JSON-RPC messages. Agents call tools exposed by environments and receive structured responses.
## Core Concepts
### Tools
[Tools](https://modelcontextprotocol.io/specification/server/tools) are functions exposed by the environment:
```json theme={null}
{
"name": "move",
"description": "Move in a direction",
"inputSchema": {
"type": "object",
"properties": {
"direction": {
"type": "string",
"enum": ["up", "down", "left", "right"]
}
}
}
}
```
### Tool Calls & Results
Agents [call tools](https://modelcontextprotocol.io/specification/server/tools#calling-tools) and receive results:
```python theme={null}
# Agent makes a tool call
result = await client.call_tool("move", {"direction": "up"})
# Environment returns result
{
"content": [{
"type": "text",
"text": "Moved up. New board state: ..."
}]
}
```
### Lifecycle Management
MCP defines a [rigorous lifecycle](https://modelcontextprotocol.io/specification/basic/lifecycle) for connections:
1. **Initialization**: Client and server negotiate capabilities and protocol version with `client.initialize()`
2. **Operation**: Normal tool calling and message exchange
3. **Shutdown**: Clean termination of the connection
The protocol ensures both sides understand each other's capabilities before proceeding.
## HUD's MCP Extensions
HUD adds conventions on top of MCP:
1. **Setup Tools**: Initialize environment state (`setup_board`, `navigate_to_url`)
2. **Evaluate Tools**: Score agent performance (`evaluate_max_tile`, `contains_text`)
3. **Lifecycle Management**: Clean initialization and shutdown with `client.initialize()` and proper cleanup
See the [Tools Reference](/reference/tools) for implementation details.
## Transport Options
HUD environments are designed for 100% reproducibility through Docker:
Run environments locally for development and debugging using [stdio transport](https://modelcontextprotocol.io/specification/basic/transports#stdio):
```python theme={null}
mcp_config = {
"hud-text-2048": {
"command": "docker",
"args": ["run", "--rm", "-i", "hudpython/hud-text-2048:v1.2"]
}
}
```
* **Transport**: [stdio](https://modelcontextprotocol.io/specification/basic/transports#stdio) - JSON-RPC over stdin/stdout
* **Pros**: Full control, easy debugging, no network latency
* **Use case**: Development, testing, single-agent runs
The `-i` flag enables Docker's interactive mode, allowing stdio communication between the client and server process.
Scale to parallel runs with cloud infrastructure using [Streamable HTTP transport](https://modelcontextprotocol.io/specification/basic/transports#streamable-http):
```python theme={null}
mcp_config = {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-text-2048:v1.2"
}
}
}
```
* **Transport**: [Streamable HTTP](https://modelcontextprotocol.io/specification/basic/transports#streamable-http) with SSE
* **Pros**: Infinite parallelization, no local setup, managed infrastructure
* **Use case**: Benchmarks, CI/CD, multi-agent experiments
HUD's orchestrator manages Docker containers in the cloud, exposing them via HTTP endpoints with Server-Sent Events for streaming.
Both approaches use the exact same Docker image, ensuring identical behavior whether running locally or remotely
## Next Steps
See how HUD builds on MCP for agent evaluation
# Task System
Source: https://docs.hud.ai/core-concepts/task-system
How tasks define agent objectives and evaluation criteria
Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.
## Lifecycle
```mermaid theme={null}
graph LR
Task["📋 Task
(Prompt + Config)"]
Setup["🔧 Setup Phase
(Initialize State)"]
Execute["🤖 Execute Phase
(Agent Works)"]
Evaluate["📊 Evaluate Phase
(Score Result)"]
Task --> Setup
Setup --> Execute
Execute --> Evaluate
style Task fill:#6366f1,stroke:#4338ca,stroke-width:2px,color:#ffffff
style Setup fill:#ef4444,stroke:#dc2626,stroke-width:2px,color:#ffffff
style Execute fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
style Evaluate fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
```
## Task Structure
```python theme={null}
from hud.datasets import Task
import uuid
task = Task(
# Required fields
prompt="Navigate to the login page and sign in as testuser@example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Optional fields
id=str(uuid.uuid4()), # Required for HuggingFace datasets
system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
setup_tool={
"name": "playwright",
"arguments": {
"action": "navigate",
"url": "https://example.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_contains",
"substring": "/dashboard"
}
},
metadata={"category": "authentication", "difficulty": "easy"}
)
```
## Field Reference
The instruction given to the agent. Be specific and clear about success criteria.
Environment connection configuration. Supports environment variables with `${VAR}` syntax.
Unique identifier. Required when creating HuggingFace datasets. Use `str(uuid.uuid4())`.
Custom system prompt for the agent. Overrides the agent's default system prompt.
Tool(s) to initialize environment state with `name` and `arguments`
Tool(s) to score performance with `name` and `arguments`
Additional task information for analysis and filtering
## Environment Variables
Tasks automatically resolve `${VAR}` patterns when deserialized from dictionaries. This is why HuggingFace datasets should store raw dictionaries, not Task objects.
```python theme={null}
# In dataset JSON:
{
"prompt": "Complete the TODO list",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "${BROWSER_IMAGE}"
}
}
}
}
# When loaded:
task = Task(**task_dict) # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
```
This enables:
* **Public datasets** without exposing secrets
* **Environment-specific** configurations
* **CI/CD pipelines** with different credentials
## Running Tasks
```python theme={null}
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}") # 0.0 to 1.0
```
The agent will:
1. Execute `setup_tool` if provided
2. Work on the `prompt` using available tools
3. Execute `evaluate_tool` to calculate reward
## Working with Datasets
Tasks integrate with HuggingFace datasets:
```python theme={null}
from datasets import load_dataset
from hud.datasets import run_dataset
# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/SheetBench-50", split="train")
# Run agent on entire dataset with automatic parallelization
results = await run_dataset(
"SheetBench Run",
dataset, # Pass the dataset object or name (e.g. "hud-evals/SheetBench-50")
agent_class=ClaudeAgent
)
```
## Creating Datasets
```python theme={null}
from hud.datasets import save_tasks
# Create task dictionaries (NOT Task objects!)
task_dicts = [
{
"prompt": "Navigate to the login page",
"mcp_config": {
"hud": {
"url": "${MCP_URL}",
"headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
}
},
"setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
"evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
},
# More task dicts...
]
# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")
```
Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
## Best Practices
1. **Use UUIDs**: Always include `id=str(uuid.uuid4())` for dataset tasks
2. **Clear Prompts**: Be specific about success criteria
3. **Template Variables**: Use `${VAR}` syntax for shareable configs
4. **Rich Tools**: Include both `name` and `arguments` in tool definitions
## Next Steps
Create, run, and publish evaluations
Build your own MCP-compatible agent
# Tricks
Source: https://docs.hud.ai/core-concepts/tricks
Collection of unique examples with the SDK
## Computer Tool
Working on any web app? Run this simple python server and allow the agent to see your screen!
```python theme={null}
from hud.server import MCPServer
from hud.tools import HudComputerTool
mcp = MCPServer(name="HUD Computer", host="0.0.0.0", port=8777)
mcp.add_tool(HudComputerTool())
mcp.run(transport="streamable-http")
```
Run: `python computer_server.py`
Add to cursor:
```
{
"hud-computer": {
"url": "http://localhost:8777/mcp"
}
}
```
More patterns coming soon! Have a cool pattern? [Share it on Discord](https://discord.gg/wkjtmHYYjm).
# Benchmarks
Source: https://docs.hud.ai/evaluate-agents/benchmarks
Create, run, and publish agent evaluations
This page covers how to build datasets, run evaluations, and publish results.
## Quick Start
* CLI
```bash theme={null}
# Run a local tasks file (interactive agent selection)
hud eval tasks.json
# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full
```
* SDK
```python theme={null}
from hud.agents import ClaudeAgent
from hud.datasets import run_dataset
results = await run_dataset(
name="SheetBench Eval",
dataset="hud-evals/SheetBench-50",
agent_class=ClaudeAgent,
max_concurrent=50,
)
```
## Build Benchmarks
### Explore Evaluators
```bash theme={null}
hud analyze hudpython/hud-remote-browser:latest
```
### Create Tasks
```python theme={null}
import uuid
web_tasks = []
web_tasks.append({
"id": str(uuid.uuid4()),
"prompt": "Navigate to the documentation page",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
"setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
"evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
"metadata": {"difficulty": "easy", "category": "navigation"}
})
```
### Save to HuggingFace
```python theme={null}
from hud.datasets import save_tasks
save_tasks(
web_tasks,
repo_id="my-org/web-navigation-benchmark",
private=False,
tags=["web", "navigation", "automation"],
)
```
## Leaderboards
After running, visit your dataset leaderboard and publish a scorecard:
```python theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent
results = await run_dataset(
name="Claude Sonnet SheetBench",
dataset="hud-evals/SheetBench-50",
agent_class=ClaudeAgent,
)
# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard
```
## Best Practices
* Clear, measurable prompts (binary or graded)
* Isolated task state and deterministic setup
* Use metadata tags (category, difficulty)
* Validate locally, then parallelize
* Version datasets; include a `system_prompt.txt`
## See Also
* [`hud eval`](/reference/cli/eval)
* [`hud rl`](/reference/cli/rl)
* [Tasks](/reference/tasks)
* [Agents (SDK)](/reference/agents)
# Create Agents
Source: https://docs.hud.ai/evaluate-agents/create-agents
Build your own MCP-compatible agent for HUD evaluation
Build custom agents that interact with MCP tools to complete tasks. An agent is essentially a loop that calls your LLM, executes tools based on its decisions, and continues until the task is complete.
## How Agents Work
An agent follows this lifecycle:
```mermaid theme={null}
graph LR
A[System Prompt] --> B[User Prompt]
B --> C[LLM Response]
C --> D{Has Tool Calls?}
D -->|Yes| E[Execute Tools]
E --> F[Format Results]
F --> C
D -->|No| G[Task Complete]
```
The agent keeps calling your LLM and executing tools until the LLM stops requesting tools, indicating the task is complete.
## The Four Required Methods
To create an agent, you implement four methods that bridge your LLM with MCP's tool system:
```python theme={null}
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult
class MyAgent(MCPAgent):
"""Your custom agent implementation."""
async def get_system_messages(self) -> list[Any]:
"""1. Called ONCE at start - returns your LLM's system prompt."""
pass
async def get_response(self, messages: list[Any]) -> AgentResponse:
"""2. Called EACH TURN - sends messages to your LLM, returns its response, optionally adds the assistant message to messages."""
pass
async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
"""3. Called at START - converts initial prompt/context to your LLM format."""
pass
async def format_tool_results(
self, tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]
) -> list[Any]:
"""4. Called AFTER TOOLS - converts tool results to your LLM format."""
pass
```
### Understanding When Each Method is Called
The agent loop calls your methods in this sequence:
1. **`get_system_messages()`** - Once at start
2. **`format_blocks()`** - Converts initial task prompt
3. **`get_response()`** - Gets LLM decision, adds assistant message to messages
4. **`format_tool_results()`** - After each tool execution
5. Back to step 3 until done
## What MCPAgent Does For You
### The Agent Loop
The base `MCPAgent` class handles the entire execution loop. When you call `agent.run(task)`:
1. **Initialization Phase**
* Connects to MCP servers (auto-creates client from task.mcp\_config if needed)
* Discovers available tools from all connected servers
* Applies tool filtering (allowed/disallowed lists)
* Identifies lifecycle tools (setup, evaluate, response)
2. **Setup Phase** (if task.setup\_tool provided)
* Executes setup tools (e.g., navigate to website, initialize environment)
* Optionally appends setup output to initial context (controlled by `append_setup_output`)
* Can include initial screenshots (controlled by `initial_screenshot`)
3. **Main Execution Loop**
```python theme={null}
while not done and step < max_steps:
# Your get_response() is called here
response = await agent.get_response(messages)
if response.tool_calls:
# MCPAgent executes tools for you
results = await agent.call_tools(response.tool_calls)
# Your format_tool_results() is called here
messages.extend(await agent.format_tool_results(tool_calls, results))
else:
done = True
```
4. **Evaluation Phase** (if task.evaluate\_tool provided)
* Runs evaluation tools to calculate reward
* Extracts reward from result (looks for "reward", "grade", "score" keys)
* Returns Trace object with full execution history
### Tool Management
**Tool Discovery & Filtering**
```python theme={null}
agent = ClaudeAgent(
allowed_tools=["anthropic_computer"], # Only these tools
disallowed_tools=["openai_computer"], # Never these tools
)
```
* **Available Tools**: Retrieved via `self.get_available_tools()` - already filtered
* **Lifecycle Tools**: Automatically detected and hidden from your LLM
* **Response Tools**: Auto-detected (tools with "response" in name) for task completion
### Client Management
MCPAgent handles complex client lifecycle:
```python theme={null}
# Option 1: Provide your own client
from hud.clients import MCPClient
client = MCPClient(mcp_config={...})
agent = MyAgent(mcp_client=client)
# Option 2: Auto-create from task
task = Task(mcp_config={...})
agent = MyAgent() # No client needed
await agent.run(task) # Client created automatically
```
**Auto-cleanup**: Clients created automatically are properly shut down after execution.
### Error Handling
MCPAgent provides robust error handling:
* **Connection Errors**: Helpful messages about MCP server availability
* **Tool Errors**: Captured and returned as MCPToolResult with isError=True
* **Timeout Handling**: Graceful shutdown on tool execution timeouts
* **Trace Always Returns**: Even on errors, you get a Trace object with details
### Message Accumulation
Messages build up over the conversation:
```
[System] → [User Prompt] → [LLM Response] → [Tool Results] → [LLM Response] → ...
```
Your `get_response()` receives the full conversation history each time, allowing your LLM to maintain context.
### Advanced Features
**Response Agent Integration**
```python theme={null}
from hud.agents.misc import ResponseAgent
agent = MyAgent(
response_agent=ResponseAgent() # Auto-decides when to stop/continue
)
```
The ResponseAgent can analyze ambiguous LLM responses like "Should I submit?" and decide whether to continue.
**Telemetry & Tracing**
```python theme={null}
agent = MyAgent(
auto_trace=True, # Automatic span creation
verbose=True # Detailed logging
)
```
**System Prompt Augmentation**
```python theme={null}
task = Task(
system_prompt="Additional instructions...", # Appended to agent's system prompt
...
)
```
## Testing Your Agent
Test your agent on a simple task:
```python theme={null}
import asyncio
import hud
import os
from hud.datasets import Task
async def test_agent():
async with hud.async_trace("test-custom-agent"):
task = Task(
prompt="Navigate to example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
setup_tool={
"name": "setup",
"arguments": {
"name": "navigate",
"arguments": {"url": "https://example.com"}
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_match",
"arguments": {"pattern": "example.com"}
}
}
)
# Use your custom agent
agent = MyAgent()
result = await agent.run(task)
print(f"Reward: {result.reward}")
asyncio.run(test_agent())
```
## Built-in Agents
HUD provides built-in agents for common LLM providers:
```python theme={null}
from hud.agents import ClaudeAgent, OperatorAgent
# Claude (Anthropic)
claude_agent = ClaudeAgent(
model="claude-sonnet-4-20250514",
)
# Operator (OpenAI-based)
operator_agent = OperatorAgent()
```
Always test your agent with the actual MCP servers you'll use in production.
## Next Steps
Create, run, and publish evaluations
API details and built-in agents
## See Also
* [`hud eval`](/reference/cli/eval) - Run agents on tasks/datasets from the CLI
* [`hud rl`](/reference/cli/rl) - Train agents with GRPO on your datasets
* [Agents (SDK Reference)](/reference/agents) – API details and built-in agents
# Introduction
Source: https://docs.hud.ai/index
OSS RL environment + evals toolkit.
**Version 0.4.62** - Latest stable release
Test Claude, Operator, or custom agents on benchmarks like SheetBench and OSWorld
Wrap any software in dockerized MCP for scalable and generalizable agent evaluation
Use reinforcement learning and GRPO on evaluations to improve agent performance
## What is HUD?
HUD connects AI agents to software environments using the Model Context Protocol (MCP). Whether you're evaluating existing agents, building new environments, or training models with RL, HUD provides the infrastructure.
```mermaid theme={null}
graph LR
Agent["🤖 Any Agent
(Claude, Operator, etc.)"]
MCP["🔌 MCP Protocol
(Tool Calls)"]
Env["📦 Any Environment
(Browser, OS, etc.)"]
Agent -->|"call_tool()"| MCP
MCP -->|"click(x, y)"| Env
Env -->|"screenshot"| MCP
MCP -->|"get_response()"| Agent
style Agent fill:#3b82f6,stroke:#1e40af,stroke-width:2px,color:#ffffff
style MCP fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
style Env fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
```
## Why HUD?
* **🔌 MCP-native**: Any agent can connect to any environment
* **📡 Live telemetry**: Debug every tool call at [hud.ai](https://hud.ai)
* **🚀 Production-ready**: From local Docker to cloud scale
* **🎯 Built-in benchmarks**: OSWorld-Verified, SheetBench-50, and more
* **🔧 CLI tools**: Create, develop, run, and train with `hud init`, `hud dev`, `hud run`, `hud eval`, `hud rl`
Run your first agent evaluation with zero setup
```bash theme={null}
uvx hud-python quickstart
```
## Quick Example
```python theme={null}
import asyncio, os, hud
from hud.datasets import Task
from hud.agents import ClaudeAgent
async def main():
# Define evaluation task with remote MCP
task = Task(
prompt="Win a game of 2048 by reaching the 128 tile",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudevals/hud-text-2048:0.1.3"
}
}
},
setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
)
# Run agent (auto-creates MCP client)
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Score: {result.reward}")
asyncio.run(main())
```
## Community
Star the repo and contribute
Join our community
### Are you a startup building agents?
[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.ai](mailto:founders@hud.ai)
# LLM Quickstart
Source: https://docs.hud.ai/llm-quickstart
Add context about hud to any coding agent
## Setup
```bash theme={null}
claude mcp add --transport http docs-hud https://docs.hud.ai/mcp
```
Add to MCP settings:
```json theme={null}
"docs-hud": {
"url": "https://docs.hud.ai/mcp"
}
```
Or use one-click install:
[](https://cursor.com/en/install-mcp?name=docs-hud-so\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9)
Copy paste the full llms-full.txt
## What you get
Your AI assistant gains access to:
* Complete HUD API reference and methods
* Working code examples and implementations
* Architecture patterns and best practices
* Real-time updates as documentation evolves
Try asking your assistant: "How do I create a custom agent in HUD?" or "Help me debug MCP tool calls"
## Next steps
Get started with HUD in 3 minutes
Create custom MCP environments
# Quickstart
Source: https://docs.hud.ai/quickstart
Run your first agent evaluation in 3 minutes
Get up and running with HUD in minutes. This guide walks you through installation, basic setup, and running your first agent evaluation.
## Quick Clone
The fastest way to get started:
```bash theme={null}
# Clone a complete example project with uv
uvx hud-python quickstart
```
This sets you up with a working agent evaluation example you can run immediately.
## Installation
```bash theme={null}
pip install hud-python
```
```bash theme={null}
# Includes AI providers and telemetry
pip install "hud-python[agent]"
```
```bash theme={null}
# Install CLI in isolated environment
uv tool install hud-python
```
## API Keys
Set your API keys as environment variables:
```bash theme={null}
export HUD_API_KEY="sk-hud-..." # Get from hud.ai
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude agents
export OPENAI_API_KEY="sk-..." # For OpenAI agents
```
Create a `.env` file in your project root to manage API keys locally
## Your First Agent
Run an agent on the 2048 game environment:
```python theme={null}
import asyncio, os
import hud
from hud.datasets import Task
from hud.agents import ClaudeAgent
async def main():
# The trace context captures ALL agent interactions within a "task run"
# Everything inside this context shows up as one trace on hud.ai
async with hud.async_trace("quickstart-2048"):
# Define task with remote MCP environment
task = Task(
prompt="Win a game of 2048 by reaching the 128 tile",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
"Mcp-Image": "hudevals/hud-text-2048:0.1.3"
}
}
},
setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
)
# Run agent (auto-creates MCP client from task.mcp_config)
agent = ClaudeAgent()
result = await agent.run(task)
print(f"Max tile reached: {result.reward}")
asyncio.run(main())
```
The trace context ensures that the entire task run - from setup through evaluation - appears as one coherent trace on the platform
## What just happened?
1. **Task Definition**: We created a `Task` with:
* A prompt telling the agent what to do
* An `mcp_config` pointing to a remote MCP environment
* Setup and evaluation tools to initialize and score the game
2. **Auto Client**: The agent automatically created an MCP client from `task.mcp_config`
3. **Telemetry**: The `trace` context captured all interactions for debugging
4. **Evaluation**: The `evaluate_max_tile` tool returned the highest tile as reward
## Next Steps
Learn how agents connect to environments
Test on SheetBench and OSWorld
Create your own MCP environment
## CLI Quick Reference
```bash theme={null}
# Create sample environment
hud init
# Start hot-reload development server for an environment
hud dev . --build --interactive
# Train a model on your environment
hud rl
```
See the [CLI Reference](/reference/cli/overview) for detailed command documentation
# Agents
Source: https://docs.hud.ai/reference/agents
SDK reference for HUD agent classes
The HUD SDK provides a base `MCPAgent` class and several pre-built agent implementations for interacting with MCP environments.
## Base Class
### MCPAgent
```python theme={null}
from hud.agents import MCPAgent
```
Abstract base class for all MCP-enabled agents. Handles the agent loop, MCP client lifecycle, tool discovery/filtering (including lifecycle tools like `setup`, `evaluate`, `response`), and standardized telemetry/logging.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------------------- | ---------------- | -------------------------------------------- | -------------- |
| `mcp_client` | `AgentMCPClient` | MCP client for server connections | `None` |
| `allowed_tools` | `list[str]` | List of tool names to allow | `None` (all) |
| `disallowed_tools` | `list[str]` | List of tool names to disallow | `[]` |
| `system_prompt` | `str` | System prompt to seed the conversation | Default prompt |
| `append_setup_output` | `bool` | Append setup tool output to initial context | `True` |
| `initial_screenshot` | `bool` | Include an initial screenshot when supported | `True` |
| `model_name` | `str` | Model label for telemetry/logging | `"mcp-agent"` |
| `response_agent` | `ResponseAgent` | Optional auto-continue/stop helper | `None` |
| `auto_trace` | `bool` | Enable automatic tracing spans | `True` |
| `verbose` | `bool` | Verbose console logs for development | `False` |
**Key Methods:**
```python theme={null}
async def initialize(task: str | Task | None = None) -> None
"""Initialize agent with task-specific configuration."""
async def run(prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace
"""Run agent with prompt or task. Returns Trace with results."""
async def call_tools(tool_call: MCPToolCall | list[MCPToolCall]) -> list[MCPToolResult]
"""Execute tool calls through MCP client."""
def get_available_tools() -> list[types.Tool]
"""Get filtered list of available tools (excludes lifecycle)."""
def get_tool_schemas() -> list[dict]
"""Get tool schemas formatted for the model."""
```
**Abstract Methods (must implement):**
```python theme={null}
async def get_system_messages() -> list[Any]
"""Get system prompt formatted for the model."""
async def get_response(messages: list[Any]) -> AgentResponse
"""Get model response including tool calls."""
async def format_blocks(blocks: list[ContentBlock]) -> list[Any]
"""Format content blocks into model messages."""
async def format_tool_results(tool_calls: list[MCPToolCall],
tool_results: list[MCPToolResult]) -> list[Any]
"""Format tool results for the model."""
```
**Class Variables:**
* `metadata: dict[str, Any] | None` - Injected into MCP initialize requests
* `required_tools: list[str]` - Tools that must be present or initialization fails
**Auto-Client Creation:**
If no `mcp_client` is provided but a `Task` with `mcp_config` is passed to `run()`, an MCPClient is automatically created and cleaned up.
**Lifecycle & Filtering:**
* Lifecycle tools (e.g., `setup`, `evaluate`, and auto-detected `response`) are hidden from the model but available to the framework
* Use `allowed_tools` / `disallowed_tools` to control the tool surface visible to your model
## Pre-built Agents
### ClaudeAgent
```python theme={null}
from hud.agents import ClaudeAgent
```
Claude-specific implementation using Anthropic's API.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------- | ---------------- | --------------------------------- | ---------------------------- |
| `model_client` | `AsyncAnthropic` | Anthropic client | Auto-created |
| `model` | `str` | Claude model to use | `"claude-sonnet-4-20250514"` |
| `max_tokens` | `int` | Maximum response tokens | `4096` |
| `use_computer_beta` | `bool` | Enable computer-use beta features | `True` |
| `validate_api_key` | `bool` | Validate key on init | `True` |
**Features:**
* Native Claude tool calling
* Automatic prompt caching
* Computer-use beta support
* Display metadata injection
**Example:**
```python theme={null}
agent = ClaudeAgent(
model="claude-sonnet-4-20250514",
max_tokens=8192,
)
result = await agent.run(
Task(
prompt="Navigate to example.com",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
evaluate_tool={"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": "example.com"}}}
)
)
```
### OperatorAgent
```python theme={null}
from hud.agents import OperatorAgent
```
OpenAI Operator-style agent built on the Responses API with computer-use.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------ | -------------------------------------------- | ---------------------- | ------------------------ |
| `model_client` | `AsyncOpenAI` | OpenAI client | Auto-created |
| `model` | `str` | Responses model to use | `"computer-use-preview"` |
| `environment` | `Literal["windows","mac","linux","browser"]` | Computer environment | `"linux"` |
| `validate_api_key` | `bool` | Validate key on init | `True` |
**Features:**
* OpenAI Responses computer-use
* Operator-style system prompt guidance
* Display metadata injection
### GenericOpenAIChatAgent
```python theme={null}
from hud.agents import GenericOpenAIChatAgent
```
OpenAI-compatible chat.completions agent that works with any endpoint implementing the OpenAI schema (OpenAI, vLLM, Ollama, Together, custom, etc.).
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------- | ---------------- | ------------------------------------------------- | --------------- |
| `openai_client` | `AsyncOpenAI` | OpenAI-compatible client instance | Required |
| `model_name` | `str` | Chat model name | `"gpt-4o-mini"` |
| `completion_kwargs` | `dict[str, Any]` | Extra args forwarded to `chat.completions.create` | `{}` |
**Example (local or custom endpoint):**
```python theme={null}
from openai import AsyncOpenAI
openai_client = AsyncOpenAI(
base_url="http://localhost:11434/v1", # e.g., Ollama
api_key="not-needed",
)
agent = GenericOpenAIChatAgent(
openai_client=openai_client,
model_name="llama3.1",
completion_kwargs={"temperature": 0.2},
)
```
### LangChainAgent
```python theme={null}
from hud.agents import LangChainAgent
```
LangChain integration for using any LangChain-compatible model.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | ------------------- | -------------------------- | -------- |
| `llm` | `BaseLanguageModel` | LangChain-compatible model | Required |
**Example:**
```python theme={null}
from langchain_anthropic import ChatAnthropic
llm = ChatAnthropic(model="claude-3-opus-20240229")
agent = LangChainAgent(llm=llm)
```
### GroundedOpenAIChatAgent
```python theme={null}
from hud.agents.grounded_openai import GroundedOpenAIChatAgent
```
OpenAI chat agent with a separate grounding model for element detection. The planning model issues natural-language "computer" tool calls that are grounded to coordinates by a dedicated vision model.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------------------- | ---------------- | ------------------------------------------ | ------------------- |
| `grounder_config` | `GrounderConfig` | Configuration for the grounding model | Required |
| `model_name` | `str` | OpenAI chat model name | `"gpt-4o-mini"` |
| `allowed_tools` | `list[str]` | Exposed tool list (usually `["computer"]`) | `None` |
| `append_setup_output` | `bool` | Append setup output to first turn | `False` |
| `system_prompt` | `str` | System prompt to use | Opinionated default |
> Note: The grounded agent is for advanced use-cases where you want to decouple planning from grounding. For most users, `ClaudeAgent` or `OperatorAgent` is sufficient.
## Helper Classes
### ResponseAgent
Base class for auto-response handlers that decide when to continue or stop.
```python theme={null}
from hud.agents.misc import ResponseAgent
class MyResponseAgent(ResponseAgent):
async def determine_response(self, agent_output: str) -> str:
if "task complete" in agent_output.lower():
return "STOP"
return "Continue with the next step"
```
## Common Types
### AgentResponse
Key fields:
* `tool_calls: list[MCPToolCall]`
* `done: bool`
* `content: str | None`
* `reasoning: str | None`
* `info: dict[str, Any]`
* `isError: bool`
### MCPToolCall
Key fields:
* `id: str` — unique identifier (auto-generated if not provided)
* `name: str`
* `arguments: dict[str, Any]`
### MCPToolResult
Key fields:
* `content: list[ContentBlock]`
* `structuredContent: dict[str, Any] | None`
* `isError: bool`
### Trace
Key fields:
* `reward: float`
* `done: bool`
* `content: str | None`
* `isError: bool`
* `info: dict[str, Any]`
* `task: Task | None`
* `trace: list[TraceStep]` — execution trace steps
* `messages: list[Any]` — final conversation state
## Usage Examples
### Simple Prompt Execution
```python theme={null}
from hud.agents import ClaudeAgent
from hud.datasets import Task
agent = ClaudeAgent()
result = await agent.run(
Task(
prompt="Click the submit button",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
}
),
max_steps=5,
)
print("Done:", result.done, "Error:", result.isError)
```
### Task Execution with Auto-Client
```python theme={null}
from hud.agents import OperatorAgent
from hud.datasets import Task
# No client needed - auto-created from task
agent = OperatorAgent()
task = Task(
prompt="Find the price of the product",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-remote-browser:latest"
}
}
},
setup_tool={
"name": "setup",
"arguments": {"url": "https://example.com"}
},
evaluate_tool={
"name": "evaluate",
"arguments": {"check": "price_found"}
}
)
# Client created automatically
result = await agent.run(task, max_steps=20)
print(f"Reward: {result.reward}")
```
### Custom Agent Implementation
```python theme={null}
from hud.agents import MCPAgent
from hud.types import AgentResponse
import hud
class MyCustomAgent(MCPAgent):
metadata = {"custom": "metadata"}
async def get_system_messages(self) -> list[dict]:
return [{
"role": "system",
"content": self.system_prompt
}]
@hud.instrument(span_type="agent", record_args=False, record_result=True)
async def get_response(self, messages: list[dict]) -> AgentResponse:
# Your LLM call here
response = await self.llm.chat(messages)
return AgentResponse(
content=response.content,
tool_calls=[
MCPToolCall(name=tc.name, arguments=tc.args)
for tc in response.tool_calls
],
done=response.stop_reason == "stop"
)
async def format_blocks(self, blocks: list[ContentBlock]) -> list[dict]:
content = []
for block in blocks:
if block.type == "text":
content.append({"type": "text", "text": block.text})
elif block.type == "image":
content.append({
"type": "image",
"image": {"data": block.data, "format": "png"}
})
return [{"role": "user", "content": content}]
async def format_tool_results(self, tool_calls, tool_results) -> list[dict]:
return [{
"role": "tool",
"content": result.content,
"tool_call_id": call.name
} for call, result in zip(tool_calls, tool_results)]
```
## See Also
* [Create Agents](/evaluate-agents/create-agents) - Tutorial on building agents
* [Tasks](/reference/tasks) - Task configuration reference
* [Architecture](/core-concepts/architecture) - How agents fit in HUD
* [`hud eval`](/reference/cli/eval) - Run agents on tasks/datasets
* [`hud rl`](/reference/cli/rl) - Train agents with GRPO
# hud analyze
Source: https://docs.hud.ai/reference/cli/analyze
Inspect MCP environments quickly from metadata or live containers
The `hud analyze` command inspects MCP environments to discover tools and capabilities. By default it uses cached metadata; pass `--live` (or Docker args) to run a container.
## Usage
```bash theme={null}
hud analyze [DOCKER_ARGS...] [OPTIONS]
hud analyze --config mcp.json
hud analyze --cursor
```
## Arguments
Docker image to analyze (omit when using --config or --cursor)
## Options
Output format: `interactive`, `json`, or `markdown`. Short: `-f`
Show input schemas in fast mode and more details in live mode. Short: `-v`
Run container for live analysis (slower but exact)
JSON file with MCP configuration (always live). Short: `-c`
Analyze a Cursor MCP server (always live)
## Analysis Modes
### Fast Mode (Default)
Sources (in order):
1. Local HUD cache: `~/.hud/envs//hud.lock.yaml`
2. HUD registry metadata (if available)
3. Basic Docker manifest info
```bash theme={null}
# Instant results from metadata
hud analyze myorg/my-env:latest
```
### Live Mode
Runs the container and queries the server:
```bash theme={null}
# Full analysis with running container
hud analyze myorg/my-env:latest --live
# Passing env vars implies live mode
a=API_KEY=secret hud analyze myorg/my-env:latest --live -e API_KEY=secret
```
> Providing Docker args (after the image) automatically switches to live mode.
## Output Formats
* `interactive` (default): rich table/tree views
* `json`: structured output for tooling
* `markdown`: docs-friendly summaries
## Examples
```bash theme={null}
# Fast metadata analysis
hud analyze hudpython/text-2048:latest
# From lock file
hud analyze ./locks/text-2048.yaml --format json
# Live analysis with env vars
hud analyze my-env:latest --live -e API_KEY=test
# Cursor and config
hud analyze --cursor my-dev-server --live
hud analyze --config mcp-config.json
```
## Tips
* Prefer fast mode during iteration; switch to `--live` for final validation.
* If metadata isn’t found, the command suggests pulling or running live.
## See Also
* [`hud pull`](/reference/cli/pull)
* [`hud debug`](/reference/cli/debug)
* [`hud build`](/reference/cli/build)
* [`hud run`](/reference/cli/run)
# hud build
Source: https://docs.hud.ai/reference/cli/build
Build production images and generate lock files for reproducible environments
The `hud build` command builds your Docker image, analyzes its MCP server, and writes a reproducible `hud.lock.yaml`.
## Usage
```bash theme={null}
hud build [DIRECTORY] [OPTIONS]
```
## Arguments
Environment directory containing Dockerfile and pyproject.toml
## Options
Docker image tag to use. Defaults to `[tool.hud.image]` or `{dir}:dev`.
Build without using Docker cache
Show detailed build output. Short: `-v`
Target platform for Docker build. Defaults to `linux/amd64` if unspecified.
## What It Does
Runs `docker build` (twice): a temporary build for analysis, then a labeled build for the final image and version tags.
Runs the container via MCP to measure initialize time and list tools (count and input schemas).
Writes `hud.lock.yaml` with:
* `version`: lock format version
* `image`: `{tag}@sha256:` after final build
* `build`: generatedAt, hudVersion, directory, internal `version` (semver, auto‑incremented), `sourceHash`, optional `sourceFiles`
* `environment`: initializeMs, toolCount, and optional `variables` with `provided`, `required`, `optional`
* `tools`: name, description, inputSchema (for RL/evals tooling)
Adds labels (`org.hud.*`) and tags the image with both the chosen tag and the internal version.
## Examples
```bash theme={null}
# Basic build (auto tag)
hud build
# Custom tag
hud build . --tag v1.2.0
# Clean build with logs
hud build . --no-cache --verbose
# Force cross‑platform (useful on Apple Silicon)
hud build . --platform linux/amd64
```
## Image Naming
If `--tag` is not provided:
1. Read `[tool.hud.image]` from `pyproject.toml`
2. Otherwise use `{directory-name}:dev`
## Environment Variables
Pass runtime environment variables during analysis with `-e`/`--env` flags:
```bash theme={null}
hud build . -e API_KEY=secret -e DEBUG=true
```
They’re recorded in the lock file as placeholders under `environment.variables.provided`; missing required variables are listed in `environment.variables.required`.
## Lock File
Minimal structure you’ll see:
```yaml theme={null}
version: "1.0"
image: "my-env:dev@sha256:..."
build:
generatedAt: "2025-01-01T12:00:00Z"
hudVersion: "0.x.y"
directory: "my-env"
version: "0.1.0"
sourceHash: "..."
environment:
initializeMs: 450
toolCount: 3
variables:
provided: { API_KEY: "${API_KEY}" }
required: ["OTHER_KEY"]
tools:
- name: setup
description: Initialize environment
inputSchema: { type: object }
```
## Next Steps
```bash theme={null}
# Hot‑reload development
hud dev
# Run the built image locally (stdio)
hud run my-env:dev --local
# Push to registry
hud push
```
## See Also
* [`hud init`](/reference/cli/init)
* [`hud dev`](/reference/cli/dev)
* [`hud push`](/reference/cli/push)
* [`hud analyze`](/reference/cli/analyze)
* [Build Environments](/build-environments)
# hud debug
Source: https://docs.hud.ai/reference/cli/debug
Test MCP environments through 5 validation phases
The `hud debug` command validates MCP environments through 5 progressive phases.
## Synopsis
```bash theme={null}
hud debug [IMAGE|DIRECTORY] [OPTIONS] [DOCKER_ARGS...]
hud debug --config PATH
hud debug --cursor SERVER_NAME
```
## Options
| Option | Description | Default |
| ------------------- | --------------------------------------------- | ------- |
| `--config`, `-c` | JSON config file with MCP configuration | - |
| `--cursor` | Debug a Cursor MCP server | - |
| `--build`, `-b` | Build image before debugging (directory mode) | false |
| `--max-phase`, `-p` | Maximum debug phase (1-5) | 5 |
## Debug Phases
### Phase 1: Container Startup
```
🚀 Phase 1: Container startup
✓ Container started successfully
✓ Received output on stderr
```
Checks: Process starts, outputs to stderr, doesn't exit
### Phase 2: MCP Initialization
```
🔗 Phase 2: MCP initialization
✓ Received valid initialize response
✓ Protocol version: 1.0
```
Checks: JSON-RPC communication, protocol compatibility
### Phase 3: Tool Discovery
```
🔧 Phase 3: Tool discovery
✓ Found 8 tools:
• setup_board - Initialize game board
• move - Move in direction
```
Checks: Tools registered, valid schemas
### Phase 4: Tool Execution
```
🧪 Phase 4: Tool testing
→ Testing tool: setup_board
✓ Tool executed successfully
```
Checks: Tools callable, return valid responses
### Phase 5: Readiness Check
```
✅ Phase 5: Readiness check
✓ Ready for agent interaction
```
Checks: Overall health, state persistence
Use `hud dev` for hot-reload development after debug passes to speed up iteration.
## Examples
### Docker Image
```bash theme={null}
# Basic
hud debug my-server:latest
# With environment variables (Docker args after image)
hud debug my-server:latest -e DEBUG=true -e LOG_LEVEL=debug
# Stop at phase 3
hud debug my-server:latest --max-phase 3
```
### Directory Mode
```bash theme={null}
# Build then debug from a directory containing a Dockerfile and pyproject.toml
hud debug . --build
```
## Common Issues
### Phase 1 Failures
* **Container exits**: Check CMD/ENTRYPOINT
* **No stderr**: Route logs to stderr
* **Permission denied**: Check file permissions
### Phase 2 Failures
* **Invalid JSON**: Only JSON-RPC on stdout
* **Timeout**: Check for blocking operations
### Phase 3 Failures
* **No tools**: Check `@server.tool()` decorators
* **Invalid schema**: Add type hints and docstrings
## Advanced Usage
### Incremental Debugging
```bash theme={null}
for phase in 1 2 3 4 5; do
echo "Testing phase $phase..."
hud debug my-env:latest --max-phase $phase || break
done
```
### Parallel Testing
```bash theme={null}
# Test multiple environments
images=("env1:latest" "env2:latest")
for image in "${images[@]}"; do
hud debug "$image" > "debug_${image//:/}.log" 2>&1 &
done
wait
```
## Best Practices
1. Start with phase 1 and work up
2. Use `--max-phase` for incremental debugging
3. Check stderr output first
4. Keep tools simple for easier debugging
5. Add health check tools
If debug passes, agents should work reliably with your environment.
## Next Step
Explore environment capabilities after debugging
# hud dev
Source: https://docs.hud.ai/reference/cli/dev
Hot-reload development server for MCP environments
The `hud dev` command provides development-time proxying for MCP environments, starting your image and exposing it over HTTP (default) or stdio.
## Synopsis
```bash theme={null}
hud dev [DIRECTORY] [OPTIONS]
```
## Description
`hud dev` launches a lightweight MCP proxy that runs your Docker image and forwards MCP traffic. It auto-detects or builds your image, mounts your project directory, and optionally opens an inspector or interactive tool runner.
**Key Features:**
* **Auto-detection**: Determines image from `[tool.hud.image]` or `dir-name:dev`
* **Project mount**: Mounts your project at `/app` (sets `PYTHONPATH=/app`)
* **Interactive testing**: TUI for calling tools (HTTP transport only)
* **HTTP/Stdio protocols**: Choose transport method (`http` default)
* **Inspector support**: Launch MCP Inspector (HTTP mode)
* **Per-connection containers**: Each client gets its own container
## Arguments
| Argument | Description | Default |
| ----------- | -------------------------------- | ------------- |
| `DIRECTORY` | Environment directory (optional) | `.` (current) |
## Options
| Option | Description | Default |
| ------------------- | ---------------------------------------------------------- | ------------- |
| `--image`, `-i` | Docker image name (overrides auto-detection) | Auto-detected |
| `--build`, `-b` | Build image before starting | `false` |
| `--no-cache` | Force rebuild without cache | `false` |
| `--transport`, `-t` | Transport protocol: `http` or `stdio` | `http` |
| `--port`, `-p` | HTTP server port (ignored for stdio) | `8765` |
| `--interactive` | Launch interactive tool testing interface | `false` |
| `--no-reload` | Disable hot-reload supervision | `false` |
| `--verbose`, `-v` | Show detailed server logs | `false` |
| `--inspector` | Launch MCP Inspector (HTTP mode only) | `false` |
| `--no-logs` | Disable streaming Docker logs | `false` |
| `--full-reload` | Restart entire container on file changes (not implemented) | `false` |
## Examples
### Auto-Detection Mode (Recommended)
```bash theme={null}
hud dev
```
### Build and Start
```bash theme={null}
hud dev --build
```
### Specific Directory
```bash theme={null}
hud dev environments/my-env --build
```
### Custom Image
```bash theme={null}
hud dev . --image my-custom-env:dev --build
```
### HTTP Mode with Inspector
```bash theme={null}
hud dev . --build --transport http --inspector
```
### Stdio Mode
```bash theme={null}
hud dev . --build --transport stdio
```
### Clean Rebuild
```bash theme={null}
hud dev . --build --no-cache
```
### Verbose Logging
```bash theme={null}
hud dev . --build --verbose
```
### Interactive Testing (HTTP only)
```bash theme={null}
hud dev . --build --interactive
```
Interactive mode disables log streaming and hot-reload supervision to provide a clean UI.
## How It Works
1. **Image Resolution**:
* Reads `pyproject.toml` `[tool.hud.image]`
* Auto-generates `dir-name:dev` if absent; writes it back on build
* Uses `--image` override if provided
2. **Container Startup**:
* Runs your image with its original `CMD` (the container controls reload behavior)
* Mounts your project at `/app` and sets `PYTHONPATH=/app`
* For HTTP transport, the proxy hosts `http://localhost:/mcp`
3. **Proxy Server**:
* Each client connection gets its own container
* Inspector and interactive UI are available in HTTP mode
4. **Hot-Reload**:
* Depends on your container `CMD` (templates include `--reload` flags)
* For best results, separate the restartable MCP controller from any long-lived backend
### Process Separation Architecture
For stateful environments, `hud dev` supports a critical design pattern: separating the MCP server from the environment process. This separation enables hot-reload without losing state.
**Why Separation Matters:**
* MCP server can restart instantly for code changes
* Environment state (browsers, databases, games) persists
* Development is faster without constant state resets
**Example Architecture:**
```mermaid theme={null}
graph LR
MCP["🔄 MCP Server
(Restartable)
• Lightweight
• Fast startup
• MCP protocol"]
ENV["🏗️ Environment Process
(Persistent)
• Maintains state
• Long-lived
• Resource heavy"]
MCP -->|"MCP Calls"| ENV
ENV -->|"Unix Socket
TCP, gRPC, etc."| MCP
style MCP fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
style ENV fill:#d1fae5,stroke:#10b981,stroke-width:2px
```
**Connection Methods:**
* Unix sockets (recommended for local dev)
* TCP/HTTP endpoints
* gRPC services
* Shared memory/IPC
Separating the MCP server (restartable) from the backend (stateful) enables code reloads without losing state. See the browser template for a full example.
## File Mounting
Project directory is mounted as `/app` inside the container:
```
Local: ./ → Container: /app/
Local: ./controller/ → Container: /app/controller/
Local: ./environment/ → Container: /app/environment/
```
## Transport Modes
### HTTP Transport (Default)
```bash theme={null}
hud dev . --transport http --port 8765
```
**Benefits:**
* Web browser access and MCP Inspector
* Multiple simultaneous connections
* Easier debugging
**URL:** `http://localhost:8765/mcp`
### Stdio Transport
```bash theme={null}
hud dev . --transport stdio
```
**Benefits:**
* Direct MCP protocol, lower latency
* Single connection focused
## Integration Examples
### Cursor Integration
1. Start development server:
```bash theme={null}
hud dev . --build --transport http --port 8765
```
2. Add to Cursor's MCP config:
```json theme={null}
{
"mcpServers": {
"my-dev-env": {
"url": "http://localhost:8765/mcp"
}
}
}
```
3. Edit files — changes apply immediately based on your container's reload behavior.
### Testing During Development
```bash theme={null}
# Quick inspection (optional)
hud analyze my-env:dev
# Deep check (only when broken)
hud debug my-env:dev
```
### MCP Inspector
```bash theme={null}
hud dev . --build --inspector
```
Opens an inspector that shows:
* Available tools and schemas
* Real-time tool calls and protocol messages
## Troubleshooting
* Use `--verbose` to see container logs
* If container never starts: run `hud debug `
* Port conflicts: choose a new `--port`
## See Also
* [`hud init`](/reference/cli/init) — Create new environments
* [`hud build`](/reference/cli/build) — Build production images
* [`hud push`](/reference/cli/push) — Share to registry
* [`hud analyze`](/reference/cli/analyze) — Inspect tools
* [`hud debug`](/reference/cli/debug) — Validate environment
* [`hud run`](/reference/cli/run) — Run locally/remotely
* [Build Environments](/build-environments)
# hud eval
Source: https://docs.hud.ai/reference/cli/eval
Run agents on tasks or datasets
The `hud eval` command runs an agent on a tasks file or a HuggingFace dataset.
## Usage
```bash theme={null}
hud eval [SOURCE] [AGENT] [OPTIONS]
```
## Arguments
HuggingFace dataset (e.g., `hud-evals/SheetBench-50`) or task JSON/JSONL file. If omitted, looks for a tasks file in the current directory.
Agent backend to use: `claude`, `openai`, or `vllm`. If omitted, an interactive selector appears (including HUD hosted models).
## Options
Run the entire dataset (omit for single-task debug mode)
Model name for the chosen agent (required for some agents)
Comma-separated list of allowed tools
Maximum concurrent tasks (1-200 recommended). Adjust based on your API rate limits and system resources.
Maximum steps per task (default: 10 for single task, 50 for full dataset)
Enable verbose agent output
Enable debug-level logs for maximum visibility
Base URL for vLLM server (when using `--agent vllm` or HUD hosted models)
Number of times to run each task (mini-batch style)
## Examples
```bash theme={null}
# Single task (debug mode)
hud eval hud-evals/SheetBench-50
# Entire huggingface dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full
# High concurrency for faster evaluation
hud eval hud-evals/SheetBench-50 claude --full --max-concurrent 100
# Limit concurrency to prevent rate limits
hud eval hud-evals/SheetBench-50 openai --full --max-concurrent 20
# Local task config
hud eval tasks.json claude
# Local task config with verbose output for debugging
hud eval tasks.json claude --verbose
# vLLM with explicit base URL
hud eval tasks.json vllm --model llama3.1 --vllm-base-url http://localhost:8000
# Limit tools and concurrency
hud eval tasks.json claude --allowed-tools click,type --max-concurrent 10
```
## Notes
* If you select a HUD hosted model, `hud eval` will route through vLLM with the appropriate base model.
* When `SOURCE` is omitted, an interactive file picker helps locate a tasks file.
## See Also
* [`hud rl`](/reference/cli/rl)
* [`hud run`](/reference/cli/run)
* [`hud analyze`](/reference/cli/analyze)
## Pricing & Billing
See hosted vLLM and training GPU rates in the [Training Quickstart → Pricing](/train-agents/quickstart#pricing). Manage usage and billing at `https://hud.ai/project/billing`.
# hud init
Source: https://docs.hud.ai/reference/cli/init
Create a new HUD environment from a preset
The `hud init` command scaffolds a working MCP environment using templates from the public SDK.
## Usage
```bash theme={null}
hud init [NAME] [OPTIONS]
```
## Arguments
Environment name. If omitted, the current directory name is used.
## Options
Template preset: `blank`, `deep-research`, or `browser`. Short: `-p`
Target directory where the environment will be created. Short: `-d`
Overwrite existing files if they exist. Short: `-f`
## What It Creates
A minimal but complete environment with controller/frontend and optional backend:
```
my-env/
├── Dockerfile # Container configuration
├── pyproject.toml # Dependencies and metadata
├── README.md # Template instructions
├── tasks.json # Example tasks
├── controller/ # MCP server (stdio)
│ ├── __init__.py # mcp = MCPServer()
│ ├── __main__.py # python -m controller → mcp.run()
│ ├── hooks.py # @mcp.initialize / @mcp.shutdown
│ └── tools.py # @mcp.tool act / setup / evaluate
└── environment/ # Backend (FastAPI example)
└── server.py # /health /act /reset /state
```
### Dockerfile (template)
```dockerfile theme={null}
FROM python:3.11-slim
WORKDIR /app
COPY pyproject.toml ./
COPY controller/ ./controller/
COPY environment/ ./environment/
RUN pip install --no-cache-dir -e .
ENV ENV_SERVER_PORT=8005
# Start backend then launch MCP controller on stdio
CMD ["sh", "-c", "uvicorn environment.server:app --host 0.0.0.0 --port $ENV_SERVER_PORT --log-level warning & python -m controller"]
```
Templates may include hot-reload flags for development. Remove them for production images.
## Examples
```bash theme={null}
# Choose preset interactively (default blank)
hud init
# Create a blank template in a new directory
hud init my-env -p blank
# Browser presets
hud init my-browser -p browser
# Deep research preset (remote browser)
hud init my-deep -p deep-research
# Force overwrite
hud init my-env -p blank --force
```
## Next Steps
Run with hot-reload and choose your preferred UI:
```bash theme={null}
# Inspector (HTTP, visual)
hud dev --inspector
# Interactive TUI (arrow keys)
hud dev --interactive
```
Add tools in `controller/tools.py`; use `@mcp.tool`.
```bash theme={null}
hud build
hud push
```
## Presets
* **blank**: Minimal controller + FastAPI backend with `/health`, `/act`, `/reset`, `/state` and example tools.
* **browser**: Local browser environment preset.
* **deep-research**: Remote browser environment preset (maps to `remote_browser`).
## See Also
* [Build Environments](/build-environments) – Quickstart tutorial
* [Technical Spec](/build-environments/spec) – Exact runtime requirements
* [hud dev](/reference/cli/dev) – Hot-reload server
* [hud build](/reference/cli/build) – Build production images
# Other CLI Commands
Source: https://docs.hud.ai/reference/cli/misc
Miscellaneous HUD CLI utilities
Quick reference for auxiliary CLI utilities.
## hud pull
Fetch an environment from a registry with a metadata preview.
```bash theme={null}
hud pull [--yes] [--verify-only]
# From a shared lock file
hud pull ./hud.lock.yaml
```
Notes:
* Shows tool list, init time, variables (when available)
* `--verify-only` skips `docker pull`
## hud get
```bash theme={null}
hud get hud-evals/2048-basic --split train -o tasks.jsonl --limit 100 --format jsonl
```
Downloads a HuggingFace dataset split and saves as JSON or JSONL.
## hud quickstart
```bash theme={null}
hud quickstart
```
Clones the HUD quickstart repository.
## hud cursor-list
```bash theme={null}
hud cursor-list
```
Lists MCP servers configured in Cursor and shows the config location.
## hud version
```bash theme={null}
hud version
```
Shows the HUD CLI version.
## hud clone
```bash theme={null}
hud clone https://github.com/user/repo.git
```
Clones a repository and prints any configured tutorial snippet.
## hud set
```bash theme={null}
hud set HUD_API_KEY=... ANTHROPIC_API_KEY=...
```
Saves KEY=VALUE pairs to `~/.hud/.env` for default use in HUD.
# CLI Overview
Source: https://docs.hud.ai/reference/cli/overview
Complete reference for HUD command-line tools
The HUD CLI provides a complete toolkit for creating, developing, and running MCP environments. Commands are organized into two main workflows:
**Directory-based commands** for creating and sharing environments:
* `hud init` — Create new environment
* `hud dev` — Develop with hot‑reload
* `hud build` — Build and generate lock file
* `hud push` — Share to registry
**Target-based commands** for using environments and agents:
* `hud analyze` — Inspect capabilities (fast/live)
* `hud debug` — 5‑phase compliance test
* `hud run` — Execute (Python module/command/Docker)
* `hud eval` — Run agents on tasks/datasets
* `hud rl` — Train with GRPO on tasks
## Installation
```bash uv (Recommended) theme={null}
uv tool install hud-python
hud --version
```
```bash pip theme={null}
pip install hud-python
python -m hud.cli --help
```
```bash pipx theme={null}
pipx install hud-python
hud --version
```
## Commands
### Building Workflow
| Command | Input | Description | Example |
| ----------- | --------- | ----------------------- | ------------------------- |
| `hud init` | Directory | Create new environment | `hud init my-env` |
| `hud dev` | Directory | Hot-reload development | `hud dev . --interactive` |
| `hud build` | Directory | Build image & lock file | `hud build . --tag v1.0` |
| `hud push` | Directory | Share to registry | `hud push . --tag prod` |
### Running Workflow
| Command | Input | Description | Example |
| ------------- | -------------------- | ----------------------------- | ----------------------------- |
| `hud analyze` | Image or config | Inspect tools & capabilities | `hud analyze org/env` |
| `hud debug` | Image/dir/config | 5‑phase compliance test | `hud debug my-env:latest` |
| `hud run` | Module/command/image | Execute server (local/remote) | `hud run controller --reload` |
| `hud eval` | Tasks/dataset | Run agent on tasks | `hud eval tasks.json claude` |
| `hud rl` | Tasks/dataset | Train with GRPO | `hud rl tasks.json --local` |
### Other Commands
| Command | Description | Example |
| ----------------- | ---------------------------------- | --------------------------------------------- |
| `hud get` | Download HF dataset to tasks file | `hud get hud-evals/2048-basic -o tasks.jsonl` |
| `hud quickstart` | Clone quickstart repo | `hud quickstart` |
| `hud cursor-list` | List Cursor MCP servers | `hud cursor-list` |
| `hud version` | Show CLI version | `hud version` |
| `hud clone` | Clone any git repo (pretty output) | `hud clone https://github.com/...` |
| `hud set` | Persist API keys to \~/.hud/.env | `hud set HUD_API_KEY=...` |
## Complete Workflows
### Building an Environment
Create a new HUD environment with minimal boilerplate:
```bash theme={null}
hud init my-env && cd my-env
```
Creates `Dockerfile`, `pyproject.toml`, `controller/` (MCP server), optional `environment/` backend, `tasks.json`.
Run with hot-reload and interactive testing:
```bash theme={null}
hud dev
```
Your changes reload automatically. Test tools interactively with arrow keys.
Create production image and lock file:
```bash theme={null}
hud build
```
Generates `hud.lock.yaml` with metadata and labels image for reproducibility.
Share to Docker and HUD registries:
```bash theme={null}
hud push
```
Requires `HUD_API_KEY`. Auto-detects registry from Docker login.
### Running an Environment
Quick inspection without running:
```bash theme={null}
hud analyze hudpython/text_init # Fast (from metadata)
hud analyze hudpython/text_init --live # Full (runs container)
```
Test MCP protocol compliance:
```bash theme={null}
hud debug hudpython/text_init:latest
```
Validates through 5 phases of initialization.
Execute in production mode:
```bash theme={null}
hud run hudpython/text_init:latest # Remote (default)
hud run hudpython/text_init:latest --local # Local Docker
```
## Common Usage
### Docker Images
```bash theme={null}
# Basic
hud debug my-image:latest
# With options
hud debug my-image:latest -e DEBUG=true -p 8080:8080
# From registry
hud analyze ghcr.io/org/image:v1.0.0
```
### Arbitrary Commands (Python/Node/etc.)
```bash theme={null}
# Run any command as MCP server (stdio/http)
hud run --cmd "python -m controller" -t http -p 8765
hud run --cmd "node mcp-server.js"
```
### Cursor Integration
```bash theme={null}
# Debug Cursor server
hud debug --cursor my-dev-server
# List all servers
hud cursor-list
```
## Output Formats
### Interactive (Default)
```bash theme={null}
hud analyze my-env
🔍 Analyzing MCP environment: my-env
✓ Connected successfully
✓ Found 12 tools
Available Tools:
• click - Click at coordinates
• type - Type text
...
```
### JSON
```bash theme={null}
hud analyze my-env --format json
{
"tools": [{
"name": "click",
"description": "Click at coordinates",
"parameters": {...}
}]
}
```
### Markdown
```bash theme={null}
hud analyze my-env --format markdown > docs/tools.md
```
## CI/CD Example
```bash theme={null}
#!/bin/bash
set -e
# Test environment
hud debug "$IMAGE_NAME"
# Verify tools
TOOLS=$(hud analyze "$IMAGE_NAME" --format json | jq '.tools | length')
if [ "$TOOLS" -lt 3 ]; then
echo "Not enough tools!"
exit 1
fi
```
## Python Scripting
```python theme={null}
import subprocess
import json
def get_tools(image):
result = subprocess.run(
["hud", "analyze", image, "--format", "json"],
capture_output=True,
text=True,
check=True
)
return json.loads(result.stdout)["tools"]
# Use
tools = get_tools("my-env:latest")
for tool in tools:
print(f"- {tool['name']}: {tool['description']}")
```
## Exit Codes
| Code | Meaning | Description |
| ---- | ---------------- | ------------------- |
| 0 | Success | Command completed |
| 1 | General Error | Command failed |
| 2 | Usage Error | Invalid arguments |
| 3 | Connection Error | Failed to connect |
| 4 | Timeout | Operation timed out |
| 5 | Protocol Error | MCP violation |
## Environment Variables
```bash theme={null}
# Debug output
export HUD_CLI_DEBUG=true
# Custom timeout
export HUD_CLI_TIMEOUT=120
# Provider keys
export ANCHOR_API_KEY=...
```
## Next Steps
### Building Commands
Create new environments from scratch
Develop with hot-reload and interactive testing
Build images and generate lock files
Share environments to registry
### Running Commands
Inspect tools and capabilities
Test MCP protocol compliance
Execute servers locally or remotely
# hud push
Source: https://docs.hud.ai/reference/cli/push
Share HUD environments to Docker and HUD registries
The `hud push` command publishes your environment image to a Docker registry and uploads metadata to the HUD registry.
## Usage
```bash theme={null}
hud push [DIRECTORY] [OPTIONS]
```
## Arguments
Environment directory containing hud.lock.yaml
## Options
Override registry image name (e.g., `myorg/myenv`). Short: `-i`
Override tag (e.g., `v1.0`, `latest`). Short: `-t`
Sign the image with cosign (not yet implemented)
Skip confirmation prompts. Short: `-y`
Show detailed output. Short: `-v`
## Prerequisites
Requires `HUD_API_KEY`:
```bash theme={null}
export HUD_API_KEY="your-api-key"
```
Login to your Docker registry first (e.g., Docker Hub or GHCR):
```bash theme={null}
docker login
# or
docker login ghcr.io
```
## What It Does
Ensures a recent `hud build` exists (lock file present), prompting to build if missing.
Determines target image:
* `--image` if provided
* Else auto-detects Docker Hub username and uses `username/{name}:{tag}`
* Tag uses `--tag`, or the lock’s internal `build.version`, else current tag
Tags and pushes via Docker; captures pushed digest.
Updates `hud.lock.yaml` with:
* `image`: full registry reference with digest
* `push`: `source`, `pushedAt`, `registry`, and `image_with_tag`
Uploads lock metadata to the HUD registry path `.../registry/envs/{org}/{name}:{tag}`.
## Registry Detection
1. Explicit `--image` → used as‑is.
2. Docker config → reads logged‑in username (Docker Hub).
3. Default → prompts if no login and `--image` missing.
## Examples
```bash theme={null}
# Basic (auto username)
hud push
# Custom registry and tag
hud push . --image ghcr.io/myorg/myenv --tag v1.0.0
# Non‑interactive (CI)
hud push . --yes --tag "prod-$(date +%Y%m%d)"
```
## Lock File After Push
```yaml theme={null}
image: "myorg/myenv:latest@sha256:..."
push:
source: "my-env:dev@sha256:..."
pushedAt: "2025-01-01T11:00:00Z"
registry: "ghcr.io"
image_with_tag: "myorg/myenv:latest"
```
## Notes
* If HUD registry upload fails, the Docker push still succeeds; you can share `hud.lock.yaml` manually.
* Image names must include `org/name` for HUD registry uploads.
## See Also
* [`hud build`](/reference/cli/build)
* [`hud pull`](/reference/cli/pull)
* [`hud analyze`](/reference/cli/analyze)
* [Build Environments](/build-environments) - Getting started with HUD environments
# hud rl
Source: https://docs.hud.ai/reference/cli/rl
Run GRPO reinforcement learning on tasks
The `hud rl` command trains an agent with GRPO on tasks, locally or via the HUD remote service.
## Usage
```bash theme={null}
hud rl [TASKS_FILE|DATASET] [MODEL] [OPTIONS]
```
## Arguments
Path to tasks JSON/JSONL file or HuggingFace dataset name. If omitted, looks for a tasks file in the current directory.
Model to train (default: interactive selection)
## Options
Path to existing configuration file. Short: `-c`
Output directory for checkpoints. Short: `-o`
Restart the vLLM server before training
Enable verbose output. Short: `-v`
Disable DistributedDataParallel (even with multiple GPUs)
Specific GPUs for DDP (e.g., `0,1,2,3`)
Specific GPU for vLLM server
Run training locally instead of the remote HUD server
## Behavior
* If no tasks file is provided, an interactive picker helps locate one.
* Remote mode (default) converts tasks to remote MCP automatically (build/push as needed) and launches remote training.
* Local mode runs training on your machine (delegated to `local_runner`).
## Examples
```bash theme={null}
# Remote (default): auto-convert tasks to remote, then train
hud rl tasks.json --model claude-rl
# Local training with GPU selection
hud rl tasks.json llama3.1 --local --ddp-gpus 0,1 --vllm-gpu 0
# Use a dataset directly (remote)
hud rl hud-evals/SheetBench-50 --model claude-rl
```
## See Also
* [`hud eval`](/reference/cli/eval)
* [`hud get`](/reference/cli/get)
* [`hud build`](/reference/cli/build)
* [`hud push`](/reference/cli/push)
## Pricing & Billing
See hosted vLLM and training GPU rates in the [Training Quickstart → Pricing](/train-agents/quickstart#pricing). Manage usage and billing at `https://hud.ai/project/billing`.
# hud run
Source: https://docs.hud.ai/reference/cli/run
Execute MCP environments locally or remotely for production use
The `hud run` command executes MCP servers in three ways: (1) Python module/file (decorator‑based), (2) arbitrary command, or (3) Docker image (local/remote). Unlike `hud dev`, `hud run` is optimized for execution, not hot‑reload.
## Usage
```bash theme={null}
# Python module or file (decorator-based)
hud run controller[:attr] [OPTIONS]
hud run path/to/server.py[:attr] [OPTIONS]
# Arbitrary command
hud run --cmd "python -m controller" [OPTIONS]
# Docker image (supports Docker args after --)
hud run [OPTIONS] [-- DOCKER_ARGS...]
```
## Arguments
Python module/file (e.g., `controller` or `controller:server`) or a Docker image (e.g., `myorg/env:latest`)
Additional Docker arguments (accepted after the image)
Docker arguments are passed positionally after the image in this command form.
## Options
Run locally with Docker instead of remote execution
Explicitly run remotely (default behavior if neither --local nor --remote is set)
Transport protocol: `stdio` or `http`. Short: `-t`
Port for HTTP transport (ignored for stdio). Short: `-p`
Remote MCP server URL
API key for remote server (defaults to `HUD_API_KEY` env var)
Run ID for tracking (remote only)
Show detailed output. Short: `-v`
Launch interactive testing mode (HTTP transport only; local Docker only)
Auto-reload when files change (Python mode only)
Directories to watch when using `--reload` (Python mode)
Run an arbitrary command as an MCP server (e.g., `python -m controller`)
## Python Mode (Decorator-based)
Run a Python module, package, or file containing an MCP server (default attribute `mcp`; override with `module:attr`).
```bash theme={null}
# Run module or file (stdio by default)
hud run controller
hud run controller:mcp
hud run server.py:mcp
# Auto-reload on changes (watches module/package)
hud run controller --reload
hud run controller --reload --watch controller tools
# HTTP transport
hud run controller -t http -p 8765
```
## Remote Execution (Default for Docker)
By default, `hud run` executes your Docker image on the HUD platform's remote infrastructure. This provides:
* **Scalability**: Run hundreds of concurrent instances
* **Resource Management**: Automatic container lifecycle management
* **Monitoring**: Live telemetry and performance metrics
* **Geographic Distribution**: Global edge deployment
### Remote Examples
```bash theme={null}
# Basic remote execution
hud run hudpython/text-2048:latest
# With environment variables
hud run myorg/server:v1 -- -e API_KEY=secret -e DEBUG=true
# HTTP transport with custom port
hud run myorg/server:v1 --transport http --port 9000
# Custom remote URL and API key
hud run my-image:latest --url https://custom.mcp.server --api-key my-key
# With run tracking
hud run production-env:v1.0 --run-id "deploy-$(date +%Y%m%d)"
```
### Authentication
Remote execution requires authentication via API key:
```bash theme={null}
# Set environment variable (recommended)
export HUD_API_KEY=your-api-key-here
hud run my-image:latest
# Or pass directly (less secure)
hud run my-image:latest --api-key your-api-key-here
```
### Run Tracking
Use `--run-id` to track specific execution instances:
```bash theme={null}
hud run my-image:latest --run-id "experiment-001"
```
## Local Execution
Use `--local` to run the image locally with Docker:
```bash theme={null}
# Basic local execution
hud run hudpython/text-2048:latest --local
# With Docker arguments
hud run myorg/server:v1 --local -- -e API_KEY=secret -v /data:/app/data
# HTTP transport locally
hud run myorg/server:v1 --local --transport http --port 8080
# With memory limits and volumes
hud run my-env:latest --local -- --memory 2g -v $(pwd)/data:/app/data
```
Local execution is useful for:
* Development and testing
* Environments without internet access
* Custom Docker configurations
* Debugging before remote deployment
## Transport Modes
### Stdio Transport (Default)
```bash theme={null}
hud run my-image:latest --transport stdio
```
**Characteristics:**
* Direct JSON-RPC over stdin/stdout
* Single client connection
* Lower latency
* MCP protocol native
### HTTP Transport
```bash theme={null}
hud run my-image:latest --transport http --port 8765
```
**Characteristics:**
* HTTP proxy at specified port
* Multiple concurrent connections
* Web-compatible
* Inspector support
## Docker Arguments
Pass Docker arguments after `--` separator:
```bash theme={null}
# Environment variables
hud run my-image:latest -- -e DEBUG=true -e API_KEY=secret
# Volume mounts
hud run my-image:latest -- -v /host/data:/container/data
# Network settings
hud run my-image:latest -- --network host
# Memory limits
hud run my-image:latest -- --memory 2g
# Combined arguments
hud run my-image:latest -- -e API_KEY=secret -v /data:/app/data --memory 1g
```
The `--` separator is required to distinguish HUD options from Docker arguments.
## Environment Variables
| Variable | Description | Default |
| ------------- | --------------------------------- | --------------------------- |
| `HUD_MCP_URL` | Remote MCP server URL | `https://mcp.hud.ai/v3/mcp` |
| `HUD_API_KEY` | API key for remote authentication | *required for remote* |
## Comparison with Other Commands
| Command | Purpose | Hot-reload | Target Use Case |
| ----------- | ----------------------------- | ---------- | ----------------------------- |
| `hud dev` | Development with auto-restart | ✅ Yes | Local development iteration |
| `hud run` | Production execution | ❌ No | Production workloads, testing |
| `hud debug` | Environment validation | ❌ No | Debugging and validation |
## Examples
### Basic Usage
```bash theme={null}
# Run remotely (default)
hud run hudpython/text-2048:latest
# Run locally for testing
hud run hudpython/text-2048:latest --local
# Verbose output for debugging
hud run my-image:latest --local --verbose
```
### Production Deployment
```bash theme={null}
# Remote execution with tracking
export HUD_API_KEY=your-production-key
hud run production-env:v1.0 --run-id "prod-deploy-$(date +%s)"
# With environment configuration
hud run production-env:v1.0 --run-id "prod-$(git rev-parse --short HEAD)" -- \
-e DATABASE_URL=postgres://... \
-e REDIS_URL=redis://... \
-e LOG_LEVEL=info
```
### Development Testing
```bash theme={null}
# Test locally before remote deployment
hud run my-env:dev --local -- -e DEBUG=true
# Test with HTTP transport
hud run my-env:dev --local --transport http --port 9000
# Test with custom data volume
hud run my-env:dev --local -- -v $(pwd)/test-data:/app/data
```
### CI/CD Integration
```yaml theme={null}
# GitHub Actions example
- name: Run HUD environment
env:
HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
run: |
# Pull latest
hud pull myorg/env:latest --yes
# Run tests
hud run myorg/env:latest --run-id "ci-${{ github.sha }}"
```
## Error Handling
`hud run` provides clear error messages for common issues:
* **Missing image**: Suggests available images or build commands
* **Authentication failure**: Guides through API key setup
* **Port conflicts**: Suggests alternative ports
* **Docker errors**: Shows Docker-specific troubleshooting
## See Also
* [`hud pull`](/reference/cli/pull) - Get environments before running
* [`hud analyze`](/reference/cli/analyze) - Inspect capabilities first
* [`hud debug`](/reference/cli/debug) - Test environment compliance
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [Build Environments](/build-environments) - Environment development guide
# Environments
Source: https://docs.hud.ai/reference/environments
SDK reference for building MCP environments
The HUD SDK provides `MCPServer` for building MCP-compatible environments that work with any MCP client.
## MCPServer
```python theme={null}
from hud.server import MCPServer
```
Enhanced FastMCP server with Docker-friendly features for building HUD environments.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------------ | ----- | ------------------------------- | -------- |
| `name` | `str` | Server name for MCP handshake | Required |
| `instructions` | `str` | Server instructions/description | `None` |
| `**fastmcp_kwargs` | `Any` | Additional FastMCP parameters | - |
**Key Features:**
1. **SIGTERM handling** - Graceful shutdown in containers via custom runner
2. **Initialize decorator** - Async setup during MCP initialize request (stdout is temporarily redirected to stderr during initialization to avoid corrupting MCP output)
3. **Shutdown decorator** - Runs only on SIGTERM (container termination), not on hot‑reload/SIGINT
4. **Enhanced add\_tool()** - Automatically handles `BaseTool` instances and raw FastMCP Tool objects
5. **Tool decorator passthrough** - `@mcp.tool` returns the original function for easy composition
6. **FastMCP inheritance** - All FastMCP methods available (`mount`, `resource`, `tool`)
### Decorators
#### @initialize
Run async setup during MCP initialize request:
```python theme={null}
mcp = MCPServer(name="my-env")
@mcp.initialize
async def setup_environment(ctx):
"""
Initialize environment resources.
Args:
ctx: RequestContext with:
- ctx.meta: Client metadata dict
- ctx.session: MCP ServerSession
"""
# Access metadata from agent (if provided)
if ctx.meta:
progress_token = ctx.meta.get("progressToken")
display_width = ctx.meta.get("display_width", 1920)
display_height = ctx.meta.get("display_height", 1080)
# Send progress notifications
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=50,
total=100,
message="Initializing environment..."
)
```
#### @shutdown
Run cleanup on SIGTERM (container termination only):
```python theme={null}
@mcp.shutdown
async def cleanup():
"""Clean up resources on shutdown."""
if browser_provider:
browser_provider.close()
logger.info("Cleanup complete")
```
### Tool Registration
Three ways to register tools:
```python theme={null}
# 1. Decorator for simple functions
@mcp.tool()
async def my_tool(param: str) -> dict:
return {"result": param}
# 2. Add BaseTool instances
from hud.tools import BashTool
bash = BashTool()
mcp.add_tool(bash) # Automatically uses bash.mcp internally
# 3. Add non-BaseTool instances directly
from custom import PlaywrightTool
playwright = PlaywrightTool()
mcp.add_tool(playwright) # Added as-is
```
### Hub Pattern (mount)
Use BaseHub for organized tool namespaces:
```python theme={null}
from hud.tools import BaseHub
# Create hub
setup_hub = BaseHub("setup")
# Add internal tools (hidden from agents)
@setup_hub.tool("board")
async def setup_board(size: int = 4):
game = setup_hub.env
game.reset(size=size)
return [TextContent(text=f"{size}x{size} board initialized")]
# Mount hub on server
mcp.mount(setup_hub)
# Agents call via dispatcher: setup(name="board", arguments={"size": 4})
```
### Resources
Expose metadata via MCP resources:
```python theme={null}
@mcp.resource("telemetry://live")
async def get_telemetry():
"""Expose live telemetry data."""
return {
"provider": os.getenv("BROWSER_PROVIDER"),
"status": "running" if browser_provider else "stopped",
"live_url": browser_provider.get_live_view_url() if browser_provider else None,
"timestamp": datetime.now().isoformat()
}
```
### Running the Server
```python theme={null}
if __name__ == "__main__":
# Run with SIGTERM handling (stdio by default)
mcp.run()
# Or use development transports (HTTP/SSE)
mcp.run(transport="http", port=8765)
mcp.run(transport="sse", port=8080)
```
When using HTTP/SSE, HUD development helper endpoints are available:
* `GET /hud` – overview
* `GET /hud/tools` – list tools with schemas
* `GET /hud/resources` – list resources
* `GET /hud/prompts` – list prompts
## Real Environment Examples
### Minimal Environment
```python theme={null}
# src/hud_controller/server.py
from hud.server import MCPServer
from mcp.types import TextContent
mcp = MCPServer(name="counter-env")
counter = {"value": 0}
@mcp.tool()
async def setup(start_value: int = 0):
"""Initialize counter."""
counter["value"] = start_value
return {"status": "ready", "counter": counter["value"]}
@mcp.tool()
async def increment():
"""Increment counter."""
counter["value"] += 1
return [TextContent(text=f"Counter: {counter['value']}", type="text")]
@mcp.tool()
async def evaluate(target: int):
"""Check if target reached."""
from hud.tools.types import EvaluationResult
return EvaluationResult(
reward=1.0 if counter["value"] >= target else 0.0,
done=counter["value"] >= target
)
if __name__ == "__main__":
mcp.run()
```
### text\_2048 Environment
From `environments/text_2048/src/hud_controller/server.py`:
```python theme={null}
from hud.server import MCPServer
from .game import Game2048
from .tools import MoveTool
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
mcp = MCPServer(name="text-2048")
game = None
@mcp.initialize
async def initialize_environment(ctx):
global game
# Progress notifications
progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
async def send_progress(progress: int, message: str):
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=progress,
total=100,
message=message
)
await send_progress(0, "Starting 2048 game environment...")
# Create game
game = Game2048()
game.reset()
await send_progress(50, "Setting up game board...")
# Set game on hubs
setup_hub.env = game
evaluate_hub.env = game
# Mount hubs
mcp.mount(setup_hub)
mcp.mount(evaluate_hub)
await send_progress(70, "Configuring tools...")
# Add move tool
mcp.add_tool(MoveTool(env=game))
await send_progress(100, "2048 environment ready")
```
### remote\_browser Environment
From `environments/remote_browser/src/hud_controller/server.py`:
```python theme={null}
from hud.server import MCPServer
from hud.tools.computer import HudComputerTool, AnthropicComputerTool, OpenAIComputerTool
from .tools import PlaywrightToolWithMemory, BrowserExecutor
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
from .providers import get_provider
mcp = MCPServer(
name="HUD Remote Browser Environment",
instructions="""Remote browser automation environment..."""
)
# Global state
browser_provider = None
playwright_tool = None
@mcp.resource("telemetry://live")
async def get_telemetry_resource():
"""MCP resource with live browser status."""
return {
"provider": os.getenv("BROWSER_PROVIDER", "unknown"),
"status": "running" if browser_provider else "stopped",
"live_url": browser_provider.get_live_view_url() if browser_provider else None,
"cdp_url": browser_provider.cdp_url if browser_provider else None
}
@mcp.initialize
async def initialize_environment(ctx):
global browser_provider, playwright_tool
# Get metadata
metadata = ctx.meta
progress_token = metadata.get("progressToken", None)
# Initialize provider
provider_name = os.getenv("BROWSER_PROVIDER")
provider_class = get_provider(provider_name)
browser_provider = provider_class(config)
# Launch browser
cdp_url = await browser_provider.launch()
# Create playwright tool
playwright_tool = PlaywrightToolWithMemory(cdp_url=cdp_url)
await playwright_tool._ensure_browser()
# Add playwright tool (not a BaseTool, added directly)
mcp.add_tool(playwright_tool)
# Create computer tools
executor = BrowserExecutor(playwright_tool)
tool_kwargs = {"executor": executor}
# Add display dimensions from metadata
if metadata:
width = metadata.get("display_width")
height = metadata.get("display_height")
if width and height:
tool_kwargs["width"] = width
tool_kwargs["height"] = height
# Add computer tools (all are BaseTool subclasses)
mcp.add_tool(HudComputerTool(**tool_kwargs))
mcp.add_tool(AnthropicComputerTool(**tool_kwargs))
mcp.add_tool(OpenAIComputerTool(**tool_kwargs))
# Mount hubs
setup_hub.env = playwright_tool
evaluate_hub.env = playwright_tool
mcp.mount(setup_hub)
mcp.mount(evaluate_hub)
@mcp.shutdown
async def shutdown_environment():
"""Cleanup browser resources."""
global browser_provider
if browser_provider:
browser_provider.close()
browser_provider = None
```
## Standard Structure
### Directory Layout
```
my-environment/
├── Dockerfile
├── pyproject.toml
├── controller/ # MCP controller (stdio)
│ ├── __init__.py # mcp = MCPServer()
│ ├── __main__.py # python -m controller → mcp.run()
│ ├── hooks.py # @mcp.initialize / @mcp.shutdown
│ └── tools.py # @mcp.tool(...)
└── environment/ # Optional backend (HTTP/IPC)
└── server.py # e.g., FastAPI app
```
### Dockerfile
```dockerfile theme={null}
FROM python:3.11-slim
WORKDIR /app
# Copy and install
COPY pyproject.toml ./
COPY controller/ ./controller/
COPY environment/ ./environment/
RUN pip install --no-cache-dir -e .
ENV ENV_SERVER_PORT=8005
# Start optional backend, then MCP controller on stdio
CMD ["sh", "-c", "uvicorn environment.server:app --host 0.0.0.0 --port $ENV_SERVER_PORT --log-level warning & python -m controller"]
```
### Hub Module Pattern
Example from text\_2048:
```python theme={null}
# src/hud_controller/setup/__init__.py
from hud.tools.base import BaseHub
setup = BaseHub("setup")
# Import all setup functions to register them
from . import board
__all__ = ["setup"]
# src/hud_controller/setup/board.py
from . import setup
@setup.tool("board")
async def setup_board(board_size: int = 4):
"""Initialize game board."""
game = setup.env # Access environment from hub
game.reset(size=board_size)
return [TextContent(text=f"{board_size}x{board_size} game initialized")]
```
## Key Concepts
### Environment State
Three patterns for managing state:
1. **Global variables** (simple environments):
```python theme={null}
game = None
@mcp.initialize
async def initialize_environment(ctx):
global game
game = Game2048()
```
2. **Context class** (complex environments):
```python theme={null}
class EnvironmentContext:
def __init__(self):
self.browser = None
self.page = None
env = EnvironmentContext()
```
3. **Hub env attribute** (for tool access):
```python theme={null}
setup_hub.env = game # Tools access via hub.env
```
### Tool Lifecycle
1. **Setup tools** - Hidden from agents, prepare environment state
2. **Interaction tools** - Available to agents for control
3. **Evaluate tools** - Hidden from agents, score performance
### Progress Notifications
Send [progress updates](https://modelcontextprotocol.io/specification/basic/utilities/progress) during long-running operations:
```python theme={null}
async def send_progress(progress: int, message: str):
if progress_token:
await ctx.session.send_progress_notification(
progress_token=progress_token,
progress=progress,
total=100,
message=message
)
```
Progress notifications follow the [MCP progress specification](https://modelcontextprotocol.io/specification/basic/utilities/progress#progress-flow). The `progressToken` comes from the client's request [metadata](https://modelcontextprotocol.io/specification/basic/index#_meta).
### Metadata Access
Agent metadata flows through initialization:
```python theme={null}
@mcp.initialize
async def initialize_environment(ctx):
# From agent's metadata class variable
width = ctx.meta.get("display_width", 1920) if ctx.meta else 1920
height = ctx.meta.get("display_height", 1080) if ctx.meta else 1080
```
## Testing
```bash theme={null}
# CLI testing
hud debug my-env:latest
hud analyze my-env:latest
# Python testing
async def test():
from hud.clients import MCPClient
client = MCPClient({
"env": {"command": "docker", "args": ["run", "-i", "my-env"]}
})
async with client:
tools = await client.list_tools()
result = await client.call_tool("setup", {"value": 0})
```
## See Also
* [Build Environments](/build-environments) - Getting started guide
* [Tools](/reference/tools) - Tool implementation reference
* [Environment Spec](/build-environments/spec) - Technical specification and architecture
# Tasks
Source: https://docs.hud.ai/reference/tasks
SDK reference for task configuration and dataset utilities
The HUD SDK provides the `Task` class for defining agent objectives and dataset utilities for managing task collections.
## Task Class
```python theme={null}
from hud.datasets import Task
```
Pydantic model that defines an agent's objective, setup, and evaluation criteria.
**Fields:**
| Field | Type | Description | Default |
| --------------- | ------------------------------------------ | ---------------------------------------------------------- | -------- |
| `id` | `str \| None` | Unique identifier (UUID recommended) | `None` |
| `prompt` | `str` | Task instruction for the agent | Required |
| `mcp_config` | `dict[str, Any]` | MCP server configuration | Required |
| `setup_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to prepare environment | `None` |
| `evaluate_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to score performance | `None` |
| `agent_config` | `dict[str, Any] \| None` | Agent configuration (system\_prompt, allowed\_tools, etc.) | `None` |
| `metadata` | `dict[str, Any]` | Extra task metadata | `{}` |
### Environment Variable Substitution
The `mcp_config` field automatically resolves environment variables using `${VAR_NAME}` syntax:
```python theme={null}
task = Task(
prompt="Navigate to the dashboard",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
}
)
```
**Substitution happens automatically when Task is created from a dict** - environment variables are resolved using Python's `Template.substitute()` with a defaultdict that returns empty strings for missing variables.
### Field Validators
Task automatically:
1. **Parses JSON strings** - `mcp_config` and `metadata` can be JSON strings
2. **Converts dicts to MCPToolCall** - `setup_tool` and `evaluate_tool` dicts are converted
3. **Resolves environment variables** - Only when created from dict (preserves templates in model\_dump())
```python theme={null}
# All of these work:
task = Task(
prompt="Test",
mcp_config='{"server": {"url": "..."}}', # JSON string → dict
setup_tool='{"name": "setup"}' # JSON string → MCPToolCall
)
task = Task(
prompt="Test",
mcp_config={...},
setup_tool={"name": "setup", "arguments": {...}} # dict → MCPToolCall
)
task = Task(
prompt="Test",
mcp_config={...},
evaluate_tool=[ # List of tool calls
{"name": "check_text", "arguments": {"text": "Success"}},
{"name": "check_url", "arguments": {"url": "example.com"}}
]
)
```
## Recommended Evaluation Workflow
When developing and testing agents, follow this progression for optimal debugging and performance:
### Step 1: Single Task Development
Start with individual tasks to debug your agent and environment setup:
```python theme={null}
import hud
from hud.agents import ClaudeAgent
from hud.datasets import Task
task = Task(prompt="Click the submit button", mcp_config={...})
async with hud.async_trace(task.prompt):
agent = ClaudeAgent()
result = await agent.run(task, max_steps=50)
print(f"Result: {result.reward}")
```
**Benefits:**
* Full error stack traces
* Clear log output
* Quick iteration cycle
* Automatic telemetry tracking
### Step 2: Full Dataset Evaluation
Once single tasks work reliably, scale up to full dataset evaluation:
```python theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent
results = await run_dataset(
name="SheetBench Evaluation",
dataset="hud-evals/SheetBench-50",
agent_class=ClaudeAgent,
agent_config={"model": "claude-sonnet-4-5-20250929", "validate_api_key": False},
max_concurrent=50,
max_steps=50,
auto_respond=True
)
```
**Concurrency Guidelines:**
* Start with `max_concurrent=50` and adjust based on results
* Increase to 100-200 for faster evaluation (if API limits allow)
* Decrease to 10-20 if hitting rate limits
### Quick Reference
| Stage | Method | Concurrency | Use Case | Debugging |
| ----------- | ------------- | ----------- | ----------------- | --------- |
| Development | Single task | 1 | Initial debugging | Excellent |
| Production | `run_dataset` | 50-200 | Full evaluation | Good |
## Dataset Functions
### run\_dataset
```python theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent
results = await run_dataset(
name="SheetBench Evaluation",
dataset="hud-evals/SheetBench-50",
agent_class=ClaudeAgent,
agent_config={"model": "claude-sonnet-4-5-20250929", "validate_api_key": False},
max_concurrent=50,
max_steps=50,
auto_respond=True
)
```
**Parameters:**
| Parameter | Type | Description | Default |
| ---------------- | ------------------------------ | ---------------------------------------------- | --------- |
| `name` | `str` | Job name for tracking | Required |
| `dataset` | `str \| Dataset \| list[dict]` | HF dataset ID, Dataset object, or task dicts | Required |
| `agent_class` | `type[MCPAgent]` | Agent class to instantiate | Required |
| `agent_config` | `dict[str, Any] \| None` | Constructor kwargs for agent | `None` |
| `max_concurrent` | `int` | Maximum concurrent tasks (recommended: 50-200) | `30` |
| `max_steps` | `int` | Max steps per task | `10` |
| `auto_respond` | `bool` | Use ResponseAgent for continuations | `False` |
| `metadata` | `dict[str, Any] \| None` | Job metadata | `None` |
| `split` | `str` | Dataset split when loading by ID | `"train"` |
**Returns:** `list[Any]` - Results in dataset order
**Examples:**
```python theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent, OperatorAgent
# From HuggingFace dataset
results = await run_dataset(
"Eval", "hud-evals/SheetBench-50", ClaudeAgent,
{"model": "claude-sonnet-4-5-20250929"}, max_concurrent=50
)
# From Dataset object
from datasets import load_dataset
dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
results = await run_dataset("Eval", dataset, OperatorAgent)
# From list of dicts
tasks = [{"prompt": "...", "mcp_config": {...}, "setup_tool": {...}}]
results = await run_dataset("Eval", tasks, ClaudeAgent)
```
### fetch\_system\_prompt\_from\_dataset
```python theme={null}
from hud.datasets import fetch_system_prompt_from_dataset
system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
```
Fetch `system_prompt.txt` from a HuggingFace dataset repository.
**Returns:** `str | None` - System prompt text if found
**Note:** Requires `huggingface_hub` to be installed.
### save\_tasks
```python theme={null}
from hud.datasets import save_tasks
# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
tasks=task_dicts,
repo_id="my-org/my-tasks",
private=False,
dataset_name="My Tasks v1.0"
)
```
Save tasks to HuggingFace dataset with JSON string serialization.
**Parameters:**
| Parameter | Type | Description | Default |
| ---------- | ---------------------- | ------------------------------------ | -------- |
| `tasks` | `list[dict[str, Any]]` | Task dictionaries (NOT Task objects) | Required |
| `repo_id` | `str` | HuggingFace repository ID | Required |
| `**kwargs` | `Any` | Additional args for `push_to_hub()` | - |
**Important:** Always pass dictionaries to preserve environment variable templates:
```python theme={null}
# ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset
task_dicts = [task.model_dump() for task in tasks]
save_tasks(task_dicts, "my-dataset")
# ❌ BAD - Would expose resolved API keys!
# save_tasks(tasks, "my-dataset") # Don't do this!
```
**Dataset Schema:**
The function converts complex fields to JSON strings for clean HuggingFace schema:
* `mcp_config` → JSON string
* `setup_tool` → JSON string (if present)
* `evaluate_tool` → JSON string (if present)
* `metadata` → JSON string (if present)
## MCPToolCall Type
```python theme={null}
from hud.types import MCPToolCall
```
Defines a tool invocation with arguments.
**Fields:**
| Field | Type | Description | Default |
| ----------- | ---------------- | ----------------- | -------- |
| `name` | `str` | Tool name to call | Required |
| `arguments` | `dict[str, Any]` | Tool arguments | `{}` |
## Real-World Examples
### Loading Tasks from Datasets
From `examples/run_evaluation.py`:
```python theme={null}
# Load dataset
dataset = load_dataset("hud-evals/SheetBench-50", split="train")
# Get system prompt
dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
# Run single task
sample_task = dataset[0]
task = Task(**sample_task) # Auto-converts dict to Task
# Or run all tasks
results = await run_dataset(
"Full SheetBench Evaluation",
dataset,
ClaudeAgent,
{"model": "claude-3-5-sonnet-20241022"},
max_concurrent=50
)
```
### Task Structure in Datasets
From `environments/text_2048/2048_taskconfigs.json`:
```json theme={null}
{
"id": "2048_target_256",
"prompt": "You are playing 2048 and your goal is to reach the 256 tile.",
"mcp_config": {
"local": {
"command": "docker",
"args": ["run", "--rm", "-i", "hud-text-2048"]
}
},
"setup_tool": {
"name": "setup",
"arguments": {
"name": "board",
"arguments": {"board_size": 4}
}
},
"evaluate_tool": {
"name": "evaluate",
"arguments": {
"name": "max_number",
"arguments": {"target": 256}
}
},
"metadata": {
"difficulty": 256,
"game": "2048"
}
}
```
### Creating and Saving Tasks
```python theme={null}
import os
from uuid import uuid4
# Set environment variables
os.environ["HUD_API_KEY"] = "sk-hud-..."
os.environ["BROWSER_PROVIDER"] = "anchorbrowser"
# Create tasks with env var templates
tasks = []
for url in ["example.com", "test.com"]:
task = Task(
id=str(uuid4()),
prompt=f"Navigate to {url} and find the login button",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"X-Provider": "${BROWSER_PROVIDER}"
}
}
},
setup_tool=MCPToolCall(
name="setup",
arguments={"name": "navigate", "arguments": {"url": url}}
),
evaluate_tool=MCPToolCall(
name="evaluate",
arguments={"name": "element_exists", "arguments": {"selector": "button.login"}}
),
metadata={
"category": "navigation",
"difficulty": "easy"
}
)
tasks.append(task)
# Save to HuggingFace (preserves env var templates)
task_dicts = [t.model_dump() for t in tasks]
save_tasks(task_dicts, "my-org/navigation-tasks")
```
### Agent Integration
Tasks automatically configure agents:
```python theme={null}
task = Task(
prompt="Complete the form",
mcp_config={...},
setup_tool={...},
evaluate_tool={...},
agent_config={
"system_prompt": "Be very careful with form fields",
"allowed_tools": ["computer", "search"],
"append_setup_output": True
}
)
agent = ClaudeAgent()
# When agent.run(task) is called:
# 1. If no mcp_client, creates one from task.mcp_config
# 2. Applies task.agent_config to agent (system_prompt, allowed_tools, etc.)
# 3. Adds lifecycle tools to agent.lifecycle_tools
# 4. Runs setup_tool (hidden from agent)
# 5. Executes task.prompt
# 6. Runs evaluate_tool (hidden from agent)
# 7. Returns Trace with evaluation reward
```
### Agent Config Options
The `agent_config` field supports the following options:
| Option | Type | Description |
| --------------------- | ----------- | ----------------------------------------------------- |
| `system_prompt` | `str` | Custom system prompt appended to agent's default |
| `allowed_tools` | `list[str]` | Tools the agent can use (replaces `agent_tools`) |
| `disallowed_tools` | `list[str]` | Tools to exclude from the agent |
| `append_setup_output` | `bool` | Include setup output in first message (default: True) |
| `initial_screenshot` | `bool` | Take screenshot before first action (default: True) |
## Best Practices
1. **Use UUIDs for task IDs** - Required for HuggingFace datasets
2. **Save dictionaries, not objects** - Preserves env var templates
3. **Use agent\_config for agent settings** - Centralize agent configuration in one place
4. **Use metadata for filtering** - Category, difficulty, tags
5. **Test locally first** - Before uploading to HuggingFace
6. **Version your datasets** - Use meaningful repo names
## Common Patterns
### Filtering Tasks
```python theme={null}
# Load all tasks
dataset = load_dataset("hud-evals/WebArena-Lite", split="train")
# Filter in Python
login_tasks = [
task for task in dataset
if "login" in task["prompt"].lower()
]
# Convert and run
results = await run_dataset(
"Login Tasks Only",
login_tasks,
ClaudeAgent
)
```
### Custom System Prompts
```python theme={null}
# Override dataset system prompt
results = await run_dataset(
"Custom Evaluation",
"hud-evals/SheetBench-50",
ClaudeAgent,
custom_system_prompt="You are an expert spreadsheet user. Be precise."
)
```
### Environment Variable Management
```python theme={null}
# Tasks preserve templates
task = Task(
prompt="Test",
mcp_config={
"server": {
"api_key": "${API_KEY:default-key}"
}
}
)
# model_dump() preserves the template
data = task.model_dump()
assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}"
# But task object has resolved value
os.environ["API_KEY"] = "real-key"
task2 = Task(**data)
# task2.mcp_config["server"]["api_key"] is now "real-key"
```
## See Also
* [Task System](/core-concepts/task-system) - Conceptual overview
* [Benchmarks](/evaluate-agents/benchmarks) - Building and running datasets
* [Agents](/reference/agents) - How agents use tasks
# Tools
Source: https://docs.hud.ai/reference/tools
SDK reference for HUD tools and executors
The HUD SDK provides a comprehensive set of tools for environment interaction. All tools inherit from `BaseTool` and return standardized MCP content blocks.
## Base Classes
### BaseTool
```python theme={null}
from hud.tools import BaseTool
```
Abstract base class for all MCP tools. Provides standardized output formatting and MCP registration.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------- | ----- | --------------------------------------------------- | ------------------- |
| `env` | `Any` | Optional stateful context (game, executor, browser) | `None` |
| `name` | `str` | Tool name for MCP registration | Auto from class |
| `title` | `str` | Human-readable display name | Auto from class |
| `description` | `str` | Tool description | Auto from docstring |
**Abstract Methods:**
```python theme={null}
async def __call__(self, **kwargs: Any) -> list[ContentBlock]
```
**Properties:**
* `mcp` - FastMCP FunctionTool wrapper for registration
**Usage:**
```python theme={null}
class MyTool(BaseTool):
async def __call__(self, param: str) -> list[ContentBlock]:
result = await self.env.do_something(param)
return [TextContent(text=result, type="text")]
# MCPServer automatically handles BaseTool instances
tool = MyTool(env=my_context)
mcp_server.add_tool(tool) # No need for .mcp property
```
### BaseHub
```python theme={null}
from hud.tools import BaseHub
```
A composition-friendly FastMCP server for organizing tools into nested namespaces.
**Key Features:**
* Internal tool dispatcher pattern
* Automatic resource catalog generation
* Hidden internal tools from MCP clients
**Usage:**
```python theme={null}
hub = BaseHub("evaluators")
@hub.tool() # Internal tool, hidden from agents
async def url_match(url: str) -> EvaluationResult:
return EvaluationResult(reward=1.0, done=True)
# Access via dispatcher when mounted on server
# Agents call: evaluators(name="url_match", arguments={"url": "..."})
```
## Core Tools
### BashTool
```python theme={null}
from hud.tools import BashTool
```
Execute bash commands in a persistent shell session.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | -------------- | --------------------------- | ------- |
| `session` | `_BashSession` | Pre-configured bash session | `None` |
**Tool Parameters:**
| Parameter | Type | Description | Default |
| --------- | ------ | ------------------------ | -------- |
| `command` | `str` | Bash command to execute | Required |
| `restart` | `bool` | Restart the bash session | `False` |
**Example:**
```python theme={null}
bash = BashTool()
result = await bash(command="ls -la")
# Returns: [TextContent(text="file1.txt\nfile2.txt", type="text")]
# Restart session
await bash(restart=True)
```
### EditTool
```python theme={null}
from hud.tools import EditTool
```
File system editor with undo support.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| -------------- | ----------------------- | --------------------- | ------- |
| `file_history` | `dict[Path, list[str]]` | Edit history per file | `None` |
**Commands:**
| Command | Parameters | Description |
| ------------- | -------------------------------- | ------------------------------- |
| `view` | `path`, `view_range` | View file or directory contents |
| `create` | `path`, `file_text` | Create new file |
| `str_replace` | `path`, `old_str`, `new_str` | Replace string in file |
| `insert` | `path`, `insert_line`, `new_str` | Insert text at line |
| `undo_edit` | `path` | Undo last edit |
**Example:**
```python theme={null}
editor = EditTool()
# View file
await editor(command="view", path="/path/to/file.py", view_range=[1, 20])
# Create file
await editor(command="create", path="/new/file.py", file_text="print('hello')")
# Replace text
await editor(command="str_replace", path="/file.py",
old_str="old_text", new_str="new_text")
```
### PlaywrightTool
```python theme={null}
from hud.tools import PlaywrightTool
```
Web automation using Playwright browser.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| --------- | ------ | ---------------------------- | ------- |
| `page` | `Page` | Existing Playwright page | `None` |
| `cdp_url` | `str` | Chrome DevTools Protocol URL | `None` |
**Actions:**
| Action | Parameters | Description |
| ------------------ | ---------------------------- | -------------------------- |
| `navigate` | `url`, `wait_for_load_state` | Navigate to URL |
| `screenshot` | - | Capture page screenshot |
| `click` | `selector` | Click element |
| `type` | `selector`, `text` | Type text in element |
| `get_page_info` | - | Get page title and URL |
| `wait_for_element` | `selector` | Wait for element to appear |
**Example:**
```python theme={null}
tool = PlaywrightTool()
# Navigate
await tool(action="navigate", url="https://example.com")
# Click button
await tool(action="click", selector="button.submit")
# Type text
await tool(action="type", selector="input#search", text="query")
```
### JupyterTool
```python theme={null}
from hud.tools import JupyterTool
```
Execute Python code in a Jupyter kernel.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ------------- | ----- | ----------------------------------------------------------------- | ---------------- |
| `url_suffix` | `str` | (Optional) Kernel gateway host:port | "localhost:8888" |
| `kernel_name` | `str` | (Optional) Kernel name | "python3" |
| `kernel_id` | `str` | (Optional) If set, connect to the existed kernel with `kernel_id` | ""(empty string) |
**Tool Parameters:**
| Parameter | Type | Description | Default |
| ------------------- | ----- | ---------------------------- | -------- |
| `code` | `str` | Python code to execute | Required |
| `execution_timeout` | `int` | Execution timeout in seconds | 15 |
**Example: Tool Calling**
```python theme={null}
jupyter_tool = JupyterTool()
result = await jupyter_tool(code="print(\"hello world!\")")
# Returns: [TextContent(text="hello world\n", type="text")]
```
**Example: Reuse the same kernel**
```python theme={null}
# When initializing, register the kernel
jupyter_tool = JupyterTool()
JupyterTool.register_shared_kernel("my-jupyter", jupyter_tool.get_kernel_id())
# Reuse the same kernel in other places:
jupyter_tool_reuse = JupyterTool.from_shared_kernel("my-jupyter")
```
## Computer Control Tools
### HudComputerTool
```python theme={null}
from hud.tools import HudComputerTool
```
Universal computer control with automatic scaling and executor selection.
**Constructor Parameters:**
| Parameter | Type | Description | Default |
| ---------------- | -------------------------------- | ------------------- | ----------- |
| `executor` | `BaseExecutor` | Executor to use | Auto-detect |
| `platform_type` | `"auto"`, `"xdo"`, `"pyautogui"` | Executor type | `"auto"` |
| `display_num` | `int` | X display number | From env |
| `width` | `int` | Agent screen width | 1280 |
| `height` | `int` | Agent screen height | 720 |
| `rescale_images` | `bool` | Rescale screenshots | `True` |
**Actions:**
| Action | Parameters | Description |
| ------------ | ------------------------------------------ | --------------------- |
| `screenshot` | - | Capture screen |
| `click` | `x`, `y`, `button`, `pattern`, `hold_keys` | Mouse click |
| `write` | `text`, `enter_after`, `delay` | Type text |
| `press` | `keys` | Press key combination |
| `scroll` | `x`, `y`, `scroll_x`, `scroll_y` | Scroll |
| `drag` | `start_x`, `start_y`, `end_x`, `end_y` | Drag mouse |
**Example:**
```python theme={null}
computer = HudComputerTool()
# Take screenshot
await computer(action="screenshot")
# Click at coordinates
await computer(action="click", x=100, y=200)
# Type text
await computer(action="write", text="Hello World", enter_after=True)
# Press hotkey
await computer(action="press", keys=["ctrl", "c"])
```
### AnthropicComputerTool
```python theme={null}
from hud.tools import AnthropicComputerTool
```
Computer control optimized for Anthropic's Claude models.
**Features:**
* Pre-configured for 1280x720 resolution
* Optimized action names for Claude
* Built-in screenshot scaling
### OpenAIComputerTool
```python theme={null}
from hud.tools import OpenAIComputerTool
```
Computer control optimized for OpenAI models.
**Features:**
* Pre-configured for 1920x1080 resolution
* Simplified action interface
* No automatic screenshot scaling
## Executors
Executors provide platform-specific implementations for computer control actions.
### BaseExecutor
```python theme={null}
from hud.tools.executors import BaseExecutor
```
Abstract base providing simulation mode for all actions.
**Core Methods:**
* `click(x, y, button, pattern, hold_keys)` - Mouse click
* `write(text, enter_after, delay)` - Type text
* `press(keys)` - Press key combination
* `scroll(x, y, scroll_x, scroll_y)` - Scroll
* `drag(start_x, start_y, end_x, end_y)` - Drag mouse
* `screenshot()` - Capture screen
* `get_screen_size()` - Get display dimensions
### PyAutoGUIExecutor
```python theme={null}
from hud.tools.executors import PyAutoGUIExecutor
```
Cross-platform executor using PyAutoGUI library.
**Features:**
* Works on Windows, macOS, Linux
* Real mouse/keyboard control
* Screenshot capture
* Automatic failsafe
**Example:**
```python theme={null}
executor = PyAutoGUIExecutor()
computer = HudComputerTool(executor=executor)
```
### XDOExecutor
```python theme={null}
from hud.tools.executors import XDOExecutor
```
Linux/X11 executor using xdotool.
**Features:**
* Native X11 integration
* Faster than PyAutoGUI on Linux
* Support for X display selection
* Window management capabilities
**Example:**
```python theme={null}
executor = XDOExecutor(display_num=1)
computer = HudComputerTool(executor=executor)
```
## Common Types
### ContentBlock
MCP standard output format (from `mcp.types`):
```python theme={null}
from mcp.types import TextContent, ImageContent
# Text output
TextContent(text="Operation complete", type="text")
# Image output
ImageContent(data="base64_data", mimeType="image/png", type="image")
```
### EvaluationResult
```python theme={null}
from hud.tools.types import EvaluationResult
result = EvaluationResult(
reward=0.8, # Score 0-1
done=True, # Task complete
content="Details", # Optional text
info={"score": 80} # Metadata
)
```
### ContentResult
```python theme={null}
from hud.tools.types import ContentResult
# Helper for building complex outputs
result = ContentResult(
output="Success message",
error="Error if any",
base64_image="screenshot_data",
system="System message"
)
# Convert to ContentBlocks
blocks = result.to_content_blocks()
```
## Integration Examples
### Adding Tools to Environment
```python theme={null}
from hud.server import MCPServer
from hud.tools import BashTool, EditTool
mcp = MCPServer(name="my-env")
# MCPServer handles BaseTool instances automatically
bash = BashTool()
mcp.add_tool(bash) # Internally uses bash.mcp
editor = EditTool()
mcp.add_tool(editor) # Same here
```
### Custom Tool Implementation
```python theme={null}
from hud.tools import BaseTool
from mcp.types import TextContent
class DatabaseTool(BaseTool):
def __init__(self, db_connection):
super().__init__(
env=db_connection,
name="database",
title="Database Query Tool",
description="Execute SQL queries"
)
async def __call__(self, query: str) -> list[ContentBlock]:
try:
results = await self.env.execute(query)
return [TextContent(text=str(results), type="text")]
except Exception as e:
return [TextContent(text=f"Error: {e}", type="text")]
```
### Hub Pattern for Evaluators
```python theme={null}
from hud.tools import BaseHub
from hud.tools.types import EvaluationResult
evaluators = BaseHub("evaluate")
@evaluators.tool("text_contains")
async def check_text(text: str, target: str) -> EvaluationResult:
return EvaluationResult(
reward=1.0 if target in text else 0.0,
done=True,
content=f"Checking if '{target}' in text"
)
# Use in environment
@mcp.tool()
async def evaluate(name: str, **kwargs):
return await evaluators.call_tool(name, kwargs)
```
### Callback Functions
Callback functions help you monitor certain types of actions and hook
behaviors to them. You can manage callbacks using three main methods,
```python theme={null}
add_callback(self, event_type: str, callback: Callable)
remove_callback(self, event_type: str, callback: Callable)
_trigger_callbacks(self, event_type: str, **kwargs)
```
*Special note*: Make sure the callback functions are defined by `async def`
For example, when you want to take a screenshot every
time the webpage changes in `Playwright`
```python theme={null}
from hud.tools.playwright import PlaywrightTool
import base64
class MyPlaywrightTool(PlaywrightTool):
def __init__(self, context: Any = None, cdp_url: str | None = None) -> None:
super().__init__(cdp_url=cdp_url)
self.screenshots: List = []
self.add_callback("webpage_change", self._on_webpage_change)
async def _on_webpage_change(self, **kwargs) -> None:
"""Callback to capture screenshots on webpage changes"""
try:
await self._ensure_browser()
if self.page:
screenshot_bytes = await self.page.screenshot(full_page=True)
screenshot_b64 = base64.b64encode(screenshot_bytes).decode("utf-8")
self.screenshots.append(screenshot_b64)
except Exception as e:
pass
async def navigate(self, url: str, wait_for_load_state="networkidle"):
result = await super().navigate(url, wait_for_load_state)
# Trigger callbacks with event data
await self._trigger_callbacks("webpage_change")
return result
async def click(self, selector: str, **kwargs):
result = await super().click(selector, **kwargs)
await self._trigger_callbacks("webpage_change")
return result
```
## See Also
* [Build Environments](/build-environments) - Using tools in environments
* [Architecture](/core-concepts/architecture) - How tools fit in HUD
* [Adapting Software](/build-environments/adapting-software) - Tool patterns
# RL Quickstart
Source: https://docs.hud.ai/train-agents/quickstart
## Prerequisites
* HUD API key: Remote training requires authentication. Set `HUD_API_KEY` before running:
```bash theme={null}
export HUD_API_KEY="sk-hud-..." # get one at https://hud.ai
# Or persist it locally:
hud set HUD_API_KEY=sk-hud-...
```
* Docker daemon: For local runs (using `--local`) or when training against a local Docker image, ensure Docker Desktop is installed and the Docker daemon is running.
## Quickstart
Install and download a taskset:
```bash theme={null}
uv tool install hud-python
hud get hud-evals/2048-basic
```
### 1) Simple: Train (remote by default)
```bash theme={null}
hud rl 2048-basic.json
```
This launches training remotely and automatically provisions a vLLM server and a trainer for you. You can monitor progress on [https://hud.ai](https://hud.ai). The server persists between runs, so you can rerun training or evaluate against the same endpoint.
Optional baseline first (Claude or Operator):
```bash theme={null}
hud eval 2048-basic.json
```
### 2) Run on your own machine/remote
Use any provider with at least 2 GPUs (one for inference, one for training). Run locally with the flag `--local`:
```bash theme={null}
uv tool install hud-python
hud get hud-evals/2048-basic
hud rl 2048-basic.json --local
```
### Recommended setups
* 2× A100: quick iteration, shorter runs
* 8× A100: higher throughput for larger tasksets
Training throughput depends on task complexity and parallelism (`max_parallel_episodes`).
### 3) Build your own environment (hud init)
Create a new MCP environment, develop with hot-reload, and train on a production image:
```bash theme={null}
hud init my-env && cd my-env
hud dev --interactive
# When ready to run:
hud rl
```
Change the tasks.json to include other tasks you want to train on.
See [hud init](/reference/cli/init) for options and details.
## Getting the best performance
Often training a good model requires many iterations over the parameters of the trainer. Take the config generated by `hud rl` and modify it to various values to do a hyperparameter sweep.
For easy launching, specify the tasks and config upfront, and add `--yes` to automatically launch vllm and training.
```bash theme={null}
hud rl taskset.json --config rl-config.json --yes
```
Additionally, sometimes it may be helpful to run an initial analysis on the dataset to determine which tasks would be the most informative to trian on. In that case either start with a deployed model or run `hud rl` without training, and then:
```bash theme={null}
hud eval taskset.json --full --group-size 6 --max-steps 5
```
This will prompt you for the model choice, produce a table of accuracies per task. Prefer tasks which are 10%-60% accurate for training.
Some general findings from our internal training runs:
* As many different tasks per gradient update as possible (runs with 4+ GPUs and batch size of 50+ are much more stable than single GPU runs)
* Batch size should be somewhere around 2/X where X is the accuracy of that given task on an untrained model.
### Pricing
Below is the pricing by GPU type. Actual prices vary — see [https://hud.ai/project/billing](https://hud.ai/project/billing) for current rates.
vLLM GPU Pricing (2 Hosted GPUs)
| GPU type | Memory | Est. price/hr |
| --------- | ------ | ------------- |
| A100 80GB | 80 GB | \$4.95 |
| H100 80GB | 80 GB | \$7.95 |
Training GPU Pricing
| GPU type | Memory | Est. price/hr |
| --------- | ------ | ------------- |
| A100 80GB | 80 GB | \$3.95 |
| H100 80GB | 80 GB | \$5.40 |
***
### Learn more
Complete guide to building environments from scratch
Full `hud rl` command options and usage
# Dataset Design
Source: https://docs.hud.ai/train-agents/tasks
## Tasks format
HUD tasksets can be provided in two primary formats (both supported):
1. A single JSON file containing a list of task objects (recommended)
```json theme={null}
[
{
"id": "browser_2048_128",
"prompt": "Reach 128 in 2048.",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudevals/hud-browser:0.1.3"
}
}
},
"setup_tool": {"name": "launch_app", "arguments": {"app_name": "2048"}},
"evaluate_tool": {"name": "evaluate", "arguments": {"name": "game_2048_max_number", "arguments": {"target": 128}}}
}
]
```
Save as `2048-basic.json` and run:
```bash theme={null}
hud eval 2048-basic.json
hud rl 2048-basic.json
```
2. JSONL file with one task object per line
* prompt: instruction for the agent
* mcp\_config: where to run the environment (local docker or remote MCP)
* setup\_tool (optional): a tool call to prepare the environment
* evaluate\_tool: a tool call to compute reward
* system\_prompt (optional): extra guidance for the agent
## Hosting on HuggingFace
You can host tasksets on the Hub and fetch them with:
```bash theme={null}
hud get hud-evals/2048-basic
```
The command downloads the JSONL task file and places it in your project directory.
This allows running the full dataset or training with simply:
```bash theme={null}
hud eval hud-evals/2048-basic
hud rl hud-evals/2048-basic
```
## Tips
* Keep tasks self-contained; use `setup_tool` to open apps or load data
* Ensure `evaluate_tool` returns a numeric reward per episode
* Use small task counts to iterate quickly; scale up once stable
Learn how to run benchmarks
Deep-dive into MCP configs and tools