# Build Environments
Source: https://docs.hud.ai/build-environments/index

From `hud init` to production in five steps

Build MCP environments that wrap any software for agent interaction. Think of it in three phases:

**Phase 1: Environment** - Wrap software in MCP tools
**Phase 2: Tasks** - Define evaluation scenarios
**Phase 3: Agents** - Run evaluations and training

## Phase 1 · Create a project (2 min)

```bash  theme={null}
# Pick a template: blank, deep-research, browser
hud init my-env
cd my-env
```

Start development servers:

```bash  theme={null}
# Terminal 1 - Environment backend
cd environment && uv run uvicorn server:app --reload

# Terminal 2 - MCP server  
cd server && uv run hud dev
```

### Edit-save-test flow

1. Open `server/tools.py`, add or tweak a tool.
2. Save – the mcp restarts instantly.
3. Visit `http://localhost:8765/docs` to test tools/

## Phase 2 · Write Tasks (2 min)

Build your environment image first (in the global folder):

```bash  theme={null}
hud build
```

Create `tasks.json` using `docker run`:

```json  theme={null}
{
  "prompt": "Complete task",
  "mcp_config": {
    "local": {
      "command": "docker",
      "args": ["run", "--rm", "-i", "my-env:0.1.0"]
    }
  },
  ...your setup and evaluation tools
}
```

See [Task System](/core-concepts/task-system) or the `hud init` README for details.

## Phase 3: Run Agents

```bash  theme={null}
# Test with agents
hud eval tasks.json

# Deploy to registry
hud push

# Train agents on your tasks
hud rl tasks.json
```

***

### Cheatsheet

| Action           | Command                    |
| ---------------- | -------------------------- |
| Create env       | `hud init my-env -p blank` |
| Hot-reload dev   | `hud dev --build`          |
| Interactive test | `hud dev --interactive`    |
| Troubleshoot     | `hud debug my-env:dev`     |
| Build image      | `hud build`                |
| Push to registry | `hud push`                 |
| RL training      | `hud rl tasks.json`        |

***

### Learn more →

* **Blank template walkthrough**: `environments/blank/README.md` in the repo.
* **Technical spec**: [/build-environments/spec](/build-environments/spec)
* **CLI reference**: [`hud init`](/reference/cli/init) • [`hud dev`](/reference/cli/dev) • [`hud build`](/reference/cli/build) • [`hud push`](/reference/cli/push) • [`hud run`](/reference/cli/run)

Have fun – and remember: *stderr for logs, stdout for MCP!*


# Environment Spec
Source: https://docs.hud.ai/build-environments/spec

What qualifies as a HUD environment, and how to run it

If your Docker image starts a stdio MCP server on container startup, it is a hud environment.

## Architecture

```mermaid  theme={null}
graph TD
    Agent[Any Agent] -->|MCP JSON‑RPC| Server[MCP Server in Docker]
    Server -->|optional IPC/HTTP/gRPC| App[Your Software]
```

* The MCP server can be written in any language or framework.
* Your software can be anything (service, desktop app, game, scripts). The connection between them is your choice.
* Logs must go to stderr; stdout is reserved for MCP protocol packets.

## Minimal requirements

* Docker image starts a process that speaks MCP on stdout/stdin (stdio) by default.
* No non‑MCP output on stdout (all logging to stderr).
* No required file layout, framework, or endpoints.

Recommended (for HUD RL/evals): provide tools named `setup` and `evaluate`.

## Make it runnable remotely (mcp.hud.ai)

Remote execution is built‑in. Push your image, then either:

* CLI (default is remote):

```bash  theme={null}
export HUD_API_KEY=...
hud push
hud run myorg/my-env:latest   # remote container with MCP
```

* Programmatic (Task config):

```json  theme={null}
{
  "mcp_config": {
    "hud": {
      "url": "https://mcp.hud.ai/v3/mcp",
      "headers": {
        "Authorization": "Bearer ${HUD_API_KEY}",
        "Mcp-Image": "myorg/my-env:latest"
      }
    }
  }
}
```

## Run arbitrary configs locally or remotely

* Local Docker (stdio):

```bash  theme={null}
hud run myorg/my-env:latest --local --transport stdio
```

* Local HTTP proxy (for inspectors):

```bash  theme={null}
hud run myorg/my-env:latest --local --transport http --port 8765
```

* Remote (default):

```bash  theme={null}
hud run myorg/my-env:latest
```

## Notes

* Use `hud dev` only for local hot‑reload workflows; for production, build once and `hud run`.
* Use `hud debug` only when startup/compliance issues occur.
* Any MCP‑speaking Docker image works; controller/backend split is optional.

## Task config examples

```python  theme={null}
from hud.datasets import Task

task = Task(
    prompt="...",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "myorg/my-env:latest"
            }
        }
    },
)
```

The same structure is used by `hud init`’s template and by programmatic tasks.

* Programmatic (inline in code):

```json  theme={null}
{
  "mcp_config": {
    "hud": {
      "url": "https://mcp.hud.ai/v3/mcp",
      "headers": {
        "Authorization": "Bearer ${HUD_API_KEY}",
        "Mcp-Image": "myorg/my-env:latest"
      }
    }
  },
  "agent_config": {
    "allowed_tools": ["act"]
  },
  "setup_tool": { "name": "setup", "arguments": {} },
  "evaluate_tool": { "name": "evaluate", "arguments": {"target": 10} }
}
```

* File `tasks.json` (created by `hud init` templates), local dev variant:

```json  theme={null}
[
  {
    "prompt": "Increment the counter to reach 10",
    "mcp_config": {
      "local": {
        "command": "docker",
        "args": ["run", "--rm", "-i", "my-env:latest"]
      }
    },
    "agent_config": {
      "allowed_tools": ["act"]
    },
    "setup_tool": { "name": "setup", "arguments": {} },
    "evaluate_tool": { "name": "evaluate", "arguments": {"target": 10} }
  }
]
```

Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud rl`, which will help convert it automatically).

Run tasks with either the CLI or an agent:

```bash  theme={null}
# Evaluate tasks with an agent backend
hud eval tasks.json --agent claude
```

or programmatically by passing the Task to your agent (e.g., `ClaudeAgent`, `OpenAIChatAgent`).


# Architecture
Source: https://docs.hud.ai/core-concepts/architecture

How HUD connects agents, environments, and evaluation

HUD is built as a thin orchestration layer on top of MCP, providing structure for agent evaluation, environment management, and telemetry.

## System Overview

```mermaid  theme={null}
graph TB
    subgraph "Remote"
        Dashboard["📊 Dashboard<br/>(hud.ai)"]
        API["🔌 MCP Orchestrator<br/>(mcp.hud.ai)"]
    end
    
    subgraph Code["Your Code"]
        Agent["🤖 Agent<br/>(Claude/Operator)"]
        Task["📋 Task<br/>(Prompt + Evaluation)"]
        SDK["📦 hud-python"]
    end
    
    subgraph "Environments"
        LocalEnv["🖥️ Local Docker<br/>(Development)"]
        RemoteEnv["☁️ Remote Docker<br/>(100s Parallel)"]
    end
    
    subgraph "OpenTelemetry"
        Trace["📡 Traces & Metrics"]
    end
    
    Dataset["📚 Dataset<br/>(HuggingFace)"]
    
    AnyMCP["🔗 Any MCP Client<br/>(Cursor, Claude, Custom)"]
    
    Agent <--> SDK
    Task --> SDK
    Dataset <-.-> Task
    SDK <-->|"MCP"| LocalEnv
    SDK <-->|"MCP"| API
    API  <-->|"MCP"| RemoteEnv
    SDK  -->|"hud.trace"| OpenTelemetry
    Trace --> Dashboard
    AnyMCP -->|"MCP"| API
    
    style Dashboard fill:#3b82f6,stroke:#1e40af,stroke-width:2px,color:#ffffff
    style API fill:#6b7280,stroke:#374151,stroke-width:2px,color:#ffffff
    style Agent fill:#ef4444,stroke:#dc2626,stroke-width:2px,color:#ffffff
    style Task fill:#f97316,stroke:#ea580c,stroke-width:2px,color:#ffffff
    style SDK fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
    style LocalEnv fill:#06b6d4,stroke:#0891b2,stroke-width:2px,color:#ffffff
    style RemoteEnv fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
    style Trace fill:#84cc16,stroke:#65a30d,stroke-width:2px,color:#ffffff
    style Dataset fill:#a855f7,stroke:#9333ea,stroke-width:2px,color:#ffffff
    style AnyMCP fill:#8b5cf6,stroke:#7c3aed,stroke-width:2px,color:#ffffff,stroke-dasharray: 5 5
  
```

## Core Components

### 1. Agents (`hud.agents`)

Agents make decisions and call tools:

```python  theme={null}
from hud.agents import ClaudeAgent, OperatorAgent

# Built-in agents
agent = ClaudeAgent()      # Uses Anthropic's Claude
agent = OperatorAgent()    # Uses OpenAI Operator

# All agents inherit from MCPAgent
class MCPAgent:
    async def initialize(task: str | Task | None)
    async def run(prompt_or_task: str | Task, max_steps: int = 10) -> Trace
    async def call_tools(tool_calls: list[MCPToolCall]) -> list[MCPToolResult]
```

<Note>
  Agents can auto-create MCP clients from `task.mcp_config` - no manual client setup needed
</Note>

### 2. Tasks (`hud.Task`)

Tasks define what agents should accomplish:

```python  theme={null}
from hud.datasets import Task

task = Task(
    prompt="Complete 3 items in the TODO list",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": f"Bearer {api_key}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    # Tool names and arguments map directly to MCP server tools
    setup_tool={
        "name": "setup",  # Calls def setup(name, num_items) on the environment
        "arguments": {
            "name": "todo_seed",
            "num_items": 5
        }
    },
    evaluate_tool={
        "name": "evaluate",  # Calls def evaluate(name, expected_count) on the environment
        "arguments": {
            "name": "todo_completed",
            "expected_count": 3
        }
    }
)
```

<Note>
  The `name` and `arguments` in setup/evaluate tools correspond exactly to the tool names and parameters exposed by the MCP server
</Note>

### 3. MCP Clients (`hud.clients`)

Clients handle the MCP protocol:

```python  theme={null}
from hud.clients import MCPClient

# Auto-created by agents, or manual:
client = MCPClient(mcp_config)
await client.initialize()
tools = await client.list_tools()
```

### 4. Environments

Environments are MCP servers exposing tools:

```python  theme={null}
from hud.server import MCPServer

server = MCPServer("my-env")

@server.tool()
def move(direction: str):
    # Environment logic
    return result
```

### 5. Telemetry (`hud.trace`)

Real-time observability:

```python  theme={null}
import hud

async with hud.async_trace("my-evaluation"):
    result = await agent.run(task)
    # View at hud.ai/traces/{trace_id}
```

## Execution Flow

<Steps>
  <Step title="Task Definition">
    Create a `Task` with prompt and MCP configuration
  </Step>

  <Step title="Agent Initialization">
    Agent creates MCP client (if needed) and connects to environment
  </Step>

  <Step title="Setup Phase">
    Execute `setup_tool` to initialize environment state
  </Step>

  <Step title="Execution Loop">
    Agent receives observations, makes decisions, calls tools
  </Step>

  <Step title="Evaluation">
    Execute `evaluate_tool` to score performance
  </Step>

  <Step title="Telemetry">
    All interactions streamed to HUD backend for analysis
  </Step>
</Steps>

## Key Design Principles

1. **Protocol-First**: Everything speaks MCP
2. **Composable**: Mix and match agents, environments, evaluations
3. **Observable**: Built-in telemetry for every interaction
4. **Testable**: Reproducible evaluations with Docker
5. **Extensible**: Easy to add new agents or environments

<Note>
  The `MCPServer` class wraps FastMCP with lifecycle management, making it easy to build Docker-based environments
</Note>

## Next Steps

<CardGroup cols={2}>
  <Card title="Evaluate Agents" icon="robot" href="/evaluate-agents/create-agents">
    Test and benchmark agents on standardized tasks
  </Card>

  <Card title="Build Environments" icon="cube" href="/build-environments">
    Create MCP-compatible environments for your software
  </Card>
</CardGroup>


# MCP Protocol
Source: https://docs.hud.ai/core-concepts/mcp-protocol

Understanding the Model Context Protocol that connects agents to environments

HUD uses the Model Context Protocol (MCP) to provide a standard way for AI agents to interact with any software environment through tool calls.

## Why MCP?

Traditional agent frameworks couple agents tightly to specific environments. MCP decouples them:

<CardGroup cols={2}>
  <Card title="Without MCP" icon="x">
    * Agent code hardcoded for each environment
    * No standardization across tools
    * Difficult to swap agents or environments
  </Card>

  <Card title="With MCP" icon="check">
    * Any agent works with any environment
    * Standard protocol for all interactions
    * Easy to swap components
  </Card>
</CardGroup>

## How It Works

MCP standardizes agent-environment communication through JSON-RPC messages. Agents call tools exposed by environments and receive structured responses.

## Core Concepts

### Tools

[Tools](https://modelcontextprotocol.io/specification/server/tools) are functions exposed by the environment:

```json  theme={null}
{
  "name": "move",
  "description": "Move in a direction",
  "inputSchema": {
    "type": "object",
    "properties": {
      "direction": {
        "type": "string",
        "enum": ["up", "down", "left", "right"]
      }
    }
  }
}
```

### Tool Calls & Results

Agents [call tools](https://modelcontextprotocol.io/specification/server/tools#calling-tools) and receive results:

```python  theme={null}
# Agent makes a tool call
result = await client.call_tool("move", {"direction": "up"})

# Environment returns result
{
  "content": [{
    "type": "text",
    "text": "Moved up. New board state: ..."
  }]
}
```

### Lifecycle Management

MCP defines a [rigorous lifecycle](https://modelcontextprotocol.io/specification/basic/lifecycle) for connections:

1. **Initialization**: Client and server negotiate capabilities and protocol version with `client.initialize()`
2. **Operation**: Normal tool calling and message exchange
3. **Shutdown**: Clean termination of the connection

The protocol ensures both sides understand each other's capabilities before proceeding.

## HUD's MCP Extensions

HUD adds conventions on top of MCP:

1. **Setup Tools**: Initialize environment state (`setup_board`, `navigate_to_url`)
2. **Evaluate Tools**: Score agent performance (`evaluate_max_tile`, `contains_text`)
3. **Lifecycle Management**: Clean initialization and shutdown with `client.initialize()` and proper cleanup

See the [Tools Reference](/reference/tools) for implementation details.

## Transport Options

HUD environments are designed for 100% reproducibility through Docker:

<Tabs>
  <Tab title="Local Docker Build">
    Run environments locally for development and debugging using [stdio transport](https://modelcontextprotocol.io/specification/basic/transports#stdio):

    ```python  theme={null}
    mcp_config = {
      "hud-text-2048": {
        "command": "docker",
        "args": ["run", "--rm", "-i", "hudpython/hud-text-2048:v1.2"]
      }
    }
    ```

    * **Transport**: [stdio](https://modelcontextprotocol.io/specification/basic/transports#stdio) - JSON-RPC over stdin/stdout
    * **Pros**: Full control, easy debugging, no network latency
    * **Use case**: Development, testing, single-agent runs

    <Info>
      The `-i` flag enables Docker's interactive mode, allowing stdio communication between the client and server process.
    </Info>
  </Tab>

  <Tab title="Remote Docker Launch">
    Scale to parallel runs with cloud infrastructure using [Streamable HTTP transport](https://modelcontextprotocol.io/specification/basic/transports#streamable-http):

    ```python  theme={null}
    mcp_config = {
      "hud": {
        "url": "https://mcp.hud.ai/v3/mcp",
        "headers": {
          "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
          "Mcp-Image": "hudpython/hud-text-2048:v1.2"
        }
      }
    }
    ```

    * **Transport**: [Streamable HTTP](https://modelcontextprotocol.io/specification/basic/transports#streamable-http) with SSE
    * **Pros**: Infinite parallelization, no local setup, managed infrastructure
    * **Use case**: Benchmarks, CI/CD, multi-agent experiments

    <Info>
      HUD's orchestrator manages Docker containers in the cloud, exposing them via HTTP endpoints with Server-Sent Events for streaming.
    </Info>
  </Tab>
</Tabs>

<Check>
  Both approaches use the exact same Docker image, ensuring identical behavior whether running locally or remotely
</Check>

## Next Steps

<Card title="Architecture" icon="sitemap" href="/core-concepts/architecture">
  See how HUD builds on MCP for agent evaluation
</Card>


# Task System
Source: https://docs.hud.ai/core-concepts/task-system

How tasks define agent objectives and evaluation criteria

Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.

## Lifecycle

```mermaid  theme={null}
graph LR
    Task["📋 Task<br/>(Prompt + Config)"]
    Setup["🔧 Setup Phase<br/>(Initialize State)"]
    Execute["🤖 Execute Phase<br/>(Agent Works)"]
    Evaluate["📊 Evaluate Phase<br/>(Score Result)"]
    
    Task --> Setup
    Setup --> Execute
    Execute --> Evaluate
    
    style Task fill:#6366f1,stroke:#4338ca,stroke-width:2px,color:#ffffff
    style Setup fill:#ef4444,stroke:#dc2626,stroke-width:2px,color:#ffffff
    style Execute fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
    style Evaluate fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
```

## Task Structure

```python  theme={null}
from hud.datasets import Task
import uuid

task = Task(
    # Required fields
    prompt="Navigate to the login page and sign in as testuser@example.com",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    
    # Optional fields
    id=str(uuid.uuid4()),  # Required for HuggingFace datasets
    system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
    setup_tool={
        "name": "playwright",
        "arguments": {
            "action": "navigate",
            "url": "https://example.com"
        }
    },
    evaluate_tool={
        "name": "evaluate",
        "arguments": {
            "name": "url_contains",
            "substring": "/dashboard"
        }
    },
    metadata={"category": "authentication", "difficulty": "easy"}
)
```

## Field Reference

<ParamField body="prompt" type="string" required>
  The instruction given to the agent. Be specific and clear about success criteria.
</ParamField>

<ParamField body="mcp_config" type="object" required>
  Environment connection configuration. Supports environment variables with `${VAR}` syntax.
</ParamField>

<ParamField body="id" type="string (UUID)" optional>
  Unique identifier. Required when creating HuggingFace datasets. Use `str(uuid.uuid4())`.
</ParamField>

<ParamField body="system_prompt" type="string" optional>
  Custom system prompt for the agent. Overrides the agent's default system prompt.
</ParamField>

<ParamField body="setup_tool" type="dict | list[dict]" optional>
  Tool(s) to initialize environment state with `name` and `arguments`
</ParamField>

<ParamField body="evaluate_tool" type="dict | list[dict]" optional>
  Tool(s) to score performance with `name` and `arguments`
</ParamField>

<ParamField body="metadata" type="object" optional>
  Additional task information for analysis and filtering
</ParamField>

## Environment Variables

<Important>
  Tasks automatically resolve `${VAR}` patterns when deserialized from dictionaries. This is why HuggingFace datasets should store raw dictionaries, not Task objects.
</Important>

```python  theme={null}
# In dataset JSON:
{
    "prompt": "Complete the TODO list",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "${BROWSER_IMAGE}"
            }
        }
    }
}

# When loaded:
task = Task(**task_dict)  # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
```

This enables:

* **Public datasets** without exposing secrets
* **Environment-specific** configurations
* **CI/CD pipelines** with different credentials

## Running Tasks

```python  theme={null}
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}")  # 0.0 to 1.0
```

The agent will:

1. Execute `setup_tool` if provided
2. Work on the `prompt` using available tools
3. Execute `evaluate_tool` to calculate reward

## Working with Datasets

Tasks integrate with HuggingFace datasets:

```python  theme={null}
from datasets import load_dataset
from hud.datasets import run_dataset

# Load dataset from HuggingFace
dataset = load_dataset("hud-evals/SheetBench-50", split="train")

# Run agent on entire dataset with automatic parallelization
results = await run_dataset(
    "SheetBench Run",
    dataset,  # Pass the dataset object or name (e.g. "hud-evals/SheetBench-50")
    agent_class=ClaudeAgent
)
```

## Creating Datasets

```python  theme={null}
from hud.datasets import save_tasks

# Create task dictionaries (NOT Task objects!)
task_dicts = [
    {
        "prompt": "Navigate to the login page",
        "mcp_config": {
            "hud": {
                "url": "${MCP_URL}",
                "headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
            }
        },
        "setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
        "evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
    },
    # More task dicts...
]

# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-org/my-benchmark")
```

<Warning>
  Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
</Warning>

## Best Practices

1. **Use UUIDs**: Always include `id=str(uuid.uuid4())` for dataset tasks
2. **Clear Prompts**: Be specific about success criteria
3. **Template Variables**: Use `${VAR}` syntax for shareable configs
4. **Rich Tools**: Include both `name` and `arguments` in tool definitions

## Next Steps

<CardGroup cols={2}>
  <Card title="Benchmarks" icon="flask" href="/evaluate-agents/benchmarks">
    Create, run, and publish evaluations
  </Card>

  <Card title="Create Agents" icon="robot" href="/evaluate-agents/create-agents">
    Build your own MCP-compatible agent
  </Card>
</CardGroup>


# Tricks
Source: https://docs.hud.ai/core-concepts/tricks

Collection of unique examples with the SDK

## Computer Tool

Working on any web app? Run this simple python server and allow the agent to see your screen!

```python  theme={null}
from hud.server import MCPServer
from hud.tools import HudComputerTool

mcp = MCPServer(name="HUD Computer", host="0.0.0.0", port=8777)
mcp.add_tool(HudComputerTool())
mcp.run(transport="streamable-http")
```

Run: `python computer_server.py`
Add to cursor:

```
{
    "hud-computer": {
        "url": "http://localhost:8777/mcp"
    }
}
```

More patterns coming soon! Have a cool pattern? [Share it on Discord](https://discord.gg/wkjtmHYYjm).


# Benchmarks
Source: https://docs.hud.ai/evaluate-agents/benchmarks

Create, run, and publish agent evaluations

This page covers how to build datasets, run evaluations, and publish results.

## Quick Start

* CLI

```bash  theme={null}
# Run a local tasks file (interactive agent selection)
hud eval tasks.json

# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full
```

* SDK

```python  theme={null}
from hud.agents import ClaudeAgent
from hud.datasets import run_dataset

results = await run_dataset(
    name="SheetBench Eval",
    dataset="hud-evals/SheetBench-50",
    agent_class=ClaudeAgent,
    max_concurrent=50,
)
```

## Build Benchmarks

### Explore Evaluators

```bash  theme={null}
hud analyze hudpython/hud-remote-browser:latest
```

### Create Tasks

```python  theme={null}
import uuid

web_tasks = []
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
    "metadata": {"difficulty": "easy", "category": "navigation"}
})
```

### Save to HuggingFace

```python  theme={null}
from hud.datasets import save_tasks

save_tasks(
  web_tasks,
  repo_id="my-org/web-navigation-benchmark",
  private=False,
  tags=["web", "navigation", "automation"],
)
```

## Leaderboards

After running, visit your dataset leaderboard and publish a scorecard:

```python  theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

results = await run_dataset(
    name="Claude Sonnet SheetBench",
    dataset="hud-evals/SheetBench-50",
    agent_class=ClaudeAgent,
)

# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard
```

## Best Practices

* Clear, measurable prompts (binary or graded)
* Isolated task state and deterministic setup
* Use metadata tags (category, difficulty)
* Validate locally, then parallelize
* Version datasets; include a `system_prompt.txt`

## See Also

* [`hud eval`](/reference/cli/eval)
* [`hud rl`](/reference/cli/rl)
* [Tasks](/reference/tasks)
* [Agents (SDK)](/reference/agents)


# Create Agents
Source: https://docs.hud.ai/evaluate-agents/create-agents

Build your own MCP-compatible agent for HUD evaluation

Build custom agents that interact with MCP tools to complete tasks. An agent is essentially a loop that calls your LLM, executes tools based on its decisions, and continues until the task is complete.

## How Agents Work

An agent follows this lifecycle:

```mermaid  theme={null}
graph LR
    A[System Prompt] --> B[User Prompt]
    B --> C[LLM Response]
    C --> D{Has Tool Calls?}
    D -->|Yes| E[Execute Tools]
    E --> F[Format Results]
    F --> C
    D -->|No| G[Task Complete]
```

The agent keeps calling your LLM and executing tools until the LLM stops requesting tools, indicating the task is complete.

## The Four Required Methods

To create an agent, you implement four methods that bridge your LLM with MCP's tool system:

```python  theme={null}
from hud.agents import MCPAgent
from hud.types import AgentResponse, MCPToolCall, MCPToolResult

class MyAgent(MCPAgent):
    """Your custom agent implementation."""

    async def get_system_messages(self) -> list[Any]:
        """1. Called ONCE at start - returns your LLM's system prompt."""
        pass

    async def get_response(self, messages: list[Any]) -> AgentResponse:
        """2. Called EACH TURN - sends messages to your LLM, returns its response, optionally adds the assistant message to messages."""
        pass

    async def format_blocks(self, blocks: list[ContentBlock]) -> list[Any]:
        """3. Called at START - converts initial prompt/context to your LLM format."""
        pass

    async def format_tool_results(
        self, tool_calls: list[MCPToolCall],
        tool_results: list[MCPToolResult]
    ) -> list[Any]:
        """4. Called AFTER TOOLS - converts tool results to your LLM format."""
        pass
```

### Understanding When Each Method is Called

The agent loop calls your methods in this sequence:

1. **`get_system_messages()`** - Once at start
2. **`format_blocks()`** - Converts initial task prompt
3. **`get_response()`** - Gets LLM decision, adds assistant message to messages
4. **`format_tool_results()`** - After each tool execution
5. Back to step 3 until done

## What MCPAgent Does For You

### The Agent Loop

The base `MCPAgent` class handles the entire execution loop. When you call `agent.run(task)`:

1. **Initialization Phase**
   * Connects to MCP servers (auto-creates client from task.mcp\_config if needed)
   * Discovers available tools from all connected servers
   * Applies tool filtering (allowed/disallowed lists)
   * Identifies lifecycle tools (setup, evaluate, response)

2. **Setup Phase** (if task.setup\_tool provided)
   * Executes setup tools (e.g., navigate to website, initialize environment)
   * Optionally appends setup output to initial context (controlled by `append_setup_output`)
   * Can include initial screenshots (controlled by `initial_screenshot`)

3. **Main Execution Loop**
   ```python  theme={null}
   while not done and step < max_steps:
       # Your get_response() is called here
       response = await agent.get_response(messages)

       if response.tool_calls:
           # MCPAgent executes tools for you
           results = await agent.call_tools(response.tool_calls)

           # Your format_tool_results() is called here
           messages.extend(await agent.format_tool_results(tool_calls, results))
       else:
           done = True
   ```

4. **Evaluation Phase** (if task.evaluate\_tool provided)
   * Runs evaluation tools to calculate reward
   * Extracts reward from result (looks for "reward", "grade", "score" keys)
   * Returns Trace object with full execution history

### Tool Management

**Tool Discovery & Filtering**

```python  theme={null}
agent = ClaudeAgent(
    allowed_tools=["anthropic_computer"],  # Only these tools
    disallowed_tools=["openai_computer"],  # Never these tools
)
```

* **Available Tools**: Retrieved via `self.get_available_tools()` - already filtered
* **Lifecycle Tools**: Automatically detected and hidden from your LLM
* **Response Tools**: Auto-detected (tools with "response" in name) for task completion

### Client Management

MCPAgent handles complex client lifecycle:

```python  theme={null}
# Option 1: Provide your own client
from hud.clients import MCPClient
client = MCPClient(mcp_config={...})
agent = MyAgent(mcp_client=client)

# Option 2: Auto-create from task
task = Task(mcp_config={...})
agent = MyAgent()  # No client needed
await agent.run(task)  # Client created automatically
```

**Auto-cleanup**: Clients created automatically are properly shut down after execution.

### Error Handling

MCPAgent provides robust error handling:

* **Connection Errors**: Helpful messages about MCP server availability
* **Tool Errors**: Captured and returned as MCPToolResult with isError=True
* **Timeout Handling**: Graceful shutdown on tool execution timeouts
* **Trace Always Returns**: Even on errors, you get a Trace object with details

### Message Accumulation

Messages build up over the conversation:

```
[System] → [User Prompt] → [LLM Response] → [Tool Results] → [LLM Response] → ...
```

Your `get_response()` receives the full conversation history each time, allowing your LLM to maintain context.

### Advanced Features

**Response Agent Integration**

```python  theme={null}
from hud.agents.misc import ResponseAgent

agent = MyAgent(
    response_agent=ResponseAgent()  # Auto-decides when to stop/continue
)
```

The ResponseAgent can analyze ambiguous LLM responses like "Should I submit?" and decide whether to continue.

**Telemetry & Tracing**

```python  theme={null}
agent = MyAgent(
    auto_trace=True,  # Automatic span creation
    verbose=True  # Detailed logging
)
```

**System Prompt Augmentation**

```python  theme={null}
task = Task(
    system_prompt="Additional instructions...",  # Appended to agent's system prompt
    ...
)
```

## Testing Your Agent

Test your agent on a simple task:

```python  theme={null}
import asyncio
import hud
import os
from hud.datasets import Task

async def test_agent():
    async with hud.async_trace("test-custom-agent"):
        task = Task(
            prompt="Navigate to example.com",
            mcp_config={
                "hud": {
                    "url": "https://mcp.hud.ai/v3/mcp",
                    "headers": {
                        "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                        "Mcp-Image": "hudpython/hud-remote-browser:latest"
                    }
                }
            },
            setup_tool={
                "name": "setup",
                "arguments": {
                    "name": "navigate",
                    "arguments": {"url": "https://example.com"}
                }
            },
            evaluate_tool={
                "name": "evaluate",
                "arguments": {
                    "name": "url_match",
                    "arguments": {"pattern": "example.com"}
                }
            }
        )
        
        # Use your custom agent
        agent = MyAgent()
        result = await agent.run(task)
        print(f"Reward: {result.reward}")

asyncio.run(test_agent())
```

## Built-in Agents

HUD provides built-in agents for common LLM providers:

```python  theme={null}
from hud.agents import ClaudeAgent, OperatorAgent

# Claude (Anthropic)
claude_agent = ClaudeAgent(
    model="claude-sonnet-4-20250514",
)

# Operator (OpenAI-based)
operator_agent = OperatorAgent()
```

Always test your agent with the actual MCP servers you'll use in production.

## Next Steps

<CardGroup cols={2}>
  <Card title="Benchmarks" icon="flask" href="/evaluate-agents/benchmarks">
    Create, run, and publish evaluations
  </Card>

  <Card title="Agents Reference" icon="book" href="/reference/agents">
    API details and built-in agents
  </Card>
</CardGroup>

## See Also

* [`hud eval`](/reference/cli/eval) - Run agents on tasks/datasets from the CLI
* [`hud rl`](/reference/cli/rl) - Train agents with GRPO on your datasets
* [Agents (SDK Reference)](/reference/agents) – API details and built-in agents


# Introduction
Source: https://docs.hud.ai/index

OSS RL environment + evals toolkit.

<Note>
  **Version 0.4.62** - Latest stable release
</Note>

<CardGroup cols={3}>
  <Card title="I want to evaluate agents" icon="robot" href="/evaluate-agents/create-agents">
    Test Claude, Operator, or custom agents on benchmarks like SheetBench and OSWorld
  </Card>

  <Card title="I want to build environments" icon="cube" href="/build-environments">
    Wrap any software in dockerized MCP for scalable and generalizable agent evaluation
  </Card>

  <Card title="I want to train agents" icon="brain" href="/train-agents/quickstart">
    Use reinforcement learning and GRPO on evaluations to improve agent performance
  </Card>
</CardGroup>

## What is HUD?

HUD connects AI agents to software environments using the Model Context Protocol (MCP). Whether you're evaluating existing agents, building new environments, or training models with RL, HUD provides the infrastructure.

```mermaid  theme={null}
graph LR
    Agent["🤖 Any Agent<br/>(Claude, Operator, etc.)"]
    MCP["🔌 MCP Protocol<br/>(Tool Calls)"]
    Env["📦 Any Environment<br/>(Browser, OS, etc.)"]
    
    Agent -->|"call_tool()"| MCP
    MCP -->|"click(x, y)"| Env
    Env -->|"screenshot"| MCP
    MCP -->|"get_response()"| Agent
    
    style Agent fill:#3b82f6,stroke:#1e40af,stroke-width:2px,color:#ffffff
    style MCP fill:#f59e0b,stroke:#d97706,stroke-width:2px,color:#ffffff
    style Env fill:#10b981,stroke:#047857,stroke-width:2px,color:#ffffff
```

## Why HUD?

* **🔌 MCP-native**: Any agent can connect to any environment
* **📡 Live telemetry**: Debug every tool call at [hud.ai](https://hud.ai)
* **🚀 Production-ready**: From local Docker to cloud scale
* **🎯 Built-in benchmarks**: OSWorld-Verified, SheetBench-50, and more
* **🔧 CLI tools**: Create, develop, run, and train with `hud init`, `hud dev`, `hud run`, `hud eval`, `hud rl`

<CardGroup cols={2}>
  <Card title="3-minute quickstart" icon="bolt" href="/quickstart">
    Run your first agent evaluation with zero setup
  </Card>

  <Card title="Clone starter project" icon="code-branch">
    ```bash  theme={null}
    uvx hud-python quickstart
    ```
  </Card>
</CardGroup>

## Quick Example

```python  theme={null}
import asyncio, os, hud
from hud.datasets import Task
from hud.agents import ClaudeAgent

async def main():
    # Define evaluation task with remote MCP
    task = Task(
        prompt="Win a game of 2048 by reaching the 128 tile",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.ai/v3/mcp",
                "headers": {
                    "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                    "Mcp-Image": "hudevals/hud-text-2048:0.1.3"
                }
            }
        },
        setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
        evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
    )
    
    # Run agent (auto-creates MCP client)
    agent = ClaudeAgent()
    result = await agent.run(task)
    print(f"Score: {result.reward}")

asyncio.run(main())
```

## Community

<CardGroup cols={2}>
  <Card title="GitHub" icon="github" href="https://github.com/hud-evals/hud-python">
    Star the repo and contribute
  </Card>

  <Card title="Discord" icon="discord" href="https://discord.gg/wkjtmHYYjm">
    Join our community
  </Card>
</CardGroup>

### Are you a startup building agents?

[📅 Hop on a call](https://cal.com/jay-ram-z6st6w/demo) or [📧 founders@hud.ai](mailto:founders@hud.ai)


# LLM Quickstart
Source: https://docs.hud.ai/llm-quickstart

Add context about hud to any coding agent

## Setup

<Card title="Claude Desktop">
  ```bash  theme={null}
  claude mcp add --transport http docs-hud https://docs.hud.ai/mcp
  ```
</Card>

<Card title="Cursor">
  Add to MCP settings:

  ```json  theme={null}
  "docs-hud": {
    "url": "https://docs.hud.ai/mcp"
  }
  ```

  Or use one-click install:

  [![Install MCP Server](https://cursor.com/deeplink/mcp-install-dark.svg)](https://cursor.com/en/install-mcp?name=docs-hud-so\&config=eyJ1cmwiOiJodHRwczovL2RvY3MuaHVkLnNvL21jcCJ9)
</Card>

<Card title="Full Documentation" href="https://docs.hud.ai/llms-full.txt">
  Copy paste the full llms-full.txt
</Card>

## What you get

Your AI assistant gains access to:

* Complete HUD API reference and methods
* Working code examples and implementations
* Architecture patterns and best practices
* Real-time updates as documentation evolves

<Tip>
  Try asking your assistant: "How do I create a custom agent in HUD?" or "Help me debug MCP tool calls"
</Tip>

## Next steps

<CardGroup cols={2}>
  <Card title="Run your first evaluation" icon="play" href="/quickstart">
    Get started with HUD in 3 minutes
  </Card>

  <Card title="Build an environment" icon="cube" href="/build-environments">
    Create custom MCP environments
  </Card>
</CardGroup>


# Quickstart
Source: https://docs.hud.ai/quickstart

Run your first agent evaluation in 3 minutes

Get up and running with HUD in minutes. This guide walks you through installation, basic setup, and running your first agent evaluation.

## Quick Clone

The fastest way to get started:

```bash  theme={null}
# Clone a complete example project with uv
uvx hud-python quickstart
```

This sets you up with a working agent evaluation example you can run immediately.

## Installation

<Tabs>
  <Tab title="Basic Install">
    ```bash  theme={null}
    pip install hud-python
    ```
  </Tab>

  <Tab title="Agent Install">
    ```bash  theme={null}
    # Includes AI providers and telemetry
    pip install "hud-python[agent]"
    ```
  </Tab>

  <Tab title="CLI Tools">
    ```bash  theme={null}
    # Install CLI in isolated environment
    uv tool install hud-python
    ```
  </Tab>
</Tabs>

## API Keys

Set your API keys as environment variables:

```bash  theme={null}
export HUD_API_KEY="sk-hud-..."      # Get from hud.ai
export ANTHROPIC_API_KEY="sk-ant-..." # For Claude agents
export OPENAI_API_KEY="sk-..."        # For OpenAI agents
```

<Tip>
  Create a `.env` file in your project root to manage API keys locally
</Tip>

## Your First Agent

Run an agent on the 2048 game environment:

```python  theme={null}
import asyncio, os
import hud
from hud.datasets import Task
from hud.agents import ClaudeAgent

async def main():
    # The trace context captures ALL agent interactions within a "task run"
    # Everything inside this context shows up as one trace on hud.ai
    async with hud.async_trace("quickstart-2048"):
        # Define task with remote MCP environment
        task = Task(
            prompt="Win a game of 2048 by reaching the 128 tile",
            mcp_config={
                "hud": {
                    "url": "https://mcp.hud.ai/v3/mcp",
                    "headers": {
                        "Authorization": f"Bearer {os.getenv('HUD_API_KEY')}",
                        "Mcp-Image": "hudevals/hud-text-2048:0.1.3"
                    }
                }
            },
        setup_tool={"name": "setup", "arguments": {"name": "board", "arguments": { "board_size": 4}}},
        evaluate_tool={"name": "evaluate", "arguments": {"name": "max_number", "arguments": {"target": 64}}}
        )
        
        # Run agent (auto-creates MCP client from task.mcp_config)
        agent = ClaudeAgent()
        result = await agent.run(task)
        print(f"Max tile reached: {result.reward}")

asyncio.run(main())
```

<Check>
  The trace context ensures that the entire task run - from setup through evaluation - appears as one coherent trace on the platform
</Check>

## What just happened?

1. **Task Definition**: We created a `Task` with:
   * A prompt telling the agent what to do
   * An `mcp_config` pointing to a remote MCP environment
   * Setup and evaluation tools to initialize and score the game

2. **Auto Client**: The agent automatically created an MCP client from `task.mcp_config`

3. **Telemetry**: The `trace` context captured all interactions for debugging

4. **Evaluation**: The `evaluate_max_tile` tool returned the highest tile as reward

## Next Steps

<CardGroup cols={3}>
  <Card title="Understand MCP" icon="book" href="/core-concepts/mcp-protocol">
    Learn how agents connect to environments
  </Card>

  <Card title="Run Benchmarks" icon="chart-line" href="/evaluate-agents/benchmarks">
    Test on SheetBench and OSWorld
  </Card>

  <Card title="Build Environments" icon="cube" href="/build-environments">
    Create your own MCP environment
  </Card>
</CardGroup>

## CLI Quick Reference

```bash  theme={null}
# Create sample environment
hud init

# Start hot-reload development server for an environment
hud dev . --build --interactive

# Train a model on your environment
hud rl
```

<Note>
  See the [CLI Reference](/reference/cli/overview) for detailed command documentation
</Note>


# Agents
Source: https://docs.hud.ai/reference/agents

SDK reference for HUD agent classes

The HUD SDK provides a base `MCPAgent` class and several pre-built agent implementations for interacting with MCP environments.

## Base Class

### MCPAgent

```python  theme={null}
from hud.agents import MCPAgent
```

Abstract base class for all MCP-enabled agents. Handles the agent loop, MCP client lifecycle, tool discovery/filtering (including lifecycle tools like `setup`, `evaluate`, `response`), and standardized telemetry/logging.

**Constructor Parameters:**

| Parameter             | Type             | Description                                  | Default        |
| --------------------- | ---------------- | -------------------------------------------- | -------------- |
| `mcp_client`          | `AgentMCPClient` | MCP client for server connections            | `None`         |
| `allowed_tools`       | `list[str]`      | List of tool names to allow                  | `None` (all)   |
| `disallowed_tools`    | `list[str]`      | List of tool names to disallow               | `[]`           |
| `system_prompt`       | `str`            | System prompt to seed the conversation       | Default prompt |
| `append_setup_output` | `bool`           | Append setup tool output to initial context  | `True`         |
| `initial_screenshot`  | `bool`           | Include an initial screenshot when supported | `True`         |
| `model_name`          | `str`            | Model label for telemetry/logging            | `"mcp-agent"`  |
| `response_agent`      | `ResponseAgent`  | Optional auto-continue/stop helper           | `None`         |
| `auto_trace`          | `bool`           | Enable automatic tracing spans               | `True`         |
| `verbose`             | `bool`           | Verbose console logs for development         | `False`        |

**Key Methods:**

```python  theme={null}
async def initialize(task: str | Task | None = None) -> None
    """Initialize agent with task-specific configuration."""

async def run(prompt_or_task: str | Task | dict[str, Any], max_steps: int = 10) -> Trace
    """Run agent with prompt or task. Returns Trace with results."""

async def call_tools(tool_call: MCPToolCall | list[MCPToolCall]) -> list[MCPToolResult]
    """Execute tool calls through MCP client."""

def get_available_tools() -> list[types.Tool]
    """Get filtered list of available tools (excludes lifecycle)."""

def get_tool_schemas() -> list[dict]
    """Get tool schemas formatted for the model."""
```

**Abstract Methods (must implement):**

```python  theme={null}
async def get_system_messages() -> list[Any]
    """Get system prompt formatted for the model."""

async def get_response(messages: list[Any]) -> AgentResponse
    """Get model response including tool calls."""

async def format_blocks(blocks: list[ContentBlock]) -> list[Any]
    """Format content blocks into model messages."""

async def format_tool_results(tool_calls: list[MCPToolCall], 
                            tool_results: list[MCPToolResult]) -> list[Any]
    """Format tool results for the model."""
```

**Class Variables:**

* `metadata: dict[str, Any] | None` - Injected into MCP initialize requests
* `required_tools: list[str]` - Tools that must be present or initialization fails

**Auto-Client Creation:**
If no `mcp_client` is provided but a `Task` with `mcp_config` is passed to `run()`, an MCPClient is automatically created and cleaned up.

**Lifecycle & Filtering:**

* Lifecycle tools (e.g., `setup`, `evaluate`, and auto-detected `response`) are hidden from the model but available to the framework
* Use `allowed_tools` / `disallowed_tools` to control the tool surface visible to your model

## Pre-built Agents

### ClaudeAgent

```python  theme={null}
from hud.agents import ClaudeAgent
```

Claude-specific implementation using Anthropic's API.

**Constructor Parameters:**

| Parameter           | Type             | Description                       | Default                      |
| ------------------- | ---------------- | --------------------------------- | ---------------------------- |
| `model_client`      | `AsyncAnthropic` | Anthropic client                  | Auto-created                 |
| `model`             | `str`            | Claude model to use               | `"claude-sonnet-4-20250514"` |
| `max_tokens`        | `int`            | Maximum response tokens           | `4096`                       |
| `use_computer_beta` | `bool`           | Enable computer-use beta features | `True`                       |
| `validate_api_key`  | `bool`           | Validate key on init              | `True`                       |

**Features:**

* Native Claude tool calling
* Automatic prompt caching
* Computer-use beta support
* Display metadata injection

**Example:**

```python  theme={null}
agent = ClaudeAgent(
    model="claude-sonnet-4-20250514",
    max_tokens=8192,
)

result = await agent.run(
    Task(
        prompt="Navigate to example.com",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.ai/v3/mcp",
                "headers": {
                    "Authorization": "Bearer ${HUD_API_KEY}",
                    "Mcp-Image": "hudpython/hud-remote-browser:latest"
                }
            }
        },
        evaluate_tool={"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": "example.com"}}}
    )
)
```

### OperatorAgent

```python  theme={null}
from hud.agents import OperatorAgent
```

OpenAI Operator-style agent built on the Responses API with computer-use.

**Constructor Parameters:**

| Parameter          | Type                                         | Description            | Default                  |
| ------------------ | -------------------------------------------- | ---------------------- | ------------------------ |
| `model_client`     | `AsyncOpenAI`                                | OpenAI client          | Auto-created             |
| `model`            | `str`                                        | Responses model to use | `"computer-use-preview"` |
| `environment`      | `Literal["windows","mac","linux","browser"]` | Computer environment   | `"linux"`                |
| `validate_api_key` | `bool`                                       | Validate key on init   | `True`                   |

**Features:**

* OpenAI Responses computer-use
* Operator-style system prompt guidance
* Display metadata injection

### GenericOpenAIChatAgent

```python  theme={null}
from hud.agents import GenericOpenAIChatAgent
```

OpenAI-compatible chat.completions agent that works with any endpoint implementing the OpenAI schema (OpenAI, vLLM, Ollama, Together, custom, etc.).

**Constructor Parameters:**

| Parameter           | Type             | Description                                       | Default         |
| ------------------- | ---------------- | ------------------------------------------------- | --------------- |
| `openai_client`     | `AsyncOpenAI`    | OpenAI-compatible client instance                 | Required        |
| `model_name`        | `str`            | Chat model name                                   | `"gpt-4o-mini"` |
| `completion_kwargs` | `dict[str, Any]` | Extra args forwarded to `chat.completions.create` | `{}`            |

**Example (local or custom endpoint):**

```python  theme={null}
from openai import AsyncOpenAI

openai_client = AsyncOpenAI(
    base_url="http://localhost:11434/v1",  # e.g., Ollama
    api_key="not-needed",
)

agent = GenericOpenAIChatAgent(
    openai_client=openai_client,
    model_name="llama3.1",
    completion_kwargs={"temperature": 0.2},
)
```

### LangChainAgent

```python  theme={null}
from hud.agents import LangChainAgent
```

LangChain integration for using any LangChain-compatible model.

**Constructor Parameters:**

| Parameter | Type                | Description                | Default  |
| --------- | ------------------- | -------------------------- | -------- |
| `llm`     | `BaseLanguageModel` | LangChain-compatible model | Required |

**Example:**

```python  theme={null}
from langchain_anthropic import ChatAnthropic

llm = ChatAnthropic(model="claude-3-opus-20240229")
agent = LangChainAgent(llm=llm)
```

### GroundedOpenAIChatAgent

```python  theme={null}
from hud.agents.grounded_openai import GroundedOpenAIChatAgent
```

OpenAI chat agent with a separate grounding model for element detection. The planning model issues natural-language "computer" tool calls that are grounded to coordinates by a dedicated vision model.

**Constructor Parameters:**

| Parameter             | Type             | Description                                | Default             |
| --------------------- | ---------------- | ------------------------------------------ | ------------------- |
| `grounder_config`     | `GrounderConfig` | Configuration for the grounding model      | Required            |
| `model_name`          | `str`            | OpenAI chat model name                     | `"gpt-4o-mini"`     |
| `allowed_tools`       | `list[str]`      | Exposed tool list (usually `["computer"]`) | `None`              |
| `append_setup_output` | `bool`           | Append setup output to first turn          | `False`             |
| `system_prompt`       | `str`            | System prompt to use                       | Opinionated default |

> Note: The grounded agent is for advanced use-cases where you want to decouple planning from grounding. For most users, `ClaudeAgent` or `OperatorAgent` is sufficient.

## Helper Classes

### ResponseAgent

Base class for auto-response handlers that decide when to continue or stop.

```python  theme={null}
from hud.agents.misc import ResponseAgent

class MyResponseAgent(ResponseAgent):
    async def determine_response(self, agent_output: str) -> str:
        if "task complete" in agent_output.lower():
            return "STOP"
        return "Continue with the next step"
```

## Common Types

### AgentResponse

Key fields:

* `tool_calls: list[MCPToolCall]`
* `done: bool`
* `content: str | None`
* `reasoning: str | None`
* `info: dict[str, Any]`
* `isError: bool`

### MCPToolCall

Key fields:

* `id: str` — unique identifier (auto-generated if not provided)
* `name: str`
* `arguments: dict[str, Any]`

### MCPToolResult

Key fields:

* `content: list[ContentBlock]`
* `structuredContent: dict[str, Any] | None`
* `isError: bool`

### Trace

Key fields:

* `reward: float`
* `done: bool`
* `content: str | None`
* `isError: bool`
* `info: dict[str, Any]`
* `task: Task | None`
* `trace: list[TraceStep]` — execution trace steps
* `messages: list[Any]` — final conversation state

## Usage Examples

### Simple Prompt Execution

```python  theme={null}
from hud.agents import ClaudeAgent
from hud.datasets import Task

agent = ClaudeAgent()
result = await agent.run(
    Task(
        prompt="Click the submit button",
        mcp_config={
            "hud": {
                "url": "https://mcp.hud.ai/v3/mcp",
                "headers": {
                    "Authorization": "Bearer ${HUD_API_KEY}",
                    "Mcp-Image": "hudpython/hud-remote-browser:latest"
                }
            }
        }
    ),
    max_steps=5,
)
print("Done:", result.done, "Error:", result.isError)
```

### Task Execution with Auto-Client

```python  theme={null}
from hud.agents import OperatorAgent
from hud.datasets import Task

# No client needed - auto-created from task
agent = OperatorAgent()

task = Task(
    prompt="Find the price of the product",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    setup_tool={
        "name": "setup",
        "arguments": {"url": "https://example.com"}
    },
    evaluate_tool={
        "name": "evaluate", 
        "arguments": {"check": "price_found"}
    }
)

# Client created automatically
result = await agent.run(task, max_steps=20)
print(f"Reward: {result.reward}")
```

### Custom Agent Implementation

```python  theme={null}
from hud.agents import MCPAgent
from hud.types import AgentResponse
import hud

class MyCustomAgent(MCPAgent):
    metadata = {"custom": "metadata"}
    
    async def get_system_messages(self) -> list[dict]:
        return [{
            "role": "system",
            "content": self.system_prompt
        }]
    
    @hud.instrument(span_type="agent", record_args=False, record_result=True)
    async def get_response(self, messages: list[dict]) -> AgentResponse:
        # Your LLM call here
        response = await self.llm.chat(messages)
        
        return AgentResponse(
            content=response.content,
            tool_calls=[
                MCPToolCall(name=tc.name, arguments=tc.args)
                for tc in response.tool_calls
            ],
            done=response.stop_reason == "stop"
        )
    
    async def format_blocks(self, blocks: list[ContentBlock]) -> list[dict]:
        content = []
        for block in blocks:
            if block.type == "text":
                content.append({"type": "text", "text": block.text})
            elif block.type == "image":
                content.append({
                    "type": "image",
                    "image": {"data": block.data, "format": "png"}
                })
        
        return [{"role": "user", "content": content}]
    
    async def format_tool_results(self, tool_calls, tool_results) -> list[dict]:
        return [{
            "role": "tool",
            "content": result.content,
            "tool_call_id": call.name
        } for call, result in zip(tool_calls, tool_results)]
```

## See Also

* [Create Agents](/evaluate-agents/create-agents) - Tutorial on building agents
* [Tasks](/reference/tasks) - Task configuration reference
* [Architecture](/core-concepts/architecture) - How agents fit in HUD
* [`hud eval`](/reference/cli/eval) - Run agents on tasks/datasets
* [`hud rl`](/reference/cli/rl) - Train agents with GRPO


# hud analyze
Source: https://docs.hud.ai/reference/cli/analyze

Inspect MCP environments quickly from metadata or live containers

The `hud analyze` command inspects MCP environments to discover tools and capabilities. By default it uses cached metadata; pass `--live` (or Docker args) to run a container.

## Usage

```bash  theme={null}
hud analyze <IMAGE> [DOCKER_ARGS...] [OPTIONS]
hud analyze --config mcp.json
hud analyze --cursor <SERVER_NAME>
```

## Arguments

<ParamField path="image" type="string">
  Docker image to analyze (omit when using --config or --cursor)
</ParamField>

## Options

<ParamField path="--format" type="string" default="interactive">
  Output format: `interactive`, `json`, or `markdown`. Short: `-f`
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show input schemas in fast mode and more details in live mode. Short: `-v`
</ParamField>

<ParamField path="--live" type="boolean" default="false">
  Run container for live analysis (slower but exact)
</ParamField>

<ParamField path="--config" type="string">
  JSON file with MCP configuration (always live). Short: `-c`
</ParamField>

<ParamField path="--cursor" type="string">
  Analyze a Cursor MCP server (always live)
</ParamField>

## Analysis Modes

### Fast Mode (Default)

Sources (in order):

1. Local HUD cache: `~/.hud/envs/<digest>/hud.lock.yaml`
2. HUD registry metadata (if available)
3. Basic Docker manifest info

```bash  theme={null}
# Instant results from metadata
hud analyze myorg/my-env:latest
```

### Live Mode

Runs the container and queries the server:

```bash  theme={null}
# Full analysis with running container
hud analyze myorg/my-env:latest --live

# Passing env vars implies live mode
a=API_KEY=secret hud analyze myorg/my-env:latest --live -e API_KEY=secret
```

> Providing Docker args (after the image) automatically switches to live mode.

## Output Formats

* `interactive` (default): rich table/tree views
* `json`: structured output for tooling
* `markdown`: docs-friendly summaries

## Examples

```bash  theme={null}
# Fast metadata analysis
hud analyze hudpython/text-2048:latest

# From lock file
hud analyze ./locks/text-2048.yaml --format json

# Live analysis with env vars
hud analyze my-env:latest --live -e API_KEY=test

# Cursor and config
hud analyze --cursor my-dev-server --live
hud analyze --config mcp-config.json
```

## Tips

* Prefer fast mode during iteration; switch to `--live` for final validation.
* If metadata isn’t found, the command suggests pulling or running live.

## See Also

* [`hud pull`](/reference/cli/pull)
* [`hud debug`](/reference/cli/debug)
* [`hud build`](/reference/cli/build)
* [`hud run`](/reference/cli/run)


# hud build
Source: https://docs.hud.ai/reference/cli/build

Build production images and generate lock files for reproducible environments

The `hud build` command builds your Docker image, analyzes its MCP server, and writes a reproducible `hud.lock.yaml`.

## Usage

```bash  theme={null}
hud build [DIRECTORY] [OPTIONS]
```

## Arguments

<ParamField path="directory" type="string" default=".">
  Environment directory containing Dockerfile and pyproject.toml
</ParamField>

## Options

<ParamField path="--tag" type="string">
  Docker image tag to use. Defaults to `[tool.hud.image]` or `{dir}:dev`.
</ParamField>

<ParamField path="--no-cache" type="boolean" default="false">
  Build without using Docker cache
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed build output. Short: `-v`
</ParamField>

<ParamField path="--platform" type="string" default="linux/amd64">
  Target platform for Docker build. Defaults to `linux/amd64` if unspecified.
</ParamField>

## What It Does

<Steps>
  <Step title="Build Docker Image">
    Runs `docker build` (twice): a temporary build for analysis, then a labeled build for the final image and version tags.
  </Step>

  <Step title="Analyze Environment">
    Runs the container via MCP to measure initialize time and list tools (count and input schemas).
  </Step>

  <Step title="Generate Lock File">
    Writes `hud.lock.yaml` with:

    * `version`: lock format version
    * `image`: `{tag}@sha256:<id>` after final build
    * `build`: generatedAt, hudVersion, directory, internal `version` (semver, auto‑incremented), `sourceHash`, optional `sourceFiles`
    * `environment`: initializeMs, toolCount, and optional `variables` with `provided`, `required`, `optional`
    * `tools`: name, description, inputSchema (for RL/evals tooling)
  </Step>

  <Step title="Retag & Label">
    Adds labels (`org.hud.*`) and tags the image with both the chosen tag and the internal version.
  </Step>
</Steps>

## Examples

```bash  theme={null}
# Basic build (auto tag)
hud build

# Custom tag
hud build . --tag v1.2.0

# Clean build with logs
hud build . --no-cache --verbose

# Force cross‑platform (useful on Apple Silicon)
hud build . --platform linux/amd64
```

## Image Naming

If `--tag` is not provided:

1. Read `[tool.hud.image]` from `pyproject.toml`
2. Otherwise use `{directory-name}:dev`

## Environment Variables

Pass runtime environment variables during analysis with `-e`/`--env` flags:

```bash  theme={null}
hud build . -e API_KEY=secret -e DEBUG=true
```

They’re recorded in the lock file as placeholders under `environment.variables.provided`; missing required variables are listed in `environment.variables.required`.

## Lock File

Minimal structure you’ll see:

```yaml  theme={null}
version: "1.0"
image: "my-env:dev@sha256:..."
build:
  generatedAt: "2025-01-01T12:00:00Z"
  hudVersion: "0.x.y"
  directory: "my-env"
  version: "0.1.0"
  sourceHash: "..."
environment:
  initializeMs: 450
  toolCount: 3
  variables:
    provided: { API_KEY: "${API_KEY}" }
    required: ["OTHER_KEY"]
tools:
  - name: setup
    description: Initialize environment
    inputSchema: { type: object }
```

## Next Steps

```bash  theme={null}
# Hot‑reload development
hud dev

# Run the built image locally (stdio)
hud run my-env:dev --local

# Push to registry
hud push
```

## See Also

* [`hud init`](/reference/cli/init)
* [`hud dev`](/reference/cli/dev)
* [`hud push`](/reference/cli/push)
* [`hud analyze`](/reference/cli/analyze)
* [Build Environments](/build-environments)


# hud debug
Source: https://docs.hud.ai/reference/cli/debug

Test MCP environments through 5 validation phases

The `hud debug` command validates MCP environments through 5 progressive phases.

## Synopsis

```bash  theme={null}
hud debug [IMAGE|DIRECTORY] [OPTIONS] [DOCKER_ARGS...]
hud debug --config PATH
hud debug --cursor SERVER_NAME
```

## Options

| Option              | Description                                   | Default |
| ------------------- | --------------------------------------------- | ------- |
| `--config`, `-c`    | JSON config file with MCP configuration       | -       |
| `--cursor`          | Debug a Cursor MCP server                     | -       |
| `--build`, `-b`     | Build image before debugging (directory mode) | false   |
| `--max-phase`, `-p` | Maximum debug phase (1-5)                     | 5       |

## Debug Phases

### Phase 1: Container Startup

```
🚀 Phase 1: Container startup
✓ Container started successfully
✓ Received output on stderr
```

Checks: Process starts, outputs to stderr, doesn't exit

### Phase 2: MCP Initialization

```
🔗 Phase 2: MCP initialization
✓ Received valid initialize response
✓ Protocol version: 1.0
```

Checks: JSON-RPC communication, protocol compatibility

### Phase 3: Tool Discovery

```
🔧 Phase 3: Tool discovery
✓ Found 8 tools:
  • setup_board - Initialize game board
  • move - Move in direction
```

Checks: Tools registered, valid schemas

### Phase 4: Tool Execution

```
🧪 Phase 4: Tool testing
→ Testing tool: setup_board
✓ Tool executed successfully
```

Checks: Tools callable, return valid responses

### Phase 5: Readiness Check

```
✅ Phase 5: Readiness check
✓ Ready for agent interaction
```

Checks: Overall health, state persistence

<Note>
  Use `hud dev` for hot-reload development after debug passes to speed up iteration.
</Note>

## Examples

### Docker Image

```bash  theme={null}
# Basic
hud debug my-server:latest

# With environment variables (Docker args after image)
hud debug my-server:latest -e DEBUG=true -e LOG_LEVEL=debug

# Stop at phase 3
hud debug my-server:latest --max-phase 3
```

### Directory Mode

```bash  theme={null}
# Build then debug from a directory containing a Dockerfile and pyproject.toml
hud debug . --build
```

## Common Issues

### Phase 1 Failures

* **Container exits**: Check CMD/ENTRYPOINT
* **No stderr**: Route logs to stderr
* **Permission denied**: Check file permissions

### Phase 2 Failures

* **Invalid JSON**: Only JSON-RPC on stdout
* **Timeout**: Check for blocking operations

### Phase 3 Failures

* **No tools**: Check `@server.tool()` decorators
* **Invalid schema**: Add type hints and docstrings

## Advanced Usage

### Incremental Debugging

```bash  theme={null}
for phase in 1 2 3 4 5; do
  echo "Testing phase $phase..."
  hud debug my-env:latest --max-phase $phase || break
done
```

### Parallel Testing

```bash  theme={null}
# Test multiple environments
images=("env1:latest" "env2:latest")
for image in "${images[@]}"; do
  hud debug "$image" > "debug_${image//:/}.log" 2>&1 &
done
wait
```

## Best Practices

1. Start with phase 1 and work up
2. Use `--max-phase` for incremental debugging
3. Check stderr output first
4. Keep tools simple for easier debugging
5. Add health check tools

<Note>
  If debug passes, agents should work reliably with your environment.
</Note>

## Next Step

<Card title="Analyze Command" icon="magnifying-glass" href="/reference/cli/analyze">
  Explore environment capabilities after debugging
</Card>


# hud dev
Source: https://docs.hud.ai/reference/cli/dev

Hot-reload development server for MCP environments

The `hud dev` command provides development-time proxying for MCP environments, starting your image and exposing it over HTTP (default) or stdio.

## Synopsis

```bash  theme={null}
hud dev [DIRECTORY] [OPTIONS]
```

## Description

`hud dev` launches a lightweight MCP proxy that runs your Docker image and forwards MCP traffic. It auto-detects or builds your image, mounts your project directory, and optionally opens an inspector or interactive tool runner.

**Key Features:**

* **Auto-detection**: Determines image from `[tool.hud.image]` or `dir-name:dev`
* **Project mount**: Mounts your project at `/app` (sets `PYTHONPATH=/app`)
* **Interactive testing**: TUI for calling tools (HTTP transport only)
* **HTTP/Stdio protocols**: Choose transport method (`http` default)
* **Inspector support**: Launch MCP Inspector (HTTP mode)
* **Per-connection containers**: Each client gets its own container

## Arguments

| Argument    | Description                      | Default       |
| ----------- | -------------------------------- | ------------- |
| `DIRECTORY` | Environment directory (optional) | `.` (current) |

## Options

| Option              | Description                                                | Default       |
| ------------------- | ---------------------------------------------------------- | ------------- |
| `--image`, `-i`     | Docker image name (overrides auto-detection)               | Auto-detected |
| `--build`, `-b`     | Build image before starting                                | `false`       |
| `--no-cache`        | Force rebuild without cache                                | `false`       |
| `--transport`, `-t` | Transport protocol: `http` or `stdio`                      | `http`        |
| `--port`, `-p`      | HTTP server port (ignored for stdio)                       | `8765`        |
| `--interactive`     | Launch interactive tool testing interface                  | `false`       |
| `--no-reload`       | Disable hot-reload supervision                             | `false`       |
| `--verbose`, `-v`   | Show detailed server logs                                  | `false`       |
| `--inspector`       | Launch MCP Inspector (HTTP mode only)                      | `false`       |
| `--no-logs`         | Disable streaming Docker logs                              | `false`       |
| `--full-reload`     | Restart entire container on file changes (not implemented) | `false`       |

## Examples

### Auto-Detection Mode (Recommended)

```bash  theme={null}
hud dev
```

### Build and Start

```bash  theme={null}
hud dev --build
```

### Specific Directory

```bash  theme={null}
hud dev environments/my-env --build
```

### Custom Image

```bash  theme={null}
hud dev . --image my-custom-env:dev --build
```

### HTTP Mode with Inspector

```bash  theme={null}
hud dev . --build --transport http --inspector
```

### Stdio Mode

```bash  theme={null}
hud dev . --build --transport stdio
```

### Clean Rebuild

```bash  theme={null}
hud dev . --build --no-cache
```

### Verbose Logging

```bash  theme={null}
hud dev . --build --verbose
```

### Interactive Testing (HTTP only)

```bash  theme={null}
hud dev . --build --interactive
```

<Note>
  Interactive mode disables log streaming and hot-reload supervision to provide a clean UI.
</Note>

## How It Works

1. **Image Resolution**:
   * Reads `pyproject.toml` `[tool.hud.image]`
   * Auto-generates `dir-name:dev` if absent; writes it back on build
   * Uses `--image` override if provided

2. **Container Startup**:
   * Runs your image with its original `CMD` (the container controls reload behavior)
   * Mounts your project at `/app` and sets `PYTHONPATH=/app`
   * For HTTP transport, the proxy hosts `http://localhost:<port>/mcp`

3. **Proxy Server**:
   * Each client connection gets its own container
   * Inspector and interactive UI are available in HTTP mode

4. **Hot-Reload**:
   * Depends on your container `CMD` (templates include `--reload` flags)
   * For best results, separate the restartable MCP controller from any long-lived backend

### Process Separation Architecture

For stateful environments, `hud dev` supports a critical design pattern: separating the MCP server from the environment process. This separation enables hot-reload without losing state.

**Why Separation Matters:**

* MCP server can restart instantly for code changes
* Environment state (browsers, databases, games) persists
* Development is faster without constant state resets

**Example Architecture:**

```mermaid  theme={null}
graph LR
    MCP["🔄 MCP Server<br/>(Restartable)<br/><br/>• Lightweight<br/>• Fast startup<br/>• MCP protocol"]
    ENV["🏗️ Environment Process<br/>(Persistent)<br/><br/>• Maintains state<br/>• Long-lived<br/>• Resource heavy"]
    
    MCP -->|"MCP Calls"| ENV
    ENV -->|"Unix Socket<br/>TCP, gRPC, etc."| MCP
    
    style MCP fill:#e0e7ff,stroke:#6366f1,stroke-width:2px
    style ENV fill:#d1fae5,stroke:#10b981,stroke-width:2px
```

**Connection Methods:**

* Unix sockets (recommended for local dev)
* TCP/HTTP endpoints
* gRPC services
* Shared memory/IPC

Separating the MCP server (restartable) from the backend (stateful) enables code reloads without losing state. See the browser template for a full example.

## File Mounting

Project directory is mounted as `/app` inside the container:

```
Local: ./                   → Container: /app/
Local: ./controller/        → Container: /app/controller/
Local: ./environment/       → Container: /app/environment/
```

## Transport Modes

### HTTP Transport (Default)

```bash  theme={null}
hud dev . --transport http --port 8765
```

**Benefits:**

* Web browser access and MCP Inspector
* Multiple simultaneous connections
* Easier debugging

**URL:** `http://localhost:8765/mcp`

### Stdio Transport

```bash  theme={null}
hud dev . --transport stdio
```

**Benefits:**

* Direct MCP protocol, lower latency
* Single connection focused

## Integration Examples

### Cursor Integration

1. Start development server:
   ```bash  theme={null}
   hud dev . --build --transport http --port 8765
   ```
2. Add to Cursor's MCP config:
   ```json  theme={null}
   {
     "mcpServers": {
       "my-dev-env": {
         "url": "http://localhost:8765/mcp"
       }
     }
   }
   ```
3. Edit files — changes apply immediately based on your container's reload behavior.

### Testing During Development

```bash  theme={null}
# Quick inspection (optional)
hud analyze my-env:dev

# Deep check (only when broken)
hud debug my-env:dev
```

### MCP Inspector

```bash  theme={null}
hud dev . --build --inspector
```

Opens an inspector that shows:

* Available tools and schemas
* Real-time tool calls and protocol messages

## Troubleshooting

* Use `--verbose` to see container logs
* If container never starts: run `hud debug <image>`
* Port conflicts: choose a new `--port`

## See Also

* [`hud init`](/reference/cli/init) — Create new environments
* [`hud build`](/reference/cli/build) — Build production images
* [`hud push`](/reference/cli/push) — Share to registry
* [`hud analyze`](/reference/cli/analyze) — Inspect tools
* [`hud debug`](/reference/cli/debug) — Validate environment
* [`hud run`](/reference/cli/run) — Run locally/remotely
* [Build Environments](/build-environments)


# hud eval
Source: https://docs.hud.ai/reference/cli/eval

Run agents on tasks or datasets

The `hud eval` command runs an agent on a tasks file or a HuggingFace dataset.

## Usage

```bash  theme={null}
hud eval [SOURCE] [AGENT] [OPTIONS]
```

## Arguments

<ParamField path="source" type="string">
  HuggingFace dataset (e.g., `hud-evals/SheetBench-50`) or task JSON/JSONL file. If omitted, looks for a tasks file in the current directory.
</ParamField>

<ParamField path="agent" type="string">
  Agent backend to use: `claude`, `openai`, or `vllm`. If omitted, an interactive selector appears (including HUD hosted models).
</ParamField>

## Options

<ParamField path="--full" type="boolean" default="false">
  Run the entire dataset (omit for single-task debug mode)
</ParamField>

<ParamField path="--model" type="string">
  Model name for the chosen agent (required for some agents)
</ParamField>

<ParamField path="--allowed-tools" type="string">
  Comma-separated list of allowed tools
</ParamField>

<ParamField path="--max-concurrent" type="integer" default="30">
  Maximum concurrent tasks (1-200 recommended). Adjust based on your API rate limits and system resources.
</ParamField>

<ParamField path="--max-steps" type="integer" default="50">
  Maximum steps per task (default: 10 for single task, 50 for full dataset)
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Enable verbose agent output
</ParamField>

<ParamField path="--very-verbose" type="boolean" default="false">
  Enable debug-level logs for maximum visibility
</ParamField>

<ParamField path="--vllm-base-url" type="string">
  Base URL for vLLM server (when using `--agent vllm` or HUD hosted models)
</ParamField>

<ParamField path="--group-size" type="integer" default="1">
  Number of times to run each task (mini-batch style)
</ParamField>

## Examples

```bash  theme={null}
# Single task (debug mode)
hud eval hud-evals/SheetBench-50

# Entire huggingface dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full

# High concurrency for faster evaluation
hud eval hud-evals/SheetBench-50 claude --full --max-concurrent 100

# Limit concurrency to prevent rate limits
hud eval hud-evals/SheetBench-50 openai --full --max-concurrent 20

# Local task config
hud eval tasks.json claude

# Local task config with verbose output for debugging
hud eval tasks.json claude --verbose

# vLLM with explicit base URL
hud eval tasks.json vllm --model llama3.1 --vllm-base-url http://localhost:8000

# Limit tools and concurrency
hud eval tasks.json claude --allowed-tools click,type --max-concurrent 10
```

## Notes

* If you select a HUD hosted model, `hud eval` will route through vLLM with the appropriate base model.
* When `SOURCE` is omitted, an interactive file picker helps locate a tasks file.

## See Also

* [`hud rl`](/reference/cli/rl)
* [`hud run`](/reference/cli/run)
* [`hud analyze`](/reference/cli/analyze)

## Pricing & Billing

See hosted vLLM and training GPU rates in the [Training Quickstart → Pricing](/train-agents/quickstart#pricing). Manage usage and billing at `https://hud.ai/project/billing`.


# hud init
Source: https://docs.hud.ai/reference/cli/init

Create a new HUD environment from a preset

The `hud init` command scaffolds a working MCP environment using templates from the public SDK.

## Usage

```bash  theme={null}
hud init [NAME] [OPTIONS]
```

## Arguments

<ParamField path="name" type="string">
  Environment name. If omitted, the current directory name is used.
</ParamField>

## Options

<ParamField path="--preset" type="string" default="blank">
  Template preset: `blank`, `deep-research`, or `browser`. Short: `-p`
</ParamField>

<ParamField path="--dir" type="string" default=".">
  Target directory where the environment will be created. Short: `-d`
</ParamField>

<ParamField path="--force" type="boolean" default="false">
  Overwrite existing files if they exist. Short: `-f`
</ParamField>

## What It Creates

A minimal but complete environment with controller/frontend and optional backend:

```
my-env/
├── Dockerfile                # Container configuration
├── pyproject.toml            # Dependencies and metadata
├── README.md                 # Template instructions
├── tasks.json                # Example tasks
├── controller/               # MCP server (stdio)
│   ├── __init__.py           # mcp = MCPServer()
│   ├── __main__.py           # python -m controller → mcp.run()
│   ├── hooks.py              # @mcp.initialize / @mcp.shutdown
│   └── tools.py              # @mcp.tool act / setup / evaluate
└── environment/              # Backend (FastAPI example)
    └── server.py             # /health /act /reset /state
```

### Dockerfile (template)

```dockerfile  theme={null}
FROM python:3.11-slim
WORKDIR /app

COPY pyproject.toml ./
COPY controller/ ./controller/
COPY environment/ ./environment/
RUN pip install --no-cache-dir -e .

ENV ENV_SERVER_PORT=8005

# Start backend then launch MCP controller on stdio
CMD ["sh", "-c", "uvicorn environment.server:app --host 0.0.0.0 --port $ENV_SERVER_PORT --log-level warning & python -m controller"]
```

<Note>
  Templates may include hot-reload flags for development. Remove them for production images.
</Note>

## Examples

```bash  theme={null}
# Choose preset interactively (default blank)
hud init

# Create a blank template in a new directory
hud init my-env -p blank

# Browser presets
hud init my-browser -p browser

# Deep research preset (remote browser)
hud init my-deep -p deep-research

# Force overwrite
hud init my-env -p blank --force
```

## Next Steps

<Steps>
  <Step title="Start Development">
    Run with hot-reload and choose your preferred UI:

    ```bash  theme={null}
    # Inspector (HTTP, visual)
    hud dev --inspector

    # Interactive TUI (arrow keys)
    hud dev --interactive
    ```
  </Step>

  <Step title="Edit Tools">
    Add tools in `controller/tools.py`; use `@mcp.tool`.
  </Step>

  <Step title="Build & Push">
    ```bash  theme={null}
    hud build
    hud push
    ```
  </Step>
</Steps>

## Presets

* **blank**: Minimal controller + FastAPI backend with `/health`, `/act`, `/reset`, `/state` and example tools.
* **browser**: Local browser environment preset.
* **deep-research**: Remote browser environment preset (maps to `remote_browser`).

## See Also

* [Build Environments](/build-environments) – Quickstart tutorial
* [Technical Spec](/build-environments/spec) – Exact runtime requirements
* [hud dev](/reference/cli/dev) – Hot-reload server
* [hud build](/reference/cli/build) – Build production images


# Other CLI Commands
Source: https://docs.hud.ai/reference/cli/misc

Miscellaneous HUD CLI utilities

Quick reference for auxiliary CLI utilities.

## hud pull

Fetch an environment from a registry with a metadata preview.

```bash  theme={null}
hud pull <org/env:tag> [--yes] [--verify-only]

# From a shared lock file
hud pull ./hud.lock.yaml
```

Notes:

* Shows tool list, init time, variables (when available)
* `--verify-only` skips `docker pull`

## hud get

```bash  theme={null}
hud get hud-evals/2048-basic --split train -o tasks.jsonl --limit 100 --format jsonl
```

Downloads a HuggingFace dataset split and saves as JSON or JSONL.

## hud quickstart

```bash  theme={null}
hud quickstart
```

Clones the HUD quickstart repository.

## hud cursor-list

```bash  theme={null}
hud cursor-list
```

Lists MCP servers configured in Cursor and shows the config location.

## hud version

```bash  theme={null}
hud version
```

Shows the HUD CLI version.

## hud clone

```bash  theme={null}
hud clone https://github.com/user/repo.git
```

Clones a repository and prints any configured tutorial snippet.

## hud set

```bash  theme={null}
hud set HUD_API_KEY=... ANTHROPIC_API_KEY=...
```

Saves KEY=VALUE pairs to `~/.hud/.env` for default use in HUD.


# CLI Overview
Source: https://docs.hud.ai/reference/cli/overview

Complete reference for HUD command-line tools

The HUD CLI provides a complete toolkit for creating, developing, and running MCP environments. Commands are organized into two main workflows:

<CardGroup cols={2}>
  <Card title="Build & Ship" icon="hammer" color="#6b21a8">
    **Directory-based commands** for creating and sharing environments:

    * `hud init` — Create new environment
    * `hud dev` — Develop with hot‑reload
    * `hud build` — Build and generate lock file
    * `hud push` — Share to registry
  </Card>

  <Card title="Run & Evaluate" icon="play" color="#0891b2">
    **Target-based commands** for using environments and agents:

    * `hud analyze` — Inspect capabilities (fast/live)
    * `hud debug` — 5‑phase compliance test
    * `hud run` — Execute (Python module/command/Docker)
    * `hud eval` — Run agents on tasks/datasets
    * `hud rl` — Train with GRPO on tasks
  </Card>
</CardGroup>

## Installation

<CodeGroup>
  ```bash uv (Recommended) theme={null}
  uv tool install hud-python
  hud --version
  ```

  ```bash pip theme={null}
  pip install hud-python
  python -m hud.cli --help
  ```

  ```bash pipx theme={null}
  pipx install hud-python
  hud --version
  ```
</CodeGroup>

## Commands

### Building Workflow

| Command     | Input     | Description             | Example                   |
| ----------- | --------- | ----------------------- | ------------------------- |
| `hud init`  | Directory | Create new environment  | `hud init my-env`         |
| `hud dev`   | Directory | Hot-reload development  | `hud dev . --interactive` |
| `hud build` | Directory | Build image & lock file | `hud build . --tag v1.0`  |
| `hud push`  | Directory | Share to registry       | `hud push . --tag prod`   |

### Running Workflow

| Command       | Input                | Description                   | Example                       |
| ------------- | -------------------- | ----------------------------- | ----------------------------- |
| `hud analyze` | Image or config      | Inspect tools & capabilities  | `hud analyze org/env`         |
| `hud debug`   | Image/dir/config     | 5‑phase compliance test       | `hud debug my-env:latest`     |
| `hud run`     | Module/command/image | Execute server (local/remote) | `hud run controller --reload` |
| `hud eval`    | Tasks/dataset        | Run agent on tasks            | `hud eval tasks.json claude`  |
| `hud rl`      | Tasks/dataset        | Train with GRPO               | `hud rl tasks.json --local`   |

### Other Commands

| Command           | Description                        | Example                                       |
| ----------------- | ---------------------------------- | --------------------------------------------- |
| `hud get`         | Download HF dataset to tasks file  | `hud get hud-evals/2048-basic -o tasks.jsonl` |
| `hud quickstart`  | Clone quickstart repo              | `hud quickstart`                              |
| `hud cursor-list` | List Cursor MCP servers            | `hud cursor-list`                             |
| `hud version`     | Show CLI version                   | `hud version`                                 |
| `hud clone`       | Clone any git repo (pretty output) | `hud clone https://github.com/...`            |
| `hud set`         | Persist API keys to \~/.hud/.env   | `hud set HUD_API_KEY=...`                     |

## Complete Workflows

### Building an Environment

<Steps>
  <Step title="Initialize">
    Create a new HUD environment with minimal boilerplate:

    ```bash  theme={null}
    hud init my-env && cd my-env
    ```

    Creates `Dockerfile`, `pyproject.toml`, `controller/` (MCP server), optional `environment/` backend, `tasks.json`.
  </Step>

  <Step title="Develop">
    Run with hot-reload and interactive testing:

    ```bash  theme={null}
    hud dev
    ```

    Your changes reload automatically. Test tools interactively with arrow keys.
  </Step>

  <Step title="Build">
    Create production image and lock file:

    ```bash  theme={null}
    hud build
    ```

    Generates `hud.lock.yaml` with metadata and labels image for reproducibility.
  </Step>

  <Step title="Push">
    Share to Docker and HUD registries:

    ```bash  theme={null}
    hud push
    ```

    Requires `HUD_API_KEY`. Auto-detects registry from Docker login.
  </Step>
</Steps>

### Running an Environment

<Steps>
  <Step title="Analyze">
    Quick inspection without running:

    ```bash  theme={null}
    hud analyze hudpython/text_init    # Fast (from metadata)
    hud analyze hudpython/text_init --live  # Full (runs container)
    ```
  </Step>

  <Step title="Debug">
    Test MCP protocol compliance:

    ```bash  theme={null}
    hud debug hudpython/text_init:latest
    ```

    Validates through 5 phases of initialization.
  </Step>

  <Step title="Run">
    Execute in production mode:

    ```bash  theme={null}
    hud run hudpython/text_init:latest        # Remote (default)
    hud run hudpython/text_init:latest --local  # Local Docker
    ```
  </Step>
</Steps>

## Common Usage

### Docker Images

```bash  theme={null}
# Basic
hud debug my-image:latest

# With options
hud debug my-image:latest -e DEBUG=true -p 8080:8080

# From registry
hud analyze ghcr.io/org/image:v1.0.0
```

### Arbitrary Commands (Python/Node/etc.)

```bash  theme={null}
# Run any command as MCP server (stdio/http)
hud run --cmd "python -m controller" -t http -p 8765
hud run --cmd "node mcp-server.js"
```

### Cursor Integration

```bash  theme={null}
# Debug Cursor server
hud debug --cursor my-dev-server

# List all servers
hud cursor-list
```

## Output Formats

### Interactive (Default)

```bash  theme={null}
hud analyze my-env

🔍 Analyzing MCP environment: my-env
✓ Connected successfully
✓ Found 12 tools

Available Tools:
  • click - Click at coordinates
  • type - Type text
  ...
```

### JSON

```bash  theme={null}
hud analyze my-env --format json

{
  "tools": [{
    "name": "click",
    "description": "Click at coordinates",
    "parameters": {...}
  }]
}
```

### Markdown

```bash  theme={null}
hud analyze my-env --format markdown > docs/tools.md
```

## CI/CD Example

```bash  theme={null}
#!/bin/bash
set -e

# Test environment
hud debug "$IMAGE_NAME"

# Verify tools
TOOLS=$(hud analyze "$IMAGE_NAME" --format json | jq '.tools | length')
if [ "$TOOLS" -lt 3 ]; then
  echo "Not enough tools!"
  exit 1
fi
```

## Python Scripting

```python  theme={null}
import subprocess
import json

def get_tools(image):
    result = subprocess.run(
        ["hud", "analyze", image, "--format", "json"],
        capture_output=True,
        text=True,
        check=True
    )
    return json.loads(result.stdout)["tools"]

# Use
tools = get_tools("my-env:latest")
for tool in tools:
    print(f"- {tool['name']}: {tool['description']}")
```

## Exit Codes

| Code | Meaning          | Description         |
| ---- | ---------------- | ------------------- |
| 0    | Success          | Command completed   |
| 1    | General Error    | Command failed      |
| 2    | Usage Error      | Invalid arguments   |
| 3    | Connection Error | Failed to connect   |
| 4    | Timeout          | Operation timed out |
| 5    | Protocol Error   | MCP violation       |

## Environment Variables

```bash  theme={null}
# Debug output
export HUD_CLI_DEBUG=true

# Custom timeout
export HUD_CLI_TIMEOUT=120

# Provider keys
export ANCHOR_API_KEY=...
```

## Next Steps

### Building Commands

<CardGroup cols={2}>
  <Card title="Init Command" icon="sparkles" href="/reference/cli/init">
    Create new environments from scratch
  </Card>

  <Card title="Dev Command" icon="code" href="/reference/cli/dev">
    Develop with hot-reload and interactive testing
  </Card>

  <Card title="Build Command" icon="hammer" href="/reference/cli/build">
    Build images and generate lock files
  </Card>

  <Card title="Push Command" icon="upload" href="/reference/cli/push">
    Share environments to registry
  </Card>
</CardGroup>

### Running Commands

<CardGroup cols={2}>
  <Card title="Analyze Command" icon="magnifying-glass" href="/reference/cli/analyze">
    Inspect tools and capabilities
  </Card>

  <Card title="Debug Command" icon="bug" href="/reference/cli/debug">
    Test MCP protocol compliance
  </Card>

  <Card title="Run Command" icon="play" href="/reference/cli/run">
    Execute servers locally or remotely
  </Card>
</CardGroup>


# hud push
Source: https://docs.hud.ai/reference/cli/push

Share HUD environments to Docker and HUD registries

The `hud push` command publishes your environment image to a Docker registry and uploads metadata to the HUD registry.

## Usage

```bash  theme={null}
hud push [DIRECTORY] [OPTIONS]
```

## Arguments

<ParamField path="directory" type="string" default=".">
  Environment directory containing hud.lock.yaml
</ParamField>

## Options

<ParamField path="--image" type="string">
  Override registry image name (e.g., `myorg/myenv`). Short: `-i`
</ParamField>

<ParamField path="--tag" type="string">
  Override tag (e.g., `v1.0`, `latest`). Short: `-t`
</ParamField>

<ParamField path="--sign" type="boolean" default="false">
  Sign the image with cosign (not yet implemented)
</ParamField>

<ParamField path="--yes" type="boolean" default="false">
  Skip confirmation prompts. Short: `-y`
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed output. Short: `-v`
</ParamField>

## Prerequisites

<Warning>
  Requires `HUD_API_KEY`:

  ```bash  theme={null}
  export HUD_API_KEY="your-api-key"
  ```
</Warning>

<Note>
  Login to your Docker registry first (e.g., Docker Hub or GHCR):

  ```bash  theme={null}
  docker login
  # or
  docker login ghcr.io
  ```
</Note>

## What It Does

<Steps>
  <Step title="Verify Build">
    Ensures a recent `hud build` exists (lock file present), prompting to build if missing.
  </Step>

  <Step title="Choose Target">
    Determines target image:

    * `--image` if provided
    * Else auto-detects Docker Hub username and uses `username/{name}:{tag}`
    * Tag uses `--tag`, or the lock’s internal `build.version`, else current tag
  </Step>

  <Step title="Push Image">
    Tags and pushes via Docker; captures pushed digest.
  </Step>

  <Step title="Update Lock">
    Updates `hud.lock.yaml` with:

    * `image`: full registry reference with digest
    * `push`: `source`, `pushedAt`, `registry`, and `image_with_tag`
  </Step>

  <Step title="Upload Metadata">
    Uploads lock metadata to the HUD registry path `.../registry/envs/{org}/{name}:{tag}`.
  </Step>
</Steps>

## Registry Detection

1. Explicit `--image` → used as‑is.
2. Docker config → reads logged‑in username (Docker Hub).
3. Default → prompts if no login and `--image` missing.

## Examples

```bash  theme={null}
# Basic (auto username)
hud push

# Custom registry and tag
hud push . --image ghcr.io/myorg/myenv --tag v1.0.0

# Non‑interactive (CI)
hud push . --yes --tag "prod-$(date +%Y%m%d)"
```

## Lock File After Push

```yaml  theme={null}
image: "myorg/myenv:latest@sha256:..."
push:
  source: "my-env:dev@sha256:..."
  pushedAt: "2025-01-01T11:00:00Z"
  registry: "ghcr.io"
  image_with_tag: "myorg/myenv:latest"
```

## Notes

* If HUD registry upload fails, the Docker push still succeeds; you can share `hud.lock.yaml` manually.
* Image names must include `org/name` for HUD registry uploads.

## See Also

* [`hud build`](/reference/cli/build)
* [`hud pull`](/reference/cli/pull)
* [`hud analyze`](/reference/cli/analyze)
* [Build Environments](/build-environments) - Getting started with HUD environments


# hud rl
Source: https://docs.hud.ai/reference/cli/rl

Run GRPO reinforcement learning on tasks

The `hud rl` command trains an agent with GRPO on tasks, locally or via the HUD remote service.

## Usage

```bash  theme={null}
hud rl [TASKS_FILE|DATASET] [MODEL] [OPTIONS]
```

## Arguments

<ParamField path="tasks_file" type="string">
  Path to tasks JSON/JSONL file or HuggingFace dataset name. If omitted, looks for a tasks file in the current directory.
</ParamField>

<ParamField path="model" type="string">
  Model to train (default: interactive selection)
</ParamField>

## Options

<ParamField path="--config" type="string">
  Path to existing configuration file. Short: `-c`
</ParamField>

<ParamField path="--output-dir" type="string" default="/checkpoints">
  Output directory for checkpoints. Short: `-o`
</ParamField>

<ParamField path="--restart" type="boolean" default="false">
  Restart the vLLM server before training
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Enable verbose output. Short: `-v`
</ParamField>

<ParamField path="--no-ddp" type="boolean" default="false">
  Disable DistributedDataParallel (even with multiple GPUs)
</ParamField>

<ParamField path="--ddp-gpus" type="string">
  Specific GPUs for DDP (e.g., `0,1,2,3`)
</ParamField>

<ParamField path="--vllm-gpu" type="integer">
  Specific GPU for vLLM server
</ParamField>

<ParamField path="--local" type="boolean" default="false">
  Run training locally instead of the remote HUD server
</ParamField>

## Behavior

* If no tasks file is provided, an interactive picker helps locate one.
* Remote mode (default) converts tasks to remote MCP automatically (build/push as needed) and launches remote training.
* Local mode runs training on your machine (delegated to `local_runner`).

## Examples

```bash  theme={null}
# Remote (default): auto-convert tasks to remote, then train
hud rl tasks.json --model claude-rl

# Local training with GPU selection
hud rl tasks.json llama3.1 --local --ddp-gpus 0,1 --vllm-gpu 0

# Use a dataset directly (remote)
hud rl hud-evals/SheetBench-50 --model claude-rl
```

## See Also

* [`hud eval`](/reference/cli/eval)
* [`hud get`](/reference/cli/get)
* [`hud build`](/reference/cli/build)
* [`hud push`](/reference/cli/push)

## Pricing & Billing

See hosted vLLM and training GPU rates in the [Training Quickstart → Pricing](/train-agents/quickstart#pricing). Manage usage and billing at `https://hud.ai/project/billing`.


# hud run
Source: https://docs.hud.ai/reference/cli/run

Execute MCP environments locally or remotely for production use

The `hud run` command executes MCP servers in three ways: (1) Python module/file (decorator‑based), (2) arbitrary command, or (3) Docker image (local/remote). Unlike `hud dev`, `hud run` is optimized for execution, not hot‑reload.

## Usage

```bash  theme={null}
# Python module or file (decorator-based)
hud run controller[:attr] [OPTIONS]
hud run path/to/server.py[:attr] [OPTIONS]

# Arbitrary command
hud run --cmd "python -m controller" [OPTIONS]

# Docker image (supports Docker args after --)
hud run <IMAGE> [OPTIONS] [-- DOCKER_ARGS...]
```

## Arguments

<ParamField path="target" type="string" required>
  Python module/file (e.g., `controller` or `controller:server`) or a Docker image (e.g., `myorg/env:latest`)
</ParamField>

<ParamField path="docker_args" type="array">
  Additional Docker arguments (accepted after the image)
</ParamField>

<Note>
  Docker arguments are passed positionally after the image in this command form.
</Note>

## Options

<ParamField path="--local" type="boolean" default="false">
  Run locally with Docker instead of remote execution
</ParamField>

<ParamField path="--remote" type="boolean" default="false">
  Explicitly run remotely (default behavior if neither --local nor --remote is set)
</ParamField>

<ParamField path="--transport" type="string" default="stdio">
  Transport protocol: `stdio` or `http`. Short: `-t`
</ParamField>

<ParamField path="--port" type="integer" default="8765">
  Port for HTTP transport (ignored for stdio). Short: `-p`
</ParamField>

<ParamField path="--url" type="string" default="https://mcp.hud.ai/v3/mcp">
  Remote MCP server URL
</ParamField>

<ParamField path="--api-key" type="string">
  API key for remote server (defaults to `HUD_API_KEY` env var)
</ParamField>

<ParamField path="--run-id" type="string">
  Run ID for tracking (remote only)
</ParamField>

<ParamField path="--verbose" type="boolean" default="false">
  Show detailed output. Short: `-v`
</ParamField>

<ParamField path="--interactive" type="boolean" default="false">
  Launch interactive testing mode (HTTP transport only; local Docker only)
</ParamField>

<ParamField path="--reload" type="boolean" default="false">
  Auto-reload when files change (Python mode only)
</ParamField>

<ParamField path="--watch" type="array">
  Directories to watch when using `--reload` (Python mode)
</ParamField>

<ParamField path="--cmd" type="string">
  Run an arbitrary command as an MCP server (e.g., `python -m controller`)
</ParamField>

## Python Mode (Decorator-based)

Run a Python module, package, or file containing an MCP server (default attribute `mcp`; override with `module:attr`).

```bash  theme={null}
# Run module or file (stdio by default)
hud run controller
hud run controller:mcp
hud run server.py:mcp

# Auto-reload on changes (watches module/package)
hud run controller --reload
hud run controller --reload --watch controller tools

# HTTP transport
hud run controller -t http -p 8765
```

## Remote Execution (Default for Docker)

By default, `hud run` executes your Docker image on the HUD platform's remote infrastructure. This provides:

* **Scalability**: Run hundreds of concurrent instances
* **Resource Management**: Automatic container lifecycle management
* **Monitoring**: Live telemetry and performance metrics
* **Geographic Distribution**: Global edge deployment

### Remote Examples

```bash  theme={null}
# Basic remote execution
hud run hudpython/text-2048:latest

# With environment variables
hud run myorg/server:v1 -- -e API_KEY=secret -e DEBUG=true

# HTTP transport with custom port
hud run myorg/server:v1 --transport http --port 9000

# Custom remote URL and API key
hud run my-image:latest --url https://custom.mcp.server --api-key my-key

# With run tracking
hud run production-env:v1.0 --run-id "deploy-$(date +%Y%m%d)"
```

### Authentication

Remote execution requires authentication via API key:

```bash  theme={null}
# Set environment variable (recommended)
export HUD_API_KEY=your-api-key-here
hud run my-image:latest

# Or pass directly (less secure)
hud run my-image:latest --api-key your-api-key-here
```

### Run Tracking

Use `--run-id` to track specific execution instances:

```bash  theme={null}
hud run my-image:latest --run-id "experiment-001"
```

## Local Execution

Use `--local` to run the image locally with Docker:

```bash  theme={null}
# Basic local execution
hud run hudpython/text-2048:latest --local

# With Docker arguments
hud run myorg/server:v1 --local -- -e API_KEY=secret -v /data:/app/data

# HTTP transport locally
hud run myorg/server:v1 --local --transport http --port 8080

# With memory limits and volumes
hud run my-env:latest --local -- --memory 2g -v $(pwd)/data:/app/data
```

Local execution is useful for:

* Development and testing
* Environments without internet access
* Custom Docker configurations
* Debugging before remote deployment

## Transport Modes

### Stdio Transport (Default)

```bash  theme={null}
hud run my-image:latest --transport stdio
```

**Characteristics:**

* Direct JSON-RPC over stdin/stdout
* Single client connection
* Lower latency
* MCP protocol native

### HTTP Transport

```bash  theme={null}
hud run my-image:latest --transport http --port 8765
```

**Characteristics:**

* HTTP proxy at specified port
* Multiple concurrent connections
* Web-compatible
* Inspector support

## Docker Arguments

Pass Docker arguments after `--` separator:

```bash  theme={null}
# Environment variables
hud run my-image:latest -- -e DEBUG=true -e API_KEY=secret

# Volume mounts
hud run my-image:latest -- -v /host/data:/container/data

# Network settings
hud run my-image:latest -- --network host

# Memory limits
hud run my-image:latest -- --memory 2g

# Combined arguments
hud run my-image:latest -- -e API_KEY=secret -v /data:/app/data --memory 1g
```

<Note>
  The `--` separator is required to distinguish HUD options from Docker arguments.
</Note>

## Environment Variables

| Variable      | Description                       | Default                     |
| ------------- | --------------------------------- | --------------------------- |
| `HUD_MCP_URL` | Remote MCP server URL             | `https://mcp.hud.ai/v3/mcp` |
| `HUD_API_KEY` | API key for remote authentication | *required for remote*       |

## Comparison with Other Commands

| Command     | Purpose                       | Hot-reload | Target Use Case               |
| ----------- | ----------------------------- | ---------- | ----------------------------- |
| `hud dev`   | Development with auto-restart | ✅ Yes      | Local development iteration   |
| `hud run`   | Production execution          | ❌ No       | Production workloads, testing |
| `hud debug` | Environment validation        | ❌ No       | Debugging and validation      |

## Examples

### Basic Usage

```bash  theme={null}
# Run remotely (default)
hud run hudpython/text-2048:latest

# Run locally for testing
hud run hudpython/text-2048:latest --local

# Verbose output for debugging
hud run my-image:latest --local --verbose
```

### Production Deployment

```bash  theme={null}
# Remote execution with tracking
export HUD_API_KEY=your-production-key
hud run production-env:v1.0 --run-id "prod-deploy-$(date +%s)"

# With environment configuration
hud run production-env:v1.0 --run-id "prod-$(git rev-parse --short HEAD)" -- \
  -e DATABASE_URL=postgres://... \
  -e REDIS_URL=redis://... \
  -e LOG_LEVEL=info
```

### Development Testing

```bash  theme={null}
# Test locally before remote deployment
hud run my-env:dev --local -- -e DEBUG=true

# Test with HTTP transport
hud run my-env:dev --local --transport http --port 9000

# Test with custom data volume
hud run my-env:dev --local -- -v $(pwd)/test-data:/app/data
```

### CI/CD Integration

```yaml  theme={null}
# GitHub Actions example
- name: Run HUD environment
  env:
    HUD_API_KEY: ${{ secrets.HUD_API_KEY }}
  run: |
    # Pull latest
    hud pull myorg/env:latest --yes
    
    # Run tests
    hud run myorg/env:latest --run-id "ci-${{ github.sha }}"
```

## Error Handling

`hud run` provides clear error messages for common issues:

* **Missing image**: Suggests available images or build commands
* **Authentication failure**: Guides through API key setup
* **Port conflicts**: Suggests alternative ports
* **Docker errors**: Shows Docker-specific troubleshooting

## See Also

* [`hud pull`](/reference/cli/pull) - Get environments before running
* [`hud analyze`](/reference/cli/analyze) - Inspect capabilities first
* [`hud debug`](/reference/cli/debug) - Test environment compliance
* [`hud dev`](/reference/cli/dev) - Development with hot-reload
* [Build Environments](/build-environments) - Environment development guide


# Environments
Source: https://docs.hud.ai/reference/environments

SDK reference for building MCP environments

The HUD SDK provides `MCPServer` for building MCP-compatible environments that work with any MCP client.

## MCPServer

```python  theme={null}
from hud.server import MCPServer
```

Enhanced FastMCP server with Docker-friendly features for building HUD environments.

**Constructor Parameters:**

| Parameter          | Type  | Description                     | Default  |
| ------------------ | ----- | ------------------------------- | -------- |
| `name`             | `str` | Server name for MCP handshake   | Required |
| `instructions`     | `str` | Server instructions/description | `None`   |
| `**fastmcp_kwargs` | `Any` | Additional FastMCP parameters   | -        |

**Key Features:**

1. **SIGTERM handling** - Graceful shutdown in containers via custom runner
2. **Initialize decorator** - Async setup during MCP initialize request (stdout is temporarily redirected to stderr during initialization to avoid corrupting MCP output)
3. **Shutdown decorator** - Runs only on SIGTERM (container termination), not on hot‑reload/SIGINT
4. **Enhanced add\_tool()** - Automatically handles `BaseTool` instances and raw FastMCP Tool objects
5. **Tool decorator passthrough** - `@mcp.tool` returns the original function for easy composition
6. **FastMCP inheritance** - All FastMCP methods available (`mount`, `resource`, `tool`)

### Decorators

#### @initialize

Run async setup during MCP initialize request:

```python  theme={null}
mcp = MCPServer(name="my-env")

@mcp.initialize
async def setup_environment(ctx):
    """
    Initialize environment resources.
    
    Args:
        ctx: RequestContext with:
            - ctx.meta: Client metadata dict
            - ctx.session: MCP ServerSession
    """
    # Access metadata from agent (if provided)
    if ctx.meta:
        progress_token = ctx.meta.get("progressToken")
        display_width = ctx.meta.get("display_width", 1920)
        display_height = ctx.meta.get("display_height", 1080)
        
        # Send progress notifications
        if progress_token:
            await ctx.session.send_progress_notification(
                progress_token=progress_token,
                progress=50,
                total=100,
                message="Initializing environment..."
            )
```

#### @shutdown

Run cleanup on SIGTERM (container termination only):

```python  theme={null}
@mcp.shutdown
async def cleanup():
    """Clean up resources on shutdown."""
    if browser_provider:
        browser_provider.close()
    logger.info("Cleanup complete")
```

### Tool Registration

Three ways to register tools:

```python  theme={null}
# 1. Decorator for simple functions
@mcp.tool()
async def my_tool(param: str) -> dict:
    return {"result": param}

# 2. Add BaseTool instances
from hud.tools import BashTool
bash = BashTool()
mcp.add_tool(bash)  # Automatically uses bash.mcp internally

# 3. Add non-BaseTool instances directly
from custom import PlaywrightTool
playwright = PlaywrightTool()
mcp.add_tool(playwright)  # Added as-is
```

### Hub Pattern (mount)

Use BaseHub for organized tool namespaces:

```python  theme={null}
from hud.tools import BaseHub

# Create hub
setup_hub = BaseHub("setup")

# Add internal tools (hidden from agents)
@setup_hub.tool("board")
async def setup_board(size: int = 4):
    game = setup_hub.env
    game.reset(size=size)
    return [TextContent(text=f"{size}x{size} board initialized")]

# Mount hub on server
mcp.mount(setup_hub)

# Agents call via dispatcher: setup(name="board", arguments={"size": 4})
```

### Resources

Expose metadata via MCP resources:

```python  theme={null}
@mcp.resource("telemetry://live")
async def get_telemetry():
    """Expose live telemetry data."""
    return {
        "provider": os.getenv("BROWSER_PROVIDER"),
        "status": "running" if browser_provider else "stopped",
        "live_url": browser_provider.get_live_view_url() if browser_provider else None,
        "timestamp": datetime.now().isoformat()
    }
```

### Running the Server

```python  theme={null}
if __name__ == "__main__":
    # Run with SIGTERM handling (stdio by default)
    mcp.run()

    # Or use development transports (HTTP/SSE)
    mcp.run(transport="http", port=8765)
    mcp.run(transport="sse", port=8080)
```

When using HTTP/SSE, HUD development helper endpoints are available:

* `GET /hud` – overview
* `GET /hud/tools` – list tools with schemas
* `GET /hud/resources` – list resources
* `GET /hud/prompts` – list prompts

## Real Environment Examples

### Minimal Environment

```python  theme={null}
# src/hud_controller/server.py
from hud.server import MCPServer
from mcp.types import TextContent

mcp = MCPServer(name="counter-env")
counter = {"value": 0}

@mcp.tool()
async def setup(start_value: int = 0):
    """Initialize counter."""
    counter["value"] = start_value
    return {"status": "ready", "counter": counter["value"]}

@mcp.tool()
async def increment():
    """Increment counter."""
    counter["value"] += 1
    return [TextContent(text=f"Counter: {counter['value']}", type="text")]

@mcp.tool()
async def evaluate(target: int):
    """Check if target reached."""
    from hud.tools.types import EvaluationResult
    return EvaluationResult(
        reward=1.0 if counter["value"] >= target else 0.0,
        done=counter["value"] >= target
    )

if __name__ == "__main__":
    mcp.run()
```

### text\_2048 Environment

From `environments/text_2048/src/hud_controller/server.py`:

```python  theme={null}
from hud.server import MCPServer
from .game import Game2048
from .tools import MoveTool
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub

mcp = MCPServer(name="text-2048")
game = None

@mcp.initialize
async def initialize_environment(ctx):
    global game
    
    # Progress notifications
    progress_token = getattr(ctx.meta, "progressToken", None) if ctx.meta else None
    
    async def send_progress(progress: int, message: str):
        if progress_token:
            await ctx.session.send_progress_notification(
                progress_token=progress_token,
                progress=progress,
                total=100,
                message=message
            )
    
    await send_progress(0, "Starting 2048 game environment...")
    
    # Create game
    game = Game2048()
    game.reset()
    
    await send_progress(50, "Setting up game board...")
    
    # Set game on hubs
    setup_hub.env = game
    evaluate_hub.env = game
    
    # Mount hubs
    mcp.mount(setup_hub)
    mcp.mount(evaluate_hub)
    
    await send_progress(70, "Configuring tools...")
    
    # Add move tool
    mcp.add_tool(MoveTool(env=game))
    
    await send_progress(100, "2048 environment ready")
```

### remote\_browser Environment

From `environments/remote_browser/src/hud_controller/server.py`:

```python  theme={null}
from hud.server import MCPServer
from hud.tools.computer import HudComputerTool, AnthropicComputerTool, OpenAIComputerTool
from .tools import PlaywrightToolWithMemory, BrowserExecutor
from .setup import setup as setup_hub
from .evaluate import evaluate as evaluate_hub
from .providers import get_provider

mcp = MCPServer(
    name="HUD Remote Browser Environment",
    instructions="""Remote browser automation environment..."""
)

# Global state
browser_provider = None
playwright_tool = None

@mcp.resource("telemetry://live")
async def get_telemetry_resource():
    """MCP resource with live browser status."""
    return {
        "provider": os.getenv("BROWSER_PROVIDER", "unknown"),
        "status": "running" if browser_provider else "stopped",
        "live_url": browser_provider.get_live_view_url() if browser_provider else None,
        "cdp_url": browser_provider.cdp_url if browser_provider else None
    }

@mcp.initialize
async def initialize_environment(ctx):
    global browser_provider, playwright_tool
    
    # Get metadata
    metadata = ctx.meta
    progress_token = metadata.get("progressToken", None)
    
    # Initialize provider
    provider_name = os.getenv("BROWSER_PROVIDER")
    provider_class = get_provider(provider_name)
    browser_provider = provider_class(config)
    
    # Launch browser
    cdp_url = await browser_provider.launch()
    
    # Create playwright tool
    playwright_tool = PlaywrightToolWithMemory(cdp_url=cdp_url)
    await playwright_tool._ensure_browser()
    
    # Add playwright tool (not a BaseTool, added directly)
    mcp.add_tool(playwright_tool)
    
    # Create computer tools
    executor = BrowserExecutor(playwright_tool)
    tool_kwargs = {"executor": executor}
    
    # Add display dimensions from metadata
    if metadata:
        width = metadata.get("display_width")
        height = metadata.get("display_height")
        if width and height:
            tool_kwargs["width"] = width
            tool_kwargs["height"] = height
    
    # Add computer tools (all are BaseTool subclasses)
    mcp.add_tool(HudComputerTool(**tool_kwargs))
    mcp.add_tool(AnthropicComputerTool(**tool_kwargs))
    mcp.add_tool(OpenAIComputerTool(**tool_kwargs))
    
    # Mount hubs
    setup_hub.env = playwright_tool
    evaluate_hub.env = playwright_tool
    mcp.mount(setup_hub)
    mcp.mount(evaluate_hub)

@mcp.shutdown
async def shutdown_environment():
    """Cleanup browser resources."""
    global browser_provider
    if browser_provider:
        browser_provider.close()
    browser_provider = None
```

## Standard Structure

### Directory Layout

```
my-environment/
├── Dockerfile
├── pyproject.toml
├── controller/                 # MCP controller (stdio)
│   ├── __init__.py             # mcp = MCPServer()
│   ├── __main__.py             # python -m controller → mcp.run()
│   ├── hooks.py                # @mcp.initialize / @mcp.shutdown
│   └── tools.py                # @mcp.tool(...)
└── environment/                # Optional backend (HTTP/IPC)
    └── server.py               # e.g., FastAPI app
```

### Dockerfile

```dockerfile  theme={null}
FROM python:3.11-slim

WORKDIR /app

# Copy and install
COPY pyproject.toml ./
COPY controller/ ./controller/
COPY environment/ ./environment/
RUN pip install --no-cache-dir -e .

ENV ENV_SERVER_PORT=8005

# Start optional backend, then MCP controller on stdio
CMD ["sh", "-c", "uvicorn environment.server:app --host 0.0.0.0 --port $ENV_SERVER_PORT --log-level warning & python -m controller"]
```

### Hub Module Pattern

Example from text\_2048:

```python  theme={null}
# src/hud_controller/setup/__init__.py
from hud.tools.base import BaseHub

setup = BaseHub("setup")

# Import all setup functions to register them
from . import board

__all__ = ["setup"]

# src/hud_controller/setup/board.py
from . import setup

@setup.tool("board")
async def setup_board(board_size: int = 4):
    """Initialize game board."""
    game = setup.env  # Access environment from hub
    game.reset(size=board_size)
    return [TextContent(text=f"{board_size}x{board_size} game initialized")]
```

## Key Concepts

### Environment State

Three patterns for managing state:

1. **Global variables** (simple environments):
   ```python  theme={null}
   game = None

   @mcp.initialize
   async def initialize_environment(ctx):
       global game
       game = Game2048()
   ```

2. **Context class** (complex environments):
   ```python  theme={null}
   class EnvironmentContext:
       def __init__(self):
           self.browser = None
           self.page = None

   env = EnvironmentContext()
   ```

3. **Hub env attribute** (for tool access):
   ```python  theme={null}
   setup_hub.env = game  # Tools access via hub.env
   ```

### Tool Lifecycle

1. **Setup tools** - Hidden from agents, prepare environment state
2. **Interaction tools** - Available to agents for control
3. **Evaluate tools** - Hidden from agents, score performance

### Progress Notifications

Send [progress updates](https://modelcontextprotocol.io/specification/basic/utilities/progress) during long-running operations:

```python  theme={null}
async def send_progress(progress: int, message: str):
    if progress_token:
        await ctx.session.send_progress_notification(
            progress_token=progress_token,
            progress=progress,
            total=100,
            message=message
        )
```

<Info>
  Progress notifications follow the [MCP progress specification](https://modelcontextprotocol.io/specification/basic/utilities/progress#progress-flow). The `progressToken` comes from the client's request [metadata](https://modelcontextprotocol.io/specification/basic/index#_meta).
</Info>

### Metadata Access

Agent metadata flows through initialization:

```python  theme={null}
@mcp.initialize
async def initialize_environment(ctx):
    # From agent's metadata class variable
    width = ctx.meta.get("display_width", 1920) if ctx.meta else 1920
    height = ctx.meta.get("display_height", 1080) if ctx.meta else 1080
```

## Testing

```bash  theme={null}
# CLI testing
hud debug my-env:latest
hud analyze my-env:latest

# Python testing
async def test():
    from hud.clients import MCPClient
    
    client = MCPClient({
        "env": {"command": "docker", "args": ["run", "-i", "my-env"]}
    })
    
    async with client:
        tools = await client.list_tools()
        result = await client.call_tool("setup", {"value": 0})
```

## See Also

* [Build Environments](/build-environments) - Getting started guide
* [Tools](/reference/tools) - Tool implementation reference
* [Environment Spec](/build-environments/spec) - Technical specification and architecture


# Tasks
Source: https://docs.hud.ai/reference/tasks

SDK reference for task configuration and dataset utilities

The HUD SDK provides the `Task` class for defining agent objectives and dataset utilities for managing task collections.

## Task Class

```python  theme={null}
from hud.datasets import Task
```

Pydantic model that defines an agent's objective, setup, and evaluation criteria.

**Fields:**

| Field           | Type                                       | Description                                                | Default  |
| --------------- | ------------------------------------------ | ---------------------------------------------------------- | -------- |
| `id`            | `str \| None`                              | Unique identifier (UUID recommended)                       | `None`   |
| `prompt`        | `str`                                      | Task instruction for the agent                             | Required |
| `mcp_config`    | `dict[str, Any]`                           | MCP server configuration                                   | Required |
| `setup_tool`    | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to prepare environment                             | `None`   |
| `evaluate_tool` | `MCPToolCall \| list[MCPToolCall] \| None` | Tool(s) to score performance                               | `None`   |
| `agent_config`  | `dict[str, Any] \| None`                   | Agent configuration (system\_prompt, allowed\_tools, etc.) | `None`   |
| `metadata`      | `dict[str, Any]`                           | Extra task metadata                                        | `{}`     |

### Environment Variable Substitution

The `mcp_config` field automatically resolves environment variables using `${VAR_NAME}` syntax:

```python  theme={null}
task = Task(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)
```

**Substitution happens automatically when Task is created from a dict** - environment variables are resolved using Python's `Template.substitute()` with a defaultdict that returns empty strings for missing variables.

### Field Validators

Task automatically:

1. **Parses JSON strings** - `mcp_config` and `metadata` can be JSON strings
2. **Converts dicts to MCPToolCall** - `setup_tool` and `evaluate_tool` dicts are converted
3. **Resolves environment variables** - Only when created from dict (preserves templates in model\_dump())

```python  theme={null}
# All of these work:
task = Task(
    prompt="Test",
    mcp_config='{"server": {"url": "..."}}',  # JSON string → dict
    setup_tool='{"name": "setup"}'            # JSON string → MCPToolCall
)

task = Task(
    prompt="Test", 
    mcp_config={...},
    setup_tool={"name": "setup", "arguments": {...}}  # dict → MCPToolCall
)

task = Task(
    prompt="Test",
    mcp_config={...},
    evaluate_tool=[  # List of tool calls
        {"name": "check_text", "arguments": {"text": "Success"}},
        {"name": "check_url", "arguments": {"url": "example.com"}}
    ]
)
```

## Recommended Evaluation Workflow

When developing and testing agents, follow this progression for optimal debugging and performance:

### Step 1: Single Task Development

Start with individual tasks to debug your agent and environment setup:

```python  theme={null}
import hud
from hud.agents import ClaudeAgent
from hud.datasets import Task

task = Task(prompt="Click the submit button", mcp_config={...})

async with hud.async_trace(task.prompt):
    agent = ClaudeAgent()
    result = await agent.run(task, max_steps=50)
    print(f"Result: {result.reward}")
```

**Benefits:**

* Full error stack traces
* Clear log output
* Quick iteration cycle
* Automatic telemetry tracking

### Step 2: Full Dataset Evaluation

Once single tasks work reliably, scale up to full dataset evaluation:

```python  theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

results = await run_dataset(
    name="SheetBench Evaluation",
    dataset="hud-evals/SheetBench-50",
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-sonnet-4-5-20250929", "validate_api_key": False},
    max_concurrent=50,
    max_steps=50,
    auto_respond=True
)
```

**Concurrency Guidelines:**

* Start with `max_concurrent=50` and adjust based on results
* Increase to 100-200 for faster evaluation (if API limits allow)
* Decrease to 10-20 if hitting rate limits

### Quick Reference

| Stage       | Method        | Concurrency | Use Case          | Debugging |
| ----------- | ------------- | ----------- | ----------------- | --------- |
| Development | Single task   | 1           | Initial debugging | Excellent |
| Production  | `run_dataset` | 50-200      | Full evaluation   | Good      |

## Dataset Functions

### run\_dataset

```python  theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent

results = await run_dataset(
    name="SheetBench Evaluation",
    dataset="hud-evals/SheetBench-50",
    agent_class=ClaudeAgent,
    agent_config={"model": "claude-sonnet-4-5-20250929", "validate_api_key": False},
    max_concurrent=50,
    max_steps=50,
    auto_respond=True
)
```

**Parameters:**

| Parameter        | Type                           | Description                                    | Default   |
| ---------------- | ------------------------------ | ---------------------------------------------- | --------- |
| `name`           | `str`                          | Job name for tracking                          | Required  |
| `dataset`        | `str \| Dataset \| list[dict]` | HF dataset ID, Dataset object, or task dicts   | Required  |
| `agent_class`    | `type[MCPAgent]`               | Agent class to instantiate                     | Required  |
| `agent_config`   | `dict[str, Any] \| None`       | Constructor kwargs for agent                   | `None`    |
| `max_concurrent` | `int`                          | Maximum concurrent tasks (recommended: 50-200) | `30`      |
| `max_steps`      | `int`                          | Max steps per task                             | `10`      |
| `auto_respond`   | `bool`                         | Use ResponseAgent for continuations            | `False`   |
| `metadata`       | `dict[str, Any] \| None`       | Job metadata                                   | `None`    |
| `split`          | `str`                          | Dataset split when loading by ID               | `"train"` |

**Returns:** `list[Any]` - Results in dataset order

**Examples:**

```python  theme={null}
from hud.datasets import run_dataset
from hud.agents import ClaudeAgent, OperatorAgent

# From HuggingFace dataset
results = await run_dataset(
    "Eval", "hud-evals/SheetBench-50", ClaudeAgent,
    {"model": "claude-sonnet-4-5-20250929"}, max_concurrent=50
)

# From Dataset object  
from datasets import load_dataset
dataset = load_dataset("hud-evals/OSWorld-Verified", split="train")
results = await run_dataset("Eval", dataset, OperatorAgent)

# From list of dicts
tasks = [{"prompt": "...", "mcp_config": {...}, "setup_tool": {...}}]
results = await run_dataset("Eval", tasks, ClaudeAgent)
```

### fetch\_system\_prompt\_from\_dataset

```python  theme={null}
from hud.datasets import fetch_system_prompt_from_dataset

system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")
```

Fetch `system_prompt.txt` from a HuggingFace dataset repository.

**Returns:** `str | None` - System prompt text if found

**Note:** Requires `huggingface_hub` to be installed.

### save\_tasks

```python  theme={null}
from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-org/my-tasks",
    private=False,
    dataset_name="My Tasks v1.0"
)
```

Save tasks to HuggingFace dataset with JSON string serialization.

**Parameters:**

| Parameter  | Type                   | Description                          | Default  |
| ---------- | ---------------------- | ------------------------------------ | -------- |
| `tasks`    | `list[dict[str, Any]]` | Task dictionaries (NOT Task objects) | Required |
| `repo_id`  | `str`                  | HuggingFace repository ID            | Required |
| `**kwargs` | `Any`                  | Additional args for `push_to_hub()`  | -        |

**Important:** Always pass dictionaries to preserve environment variable templates:

```python  theme={null}
# ✅ GOOD - Preserves ${HUD_API_KEY} in the dataset
task_dicts = [task.model_dump() for task in tasks]
save_tasks(task_dicts, "my-dataset")

# ❌ BAD - Would expose resolved API keys!
# save_tasks(tasks, "my-dataset")  # Don't do this!
```

**Dataset Schema:**
The function converts complex fields to JSON strings for clean HuggingFace schema:

* `mcp_config` → JSON string
* `setup_tool` → JSON string (if present)
* `evaluate_tool` → JSON string (if present)
* `metadata` → JSON string (if present)

## MCPToolCall Type

```python  theme={null}
from hud.types import MCPToolCall
```

Defines a tool invocation with arguments.

**Fields:**

| Field       | Type             | Description       | Default  |
| ----------- | ---------------- | ----------------- | -------- |
| `name`      | `str`            | Tool name to call | Required |
| `arguments` | `dict[str, Any]` | Tool arguments    | `{}`     |

## Real-World Examples

### Loading Tasks from Datasets

From `examples/run_evaluation.py`:

```python  theme={null}
# Load dataset
dataset = load_dataset("hud-evals/SheetBench-50", split="train")

# Get system prompt
dataset_system_prompt = await fetch_system_prompt_from_dataset("hud-evals/SheetBench-50")

# Run single task
sample_task = dataset[0]
task = Task(**sample_task)  # Auto-converts dict to Task

# Or run all tasks
results = await run_dataset(
    "Full SheetBench Evaluation",
    dataset,
    ClaudeAgent,
    {"model": "claude-3-5-sonnet-20241022"},
    max_concurrent=50
)
```

### Task Structure in Datasets

From `environments/text_2048/2048_taskconfigs.json`:

```json  theme={null}
{
    "id": "2048_target_256",
    "prompt": "You are playing 2048 and your goal is to reach the 256 tile.",
    "mcp_config": {
        "local": {
            "command": "docker",
            "args": ["run", "--rm", "-i", "hud-text-2048"]
        }
    },
    "setup_tool": {
        "name": "setup",
        "arguments": {
            "name": "board",
            "arguments": {"board_size": 4}
        }
    },
    "evaluate_tool": {
        "name": "evaluate", 
        "arguments": {
            "name": "max_number",
            "arguments": {"target": 256}
        }
    },
    "metadata": {
        "difficulty": 256,
        "game": "2048"
    }
}
```

### Creating and Saving Tasks

```python  theme={null}
import os
from uuid import uuid4

# Set environment variables
os.environ["HUD_API_KEY"] = "sk-hud-..."
os.environ["BROWSER_PROVIDER"] = "anchorbrowser"

# Create tasks with env var templates
tasks = []
for url in ["example.com", "test.com"]:
    task = Task(
        id=str(uuid4()),
        prompt=f"Navigate to {url} and find the login button",
        mcp_config={
            "browser": {
                "url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
                "headers": {
                    "Authorization": "Bearer ${HUD_API_KEY}",
                    "X-Provider": "${BROWSER_PROVIDER}"
                }
            }
        },
        setup_tool=MCPToolCall(
            name="setup",
            arguments={"name": "navigate", "arguments": {"url": url}}
        ),
        evaluate_tool=MCPToolCall(
            name="evaluate", 
            arguments={"name": "element_exists", "arguments": {"selector": "button.login"}}
        ),
        metadata={
            "category": "navigation",
            "difficulty": "easy"
        }
    )
    tasks.append(task)

# Save to HuggingFace (preserves env var templates)
task_dicts = [t.model_dump() for t in tasks]
save_tasks(task_dicts, "my-org/navigation-tasks")
```

### Agent Integration

Tasks automatically configure agents:

```python  theme={null}
task = Task(
    prompt="Complete the form",
    mcp_config={...},
    setup_tool={...},
    evaluate_tool={...},
    agent_config={
        "system_prompt": "Be very careful with form fields",
        "allowed_tools": ["computer", "search"],
        "append_setup_output": True
    }
)

agent = ClaudeAgent()

# When agent.run(task) is called:
# 1. If no mcp_client, creates one from task.mcp_config
# 2. Applies task.agent_config to agent (system_prompt, allowed_tools, etc.)
# 3. Adds lifecycle tools to agent.lifecycle_tools
# 4. Runs setup_tool (hidden from agent)
# 5. Executes task.prompt
# 6. Runs evaluate_tool (hidden from agent)
# 7. Returns Trace with evaluation reward
```

### Agent Config Options

The `agent_config` field supports the following options:

| Option                | Type        | Description                                           |
| --------------------- | ----------- | ----------------------------------------------------- |
| `system_prompt`       | `str`       | Custom system prompt appended to agent's default      |
| `allowed_tools`       | `list[str]` | Tools the agent can use (replaces `agent_tools`)      |
| `disallowed_tools`    | `list[str]` | Tools to exclude from the agent                       |
| `append_setup_output` | `bool`      | Include setup output in first message (default: True) |
| `initial_screenshot`  | `bool`      | Take screenshot before first action (default: True)   |

## Best Practices

1. **Use UUIDs for task IDs** - Required for HuggingFace datasets
2. **Save dictionaries, not objects** - Preserves env var templates
3. **Use agent\_config for agent settings** - Centralize agent configuration in one place
4. **Use metadata for filtering** - Category, difficulty, tags
5. **Test locally first** - Before uploading to HuggingFace
6. **Version your datasets** - Use meaningful repo names

## Common Patterns

### Filtering Tasks

```python  theme={null}
# Load all tasks
dataset = load_dataset("hud-evals/WebArena-Lite", split="train")

# Filter in Python
login_tasks = [
    task for task in dataset
    if "login" in task["prompt"].lower()
]

# Convert and run
results = await run_dataset(
    "Login Tasks Only",
    login_tasks,
    ClaudeAgent
)
```

### Custom System Prompts

```python  theme={null}
# Override dataset system prompt
results = await run_dataset(
    "Custom Evaluation",
    "hud-evals/SheetBench-50",
    ClaudeAgent,
    custom_system_prompt="You are an expert spreadsheet user. Be precise."
)
```

### Environment Variable Management

```python  theme={null}
# Tasks preserve templates
task = Task(
    prompt="Test",
    mcp_config={
        "server": {
            "api_key": "${API_KEY:default-key}"
        }
    }
)

# model_dump() preserves the template
data = task.model_dump()
assert data["mcp_config"]["server"]["api_key"] == "${API_KEY:default-key}"

# But task object has resolved value
os.environ["API_KEY"] = "real-key"
task2 = Task(**data)
# task2.mcp_config["server"]["api_key"] is now "real-key"
```

## See Also

* [Task System](/core-concepts/task-system) - Conceptual overview
* [Benchmarks](/evaluate-agents/benchmarks) - Building and running datasets
* [Agents](/reference/agents) - How agents use tasks


# Tools
Source: https://docs.hud.ai/reference/tools

SDK reference for HUD tools and executors

The HUD SDK provides a comprehensive set of tools for environment interaction. All tools inherit from `BaseTool` and return standardized MCP content blocks.

## Base Classes

### BaseTool

```python  theme={null}
from hud.tools import BaseTool
```

Abstract base class for all MCP tools. Provides standardized output formatting and MCP registration.

**Constructor Parameters:**

| Parameter     | Type  | Description                                         | Default             |
| ------------- | ----- | --------------------------------------------------- | ------------------- |
| `env`         | `Any` | Optional stateful context (game, executor, browser) | `None`              |
| `name`        | `str` | Tool name for MCP registration                      | Auto from class     |
| `title`       | `str` | Human-readable display name                         | Auto from class     |
| `description` | `str` | Tool description                                    | Auto from docstring |

**Abstract Methods:**

```python  theme={null}
async def __call__(self, **kwargs: Any) -> list[ContentBlock]
```

**Properties:**

* `mcp` - FastMCP FunctionTool wrapper for registration

**Usage:**

```python  theme={null}
class MyTool(BaseTool):
    async def __call__(self, param: str) -> list[ContentBlock]:
        result = await self.env.do_something(param)
        return [TextContent(text=result, type="text")]

# MCPServer automatically handles BaseTool instances
tool = MyTool(env=my_context)
mcp_server.add_tool(tool)  # No need for .mcp property
```

### BaseHub

```python  theme={null}
from hud.tools import BaseHub
```

A composition-friendly FastMCP server for organizing tools into nested namespaces.

**Key Features:**

* Internal tool dispatcher pattern
* Automatic resource catalog generation
* Hidden internal tools from MCP clients

**Usage:**

```python  theme={null}
hub = BaseHub("evaluators")

@hub.tool()  # Internal tool, hidden from agents
async def url_match(url: str) -> EvaluationResult:
    return EvaluationResult(reward=1.0, done=True)

# Access via dispatcher when mounted on server
# Agents call: evaluators(name="url_match", arguments={"url": "..."})
```

## Core Tools

### BashTool

```python  theme={null}
from hud.tools import BashTool
```

Execute bash commands in a persistent shell session.

**Constructor Parameters:**

| Parameter | Type           | Description                 | Default |
| --------- | -------------- | --------------------------- | ------- |
| `session` | `_BashSession` | Pre-configured bash session | `None`  |

**Tool Parameters:**

| Parameter | Type   | Description              | Default  |
| --------- | ------ | ------------------------ | -------- |
| `command` | `str`  | Bash command to execute  | Required |
| `restart` | `bool` | Restart the bash session | `False`  |

**Example:**

```python  theme={null}
bash = BashTool()
result = await bash(command="ls -la")
# Returns: [TextContent(text="file1.txt\nfile2.txt", type="text")]

# Restart session
await bash(restart=True)
```

### EditTool

```python  theme={null}
from hud.tools import EditTool
```

File system editor with undo support.

**Constructor Parameters:**

| Parameter      | Type                    | Description           | Default |
| -------------- | ----------------------- | --------------------- | ------- |
| `file_history` | `dict[Path, list[str]]` | Edit history per file | `None`  |

**Commands:**

| Command       | Parameters                       | Description                     |
| ------------- | -------------------------------- | ------------------------------- |
| `view`        | `path`, `view_range`             | View file or directory contents |
| `create`      | `path`, `file_text`              | Create new file                 |
| `str_replace` | `path`, `old_str`, `new_str`     | Replace string in file          |
| `insert`      | `path`, `insert_line`, `new_str` | Insert text at line             |
| `undo_edit`   | `path`                           | Undo last edit                  |

**Example:**

```python  theme={null}
editor = EditTool()

# View file
await editor(command="view", path="/path/to/file.py", view_range=[1, 20])

# Create file
await editor(command="create", path="/new/file.py", file_text="print('hello')")

# Replace text
await editor(command="str_replace", path="/file.py", 
            old_str="old_text", new_str="new_text")
```

### PlaywrightTool

```python  theme={null}
from hud.tools import PlaywrightTool
```

Web automation using Playwright browser.

**Constructor Parameters:**

| Parameter | Type   | Description                  | Default |
| --------- | ------ | ---------------------------- | ------- |
| `page`    | `Page` | Existing Playwright page     | `None`  |
| `cdp_url` | `str`  | Chrome DevTools Protocol URL | `None`  |

**Actions:**

| Action             | Parameters                   | Description                |
| ------------------ | ---------------------------- | -------------------------- |
| `navigate`         | `url`, `wait_for_load_state` | Navigate to URL            |
| `screenshot`       | -                            | Capture page screenshot    |
| `click`            | `selector`                   | Click element              |
| `type`             | `selector`, `text`           | Type text in element       |
| `get_page_info`    | -                            | Get page title and URL     |
| `wait_for_element` | `selector`                   | Wait for element to appear |

**Example:**

```python  theme={null}
tool = PlaywrightTool()

# Navigate
await tool(action="navigate", url="https://example.com")

# Click button
await tool(action="click", selector="button.submit")

# Type text
await tool(action="type", selector="input#search", text="query")
```

### JupyterTool

```python  theme={null}
from hud.tools import JupyterTool
```

Execute Python code in a Jupyter kernel.

**Constructor Parameters:**

| Parameter     | Type  | Description                                                       | Default          |
| ------------- | ----- | ----------------------------------------------------------------- | ---------------- |
| `url_suffix`  | `str` | (Optional) Kernel gateway host:port                               | "localhost:8888" |
| `kernel_name` | `str` | (Optional) Kernel name                                            | "python3"        |
| `kernel_id`   | `str` | (Optional) If set, connect to the existed kernel with `kernel_id` | ""(empty string) |

**Tool Parameters:**

| Parameter           | Type  | Description                  | Default  |
| ------------------- | ----- | ---------------------------- | -------- |
| `code`              | `str` | Python code to execute       | Required |
| `execution_timeout` | `int` | Execution timeout in seconds | 15       |

**Example: Tool Calling**

```python  theme={null}
jupyter_tool = JupyterTool()
result = await jupyter_tool(code="print(\"hello world!\")")
# Returns: [TextContent(text="hello world\n", type="text")]
```

**Example: Reuse the same kernel**

```python  theme={null}
# When initializing, register the kernel
jupyter_tool = JupyterTool()
JupyterTool.register_shared_kernel("my-jupyter", jupyter_tool.get_kernel_id())

# Reuse the same kernel in other places:
jupyter_tool_reuse = JupyterTool.from_shared_kernel("my-jupyter")
```

## Computer Control Tools

### HudComputerTool

```python  theme={null}
from hud.tools import HudComputerTool
```

Universal computer control with automatic scaling and executor selection.

**Constructor Parameters:**

| Parameter        | Type                             | Description         | Default     |
| ---------------- | -------------------------------- | ------------------- | ----------- |
| `executor`       | `BaseExecutor`                   | Executor to use     | Auto-detect |
| `platform_type`  | `"auto"`, `"xdo"`, `"pyautogui"` | Executor type       | `"auto"`    |
| `display_num`    | `int`                            | X display number    | From env    |
| `width`          | `int`                            | Agent screen width  | 1280        |
| `height`         | `int`                            | Agent screen height | 720         |
| `rescale_images` | `bool`                           | Rescale screenshots | `True`      |

**Actions:**

| Action       | Parameters                                 | Description           |
| ------------ | ------------------------------------------ | --------------------- |
| `screenshot` | -                                          | Capture screen        |
| `click`      | `x`, `y`, `button`, `pattern`, `hold_keys` | Mouse click           |
| `write`      | `text`, `enter_after`, `delay`             | Type text             |
| `press`      | `keys`                                     | Press key combination |
| `scroll`     | `x`, `y`, `scroll_x`, `scroll_y`           | Scroll                |
| `drag`       | `start_x`, `start_y`, `end_x`, `end_y`     | Drag mouse            |

**Example:**

```python  theme={null}
computer = HudComputerTool()

# Take screenshot
await computer(action="screenshot")

# Click at coordinates
await computer(action="click", x=100, y=200)

# Type text
await computer(action="write", text="Hello World", enter_after=True)

# Press hotkey
await computer(action="press", keys=["ctrl", "c"])
```

### AnthropicComputerTool

```python  theme={null}
from hud.tools import AnthropicComputerTool
```

Computer control optimized for Anthropic's Claude models.

**Features:**

* Pre-configured for 1280x720 resolution
* Optimized action names for Claude
* Built-in screenshot scaling

### OpenAIComputerTool

```python  theme={null}
from hud.tools import OpenAIComputerTool
```

Computer control optimized for OpenAI models.

**Features:**

* Pre-configured for 1920x1080 resolution
* Simplified action interface
* No automatic screenshot scaling

## Executors

Executors provide platform-specific implementations for computer control actions.

### BaseExecutor

```python  theme={null}
from hud.tools.executors import BaseExecutor
```

Abstract base providing simulation mode for all actions.

**Core Methods:**

* `click(x, y, button, pattern, hold_keys)` - Mouse click
* `write(text, enter_after, delay)` - Type text
* `press(keys)` - Press key combination
* `scroll(x, y, scroll_x, scroll_y)` - Scroll
* `drag(start_x, start_y, end_x, end_y)` - Drag mouse
* `screenshot()` - Capture screen
* `get_screen_size()` - Get display dimensions

### PyAutoGUIExecutor

```python  theme={null}
from hud.tools.executors import PyAutoGUIExecutor
```

Cross-platform executor using PyAutoGUI library.

**Features:**

* Works on Windows, macOS, Linux
* Real mouse/keyboard control
* Screenshot capture
* Automatic failsafe

**Example:**

```python  theme={null}
executor = PyAutoGUIExecutor()
computer = HudComputerTool(executor=executor)
```

### XDOExecutor

```python  theme={null}
from hud.tools.executors import XDOExecutor
```

Linux/X11 executor using xdotool.

**Features:**

* Native X11 integration
* Faster than PyAutoGUI on Linux
* Support for X display selection
* Window management capabilities

**Example:**

```python  theme={null}
executor = XDOExecutor(display_num=1)
computer = HudComputerTool(executor=executor)
```

## Common Types

### ContentBlock

MCP standard output format (from `mcp.types`):

```python  theme={null}
from mcp.types import TextContent, ImageContent

# Text output
TextContent(text="Operation complete", type="text")

# Image output  
ImageContent(data="base64_data", mimeType="image/png", type="image")
```

### EvaluationResult

```python  theme={null}
from hud.tools.types import EvaluationResult

result = EvaluationResult(
    reward=0.8,        # Score 0-1
    done=True,         # Task complete
    content="Details", # Optional text
    info={"score": 80} # Metadata
)
```

### ContentResult

```python  theme={null}
from hud.tools.types import ContentResult

# Helper for building complex outputs
result = ContentResult(
    output="Success message",
    error="Error if any",
    base64_image="screenshot_data",
    system="System message"
)

# Convert to ContentBlocks
blocks = result.to_content_blocks()
```

## Integration Examples

### Adding Tools to Environment

```python  theme={null}
from hud.server import MCPServer
from hud.tools import BashTool, EditTool

mcp = MCPServer(name="my-env")

# MCPServer handles BaseTool instances automatically
bash = BashTool()
mcp.add_tool(bash)  # Internally uses bash.mcp

editor = EditTool()
mcp.add_tool(editor)  # Same here
```

### Custom Tool Implementation

```python  theme={null}
from hud.tools import BaseTool
from mcp.types import TextContent

class DatabaseTool(BaseTool):
    def __init__(self, db_connection):
        super().__init__(
            env=db_connection,
            name="database",
            title="Database Query Tool",
            description="Execute SQL queries"
        )
    
    async def __call__(self, query: str) -> list[ContentBlock]:
        try:
            results = await self.env.execute(query)
            return [TextContent(text=str(results), type="text")]
        except Exception as e:
            return [TextContent(text=f"Error: {e}", type="text")]
```

### Hub Pattern for Evaluators

```python  theme={null}
from hud.tools import BaseHub
from hud.tools.types import EvaluationResult

evaluators = BaseHub("evaluate")

@evaluators.tool("text_contains")
async def check_text(text: str, target: str) -> EvaluationResult:
    return EvaluationResult(
        reward=1.0 if target in text else 0.0,
        done=True,
        content=f"Checking if '{target}' in text"
    )

# Use in environment
@mcp.tool()
async def evaluate(name: str, **kwargs):
    return await evaluators.call_tool(name, kwargs)
```

### Callback Functions

Callback functions help you monitor certain types of actions and hook
behaviors to them. You can manage callbacks using three main methods,

```python  theme={null}
add_callback(self, event_type: str, callback: Callable)
remove_callback(self, event_type: str, callback: Callable)
_trigger_callbacks(self, event_type: str, **kwargs)  
```

*Special note*: Make sure the callback functions are defined by `async def`

For example, when you want to take a screenshot every
time the webpage changes in `Playwright`

```python  theme={null}
from hud.tools.playwright import PlaywrightTool
import base64

class MyPlaywrightTool(PlaywrightTool):
   def __init__(self, context: Any = None, cdp_url: str | None = None) -> None:
       super().__init__(cdp_url=cdp_url)
       self.screenshots: List = []
       self.add_callback("webpage_change", self._on_webpage_change)

   async def _on_webpage_change(self, **kwargs) -> None:
       """Callback to capture screenshots on webpage changes"""
       try:
           await self._ensure_browser()
           if self.page:
               screenshot_bytes = await self.page.screenshot(full_page=True)
               screenshot_b64 = base64.b64encode(screenshot_bytes).decode("utf-8")
               self.screenshots.append(screenshot_b64)
       except Exception as e:
           pass

   async def navigate(self, url: str, wait_for_load_state="networkidle"):
       result = await super().navigate(url, wait_for_load_state)
       # Trigger callbacks with event data
       await self._trigger_callbacks("webpage_change")
       return result

   async def click(self, selector: str, **kwargs):
       result = await super().click(selector, **kwargs)
       await self._trigger_callbacks("webpage_change")
       return result
```

## See Also

* [Build Environments](/build-environments) - Using tools in environments
* [Architecture](/core-concepts/architecture) - How tools fit in HUD
* [Adapting Software](/build-environments/adapting-software) - Tool patterns


# RL Quickstart
Source: https://docs.hud.ai/train-agents/quickstart


## Prerequisites

* HUD API key: Remote training requires authentication. Set `HUD_API_KEY` before running:

```bash  theme={null}
export HUD_API_KEY="sk-hud-..."  # get one at https://hud.ai
# Or persist it locally:
hud set HUD_API_KEY=sk-hud-...
```

* Docker daemon: For local runs (using `--local`) or when training against a local Docker image, ensure Docker Desktop is installed and the Docker daemon is running.

## Quickstart

Install and download a taskset:

```bash  theme={null}
uv tool install hud-python
hud get hud-evals/2048-basic
```

### 1) Simple: Train (remote by default)

```bash  theme={null}
hud rl 2048-basic.json
```

This launches training remotely and automatically provisions a vLLM server and a trainer for you. You can monitor progress on [https://hud.ai](https://hud.ai). The server persists between runs, so you can rerun training or evaluate against the same endpoint.

Optional baseline first (Claude or Operator):

```bash  theme={null}
hud eval 2048-basic.json
```

### 2) Run on your own machine/remote

Use any provider with at least 2 GPUs (one for inference, one for training). Run locally with the flag `--local`:

```bash  theme={null}
uv tool install hud-python
hud get hud-evals/2048-basic
hud rl 2048-basic.json --local
```

### Recommended setups

* 2× A100: quick iteration, shorter runs
* 8× A100: higher throughput for larger tasksets

Training throughput depends on task complexity and parallelism (`max_parallel_episodes`).

### 3) Build your own environment (hud init)

Create a new MCP environment, develop with hot-reload, and train on a production image:

```bash  theme={null}
hud init my-env && cd my-env
hud dev --interactive
# When ready to run:
hud rl
```

Change the tasks.json to include other tasks you want to train on.

See [hud init](/reference/cli/init) for options and details.

## Getting the best performance

Often training a good model requires many iterations over the parameters of the trainer. Take the config generated by `hud rl` and modify it to various values to do a hyperparameter sweep.

For easy launching, specify the tasks and config upfront, and add `--yes` to automatically launch vllm and training.

```bash  theme={null}
hud rl taskset.json --config rl-config.json --yes
```

Additionally, sometimes it may be helpful to run an initial analysis on the dataset to determine which tasks would be the most informative to trian on. In that case either start with a deployed model or run `hud rl` without training, and then:

```bash  theme={null}
hud eval taskset.json --full --group-size 6 --max-steps 5
```

This will prompt you for the model choice, produce a table of accuracies per task. Prefer tasks which are 10%-60% accurate for training.

Some general findings from our internal training runs:

* As many different tasks per gradient update as possible (runs with 4+ GPUs and batch size of 50+ are much more stable than single GPU runs)
* Batch size should be somewhere around 2/X where X is the accuracy of that given task on an untrained model.

### Pricing

Below is the pricing by GPU type. Actual prices vary — see [https://hud.ai/project/billing](https://hud.ai/project/billing) for current rates.

vLLM GPU Pricing (2 Hosted GPUs)

| GPU type  | Memory | Est. price/hr |
| --------- | ------ | ------------- |
| A100 80GB | 80 GB  | \$4.95        |
| H100 80GB | 80 GB  | \$7.95        |

Training GPU Pricing

| GPU type  | Memory | Est. price/hr |
| --------- | ------ | ------------- |
| A100 80GB | 80 GB  | \$3.95        |
| H100 80GB | 80 GB  | \$5.40        |

***

### Learn more

<CardGroup cols={2}>
  <Card title="Environment Quickstart" icon="cube" href="/build-environments">
    Complete guide to building environments from scratch
  </Card>

  <Card title="RL CLI Reference" icon="terminal" href="/reference/cli/rl">
    Full `hud rl` command options and usage
  </Card>
</CardGroup>


# Dataset Design
Source: https://docs.hud.ai/train-agents/tasks


## Tasks format

HUD tasksets can be provided in two primary formats (both supported):

1. A single JSON file containing a list of task objects (recommended)

```json  theme={null}
[
  {
    "id": "browser_2048_128",
    "prompt": "Reach 128 in 2048.",
    "mcp_config": {
      "hud": {
        "url": "https://mcp.hud.ai/v3/mcp",
        "headers": {
          "Authorization": "Bearer ${HUD_API_KEY}",
          "Mcp-Image": "hudevals/hud-browser:0.1.3"
        }
      }
    },
    "setup_tool": {"name": "launch_app", "arguments": {"app_name": "2048"}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "game_2048_max_number", "arguments": {"target": 128}}}
  }
]
```

Save as `2048-basic.json` and run:

```bash  theme={null}
hud eval 2048-basic.json
hud rl 2048-basic.json
```

2. JSONL file with one task object per line

* prompt: instruction for the agent
* mcp\_config: where to run the environment (local docker or remote MCP)
* setup\_tool (optional): a tool call to prepare the environment
* evaluate\_tool: a tool call to compute reward
* system\_prompt (optional): extra guidance for the agent

## Hosting on HuggingFace

You can host tasksets on the Hub and fetch them with:

```bash  theme={null}
hud get hud-evals/2048-basic
```

The command downloads the JSONL task file and places it in your project directory.

This allows running the full dataset or training with simply:

```bash  theme={null}
hud eval hud-evals/2048-basic
hud rl hud-evals/2048-basic
```

## Tips

* Keep tasks self-contained; use `setup_tool` to open apps or load data
* Ensure `evaluate_tool` returns a numeric reward per episode
* Use small task counts to iterate quickly; scale up once stable

<CardGroup cols={2}>
  <Card title="Agent Evals" icon="robot" href="/evaluate-agents/benchmarks">
    Learn how to run benchmarks
  </Card>

  <Card title="Environment Spec" icon="wrench" href="/build-environments/spec">
    Deep-dive into MCP configs and tools
  </Card>
</CardGroup>