HUD Documentation — Evaluations and RL Environments.

Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.

Lifecycle

Task Structure

from hud.datasets import Task
import uuid

task = Task(
    # Required fields
    prompt="Navigate to the login page and sign in as [email protected]",
    mcp_config={
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    },
    
    # Optional fields
    id=str(uuid.uuid4()),  # Required for HuggingFace datasets
    system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
    setup_tool={
        "name": "playwright",
        "arguments": {
            "action": "navigate",
            "url": "https://example.com"
        }
    },
    evaluate_tool={
        "name": "evaluate",
        "arguments": {
            "name": "url_contains",
            "substring": "/dashboard"
        }
    },
    metadata={"category": "authentication", "difficulty": "easy"}
)

Environment Variables

# In dataset JSON:
{
    "prompt": "Complete the TODO list",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "${BROWSER_IMAGE}"
            }
        }
    }
}

# When loaded:
task = Task(**task_dict)  # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."

This enables:

Public datasets without exposing secrets
Environment-specific configurations
CI/CD pipelines with different credentials

Running Tasks

# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}")  # 0.0 to 1.0

The agent will:

Execute setup_tool if provided
Work on the prompt using available tools
Execute evaluate_tool to calculate reward

Working with Datasets

Tasks integrate with HuggingFace datasets:

from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

# Load tasks from HuggingFace
tasks = load_tasks("hud-evals/SheetBench-50")

# Run agent on all tasks with automatic parallelization
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    name="SheetBench Run",
)

Creating Datasets

from hud.datasets import save_tasks

# Create task dictionaries (NOT Task objects!)
task_dicts = [
    {
        "prompt": "Navigate to the login page",
        "mcp_config": {
            "hud": {
                "url": "${MCP_URL}",
                "headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
            }
        },
        "setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
        "evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
    },
    # More task dicts...
]

# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-benchmark")

Always save dictionaries, not Task objects. Task objects have already resolved environment variables!

Best Practices

Use UUIDs: Always include id=str(uuid.uuid4()) for dataset tasks
Clear Prompts: Be specific about success criteria
Template Variables: Use ${VAR} syntax for shareable configs
Rich Tools: Include both name and arguments in tool definitions

Next Steps

Benchmarks

Create, run, and publish evaluations

Create Agents

Build your own MCP-compatible agent

Get Started

Core Concepts

SDK Reference

Environments

HUD Gateway

Beta Features

Agents

CLI Reference

Community

Task System

Lifecycle

Task Structure

Environment Variables

Running Tasks

Working with Datasets

Creating Datasets

Best Practices

Next Steps

Benchmarks

Create Agents

Get Started

Core Concepts

SDK Reference

Environments

HUD Gateway

Beta Features

Agents

CLI Reference

Community

​Lifecycle

​Task Structure

​Environment Variables

​Running Tasks

​Working with Datasets

​Creating Datasets

​Best Practices

​Next Steps

Benchmarks

Create Agents

Lifecycle

Task Structure

Environment Variables

Running Tasks

Working with Datasets

Creating Datasets

Best Practices

Next Steps