HUD Documentation — Evaluations and RL Environments.

This page covers how to build datasets, run evaluations, and publish results.

Quick Start

# Run a local tasks file (interactive agent selection)
hud eval tasks.json

# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full

SDK (Context Manager)

import hud

# Single task evaluation
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
    agent = MyAgent()
    result = await agent.run(ctx)
    ctx.reward = result.reward

# All tasks with variants
async with hud.eval(
    "hud-evals/SheetBench-50:*",
    variants={"model": ["claude-sonnet", "gpt-4o"]},
    group=3,
    max_concurrent=50,
) as ctx:
    agent = create_agent(model=ctx.variants["model"])
    result = await agent.run(ctx)
    ctx.reward = result.reward

SDK (Batch Execution)

from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    max_concurrent=50,
)

Build Benchmarks

Explore Evaluators

hud analyze hudpython/hud-remote-browser:latest

Create Tasks

import uuid

web_tasks = []
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

Save to HuggingFace

from hud.utils.tasks import save_tasks

save_tasks(
  web_tasks,
  repo_id="web-navigation-benchmark",
  private=False,
  tags=["web", "navigation", "automation"],
)

Leaderboards

After running, visit your dataset leaderboard and publish a scorecard:

from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    name="Claude Sonnet SheetBench",
)

# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard

Best Practices

Clear, measurable prompts (binary or graded)
Isolated task state and deterministic setup
Use metadata tags (category, difficulty)
Validate locally, then parallelize
Version datasets; include a system_prompt.txt

Get Started

Core Concepts

SDK Reference

Environments

HUD Gateway

Beta Features

Agents

CLI Reference

Community

Benchmarks

Quick Start

Build Benchmarks

Explore Evaluators

Create Tasks

Save to HuggingFace

Leaderboards

Best Practices

See Also

Get Started

Core Concepts

SDK Reference

Environments

HUD Gateway

Beta Features

Agents

CLI Reference

Community

​Quick Start

​Build Benchmarks

​Explore Evaluators

​Create Tasks

​Save to HuggingFace

​Leaderboards

​Best Practices

​See Also

Quick Start

Build Benchmarks

Explore Evaluators

Create Tasks

Save to HuggingFace

Leaderboards

Best Practices

See Also