Skip to main content
This page covers how to build datasets, run evaluations, and publish results.

Quick Start

  • CLI
# Run a local tasks file (interactive agent selection)
hud eval tasks.json

# Run a hosted dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full
  • SDK (Context Manager)
import hud

# Single task evaluation
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
    agent = MyAgent()
    result = await agent.run(ctx)
    ctx.reward = result.reward

# All tasks with variants
async with hud.eval(
    "hud-evals/SheetBench-50:*",
    variants={"model": ["claude-sonnet", "gpt-4o"]},
    group=3,
    max_concurrent=50,
) as ctx:
    agent = create_agent(model=ctx.variants["model"])
    result = await agent.run(ctx)
    ctx.reward = result.reward
  • SDK (Batch Execution)
from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    max_concurrent=50,
)

Build Benchmarks

Explore Evaluators

hud analyze hudpython/hud-remote-browser:latest

Create Tasks

import uuid

web_tasks = []
web_tasks.append({
    "id": str(uuid.uuid4()),
    "prompt": "Navigate to the documentation page",
    "mcp_config": {
        "hud": {
            "url": "https://mcp.hud.ai/v3/mcp",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-remote-browser:latest"
            }
        }
    },
    "setup_tool": {"name": "setup", "arguments": {"name": "navigate", "arguments": {"url": "https://example.com"}}},
    "evaluate_tool": {"name": "evaluate", "arguments": {"name": "url_match", "arguments": {"pattern": ".*/docs.*"}}},
    "metadata": {"difficulty": "easy", "category": "navigation"}
})

Save to HuggingFace

from hud.utils.tasks import save_tasks

save_tasks(
  web_tasks,
  repo_id="web-navigation-benchmark",
  private=False,
  tags=["web", "navigation", "automation"],
)

Leaderboards

After running, visit your dataset leaderboard and publish a scorecard:
from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks

tasks = load_tasks("hud-evals/SheetBench-50")
results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    name="Claude Sonnet SheetBench",
)

# Open https://hud.ai/leaderboards/hud-evals/SheetBench-50
# Click "My Jobs" to see runs and create a scorecard

Best Practices

  • Clear, measurable prompts (binary or graded)
  • Isolated task state and deterministic setup
  • Use metadata tags (category, difficulty)
  • Validate locally, then parallelize
  • Version datasets; include a system_prompt.txt

See Also