HUD Documentation — Evaluations and RL Environments.

The hud eval command runs an agent on a tasks file or HuggingFace dataset.

Local Execution Dependencies: Running Claude or Gemini agents locally requires additional packages:

uv add "hud-python[agents]"

This is not needed for --remote execution, which runs on HUD infrastructure.

Usage

hud eval [SOURCE] [AGENT] [OPTIONS]

Arguments

source

string

HuggingFace dataset (e.g., hud-evals/SheetBench-50) or task JSON/JSONL file.

agent

string

Agent to use: claude, openai, operator, gemini, gemini_cua, openai_compatible. If omitted, an interactive preset selector appears.

Options

Execution Mode

--full

boolean

default:"false"

Run the entire dataset. Without this flag, only the first task runs (debug mode).

--remote

boolean

default:"false"

Submit tasks to HUD platform for remote execution. Fire-and-forget - monitor at hud.ai.

--task-ids

string

Comma-separated task IDs to run (e.g., task_1,task_5). Overrides --full.

--group-size

integer

default:"1"

Number of times to run each task (for variance estimation).

Agent Configuration

--model, -m

string

Model/checkpoint name (e.g., claude-sonnet-4-5, gpt-5).

--config, -c

string

Agent config overrides as key=value. Supports namespaced keys like claude.max_tokens=32768.

--allowed-tools

string

Comma-separated tools to expose to the agent.

--disallowed-tools

string

Comma-separated tools to hide from the agent.

Execution Limits

--max-concurrent

integer

default:"30"

Maximum concurrent tasks (local execution only).

--max-steps

integer

Maximum steps per task. Default: 10 (single task) or 100 (--full).

--auto-respond

boolean

Use ResponseAgent to decide when to stop/continue. Default: True for --full.

Output & Confirmation

--verbose, -v

boolean

default:"false"

Enable verbose agent output.

--very-verbose, -vv

boolean

default:"false"

Enable debug-level logs.

--yes, -y

boolean

default:"false"

Skip confirmation prompt.

Configuration File

hud eval supports a .hud_eval.toml config file. Settings are merged with CLI args taking precedence: CLI arguments > .hud_eval.toml > defaults On first run, a template is created:

# .hud_eval.toml
[eval]
# source = "hud-evals/SheetBench-50"
# agent = "claude"
# full = false
# max_concurrent = 30
# max_steps = 10
# group_size = 1
# task_ids = ["task_1", "task_2"]
# auto_respond = true

[agent]
# allowed_tools = ["computer", "playwright"]
# disallowed_tools = []

[claude]
# model = "claude-sonnet-4-5"
# max_tokens = 16384

[openai]
# model = "gpt-4o"
# temperature = 0.7
# max_output_tokens = 4096

[gemini]
# model = "gemini-2.5-pro"
# temperature = 1.0

[openai_compatible]
# model = "my-model"
# base_url = "http://localhost:8000/v1"

Examples

# Single task (debug mode)
hud eval tasks.json claude

# Full dataset evaluation
hud eval hud-evals/SheetBench-50 claude --full

# Run specific tasks by ID
hud eval tasks.json claude --task-ids task_1,task_5

# With model override
hud eval tasks.json openai --model gpt-4o

# Agent config overrides
hud eval tasks.json claude --config max_tokens=32768
hud eval tasks.json openai --config temperature=0.7

# High concurrency
hud eval hud-evals/SheetBench-50 claude --full --max-concurrent 100

# Variance estimation (run each task 3 times)
hud eval tasks.json claude --full --group-size 3

# Remote execution on HUD platform
hud eval hud-evals/SheetBench-50 claude --full --remote

# OpenAI-compatible endpoint (vLLM, Ollama, etc.)
hud eval tasks.json openai_compatible \
    --config base_url=http://localhost:8000/v1 \
    --model llama3.1

# Skip confirmation
hud eval tasks.json claude --full -y

# Verbose debugging
hud eval tasks.json claude -vv

Interactive Mode

When agent is omitted, an interactive selector shows presets:

? Select an agent:
❯ Claude Sonnet 4.5
  GPT-5
  Operator (OpenAI Computer Use)
  Gemini 2.5 Computer Use
  Grok 4.1 Fast

Remote Execution

With --remote, both the agent and environment run on HUD infrastructure:

hud eval hud-evals/SheetBench-50 claude --full --remote

Remote agent: Runs on HUD workers (no local compute needed)
Remote environment: Tasks must use URL-based mcp_config (not local Docker)
Uses HUD Gateway - no model-specific API keys needed
Monitor progress at https://hud.ai/jobs/{job_id}
Cancel with hud cancel

Tasks with local Docker configs (command-based mcp_config) cannot be run remotely. Convert them first:

hud convert tasks.json

Remote execution requires HUD_API_KEY. Gemini and Operator agents are not supported remotely.

Cancellation

Cancel remote jobs:

# Cancel a specific job
hud cancel <job_id>

# Cancel a specific trace within a job
hud cancel <job_id> --trace-id <trace_id>

# Cancel ALL your active jobs
hud cancel --all

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

hud eval

Usage

Arguments

Options

Execution Mode

Agent Configuration

Execution Limits

Output & Confirmation

Configuration File

Examples

Interactive Mode

Remote Execution

Cancellation

See Also

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

​Usage

​Arguments

​Options

​Execution Mode

​Agent Configuration

​Execution Limits

​Output & Confirmation

​Configuration File

​Examples

​Interactive Mode

​Remote Execution

​Cancellation

​See Also

Usage

Arguments

Options

Execution Mode

Agent Configuration

Execution Limits

Output & Confirmation

Configuration File

Examples

Interactive Mode

Remote Execution

Cancellation

See Also