Skip to main content
The hud eval command runs an agent on a tasks file or HuggingFace dataset.

Usage

hud eval [SOURCE] [AGENT] [OPTIONS]

Arguments

source
string
HuggingFace dataset (e.g., hud-evals/SheetBench-50) or task JSON/JSONL file.
agent
string
Agent to use: claude, openai, operator, gemini, openai_compatible. If omitted, an interactive preset selector appears.

Options

Execution Mode

--full
boolean
default:"false"
Run the entire dataset. Without this flag, only the first task runs (debug mode).
--remote
boolean
default:"false"
Submit tasks to HUD platform for remote execution. Fire-and-forget - monitor at hud.ai.
--task-ids
string
Comma-separated task IDs to run (e.g., task_1,task_5). Overrides --full.
--group-size
integer
default:"1"
Number of times to run each task (for variance estimation).

Agent Configuration

--model, -m
string
Model/checkpoint name (e.g., claude-sonnet-4-5, gpt-5).
--config, -c
string
Agent config overrides as key=value. Supports namespaced keys like claude.max_tokens=32768.
--allowed-tools
string
Comma-separated tools to expose to the agent.
--disallowed-tools
string
Comma-separated tools to hide from the agent.

Execution Limits

--max-concurrent
integer
default:"30"
Maximum concurrent tasks (local execution only).
--max-steps
integer
Maximum steps per task. Default: 10 (single task) or 100 (--full).
--auto-respond
boolean
Use ResponseAgent to decide when to stop/continue. Default: True for --full.

Output & Confirmation

--verbose, -v
boolean
default:"false"
Enable verbose agent output.
--very-verbose, -vv
boolean
default:"false"
Enable debug-level logs.
--yes, -y
boolean
default:"false"
Skip confirmation prompt.

Configuration File

hud eval supports a .hud_eval.toml config file. Settings are merged with CLI args taking precedence: CLI arguments > .hud_eval.toml > defaults On first run, a template is created:
# .hud_eval.toml
[eval]
# source = "hud-evals/SheetBench-50"
# agent = "claude"
# full = false
# max_concurrent = 30
# max_steps = 10
# group_size = 1
# task_ids = ["task_1", "task_2"]
# auto_respond = true

[agent]
# allowed_tools = ["computer", "playwright"]
# disallowed_tools = []

[claude]
# model = "claude-sonnet-4-5"
# max_tokens = 16384

[openai]
# model = "gpt-4o"
# temperature = 0.7
# max_output_tokens = 4096

[gemini]
# model = "gemini-2.5-pro"
# temperature = 1.0

[openai_compatible]
# model = "my-model"
# base_url = "http://localhost:8000/v1"

Examples

# Single task (debug mode)
hud eval tasks.json claude

# Full dataset evaluation
hud eval hud-evals/SheetBench-50 claude --full

# Run specific tasks by ID
hud eval tasks.json claude --task-ids task_1,task_5

# With model override
hud eval tasks.json openai --model gpt-4o

# Agent config overrides
hud eval tasks.json claude --config max_tokens=32768
hud eval tasks.json openai --config temperature=0.7

# High concurrency
hud eval hud-evals/SheetBench-50 claude --full --max-concurrent 100

# Variance estimation (run each task 3 times)
hud eval tasks.json claude --full --group-size 3

# Remote execution on HUD platform
hud eval hud-evals/SheetBench-50 claude --full --remote

# OpenAI-compatible endpoint (vLLM, Ollama, etc.)
hud eval tasks.json openai_compatible \
    --config base_url=http://localhost:8000/v1 \
    --model llama3.1

# Skip confirmation
hud eval tasks.json claude --full -y

# Verbose debugging
hud eval tasks.json claude -vv

Interactive Mode

When agent is omitted, an interactive selector shows presets:
? Select an agent:
❯ Claude Sonnet 4.5
  GPT-5
  Operator (OpenAI Computer Use)
  Gemini 2.5 Computer Use
  Grok 4.1 Fast

Remote Execution

With --remote, both the agent and environment run on HUD infrastructure:
hud eval hud-evals/SheetBench-50 claude --full --remote
  • Remote agent: Runs on HUD workers (no local compute needed)
  • Remote environment: Tasks must use URL-based mcp_config (not local Docker)
  • Uses HUD Gateway - no model-specific API keys needed
  • Monitor progress at https://hud.ai/jobs/{job_id}
  • Cancel with hud cancel
Tasks with local Docker configs (command-based mcp_config) cannot be run remotely. Convert them first:
hud convert tasks.json
Remote execution requires HUD_API_KEY. Gemini and Operator agents are not supported remotely.

Cancellation

Cancel remote jobs:
# Cancel a specific job
hud cancel --job <job_id>

# Cancel a specific task
hud cancel --task <job_id> <task_id>

# Cancel ALL your active jobs
hud cancel --all

See Also