Skip to main content
The hud eval command runs an agent on a tasks file or a HuggingFace dataset.

Usage

hud eval [SOURCE] [AGENT] [OPTIONS]

Arguments

source
string
HuggingFace dataset (e.g., hud-evals/SheetBench-50) or task JSON/JSONL file. If omitted, looks for a tasks file in the current directory.
agent
string
Agent backend to use: claude, openai, or vllm. If omitted, an interactive selector appears (including HUD hosted models).

Options

--full
boolean
default:"false"
Run the entire dataset (omit for single-task debug mode)
--model
string
Model name for the chosen agent (required for some agents)
--allowed-tools
string
Comma-separated list of allowed tools
--max-concurrent
integer
default:"30"
Maximum concurrent tasks (1-200 recommended). Adjust based on your API rate limits and system resources.
--max-steps
integer
default:"50"
Maximum steps per task (default: 10 for single task, 50 for full dataset)
--verbose
boolean
default:"false"
Enable verbose agent output
--very-verbose
boolean
default:"false"
Enable debug-level logs for maximum visibility
--vllm-base-url
string
Base URL for vLLM server (when using --agent vllm or HUD hosted models)
--group-size
integer
default:"1"
Number of times to run each task (mini-batch style)

Examples

# Single task (debug mode)
hud eval hud-evals/SheetBench-50

# Entire huggingface dataset with Claude
hud eval hud-evals/SheetBench-50 claude --full

# High concurrency for faster evaluation
hud eval hud-evals/SheetBench-50 claude --full --max-concurrent 100

# Limit concurrency to prevent rate limits
hud eval hud-evals/SheetBench-50 openai --full --max-concurrent 20

# Local task config
hud eval tasks.json claude

# Local task config with verbose output for debugging
hud eval tasks.json claude --verbose

# vLLM with explicit base URL
hud eval tasks.json vllm --model llama3.1 --vllm-base-url http://localhost:8000

# Limit tools and concurrency
hud eval tasks.json claude --allowed-tools click,type --max-concurrent 10

Notes

  • If you select a HUD hosted model, hud eval will route through vLLM with the appropriate base model.
  • When SOURCE is omitted, an interactive file picker helps locate a tasks file.

See Also

Pricing & Billing

See hosted vLLM and training GPU rates in the Training Quickstart → Pricing. Manage usage and billing at https://hud.ai/project/billing.