hud eval command runs an agent on a tasks file or HuggingFace dataset.
Usage
Arguments
HuggingFace dataset (e.g.,
hud-evals/SheetBench-50) or task JSON/JSONL file.Agent to use:
claude, openai, operator, gemini, openai_compatible. If omitted, an interactive preset selector appears.Options
Execution Mode
Run the entire dataset. Without this flag, only the first task runs (debug mode).
Submit tasks to HUD platform for remote execution. Fire-and-forget - monitor at hud.ai.
Comma-separated task IDs to run (e.g.,
task_1,task_5). Overrides --full.Number of times to run each task (for variance estimation).
Agent Configuration
Model/checkpoint name (e.g.,
claude-sonnet-4-5, gpt-5).Agent config overrides as
key=value. Supports namespaced keys like claude.max_tokens=32768.Comma-separated tools to expose to the agent.
Comma-separated tools to hide from the agent.
Execution Limits
Maximum concurrent tasks (local execution only).
Maximum steps per task. Default: 10 (single task) or 100 (
--full).Use ResponseAgent to decide when to stop/continue. Default: True for
--full.Output & Confirmation
Enable verbose agent output.
Enable debug-level logs.
Skip confirmation prompt.
Configuration File
hud eval supports a .hud_eval.toml config file. Settings are merged with CLI args taking precedence:
CLI arguments > .hud_eval.toml > defaults
On first run, a template is created:
Examples
Interactive Mode
When agent is omitted, an interactive selector shows presets:Remote Execution
With--remote, both the agent and environment run on HUD infrastructure:
- Remote agent: Runs on HUD workers (no local compute needed)
- Remote environment: Tasks must use URL-based
mcp_config(not local Docker) - Uses HUD Gateway - no model-specific API keys needed
- Monitor progress at
https://hud.ai/jobs/{job_id} - Cancel with
hud cancel
Tasks with local Docker configs (
command-based mcp_config) cannot be run remotely. Convert them first:Remote execution requires
HUD_API_KEY. Gemini and Operator agents are not supported remotely.Cancellation
Cancel remote jobs:See Also
- Tasks Reference - Task configuration
- Agents Reference - Agent options
hud rft- Reinforcement fine-tuninghud cancel- Cancel remote jobs