Prerequisites
- A task to run (see Tasks).
- A
HUD_API_KEYfor gateway routing + tracing, or a provider key (ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY) to call a provider directly.
The fastest path: hud eval
Pass a task source and an agent name. The agent names are claude, openai, gemini, and openai_compatible:
ANTHROPIC_API_KEY, etc.) it goes straight to the provider; with only your HUD_API_KEY, it routes through the HUD gateway automatically. Pass --gateway to force the gateway even when a provider key is present:
| Flag | Effect |
|---|---|
--full | Run the whole dataset (--all --auto-respond --max-steps 100) |
--all | Run every task instead of just the first |
--model, -m | Pin a specific model id |
--group N | Run each task N times — a group, to see reward variance |
--max-concurrent N | Cap parallel rollouts |
--max-steps N | Cap agent steps per task |
In code: the agent contract
Every agent implements one method —await agent(run) — which drives a live Run to completion by filling run.trace. create_agent builds one routed through the HUD gateway for any model id:
run.py
create_agent accepts any model id the gateway knows — claude-..., gpt-..., gemini-..., grok-... — and wires the capability-backed tools for whatever the environment exposes. The gateway is an OpenAI-compatible endpoint at inference.hud.ai.
Calling a provider directly
To use your own provider key instead of the gateway, construct a provider agent with its config:run.py
ClaudeAgent, OpenAIAgent, GeminiAgent, and OpenAIChatAgent, each with a matching config in hud.agents.types (ClaudeConfig, OpenAIConfig, GeminiConfig, OpenAIChatConfig). ClaudeSDKAgent runs the claude CLI (Claude Code) over an ssh capability.
Your own vLLM / OpenAI-compatible endpoint
OpenAIChatAgent speaks the OpenAI Chat Completions API, so any compatible server (vLLM, a local model, a hosted checkpoint) works — point it at the base_url:
run.py
hud eval tasks.py openai_compatible --model my-model with the base_url set in your eval config.
Bring your own harness
A harness is just attach to a capability + define a tool spec, so wrapping another agent framework is a thin adapter — no protocol work. SubclassAgent and implement __call__:
harness.py
run.trace.content is the answer that gets graded on exit. The bundled BrowserUseAgent (in hud.agents.browser_use) is exactly this pattern — browser-use driving the cdp capability.
Next steps
Deploy & scale
Package once, run anywhere.
Train on your tasks
Turn a group of rewards into GRPO advantages.
Agents reference
Every agent class, config, and the
Run contract.Capabilities
What a harness can attach to.