Why HUD
Why HUD instead of building my own eval or RL harness?
Why HUD instead of building my own eval or RL harness?
- The environment never needs rebuilding as models change. It exposes a capability - a real connection like an
sshshell or a browser - that any model or harness drives directly, so a harness released years from now still runs it. - One task definition is a whole dataset. A generative task mints as many concrete tasks as you want from a single definition; you donβt author and store one artifact per task.
- Nothing downstream is locked in. A graded rollout is just a
trace_idand areward, so the same runs you eval today feed any trainer tomorrow - your own loop or a stack like Tinker, slime, or Fireworks - with no environment-side glue, on any rollout infra.
Setup & requirements
Do I need Docker?
Do I need Docker?
hud eval, hud serve, and gateway runs need no Docker - you write a tasks.py and run it. You only need Docker for the packaging path: building a portable image from Dockerfile.hud and the build step of hud deploy. See Package & deploy.Do I need an API key?
Do I need an API key?
- A
HUD_API_KEY(hud.ai/project/api-keys) - routes models through the HUD gateway (the default when no provider key is set) and traces every rollout. One key for everything. - A provider key (
ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY) - to call that provider directly instead of the gateway.
Do I need a GPU?
Do I need a GPU?
My environment imports a package hud can't find - why?
My environment imports a package hud can't find - why?
uv tool install hud-python) runs in its own Python environment, so it canβt see packages from your projectβs venv (e.g. playwright in your envβs dependencies). Inside a project with its own deps, add hud-python to the project and run it from the venv:What platforms are supported (macOS / Windows / Linux)?
What platforms are supported (macOS / Windows / Linux)?
ssh sandbox isolation is Linux-only (the shell still runs without it elsewhere), and BashGrader needs bash, so on native Windows it scores 0.0. Both are fine for local iteration and resolved inside a built Linux image. See Capabilities.Privacy & cost
What does HUD see? Is my data private?
What does HUD see? Is my data private?
- Gateway (the default with just
HUD_API_KEY, or forced with--gateway/create_agent): model calls route through HUDβs OpenAI-compatible endpoint atinference.hud.ai, which forwards to the provider. - Tracing: when
HUD_API_KEYis set, each rolloutβs trace is recorded on the hud.ai platform so you can replay it. Run without the key (or with a provider key directly) to skip the gateway.
How much does it cost?
How much does it cost?
hud serve, hud eval ... claude) incurs no HUD charge beyond your providerβs usage. The gateway uses hosted compute. For current pricing, quotas, and any free tier, see hud.ai.Concepts & commands
Environment vs task vs taskset?
Environment vs task vs taskset?
- Environment - where the agent acts; exposes capabilities (
ssh,cdp, β¦). - Task definition - a
@env.templateasync generator that prompts and grades. - Task - calling a definition (
count_letter(word="β¦")) mints one runnable, parameterized data row. - Taskset - a collection of tasks you evaluate one agent over, with optional GRPO grouping. See Tasks & tasksets.
hud eval vs hud serve vs hud deploy - which when?
hud eval vs hud serve vs hud deploy - which when?
hud eval tasks.py claude- run an agent over your tasks and grade them. Your main loop.hud serve env.py- serve the environment locally so you can drive one task by hand (hud task start/hud task grade).hud deploy- build a portable Docker image and publish to HUD infra in one step.
hud eval env.py or tasks.py?
hud eval env.py or tasks.py?
hud init) thatβs tasks.py (it imports templates from env.py); in a single-file setup itβs whatever file holds both the Environment and the tasks = [...] list. The CLI spawns env.py for the control channel automatically - you donβt pass both. Point it at . to scan a directory. See the CLI reference.Can I use my own model or a local endpoint?
Can I use my own model or a local endpoint?
OpenAIChatAgent speaks the OpenAI Chat Completions API, so any compatible server (vLLM, a local model, a hosted checkpoint) works - point base_url at it. From the CLI use the openai_compatible agent. See Run on any model and Integrations.Do I have to train, or can I just run evals?
Do I have to train, or can I just run evals?
Can I bring an existing benchmark or tasks?
Can I bring an existing benchmark or tasks?
Taskset (integrations.harbor.load), no conversion round-trip needed. And a whole benchmark can become one generative task definition. See Harbor interop.Does HUD support robotics / VLA policies?
Does HUD support robotics / VLA policies?
openpi/0 capability is a schema-driven observation/action loop over WebSocket for simulator and robot environments, with a LeRobot-ready agent harness and trace playback with action-chunk markers. See the Robots reference and the robot benchmark cookbook.I'm upgrading from v5 - what changed?
I'm upgrading from v5 - what changed?