Skip to main content
Short answers to the questions that come up most. For the full story, each answer links to the page that covers it.

Why HUD

Rolling your own usually means three recurring chores: re-wiring tools for each model, re-packaging an artifact per task, and gluing rewards into a trainer. HUD removes all three:
  • The environment never needs rebuilding as models change. It exposes a capability — a real connection like an ssh shell or a browser — that any model or harness drives directly, so a harness released years from now still runs it.
  • One task definition is a whole dataset. A generative task mints as many concrete tasks as you want from a single definition; you don’t author and store one artifact per task.
  • Nothing downstream is locked in. A graded rollout is just a trace_id and a reward, so the same runs you eval today feed any trainer tomorrow — your own loop or a stack like Tinker, slime, or Fireworks — with no environment-side glue, on any rollout infra.
You write the environment once; the model, harness, trainer, and infra all stay swappable. See Introduction.

Setup & requirements

Not for the quickstart. hud eval, hud serve, and gateway runs need no Docker — you write a tasks.py and run it. You only need Docker for the packaging path: building a portable image from Dockerfile.hud and the build step of hud deploy. See Package & deploy.
You need one of:
  • A HUD_API_KEY (hud.ai/project/api-keys) — routes models through the HUD gateway (the default when no provider key is set) and traces every rollout. One key for everything.
  • A provider key (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY) — to call that provider directly instead of the gateway.
See Run on any model.
No — not to build environments, write tasks, or run evals. Inference happens through the gateway or your provider. Training feeds HUD’s rewards into your own GRPO/PPO loop (or a stack like Tinker, slime, or Fireworks), which brings its own compute. See Train on rewards.
A globally installed CLI (uv tool install hud-python) runs in its own Python environment, so it can’t see packages from your project’s venv (e.g. playwright in your env’s dependencies). Inside a project with its own deps, add hud-python to the project and run it from the venv:
uv add hud-python
uv run hud eval tasks.py claude
The CLI and SDK run on macOS, Windows, and Linux. Two caveats: ssh sandbox isolation is Linux-only (the shell still runs without it elsewhere), and BashGrader needs bash, so on native Windows it scores 0.0. Both are fine for local iteration and resolved inside a built Linux image. See Capabilities.

Privacy & cost

Two data paths to know about:
  • Gateway (the default with just HUD_API_KEY, or forced with --gateway / create_agent): model calls route through HUD’s OpenAI-compatible endpoint at inference.hud.ai, which forwards to the provider.
  • Tracing: when HUD_API_KEY is set, each rollout’s trace is recorded on the hud.ai platform so you can replay it. Run without the key (or with a provider key directly) to skip the gateway.
For data-handling specifics, see hud.ai or contact the team.
Running locally with your own provider key (hud serve, hud eval ... claude) incurs no HUD charge beyond your provider’s usage. The gateway uses hosted compute. For current pricing, quotas, and any free tier, see hud.ai.

Concepts & commands

  • Environment — where the agent acts; exposes capabilities (ssh, cdp, …).
  • Task definition — a @env.template async generator that prompts and grades.
  • Task — calling a definition (count_letter(word="…")) mints one runnable, parameterized data row.
  • Taskset — a collection of tasks you evaluate one agent over, with optional GRPO grouping. See Tasks & tasksets.
  • hud eval tasks.py claude — run an agent over your tasks and grade them. Your main loop.
  • hud serve env.py — serve the environment locally so you can drive one task by hand (hud task start / hud task grade).
  • hud deploy — build a portable Docker image and publish to HUD infra in one step.
Full surface in the CLI reference.
Yes. OpenAIChatAgent speaks the OpenAI Chat Completions API, so any compatible server (vLLM, a local model, a hosted checkpoint) works — point base_url at it. From the CLI use the openai_compatible agent. See Run on any model and Integrations.
Evals are a complete use on their own — write tasks, run them across models, read rewards and traces. Training is optional: because every rollout returns a reward and a trace, the same tasks become training data if and when you want them to. See Train on rewards.
Yes. The Harbor integration loads Harbor-format tasks straight into a Taskset (integrations.harbor.load), no conversion round-trip needed. And a whole benchmark can become one generative task definition. See Harbor interop.
Yes, in beta: the openpi/0 capability is a schema-driven observation/action loop over WebSocket for simulator and robot environments, with a LeRobot-ready agent harness and trace playback with action-chunk markers. See the Robots reference and the robot benchmark cookbook.
Scenarios became tasks, registered tools became capabilities, and the env serves a control channel instead of an MCP server. Old environments keep running; convert at your own pace. See Migrate to v6.

Still stuck?

Quickstart

Zero to a first graded trace.

Designing tasks for signal

What makes a task actually worth training on.