HUD Documentation — Evaluations and RL Environments.

An evaluation produces one trace: an agent works the task against the environment and gets graded. Because the environment only exposes capabilities (never a fixed agent), any model or harness plugs in — you choose the agent at run time, not at authoring time.

Prerequisites

A task to run (see Tasks).
A HUD_API_KEY for gateway routing + tracing, or a provider key (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY) to call a provider directly.

The fastest path: `hud eval`

Pass a task source and an agent name. The agent names are claude, openai, gemini, and openai_compatible:

hud eval tasks.py claude --group 3
hud eval tasks.py openai --model gpt-5 --group 3
hud eval tasks.py gemini --group 3

Which path a call takes depends on your keys: with a provider key set (ANTHROPIC_API_KEY, etc.) it goes straight to the provider; with only your HUD_API_KEY, it routes through the HUD gateway automatically. Pass --gateway to force the gateway even when a provider key is present:

hud eval tasks.py claude --gateway

Useful flags:

Flag	Effect
`--full`	Run the whole dataset (`--all --auto-respond --max-steps 100`)
`--all`	Run every task instead of just the first
`--model`, `-m`	Pin a specific model id
`--group N`	Run each task `N` times — a group, to see reward variance
`--max-concurrent N`	Cap parallel rollouts
`--max-steps N`	Cap agent steps per task

In code: the agent contract

Every agent implements one method — await agent(run) — which drives a live Run to completion by filling run.trace. create_agent builds one routed through the HUD gateway for any model id:

run.py

import asyncio
from hud.agents import create_agent
from tasks import count_letter

async def main():
    agent = create_agent("claude-sonnet-4-5")
    job = await count_letter(word="strawberry").run(agent)
    print(job.reward)

asyncio.run(main())

create_agent accepts any model id the gateway knows — claude-..., gpt-..., gemini-..., grok-... — and wires the capability-backed tools for whatever the environment exposes. The gateway is an OpenAI-compatible endpoint at inference.hud.ai.

Calling a provider directly

To use your own provider key instead of the gateway, construct a provider agent with its config:

run.py

from hud.agents import ClaudeAgent
from hud.agents.types import ClaudeConfig

agent = ClaudeAgent(ClaudeConfig(model="claude-sonnet-4-5"))

The provider agents are ClaudeAgent, OpenAIAgent, GeminiAgent, and OpenAIChatAgent, each with a matching config in hud.agents.types (ClaudeConfig, OpenAIConfig, GeminiConfig, OpenAIChatConfig). ClaudeSDKAgent runs the claude CLI (Claude Code) over an ssh capability.

Your own vLLM / OpenAI-compatible endpoint

OpenAIChatAgent speaks the OpenAI Chat Completions API, so any compatible server (vLLM, a local model, a hosted checkpoint) works — point it at the base_url:

run.py

from hud.agents import OpenAIChatAgent
from hud.agents.types import OpenAIChatConfig

agent = OpenAIChatAgent(OpenAIChatConfig(
    model="my-model",
    base_url="http://localhost:8000/v1",
    api_key="local",
))

From the CLI, the equivalent is hud eval tasks.py openai_compatible --model my-model with the base_url set in your eval config.

Bring your own harness

A harness is just attach to a capability + define a tool spec, so wrapping another agent framework is a thin adapter — no protocol work. Subclass Agent and implement __call__:

harness.py

from hud.agents.base import Agent
from hud import Run

class EchoAgent(Agent):
    async def __call__(self, run: Run) -> None:
        # Read run.prompt_text, do work, then write the answer:
        run.trace.content = "my answer"

run.trace.content is the answer that gets graded on exit. The bundled BrowserUseAgent (in hud.agents.browser_use) is exactly this pattern — browser-use driving the cdp capability.

Next steps

Deploy & scale

Package once, run anywhere.

Train on your tasks

Turn a group of rewards into GRPO advantages.

Agents reference

Every agent class, config, and the Run contract.

Capabilities

What a harness can attach to.

​Prerequisites

​The fastest path: hud eval

​In code: the agent contract

​Calling a provider directly

​Your own vLLM / OpenAI-compatible endpoint

​Bring your own harness

​Next steps

Deploy & scale

Train on your tasks

Agents reference

Capabilities

Prerequisites

The fastest path: `hud eval`

In code: the agent contract

Calling a provider directly

Your own vLLM / OpenAI-compatible endpoint

Bring your own harness

Next steps