HUD Documentation - Evaluations and RL Environments.

An agent is what acts inside an environment: it works a task through the environment’s capabilities and produces the answer that gets graded. Concretely it’s a model wrapped in a harness - the loop that feeds the model observations and turns its output into actions. In the framework an agent is anything callable as await agent(run), where the run is the live handle for one task: its prompt, its connection to the environment, and the trace it fills. Because an environment only exposes capabilities, the agent isn’t baked in - use a built-in agent for a standard model, or bring your own harness for a custom loop.

Built-in agents

The SDK ships one agent per major provider, reached two ways:

create_agent(model) - the preferred path. It selects the matching provider agent for a model id and routes every call through the HUD gateway.
a provider agent directly (e.g. ClaudeAgent(ClaudeConfig(...))) - the same class constructed yourself, for full config control or to call the provider with your own key instead of the gateway.

from hud.agents import create_agent

agent = create_agent("claude-sonnet-4-5")   # routed through the gateway

The HUD gateway is an OpenAI-compatible endpoint (inference.hud.ai) that fronts every provider behind your single HUD_API_KEY, so you switch between Claude, GPT, Gemini, or Grok by name alone, with unified tracing. create_agent accepts any id the gateway knows (claude-..., gpt-..., gemini-..., grok-...); extra kwargs pass through to the agent’s config. The reason this is one line: built-in agents are catalog-driven. Each run they read the environment’s manifest, open the capabilities they support, build the matching provider tools, and loop against run.prompt_messages. Declaring a capability on the environment is enough; you never wire tools.

Provider agents

Each model maps to a provider agent - the class that speaks that provider’s API. Construct one directly to set its full config or use your own provider key:

from hud.agents import ClaudeAgent
from hud.agents.types import ClaudeConfig

agent = ClaudeAgent(ClaudeConfig(model="claude-sonnet-4-5", max_steps=30))

Agent	Config	Default model
`ClaudeAgent`	`ClaudeConfig`	`claude-sonnet-4-6`
`OpenAIAgent`	`OpenAIConfig`	`gpt-5.5`
`GeminiAgent`	`GeminiConfig`	`gemini-3-pro-preview`
`OpenAIChatAgent`	`OpenAIChatConfig`	`gpt-5.4-mini`
`ClaudeSDKAgent`	`ClaudeSDKConfig`	`claude-sonnet-4-6`

Each config lives in hud.agents.types. OpenAIChatAgent speaks the OpenAI Chat Completions API, so it points at any compatible server (vLLM, a local model) via base_url; ClaudeSDKAgent runs the claude CLI over an ssh capability, against the env’s filesystem. Every knob (model, max_steps, system_prompt, citations_enabled) lives on the config; __call__(run) takes only the run.

Running an agent

Run a task with an agent two ways. Programmatically - pass the agent to task.run / taskset.run with a runtime:

from hud.agents import create_agent
from hud.eval import LocalRuntime
from tasks import TASKS

agent = create_agent("claude-sonnet-4-5")
job = await TASKS.run(agent, runtime=LocalRuntime("env.py"))
print(job.reward)

From the CLI - hud eval takes a task source (.py, a directory, or .json/.jsonl) and an agent name (claude, openai, gemini, openai_compatible), runs each rollout in a fresh env subprocess, grades it, and prints the reward:

hud eval tasks.py claude                       # first task, one rollout
hud eval tasks.py openai -m gpt-5 --group 3    # a pinned model, 3 rollouts each
hud eval tasks.py claude --all                 # every task in the source

Flags override the agent’s config for that run:

Flag	Effect
`--model`, `-m`	Pin a specific model id.
`--group N`	Run each task N times, to see the reward spread.
`--max-steps N`	Cap agent steps per task.
`--all` / `--full`	Run the whole source (`--full` also auto-responds, 100 steps).
`--gateway`	Force calls through the gateway even when a provider key is set.

With only a HUD_API_KEY set, calls route through the gateway; with a provider key present they go straight to the provider. See the CLI reference for the full flag set and key resolution.

Bring your own harness

Any loop or framework can be an agent: subclass Agent, drive the environment off the run, and write the final answer to run.trace.content (what gets graded). Since this is outside the standard workflow, the seam, the Run object you work with, the step types you record, and worked examples live in Extending HUD.

Agents

Built-in agents

Provider agents

Running an agent

Bring your own harness

See also

Bring your own harness

Capabilities

Types: Run & Trace

Robots (beta)

​Built-in agents

​Provider agents

​Running an agent

​Bring your own harness

​See also

Bring your own harness

Capabilities

Types: Run & Trace

Robots (beta)

Built-in agents

Provider agents

Running an agent

Bring your own harness

See also