Skip to main content
An evaluation produces one trace: an agent works the task against the environment and gets graded. Because the environment only exposes capabilities (never a fixed agent), any model or harness plugs in — you choose the agent at run time, not at authoring time.

Prerequisites

  • A task to run (see Tasks).
  • A HUD_API_KEY for gateway routing + tracing, or a provider key (ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY) to call a provider directly.

The fastest path: hud eval

Pass a task source and an agent name. The agent names are claude, openai, gemini, and openai_compatible:
hud eval tasks.py claude --group 3
hud eval tasks.py openai --model gpt-5 --group 3
hud eval tasks.py gemini --group 3
Which path a call takes depends on your keys: with a provider key set (ANTHROPIC_API_KEY, etc.) it goes straight to the provider; with only your HUD_API_KEY, it routes through the HUD gateway automatically. Pass --gateway to force the gateway even when a provider key is present:
hud eval tasks.py claude --gateway
Useful flags:
FlagEffect
--fullRun the whole dataset (--all --auto-respond --max-steps 100)
--allRun every task instead of just the first
--model, -mPin a specific model id
--group NRun each task N times — a group, to see reward variance
--max-concurrent NCap parallel rollouts
--max-steps NCap agent steps per task

In code: the agent contract

Every agent implements one method — await agent(run) — which drives a live Run to completion by filling run.trace. create_agent builds one routed through the HUD gateway for any model id:
run.py
import asyncio
from hud.agents import create_agent
from tasks import count_letter

async def main():
    agent = create_agent("claude-sonnet-4-5")
    job = await count_letter(word="strawberry").run(agent)
    print(job.reward)

asyncio.run(main())
create_agent accepts any model id the gateway knows — claude-..., gpt-..., gemini-..., grok-... — and wires the capability-backed tools for whatever the environment exposes. The gateway is an OpenAI-compatible endpoint at inference.hud.ai.

Calling a provider directly

To use your own provider key instead of the gateway, construct a provider agent with its config:
run.py
from hud.agents import ClaudeAgent
from hud.agents.types import ClaudeConfig

agent = ClaudeAgent(ClaudeConfig(model="claude-sonnet-4-5"))
The provider agents are ClaudeAgent, OpenAIAgent, GeminiAgent, and OpenAIChatAgent, each with a matching config in hud.agents.types (ClaudeConfig, OpenAIConfig, GeminiConfig, OpenAIChatConfig). ClaudeSDKAgent runs the claude CLI (Claude Code) over an ssh capability.

Your own vLLM / OpenAI-compatible endpoint

OpenAIChatAgent speaks the OpenAI Chat Completions API, so any compatible server (vLLM, a local model, a hosted checkpoint) works — point it at the base_url:
run.py
from hud.agents import OpenAIChatAgent
from hud.agents.types import OpenAIChatConfig

agent = OpenAIChatAgent(OpenAIChatConfig(
    model="my-model",
    base_url="http://localhost:8000/v1",
    api_key="local",
))
From the CLI, the equivalent is hud eval tasks.py openai_compatible --model my-model with the base_url set in your eval config.

Bring your own harness

A harness is just attach to a capability + define a tool spec, so wrapping another agent framework is a thin adapter — no protocol work. Subclass Agent and implement __call__:
harness.py
from hud.agents.base import Agent
from hud import Run

class EchoAgent(Agent):
    async def __call__(self, run: Run) -> None:
        # Read run.prompt_text, do work, then write the answer:
        run.trace.content = "my answer"
run.trace.content is the answer that gets graded on exit. The bundled BrowserUseAgent (in hud.agents.browser_use) is exactly this pattern — browser-use driving the cdp capability.

Next steps

Deploy & scale

Package once, run anywhere.

Train on your tasks

Turn a group of rewards into GRPO advantages.

Agents reference

Every agent class, config, and the Run contract.

Capabilities

What a harness can attach to.