HUD Documentation — Evaluations and RL Environments.

HUD is a platform for building RL environments for AI agents. An RL environment is an isolated sandbox where an agent can take actions (via tools), complete a task, and receive a reward signal. Running agents through these environments produces traces — full recordings of what the agent did, what tools it called, and how it scored. These traces are useful for model evaluations and reinforcement learning training signal. People use HUD for:

Evaluating model performance. Build a taskset, run it across models, compare scores. Find out which model handles your use case before you commit to one.
Building RL Environments. Create frontier grade post-training data for different capabilities such as coding, computer use, tool use, deep research and more.
Training specialized agents. Use RL training to produce a model that’s better at your specific tasks.

The platform gives you three pieces:

Environment SDK — Define agent-callable tools and evaluation logic. Each environment spins up fresh and isolated for every run.
Eval & Training Platform — Run evaluations at scale on hud.ai. Collect traces. Train models on successful runs.
Model Gateway — One OpenAI-compatible endpoint at inference.hud.ai for Claude, GPT, Gemini, Grok, and more.

Read Scaffolding to get started!

Install

# Install CLI
uv tool install hud-python --python 3.12

# Login
hud login

hud login opens your browser, authenticates with hud.ai, and stores your API key in ~/.hud/.env. You can also set the key manually with hud set HUD_API_KEY=your-key-here.

1. Environments: Define Your Agent’s Harness

An environment wraps your code as tools agents can call, and defines scenarios that evaluate what agents do. Each environment spins up fresh and isolated for every evaluation — no shared state, fully reproducible.

from hud import Environment

env = Environment("my-env")

@env.scenario("count")
async def count(word: str, letter: str):
    # First yield: send a prompt to the agent, get its answer back
    answer = yield f"How many '{letter}' in '{word}'?"

    # Second yield: score the answer as a reward between 0.0 and 1.0
    correct = str(word.lower().count(letter.lower()))
    yield 1.0 if answer and correct in answer else 0.0

The scenario has two yields: the first sends a prompt to the agent and receives its answer. The second scores the result as a reward. Learn more about scenarios.

Example Workflow

hud login                   # Authenticate (one-time)
hud init my-env             # Scaffold environment
cd my-env
hud dev env:env -w env.py   # Run MCP server locally with hot-reload on watched paths
hud eval tasks.json claude  # Run an eval locally
hud deploy                  # Deploy to platform → run at scale

→ More on Environments · Deploy to Platform

2. Tasks & Training: Evaluate and Train

A task is a scenario with specific arguments. Group tasks into tasksets and run them across models. Train models on successful traces to produce a model that’s better at your specific use case.

import hud
from hud.agents import create_agent

task = env("count", word="strawberry", letter="r")
agent = create_agent("claude-sonnet-4-5")

async with hud.eval(task) as ctx:
    result = await agent.run(ctx)

print(f"Reward: {result.reward}")  # 1.0 if agent answers "3"

Create tasks on hud.ai, run evaluations across models, and train on successful traces. → More on Tasks & Training

3. Models: Any Model, One API

Integrations with Anthropic, OpenAI, Gemini, xAI, and more out of the box. Point any OpenAI-compatible client at inference.hud.ai and use any model. Browse all available models at hud.ai/models.

from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://inference.hud.ai",
    api_key=os.environ["HUD_API_KEY"]
)

response = await client.chat.completions.create(
    model="claude-sonnet-4-5",  # or gpt-4o, gemini-2.5-pro, grok-4-1-fast...
    messages=[{"role": "user", "content": "Hello!"}]
)

Every call is traced. View them at hud.ai/home. → More on Models

Next Steps

Scaffolding

Create environments, define tools and scenarios.

Tasks & Evaluation

Define tasks, test locally, iterate.

Deploy & Go Remote

Deploy and run evaluations at scale.

Environments as Data

Design for useful training signal.

Community

GitHub

Star the repo and contribute

Discord

Join the community

Enterprise

Building agents at scale? We work with teams on custom environments, benchmarks, and training pipelines. 📅 Book a call · 📧 founders@hud.ai

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Introduction

Install

1. Environments: Define Your Agent’s Harness

Example Workflow

2. Tasks & Training: Evaluate and Train

3. Models: Any Model, One API

Next Steps

Scaffolding

Tasks & Evaluation

Deploy & Go Remote

Environments as Data

Community

GitHub

Discord

Enterprise

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Documentation Index

​Install

​1. Environments: Define Your Agent’s Harness

​Example Workflow

​2. Tasks & Training: Evaluate and Train

​3. Models: Any Model, One API

​Next Steps

Scaffolding

Tasks & Evaluation

Deploy & Go Remote

Environments as Data

​Community

GitHub

Discord

​Enterprise

Install

1. Environments: Define Your Agent’s Harness

Example Workflow

2. Tasks & Training: Evaluate and Train

3. Models: Any Model, One API

Next Steps

Community

Enterprise