Skip to main content
HUD is a platform for building RL environments for AI agents. An RL environment is an isolated sandbox where an agent can take actions (via tools), complete a task, and receive a reward signal. Running agents through these environments produces traces — full recordings of what the agent did, what tools it called, and how it scored. These traces are useful for model evaluations and reinforcement learning training signal. People use HUD for:
  • Evaluating model performance. Build a taskset, run it across models, compare scores. Find out which model handles your use case before you commit to one.
  • Building RL Environments. Create frontier grade post-training data for different capabilities such as coding, computer use, tool use, deep research and more.
  • Training specialized agents. Use reinforcement fine-tuning (RFT) to produce a model that’s better at your specific tasks.
The platform gives you three pieces:
  1. Environment SDK — Define agent-callable tools and evaluation logic. Each environment spins up fresh and isolated for every run.
  2. Eval & Training Platform — Run evaluations at scale on hud.ai. Collect traces. Train models on successful runs.
  3. Model Gateway — One OpenAI-compatible endpoint at inference.hud.ai for Claude, GPT, Gemini, Grok, and more.
Read Core Concepts before getting started!

Install

# Install CLI
uv tool install hud-python --python 3.12

# Set your API key
hud set HUD_API_KEY=your-key-here
Get your API key at hud.ai/project/api-keys.

1. Environments: Define Your Agent’s Harness

An environment wraps your code as tools agents can call, and defines scenarios that evaluate what agents do. Each environment spins up fresh and isolated for every evaluation — no shared state, fully reproducible.
from hud import Environment

env = Environment("my-env")

@env.scenario("count")
async def count(word: str, letter: str):
    # First yield: send a prompt to the agent, get its answer back
    answer = yield f"How many '{letter}' in '{word}'?"

    # Second yield: score the answer as a reward between 0.0 and 1.0
    correct = str(word.lower().count(letter.lower()))
    yield 1.0 if answer and correct in answer else 0.0
The scenario has two yields: the first sends a prompt to the agent and receives its answer. The second scores the result as a reward. Learn more about scenarios.

Example Workflow

hud init my-env       # Scaffold environment
cd my-env
hud dev env:env -w env.py   # Run MCP server locally with hot-reload on watched paths
hud eval tasks.json claude  # Run an eval locally
hud deploy                  # Deploy to platform → run at scale
More on Environments · Deploy to Platform

2. Tasks & Training: Evaluate and Train

A task is a scenario with specific arguments. Group tasks into tasksets and run them across models. Train models on successful traces to produce a model that’s better at your specific use case.
import hud
from hud.agents import create_agent

task = env("count", word="strawberry", letter="r")
agent = create_agent("claude-sonnet-4-5")

async with hud.eval(task) as ctx:
    result = await agent.run(ctx)

print(f"Reward: {result.reward}")  # 1.0 if agent answers "3"
Create tasks on hud.ai, run evaluations across models, and train on successful traces. More on Tasks & Training

3. Models: Any Model, One API

Integrations with Anthropic, OpenAI, Gemini, xAI, and more out of the box. Point any OpenAI-compatible client at inference.hud.ai and use any model. Browse all available models at hud.ai/models.
from openai import AsyncOpenAI
import os

client = AsyncOpenAI(
    base_url="https://inference.hud.ai",
    api_key=os.environ["HUD_API_KEY"]
)

response = await client.chat.completions.create(
    model="claude-sonnet-4-5",  # or gpt-4o, gemini-2.5-pro, grok-4-1-fast...
    messages=[{"role": "user", "content": "Hello!"}]
)
Every call is traced. View them at hud.ai/home. More on Models

Next Steps

Core Concepts

Environments, tools, scenarios, tasks — defined in one place.

Environments

Tools, scenarios, and iteration.

Tasks & Training

Evaluate and train models.

Best Practices

Patterns for reliable environments and evals.

Community

GitHub

Star the repo and contribute

Discord

Join the community

Enterprise

Building agents at scale? We work with teams on custom environments, benchmarks, and training pipelines. 📅 Book a call · 📧 founders@hud.ai