HUD Documentation — Evaluations and RL Environments.

Run agents against your tasksets, analyze the results, and train models on successful traces.

Before running evaluations, you need a deployed environment and a taskset with tasks. See Scaffolding and Deploy & Go Remote.

Running Evaluations

Open your taskset on hud.ai/evalsets, click Run Taskset, and configure your run:

Models — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
Group Size — How many times to run each task per model (more runs = higher confidence)
Max Steps — Limit agent actions per task

Jobs queue and run in parallel across HUD infrastructure. Each run generates a trace with the full conversation, tool calls, and reward. Results show up in real-time on the taskset’s Leaderboard tab—rankings, success rates, and model comparisons.

Training Models

Training turns your evaluation traces into better models:

Go to hud.ai/models and find a trainable base model in Explore
Click Fork to create your copy—this gives you your model ID
Click Train Model and select a taskset as training data
Training creates a new checkpoint in your model’s tree

Set any checkpoint as HEAD to use it for inference. Your model ID works through the same gateway:

result = await task.run("your-model-id")

See Platform Models for training details.

CLI Alternative

Prefer the command line? Use hud eval for running evaluations locally or remotely:

# Run a platform taskset with a model
hud eval "My Tasks" claude --full

# Run with multiple repeats for variance
hud eval "My Tasks" claude --full --group-size 5

# Run remotely on HUD infrastructure
hud eval "My Tasks" claude --full --remote

# Run from a local file, linked to a platform taskset
hud eval tasks.json claude --full --taskset "My Tasks"

See hud eval CLI reference for all options.

The Loop

Scenarios are the atomic skills your agent must get right. If your agent can’t reliably pass a scenario, that’s a gap to close — through prompting, fine-tuning, or tool design. Deploy your environment, create tasks, run evaluations, train on successful traces, use the trained model. Repeat. Every evaluation generates traces. Every training run creates a better model. Agents get better at your environment, your tasks, your success criteria.

What’s Next

Platform Models

Model training and checkpoints

Platform Tasksets

Full taskset management guide

Publishing Leaderboards

Make your benchmarks public

Environments as Data

Design for useful training signal

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Evaluations & Training

Running Evaluations

Training Models

CLI Alternative

The Loop

What’s Next

Platform Models

Platform Tasksets

Publishing Leaderboards

Environments as Data

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Documentation Index

​Running Evaluations

​Training Models

​CLI Alternative

​The Loop

​What’s Next

Platform Models

Platform Tasksets

Publishing Leaderboards

Environments as Data

Running Evaluations

Training Models

CLI Alternative

The Loop

What’s Next