HUD Documentation — Evaluations and RL Environments.

Once your environment is deployed, the platform handles everything: spinning up instances, running agents, collecting traces, and training models. Create tasks from your scenarios, run them across models, and use successful runs as training data.

Before running evaluations and training, your environment must be deployed via hud deploy. See Hosted Running.

Tasks and Tasksets

Tasks are instances of scenarios with specific arguments. Tasksets group related tasks for batch evaluation.

Go to your environment on hud.ai/environments
Click the Scenarios tab and select a scenario
Fill in the arguments and add to a taskset

Your scenario might take arguments:

@env.scenario("checkout")
async def checkout_flow(product_name: str, apply_coupon: bool = False):
    yield f"Complete checkout for {product_name}" + (" with coupon" if apply_coupon else "")
    yield 1.0 if order_confirmed() else 0.0

Create multiple tasks from it—checkout-laptop, checkout-phone-coupon, checkout-headphones. Group them in a taskset and run them all at once. Tasksets become your benchmarks—run them against new model versions to track progress. See Platform Tasksets for the full guide.

Running Evaluations

Open your taskset on hud.ai/evalsets, click Run Taskset, and configure your run:

Models — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
Group Size — How many times to run each task per model (more runs = higher confidence)
Max Steps — Limit agent actions per task

Jobs queue and run in parallel across HUD infrastructure. Each run generates a trace with the full conversation, tool calls, and reward. Results show up in real-time on the taskset’s Leaderboard tab—rankings, success rates, and model comparisons.

Training Models

Training turns your evaluation traces into better models:

Go to hud.ai/models and find a trainable base model in Explore
Click Fork to create your copy—this gives you your model ID
Click Train Model and select a taskset as training data
Training creates a new checkpoint in your model’s tree

Set any checkpoint as HEAD to use it for inference. Your model ID works through the same gateway:

from hud.agents import create_agent

# Your forked model - evaluate at any time
agent = create_agent("your-model-id")

async with hud.eval(task) as ctx:
    result = await agent.run(ctx)

See Platform Models for training details.

CLI Alternative

Prefer the command line? Use hud eval for running evaluations locally or remotely:

# Run a taskset with a model
hud eval my-taskset claude --full

# Run with multiple repeats for variance
hud eval my-taskset claude --full --group-size 5

# Run remotely on HUD infrastructure
hud eval my-taskset claude --full --remote

See hud eval CLI reference for all options.

The Loop

Deploy your environment, create tasks, run evaluations, train on successful traces, use the trained model. Repeat. Every evaluation generates traces. Every training run creates a better model. Agents get better at your environment, your tasks, your success criteria.

What’s Next

Platform Tasksets

Full taskset management guide

Platform Models

Model training and checkpoints

Testing Environments

Local testing, variants, and mock mode

Publishing Leaderboards

Make your benchmarks public

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

Tasks & Training

Tasks and Tasksets

Running Evaluations

Training Models

CLI Alternative

The Loop

What’s Next

Platform Tasksets

Platform Models

Testing Environments

Publishing Leaderboards

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

​Tasks and Tasksets

​Running Evaluations

​Training Models

​CLI Alternative

​The Loop

​What’s Next

Platform Tasksets

Platform Models

Testing Environments

Publishing Leaderboards

Tasks and Tasksets

Running Evaluations

Training Models

CLI Alternative

The Loop

What’s Next