Skip to main content
Run agents against your tasksets, analyze the results, and train models on successful traces.
Before running evaluations, you need a deployed environment and a taskset with tasks. See Environments and Deploy.

Running Evaluations

Open your taskset on hud.ai/evalsets, click Run Taskset, and configure your run:
Run taskset configuration modal
  • Models — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
  • Group Size — How many times to run each task per model (more runs = higher confidence)
  • Max Steps — Limit agent actions per task
Jobs queue and run in parallel across HUD infrastructure. Each run generates a trace with the full conversation, tool calls, and reward. Results show up in real-time on the taskset’s Leaderboard tab—rankings, success rates, and model comparisons.
Tasksets and leaderboards

Training Models

Training turns your evaluation traces into better models:
  1. Go to hud.ai/models and find a trainable base model in Explore
  2. Click Fork to create your copy—this gives you your model ID
  3. Click Train Model and select a taskset as training data
  4. Training creates a new checkpoint in your model’s tree
Model checkpoints after training
Set any checkpoint as HEAD to use it for inference. Your model ID works through the same gateway:
result = await task.run("your-model-id")
See Platform Models for training details.

CLI Alternative

Prefer the command line? Use hud eval for running evaluations locally or remotely:
# Run a platform taskset with a model
hud eval "My Tasks" claude --full

# Run with multiple repeats for variance
hud eval "My Tasks" claude --full --group-size 5

# Run remotely on HUD infrastructure
hud eval "My Tasks" claude --full --remote

# Run from a local file, linked to a platform taskset
hud eval tasks.json claude --full --taskset "My Tasks"
See hud eval CLI reference for all options.

The Loop

Deploy your environment, create tasks, run evaluations, train on successful traces, use the trained model. Repeat. Every evaluation generates traces. Every training run creates a better model. Agents get better at your environment, your tasks, your success criteria.

What’s Next

Platform Models

Model training and checkpoints

Platform Tasksets

Full taskset management guide

Publishing Leaderboards

Make your benchmarks public

Best Practices

Design effective environments and evals