HUD Documentation — Evaluations and RL Environments.

A leaderboard is a published taskset with public evaluation results. This guide walks through the complete workflow—from empty taskset to public benchmark.

Prerequisites: You need an environment with at least one scenario. See Environments if you haven’t deployed one yet.

Create a Taskset

Go to hud.ai/evalsets → New Taskset. Name it something descriptive—this becomes your leaderboard title once published.

Add Tasks

Tasks are what agents get evaluated on. Each task references a scenario from your environment with specific arguments. Click Upload Tasks (cloud icon) to bulk add tasks via JSON. See Tasksets → Adding Tasks for the full upload format and options.

Aim for 20–50 tasks. Fewer leads to high variance; more gives better signal but takes longer to run.

Run Evaluations

Click Run Taskset in the header. The run modal lets you configure:

Models — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
Group Size — How many times to run each task per model (more runs = higher confidence)
Max Steps — Limit agent actions per task

Run at least 3 different models before publishing. A single-entry leaderboard isn’t useful for comparison.

Jobs appear in the Jobs tab as they run. Click a job to see individual trace results.

Review and Validate

Before publishing, check your results.

Leaderboard Tab

Shows aggregated rankings—agent scores, task-by-task breakdown, result distributions.

Look for:

Reasonable scores — 0% or 100% everywhere signals something’s wrong
Variance — Good benchmarks have range
Outliers — Unexpectedly high or low scores worth investigating

Traces

Click into jobs to review individual runs. Check that grading reflects actual agent performance. Look for environment issues or grading bugs.

Invalidate Bad Runs

Found issues? Select affected jobs in the Jobs tab and click Invalidate. Invalidated jobs:

Are excluded from leaderboard calculations
Show with a striped background
Cannot be published
Remain visible for reference

Common reasons to invalidate: environment bugs, incorrect grading logic, external service outages, test runs with wrong configuration.

Invalidation is permanent. To get fresh results, re-run the evaluation.

Publish

Click Publish in the taskset header. The modal shows:

Evalset Status — Whether the taskset itself is already public
Jobs to Include — Select which jobs to make public (invalidated jobs don’t appear)
Already Public — Previously published jobs are checked and disabled

Publishing is permanent. Once published, jobs and traces are publicly accessible. This cannot be undone.

What Gets Published

Item	Visibility
Taskset name	Public
Task configurations	Public
Selected job results	Public
Trace details	Public
Your team name	Public
Non-selected jobs	Private
Invalidated jobs	Never published

Adding More Later

After initial publication, run new models and return to Publish to add them. Previously published jobs stay public.

Best Practices

Before publishing:

Verify grading — Manually check 5–10 traces. Look for false positives and false negatives.
Test stability — Flaky environments produce inconsistent results that undermine leaderboard validity.
Include baselines — Always include well-known models (GPT-4o, Claude) as reference points.
Document clearly — Add a description explaining what skills are tested and expected difficulty.

A quality leaderboard has diverse tasks, multiple agents (3–5 minimum), reasonable difficulty (20–80% average success), and fair, consistent grading.

Tasksets

Detailed taskset management

Environments

Create environments with scenarios

Get Started

Concepts

Guides

Integrations

How We Use HUD on HUD

Publishing Leaderboards

Create a Taskset

Add Tasks

Run Evaluations

Review and Validate

Leaderboard Tab

Traces

Invalidate Bad Runs

Publish

What Gets Published

Adding More Later

Best Practices

Tasksets

Environments

Get Started

Concepts

Guides

Integrations

How We Use HUD on HUD

​Create a Taskset

​Add Tasks

​Run Evaluations

​Review and Validate

​Leaderboard Tab

​Traces

​Invalidate Bad Runs

​Publish

​What Gets Published

​Adding More Later

​Best Practices

Tasksets

Environments

Create a Taskset

Add Tasks

Run Evaluations

Review and Validate

Leaderboard Tab

Traces

Invalidate Bad Runs

Publish

What Gets Published

Adding More Later

Best Practices