Skip to main content
A leaderboard is a published taskset with public evaluation results. This guide walks through the complete workflow—from empty taskset to public benchmark.
Prerequisites: You need an environment with at least one scenario. See Environments if you haven’t deployed one yet.

Create a Taskset

Go to hud.ai/evalsetsNew Taskset. Name it something descriptive—this becomes your leaderboard title once published.
Empty taskset page

Add Tasks

Tasks are what agents get evaluated on. Each task references a scenario from your environment with specific arguments. Click Upload Tasks (cloud icon) to bulk add tasks via JSON. See Tasksets → Adding Tasks for the full upload format and options.
Aim for 20–50 tasks. Fewer leads to high variance; more gives better signal but takes longer to run.

Run Evaluations

Click Run Taskset in the header. The run modal lets you configure:
  • Models — Select one or more models to evaluate. Multi-select runs the same tasks across all selected models.
  • Group Size — How many times to run each task per model (more runs = higher confidence)
  • Max Steps — Limit agent actions per task
Run taskset modal
Run at least 3 different models before publishing. A single-entry leaderboard isn’t useful for comparison.
Jobs appear in the Jobs tab as they run. Click a job to see individual trace results.
Jobs tab showing evaluation jobs

Review and Validate

Before publishing, check your results.

Leaderboard Tab

Shows aggregated rankings—agent scores, task-by-task breakdown, result distributions.
Leaderboard tab showing agent rankings
Look for:
  • Reasonable scores — 0% or 100% everywhere signals something’s wrong
  • Variance — Good benchmarks have range
  • Outliers — Unexpectedly high or low scores worth investigating

Traces

Click into jobs to review individual runs. Check that grading reflects actual agent performance. Look for environment issues or grading bugs.

Invalidate Bad Runs

Found issues? Select affected jobs in the Jobs tab and click Invalidate. Invalidated jobs:
  • Are excluded from leaderboard calculations
  • Show with a striped background
  • Cannot be published
  • Remain visible for reference
Common reasons to invalidate: environment bugs, incorrect grading logic, external service outages, test runs with wrong configuration.
Invalidation is permanent. To get fresh results, re-run the evaluation.

Publish

Click Publish in the taskset header. The modal shows:
  1. Evalset Status — Whether the taskset itself is already public
  2. Jobs to Include — Select which jobs to make public (invalidated jobs don’t appear)
  3. Already Public — Previously published jobs are checked and disabled
Publishing is permanent. Once published, jobs and traces are publicly accessible. This cannot be undone.

What Gets Published

ItemVisibility
Taskset namePublic
Task configurationsPublic
Selected job resultsPublic
Trace detailsPublic
Your team namePublic
Non-selected jobsPrivate
Invalidated jobsNever published

Adding More Later

After initial publication, run new models and return to Publish to add them. Previously published jobs stay public.

Best Practices

Before publishing:
  • Verify grading — Manually check 5–10 traces. Look for false positives and false negatives.
  • Test stability — Flaky environments produce inconsistent results that undermine leaderboard validity.
  • Include baselines — Always include well-known models (GPT-4o, Claude) as reference points.
  • Document clearly — Add a description explaining what skills are tested and expected difficulty.
A quality leaderboard has diverse tasks, multiple agents (3–5 minimum), reasonable difficulty (20–80% average success), and fair, consistent grading.