HUD Documentation — Evaluations and RL Environments.

You’ve built an environment, written scenarios, defined tasks, and tested locally. Now run the same tasks at scale on HUD infrastructure — hundreds of parallel runs, no local compute.

Prerequisites

You need:

A working environment (env.py with tools and scenarios)
Tasks that pass locally (hud eval tasks/ claude)
HUD_API_KEY set (hud set HUD_API_KEY=your-key)

Step 1: Deploy

You must deploy your environment before syncing tasks or running remotely. Deploy first, then sync.

hud deploy

The simplest path. One command builds and deploys your environment directly to HUD:

hud deploy

This:

Packages your build context (respects .dockerignore)
Uploads to HUD’s build service
Builds remotely via AWS CodeBuild
Streams logs in real-time
Links this directory to the deployed environment

Once complete, your environment appears on the platform:

See your environment’s tools, scenarios, and builds at hud.ai/environments. For full details on managing environments through the platform UI, see Platform Environments. Takes 2-5 minutes the first time; subsequent deploys are faster due to layer caching.

If your environment is pure Python (no system deps), this step is still required for remote execution — the platform needs a container image to spin up isolated instances.

Rebuilding

Run hud deploy again in the same directory. HUD reads .hud/deploy.json to find your existing environment and builds a new version:

hud deploy  # v0.1.0
# make changes...
hud deploy  # v0.1.1

Configuration

Environment Variables, Build Args & Secrets

Three flags for different purposes:

Flag	When	Use For
`--env` / `-e`	Runtime	API keys, config
`--build-arg`	Build time	Repo URLs, build modes
`--secret`	Build time (not stored in image)	Private repo tokens

# Runtime env vars (encrypted, injected when container runs)
hud deploy -e API_KEY=secret

# Build args (for Dockerfile ARG directives)
hud deploy --build-arg REPO_URL=https://github.com/org/repo

# Build secrets (for private repos, not baked into image)
hud deploy --secret id=GITHUB_TOKEN,env=GITHUB_TOKEN

See hud deploy reference for full details.

GitHub Auto-Deploy

For teams and CI/CD, connect a GitHub repository. HUD rebuilds automatically when you push:

Go to hud.ai → New → Environment
Click Connect GitHub and install the HUD GitHub App
Select your repository and branch
Push changes—rebuilds happen automatically

This is better for long-term projects because:

CI/CD integration: Rebuilds on every push to your branch
Team collaboration: Anyone with repo access can trigger deploys
Version history: See which commit each build came from
Rollback: Deploy previous commits if needed

Step 2: Sync Tasks

Push your local task definitions to a platform taskset:

# Sync from a tasks.py file
hud sync tasks my-taskset

# Sync from a tasks/ directory (spelling.py, counting.py, etc.)
hud sync tasks my-taskset tasks/

# Re-sync after changes — shows a diff
hud sync tasks
#   create: add-negative (new)
#   update: add-simple (args changed)
#   remote-only: old-task (exists on platform but not locally)

This creates a taskset called “my-taskset” on the platform, uploads your tasks, and stores the taskset ID locally in .hud/config.json. On subsequent runs, hud sync tasks re-syncs to the same taskset. The sync is diff-aware — it diffs local tasks against the platform by slug. It creates new tasks, updates changed ones, and reports tasks that exist remotely but not locally (without deleting them). Any custom columns you defined on tasks sync automatically. Version control and task history are managed on the platform — you always have a record of what changed. See the hud sync reference for full details on task discovery, diff behavior, and options.

Step 3: Run Remotely

hud eval my-taskset claude --remote --full

Both the agent and environment run on HUD infrastructure. No local compute, no local Docker. Results stream to the platform in real-time.

Flag	What it does
`--remote`	Run on HUD infrastructure instead of locally
`--full`	Run all tasks (without this, only the first task runs)
`--group-size 3`	Run each task 3 times (for variance estimation)

Monitor progress at hud.ai/jobs. View results on your taskset’s Leaderboard tab. Or click Run Taskset on the platform UI — select models, group size, max steps, and launch.

Both the agent and environment run remotely. Results show up in real-time on the taskset’s Leaderboard tab—rankings, success rates, and model comparisons.

See hud eval CLI reference for all options.

Working on the Platform

Once your taskset is live, the platform becomes your primary interface for managing evaluations, inspecting results, and iterating.

Taskset Management

Your taskset is at hud.ai/evalsets. You can create and edit tasks through the platform UI — useful for large-scale management, team collaboration, or one-off additions.

The platform UI and hud sync work together. Edit locally and sync up, or edit on the platform — both are valid. See Platform Tasksets for the full guide.

Inspecting Traces

After a run completes, click any task row in the Leaderboard to open the Trace Viewer. The trace shows:

Conversation — The full message history between the agent and environment
Tool calls — Every tool invocation with arguments and results
Reward — The final score and any subscore breakdown
LOGS tab — Container stdout/stderr from the environment
DEBUG tab — Orchestrator and worker logs for infrastructure issues

Use traces to understand why an agent scored the way it did. If it got stuck, the conversation shows where. If it failed silently, the logs show container errors.

Iterating

Once you have results, the iteration cycle is fast:

What changed	What to do
Task args or columns	`hud sync tasks`
Added/removed tasks	`hud sync tasks`
Grading logic or tools	`hud deploy` then re-run
Dockerfile or system deps	`hud deploy --no-cache` then re-run
Just want to re-run	`hud eval my-taskset claude --remote --full`

The fastest cycle when developing a single task:

Edit scenario/grading locally
task.run("claude-sonnet-4-5") — verify it works
Deploy only if env code changed: hud deploy
Sync only if tasks changed: hud sync tasks --task <slug>
hud eval "My Taskset" claude --remote --task-ids my_task
Check the trace, repeat

Debugging Zero Scores

When a task scores 0.0 remotely but works locally:

Trace — Did the agent attempt the task or get stuck? If it never acted, the issue is the prompt or model, not grading.
Logs — Check the LOGS and DEBUG tabs for container errors, missing deps, or grader failures.
Grade locally — If it passes locally but fails remotely, the deployed environment diverged. Verify your latest build version matches .hud/deploy.json.

QA Workflows

After running evaluations, use QA workflows to automatically analyze traces — detect grading errors, classify failures, and flag reward hacking. Attach a QA workflow as a column on your taskset and every completed trace is analyzed automatically. HUD ships four standard QA workflows:

Workflow	What it detects
False Negative	Agent succeeded but grader scored it wrong
False Positive	Agent got credit without genuinely solving
Failure Analysis	Root cause classification (10 categories)
Reward Hacking	Agent gamed the evaluation mechanism

To add a QA agent: open any task’s detail panel → Traces tab → Add QA Agent → pick an agent. You can also build custom QA agents — see QA Agents for the full guide.

Remote runs use the HUD Gateway for model access. Store your provider API keys at hud.ai/project/api-keys (BYOK, lower credit cost) or use HUD Credits with pooled keys. Either way, you only need HUD_API_KEY — no provider-specific keys required.

Running Externally

Every HUD image supports scenario operations via hud scenario. Setup and grading are shell commands; agents interact with tools via the MCP server at :8080/mcp. This is the same interface used by Harbor-compatible benchmarks — converting an existing benchmark to HUD format produces exactly this structure.

The default Dockerfile CMD uses --stdio for the HUD platform. For external use, override the command to start an HTTP server:

With Docker

# Build and push the image to a registry
hud build .
docker tag my-env:latest <your-registry>/my-env:latest
docker push <your-registry>/my-env:latest

# Start the environment with HTTP server (overrides default stdio CMD)
docker run -d --name my-env -p 8080:8080 my-image:latest \
  hud dev env:env --port 8080

# List available scenarios
docker exec my-env hud scenario list

# Setup a scenario (prints the prompt)
docker exec my-env hud scenario setup count \
  --args '{"text": "strawberry", "letter": "r"}'

# Your agent runs against MCP tools at localhost:8080/mcp

# Grade (prints reward as JSON)
docker exec my-env hud scenario grade count --answer "3"

# Test graders without an agent (setup + grade in one shot)
docker exec my-env hud scenario run count \
  --args '{"text": "mississippi", "letter": "s"}' --answer "4"

With a Sandbox SDK (Python)

Any platform that can run a Docker image and exec into it works. Here are two options:

Daytona
Modal

Daytona spins up HUD images as sandboxed workspaces:

import json
from daytona import Daytona, CreateSandboxFromImageParams

daytona = Daytona()
sandbox = daytona.create(CreateSandboxFromImageParams(
    image="my-image:latest",
    language="python",
))

# Setup — returns the prompt
prompt = sandbox.process.exec(
    'hud scenario setup count --args \'{"text": "strawberry", "letter": "r"}\''
).result

# Agent runs against MCP tools at the sandbox
# ... your agent loop here ...

# Grade
reward = json.loads(sandbox.process.exec(
    'hud scenario grade count --answer "3"'
).result)

daytona.delete(sandbox)

Modal runs containers serverlessly with GPU support:

import json
import modal

image = modal.Image.from_registry("my-image:latest")
app = modal.App("hud-eval")
sandbox = modal.Sandbox.create(image=image, app=app)

# Setup
prompt = sandbox.exec("hud", "scenario", "setup", "count",
    "--args", '{"text": "strawberry", "letter": "r"}').stdout.read()

# Agent runs against MCP tools
# ... your agent loop here ...

# Grade
reward = json.loads(sandbox.exec("hud", "scenario", "grade",
    "count", "--answer", "3").stdout.read())

sandbox.terminate()

The same pattern works on Kubernetes (kubectl exec), E2B, Fly.io, or any platform that runs containers.

What’s Next

Environments as Data

Design environments that produce useful training signal

Platform Tasksets

Full taskset management guide

Sync Reference

Task sync details and diff behavior

hud eval Reference

All eval CLI options

Documentation Index

​Prerequisites

​Step 1: Deploy

​hud deploy

​Rebuilding

​Configuration

​GitHub Auto-Deploy

​Step 2: Sync Tasks

​Step 3: Run Remotely

​Working on the Platform

​Taskset Management

​Inspecting Traces

​Iterating

​Debugging Zero Scores

​QA Workflows

​Running Externally

​With Docker

​With a Sandbox SDK (Python)

​What’s Next

Environments as Data

Platform Tasksets

Sync Reference

hud eval Reference

Prerequisites

Step 1: Deploy

hud deploy

Rebuilding

Configuration

GitHub Auto-Deploy

Step 2: Sync Tasks

Step 3: Run Remotely

Working on the Platform

Taskset Management

Inspecting Traces

Iterating

Debugging Zero Scores

QA Workflows

Running Externally

With Docker

With a Sandbox SDK (Python)

What’s Next