Here’s the complete environment we’ll use throughout this page — a tool and a scenario in a single env.py:
from hud import Environmentenv = Environment("letter-counter")@env.tool()def count_letter(text: str, letter: str) -> int: """Count occurrences of a letter in text.""" return text.lower().count(letter.lower())@env.scenario("count")async def count(word: str, letter: str): answer = yield f"How many '{letter}' in '{word}'?" correct = str(word.lower().count(letter.lower())) yield 1.0 if answer and correct in answer else 0.0
One tool, one scenario. The agent gets a counting question, can optionally call the count_letter tool, and gets scored on whether it answers correctly. Everything below builds on this.
A task is a scenario instantiated with specific arguments. Define them in a tasks.py file using scenario.task(). Each task needs a unique slug — a stable, kebab-case identifier used for syncing, filtering with --task-ids, and matching across local/remote:
Tasks group into tasksets — batches of related tasks used for benchmarking. Create a taskset, add tasks with different arguments, and run the whole set across models to compare performance.
Renaming a slug creates a new task on the platform (the old one remains). Choose slugs carefully.
Tasks can carry custom metadata via columns. Columns show up as filterable fields on the platform and as prefixed headers in CSV exports (col:category, col:complexity, etc.):
When syncing, the CLI auto-infers column types (text, number, multi-select) from the values across all tasks and merges them into the taskset’s column schema. Columns already defined on the platform are preserved — sync only adds new columns and expands select options.
Both hud eval and hud sync can point at the tasks/ directory and will discover all task files automatically. See how tasks are discovered for the full resolution order and advanced patterns.For validation sequences and prompt overrides, see the hud sync reference.
For running all your tasks at once. Everything runs in-process — no Docker, no server, just Python:
# Run first taskhud eval tasks.py# Run all taskshud eval tasks.py --full# Run specific tasks by slughud eval tasks.py --task-ids strawberry-r,abracadabra-a# Run with a specific modelhud eval tasks.py claude --full# Run from a tasks directoryhud eval tasks/ claude --full# Run each task 3 times for variance estimationhud eval tasks/ claude --full --group-size 3
hud eval prints a reward distribution summary after each run so you can see how the taskset is performing at a glance:
Your coding agent can call your tools directly. Edit watched paths (-w), save, and the controller reloads automatically. Great for developing and debugging individual scenarios interactively.The env:env syntax is like uvicorn — module:attribute. It tells hud dev to import env.py and run the env object as an MCP server.If you have a Dockerfile in your project root, hud dev automatically detects it and runs in Docker mode — building the image and starting the container with hot-reload on watched paths.
For environments that need system dependencies (PostgreSQL, browsers, VNC, GPU libraries). Build the image, then connect to it from a test script:
hud build .
from hud import Environmentenv = Environment("my-env")env.connect_image("my-env:latest")result = await env("checkout", product="laptop").run("claude-sonnet-4-5")
connect_image spins up the container, connects via MCP, and tears it down when done. Your tools run inside the container where the system deps live; your test script runs outside.Note: hud eval tasks.py imports your env.py directly (in-process). For Docker environments, write a separate test script that uses connect_image as shown above, or use hud dev for interactive Docker development: