You have an environment with tools and scenarios. Now turn scenarios into runnable tasks, test them locally, and iterate.Documentation Index
Fetch the complete documentation index at: https://docs.hud.ai/llms.txt
Use this file to discover all available pages before exploring further.
The Sample Environment
Here’s the complete environment we’ll use throughout this page — a tool and a scenario in a singleenv.py:
count_letter tool, and gets scored on whether it answers correctly. Everything below builds on this.
Defining Tasks
A task is a scenario instantiated with specific arguments. Define them in atasks.py file using scenario.task(). Each task needs a unique slug — a stable, kebab-case identifier used for syncing, filtering with --task-ids, and matching across local/remote:
Columns
Tasks can carry custom metadata via columns. Columns show up as filterable fields on the platform and as prefixed headers in CSV exports (col:category, col:complexity, etc.):
text, number, multi-select) from the values across all tasks and merges them into the taskset’s column schema. Columns already defined on the platform are preserved — sync only adds new columns and expands select options.
Structuring Task Files
For small sets, a singletasks.py works. For larger sets, organize tasks into a tasks/ directory with one file per category:
hud eval and hud sync can point at the tasks/ directory and will discover all task files automatically. See how tasks are discovered for the full resolution order and advanced patterns.
For validation sequences and prompt overrides, see the hud sync reference.
Running Locally
Quick Run — task.run()
The simplest way to run a single task. One line:
Batch Eval — hud eval
For running all your tasks at once. Everything runs in-process — no Docker, no server, just Python:
hud eval prints a reward distribution summary after each run so you can see how the taskset is performing at a glance:
Interactive — hud dev
Spawn your environment as an MCP server and connect from Cursor, Claude Code, or any MCP client:
-w), save, and the controller reloads automatically. Great for developing and debugging individual scenarios interactively.
The env:env syntax is like uvicorn — module:attribute. It tells hud dev to import env.py and run the env object as an MCP server.
If you have a Dockerfile in your project root, hud dev automatically detects it and runs in Docker mode — building the image and starting the container with hot-reload on watched paths.
Docker — hud build + connect_image
For environments that need system dependencies (PostgreSQL, browsers, VNC, GPU libraries). Build the image, then connect to it from a test script:
connect_image spins up the container, connects via MCP, and tears it down when done. Your tools run inside the container where the system deps live; your test script runs outside.
Note: hud eval tasks.py imports your env.py directly (in-process). For Docker environments, write a separate test script that uses connect_image as shown above, or use hud dev for interactive Docker development:
Debugging Docker Builds
When something goes wrong with your container, usehud debug:
Custom Agent Loop
Build your own agent loop using the format converters. See Integrations for OpenAI, Anthropic, LangChain, and more:When to Use What
| Mode | System deps? | Speed | Use case |
|---|---|---|---|
task.run() | No | Fastest | Single task, quick iteration |
hud eval | No | Fastest | Batch eval, pure Python envs |
hud dev | Optional | Fast (hot-reload) | Interactive development, single scenario |
hud build + connect_image | Yes | Slower (container) | Databases, browsers, GPU, full integration |
| Custom agent loop | No | Varies | When you need full control |
What’s Next
Deploy & Go Remote
Deploy your environment, sync to platform, run evaluations remotely
Environments as Data
Design environments that produce useful training signal