env.py file, registering tasks
and capabilities, doing quick local tests, and deploying the environment to our platform.
On this page: Start · The env.py file · Capabilities · Lifecycle hooks · Tasks · Testing locally · Deploying
Start
Runhud init in your terminal. This lets you select a template to start with, the default is a blank template.
my-env you now have four files, each with one job:
my-env/
├── env.pythe environment: object, capabilities, and task definitions
├── tasks.pythe concrete tasks to run, filled in with real arguments
├── Dockerfile.hudhow the environment is packaged into an image on deploy
└── pyproject.tomlthe Python dependencies
The env.py file
env.py is the only file that defines behavior. In this section we’ll go through each part of a full example (see it in full below).
It’s an environment that gives the agent a sandboxed shell and file system, asks it to count a letter in a sentence, and scores the answer.
Full env.py example
Full env.py example
We’ll use the following
env.py example in this sectionenv.py
1 · Imports and the environment object
env = Environment(name="counter") creates the environment - the central object everything else
attaches to. Capabilities, lifecycle hooks, and tasks all register against env; it holds nothing on
its own. The name identifies it - how a task finds its way back to this definition, and what it’s
published under when you deploy.env.py
2 · Capabilities
env.workspace(ROOT) attaches a capability: a sandboxed shell and file system the agent reaches
over ssh, rooted at the workspace/ directory next to env.py.env.py
Capabilities
A capability is a connection the agent drives, and it’s what keeps an environment portable: you declare the connection, and any agent that speaks the protocol drives it with its own tools. Above,env.workspace(ROOT) attached a single ssh capability - the common case. An environment isn’t
limited to one, though: publish as many as the tasks need, and the agent sees every one. You attach them
three ways:
- in the
Environment(...)constructor, for a service that already exists, - with
env.workspace(root), the one-line shortcut for the typical shell case above, - with
env.add_capability(...)from a hook, for a service the environment runs itself.
@env.initialize hook starts it, waits for it to bind, then publishes its address.
cdp, a full desktop over rfb, a robot loop - attach the same way.
The capabilities page has the spin-up for each, and the
protocol covers how an agent discovers them over the wire.
3 · Lifecycle hooks
@env.initialize runs once before the env serves - the place to seed files or start services - and
@env.shutdown runs on teardown, after the run finishes, to release them. Both are optional; here
they’re dummy hooks for demonstration, writing and removing a file.env.py
Lifecycle hooks
When a task needs state - seeded files, a running service, a browser - you bring it up in@env.initialize and release it in @env.shutdown. @env.initialize runs once before the env accepts
any connection, so by the time an agent connects, everything it needs is already in place. Because the
env.workspace(...) hook is registered first, the workspace directory already exists when your own
hook runs, so you can write straight into ROOT.
Hooks run per environment, not per task, so use them for state every task shares. The environment
reference covers the full lifecycle.
4 · Tasks
@env.template(id="count") defines the task. The first yield sends the prompt and pauses; when the
agent’s answer arrives the function resumes, and the second yield returns the reward - 1.0 if the
count matches, else 0.0.env.py
Tasks
A task is HUD’s unit of evaluation: a prompt the agent attempts and a reward that scores how it did. You write one as an async generator with exactly twoyields:
- The first
yieldreturns the prompt and pauses the function while the agent works in the environment, driving whatever capabilities it was given. Its answer comes back intoanswer. - The second
yieldreturns the score - a reward from0.0to1.0. This is where grading goes: compareanswerinline, or call a grader to judge what the agent left behind.
@env.template() registers it as a template, not a single task: its parameters fill in per run, so
one definition describes a whole space of tasks. Calling it - count(sentence="strawberry", letter="r") - mints one concrete, runnable task.
Grading
The second yield is the reward, a number from0.0 to 1.0. Here it’s plain Python: compare the
agent’s answer to the count Python computes. You have three options as tasks grow:
| Approach | What it does |
|---|---|
| Plain Python | Compute the answer inline, like above. Comparison helpers (exact_match, numeric_match, …) live in hud.graders. |
| Async graders | BashGrader scores a shell command by exit code; LLMJudgeGrader scores against rubric criteria. This is how you grade the state the agent left behind - a test passing, a file written. |
| Composed graders | combine runs several at once and weights them into one reward, each subscore visible in the trace. |
awaited call whose result you yield - BashGrader, for instance, runs a command
in the workspace and scores by exit code:
Listing tasks in tasks.py
env.py defines the shape of a task. tasks.py lists the concrete ones to run by calling the template
with real arguments:
tasks.py
Testing the environment locally
Run an agent against the tasks.hud eval spawns the environment locally, runs the agent, grades each
task, and prints the reward:
ANTHROPIC_API_KEY, …), or a HUD_API_KEY to route through the HUD gateway instead.
Runs on macOS, Windows, and Linux, but
ssh sandbox isolation is Linux-only and BashGrader needs
bash (scores 0.0 on native Windows). Off Linux, get full behavior via --runtime hud
(see next guide for running), WSL2, or the built image (Dockerfile.hud).hud eval spawns a fresh env per run; to keep one serving locally for repeated runs or manual
inspection, use hud serve.
Deploying to the platform
When it works locally, ship it. Deploying talks to the platform, so set your API key first:hud deploy builds the image from Dockerfile.hud and publishes the environment under its
Environment(name=...):
hud sync uploads a taskset and only what changed:
Export to other formats
A HUD environment isn’t locked to HUD: its tasks export to other benchmark formats, like self-contained Harbor task folders.See environment examples
For more to build on, pick a different starting point withhud init, or read the
cookbooks for worked environments like a coding agent and an ops diagnostics task.