HUD Documentation - Evaluations and RL Environments.

This page reviews HUD-specific core concepts and overviews a full workflow. We’ll go through the workflow and introduce all the core concepts as they arise. We’ll include some code snippets, but for specific guides on the main uses see Guides. The workflow can be split into 5 steps:

1 · Declare your environment

In HUD any workflow starts with creating an environment. An environment is some closed container for your agent to act in. Fundamentally it’s defined by:

the contents of the container like files or environment state
the tasks to be performed inside it
the grading mechanism associated with each task
the capabilities the agent can use to perform these tasks

HUD maps each of these abstractions to a core concept you can compose and reuse:

Concept	What it is in HUD
Environment	The closed container the agent acts in - its contents, state, and lifecycle.
Tasks & Tasksets	A single task bound to an environment, each with its own prompt; bundle many task instances into a taskset.
Graders	The mechanism associated with a specific task that scores an attempt and turns it into a reward.
Capabilities	The interfaces the agent drives to act - shell, browser, screen, robot.

All of these abstractions appear in the env.py file - the central declarative file that describes everything there is to a HUD environment. For a dedicated overview, see our guide on creating environments.

Part 1: Declare your environment

The first and key part of any HUD workflow is declaring your environment in a declaration file env.py - here is a standard scaffold:

env.py

from hud.environment import Environment
from hud.capabilities import Capability
from hud.graders import LLMJudgeGrader

# VITAL: an env with at least one capability - this is what the agent connects to and drives
env = Environment(name="...", capabilities=[
    Capability.ssh(name="shell", url="<url>", host_pubkey="<key>"),  # a real shell over ssh
])

# OPTIONAL: lifecycle hooks - only if the task needs setup/teardown (fixtures, services, seed state)
@env.initialize               # runs once before serving
async def _up():
    ...                       # write fixtures, stand up services, etc.

@env.shutdown                 # runs on env.stop()
async def _down():
    ...

# VITAL: at least one task definition - prompts the agent and returns a reward
@env.template()               # one definition = a whole space of tasks
async def some_task_1(...):
    answer = yield "<prompt>"      # the prompt handed to the agent; the agent's answer comes back
    # ── everything the agent does happens here: it drives the capability until it's done ──
    result = await LLMJudgeGrader.grade(answer=answer, criteria=[...])   # score the result → reward
    yield result.value           # VITAL: the final yield is the reward

This scaffold is general on purpose - it describes any environment. A one-line shell task, a full GUI desktop, a robot simulator - they’re all just environments with some bespoke content, tasks, and associated capabilities. The complexity hidden under this file is hidden in the HUD protocol Its thin envelope lets any model or harness plug into any environment.

2 · Choose your taskset

Once an environment is defined or chosen, the next part is to simply select the set of tasks to use on that environment for evaluation. The core abstraction for this in HUD is the Taskset.

Part 2: Choose your taskset

To form a taskset (one or more tasks with parameters) do this directly in code by importing from env.py or load them from a file. HUD provides various ways to load, select, and run tasks. For a dedicated overview see our guide on evaluating agents.

tasks.py

from hud.eval import Taskset
from env import some_task_1, some_task_2

# VITAL: a named taskset of concrete tasks to evaluate (parametrize one definition into many)
TASKS = Taskset("my-taskset", [some_task_1(<args1>), some_task_1(<args2>), some_task_2(<args3>)])

3 · Choose your runtime

Any kind of environment needs to actually run somewhere. An environment shouldn’t care where it runs - it should just work. HUD lets you run agent evaluations by deploying your environment to our platform on hud.ai. For more customizability and local development, however, we use the Runtime.

Concept	What it is in HUD
Runtime	Where an environment runs - locally, on a third-party provider, or the HUD platform - selected without changing the environment definition.

HUD provides you freedom to flexibly switch between running your environment locally for fast environment development, on third-party runtime providers like Daytona, Modal, or E2B for scale, or deploy to the HUD platform. The environment definition never changes - just the Runtime you pass.

Part 3: Choose your runtime

There are three ways to run your declared environments. The main distinction is simply where your environment file lives - on your local drive, or packaged and deployed to the HUD platform.1. From the CLI with hud eval (preferred). Point it at your on-disk env.py (or tasks.py) and choose where each rollout runs with --runtime:

hud eval env.py claude                  # run locally against your on-disk env
hud eval env.py claude --runtime hud    # same env, executed on HUD's hosted infra

2. From a script. The same eval embedded in Python when you want programmatic control - pick a runtime and run a taskset against it:

from hud.eval import LocalRuntime, DockerRuntime, ModalRuntime, HUDRuntime

LocalRuntime("env.py")     # local child process - fastest iteration
DockerRuntime("my-env")    # a fresh container per rollout
ModalRuntime("my-env")     # a Modal cloud sandbox per rollout
HUDRuntime()               # HUD's hosted infra (after `hud deploy`)

3. Deploy to the platform. Build a portable image once and push it to HUD - now your environment lives remotely, so you can run tasksets from the platform, compare models, and browse every trace with no local infra:

hud deploy                 # build + register your env image on HUD
hud sync tasks my-taskset  # publish a taskset to run from the platform

4 · Run your agent

The next step is to choose the agent you want to evaluate. For standard models like Claude, GPT, or Gemini our prebuilt harnesses and our optional inference gateway let you switch between models just by choosing their name. Running the agent evaluation produces a run (one rollout). Everry run is recorded into a trace - a full, replayable timeline of everything the agent did and how it was graded. Running a whole taskset bundles all the runs into a single job. HUD enables executing runs in parallel with full isolation out of the box, and every run is traced on the platform, so you can see exactly what the agent did in realtime.

Concept	What it is in HUD
Agent	A model paired with a harness that drives it - plugged into an environment’s capabilities to attempt tasks and be graded.
Run	A single rollout - one agent attempting one task - that produces a reward.
Trace	The full, replayable timeline of a run: every action, observation, and grade.
Job	A collection of runs grouped together - e.g. a whole taskset evaluated by an agent.

Part 4: Run your agent

You can run this programmatically:

from hud.agents import create_agent
from hud.eval import LocalRuntime
from tasks import TASKS

agent = create_agent("claude-sonnet-4-5")               # routed through the HUD gateway

job = await TASKS.run(agent, runtime=LocalRuntime("env.py"))   # start the run
print(job.reward)

or run it from the CLI:

hud eval env.py claude --group 3

5 · Learn

With runs in hand, you can learn from their signals - evaluating a model, benchmarking it against others, or training it to improve. For training, the training client turns the rewards from your runs into model updates you can plug straight into your RL stack.

Concept	What it is in HUD
Training Client	Drives managed training for a model - turning the rewards collected from runs into gradient updates you feed into your RL loop.

Part 5: Learn

HUD can directly provide rewards for your runs based on the grader. These can then be used for your training: run a group per task and feed the spread straight into your own GRPO/PPO loop - or a stack like Tinker, slime, or Fireworks.

Where to next

Go hands-on with the Guides, or dig into the full Reference.

​1 · Declare your environment

​2 · Choose your taskset

​3 · Choose your runtime

​4 · Run your agent

​5 · Learn