HUD Documentation - Evaluations and RL Environments.

HUD

HUD is a platform for building environments. You define an environment, write tasks for that environment, and run any agent to perform those tasks, at any scale. Our SDK is an open-source Python framework for all of this. The full workflow flows in five steps:

Define any environment

An environment is some closed container for your agent to act in. Fundamentally it’s defined by:

the contents of the container (Environment)
the tasks (and their rewards) to be performed inside it (Tasks & Tasksets)
the capabilities the agent can use to perform these tasks (Capabilities)

The v6 SDK leverages modular abstractions for all of these, letting you build on or reuse existing parts.

Part 1: Declare your environment

The first and key part of any HUD workflow is declaring your environment in a declaration file env.py - here is a standard scaffold:

env.py

from hud.environment import Environment
from hud.capabilities import Capability
from hud.graders import LLMJudgeGrader

# VITAL: an env with at least one capability - this is what the agent connects to and drives
env = Environment(name="...", capabilities=[
    Capability.ssh(name="shell", url="<url>", host_pubkey="<key>"),  # a real shell over ssh
])

# OPTIONAL: lifecycle hooks - only if the task needs setup/teardown (fixtures, services, seed state)
@env.initialize               # runs once before serving
async def _up():
    ...                       # write fixtures, stand up services, etc.

@env.shutdown                 # runs on env.stop()
async def _down():
    ...

# VITAL: at least one task definition - prompts the agent and returns a reward
@env.template()               # one definition = a whole space of tasks
async def some_task_1(...):
    answer = yield "<prompt>"      # the prompt handed to the agent; the agent's answer comes back
    # ── everything the agent does happens here: it drives the capability until it's done ──
    result = await LLMJudgeGrader.grade(answer=answer, criteria=[...])   # score the result → reward
    yield result.value           # VITAL: the final yield is the reward

This scaffold is general on purpose - it describes any environment. A one-line shell task, a full GUI desktop, a robot simulator - they’re all just environments with some bespoke content, tasks, and associated capabilities. The complexity hidden under this file is hidden in the HUD protocol Its thin envelope lets any model or harness plug into any environment.

Part 2: Choose your taskset

Then just form a taskset (one or more tasks with parameters) in code or load one from a file.

tasks.py

from hud.eval import Taskset
from env import some_task_1, some_task_2

# VITAL: a named taskset of concrete tasks to evaluate (parametrize one definition into many)
TASKS = Taskset("my-taskset", [some_task_1(<args1>), some_task_1(<args2>), some_task_2(<args3>)])

Spin it up anywhere

Once defined, an environment shouldn’t care where it runs - it should just work. The SDK lets you effortlessly switch between running your environment locally for development, on Daytona, Modal, or E2B for scale, or deploy to the HUD platform. The environment definition never changes - just the Runtime you pass:

Part 3: Choose your runtime

There are two main ways to run your declared environments.1. Package & deploy to the platform. Build a portable image once, push it to HUD, and run any tasks against it from the platform - compare models on a taskset and browse every trace, no local infra needed:

hud deploy                 # build + register your env image on HUD
hud sync tasks my-taskset  # publish a taskset to run from the platform

2. Run programmatically. Drive rollouts programmatically from Python by picking a runtime - the same taskset runs against any of them:

from hud.eval import LocalRuntime, DockerRuntime, ModalRuntime, HUDRuntime

LocalRuntime("env.py")     # local child process - fastest iteration
DockerRuntime("my-env")    # a fresh container per rollout
ModalRuntime("my-env")     # a Modal cloud sandbox per rollout
HUDRuntime()               # HUD's hosted infra (after `hud deploy`)

Evaluate and train any AI agent inside it

Since an environment only exposes capabilities, any agent plugs in. For standard models the HUD inference gateway and our prebuilt harnesses let you switch between models like Claude, GPT, or Gemini just by choosing the model name. Run rollouts in parallel with full isolation out of the box. Every rollout in the job is traced on the platform, so you can see exactly what the agent did realtime and how it was graded.

Part 4: Run your agent

You can run this programmatically:

from hud.agents import create_agent
from hud.eval import LocalRuntime
from tasks import TASKS

agent = create_agent("claude-sonnet-4-5")               # routed through the HUD gateway

job = await TASKS.run(agent, runtime=LocalRuntime("env.py"))   # start the run
print(job.reward)

or run it from the CLI:

hud eval env.py claude --group 3

Part 5: Learn

The rewards can then be used for your training: run a group per task and feed the spread straight into your own GRPO/PPO loop - or a stack like Tinker, slime, or Fireworks.

Where to go next

To see what HUD hides under the hood, read about the Protocol. To go in depth on each part of the workflow, start with Environments.

Build

A high-level guide on how to work with HUD.

Reference

The actual object reference: classes, objects, and abstractions.

​HUD

​Define any environment

​Spin it up anywhere

​Evaluate and train any AI agent inside it

​Where to go next

Build

Reference

HUD

Define any environment

Spin it up anywhere

Evaluate and train any AI agent inside it

Where to go next