1 · Declare your environment
In HUD any workflow starts with creating an environment. An environment is some closed container for your agent to act in. Fundamentally it’s defined by:- the contents of the container like files or environment state
- the tasks to be performed inside it
- the grading mechanism associated with each task
- the capabilities the agent can use to perform these tasks
| Concept | What it is in HUD |
|---|---|
| Environment | The closed container the agent acts in - its contents, state, and lifecycle. |
| Tasks & Tasksets | A single task bound to an environment, each with its own prompt; bundle many task instances into a taskset. |
| Graders | The mechanism associated with a specific task that scores an attempt and turns it into a reward. |
| Capabilities | The interfaces the agent drives to act - shell, browser, screen, robot. |
env.py file - the central declarative file that describes
everything there is to a HUD environment. For a dedicated overview, see our guide on creating environments.
Part 1: Declare your environment
Part 1: Declare your environment
The first and key part of any HUD workflow is declaring your environment
in a declaration file This scaffold is general on purpose - it describes any environment. A one-line shell task, a full GUI
desktop, a robot simulator - they’re all just environments with some bespoke content, tasks, and
associated capabilities. The complexity hidden under this file is hidden in the
HUD protocol Its thin envelope lets any model or harness plug into any environment.
env.py - here is a standard scaffold:env.py
2 · Choose your taskset
Once an environment is defined or chosen, the next part is to simply select the set of tasks to use on that environment for evaluation. The core abstraction for this in HUD is the Taskset.Part 2: Choose your taskset
Part 2: Choose your taskset
To form a taskset (one or more tasks with parameters) do this directly in code
by importing from
env.py or load them from a file.
HUD provides various ways to load, select, and run tasks. For a dedicated overview see our guide on
evaluating agents.tasks.py
3 · Choose your runtime
Any kind of environment needs to actually run somewhere. An environment shouldn’t care where it runs - it should just work. HUD lets you run agent evaluations by deploying your environment to our platform on hud.ai. For more customizability and local development, however, we use the Runtime.| Concept | What it is in HUD |
|---|---|
| Runtime | Where an environment runs - locally, on a third-party provider, or the HUD platform - selected without changing the environment definition. |
Part 3: Choose your runtime
Part 3: Choose your runtime
There are three ways to run your declared environments. The main distinction is simply where your
environment file lives - on your local drive, or packaged and deployed to the HUD platform.1. From the CLI with 2. From a script. The same eval embedded in Python when you want programmatic control - pick a
runtime and run a taskset against it:3. Deploy to the platform. Build a portable image once and push it to HUD - now your environment lives
remotely, so you can run tasksets from the platform, compare models, and browse every
trace with no local infra:
hud eval (preferred). Point it at your on-disk env.py (or tasks.py) and choose
where each rollout runs with --runtime:4 · Run your agent
The next step is to choose the agent you want to evaluate. For standard models like Claude, GPT, or Gemini our prebuilt harnesses and our optional inference gateway let you switch between models just by choosing their name. Running the agent evaluation produces a run (one rollout). Everry run is recorded into a trace - a full, replayable timeline of everything the agent did and how it was graded. Running a whole taskset bundles all the runs into a single job. HUD enables executing runs in parallel with full isolation out of the box, and every run is traced on the platform, so you can see exactly what the agent did in realtime.| Concept | What it is in HUD |
|---|---|
| Agent | A model paired with a harness that drives it - plugged into an environment’s capabilities to attempt tasks and be graded. |
| Run | A single rollout - one agent attempting one task - that produces a reward. |
| Trace | The full, replayable timeline of a run: every action, observation, and grade. |
| Job | A collection of runs grouped together - e.g. a whole taskset evaluated by an agent. |
Part 4: Run your agent
Part 4: Run your agent
5 · Learn
With runs in hand, you can learn from their signals - evaluating a model, benchmarking it against others, or training it to improve. For training, the training client turns the rewards from your runs into model updates you can plug straight into your RL stack.| Concept | What it is in HUD |
|---|---|
| Training Client | Drives managed training for a model - turning the rewards collected from runs into gradient updates you feed into your RL loop. |
Part 5: Learn
Part 5: Learn