HUD Documentation — Evaluations and RL Environments.

Everything that authors tasks — HUD’s own env.py, platform rows, Harbor task dirs — is a frontend that loads into the same primitives (Environment, Task, Taskset). Integrations are loaders, not converters: no codegen roundtrip to run foreign tasks. The Harbor integration lives in the SDK repo at integrations/harbor.py — a recipe built only on the public SDK surface; copy it into your project or run it from a checkout.

Prerequisites

A Harbor task directory — each task has task.toml + instruction.md, and usually an environment/ (with a Dockerfile) and tests/.

Load Harbor tasks

load(path) parses a Harbor task dir (or a dataset of them) into a Taskset directly — one row per task dir (id = the dir name), sharing one declarative Environment per distinct environment/ build context:

from integrations.harbor import detect, load

assert detect("./terminal-bench")
taskset = load("./terminal-bench")

for task in taskset:
    print(task.env, task.id)

Like every task row, the result carries no placement. Run it by supplying one — today that means a substrate already serving the control channel (runtime=Runtime(url)); a docker provider that builds and runs each task’s environment/ image is the planned follow-up:

from hud import Runtime

job = await taskset.run(agent, runtime=Runtime("tcp://127.0.0.1:8765"))

Export HUD tasks to Harbor

export(source, out_dir) goes the other way: it turns a HUD task source (a .py file/dir exposing Tasks, or a .json/.jsonl taskset next to its env.py) into self-contained Harbor task folders:

from integrations.harbor import export

created = await export("tasks.py", "harbor_tasks")

harbor_tasks/
└── <slug>/
    ├── task.toml             # Harbor-native config (+ hud_task/hud_args metadata)
    ├── instruction.md        # the materialized prompt + answer-file convention
    ├── environment/          # the env build context + baked HUD entrypoint
    │   ├── Dockerfile
    │   └── hud_entrypoint.sh
    └── tests/test.sh         # grades over the in-container control channel

How the lifecycle maps:

HUD	Harbor
serving (`python -m hud.environment.server`) + task start	the baked image ENTRYPOINT serves the control channel and parks the run
the agent works, writes `answer.txt`	the agent works in the container
task evaluate (`grade`)	`tests/test.sh` grades the parked run, writes `reward.txt`

Only environments whose capabilities are ssh/mcp are exportable (Harbor is shell-centric; rfb/cdp don’t map). The exported task grades over the HUD control channel, so it needs Harbor’s default same-container verifier — don’t set [verifier.environment] in task.toml.

Review, then rely

The mapping is mechanical, so review the result — confirm the prompt reads naturally, the grader scores what the prompt asks for, and there’s no leftover answer leakage (see Designing tasks for signal).

Harbor interop

Prerequisites

Load Harbor tasks

Export HUD tasks to Harbor

Review, then rely

See also

Package & deploy

Tasks & placement

Designing tasks for signal

CLI reference

​Prerequisites

​Load Harbor tasks

​Export HUD tasks to Harbor

​Review, then rely

​See also

Package & deploy

Tasks & placement

Designing tasks for signal

CLI reference

Prerequisites

Load Harbor tasks

Export HUD tasks to Harbor

Review, then rely

See also