HUD Documentation — Evaluations and RL Environments.

Harbor is a framework for evaluating agents in container environments. HUD can convert any Harbor dataset (including Terminal-Bench) into HUD environments and run them on the platform.

Quick Start

# 1. Clone the benchmark
git clone https://github.com/laude-institute/terminal-bench-2.git

# 2. Convert to HUD format
hud convert ./terminal-bench-2/ --output ./tb2-hud

# 3. Deploy all environments
hud deploy ./tb2-hud --all

# 4. Run evaluation
hud eval ./tb2-hud/taskset.json

That’s it. The converter handles Dockerfile adaptation, build context, test scripts, and reward parsing automatically.

What Gets Converted

A Harbor task directory:

task-name/
├── task.toml              # Config (timeout, metadata)
├── instruction.md         # Agent prompt
├── environment/
│   ├── Dockerfile         # Container setup
│   └── (build context)    # Any files the Dockerfile references
├── tests/
│   └── test.sh            # Verification script
└── solution/              # Ignored

Becomes a HUD environment:

hud-harbor-dataset/
├── env.py                 # MCP environment with run-task scenario
├── Dockerfile.hud         # Harbor Dockerfile + HUD MCP layer
├── pyproject.toml         # Dependencies
├── (build context files)  # Copied from environment/
└── tasks/
    └── task-name/
        ├── instruction.md
        ├── task.toml
        └── tests/test.sh

Plus a taskset.json that references all tasks across all environments.

How It Works

Environment Grouping

Tasks with identical Dockerfiles are grouped into a single HUD environment. If every task has a unique Dockerfile (common in Terminal-Bench), each gets its own environment.

Dockerfile Adaptation

The converter takes the Harbor Dockerfile verbatim and appends a HUD layer:

Installs uv standalone (works on any base image — Debian, Ubuntu, Alpine, etc.)
Installs hud-python and openai as dependencies
Copies task data into /harbor/tasks/
Sets the MCP server as the entrypoint

CMD and ENTRYPOINT from the original Dockerfile are commented out and replaced.

Reward Parsing

Harbor test scripts write results to /logs/verifier/. The converter supports both formats:

reward.txt — a single float (1.0 for pass, 0.0 for fail)
reward.json — {"reward": 1.0} or just a float

Running Programmatically

You can also run converted tasks from Python using the SDK:

import asyncio
import hud
from hud.agents.claude import ClaudeAgent
from hud.eval.task import Task

async def main():
    task = Task(
        env={"name": "hud-harbor-terminal-bench-2-sample-g1"},
        scenario="hud-harbor-terminal-bench-2-sample-g1:run-task",
        args={"task_id": "build-pmars"},
    )

    agent = ClaudeAgent.create(model="claude-sonnet-4-20250514")

    async with hud.eval(task, name="harbor-demo") as ctx:
        result = await agent.run(ctx, max_steps=30)

    print(f"Reward: {ctx.reward}")

asyncio.run(main())

Or load the full taskset:

import json
from pathlib import Path

from hud.eval.task import Task

taskset = json.loads(Path("./tb2-hud/taskset.json").read_text())
tasks = [Task(**t) for t in taskset]

Supported Harbor Patterns

Pattern	Status
Simple Dockerfiles (`FROM` + `RUN`)	Supported
`COPY` from local build context	Supported
Multi-stage builds	Supported
`ENV`, `ARG`, build scripts	Supported
`CMD` / `ENTRYPOINT` replacement	Supported
Tasks without Dockerfile	Supported (fallback image)
`task.toml` metadata passthrough	Supported
`docker-compose.yaml` (multi-service)	Not yet supported

Limitations

Docker Compose: Tasks using docker-compose.yaml for multi-service setups are not currently supported (HUD environments are single-container).
Pre-built images: The converter rebuilds from the source Dockerfile rather than using the docker_image field in task.toml. This ensures full reproducibility but takes longer on first deploy.

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

Converting Harbor Benchmarks

Quick Start

What Gets Converted

How It Works

Environment Grouping

Dockerfile Adaptation

Reward Parsing

Running Programmatically

Supported Harbor Patterns

Limitations

Get Started

Essentials

Guides

Cookbooks

Advanced

Tools

SDK Reference

CLI Reference

Community

​Quick Start

​What Gets Converted

​How It Works

​Environment Grouping

​Dockerfile Adaptation

​Reward Parsing

​Running Programmatically

​Supported Harbor Patterns

​Limitations

Quick Start

What Gets Converted

How It Works

Environment Grouping

Dockerfile Adaptation

Reward Parsing

Running Programmatically

Supported Harbor Patterns

Limitations