Skip to main content
Harbor is a framework for evaluating agents in container environments. HUD can convert any Harbor dataset (including Terminal-Bench) into HUD environments and run them on the platform.

Quick Start

# 1. Clone the benchmark
git clone https://github.com/laude-institute/terminal-bench-2.git

# 2. Convert to HUD format
hud convert ./terminal-bench-2/ --output ./tb2-hud

# 3. Deploy all environments
hud deploy ./tb2-hud --all

# 4. Run evaluation
hud eval ./tb2-hud/taskset.json
That’s it. The converter handles Dockerfile adaptation, build context, test scripts, and reward parsing automatically.

What Gets Converted

A Harbor task directory:
task-name/
├── task.toml              # Config (timeout, metadata)
├── instruction.md         # Agent prompt
├── environment/
│   ├── Dockerfile         # Container setup
│   └── (build context)    # Any files the Dockerfile references
├── tests/
│   └── test.sh            # Verification script
└── solution/              # Ignored
Becomes a HUD environment:
hud-harbor-dataset/
├── env.py                 # MCP environment with run-task scenario
├── Dockerfile.hud         # Harbor Dockerfile + HUD MCP layer
├── pyproject.toml         # Dependencies
├── (build context files)  # Copied from environment/
└── tasks/
    └── task-name/
        ├── instruction.md
        ├── task.toml
        └── tests/test.sh
Plus a taskset.json that references all tasks across all environments.

How It Works

Environment Grouping

Tasks with identical Dockerfiles are grouped into a single HUD environment. If every task has a unique Dockerfile (common in Terminal-Bench), each gets its own environment.

Dockerfile Adaptation

The converter takes the Harbor Dockerfile verbatim and appends a HUD layer:
  • Installs uv standalone (works on any base image — Debian, Ubuntu, Alpine, etc.)
  • Installs hud-python and openai as dependencies
  • Copies task data into /harbor/tasks/
  • Sets the MCP server as the entrypoint
CMD and ENTRYPOINT from the original Dockerfile are commented out and replaced.

Reward Parsing

Harbor test scripts write results to /logs/verifier/. The converter supports both formats:
  • reward.txt — a single float (1.0 for pass, 0.0 for fail)
  • reward.json{"reward": 1.0} or just a float

Running Programmatically

You can also run converted tasks from Python using the SDK:
import asyncio
import hud
from hud.agents.claude import ClaudeAgent
from hud.eval.task import Task

async def main():
    task = Task(
        env={"name": "hud-harbor-terminal-bench-2-sample-g1"},
        scenario="hud-harbor-terminal-bench-2-sample-g1:run-task",
        args={"task_id": "build-pmars"},
    )

    agent = ClaudeAgent.create(model="claude-sonnet-4-20250514")

    async with hud.eval(task, name="harbor-demo") as ctx:
        result = await agent.run(ctx, max_steps=30)

    print(f"Reward: {ctx.reward}")

asyncio.run(main())
Or load the full taskset:
import json
from pathlib import Path

from hud.eval.task import Task

taskset = json.loads(Path("./tb2-hud/taskset.json").read_text())
tasks = [Task(**t) for t in taskset]

Supported Harbor Patterns

PatternStatus
Simple Dockerfiles (FROM + RUN)Supported
COPY from local build contextSupported
Multi-stage buildsSupported
ENV, ARG, build scriptsSupported
CMD / ENTRYPOINT replacementSupported
Tasks without DockerfileSupported (fallback image)
task.toml metadata passthroughSupported
docker-compose.yaml (multi-service)Not yet supported

Limitations

  • Docker Compose: Tasks using docker-compose.yaml for multi-service setups are not currently supported (HUD environments are single-container).
  • Pre-built images: The converter rebuilds from the source Dockerfile rather than using the docker_image field in task.toml. This ensures full reproducibility but takes longer on first deploy.