Harbor is a framework for evaluating agents in container environments. HUD can convert any Harbor dataset (including Terminal-Bench) into HUD environments and run them on the platform.Documentation Index
Fetch the complete documentation index at: https://docs.hud.ai/llms.txt
Use this file to discover all available pages before exploring further.
Quick Start
What Gets Converted
A Harbor task directory:taskset.json that references all tasks across all environments.
How It Works
Environment Grouping
Tasks with identical Dockerfiles are grouped into a single HUD environment. If every task has a unique Dockerfile (common in Terminal-Bench), each gets its own environment.Dockerfile Adaptation
The converter takes the Harbor Dockerfile verbatim and appends a HUD layer:- Installs
uvstandalone (works on any base image — Debian, Ubuntu, Alpine, etc.) - Installs
hud-pythonandopenaias dependencies - Copies task data into
/harbor/tasks/ - Sets the MCP server as the entrypoint
CMD and ENTRYPOINT from the original Dockerfile are commented out and replaced.
Reward Parsing
Harbor test scripts write results to/logs/verifier/. The converter supports both formats:
reward.txt— a single float (1.0for pass,0.0for fail)reward.json—{"reward": 1.0}or just a float
Running Tasks
Option 1: Upload as a Taskset (recommended)
The generatedtaskset.json can be uploaded directly to the HUD platform for managed evaluation, leaderboards, and comparison across models:
- Go to hud.ai/evalsets and create a new taskset
- Click Upload Tasks and paste the contents of
taskset.json - Run evaluations from the platform UI or via
hud eval
Option 2: CLI eval
Run the taskset directly from the command line:Option 3: Python SDK
Run tasks programmatically with any agent:Supported Harbor Patterns
| Pattern | Status |
|---|---|
Simple Dockerfiles (FROM + RUN) | Supported |
COPY from local build context | Supported |
| Multi-stage builds | Supported |
ENV, ARG, build scripts | Supported |
CMD / ENTRYPOINT replacement | Supported |
| Tasks without Dockerfile | Supported (fallback image) |
task.toml metadata passthrough | Supported |
docker-compose.yaml (multi-service) | Not yet supported |
Limitations
- Docker Compose: Tasks using
docker-compose.yamlfor multi-service setups are not currently supported (HUD environments are single-container). - Pre-built images: The converter rebuilds from the source Dockerfile rather than using the
docker_imagefield intask.toml. This ensures full reproducibility but takes longer on first deploy.