Why Environments?
Your production API is a single live instance with shared state—you can’t run 500 tests against it in parallel without causing chaos. Environments spin up fresh for every evaluation: isolated, deterministic, reproducible. Run thousands in parallel, each starting from the exact state you define, each generating training data. Under the hood, an environment is an MCP server. When you deploy, HUD spins up a fresh, isolated instance for every evaluation — no shared state, no interference between parallel runs.Create an Environment
Scaffold a new environment withhud init. Works on existing codebases too:
Tools
Tools are functions that an agent can call while it’s working on a task. Decorate a function with@env.tool() and agents can call it:
Tool Hooks
Use@tool.before and @tool.after to add validation, logging, or access control to any tool without modifying its implementation:
@tool.before can modify arguments, pass through unchanged, or raise to block. @tool.after can modify or pass through the result.
Complex Stateful Tools
For tools that need internal state, connections, or complex initialization, subclassBaseTool. See the Tools SDK Reference for architecture details, base classes, native specs, and complete implementation examples.
Scenarios
Scenarios are the core of HUD. A scenario defines what you ask the agent to do and how you score the result. This is where you spend your creative energy — writing prompts, designing scoring logic, deciding what success looks like. A scenario is an async generator function with two yields — the first yield sends a prompt to the agent, and the second yield returns a reward:| Section | Where | What it does |
|---|---|---|
| Setup (optional) | Before the first yield | Seed a database, navigate to a URL, prepare initial state |
| Prompt | The first yield | Sends instructions to the agent; receives the agent’s answer |
| Scoring | After the first yield, ending with the second yield | Checks results and returns a reward between 0.0 and 1.0 |
detail_level as a parameter lets you create both step-by-step and high-level task variants from the same code.
Everything upstream (environments, tools) exists to support scenarios. Everything downstream (tasks, tasksets, traces, training) flows from them. You write scenarios in your IDE — that’s where the creative work lives.
Built-in Capabilities
HUD ships with pre-built tools, connectors, and graders so you can assemble environments without writing everything from scratch.Native Tools
Each model provider (Anthropic, OpenAI, Google) has its own tool specification. HUD handles the translation — add a tool once, and it adapts to whatever agent connects:computer_20250124 and bash_20250124. OpenAI gets compatible function calls. Same environment, every agent.
Tools declare native_specs that map to each provider’s format. When an agent connects, HUD checks for a matching spec and registers using the provider’s native format — or falls back to standard function calling. Tools with the same role (e.g. two shell tools) are mutually exclusive.
Match tools to your agent:
| Agent | Computer | Shell | Editor | Memory |
|---|---|---|---|---|
| Claude | AnthropicComputerTool | BashTool | EditTool | ClaudeMemoryTool |
| OpenAI | OpenAIComputerTool | ShellTool | ApplyPatchTool | SessionMemoryTool |
| Gemini | GeminiComputerTool | GeminiShellTool | GeminiEditTool | GeminiMemoryTool |
| Style | Read | Search | Glob | List |
|---|---|---|---|---|
| OpenCode | ReadTool | GrepTool | GlobTool | ListTool |
| Gemini CLI | GeminiReadTool | GeminiSearchTool | GeminiGlobTool | GeminiListTool |
Connectors
Connectors let you pull external tools into your HUD environment — from other HUD environments, external MCP servers, or existing APIs:Native Graders
hud.native includes reusable scoring helpers so you don’t have to hand-build grading logic for common patterns. Use Grade.gather to run multiple graders in parallel and combine them into a single result:
| Grader | What it does |
|---|---|
BashGrader | Runs a shell command, scores by exit code (0 → 1.0) |
LLMJudgeGrader | Grades against rubric criteria using an LLM judge |
exact_match | Normalized string comparison |
contains / contains_any / contains_all | Substring checks |
numeric_match | Extracts first number, checks within tolerance |
f1_score | Token-level F1 between answer and reference |
How It All Fits Together
- You write Scenarios in your IDE — the prompt, scoring logic, and arguments
- Tools give agents capabilities; Environments package tools + scenarios for deployment
- A Scenario + specific arguments = a Task
- Tasks group into Tasksets for benchmarking
- Run a taskset across models → collect Traces with rewards
- Use the traces to compare models, generate training data, or fine-tune your own agent
What You Have Now
At this point you have an environment with tools and scenarios — the static definition of what agents can do and how they’re scored. No running, no iteration yet.Tasks & Evaluation
Define tasks, test locally, iterate, sync to the platform
Tool Categories
Agent Tools
Run sub-agents as tools
Computer
Mouse, keyboard, screenshots
Coding
Shell execution, file editing
Filesystem
Read, search, and list files
Memory
Persistent storage
Web
Browser automation, search
Grounding
Element description → coordinates
Advanced Topics
| Topic | What it is | When you’ll need it |
|---|---|---|
| Harbor conversion | Importing external benchmarks | Migrating existing benchmarks |
| REST API | Programmatic platform access | Custom integrations |
| Framework integrations | LangChain, CrewAI, AutoGen, etc. | When using those frameworks |
| Chat scenarios | Multi-turn conversational agents | Building chat products |
| AgentTool | Hierarchical sub-agent delegation | Complex multi-agent workflows |
| Slack integration | Running agents from Slack | Team workflows |