Skip to main content
An environment is everything an agent can interact with—your APIs, services, databases, wrapped as tools. It also defines how agents are evaluated through scenarios. When you deploy an environment, you’re creating a sandbox that agents can learn from at scale.

Why Environments?

Your production API is a single live instance with shared state—you can’t run 500 tests against it in parallel without causing chaos. Environments spin up fresh for every evaluation: isolated, deterministic, reproducible. Run thousands in parallel, each starting from the exact state you define, each generating training data. Under the hood, an environment is an MCP server. When you deploy, HUD spins up a fresh, isolated instance for every evaluation — no shared state, no interference between parallel runs.

Create an Environment

Scaffold a new environment with hud init. Works on existing codebases too:
hud init my-env
cd my-env
This creates the basic structure:
from hud import Environment

env = Environment("my-env")

Tools

Tools are functions that an agent can call while it’s working on a task. Decorate a function with @env.tool() and agents can call it:
from hud import Environment

env = Environment("my-env")

@env.tool()
def count_letter(text: str, letter: str) -> int:
    """Count occurrences of a letter in text."""
    return text.lower().count(letter.lower())
The docstring becomes the tool’s description that the agent sees. The type hints become the tool’s parameter schema. That’s it — your function is now something any AI model can invoke.

Tool Hooks

Use @tool.before and @tool.after to add validation, logging, or access control to any tool without modifying its implementation:
from hud.tools import BashTool
from hud.tools.types import ToolError

bash = BashTool()

@bash.before
async def block_dangerous(command: str | None = None, **kwargs):
    if command and "rm -rf" in command:
        raise ToolError("Blocked dangerous command")

env.add_tool(bash)
@tool.before can modify arguments, pass through unchanged, or raise to block. @tool.after can modify or pass through the result.

Complex Stateful Tools

For tools that need internal state, connections, or complex initialization, subclass BaseTool. See the Tools SDK Reference for architecture details, base classes, native specs, and complete implementation examples.

Scenarios

Scenarios are the core of HUD. A scenario defines what you ask the agent to do and how you score the result. This is where you spend your creative energy — writing prompts, designing scoring logic, deciding what success looks like. A scenario is an async generator function with two yields — the first yield sends a prompt to the agent, and the second yield returns a reward:
@env.scenario("count")
async def count(word: str, letter: str):
    answer = yield f"How many '{letter}' in '{word}'?"

    correct = str(word.lower().count(letter.lower()))
    yield 1.0 if answer and correct in answer else 0.0
Every scenario follows this structure:
SectionWhereWhat it does
Setup (optional)Before the first yieldSeed a database, navigate to a URL, prepare initial state
PromptThe first yieldSends instructions to the agent; receives the agent’s answer
ScoringAfter the first yield, ending with the second yieldChecks results and returns a reward between 0.0 and 1.0
The agent runs between the two yields. It calls tools, reasons, and produces an answer. Your scoring logic then checks the environment state and/or the answer to determine a reward. Scenarios are parameterized. The same scenario with different arguments produces different evaluation tasks:
count.task(word="strawberry", letter="r")  # one task (answer: 3)
count.task(word="banana", letter="a")      # another task (answer: 3)
count.task(word="mississippi", letter="s") # another task (answer: 4)
Design scenarios to be expressive — the more you can control through parameters, the easier it is to calibrate difficulty later without rewriting the scenario. For example, a scenario that takes detail_level as a parameter lets you create both step-by-step and high-level task variants from the same code. Everything upstream (environments, tools) exists to support scenarios. Everything downstream (tasks, tasksets, traces, training) flows from them. You write scenarios in your IDE — that’s where the creative work lives.

Built-in Capabilities

HUD ships with pre-built tools, connectors, and graders so you can assemble environments without writing everything from scratch.

Native Tools

Each model provider (Anthropic, OpenAI, Google) has its own tool specification. HUD handles the translation — add a tool once, and it adapts to whatever agent connects:
from hud import Environment
from hud.tools import AnthropicComputerTool, BashTool, EditTool

env = Environment("desktop-agent")
env.add_tool(AnthropicComputerTool())
env.add_tool(BashTool())
env.add_tool(EditTool())
Claude gets native computer_20250124 and bash_20250124. OpenAI gets compatible function calls. Same environment, every agent. Tools declare native_specs that map to each provider’s format. When an agent connects, HUD checks for a matching spec and registers using the provider’s native format — or falls back to standard function calling. Tools with the same role (e.g. two shell tools) are mutually exclusive. Match tools to your agent:
AgentComputerShellEditorMemory
ClaudeAnthropicComputerToolBashToolEditToolClaudeMemoryTool
OpenAIOpenAIComputerToolShellToolApplyPatchToolSessionMemoryTool
GeminiGeminiComputerToolGeminiShellToolGeminiEditToolGeminiMemoryTool
Filesystem tools are agent-agnostic — choose based on output style:
StyleReadSearchGlobList
OpenCodeReadToolGrepToolGlobToolListTool
Gemini CLIGeminiReadToolGeminiSearchToolGeminiGlobToolGeminiListTool
Example — computer use environment:
from hud import Environment
from hud.tools import AnthropicComputerTool, BashTool, EditTool
from hud.tools.filesystem import ReadTool, GrepTool

env = Environment("desktop-agent")
env.add_tool(AnthropicComputerTool())
env.add_tool(BashTool())
env.add_tool(EditTool())
env.add_tool(ReadTool())
env.add_tool(GrepTool())
See the full Tools Reference for all available tools (computer, coding, filesystem, memory, web, grounding).

Connectors

Connectors let you pull external tools into your HUD environment — from other HUD environments, external MCP servers, or existing APIs:
env.connect_fastapi(app)                                    # FastAPI → tools
env.connect_openapi("https://api.example.com/openapi.json") # OpenAPI spec → tools
env.connect_hub("hud-evals/browser")                        # HUD Hub environments
env.connect_image("my-service:v1")                          # Docker images
You don’t need connectors to get started. They’re useful when you want to compose environments or wrap existing services.

Native Graders

hud.native includes reusable scoring helpers so you don’t have to hand-build grading logic for common patterns. Use Grade.gather to run multiple graders in parallel and combine them into a single result:
from hud import Environment
from hud.native import BashGrader, Grade, exact_match

env = Environment("coding-env")

@env.scenario("fix-tests")
async def fix_tests():
    yield "Make the checkout tests pass"

    yield await Grade.gather(
        BashGrader.grade(weight=0.7, command="pytest tests/test_checkout.py -q"),
        BashGrader.grade(weight=0.3, command="ruff check ."),
    )
GraderWhat it does
BashGraderRuns a shell command, scores by exit code (0 → 1.0)
LLMJudgeGraderGrades against rubric criteria using an LLM judge
exact_matchNormalized string comparison
contains / contains_any / contains_allSubstring checks
numeric_matchExtracts first number, checks within tolerance
f1_scoreToken-level F1 between answer and reference
See the full Native Graders Reference for all options and parameters.

How It All Fits Together

  1. You write Scenarios in your IDE — the prompt, scoring logic, and arguments
  2. Tools give agents capabilities; Environments package tools + scenarios for deployment
  3. A Scenario + specific arguments = a Task
  4. Tasks group into Tasksets for benchmarking
  5. Run a taskset across models → collect Traces with rewards
  6. Use the traces to compare models, generate training data, or fine-tune your own agent

What You Have Now

At this point you have an environment with tools and scenarios — the static definition of what agents can do and how they’re scored. No running, no iteration yet.

Tasks & Evaluation

Define tasks, test locally, iterate, sync to the platform

Tool Categories

Agent Tools

Run sub-agents as tools

Computer

Mouse, keyboard, screenshots

Coding

Shell execution, file editing

Filesystem

Read, search, and list files

Memory

Persistent storage

Web

Browser automation, search

Grounding

Element description → coordinates

Advanced Topics

TopicWhat it isWhen you’ll need it
Harbor conversionImporting external benchmarksMigrating existing benchmarks
REST APIProgrammatic platform accessCustom integrations
Framework integrationsLangChain, CrewAI, AutoGen, etc.When using those frameworks
Chat scenariosMulti-turn conversational agentsBuilding chat products
AgentToolHierarchical sub-agent delegationComplex multi-agent workflows
Slack integrationRunning agents from SlackTeam workflows