Why Environments?
Your production API is a single live instance with shared state—you can’t run 500 tests against it in parallel without causing chaos. Environments spin up fresh for every evaluation: isolated, deterministic, reproducible. Run thousands in parallel, each starting from the exact state you define, each generating training data.Tools
Start withhud init to scaffold an environment. Works on existing codebases too:
@env.tool() and agents can call it:
Scenarios
To evaluate an agent, you need two things: what to tell it, and how to score what it did. Scenarios capture both with twoyield statements:
Scenarios as Subagents
The first yield is more than just a prompt—it’s context management mixed with dynamic input from the scenario’s parameters. The parameters become a tool spec that other agents can call. We’ve found that agents train much better within a scenario structure than on standalone random tasks. Scenarios define boundaries: what the agent should focus on, what success looks like, and how to measure it. This structure also makes agents easier to compose—wrap a scenario withAgentTool and an orchestrator can call it as a specialized subagent.
See the Ops Diagnostics Cookbook for a complete example of hierarchical agents calling subagent scenarios.
Iterating on Your Environment
Three ways to develop and test your environment:1. Agent Loop with create_agent
Run a full agent loop locally. This mirrors exactly what happens in remote rollouts:2. MCP Server with hud dev
Spawn your environment as an MCP server that Cursor, Claude Code, or any MCP client can connect to:env:env syntax is like uvicorn—module:attribute. It tells hud dev to import env.py and run the env object as an MCP server.