Sandboxing
Agents need isolated state. You can’t point an agent at production — it’ll make real changes, hit real APIs, affect real users. These patterns keep things safe.Database Isolation
In-memory SQLite — fastest, resets automatically:Mocking External Services
env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:
env.mock() for testing.
For stateful mocking (tracking what happened for assertions):
Testing Scenarios Directly
Scenarios are async generators.hud.eval() drives them automatically, but you can test the grading logic directly:
hud.eval() will behave identically.
Scenario MCP Protocol Mapping
Each scenario registers two MCP endpoints:| Phase | MCP Type | Endpoint | What it does |
|---|---|---|---|
| Setup | Prompt | get_prompt("{env}:{scenario}", args) | Runs code before first yield, returns the prompt |
| Evaluate | Resource | read_resource("{env}:{scenario}") | Runs code after first yield, returns {"reward": float} |
Useful Environment Properties
Common Issues
evaluate_tool: NULL but using v5 scenarios — v5 scenarios return rewards via read_resource, not evaluate_tool. Ensure your orchestrator calls read_resource() after agent completion.
TypeError with complex args like list[dict] — MCP passes all arguments as strings; SDK deserializes them. Add logging to check type(arg) at scenario entry.
Scenario setup works but evaluate returns no reward — submit() wasn’t called before read_resource(). Call await env.submit(scenario_name, answer) first.