Skip to main content

Sandboxing

Agents need isolated state. You can’t point an agent at production — it’ll make real changes, hit real APIs, affect real users. These patterns keep things safe.

Database Isolation

In-memory SQLite — fastest, resets automatically:
import sqlite3
db = sqlite3.connect(":memory:")

@env.scenario("update-order")
async def update_order(order_id: str):
    db.executescript(Path("fixtures/orders.sql").read_text())
    answer = yield f"Update order {order_id} to shipped"
    row = db.execute("SELECT status FROM orders WHERE id=?", (order_id,)).fetchone()
    yield 1.0 if row and row[0] == "shipped" else 0.0
Transaction rollback — use your real DB, undo changes:
@env.scenario("process-refund")
async def process_refund(order_id: str):
    conn = await asyncpg.connect(DATABASE_URL)
    tx = conn.transaction()
    await tx.start()
    try:
        answer = yield f"Process refund for order {order_id}"
        yield reward
    finally:
        await tx.rollback()
        await conn.close()
Fixture seeding — deterministic starting state:
await db.execute("TRUNCATE orders, users CASCADE")
await db.executemany("INSERT INTO users ...", fixtures["users"])

Mocking External Services

env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:
env.mock()
env.mock_tool("send_email", {"status": "sent", "id": "mock-123"})
env.mock_tool("charge_card", {"success": True, "transaction_id": "tx-mock"})
Your agent code stays the same — toggle env.mock() for testing. For stateful mocking (tracking what happened for assertions):
class MockPaymentService:
    def __init__(self):
        self.charges = []
    
    async def charge(self, amount: int, card_token: str) -> dict:
        self.charges.append({"amount": amount, "token": card_token})
        return {"success": True, "id": f"ch-{len(self.charges)}"}

payments = MockPaymentService()

@env.scenario("checkout")
async def checkout(cart_total: int):
    _ = yield f"Complete checkout for ${cart_total}"
    yield 1.0 if any(c["amount"] == cart_total for c in payments.charges) else 0.0

Testing Scenarios Directly

Scenarios are async generators. hud.eval() drives them automatically, but you can test the grading logic directly:
async def test():
    gen = checkout("alice", 50)
    prompt = await anext(gen)            # setup phase
    reward = await gen.asend("Success!") # evaluate phase
    assert reward == 1.0
If your scenario tests pass, hud.eval() will behave identically.

Scenario MCP Protocol Mapping

Each scenario registers two MCP endpoints:
PhaseMCP TypeEndpointWhat it does
SetupPromptget_prompt("{env}:{scenario}", args)Runs code before first yield, returns the prompt
EvaluateResourceread_resource("{env}:{scenario}")Runs code after first yield, returns {"reward": float}
If a scenario isn’t working, test each phase directly:
async with env:
    prompt_result = await env.get_prompt(
        "myenv:checkout", 
        {"product": "laptop", "user_id": "alice"}
    )
    print(f"Prompt: {prompt_result.messages[0].content}")
    
    await env.submit("checkout", answer="Order completed successfully")
    resource_result = await env.read_resource("myenv:checkout")
    print(f"Reward: {resource_result}")  # {"reward": 1.0}

Useful Environment Properties

env.is_parallelizable  # True if all connections are remote
env.connections        # Dict of connection names → connectors
env.is_connected       # True if in async context

await env.list_resources()  # MCP resources
await env.list_prompts()    # MCP prompts

Common Issues

evaluate_tool: NULL but using v5 scenarios — v5 scenarios return rewards via read_resource, not evaluate_tool. Ensure your orchestrator calls read_resource() after agent completion. TypeError with complex args like list[dict] — MCP passes all arguments as strings; SDK deserializes them. Add logging to check type(arg) at scenario entry. Scenario setup works but evaluate returns no rewardsubmit() wasn’t called before read_resource(). Call await env.submit(scenario_name, answer) first.