HUD Documentation — Evaluations and RL Environments.

Sandboxing

Agents need isolated state. You can’t point an agent at production — it’ll make real changes, hit real APIs, affect real users. These patterns keep things safe.

Database Isolation

In-memory SQLite — fastest, resets automatically:

import sqlite3
db = sqlite3.connect(":memory:")

@env.scenario("update-order")
async def update_order(order_id: str):
    db.executescript(Path("fixtures/orders.sql").read_text())
    answer = yield f"Update order {order_id} to shipped"
    row = db.execute("SELECT status FROM orders WHERE id=?", (order_id,)).fetchone()
    yield 1.0 if row and row[0] == "shipped" else 0.0

Transaction rollback — use your real DB, undo changes:

@env.scenario("process-refund")
async def process_refund(order_id: str):
    conn = await asyncpg.connect(DATABASE_URL)
    tx = conn.transaction()
    await tx.start()
    try:
        answer = yield f"Process refund for order {order_id}"
        yield reward
    finally:
        await tx.rollback()
        await conn.close()

Fixture seeding — deterministic starting state:

await db.execute("TRUNCATE orders, users CASCADE")
await db.executemany("INSERT INTO users ...", fixtures["users"])

Mocking External Services

env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:

env.mock()
env.mock_tool("send_email", {"status": "sent", "id": "mock-123"})
env.mock_tool("charge_card", {"success": True, "transaction_id": "tx-mock"})

Your agent code stays the same — toggle env.mock() for testing. For stateful mocking (tracking what happened for assertions):

class MockPaymentService:
    def __init__(self):
        self.charges = []
    
    async def charge(self, amount: int, card_token: str) -> dict:
        self.charges.append({"amount": amount, "token": card_token})
        return {"success": True, "id": f"ch-{len(self.charges)}"}

payments = MockPaymentService()

@env.scenario("checkout")
async def checkout(cart_total: int):
    _ = yield f"Complete checkout for ${cart_total}"
    yield 1.0 if any(c["amount"] == cart_total for c in payments.charges) else 0.0

Testing Scenarios Directly

Scenarios are async generators. hud.eval() drives them automatically, but you can test the grading logic directly:

async def test():
    gen = checkout("alice", 50)
    prompt = await anext(gen)            # setup phase
    reward = await gen.asend("Success!") # evaluate phase
    assert reward == 1.0

If your scenario tests pass, hud.eval() will behave identically.

Scenario MCP Protocol Mapping

Each scenario registers two MCP endpoints:

Phase	MCP Type	Endpoint	What it does
Setup	Prompt	`get_prompt("{env}:{scenario}", args)`	Runs code before first `yield`, returns the prompt
Evaluate	Resource	`read_resource("{env}:{scenario}")`	Runs code after first `yield`, returns `{"reward": float}`

If a scenario isn’t working, test each phase directly:

async with env:
    prompt_result = await env.get_prompt(
        "myenv:checkout", 
        {"product": "laptop", "user_id": "alice"}
    )
    print(f"Prompt: {prompt_result.messages[0].content}")
    
    await env.submit("checkout", answer="Order completed successfully")
    resource_result = await env.read_resource("myenv:checkout")
    print(f"Reward: {resource_result}")  # {"reward": 1.0}

Useful Environment Properties

env.is_parallelizable  # True if all connections are remote
env.connections        # Dict of connection names → connectors
env.is_connected       # True if in async context

await env.list_resources()  # MCP resources
await env.list_prompts()    # MCP prompts

Common Issues

evaluate_tool: NULL but using v5 scenarios — v5 scenarios return rewards via read_resource, not evaluate_tool. Ensure your orchestrator calls read_resource() after agent completion. TypeError with complex args like list[dict] — MCP passes all arguments as strings; SDK deserializes them. Add logging to check type(arg) at scenario entry. Scenario setup works but evaluate returns no reward — submit() wasn’t called before read_resource(). Call await env.submit(scenario_name, answer) first.

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Advanced Patterns

Sandboxing

Database Isolation

Mocking External Services

Testing Scenarios Directly

Scenario MCP Protocol Mapping

Useful Environment Properties

Common Issues

Get Started

Building Environments

Running Agents

Advanced

SDK Reference

Tools Reference

Cookbooks

CLI Reference

Community

Documentation Index

​Sandboxing

​Database Isolation

​Mocking External Services

​Testing Scenarios Directly

​Scenario MCP Protocol Mapping

​Useful Environment Properties

​Common Issues

Sandboxing

Database Isolation

Mocking External Services

Testing Scenarios Directly

Scenario MCP Protocol Mapping

Useful Environment Properties

Common Issues