Documentation Index
Fetch the complete documentation index at: https://docs.hud.ai/llms.txt
Use this file to discover all available pages before exploring further.
Sandboxing
Agents need isolated state. You can’t point an agent at production — it’ll make real changes, hit real APIs, affect real users. These patterns keep things safe.
Database Isolation
In-memory SQLite — fastest, resets automatically:
import sqlite3
db = sqlite3.connect(":memory:")
@env.scenario("update-order")
async def update_order(order_id: str):
db.executescript(Path("fixtures/orders.sql").read_text())
answer = yield f"Update order {order_id} to shipped"
row = db.execute("SELECT status FROM orders WHERE id=?", (order_id,)).fetchone()
yield 1.0 if row and row[0] == "shipped" else 0.0
Transaction rollback — use your real DB, undo changes:
@env.scenario("process-refund")
async def process_refund(order_id: str):
conn = await asyncpg.connect(DATABASE_URL)
tx = conn.transaction()
await tx.start()
try:
answer = yield f"Process refund for order {order_id}"
yield reward
finally:
await tx.rollback()
await conn.close()
Fixture seeding — deterministic starting state:
await db.execute("TRUNCATE orders, users CASCADE")
await db.executemany("INSERT INTO users ...", fixtures["users"])
Mocking External Services
env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:
env.mock()
env.mock_tool("send_email", {"status": "sent", "id": "mock-123"})
env.mock_tool("charge_card", {"success": True, "transaction_id": "tx-mock"})
Your agent code stays the same — toggle env.mock() for testing.
For stateful mocking (tracking what happened for assertions):
class MockPaymentService:
def __init__(self):
self.charges = []
async def charge(self, amount: int, card_token: str) -> dict:
self.charges.append({"amount": amount, "token": card_token})
return {"success": True, "id": f"ch-{len(self.charges)}"}
payments = MockPaymentService()
@env.scenario("checkout")
async def checkout(cart_total: int):
_ = yield f"Complete checkout for ${cart_total}"
yield 1.0 if any(c["amount"] == cart_total for c in payments.charges) else 0.0
Testing Scenarios Directly
Scenarios are async generators. hud.eval() drives them automatically, but you can test the grading logic directly:
async def test():
gen = checkout("alice", 50)
prompt = await anext(gen) # setup phase
reward = await gen.asend("Success!") # evaluate phase
assert reward == 1.0
If your scenario tests pass, hud.eval() will behave identically.
Scenario MCP Protocol Mapping
Each scenario registers two MCP endpoints:
| Phase | MCP Type | Endpoint | What it does |
|---|
| Setup | Prompt | get_prompt("{env}:{scenario}", args) | Runs code before first yield, returns the prompt |
| Evaluate | Resource | read_resource("{env}:{scenario}") | Runs code after first yield, returns {"reward": float} |
If a scenario isn’t working, test each phase directly:
async with env:
prompt_result = await env.get_prompt(
"myenv:checkout",
{"product": "laptop", "user_id": "alice"}
)
print(f"Prompt: {prompt_result.messages[0].content}")
await env.submit("checkout", answer="Order completed successfully")
resource_result = await env.read_resource("myenv:checkout")
print(f"Reward: {resource_result}") # {"reward": 1.0}
Useful Environment Properties
env.is_parallelizable # True if all connections are remote
env.connections # Dict of connection names → connectors
env.is_connected # True if in async context
await env.list_resources() # MCP resources
await env.list_prompts() # MCP prompts
Common Issues
evaluate_tool: NULL but using v5 scenarios — v5 scenarios return rewards via read_resource, not evaluate_tool. Ensure your orchestrator calls read_resource() after agent completion.
TypeError with complex args like list[dict] — MCP passes all arguments as strings; SDK deserializes them. Add logging to check type(arg) at scenario entry.
Scenario setup works but evaluate returns no reward — submit() wasn’t called before read_resource(). Call await env.submit(scenario_name, answer) first.