Local Testing
| Environment | local_test.py |
|---|---|
| No Docker | from env import env |
| Docker | env.connect_url("http://localhost:8765/mcp") |
Variants
LLM outputs vary from run to run—ask the same question twice and you might get different quality answers. Variants let you test different configurations side-by-side:Groups
Run each variant multiple times to see the distribution, not just one lucky or unlucky result:hud.eval manager parallelizes automatically. Total runs = len(tasks) × len(variant_combinations) × group.
Mock Mode
env.mock() intercepts at the tool layer. Agents only see tools, so this is usually all you need for testing agent logic without hitting real services:
env.mock() for testing.
Testing Scenarios Directly
Scenarios are async generators.hud.eval() drives them automatically, but you can test the logic directly:
hud.eval() will behave identically.
Hot-Reload
For Docker environments,hud dev -w path reloads Python on save:
Debugging Build Failures
hud build runs the exact same pipeline as New → Environment on hud.ai—so if it passes locally, it’ll work in production. If the build fails or the container crashes on startup, use hud debug:
Scenario MCP Protocol Mapping
Understanding how scenarios map to MCP is crucial for debugging. Each scenario registers two MCP endpoints:| Phase | MCP Type | Endpoint | What it does |
|---|---|---|---|
| Setup | Prompt | get_prompt("{env}:{scenario}", args) | Runs code before first yield, returns the prompt |
| Evaluate | Resource | read_resource("{env}:{scenario}") | Runs code after first yield, returns {"reward": float} |
Debug with raw MCP calls
If a scenario isn’t working, test each phase directly:Common debugging scenarios
Problem:evaluate_tool: NULL but using v5 scenarios
- Cause: v5 scenarios don’t use
evaluate_tool—they return rewards viaread_resource - Fix: Ensure your orchestrator calls
read_resource()after agent completion
TypeError when evaluating with complex args like list[dict]
- Cause: MCP passes all arguments as strings; SDK deserializes them
- Debug: Add logging to check
type(arg)at scenario entry
- Cause:
submit()wasn’t called beforeread_resource() - Fix: Call
await env.submit(scenario_name, answer)first