Good Environments
A good environment gives agents what they need to succeed—and gives you what you need to evaluate them.Observable State
Agents need access to the right information. If they can’t see the data they need, they can’t complete the task. Design tools that expose useful state:Deterministic Setup
Each eval should seed the state it needs. HUD handles container isolation—you handle making sure your scenario sets up the right data before the agent runs.Isolated Execution
HUD sandboxes each eval—containers don’t share state. But if your environment connects to external services, think about stateful vs stateless. Stateless services are fine. Multiple agents can hit the same read-only API without interference. Stateful services need care. If 100 agents all hit the same database endpoint that modifies data, they’ll step on each other. Use per-eval instances, transaction isolation, or target different records.Good Evals
An eval combines a prompt (the firstyield) with grading logic (everything after). The prompt tells agents what to do—write short-to-medium length instructions that ask for an unambiguous change you can verify.
Be Specific
Ambiguous prompts lead to ambiguous grading. Say exactly what you want:Only Ask for Testable Things
If you can’t observe the result, you can’t grade it. Don’t ask an agent to “think about” something—ask it to do something you can verify.Create Variations
Evals are easier to write when you have a specific failure mode in mind. If you’ve observed agents struggling with something, incorporate that into future evals. Create different versions with more or less explicit instructions—step-by-step guidance vs. high-level goals. Use variants to test these systematically. Variations make it easier to tune difficulty later.Good Graders
The grading logic after the firstyield determines the grade. Fair grading means useful signal.