Why Contrast Matters
Most RL training frameworks — including GRPO (Group Relative Policy Optimization) — use a group-based approach: run the same task multiple times, then compare the outcomes relative to each other. The model learns by asking “what did the successful runs do differently from the failed ones?” Training doesn’t optimize absolute reward values. It needs contrast — some runs succeed, some fail — on the same task. Typically you run each task 4–16 times (your--group-size).
No signal — all runs score roughly the same. Nothing to compare, no learning.
Strong signal — mix of outcomes. The model learns from the differences.
If every run scores the same — all 0.0 or all 1.0 — there’s no contrast and nothing to learn from. The sweet spot is tasks where the model succeeds sometimes but not always. Target 5–75% average success rate across your taskset for useful training signal.
What Makes a Good Task
| Principle | Avoid | Better |
|---|---|---|
| Observable — tools return data the agent and grader can act on | Tool returns "Done" with no details | Tool returns {"id": uid, "rows_updated": 3} |
| Deterministic — each run starts from the same known state | Scenario assumes data already exists | Scenario seeds DB with fixtures before the first yield |
| Isolated — parallel runs can’t interfere with each other | 100 agents write to the same shared database | Each eval gets its own instance or uses transaction rollback |
| Specific — only one way to solve it, or grader accounts for all valid approaches | ”Fix the data issue" | "Mark order #1234 as shipped” (grader checks status field regardless of method) |
| Verifiable — the result produces a checkable state change | ”Consider the best approach to optimize" | "Navigate to the checkout page” (grader checks URL), “Add an index on email” (grader runs EXPLAIN) |
| Varied — scenario params let you calibrate difficulty without rewriting the scenario | Hardcoded prompt with one difficulty level | Scenario takes a detail_level param: "step-by-step" vs "high-level" |
| Partial credit — grader breaks the task into 2–4 sub-checks | Binary 0.0 or 1.0 with no breakdown | Weighted sub-checks: cart added (0.3) + order completed (0.7) |
Good Environments
Good environments expose observable state, seed deterministic starting conditions, and isolate each evaluation run.Observable State
Agents need to see what happened. If they can’t observe the data, they can’t complete the task — and if you can’t observe it, you can’t grade it. Design tools that return actionable data:Deterministic Setup
Each eval should seed the state it needs. HUD handles container isolation — you handle making sure your scenario sets up the right data before the agent runs:Isolated Execution
HUD sandboxes each eval — containers don’t share state. But if your environment connects to external services, think about stateful vs stateless. Stateless services are fine. Multiple agents can hit the same read-only API without interference. Stateful services need care. If 100 agents all hit the same database endpoint that modifies data, they’ll step on each other. Use per-eval instances, transaction isolation, or target different records. See Advanced Patterns for sandboxing techniques.Good Evals
An eval combines a prompt (the firstyield) with grading logic (everything after). The prompt tells agents what to do — write short-to-medium length instructions that ask for an unambiguous change you can verify.
Be Specific
Ambiguous prompts lead to ambiguous grading — and are the most common source of false positives in QA. Say exactly what you want:Only Ask for Verifiable Things
If you can’t observe the result, you can’t grade it. Don’t ask an agent to “think about” something — ask it to do something that produces a checkable state change:Create Variations
Create different versions of the same task with more or less explicit instructions — step-by-step guidance vs. high-level goals. This gives you natural difficulty range across the taskset, which directly produces the contrast training needs. If you’ve observed agents struggling with specific failure modes, incorporate those into new tasks. Failure Analysis QA workflows can help identify common failure categories to target.Good Graders
The grading logic after the firstyield determines the score. Fair grading means useful signal — unfair grading means the model learns the wrong thing.
Match the Prompt
If the prompt says “create a document with a Japanese car brand”, check for any Japanese car brand — not just “Toyota”. But don’t accept any document either. Grade exactly as strict as the prompt implies:Use Partial Credit
Partial grades give training finer-grained signal. Did the agent add to cart but not checkout? That’s a 0.3, not a 0.0. Break complex grading into weighted sub-checks:Sanity Check
At minimum, verify two cases: unchanged state → 0.0, correct completion → 1.0. For grading logic you’ll reuse across many evals, write unit tests. Load a known state snapshot, verify the grade matches what you expect.What’s Next
Platform Models
Model training and checkpoints
QA Workflows
Automated trace and task analysis
Publishing Leaderboards
Make your benchmarks public
Advanced Patterns
Sandboxing, mocking, and complex environment patterns