The environment
We give the agent shell access to a directory of logs and traces, then ask for a diagnosis. The agent must read across files — no single artifact contains the answer.env.py
answer = yield ...). The judge scores it against weighted criteria via the HUD gateway, no extra install needed.
Why this is a good training task
It satisfies the signal principles:- Multi-channel integration — the cause (a removed index) is in
deploy.log, but the symptom path runs throughdb.logandapi.log. No single file is decisive, so the agent must integrate. - Multi-step — the agent reads several files, forms a hypothesis, and checks it against the evidence.
- Substance over surface — the judge credits a correct, evidence-cited diagnosis, not keywords. A generic “it’s a database issue” with no evidence scores low.
- No leakage — no file names the root cause as “the bug”; the agent has to derive it.