BashGrader that scores by running the test suite.
The environment
The workspace gives the agent a sandboxed shell and files — the env starts it and publishes theshell capability when it serves. We seed a buggy module and a test in @env.initialize, then declare the task — the grader runs pytest and scores by exit code.
One design point matters here: the grader runs an authoritative copy of the test that lives outside the agent’s workspace. The agent gets its own copy to read and run, but if the grader re-ran that editable copy, the cheapest path to a passing pytest would be weakening or deleting the test — classic reward hacking.
env.py
answer = yield — the deliverable is the state of the workspace, not a text answer.
To start from an existing repo instead of seeding files inline, write it into the workspace root in
@env.initialize, or pass mounts= (see Capabilities).Run it
Point a coding agent at the environment.claude opens the ssh capability, edits calc.py, and the grader re-runs the test:
claude CLI driving the shell over SSH), use the ClaudeSDKAgent in code:
run.py
Read the trace
Every step — the shell commands, the edit, the test run — is on the trace at hud.ai. A reward of1.0 means pytest exited 0; 0.0 means the test still fails.
Make it a dataset
Parameterize the task definition and create concrete tasks for a spread of bugs:tasks.py
BashGrader needs bash, so on native Windows it scores 0.0 — grade from macOS/Linux, WSL, or a built image.