The HUD SDK provides the LegacyTask class for defining agent objectives and dataset utilities for managing task collections.
LegacyTask is deprecated. For new code, use env("scenario_name", **args) to create Task objects. See Environments for the recommended approach.
LegacyTask Class
from hud.datasets import LegacyTask
Pydantic model that defines an agent’s objective, setup, and evaluation criteria.
Fields:
| Field | Type | Description | Default |
|---|
id | str | None | Unique identifier (UUID recommended) | None |
prompt | str | Task instruction for the agent | Required |
mcp_config | dict[str, Any] | MCP server configuration | Required |
setup_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to prepare environment | None |
evaluate_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to score performance | None |
agent_config | BaseAgentConfig | dict[str, Any] | None | Agent configuration | None |
metadata | dict[str, Any] | Extra task metadata | {} |
Environment Variable Substitution
The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
task = LegacyTask(
prompt="Navigate to the dashboard",
mcp_config={
"browser": {
"url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
}
)
Variables are resolved when LegacyTask is created from a dict - this is why datasets should store raw dictionaries.
Running Tasks
run_single_task
Execute a single task with tracing. This is the core execution primitive.
from hud.datasets import run_single_task
from hud.types import AgentType
result = await run_single_task(
task=task,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
max_steps=20,
trace_name="My Task",
)
print(f"Reward: {result.reward}")
Parameters:
| Parameter | Type | Description | Default |
|---|
task | Task | Task to execute | Required |
agent_type | AgentType | Agent type to use | Required |
agent_params | dict | Parameters for agent.create() | None |
max_steps | int | Maximum steps | 10 |
job_id | str | Job ID for telemetry | None |
task_id | str | Task ID for telemetry | None |
group_id | str | Group ID for variance runs | None |
trace_name | str | Name for the trace | None |
metadata | dict | Additional trace metadata | None |
run_tasks
Run multiple tasks with automatic job tracking. This is the primary evaluation function.
from hud.datasets import run_tasks
from hud.types import AgentType
results = await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
name="SheetBench Evaluation",
max_concurrent=50,
max_steps=50,
)
Parameters:
| Parameter | Type | Description | Default |
|---|
tasks | list[Task] | Tasks to execute | Required |
agent_type | AgentType | Agent type to use | Required |
agent_params | dict | Parameters for agent.create() | None |
name | str | Job name for tracking | "Evaluation" |
max_concurrent | int | Maximum concurrent tasks | 30 |
metadata | dict | Job metadata | None |
max_steps | int | Max steps per task | 10 |
group_size | int | Runs per task (variance estimation) | 1 |
remote | bool | Submit to HUD platform | False |
Returns:
- If
remote=True: Empty list (fire-and-forget)
- If
group_size=1: List of Trace results
- If
group_size>1: List of statistics dicts
Examples:
# Run filtered tasks locally
all_tasks = load_tasks("hud-evals/SheetBench-50")
selected = [t for t in all_tasks if "login" in t.prompt.lower()]
results = await run_tasks(selected, AgentType.CLAUDE)
# Run with variance estimation
stats = await run_tasks(tasks, AgentType.CLAUDE, group_size=3)
# Submit for remote execution
await run_tasks(tasks, AgentType.CLAUDE, remote=True)
Remote Execution
Submit tasks to the HUD platform for execution in the cloud:
from hud.datasets import run_tasks
from hud.types import AgentType
# Submit and return immediately
await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
agent_params={"checkpoint_name": "claude-sonnet-4-5"},
name="Remote Evaluation",
remote=True,
)
# Monitor at https://hud.so/jobs/{job_id}
Cancellation
Cancel running remote jobs:
from hud.datasets.utils import cancel_job, cancel_task, cancel_all_jobs
# Cancel specific job
await cancel_job(job_id)
# Cancel specific task
await cancel_task(job_id, task_id)
# Cancel ALL your active jobs (panic button)
await cancel_all_jobs()
Or use the CLI:
hud cancel --job <job_id>
hud cancel --task <job_id> <task_id>
hud cancel --all
Saving Tasks
from hud.datasets import save_tasks
# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
tasks=task_dicts,
repo_id="my-tasks",
private=False,
)
Always pass dictionaries to preserve environment variable templates. Task objects have already resolved ${VAR} to actual values!
Agent Config Options
The agent_config field on tasks supports:
| Option | Type | Description |
|---|
system_prompt | str | Appended to agent’s default system prompt |
allowed_tools | list[str] | Tools the agent can use |
disallowed_tools | list[str] | Tools to hide from the agent |
append_setup_output | bool | Include setup output in first message |
initial_screenshot | bool | Take screenshot before first action |
task = LegacyTask(
prompt="Complete the form",
mcp_config={...},
agent_config={
"system_prompt": "Be careful with form fields",
"allowed_tools": ["computer", "search"],
}
)
Best Practices
- Use UUIDs for task IDs - Required for HuggingFace datasets
- Save dictionaries, not objects - Preserves env var templates
- Start with single tasks - Debug before scaling up
- Use group_size for variance - Run each task 3-5 times for reliable metrics
- Use remote for large evals - Better parallelization and monitoring
See Also