Skip to main content
The HUD SDK provides the LegacyTask class for defining agent objectives and dataset utilities for managing task collections.
LegacyTask is deprecated. For new code, use env("scenario_name", **args) to create Task objects. See Environments for the recommended approach.

LegacyTask Class

from hud.datasets import LegacyTask
Pydantic model that defines an agent’s objective, setup, and evaluation criteria. Fields:
FieldTypeDescriptionDefault
idstr | NoneUnique identifier (UUID recommended)None
promptstrTask instruction for the agentRequired
mcp_configdict[str, Any]MCP server configurationRequired
setup_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to prepare environmentNone
evaluate_toolMCPToolCall | list[MCPToolCall] | NoneTool(s) to score performanceNone
agent_configBaseAgentConfig | dict[str, Any] | NoneAgent configurationNone
metadatadict[str, Any]Extra task metadata{}

Environment Variable Substitution

The mcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
task = LegacyTask(
    prompt="Navigate to the dashboard",
    mcp_config={
        "browser": {
            "url": "${HUD_MCP_URL:https://mcp.hud.ai/v3/mcp}",
            "headers": {
                "Authorization": "Bearer ${HUD_API_KEY}",
                "Mcp-Image": "hudpython/hud-browser:latest"
            }
        }
    }
)
Variables are resolved when LegacyTask is created from a dict - this is why datasets should store raw dictionaries.

Running Tasks

run_single_task

Execute a single task with tracing. This is the core execution primitive.
from hud.datasets import run_single_task
from hud.types import AgentType

result = await run_single_task(
    task=task,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    max_steps=20,
    trace_name="My Task",
)
print(f"Reward: {result.reward}")
Parameters:
ParameterTypeDescriptionDefault
taskTaskTask to executeRequired
agent_typeAgentTypeAgent type to useRequired
agent_paramsdictParameters for agent.create()None
max_stepsintMaximum steps10
job_idstrJob ID for telemetryNone
task_idstrTask ID for telemetryNone
group_idstrGroup ID for variance runsNone
trace_namestrName for the traceNone
metadatadictAdditional trace metadataNone

run_tasks

Run multiple tasks with automatic job tracking. This is the primary evaluation function.
from hud.datasets import run_tasks
from hud.types import AgentType

results = await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="SheetBench Evaluation",
    max_concurrent=50,
    max_steps=50,
)
Parameters:
ParameterTypeDescriptionDefault
taskslist[Task]Tasks to executeRequired
agent_typeAgentTypeAgent type to useRequired
agent_paramsdictParameters for agent.create()None
namestrJob name for tracking"Evaluation"
max_concurrentintMaximum concurrent tasks30
metadatadictJob metadataNone
max_stepsintMax steps per task10
group_sizeintRuns per task (variance estimation)1
remoteboolSubmit to HUD platformFalse
Returns:
  • If remote=True: Empty list (fire-and-forget)
  • If group_size=1: List of Trace results
  • If group_size>1: List of statistics dicts
Examples:
# Run filtered tasks locally
all_tasks = load_tasks("hud-evals/SheetBench-50")
selected = [t for t in all_tasks if "login" in t.prompt.lower()]
results = await run_tasks(selected, AgentType.CLAUDE)

# Run with variance estimation
stats = await run_tasks(tasks, AgentType.CLAUDE, group_size=3)

# Submit for remote execution
await run_tasks(tasks, AgentType.CLAUDE, remote=True)

Remote Execution

Submit tasks to the HUD platform for execution in the cloud:
from hud.datasets import run_tasks
from hud.types import AgentType

# Submit and return immediately
await run_tasks(
    tasks=tasks,
    agent_type=AgentType.CLAUDE,
    agent_params={"checkpoint_name": "claude-sonnet-4-5"},
    name="Remote Evaluation",
    remote=True,
)
# Monitor at https://hud.so/jobs/{job_id}

Cancellation

Cancel running remote jobs:
from hud.datasets.utils import cancel_job, cancel_task, cancel_all_jobs

# Cancel specific job
await cancel_job(job_id)

# Cancel specific task
await cancel_task(job_id, task_id)

# Cancel ALL your active jobs (panic button)
await cancel_all_jobs()
Or use the CLI:
hud cancel --job <job_id>
hud cancel --task <job_id> <task_id>
hud cancel --all

Saving Tasks

from hud.datasets import save_tasks

# IMPORTANT: Pass dicts, not Task objects!
task_dicts = [task.model_dump() for task in tasks]
save_tasks(
    tasks=task_dicts,
    repo_id="my-tasks",
    private=False,
)
Always pass dictionaries to preserve environment variable templates. Task objects have already resolved ${VAR} to actual values!

Agent Config Options

The agent_config field on tasks supports:
OptionTypeDescription
system_promptstrAppended to agent’s default system prompt
allowed_toolslist[str]Tools the agent can use
disallowed_toolslist[str]Tools to hide from the agent
append_setup_outputboolInclude setup output in first message
initial_screenshotboolTake screenshot before first action
task = LegacyTask(
    prompt="Complete the form",
    mcp_config={...},
    agent_config={
        "system_prompt": "Be careful with form fields",
        "allowed_tools": ["computer", "search"],
    }
)

Best Practices

  1. Use UUIDs for task IDs - Required for HuggingFace datasets
  2. Save dictionaries, not objects - Preserves env var templates
  3. Start with single tasks - Debug before scaling up
  4. Use group_size for variance - Run each task 3-5 times for reliable metrics
  5. Use remote for large evals - Better parallelization and monitoring

See Also