Tasks define what agents should do and how to measure success. They combine a prompt, environment configuration, and optional setup/evaluation phases.
Lifecycle
Task Structure
from hud.datasets import Task
import uuid
task = Task(
# Required fields
prompt="Navigate to the login page and sign in as [email protected]",
mcp_config={
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "hudpython/hud-browser:latest"
}
}
},
# Optional fields
id=str(uuid.uuid4()), # Required for HuggingFace datasets
system_prompt="You are an expert web automation agent. Always verify page loads before interacting with elements.",
setup_tool={
"name": "playwright",
"arguments": {
"action": "navigate",
"url": "https://example.com"
}
},
evaluate_tool={
"name": "evaluate",
"arguments": {
"name": "url_contains",
"substring": "/dashboard"
}
},
metadata={"category": "authentication", "difficulty": "easy"}
)
Environment Variables
# In dataset JSON:
{
"prompt": "Complete the TODO list",
"mcp_config": {
"hud": {
"url": "https://mcp.hud.ai/v3/mcp",
"headers": {
"Authorization": "Bearer ${HUD_API_KEY}",
"Mcp-Image": "${BROWSER_IMAGE}"
}
}
}
}
# When loaded:
task = Task(**task_dict) # Variables resolved here!
# Now task.mcp_config["hud"]["headers"]["Authorization"] = "Bearer sk-hud-..."
This enables:
- Public datasets without exposing secrets
- Environment-specific configurations
- CI/CD pipelines with different credentials
Running Tasks
# Agent automatically handles all phases
result = await agent.run(task)
print(f"Success: {result.reward}") # 0.0 to 1.0
The agent will:
- Execute
setup_tool if provided
- Work on the
prompt using available tools
- Execute
evaluate_tool to calculate reward
Working with Datasets
Tasks integrate with HuggingFace datasets:
from hud.datasets import run_tasks
from hud.types import AgentType
from hud.utils.tasks import load_tasks
# Load tasks from HuggingFace
tasks = load_tasks("hud-evals/SheetBench-50")
# Run agent on all tasks with automatic parallelization
results = await run_tasks(
tasks=tasks,
agent_type=AgentType.CLAUDE,
name="SheetBench Run",
)
Creating Datasets
from hud.datasets import save_tasks
# Create task dictionaries (NOT Task objects!)
task_dicts = [
{
"prompt": "Navigate to the login page",
"mcp_config": {
"hud": {
"url": "${MCP_URL}",
"headers": {"Authorization": "Bearer ${HUD_API_KEY}"}
}
},
"setup_tool": {"name": "playwright", "arguments": {"action": "navigate", "url": "https://example.com"}},
"evaluate_tool": {"name": "url_match", "arguments": {"pattern": ".*/login"}}
},
# More task dicts...
]
# Save to HuggingFace (preserves ${VAR} templates)
save_tasks(task_dicts, "my-benchmark")
Always save dictionaries, not Task objects. Task objects have already resolved environment variables!
Best Practices
- Use UUIDs: Always include
id=str(uuid.uuid4()) for dataset tasks
- Clear Prompts: Be specific about success criteria
- Template Variables: Use
${VAR} syntax for shareable configs
- Rich Tools: Include both
name and arguments in tool definitions
Next Steps