Task class for defining agent objectives and dataset utilities for managing task collections.
Task Class
| Field | Type | Description | Default |
|---|---|---|---|
id | str | None | Unique identifier (UUID recommended) | None |
prompt | str | Task instruction for the agent | Required |
mcp_config | dict[str, Any] | MCP server configuration | Required |
setup_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to prepare environment | None |
evaluate_tool | MCPToolCall | list[MCPToolCall] | None | Tool(s) to score performance | None |
agent_config | dict[str, Any] | None | Agent configuration (system_prompt, allowed_tools, etc.) | None |
metadata | dict[str, Any] | Extra task metadata | {} |
Environment Variable Substitution
Themcp_config field automatically resolves environment variables using ${VAR_NAME} syntax:
Template.substitute() with a defaultdict that returns empty strings for missing variables.
Field Validators
Task automatically:- Parses JSON strings -
mcp_configandmetadatacan be JSON strings - Converts dicts to MCPToolCall -
setup_toolandevaluate_tooldicts are converted - Resolves environment variables - Only when created from dict (preserves templates in model_dump())
Recommended Evaluation Workflow
When developing and testing agents, follow this progression for optimal debugging and performance:Step 1: Single Task Development
Start with individual tasks to debug your agent and environment setup:- Full error stack traces
- Clear log output
- Quick iteration cycle
- Automatic telemetry tracking
Step 2: Full Dataset Evaluation
Once single tasks work reliably, scale up to full dataset evaluation:- Start with
max_concurrent=50and adjust based on results - Increase to 100-200 for faster evaluation (if API limits allow)
- Decrease to 10-20 if hitting rate limits
Quick Reference
| Stage | Method | Concurrency | Use Case | Debugging |
|---|---|---|---|---|
| Development | Single task | 1 | Initial debugging | Excellent |
| Production | run_dataset | 50-200 | Full evaluation | Good |
Dataset Functions
run_dataset
| Parameter | Type | Description | Default |
|---|---|---|---|
name | str | Job name for tracking | Required |
dataset | str | Dataset | list[dict] | HF dataset ID, Dataset object, or task dicts | Required |
agent_class | type[MCPAgent] | Agent class to instantiate | Required |
agent_config | dict[str, Any] | None | Constructor kwargs for agent | None |
max_concurrent | int | Maximum concurrent tasks (recommended: 50-200) | 30 |
max_steps | int | Max steps per task | 10 |
auto_respond | bool | Use ResponseAgent for continuations | False |
metadata | dict[str, Any] | None | Job metadata | None |
split | str | Dataset split when loading by ID | "train" |
list[Any] - Results in dataset order
Examples:
fetch_system_prompt_from_dataset
system_prompt.txt from a HuggingFace dataset repository.
Returns: str | None - System prompt text if found
Note: Requires huggingface_hub to be installed.
save_tasks
| Parameter | Type | Description | Default |
|---|---|---|---|
tasks | list[dict[str, Any]] | Task dictionaries (NOT Task objects) | Required |
repo_id | str | HuggingFace repository ID | Required |
**kwargs | Any | Additional args for push_to_hub() | - |
mcp_config→ JSON stringsetup_tool→ JSON string (if present)evaluate_tool→ JSON string (if present)metadata→ JSON string (if present)
MCPToolCall Type
| Field | Type | Description | Default |
|---|---|---|---|
name | str | Tool name to call | Required |
arguments | dict[str, Any] | Tool arguments | {} |
Real-World Examples
Loading Tasks from Datasets
Fromexamples/run_evaluation.py:
Task Structure in Datasets
Fromenvironments/text_2048/2048_taskconfigs.json:
Creating and Saving Tasks
Agent Integration
Tasks automatically configure agents:Agent Config Options
Theagent_config field supports the following options:
| Option | Type | Description |
|---|---|---|
system_prompt | str | Custom system prompt appended to agent’s default |
allowed_tools | list[str] | Tools the agent can use (replaces agent_tools) |
disallowed_tools | list[str] | Tools to exclude from the agent |
append_setup_output | bool | Include setup output in first message (default: True) |
initial_screenshot | bool | Take screenshot before first action (default: True) |
Best Practices
- Use UUIDs for task IDs - Required for HuggingFace datasets
- Save dictionaries, not objects - Preserves env var templates
- Use agent_config for agent settings - Centralize agent configuration in one place
- Use metadata for filtering - Category, difficulty, tags
- Test locally first - Before uploading to HuggingFace
- Version your datasets - Use meaningful repo names
Common Patterns
Filtering Tasks
Custom System Prompts
Environment Variable Management
See Also
- Task System - Conceptual overview
- Benchmarks - Building and running datasets
- Agents - How agents use tasks