HUD Documentation — Evaluations and RL Environments.

QA Workflows are analysis agents that run automatically on your traces. They use environments like trace-explorer to fetch trace data, inspect it with coding tools, and return structured verdicts. A scenario qualifies as a QA workflow when it declares both a platform key arg (hud_api_key) and an entity arg (trace_id for per-trace, task_id for per-task). The platform fills these at runtime — you configure the analysis prompt and attach the workflow to a taskset column.

Standard QA Workflows

Four pre-built workflows are available out of the box. These appear under the Standard QA Workflows section on the Agents page and can be attached to any taskset with one click.

Workflow	What it detects	Output
False Negative	Agent succeeded but grader scored it wrong	`is_false_negative`, `reasoning`, `confidence`
False Positive	Agent got credit without genuinely solving	`is_false_positive`, `reasoning`, `confidence`
Failure Analysis	Root cause classification (10 categories)	`failure_category`, `root_cause`, `failed_criteria`
Reward Hacking	Agent gamed the evaluation mechanism	`is_reward_hacking`, `hacking_strategy`, `severity`

How to Use

From the Agents Page

Go to the Agents page
Under Standard QA Workflows, click a recommended workflow to view it
Click Add as Column to attach it to any taskset
Every completed trace is automatically analyzed
To create your own, click New Agent → QA Workflow, select a workflow scenario, configure the analysis prompt, and choose a model. It appears under Your QA Workflows.

From a Taskset

Open any taskset → Add Column → QA Workflow
Pick a recommended workflow or one you’ve created
Results appear as columns in your trace grid

Building Your Own

A QA workflow is just a scenario with trace_id + hud_api_key arguments. Use prepare_qa_context from trace-explorer for the common setup:

from pydantic import BaseModel, Field
from env import env
from qa_common import prepare_qa_context

class MyResult(BaseModel):
    verdict: str = Field(description="Your analysis verdict")
    confidence: float = Field(ge=0.0, le=1.0)

@env.scenario("my_analysis", returns=MyResult)
async def my_analysis(
    trace_id: str,
    hud_api_key: str,
    query: str = "",
    ground_truth: str | None = None,
) -> Any:
    _, _, context = await prepare_qa_context(
        trace_id, hud_api_key, "My analysis"
    )

    prompt = f"""Your analysis instructions here.

{context}

## Focus
{query or "Default analysis question."}"""

    response: MyResult = yield prompt

    if ground_truth is not None:
        yield 1.0 if response.verdict == ground_truth else 0.0
    else:
        yield 1.0

The ground_truth parameter lets you build eval datasets for the workflow itself.

Get Started

Concepts

Guides

Agents

Integrations

How We Use HUD on HUD

QA Workflows

Standard QA Workflows

How to Use

From the Agents Page

From a Taskset

Building Your Own

See Also

Get Started

Concepts

Guides

Agents

Integrations

How We Use HUD on HUD

​Standard QA Workflows

​How to Use

​From the Agents Page

​From a Taskset

​Building Your Own

​See Also

Standard QA Workflows

How to Use

From the Agents Page

From a Taskset

Building Your Own

See Also