Skip to main content
QA Workflows are analysis agents that run automatically on your traces. They use environments like trace-explorer to fetch trace data, inspect it with coding tools, and return structured verdicts. A scenario qualifies as a QA workflow when it declares both a platform key arg (hud_api_key) and an entity arg (trace_id for per-trace, task_id for per-task). The platform fills these at runtime — you configure the analysis prompt and attach the workflow to a taskset column.

Standard QA Workflows

Four pre-built workflows are available out of the box. These appear under the Standard QA Workflows section on the Agents page and can be attached to any taskset with one click.
WorkflowWhat it detectsOutput
False NegativeAgent succeeded but grader scored it wrongis_false_negative, reasoning, confidence
False PositiveAgent got credit without genuinely solvingis_false_positive, reasoning, confidence
Failure AnalysisRoot cause classification (10 categories)failure_category, root_cause, failed_criteria
Reward HackingAgent gamed the evaluation mechanismis_reward_hacking, hacking_strategy, severity

How to Use

From the Agents Page

  1. Go to the Agents page
  2. Under Standard QA Workflows, click a recommended workflow to view it
  3. Click Add as Column to attach it to any taskset
  4. Every completed trace is automatically analyzed
  5. To create your own, click New AgentQA Workflow, select a workflow scenario, configure the analysis prompt, and choose a model. It appears under Your QA Workflows.

From a Taskset

  1. Open any taskset → Add Column → QA Workflow
  2. Pick a recommended workflow or one you’ve created
  3. Results appear as columns in your trace grid

Building Your Own

A QA workflow is just a scenario with trace_id + hud_api_key arguments. Use prepare_qa_context from trace-explorer for the common setup:
from pydantic import BaseModel, Field
from env import env
from qa_common import prepare_qa_context

class MyResult(BaseModel):
    verdict: str = Field(description="Your analysis verdict")
    confidence: float = Field(ge=0.0, le=1.0)

@env.scenario("my_analysis", returns=MyResult)
async def my_analysis(
    trace_id: str,
    hud_api_key: str,
    query: str = "",
    ground_truth: str | None = None,
) -> Any:
    _, _, context = await prepare_qa_context(
        trace_id, hud_api_key, "My analysis"
    )

    prompt = f"""Your analysis instructions here.

{context}

## Focus
{query or "Default analysis question."}"""

    response: MyResult = yield prompt

    if ground_truth is not None:
        yield 1.0 if response.verdict == ground_truth else 0.0
    else:
        yield 1.0
The ground_truth parameter lets you build eval datasets for the workflow itself.

See Also