Skip to main content
hud.eval() is the primary way to run evaluations. It creates an EvalContext with telemetry, handles parallel execution, and integrates with the HUD platform.

hud.eval()

import hud

async with hud.eval() as ctx:
    # ctx is an EvalContext (extends Environment)
    response = await client.chat.completions.create(...)
    ctx.reward = 1.0

Parameters

ParameterTypeDescriptionDefault
sourceTask | list[Task] | str | NoneTask objects from env(), task slugs, or NoneNone
variantsdict[str, Any] | NoneA/B test configuration (lists expand to combinations)None
groupintRuns per variant for statistical significance1
group_idslist[str] | NoneCustom group IDs for parallel runsNone
job_idstr | NoneJob ID to link traces toNone
api_keystr | NoneAPI key for backend callsNone
max_concurrentint | NoneMaximum concurrent evaluationsNone
traceboolSend telemetry to backendTrue
quietboolSuppress console outputFalse

Source Types

The source parameter accepts:
# 1. Direct environment entry (recommended)
env = Environment("my-env")
async with env("checkout", product="laptop") as ctx:
    await agent.run(ctx.prompt)

# 2. Blank eval - manual setup and reward
async with hud.eval() as ctx:
    ctx.reward = compute_reward()

# 3. Task slug (loads from platform)
async with hud.eval("browser-task") as ctx:
    await agent.run(ctx)

Variants

Test multiple configurations in parallel:
async with hud.eval(
    eval,
    variants={"model": ["gpt-4o", "claude-sonnet-4-5"]},
) as ctx:
    model = ctx.variants["model"]  # Current variant
    response = await client.chat.completions.create(model=model, ...)
Lists expand to all combinations:
variants = {
    "model": ["gpt-4o", "claude"],
    "temperature": [0.0, 0.7],
}
# Creates 4 combinations: gpt-4o+0.0, gpt-4o+0.7, claude+0.0, claude+0.7

Groups

Run each variant multiple times for statistical significance:
async with hud.eval(eval, variants={"model": ["gpt-4o"]}, group=5) as ctx:
    # Runs 5 times - see the distribution of results
    ...
Total runs = len(evals) × len(variant_combinations) × group

Concurrency Control

async with hud.eval(
    evals,
    max_concurrent=10,  # Max 10 parallel evaluations
) as ctx:
    ...

EvalContext

EvalContext extends Environment with evaluation tracking.

Properties

PropertyTypeDescription
trace_idstrUnique trace identifier
eval_namestrEvaluation name
promptstr | NoneTask prompt (from scenario or task)
variantsdict[str, Any]Current variant assignment
rewardfloat | NoneEvaluation reward (settable)
answerstr | NoneSubmitted answer
errorBaseException | NoneError if failed
resultslist[EvalContext]Results from parallel runs
headersdict[str, str]Trace headers for HTTP requests
job_idstr | NoneParent job ID
group_idstr | NoneGroup ID for parallel runs
indexintIndex in parallel execution

Methods

All Environment methods are available, plus:
# Submit answer (passes to scenario for evaluation)
await ctx.submit(answer)

# Set reward directly
ctx.reward = 1.0

# Access tools in provider formats
tools = ctx.as_openai_chat_tools()

# Call tools
result = await ctx.call_tool("my_tool", arg="value")

Headers for Telemetry

Inside an eval context, trace headers are automatically injected into HTTP requests:
async with hud.eval() as ctx:
    # Requests to HUD services include Trace-Id automatically
    response = await client.chat.completions.create(...)
    
    # Manual access
    print(ctx.headers)  # {"Trace-Id": "..."}

Working with Environments

The recommended pattern is to use async with env(...) directly:
from hud import Environment

env = Environment("my-env")

@env.tool()
def count_letter(text: str, letter: str) -> int:
    return text.lower().count(letter.lower())

@env.scenario("count")
async def count_scenario(sentence: str, letter: str):
    answer = yield f"How many '{letter}' in '{sentence}'?"
    correct = str(sentence.lower().count(letter.lower()))
    yield correct in answer

# Run with variants
async with env("count", sentence="Strawberry", letter="r", variants={"model": ["gpt-4o", "claude"]}) as ctx:
    response = await client.chat.completions.create(
        model=ctx.variants["model"],
        messages=[{"role": "user", "content": ctx.prompt}],
        tools=ctx.as_openai_chat_tools(),
    )
    await ctx.submit(response.choices[0].message.content or "")

Results

After parallel runs complete, access results on the context:
async with env("count", sentence="Strawberry", letter="r", variants={"model": ["gpt-4o", "claude"]}, group=3) as ctx:
    ...

# ctx.results contains all individual EvalContexts
for result in ctx.results:
    print(f"{result.variants}: reward={result.reward}, answer={result.answer}")

See Also