HUD Documentation - Evaluations and RL Environments.

May 6, 2026

Platform

Models directory refresh — /models is a single unified list with Private and Trainable filters and a live usage column on every row.
Taskset analytics tab — dedicated analytics view on tasksets with charts and richer summaries.
Multi-environment taskset selection — pick multiple environments at once when configuring a taskset run.
Run from suggested tasksets — kick off an evaluation from a model’s suggested-taskset row with the model already locked in.
Templates and workflow orchestration — templates settings page and a right-click workflow entry point for repeatable runs.
Resource sharing — invite users or whole teams to traces, jobs, evalsets, models, registry items, and collections with a unified accept flow.
Trace grader info — evaluation cards on traces show the grader that produced each result.

March 16, 2026

A2A Chat, Citations, GPT-5 & CLI Sync

A2A chat orchestrator — agent-to-agent communication for multi-agent workflows with input handling and follow-up turns.
hud sync tasks — sync task definitions from Python files or directories to the platform.
hud sync env — sync local environment configs with collision detection (replaces hud link).
hud eval accepts Python files — run evaluations directly from .py files and directories containing Task objects.
Chat class — manage multi-turn agent conversations from a single SDK abstraction.
GPT-5 support — ResponseAgent defaults to gpt-5, with ToolSearch tool support.
Citations — citation support for Claude, Gemini, and OpenAI responses in chat and agent traces.

Platform

Click & scroll coordinate overlays — computer use traces render click coordinates and scroll actions directly on screenshots.
Trace-level QA workflows — run QA workflows across all tasks from the trace table, with screenshot input and per-task status.
Evalset environment filtering — filter results by environment version, with an earliest-version-only toggle.
EvaluationResult info viewer — inspect the full info field of evaluation results directly in the UI.
Individual user spend — usage page shows per-user spend alongside team totals.
Inline job renaming — rename jobs directly from the jobs page.
Modal integration — run environments on Modal compute infrastructure.
Resources section — new /resources page with published articles.

February 16, 2026

Opus 4.6 Computer Use, Streaming & Deploy Improvements

Opus 4.6 computer tool — native support for Claude Opus 4.6 computer use with zoom and screenshot gating.
Fine-grained tool streaming — opt-in streaming for individual tool results during agent execution.
hud deploy build args & secrets — pass build arguments and secrets to environment container builds.
allowed_tools in @env.scenario — scope tool access per evaluation scenario via the decorator.
Checkpoint configs — configure checkpoint behavior for long-running evaluations.

Platform

Billing refactor — auto top-up, redesigned billing page, and per-key pricing for HUD-managed API keys.
Trace viewer enhancements — strip review mode, inline run switching, and file attachment display.
Trace comments — add and edit comments on individual traces, with a dedicated column in taskset view.
Training jobs dashboard — dedicated section for RL training jobs with detail pages.
Native binarization toggle — pass/fail binarization for taskset evaluations, built into the platform.
Column ordering — reorder columns in the taskset table view.
Model & environment sorting — sort taskset results by model, environment, and environment version.

January 12, 2026

Build args for hud deploy — pass custom build arguments to environment container builds.
Wildcard tools — environments can expose * to allow all tools without explicit registration.
CLI mode distinction — hud build and hud analyze distinguish between HTTP and stdio modes.

Platform

Leaderboard redesign — redesigned leaderboards with publishing flow, public visibility, and embedding support.
Slack bot — Slack integration for job notifications and external integration providers.
Trace compact view — compact trace view with column reorder, inline comments, and truncated task names.
BYOK API keys — bring-your-own-key support with a use_hud_key option for user-managed API keys.
Per-key pricing — individual pricing tiers for HUD-managed API keys.
Jobs page improvements — compact job list view and refreshed stats.

December 17, 2025

v0.5.0: MCP-First Architecture

Environments decoupled — environment definitions moved to separate repos, enabling independent versioning and community contributions.
Unified scenario/tool/prompt/resource handling — single abstraction layer for MCP servers and client-side tools, with caching and hot-reload.
Telemetry — trace IDs, subagent spans, and structured logging for agent runs.
Scenario decorator — @env.scenario for defining evaluation scenarios with typed configuration.
RL training — initial support for reinforcement learning training via the CLI.

Platform

Inference API usage tracking — track inference API usage on the usage page.
HUD-managed API keys — platform-side API key management with set api_key support.

October 1, 2025

Bedrock, Gemini & Expanded Model Support

AWS Bedrock — hud-python[bedrock] extra for running Claude agents via AWS Bedrock.
Gemini CUA — Gemini computer use agent support with checkpoint management.
Qwen computer tool — QwenComputerTool for Qwen-series models.
MCP server support — use HUD environments as MCP servers, integrating with any MCP-compatible client.
Telemetry tracing — structured telemetry for agent runs with trace export.

Platform

Text trace viewer — view text-only agent traces with a dedicated viewer.
Leaderboard embeds — embed leaderboards in external pages.
Versioned models — unified evalsets and leaderboards with versioned model support.
Usage tracking & billing — usage analytics and subscription management.

August 23, 2025

CLI & Claude Agent

hud CLI — full CLI for the development lifecycle: init, dev, build, deploy, eval, analyze, debug.
Claude agent with prompt caching — built-in Claude agent with reduced latency and cost.
Pre-filtered tools — agents receive only the tools relevant to their current scenario.
User-provided system prompts — custom system prompts for tasksets and individual tasks.

Platform

Trace viewer — full trace exploration UI with step-by-step replay of agent actions and screenshots.
Leaderboards & scorecards — evalset leaderboards with scorecard breakdowns.
Jobs & runs display — view agent runs with step-by-step screenshots and action metadata.
Public trace sharing — publish and share individual traces publicly.

April 18, 2025

Environment Controllers & Docker Support

Client-side environment management — local Docker-based environment execution with copy-to/from support.
Claude adapter — built-in adapter for Anthropic Claude computer use and Operator.
Gymnasium wrapper — gym.make() compatibility for RL-style agent training loops.
Evaluator framework — pluggable evaluators with structured logging and result export.

Platform

Platform launch — dashboard at hud.ai with authentication and evalset browsing.
API keys management — create and manage API keys from the dashboard.
Profile & team pages — user profiles with team membership and settings.

March 3, 2025

Initial Release

Open-source SDK — pip install hud-python for AI agent evaluation and RL environments.
Core primitives — environments, tasks, evaluators, and runs as first-class objects.
Computer use actions — keyboard, mouse, scroll, keyup/keydown, and hold-key actions for desktop environments.

​Models, Tasksets, Templates & Sharing

​Platform

​A2A Chat, Citations, GPT-5 & CLI Sync

​Platform

​Opus 4.6 Computer Use, Streaming & Deploy Improvements

​Platform

​CLI Refinements & Leaderboard Redesign

​Platform

​v0.5.0: MCP-First Architecture

​Platform

​Bedrock, Gemini & Expanded Model Support

​Platform

​CLI & Claude Agent

​Platform

​Environment Controllers & Docker Support

​Platform

​Initial Release

Models, Tasksets, Templates & Sharing

Platform

A2A Chat, Citations, GPT-5 & CLI Sync

Platform

Opus 4.6 Computer Use, Streaming & Deploy Improvements

Platform

CLI Refinements & Leaderboard Redesign

Platform

v0.5.0: MCP-First Architecture

Platform

Bedrock, Gemini & Expanded Model Support

Platform

CLI & Claude Agent

Platform

Environment Controllers & Docker Support

Platform

Initial Release