- Evaluating model performance. Build a taskset, run it across models, compare scores. Find out which model handles your use case before you commit to one.
- Building RL Environments. Create frontier grade post-training data for different capabilities such as coding, computer use, tool use, deep research and more.
- Training specialized agents. Use reinforcement fine-tuning (RFT) to produce a model that’s better at your specific tasks.
- Environment SDK — Define agent-callable tools and evaluation logic. Each environment spins up fresh and isolated for every run.
- Eval & Training Platform — Run evaluations at scale on hud.ai. Collect traces. Train models on successful runs.
- Model Gateway — One OpenAI-compatible endpoint at
inference.hud.aifor Claude, GPT, Gemini, Grok, and more.
Install
1. Environments: Define Your Agent’s Harness
An environment wraps your code as tools agents can call, and defines scenarios that evaluate what agents do. Each environment spins up fresh and isolated for every evaluation — no shared state, fully reproducible.Example Workflow
2. Tasks & Training: Evaluate and Train
A task is a scenario with specific arguments. Group tasks into tasksets and run them across models. Train models on successful traces to produce a model that’s better at your specific use case.3. Models: Any Model, One API
Integrations with Anthropic, OpenAI, Gemini, xAI, and more out of the box. Point any OpenAI-compatible client atinference.hud.ai and use any model. Browse all available models at hud.ai/models.
Next Steps
Core Concepts
Environments, tools, scenarios, tasks — defined in one place.
Environments
Tools, scenarios, and iteration.
Tasks & Training
Evaluate and train models.
Best Practices
Patterns for reliable environments and evals.
Community
GitHub
Star the repo and contribute
Discord
Join the community