- Evaluating model performance. Build a taskset, run it across models, compare scores. Find out which model handles your use case before you commit to one.
- Building RL Environments. Create frontier grade post-training data for different capabilities such as coding, computer use, tool use, deep research and more.
- Training specialized agents. Use RL training to produce a model that’s better at your specific tasks.
- Environment SDK — Define agent-callable tools and evaluation logic. Each environment spins up fresh and isolated for every run.
- Eval & Training Platform — Run evaluations at scale on hud.ai. Collect traces. Train models on successful runs.
- Model Gateway — One OpenAI-compatible endpoint at
inference.hud.aifor Claude, GPT, Gemini, Grok, and more.
Install
1. Environments: Define Your Agent’s Harness
An environment wraps your code as tools agents can call, and defines scenarios that evaluate what agents do. Each environment spins up fresh and isolated for every evaluation — no shared state, fully reproducible.Example Workflow
2. Tasks & Training: Evaluate and Train
A task is a scenario with specific arguments. Group tasks into tasksets and run them across models. Train models on successful traces to produce a model that’s better at your specific use case.3. Models: Any Model, One API
Integrations with Anthropic, OpenAI, Gemini, xAI, and more out of the box. Point any OpenAI-compatible client atinference.hud.ai and use any model. Browse all available models at hud.ai/models.
Next Steps
Scaffolding
Create environments, define tools and scenarios.
Tasks & Evaluation
Define tasks, test locally, iterate.
Deploy & Go Remote
Deploy and run evaluations at scale.
Environments as Data
Design for useful training signal.
Community
GitHub
Star the repo and contribute
Discord
Join the community