Run carrying a trajectory and a reward. You can feed that signal into HUD’s managed trainer (a trainable model whose weights advance in place) or into your own GRPO/PPO loop.
Prerequisites
- A task and an agent (see Tasks and Models).
- A task with spread in its rewards — a group that all scores
0.0(or all1.0) produces zero advantage and teaches nothing. See Designing tasks for signal. - For the managed trainer: a trainable model (created below).
Create a trainable model
A trainable model is a private, team-owned model whose weights you advance. Fork one from any trainable base — the fork starts from the base’s active checkpoint, so you continue where it left off:arith-rl) is both what you sample (through the gateway, like any other model) and what you train. Inspect a model’s catalog entry any time with hud models list.
Train it
TrainingClient targets one model slug and advances the weights behind it. The loop is: roll out a batch, hand the Runs to step (one forward_backward with a built-in loss, then one optim_step that checkpoints and promotes), and the next rollout samples the updated policy.
train.py
step is the common case; call forward_backward and optim_step separately when you want the metrics or gradient accumulation (num_substeps) in between. Inputs are Runs (sent inline) or trace_id strings (resolved from trajectories the platform already holds) — mix freely.
Built-in losses (
importance_sampling, ppo, cispo, dro, cross_entropy) run server-side and need no local ML deps. List the set a model supports with await trainer.available_losses().Custom losses
To author the loss yourself — e.g. GLM-style double-sided importance sampling — useforward_backward_custom. The service runs the current-policy forward pass and returns per-token tensors (DatumTensors); your function turns them into per-token gradients (client-side, with torch), which the service applies:
pip install 'hud-python[train]'); the built-in path does not. A full GRPO-baseline version lives in the rl-training cookbook.
Inspect progress
Eachoptim_step adds a node to the model’s checkpoint tree and promotes it to the head — the weights the gateway now serves:
optim_step extends the tree from there.
Why grouping matters
GRPO advantages are relative within a group:reward - mean, optionally divided by the group’s std. If every rollout in a group earns the same reward, the advantage is zero and the model learns nothing from that task. A good training task produces a spread of rewards across the group — a task-design concern, covered in Designing tasks for signal.
Next steps
Designing tasks for signal
Build tasks that produce within-group spread and resist reward hacking.
Reference: training
TrainingClient, the loss set, custom losses, and hud models.Run on any model
Choose the policy you’re training.
Package & deploy
Scale the rollouts that feed training.