How training works
Evaluation and training share the same three pieces: a task, an agent, and a reward. An eval runs that once; training repeats it, nudging the weights toward higher-reward rollouts after each batch. The reward is already the signal - nothing new gets graded. The loop repeats four steps:- Roll out a batch of tasks with the current model.
- Score each rollout - the reward you already get for free.
- Nudge the weights toward the rollouts that scored higher.
- Repeat - the next batch samples the now-improved model.
| Prerequisite | Why it matters |
|---|---|
| A trainable model you forked | You can only advance weights you own, so you start by forking a base into a private model (below). |
| A taskset whose rewards spread | Training learns from the gap between rollouts; a task every rollout passes or fails teaches nothing (see Why groups matter). |
Fork a trainable model
A trainable model is a gateway model whose weights you’re allowed to advance. Only some bases can be forked. List the models to see which models are trainable, then fork one into a private, team-owned model.The
hud models list command marks forkable bases in its Trainable column. Running hud models fork
copies a base into a new slug you own, starting from the base’s current weights,
so training continues from there.terminal
arith-rl) is both what you sample and what you train - one string flows through
the whole loop.
Run the loop
This is the whole thing in Python: each step rolls out a batch with the current weights, hands the graded rollouts to the trainer, and the trainer nudges the weights and promotes them - so the next step samples an improved model.1 · Set up the agent and trainer
Two objects, both keyed to the model you forked:- The agent rolls out like in an eval. The
return_token_idsflag marks it a training rollout, so each response carries the token ids and logprobs the trainer needs. - The trainer advances the weights behind the slug in place. It runs on a Tinker client under the hood, so you write no ML infra.
2 · Roll out, then nudge
One job spans the session; each step appends a batch and trains on it:- Open the job with
group=8- 8 rollouts per task, so the rewards are comparable (next). - Roll out the batch, the same eval as the previous guide. The
runtime sets where it runs; swap
LocalRuntimeforHUDRuntime()unchanged. - Nudge with
trainer.step- the one line that learns. It scores each rollout against its group, shifts the weights, then promotes them so the gateway serves the new ones at once.
train.py
Why groups matter
By default, HUD trains with GRPO, which scores each rollout relative to its group: its advantage isreward - group_mean. A rollout counts as good only next to its siblings on the same task.
So if every rollout in a group earns the same reward, every advantage is zero and nothing moves -
however high the average looks. That is why each task runs as a group (group=8) and why your tasks must
produce a spread of rewards. Trainability is a property of your tasks, not the loop;
designing tasks covers how to build that spread in.
Watch it improve
Eachtrainer.step adds a node to the model’s checkpoint tree and promotes it to the head - the
weights the gateway now serves. Read the tree from the shell, or the mean reward from the loop’s print.
Because every step checkpoints, any node is also a rollback point. Roll the head back when the objective
changes: if you edit the reward or the environment mid-run, set the head to a checkpoint taken before the
change so the run measures the new objective from a clean start.
terminal
TrainingClient API, the built-in and
custom losses, and the checkpoint tree in full.