Skip to main content
The rewards are the signal: the tasks you evaluate are already training data — every rollout returns a Run carrying a trajectory and a reward. You can feed that signal into HUD’s managed trainer (a trainable model whose weights advance in place) or into your own GRPO/PPO loop.

Prerequisites

  • A task and an agent (see Tasks and Models).
  • A task with spread in its rewards — a group that all scores 0.0 (or all 1.0) produces zero advantage and teaches nothing. See Designing tasks for signal.
  • For the managed trainer: a trainable model (created below).

Create a trainable model

A trainable model is a private, team-owned model whose weights you advance. Fork one from any trainable base — the fork starts from the base’s active checkpoint, so you continue where it left off:
hud models fork Qwen/Qwen3.5-4B --name arith-rl
The new model’s slug (arith-rl) is both what you sample (through the gateway, like any other model) and what you train. Inspect a model’s catalog entry any time with hud models list.

Train it

TrainingClient targets one model slug and advances the weights behind it. The loop is: roll out a batch, hand the Runs to step (one forward_backward with a built-in loss, then one optim_step that checkpoints and promotes), and the next rollout samples the updated policy.
train.py
import asyncio

from hud import TrainingClient
from hud.agents import create_agent
from hud.eval import Job

async def main():
    # return_token_ids marks these as training rollouts: the gateway returns
    # token ids + per-token logprobs, recorded on each turn for training.
    agent = create_agent("arith-rl", completion_kwargs={"extra_body": {"return_token_ids": True}})
    trainer = TrainingClient("arith-rl")
    taskset, runtime = ...  # your taskset + runtime (see Tasks / Deploy)

    session = await Job.start("arith-rl", group=8)   # 8 rollouts per task (GRPO group)
    for _step in range(10):
        start = len(session.runs)
        await taskset.run(agent, runtime=runtime, job=session)
        batch = session.runs[start:]
        result = await trainer.step(batch, learning_rate=1e-5, group_size=8)
        print(f"optim {result.step}{result.sampler_path}")

asyncio.run(main())
step is the common case; call forward_backward and optim_step separately when you want the metrics or gradient accumulation (num_substeps) in between. Inputs are Runs (sent inline) or trace_id strings (resolved from trajectories the platform already holds) — mix freely.
Built-in losses (importance_sampling, ppo, cispo, dro, cross_entropy) run server-side and need no local ML deps. List the set a model supports with await trainer.available_losses().

Custom losses

To author the loss yourself — e.g. GLM-style double-sided importance sampling — use forward_backward_custom. The service runs the current-policy forward pass and returns per-token tensors (DatumTensors); your function turns them into per-token gradients (client-side, with torch), which the service applies:
import torch
from hud.train import DatumTensors

def my_loss(data: list[DatumTensors], logprobs: list[torch.Tensor]):
    loss = logprobs[0].new_zeros(())
    for datum, policy_lp in zip(data, logprobs):
        ratio = torch.exp(policy_lp - torch.tensor(datum.sampling_logprobs))
        mask = torch.tensor(datum.mask)
        loss = loss - (ratio * datum.reward * mask).sum()
    return loss, {}

await trainer.forward_backward_custom(batch, my_loss, group_size=8)
await trainer.optim_step(learning_rate=1e-5)
Requires torch (pip install 'hud-python[train]'); the built-in path does not. A full GRPO-baseline version lives in the rl-training cookbook.

Inspect progress

Each optim_step adds a node to the model’s checkpoint tree and promotes it to the head — the weights the gateway now serves:
hud models checkpoints arith-rl              # the tree, oldest first (▶ = active head)
hud models head arith-rl                     # the active checkpoint + its stats
hud models head arith-rl --set <checkpoint>  # roll back / select a different head
Setting the head points the gateway at a different checkpoint (a rollback or a branch point); the next optim_step extends the tree from there.

Why grouping matters

GRPO advantages are relative within a group: reward - mean, optionally divided by the group’s std. If every rollout in a group earns the same reward, the advantage is zero and the model learns nothing from that task. A good training task produces a spread of rewards across the group — a task-design concern, covered in Designing tasks for signal.

Next steps

Designing tasks for signal

Build tasks that produce within-group spread and resist reward hacking.

Reference: training

TrainingClient, the loss set, custom losses, and hud models.

Run on any model

Choose the policy you’re training.

Package & deploy

Scale the rollouts that feed training.