Skip to main content
A TrainingClient drives HUD-managed training for one model: it accumulates gradients from rewarded trajectories and advances the weights behind the model’s gateway slug in place. Inputs are Runs (sent inline) or trace_id strings (resolved server-side); the two can be mixed.
from hud import TrainingClient

trainer = TrainingClient("my-model")   # a trainable gateway slug or model id
The slug comes from forking a trainable model — see hud models.

TrainingClient

TrainingClient(model, *, api_key=None, base_url=None, api_url=None)
ArgumentDefaultMeaning
modelTrainable model slug or id (the gateway string you also sample).
api_keysettings.api_keyHUD API key.
base_urlsettings.hud_rl_urlTraining (RL) service.
api_urlsettings.hud_api_urlCatalog API (resolves the slug → id once).

Methods

MethodReturnsPurpose
forward_backward(trajectories, *, loss_fn, loss_fn_config=None, group_size=None, reward_scale=1.0, num_substeps=1)ForwardBackwardResultAccumulate gradients with a built-in loss_fn.
optim_step(*, learning_rate, beta1=0.9, beta2=0.95, eps=1e-8, weight_decay=0.0)OptimStepResultApply gradients, checkpoint, and promote the new weights.
step(trajectories, *, learning_rate, ...)OptimStepResultOne forward_backward then one optim_step.
forward_backward_custom(trajectories, loss_fn, *, group_size=None, reward_scale=1.0)ForwardBackwardResultAccumulate gradients with a client-side loss (see Custom losses).
forward(trajectories, *, group_size=None, reward_scale=1.0)ForwardResultCurrent-policy forward pass returning per-token tensors.
backward(forward_id, weights, *, metrics=None)ForwardBackwardResultApply caller-computed per-token gradients to a forward pass.
available_losses()list[str]Built-in loss_fn names this model’s provider supports.
Advantages are normalized within contiguous groups of group_size (GRPO); None treats the whole batch as one group. num_substeps splits the batch for gradient accumulation.
for _ in range(steps):
    batch = ...  # a fresh batch of graded Runs
    result = await trainer.step(batch, learning_rate=1e-5, group_size=8)
    print(result.step, result.sampler_path)

Inputs

A training input is a recorded trajectory by id, or an inline one:
TrainInput = str | TrajectoryPayload          # trace_id, or inline tokens + reward
Passing a Run builds the right form automatically — inline TrajectoryPayload when it carries token-level samples (local rollout), else its trace_id (remote rollout).
TypeFields
TrajectorySampleprompt_token_ids, output_token_ids, output_logprobs
TrajectoryPayloadsamples: list[TrajectorySample], reward, trace_id=None

Built-in losses

loss_fn is an open string validated against the model’s provider; discover the set with await trainer.available_losses(). BuiltinLoss lists the common Tinker names (each is a str):
BuiltinLossValueUse
CROSS_ENTROPYcross_entropySupervised — imitate sampled tokens.
IMPORTANCE_SAMPLINGimportance_samplingOn-policy PG, rollout-logprob ratio.
PPOppoClipped-surrogate PG.
CISPOcispoClipped IS policy optimization.
DROdroDirect reward optimization.
loss_fn_config forwards hyperparameters to the loss (e.g. {"epsilon": 0.2} for the ppo clip).

Custom losses

forward_backward_custom runs the current-policy forward pass server-side, hands you per-token tensors, runs your loss locally (torch autograd), and ships the per-token gradients back. Requires torch (pip install 'hud-python[train]').
import torch
from hud.train import DatumTensors

def my_loss(data: list[DatumTensors], logprobs: list[torch.Tensor]):
    loss = logprobs[0].new_zeros(())
    for datum, policy_lp in zip(data, logprobs):
        ratio = torch.exp(policy_lp - torch.tensor(datum.sampling_logprobs))
        loss = loss - (ratio * datum.reward * torch.tensor(datum.mask)).sum()
    return loss, {"trained": float(len(data))}

await trainer.forward_backward_custom(batch, my_loss, group_size=8)
logprobs[i] are the current policy π_θ for datum i as differentiable leaves. Everything else is constant on the matching DatumTensors:
DatumTensorsMeaning
logprobsCurrent-policy π_θ, per token (the differentiable leaf).
sampling_logprobsRollout policy q, per token.
mask1.0 on action tokens, 0.0 on observation tokens.
reward, traj_idx, group_idxTrajectory reward, source trajectory, GRPO group (or None).
Under the hood forward returns a ForwardResult (forward_id + data: list[DatumTensors]); backward(forward_id, weights) applies weights[d][t] = -dC/dlogprobs.

Results

TypeFields
ForwardBackwardResultmetrics: dict[str, float], num_datums
OptimStepResultstep, checkpoint_id, sampler_path, state_path, model

hud models CLI

Manage trainable models from the shell:
CommandPurpose
hud models listList gateway models.
hud models fork <model> --name <slug>Fork a team-owned trainable model from an existing one.
hud models checkpoints <model>List the checkpoint tree (▶ marks the active head).
hud models head <model> [--set <checkpoint-id>]Show — or set (rollback/select) — the active checkpoint.

See also

Train on rewards

The end-to-end training how-to.

Designing tasks for signal

Produce within-group reward spread so training has signal.