HUD Documentation — Evaluations and RL Environments.

A TrainingClient drives HUD-managed training for one model: it accumulates gradients from rewarded trajectories and advances the weights behind the model’s gateway slug in place. Inputs are Runs (sent inline) or trace_id strings (resolved server-side); the two can be mixed.

from hud import TrainingClient

trainer = TrainingClient("my-model")   # a trainable gateway slug or model id

The slug comes from forking a trainable model — see hud models.

TrainingClient

TrainingClient(model, *, api_key=None, base_url=None, api_url=None)

Argument	Default	Meaning
`model`	—	Trainable model slug or id (the gateway string you also sample).
`api_key`	`settings.api_key`	HUD API key.
`base_url`	`settings.hud_rl_url`	Training (RL) service.
`api_url`	`settings.hud_api_url`	Catalog API (resolves the slug → id once).

Methods

Method	Returns	Purpose
`forward_backward(trajectories, *, loss_fn, loss_fn_config=None, group_size=None, reward_scale=1.0, num_substeps=1)`	`ForwardBackwardResult`	Accumulate gradients with a built-in `loss_fn`.
`optim_step(*, learning_rate, beta1=0.9, beta2=0.95, eps=1e-8, weight_decay=0.0)`	`OptimStepResult`	Apply gradients, checkpoint, and promote the new weights.
`step(trajectories, *, learning_rate, ...)`	`OptimStepResult`	One `forward_backward` then one `optim_step`.
`forward_backward_custom(trajectories, loss_fn, *, group_size=None, reward_scale=1.0)`	`ForwardBackwardResult`	Accumulate gradients with a client-side loss (see Custom losses).
`forward(trajectories, *, group_size=None, reward_scale=1.0)`	`ForwardResult`	Current-policy forward pass returning per-token tensors.
`backward(forward_id, weights, *, metrics=None)`	`ForwardBackwardResult`	Apply caller-computed per-token gradients to a forward pass.
`available_losses()`	`list[str]`	Built-in `loss_fn` names this model’s provider supports.

Advantages are normalized within contiguous groups of group_size (GRPO); None treats the whole batch as one group. num_substeps splits the batch for gradient accumulation.

for _ in range(steps):
    batch = ...  # a fresh batch of graded Runs
    result = await trainer.step(batch, learning_rate=1e-5, group_size=8)
    print(result.step, result.sampler_path)

Inputs

A training input is a recorded trajectory by id, or an inline one:

TrainInput = str | TrajectoryPayload          # trace_id, or inline tokens + reward

Passing a Run builds the right form automatically — inline TrajectoryPayload when it carries token-level samples (local rollout), else its trace_id (remote rollout).

Type	Fields
`TrajectorySample`	`prompt_token_ids`, `output_token_ids`, `output_logprobs`
`TrajectoryPayload`	`samples: list[TrajectorySample]`, `reward`, `trace_id=None`

Built-in losses

loss_fn is an open string validated against the model’s provider; discover the set with await trainer.available_losses(). BuiltinLoss lists the common Tinker names (each is a str):

`BuiltinLoss`	Value	Use
`CROSS_ENTROPY`	`cross_entropy`	Supervised — imitate sampled tokens.
`IMPORTANCE_SAMPLING`	`importance_sampling`	On-policy PG, rollout-logprob ratio.
`PPO`	`ppo`	Clipped-surrogate PG.
`CISPO`	`cispo`	Clipped IS policy optimization.
`DRO`	`dro`	Direct reward optimization.

loss_fn_config forwards hyperparameters to the loss (e.g. {"epsilon": 0.2} for the ppo clip).

Custom losses

forward_backward_custom runs the current-policy forward pass server-side, hands you per-token tensors, runs your loss locally (torch autograd), and ships the per-token gradients back. Requires torch (pip install 'hud-python[train]').

import torch
from hud.train import DatumTensors

def my_loss(data: list[DatumTensors], logprobs: list[torch.Tensor]):
    loss = logprobs[0].new_zeros(())
    for datum, policy_lp in zip(data, logprobs):
        ratio = torch.exp(policy_lp - torch.tensor(datum.sampling_logprobs))
        loss = loss - (ratio * datum.reward * torch.tensor(datum.mask)).sum()
    return loss, {"trained": float(len(data))}

await trainer.forward_backward_custom(batch, my_loss, group_size=8)

logprobs[i] are the current policy π_θ for datum i as differentiable leaves. Everything else is constant on the matching DatumTensors:

`DatumTensors`	Meaning
`logprobs`	Current-policy π_θ, per token (the differentiable leaf).
`sampling_logprobs`	Rollout policy q, per token.
`mask`	`1.0` on action tokens, `0.0` on observation tokens.
`reward`, `traj_idx`, `group_idx`	Trajectory reward, source trajectory, GRPO group (or `None`).

Under the hood forward returns a ForwardResult (forward_id + data: list[DatumTensors]); backward(forward_id, weights) applies weights[d][t] = -dC/dlogprobs.

Results

Type	Fields
`ForwardBackwardResult`	`metrics: dict[str, float]`, `num_datums`
`OptimStepResult`	`step`, `checkpoint_id`, `sampler_path`, `state_path`, `model`

`hud models` CLI

Manage trainable models from the shell:

Command	Purpose
`hud models list`	List gateway models.
`hud models fork <model> --name <slug>`	Fork a team-owned trainable model from an existing one.
`hud models checkpoints <model>`	List the checkpoint tree (▶ marks the active head).
`hud models head <model> [--set <checkpoint-id>]`	Show — or set (rollback/select) — the active checkpoint.

Train on rewards

The end-to-end training how-to.

Designing tasks for signal

Produce within-group reward spread so training has signal.

​TrainingClient

​Methods

​Inputs

​Built-in losses

​Custom losses

​Results

​hud models CLI

​See also