The
robot capability is in beta. The wire protocol is versioned openpi/0; the contract
schema is v0. Expect additive changes while the design settles.Run, but a 50 Hz policy can’t stream actions over tool calls.
So the robot capability is instead a continuous observation/action loop over WebSocket: the
environment streams observations (camera frames, robot state) and the agent streams back actions, as
fast as the policy can run. The wire format is openpi-inspired (msgpack with numpy serialization),
so existing openpi policy servers only need a thin adapter.
Everything below ships behind the robot extra (pulls in numpy + openpi-client):
Overview
Like with other HUD workflows there’s the environment side (server - containerized, served on the runtime) and the agent side (cleint - swappable, model with harness) For robotics the environment side translates incoming actions into changes in the digital or physical environment and serves observations. The agent side owns the policy: it reads those observations, runs inference, and sends actions back. Both sides need building, and this is where robotics differs from the rest of HUD. For LLM agents you can lean on a standard inference provider and a stock harness, so often the environment is the only thing you write. For robot policies there is no equivalent - no hosted inference provider, no standard harness. HUD ships tooling for both sides: a handful of small, named abstractions you implement, with the framework owning everything in between (the serve loop, the wire protocol, telemetry to platform). Environment side - owns the simulator and serves frames:RobotBridge- the one class you implement around your sim:reset/step/get_observation. The framework owns the WebSocket serve loop and the single-agent connection.RobotEndpoint- wraps the bridge - the environment server’s handle for the sim (even if the sim is running in another process)
RobotAgent- the harness: connects to the env and bridge, owns adapter and model, drives model until env terminates.Model- the actual stateless checkpoint of the model (includes pre-/post-processing)Adapter- translates the env’s observation space to the model’s, and the model’s action space to the env’s
Environment side
You implement one class - the bridge.resetstarts a fresh episode for a task and returns its prompt (the text the agent is given).stepapplies one action and advances the sim a tick, settingsuccess/terminatedas the episode plays out.get_observationreturns a strctured dict of the current observation plus whether the episode is done.
The
get_observation function has a strict output convention, see below to follow it.The openpi observation convention
The openpi observation convention
The Actions come back the same way: the agent sends them under openpi’s
data dict is the strict part. It is what the agent indexes by name and feeds straight to
the policy, so a few things have to be exactly right:- Values are numpy arrays - nothing else survives the trip into the adapter and the trace viewer.
- Each key is an observation feature’s name, verbatim from the contract. The agent does
data[name]directly off the contract - Images are
HWCarrays ([H, W, 3],uint8RGB). - State is a single 1-D array, passed to the policy as
float32; everything rank-1 is treated as state. terminatedis a sibling, not part ofdata- return it as the second item of your(data, terminated)tuple and the framework attaches it to the frame.
actions key, and your
step(action) receives an already-decoded numpy array - you never touch the codec.RobotEndpoint is the env’s control handle on the bridge - the one surface it drives an episode
through. start / stop bring the bridge’s socket up and down; capability publishes the robot
binding once that URL exists (call it after start); reset begins an episode and returns its
prompt; result returns the episode’s score. It’s control-plane only - the agent’s observe/act loop
tunnels straight to the bridge’s WebSocket - and the same calls work whether the bridge is local
(shown here) or in another process.
Agent side
The harness lives inhud.agents.robot.
We provide a base class called RobotAgent. It connects to the robot
binding, reads the contract, then runs the rollout loop including model inference
until the environment terminates. You supply two objects.
Model- something with aninfer()function that returns action chunks (pre-/post-processing included)Adapter- translates env ↔ model spaces.
Taskset(...).run(agent, runtime=...) - against any substrate
serving an env with the robot capability and an adaptable embodiment.
LeRobot integration
HUD integrates with LeRobot natively, so a stock checkpoint is a complete agent in a few lines. The two bundled seams are the LeRobot convention:LeRobotModel(policy, preprocess, postprocess)runs the policy through its own LeRobot pre/post-processors, so the checkpoint behaves exactly as it does upstream. Pass anEnsemblerto reduce overlapping action chunks to one action per step.LeRobotAdapter(model_image_keys=...)maps the env’s cameras and state onto the policy’s inputs from the contract - HWCuint8→ CHW float, state and prompt passed through.
Model or Adapter; the
LeRobot classes are the batteries-included default. See the
robot benchmark cookbook for a full LIBERO + pi0.5 run.
The Model
Model owns how to run a policy. To wrap a non-LeRobot checkpoint, subclass it and implement one
method - infer; the episode loop, threading, and the wire are handled for you.
- Input (
batch) - the policy-ready inputs yourAdapterproduced for this step (images, a state vector, the task prompt - whatever your policy consumes).ModelandAdapterare a matched pair, so the batch is exactly what your adapter emits. - Output - a
[T, A]float32numpy array: an action chunk ofTtimesteps ×Aaction dims, already in the env’s action space. Single-action policies returnT = 1. reset()- optional; clear per-episode state (an action queue, a chunk buffer) at the start of each episode.
ainfer, which runs your (blocking) infer in a worker thread by default -
override ainfer only if your policy is natively async. For chunked policies, reduce each [T, A]
chunk to one action per step with an Ensembler.
The contract
Embodiments and policies disagree on cameras, state layout, action semantics, and control rate, so pairing a model with an env always needs a wiring step. The contract makes it explicit: a JSON document in the capability manifest that the agent reads back withRobotClient.spaces(), which
splits features into an observation and an action space by each feature’s role - so a policy
wires itself with no shared config.
Here’s the smallest contract the bundled adapter accepts - one camera, a state vector, and an action:
role(observation/action) -spaces()splits the contract by it and theAdapterwires against that split. Required on every feature.typeon image observations -rgb/bgr/gray/depthis how the bundled adapter spots a camera; the first observation without an image type becomes the state. Omit it and your image is mistaken for the state. (On the state and action,typeis descriptive.)
get_observation (action is the single action feature). Everything else - robot_type,
control_rate, dtype, shape, names, stats - is descriptive and never enforced; add names if
you want labeled state/action slices in the trace viewer. Full list in the reference below.
Full field reference
Full field reference
| Field | Where | Meaning |
|---|---|---|
robot_type | top level | Embodiment id, shown in the trace viewer. Descriptive. |
control_rate | top level | Control-loop frequency in Hz. Descriptive. |
features | top level | Map of feature name → feature spec (rows below). |
role | feature | observation or action - the only field that splits the spaces. Load-bearing. |
type | feature | Representation tag. Observations: rgb/bgr/gray/depth mark an image (load-bearing for the bundled adapter); others (ee_abs, ee_del, joint_pos, …) are descriptive control/state modes. |
dtype | feature | image for frames, else a numpy dtype (float32). Descriptive - not checked against your arrays. |
shape | feature | Declared dims ([H, W, 3], [8]). Descriptive; every feature is rank ≥ 1 (scalars are [1]). |
names | feature | Per-element labels; what the trace viewer uses to label state/action slices. |
stats | feature | Per-element mean / std / min / max for a custom adapter. The stock LeRobot path uses the checkpoint’s own normalization, so you can omit it. |
state_type / state_representation / frame | feature | Closed-symbol embodiment metadata (EEF vs joint, quaternion vs axis-angle, world vs base frame). Descriptive. |
shape / dtype; the full authoring
spec - the closed symbol sets and known traps - lives outside the SDK alongside the contract corpus.Sim threading
The loop is lockstep - the bridge steps the sim once per received action. A simulator is usually thread-affine (every touch must run on the thread that created its GL/device context), but the bridge’s asyncio loop can’t be stalled by a blocking step.SimRunner is the one-line injection
that decides which thread runs the sim; the bridge routes every sim touch through it:
InlineSimRunner- runs on the event-loop thread. The default; for cheap/CPU sims and tests.ThreadSimRunner- sim on a dedicated worker thread, leaving the loop free during a blocking step. For render-heavy or thread-bound sims.MainThreadSimRunner- sim on the main thread, for runtimes that own both the main thread and the loop (Isaac/Omniverse); the owner’s pump loop drains queued sim touches between ticks.
RobotBridge(sim_runner=ThreadSimRunner())), or subclass SimRunner for an
exotic topology.
Telemetry
Zero-config: with HUD telemetry configured,RobotAgent streams one span per step - every camera
frame the policy saw plus the executed action - and stamps keyframes where a fresh action chunk
was inferred. The platform’s trace viewer plays the episode back: scrub through all frames, with
markers at each chunk-prediction decision point.
Recording datasets
Setagent.save = True (wire it to a --save flag on your runner) to also record every
(observation, executed action) tick into a LeRobot v3 dataset - the rollouts you just ran,
ready to finetune a policy on. Telemetry streams either way; saving is the opt-in extra.
Recording is agent-side: it consumes the observations the agent already receives and the actions
it already produces, so it runs in your process - not the environment container. That sidesteps
sims (e.g. Isaac/RoboLab) whose dependency stack conflicts with lerobot; only your machine needs
pip install 'lerobot[dataset]'.
One dataset spans the whole run - every episode the shared agent drives appends to it - and is
finalized at process exit. Destination and Hub push come from the environment:
| Env var | Effect |
|---|---|
RECORD_DIR | Dataset root (default ./data, relative to where the rollout launched) |
HF_REPO | Also push the finalized dataset to this HF namespace (needs HF_TOKEN) |
HF_PRIVATE | Push the dataset private |
observation.images.<camera> (encoded to per-episode video), the lone state vector becomes
observation.state, the action becomes action, and the task prompt rides along as each frame’s
task.
Running a sim in another process
Some simulators must own the process main thread - most notably Isaac Sim / Omniverse, where Kit drives its own main-thread event loop andenv.reset() loads USD through a nested
run_until_complete. That can’t run inside hud serve, which already owns the asyncio loop. The fix
is to move the sim into its own process and keep the env code essentially unchanged.
RobotEndpoint is built for exactly this: the same control surface (start / reset / result /
stop) works whether the bridge is local or remote.
- Env process - publish a remote handle with
RobotEndpoint.remote(host, port). It dials the sim process and forwards every control call over JSON-RPC. - Sim process - wrap the real bridge and expose it with
RobotEndpoint(bridge).serve(host, port), using aMainThreadSimRunnerso every sim touch runs on the main thread.
- Control plane (
start/reset/result) - JSON-RPC between the remote endpoint and the serving process. - Data plane (the agent’s
observe → actloop) - tunnels straight to the bridge’srobotWebSocket; the contract stays env-side.
connect() to it
first:
env.py
sim_main.py
connect() retries until the sim is listening. Everything
downstream (hud eval, tasksets, the agent) is unchanged; only where the bridge runs moved.
API summary
| Symbol | Where | Role |
|---|---|---|
RobotEndpoint.capability(contract=...) | hud.environment.robot | Build the openpi/0 capability after start() |
Capability.robot(name, url, contract) | hud.capabilities | Lower-level constructor (usually via endpoint.capability) |
RobotClient | hud.capabilities.robot | Agent-side wire client (spaces, get_observation, send_action, send_chunk) |
RobotBridge | hud.environment.robot | Env-side serve loop; subclass with your sim |
RobotEndpoint | hud.environment.robot | Episode bookkeeping + results (local or .remote()) |
SimRunner (Inline/Thread/MainThread) | hud.environment.robot | Which thread runs the sim |
RobotAgent | hud.agents.robot | The episode-loop harness |
Model / LeRobotModel, Adapter / LeRobotAdapter | hud.agents.robot | Policy + space-translation seams |
See also
Robot benchmark cookbook
LIBERO in Docker, driven by pi0.5, end to end.