Decentralized coordination for humanoid robot teams. Two simulated Unitree G1 robots learn cooperative lifting using verbal cues and force feedback, no central controller or wireless protocol required.
Existing multi-robot coordination systems assume either a central controller or reliable wireless communication. Neither holds up in communication-denied environments (planetary surfaces, disaster zones, degraded infrastructure). Fleet-Coord drops both assumptions. Two simulated Unitree G1 humanoid robots learn to cooperatively lift and carry objects using only human-compatible signals: verbal countdown cues and force feedback through the shared object.
The core constraint is that a human must be able to swap in for either robot mid-task without changing the protocol. No binary handshakes, no wireless pings. Just observable behavior.
Four layers. The environment layer wraps MuJoCo physics at 1000 Hz behind a Gymnasium interface. Each robot gets 202-dimensional observations (joint states, object pose, hand contact forces, partner speech tokens) and outputs 51-dimensional actions (43 joint targets plus an 8-token discrete speech output). A speech channel simulates verbal communication with 20 ms delay and 200 ms persistence.
The training layer bridges the environment to ETH Zurich's RSL-RL framework for multi-agent PPO. Both robots in each environment share rewards to encourage cooperative behavior. A curriculum manager tracks rolling-window success rates and advances training stages (approach, grip, lift, carry) once performance exceeds threshold.
Training uses a teacher-student setup. The teacher policy (MLP, layers [256, 256, 128]) sees privileged partner state during training. After convergence, this distills to a student policy using only observable inputs, which is what would run on real hardware.
Robots emit discrete tokens from an 8-word vocabulary: silence, "three", "two", "one", "lift", "down", "stop", "regrip". These are sampled from the policy's action distribution alongside motor commands. The channel adds a 1-step transmission delay and 10-step persistence window, giving the partner roughly 200 ms to register each cue.
This makes the system debuggable (replay logs show exactly which token was emitted at which step), and it maps directly to real voice commands for human-robot teaming.
Reward shaping is config-driven through YAML files. Stage 1 rewards include dense signals for distance decrease, stable hand contact, grip duration, and penalties for falling. Each component logs independently to TensorBoard for debugging.
Early GPU runs reported 100% success, but visual inspection showed collapsed robots with spurious contacts. The fix: require box-specific geom contact checks and verify both robots remain upright (base tilt < 0.7 rad). Success metrics now enforce structural invariants, not just contact flags.
| Layer | Technology |
|---|---|
| Physics | MuJoCo 3.4.0 at 1000 Hz |
| Scene Composition | dm-control MJCF builder |
| RL Framework | RSL-RL v1.0.2 (PPO) |
| Neural Network | PyTorch 2.0+ with CUDA |
| Robot Model | Unitree G1 with Dex3-1 hands (43 DOF) |
| Observation Space | 202 dims per robot |
| Action Space | 51 dims (43 motor + 8 speech) |
| Policy Frequency | 50 Hz (20 sim steps per action) |
Stage 1 (approach and grip) is under active training with the hardened evaluation metrics. Next steps are warm-starting from pretrained G1 locomotion checkpoints and running a controlled comparison against training from scratch.