Overview

In this paper, we propose a dynamic-aware generative model, GEM-4D, which takes the initial observation and instruction as input to predict future frames capturing the motion of target robots. Based on the generated frames, we further introduce an inverse dynamics module that efficiently extracts robot policies from the predicted trajectories. The extracted policies can then be deployed in robotic simulation environments for downstream manipulation tasks. The pipeline is shown in Fig. 1.

Teaser

Fig. 1: Overview. In this paper we propose a world-model robot vision planner system. By take the first observation as input, our world model GEM-4D predict a robot-based video, which is then used in the Dynamic Inverse System to extract robot policy. Finally, this policy is used in real robot experiments.

Figure 2

Fig. 2: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure.

Real Robot

Put Rubbish in Bin
Prompt: throw away the trash

Videos generated by GEM-4D (ours)
Real robot execution
Lift Numbered Block
Prompt: lift the block with the number three

Videos generated by GEM-4D (ours)
Real robot execution
Figure 3

Fig. 3: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF ARM manipulation.

Generated Video

Droid Dataset
Prompt: pick up the tiger toy and place it into the black bowl
Ground Truth
Tesseract
GEM-4D (Ours)
Prompt: move the cup closer to the pen
Ground Truth
Tesseract
GEM-4D (Ours)
RLBench Dataset
Prompt: close the middle drawer
Ground Truth
Tesseract
GEM-4D (Ours)
Prompt: screw the orange jar lid on
Ground Truth
Tesseract
GEM-4D (Ours)
Bridge Dataset
Prompt: sweep into pile
Ground Truth
Tesseract
GEM-4D (Ours)
Prompt: unfold the cloth from top left to bottom right
Ground Truth
Tesseract
GEM-4D (Ours)
RT-1 Dataset
Prompt: knock coke can over
Ground Truth
Tesseract
GEM-4D (Ours)
Prompt: pick orange can from bottom shelf of fridge
Ground Truth
Tesseract
GEM-4D (Ours)