In this paper, we propose a dynamic-aware generative model, GEM-4D, which takes the initial observation and instruction as input to predict future frames capturing the motion of target robots. Based on the generated frames, we further introduce an inverse dynamics module that efficiently extracts robot policies from the predicted trajectories. The extracted policies can then be deployed in robotic simulation environments for downstream manipulation tasks. The pipeline is shown in Fig. 1.
Fig. 1: Overview. In this paper we propose a world-model robot vision planner system. By take the first observation as input, our world model GEM-4D predict a robot-based video, which is then used in the Dynamic Inverse System to extract robot policy. Finally, this policy is used in real robot experiments.
Fig. 2: Adaptive Inverse Dynamic System. Given a generated video as input, this system extracts a robot policy through the four steps illustrated in the figure.
Fig. 3: Real-robot rollouts. From left: ground-truth video, GEM-4D-generated RGB, and the back-projected 3D point cloud. The model produces realistic and geometrically coherent rollouts under unseen backgrounds, supporting transfer to UF ARM manipulation.