Modeling the dynamics of deformable objects is challenging due to their diverse physical properties and the difficulty of estimating states from limited visual information. We address these challenges with a neural dynamics framework that combines object particles and spatial grids in a hybrid representation. Our particle-grid model captures global shape and motion information while predicting dense particle movements, enabling the modeling of objects with varied shapes and materials. Particles represent object shapes, while the spatial grid discretizes the 3D space to ensure spatial continuity and enhance learning efficiency. Coupled with Gaussian Splattings for visual rendering, our framework achieves a fully learning-based digital twin of deformable objects and generates 3D action-conditioned videos. Through experiments, we demonstrate that our model learns the dynamics of diverse objects—such as ropes, cloths, stuffed animals, and paper bags—from sparse-view RGB-D recordings of robot-object interactions, while also generalizing at the category level to unseen instances. Our approach outperforms state-of-the-art learning-based and physics-based simulators, particularly in scenarios with limited camera views. Furthermore, we showcase the utility of our learned models in model-based planning, enabling goal-conditioned object manipulation across a range of tasks.
Simulating deformable objects like cloths and ropes is hard because of their complex physics and partial observability. In this work, we overcome these challenges by learning a neural model for object dynamics directly from real-world videos.
The particle-based neural dynamics model represents objects as dense 3D particles and predicts their next-step velocities to simulate the object dynamics. It consists of three stages: particle encoding, grid velocity editing, and grid-to-particle velocity transfer.
We evaluate our method on 6 diverse deformable object categories, including ropes, cloths, stuffed animals, and paper bags. Our model is trained separately on each category using less than 20 minutes of RGB-D videos of robot-object interactions.
Given initial states and actions, we show the prediction results of the GBND baseline compared to our particle-grid neural dynamics model. We overlay the predictions with ground truth final state images to highlight the prediction errors. PGND's predictions are more aligned with the ground truth, offering higher-density particle predictions and fewer artifacts compared to the baseline.
When plugged into a Gaussian Splatting renderer, PGND can generate high-quality 3D action-conditioned videos. PGND's results aligns better with the ground truth while the SOTA baseline method predicts visually nonrealistic deformations.
PGND can serve as a deformable object simulator given Gaussian Splatting reconstructions of the scene. Given only the initial static reconstruction, we apply PGND to simulate the segmented object given a sequence of actions (red arrows).
PGND can be integrated with MPC to generate actions for manipulating objects. We test on 4 tasks with distinct object types: cloth lifting, box closing, rope manipulation, and plush toy relocating. In all tasks, our method produces results that are closer to the target.
@inproceedings{zhang2024particle,
title={Particle-Grid Neural Dynamics for Learning Deformable Object Models from RGB-D Videos},
author={Zhang, Kaifeng and Li, Baoyu and Hauser, Kris and Li, Yunzhu},
booktitle={Proceedings of Robotics: Science and Systems (RSS)},
year={2025}
}