Structured Deep Dynamics Models For Robot Manipulation

The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. Traditionally, this problem of dynamics modeling is done through the use of a physics simulator, given strong models of the environment including object shape, mass, friction etc. On the other hand, learning based methods such as Predictive State Representations or more recent deep learning methods have tried to learn such these models directly from raw perceptual information in a model-free manner. In this project, we try to bridge the gap between these paradigms by proposing a specific class of deep networks that explicitly encode strong physics-based priors (specifically, rigid body dynamics) in their structure. We present two instances of these models:

  • SE3-Nets are models that predict scene dynamics by explicitly segmenting the scene into distinct rigid objects whose motions are represented as SE3 transforms. SE3-Nets take an input point cloud and a continuous action (such as forces, torques etc) as input to predict the next point cloud as a result of the applied action.
  • SE3-Pose-Nets, which additionally predict the pose of the detected objects, modeling object dynamics directly in the low-dimensional pose space. SE3-Pose-Nets can be used for real-time reactive control using just raw depth data.

Both our models are trained in a supervised manner, assuming point-wise data-associations between pairs of depth frames. No additional supervision is given for learning the segmentation, SE3 transforms or the poses (for SE3-Pose-Nets). More details below (projects listed in chronological order, latest first).

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Control


In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs.

Real-time reactive control with the Baxter:

We show results on controlling the first four joints of the Baxter robot from raw depth data at real-time rates (30Hz). Our system generates joint velocity commands by minimizing the error (to the target) in the learned pose space using Gauss-Newton based single-step optimization.


SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Control, Arunkumar Byravan, Felix Leeb, Franziska Meier and Dieter Fox, IEEE International Conference on Robotics and Automation (ICRA), 2018.

SE3-Nets: Learning Rigid Body Motion using Deep Neural Networks


The ability to predict how an environment changes based on forces applied to it is fundamental for a robot to achieve specific goals. For instance, in order to arrange objects on a table into a desired configuration, a robot has to be able to reason about where and how to push individual objects, which requires some understanding of physical quantities such as object boundaries, mass, surface friction, and their relationship to forces. In this work, we explore the use of deep learning for learning such a notion of physical intuition.

We introduce SE3-Nets, which are deep networks designed to model rigid body motion from raw point cloud data. Based only on pairs of 3D point clouds along with a continuous action vector and point wise data associations, SE3-Nets learn to segment effected object parts and predict their motion resulting from the applied force. Rather than learning point wise flow vectors, SE3-Nets predict SE3 transformations for different parts of the scene. We test the system on three simulated tasks (using the physics simulator Gazebo) where we predict the motion of a varying number of rigid objects under the effect of applied forces and a 14-DOF robot arm with 4-actuated joints. We show that the structure underlying SE3-Nets enables them to generate a far more consistent prediction of object motion than traditional flow based networks, while also learning a notion of “objects” without explicit supervision.



SE3-Nets: Learning Rigid Body Motion using Deep Neural Networks, Arunkumar Byravan and Dieter Fox, IEEE International Conference on Robotics and Automation (ICRA), 2017 (Best Vision Paper Finalist)