3D-OES: Viewpoint-Invariant Object-Factorized Environment Simulators

CoRL 2020

Hsiao-Yu Fish Tung^*, Zhou Xian^*, Mihir Prabhudesai, Shamit Lal, Katerina Fragkiadaki

We propose vision-based dynamics models that are robust to camera viewpoint change. While most existing vision-based dynamics models only work when the input image comes from a fixed viewpoint, our model can take any number of input images from any camera view to predict how objects will move in 3D. Our model also shows effective sim-to-real transfer. There is no sneed to tune the camera poses when setting up real robots!

Abstract

Humans can effortlessly predict how a scene changes approximately as a result of their actions. Such predictions, which we call mental simulations, are carried out in an abstract visual space that is not tied to a particular camera viewpoint. Inspired by such capability to simulate the environment in a viewpoint- and occlusion- invariant way, we propose a model of action-conditioned dynamics that predicts scene changes caused by object and agent interactions in a viewpoint-invariant 3D neural scene representation space, inferred from RGB-D videos. In this 3D feature space, objects do not interfere with one another and their appearance persists across viewpoints and over time. This permits our model to predict future scenes by simply "moving" 3D object features based on cumulative object motion predictions, provided by a graph neural network that operates over the object features extracted from the 3D neural scene representation. Moreover, our 3D representation allows us to alter the observed scene and run conterfactual simulations in multiple ways, such as enlarging the objects or moving them around, and then simulating the corresponding outcomes. The mental simulations from our model can be decoded by a neural renderer into 2D image projections from any desired viewpoint, which aids interpretability of the latent 3D feature space. We demonstrate strong generalization of our model across camera viewpoints and varying number and appearances of interacting objects, while also outperforming multiple existing 2D models by a large margin. We further show effective sim-to-real transfer by applying our model trained solely in simulation to a pushing task in a real robotic setup.

Model

3D-OES predict 3D object motion under agent-object and object-object interactions, using a graph neural network over 3D feature maps of detected objects. Node features capture the appearance of an object node and its immediate context, and edge features capture relative 3D locations between two nodes, so the model is translational invariant. After message passing between nodes, the node and edge features are decoded to future 3D rotations and translations for each object.

Corl 2020 Spotlight Talk

Qualitative Results

Forward Unrolling
Given any number of input views captured from different camera poses, our model can predict object motion that better matches the ground-truth, comparing to the graph-XYZ baseline. The latter does not capture object appearance in any way. In our paper, we show other baseline models that use appearance features computed from 2D convolutional layers, however, they perform worse than the graph-XYZ baseline and our model.

Neurally Rendered Simulation Videos
Left: groundtruth simulation videos from the dataset. The simulation is generated by the Bullet Physics Simulation. Right: neurally rendered simulation video from the proposed model. Our model forcasts the future latent feature by explicitly warping the latent 3D feature maps, and we pass these warped latent 3D feature maps through the learned 3D-to-2D image decoder to decode them into human interpretable images. We can render the images from any arbitrary views and the images are consistent across views.

Neurally Rendered Counterfactual Experiments
The first row shows the ground truth simulation video from the dataset. Only the first frame in this video is used as input to our model to produce the predicted simulations. The second row shows the ground truth simulation from a query view. Note that our model can render images from any arbitrary view. We choose this particular view for better visualization. The third row shows the future prediction from our model given the input image. The following rows show the simulation after manipulating an objects (in the blue box) according the instruction on the left most column.

Model Predictive Control and Results on Real Robots