Carnegie Mellon University
January 05, 2024

RePOSE: 6D Refinement via Deep Texture Rendering

By Ashlyn Lacovara

Ashlyn Lacovara
  • The Robotics Institute
  • 412-268-9409

The Extended Reality Technology Center introduces RePOSE, "A Novel Iterative Refinement Approach for 6D Object Pose Estimation," a new method of 6D pose estimation. To explain what 6D object pose estimation is, think of it as a process in computer vision and robotics that involves determining the position and orientation of an object in three-dimensional space. The term "6D" refers to the six degrees of freedom that define this pose: three for position (x, y, z coordinates) and three for orientation (often represented as roll, pitch, and yaw angles). Accurately determining an object's 6D pose is essential for a wide range of applications, including:

  1. Robotics and Automation: Robots use 6D pose estimation to interact with objects accurately, such as in assembly lines, where precise manipulation is required.
  2. Augmented Reality (AR): In AR, digital content needs to be accurately overlaid on real-world objects. Knowing the 6D pose of these objects allows for a seamless integration of virtual and physical worlds.
  3. Self-driving Vehicles: Autonomous vehicles use 6D pose estimation to understand the orientation and position of nearby objects, crucial for safe navigation.
  4. 3D Modeling and Reconstruction: In creating 3D models from images or in reconstructing scenes, determining the exact pose of objects is necessary to build accurate representations.

The challenge in 6D object pose estimation lies in accurately determining both the position and orientation from sensor data, which can include images, depth data, or a combination of both. Advanced computer vision techniques have significantly improved the accuracy of pose estimation algorithms.

Traditional techniques in this field have relied on the use of Convolutional Neural Networks (CNNs) which are a type of deep learning algorithm optimized for processing images. It uses convolutional layers to automatically and adaptively learn spatial arrangements of features from input images. CNNs are widely used for tasks like image and video recognition, as well as image classification, due to their efficiency in detecting patterns. CNNs process zoomed-in and rendered RGB images to refine and update the object's pose. However, these methods are slow because CNNs require a lot of computing power. This issue becomes particularly pronounced in scenarios involving the refinement of multiple-object poses.

To address these limitations, RePOSE employs a strategy termed "deep texture rendering." This technique involves the use of a 3D model enhanced by a learnable texture for rapid feature extraction. Unlike conventional methods, deep texture rendering utilizes a shallow multi-layer perceptron. This network is tasked with directly regressing a view-invariant image representation of the object, thereby bypassing the need for intensive computations typically associated with CNNs.

Another significant advancement in RePOSE is the integration of differentiable Levenberg-Marquardt (LM) which is a method used to find the best fit for complex equations. Imagine trying to adjust the settings on a complicated machine to get the best performance. The LM algorithm helps you tweak those settings step by step, so the machine's performance gets closer and closer to what you want. This component is pivotal in refining the pose swiftly and with high precision. The LM algorithm accomplishes this by minimizing the feature-metric error between the input and rendered image representations. Notably, this process eliminates the necessity of zooming into the image, further streamlining the refinement procedure. The image representations in RePOSE are specially trained so that the smooth LM optimization can quickly conver in just a few steps.

RePOSE operates at a speed of 92 frames per second (FPS). On the Occlusion LineMOD dataset, RePOSE achieved an accuracy of 51.6%, which signifies a 4.1% absolute improvement over previous methodologies. Additionally, it delivers comparable results on the YCB-Video dataset, but with significantly faster runtime.

For those interested in exploring or utilizing this method, the authors have made the code for RePOSE publicly available. It can be accessed at the following GitHub repository: https://github.com/sh8/repose. 

Researchers: Shun Iwase, Xingyu Liu, Rawal Khirodkar, Rio Yokota, Kris M. Kitani