Deep learning Improves Robotic Vision

April 20, 2016

By John Sullivan

Distinguishing an actual book from a photo of a book seems so simple that we humans don't even think about it. But for a computer, it is another story.

In part, that is because most current research into computer vision, critical for tasks such as robotic manufacturing and automated driving, focuses on teaching computers to recognize objects in a two-dimensional plane. But researchers at Princeton University have found that adding a three-dimensional component to computer vision greatly increases both the accuracy and efficiency of the process.

Researchers working to improve computer vision. — *Researchers led by Jianxiong Xiao, assistant professor of computer science, are exploring ways to improve computer vision. The research team includes Daniel Suo, Fisher Yu, Xiao, and Shuran Song.*

Called Deep Sliding Shapes, the new process teaches computers to process images in 3D by representing each point as a voxel – a box-shaped element with an X, Y and Z measurement. This contrasts with standard photos that render images into two-dimensional elements, or pixels.

"We can see depth in photographs because our brains process the information and read it as three dimensions," said Jianxiong Xiao, an assistant professor of computer science who is leading the research project. His team is trying to solve the difficult challenge of training a computer to replicate the brain's process in adding depth to a visual field, without overtaxing the computer.

Early efforts to teach computers to "see" via voxels instead of pixels have been slow, eating up significant processing time. But the Princeton team has found techniques to improve the process. In results to be presented at IEEE Conference on Computer Vision and Pattern Recognition in June, the researchers demonstrated a 200-fold increase in processing time while improving the accuracy of the results.

In part, the research is made possible by the relatively recent development of affordable 3D cameras such as those found in Microsoft's Kinect game system. Researchers in Xiao's lab use these cameras to create three dimensional photos that they then use to train computers how to recognize objects using depth as well as colors.

Armed with a library of 3D images, the Princeton researchers' initial approach was to compare the real world with the exemplar objects in the library one by one. The result, called Sliding Shapes, calculated the position of each voxel in an image. But, although the program was accurate, the researchers felt it was impractical.

"The approach was computationally very difficult," said Shuran Song, a graduate student in computer science and the lead author of the Deep Sliding Shapes paper. "It was very slow."

So the team went back to the beginning and took a different approach. They decided to use a deep learning system to teach a computer how to process images in 3D. Deep learning is an approach that uses multiple algorithms that are designed to work together as a sort of filter, reducing errors and uncertainty until the computer eventually arrives at a solution. One fascinating aspect of deep learning is that the computer gets better at solving problems as it repeatedly runs the algorithms using new data. So it is possible to use the system to train a computer so that it can solve similar problems more accurately; eventually the machine can master the skill.

But typical deep learning algorithms only work for 1D or 2D signals, such as sentences or photos. To deploy deep learning in 3D space, the team proposed a novel formulation to characterize geometric shapes using 3D convolution, which is a procedure to learn the models for basic object parts.

Xiao's team ran the system on a cluster of graphics processors for several days.

"When we used deep learning, the testing was much faster," Song said.

Once the training was complete, the computer was able to process images in testing about 200 times as fast as the old 3D approach, the exemplar-based Sliding Shapes.

The system is also much more accurate than the non-deep-learning approach. In testing, scientists measure a system's precision – how many times it gets an image wrong, identifying a cat as a car – and its recall – how many times the computer can find all the objects, retrieving all the cats from an array of images. The Deep Sliding Shapes process was able to improve the average precision and recall of 3D object detection by 14 percent.

"Computer vision is moving more and more from analyzing single still images to understanding videos and spatial data," said Dieter Fox, a roboticist and a professor of computer science and engineering at the University of Washington who was not involved in the current project. "The Deep Sliding Shapes approach demonstrates how deep learning networks can be designed for 3D object detection, yielding extremely promising results on several challenging detection tasks. This research direction can also greatly improve the ability of robots to reason about objects in their environment."

The work was supported in part by the National Science Foundation, Intel and NVIDIA.

Xiao said the goal now is to improve the system's ability to process multiply images at high speed.

"Because a robot has a constant stream of video from the camera as input, our next step is to go beyond a static image," he said. The team would like to use "videos with multiple frames fused in 3D to further increase the recognition accuracy and reduce the processing time."