Lots of companies are working to develop self-driving cars. And almost all of them use lidar, a type of sensor that uses lasers to build a three-dimensional map of the world around the car.
But Tesla CEO Elon Musk argues that these companies are making a big mistake.
“They’re all going to dump lidar,” Elon Musk said at an April event showcasing Tesla’s self-driving technology.
“Anyone relying on lidar is doomed.”
“Lidar is really a shortcut,” added Tesla AI guru Andrej Karpathy. “It sidesteps the fundamental problems of visual recognition that is necessary for autonomy. It gives a false sense of progress, and is ultimately a crutch.”
In recent weeks I asked a number of experts about these claims. And I encountered a lot of skepticism.
“In a sense all of these sensors are crutches,” argued Greg McGuire, a researcher at MCity, the University of Michigan’s testing ground for autonomous vehicles. “That’s what we build, as engineers, as a society—we build crutches.”
Self-driving cars are going to need to be extremely safe and reliable to be accepted by society, McGuire said. And a key principle for high reliability is redundancy. Any single sensor will fail eventually. Using several different types of sensors makes it less likely that a single sensor’s failure will lead to disaster.
“Once you get out into the real world, and get beyond ideal conditions, there’s so much variability,” argues industry analyst (and former automotive engineer) Sam Abuelsamid. “It’s theoretically possible that you can do it with cameras alone, but to really have the confidence that the system is seeing what it thinks it’s seeing, it’s better to have other orthogonal sensing modes”—sensing modes like lidar.
Camera-only algorithms can work surprisingly well
On April 22, the same day Tesla held its autonomy event, a trio of Cornell researchers published a research paper that offered some support for Musk’s claims about lidar. Using nothing but stereo cameras, the computer scientists achieved breakthrough results on KITTI, a popular image recognition benchmark for self-driving systems. Their new technique produced results far superior to previously published camera-only results—and not far behind results that combined camera and lidar data.
Unfortunately, media coverage of the Cornell paper created confusion about what the researchers had actually found. Gizmodo’s writeup, for example, suggested the paper was about where cameras are mounted on a vehicle—a topic that wasn’t even mentioned in the paper. (Gizmodo re-wrote the article after researchers contacted them.)
To understand what the paper actually showed, we need a bit of background about how software converts raw camera images into a labeled three-dimensional model of a car’s surroundings. In the KITTI benchmark, an algorithm is considered a success if it can accurately place a three-dimensional bounding box around each object in a scene.
Software typically tackles this problem in two steps. First, the images are run through an algorithm that assigns a distance estimate to each pixel. This can be done using a pair of cameras and the parallax effect. Researchers have also developedtechniques to estimate pixel distances using a single camera. In either case, a second algorithm uses depth estimates to group pixels together into discrete objects, like cars, pedestrians, or cyclists.
Further Reading
The Cornell computer scientists focused on this second step. Most other researchers working on camera-only approaches have represented the pixel data as a two-dimensional image, with distance as an additional value for each pixel alongside red, green, and blue. Researchers would then typically run these two-dimensional images through a convolutional neural network (see our in-depth explainer here) that has been trained for the task.
But the Cornell team realized that using a two-dimensional representation was counterproductive because pixels that are close together in a two-dimensional image might be far apart in three-dimensional space. A vehicle in the foreground, for example, might appear directly in front of a tree that’s dozens of meters away.
So the Cornell researchers converted the pixels from each stereo image pair into the type of three-dimensional point cloud that is generated natively by lidar sensors. The researchers then fed this “pseudo-lidar” data into existing object recognition algorithms that are designed to take a lidar point cloud as an input.
“You could close the gap significantly”
“Our approach achieves impressive improvements over the existing state-of-the-art in image-based performance,” they wrote. In one version of the KITTI benchmark (“hard” 3-D detection with an IoU of 0.5), for example, the previous best result for camera-only data was an accuracy of 30%. The Cornell team managed to boost this to 66%.
In other words, one reason that cameras plus lidar performed better than cameras alone had nothing to do with the superior accuracy of lidar’s distance measurements. Rather, it was because the “native” data format produced by lidar happened to be easier for machine-learning algorithms to work with.
“What we showed in our paper is you could close the gap significantly” by converting camera-based data into a lidar-style point cloud, said Kilian Weinberger, a co-author of the Cornell paper, in a phone interview.
Still, Weinberger acknowledged, “there’s still a fair margin between lidar and non-lidar.” We mentioned before that the Cornell team achieved 66% accuracy on one version of the KITTI benchmark. Using the same algorithm on actual lidar point cloud data produced an accuracy of 86%.