This article is the first in a series on 3D Perception.
How do robots “see” in 3D? As robots increasingly share our roads, offices, and homes, we trust them to reliably “see” people, tools, obstacles, and much more. What options are available to robot developers for endowing their creations with sight?
I’ll never forget my first point cloud. I was hired by a drone startup to look at options for 3D modeling of buildings. My first line of investigation was photogrammetry. In photogrammetric reconstruction, you take digital pictures of your target from many different perspectives. Then you find common features across the images, like corners of window frames or unique markings on a wall.
Once you have plenty of strong features that appear across many of the images, you make estimations about the geometry of how the pictures were taken. By passing many estimations through the computer, we can often find a good approximation of where the pictures were all taken relative to each other. Finally, we take all the pixels from our 2D images and assign them 3D coordinates to create our 3D reconstruction.
video showing photogrammetry process aided by LiDAR
By finding features or keypoints (marked with dots in the video), we can stitch together pixels from many images into 3D models. The above video was created using RTAB-MAP, ROS, and an RGB-D sensor, basically a digital camera combined with an IR projector and detector for LiDAR (read on for more!).
To me, point clouds are gorgeous, maybe even sexy. We are digitizing the geometry of our world! A point cloud is just a bunch of points. The points always have 3 coordinates for position, but each point can also hold additional data like color, temperature, or reflection intensity.
Similar to photogrammetry, stereo vision is the way we humans see in 3D. Our eyes are separated by a known, fixed distance. When we see the same feature in both eyes, like a finger held out in front of the face, our brain estimates the distance from our eyes to that object. This gives us a perception of depth as we view the world. Robots do this by collecting images with two cameras placed close together but always the same distance apart.
CaliCam stereo camera from ASTAR Robotics
Another way to collect 3D data is LiDAR, which stands for Light Detection And Ranging.
As you may remember from physics class, light comes in many forms: x-rays, gamma rays, infrared waves, radio waves, microwaves, and our familiar visible light waves. Most LiDAR sensors use infrared (IR) light that is invisible to humans. LiDAR sensors send out pulses of IR light, the light hits the surrounding environment, bounces off, and some of it returns to the sensor where it is detected.
The sensor measures the time it takes for the light to travel out and back, then calculates the distance traveled using the equation for velocity and the speed of light: distance_traveled = time_measured * speed_of_light. Since the distance traveled is twice the distance from the sensor to the environment: distance_to_environment = distance_traveled / 2.
Graphic for theory of LiDAR: the sensor sends out an IR pulse, it bounces off an obstacle, and returns to the sensor
The most common sensor for detecting obstacles is probably the ultrasonic sensor.
Ultrasonic sensors work using the same principle that helps bats and dolphins navigate their environment--echolocation. The sensor has a speaker and a microphone. The speaker sends out a sound wave, it travels out to the environment, bounces off whatever is present, then part of the wave travels back to the sensor where it is picked up by the microphone.
Just like with LiDAR, we measure the time it takes the wave to travel out and back, then we calculate distance using the speed of sound in air: distance_traveled = time_measured * speed_of_sound. Again, the actual distance from the sensor to the environment is half the distance traveled.
Graphic for theory of sonar: the sensor has a mic and speaker. The sensors sends out a sound wave with its speaker, it bounces off an obstacle, and the sound wave returns to the sensor where it is picked up by the microphone.
Which method of 3D perception should you use for your robot?
It all depends on the details. Ultrasonic works well for short-range detection of obstacles, but common sensors are often too noisy for mapping or more detailed types of ranging. LiDAR has high accuracy up to large distances, but working with enormous point clouds from LiDAR sensors can be much more complicated than ultrasonic data or point clouds generated from camera images. There are a growing number of software tools for photogrammetric processing that make it easy to input a series of 2D images and produce nicely filtered and aligned point clouds from cameras (my favorite is https://www.capturingreality.com/).
With companies like Tesla moving away from LiDAR and relying solely on arrays of cameras, it’s important to note the serious drawbacks of camera data.
Unlike LiDAR sensors, cameras need good lighting in the environment to produce good data. Already we have seen several self-driving car accidents that surely could have been avoided by properly incorporating LiDAR data instead of relying so heavily on camera data. In one case, a white semi-truck pulled onto the road and the color of the truck was so close to the background color of the sky that the autopilot did not detect the truck at all. In other cases, it is so dark at night that pedestrians and bikers crossing the road cannot be picked up in time by the cameras. LiDAR sensors do not “see” according to color or ambient light, they produce their own light for detection.
If you are developing safety-critical applications like self-driving cars, please consider incorporating more LiDAR!
Hopefully, you now have a high-level understanding of a few of the most popular ways that robots “see” in 3D. Stay tuned for deeper dives into these technologies.