4D-Net boosts autonomous driving vision capabilities by fusing point cloud, camera, and time data

Designed to address the problem of accurately detecting objects — like other vehicles and pedestrians — at distance, Google and Waymo's 4D-Net offers a novel and generalizable approach to sensor fusion with some impressive results.

author avatar

17 Mar, 2022

Designed to find links between 2D imagery and 3D point cloud data captured over time, 4D-Net offers a big boost in long-range object detection.

Designed to find links between 2D imagery and 3D point cloud data captured over time, 4D-Net offers a big boost in long-range object detection.

The key to safe, reliable autonomous vehicles — above even how smart its on-board self-driving system is at its heart — is likely to be found in how efficiently it can process sensor data. Just as with visual acuity tests for human drivers, it’s vital to know that an autonomous vehicle system can spot hazards and react accordingly — no matter how small or distant the problem may be.

Traditional two-dimensional camera systems and three-dimensional sensors, like LiDAR (Light Detection and Ranging), may not go far enough for full reliability and safety, leading a team at Google and Alphabet’s autonomous vehicle subsidiary Waymo to look into the fourth dimension: 4D-Net, an approach to object detection which fuses two- and three-dimensional data with the fourth dimension, time, with the claim of significantly improved performance.

Time enough

“This is the first attempt to effectively combine both types of sensors, 3D LiDAR point clouds and onboard camera RGB images, when both are in time,” claim Google Research scientists and paper co-authors AJ Piergiovanni and Anelia Angelova in a joint statement on the work. “We also introduce a dynamic connection learning method, which incorporates 4D information from a scene by performing connection learning across both feature representations.”

The 4D-Net approach stems from a simple observation: The majority of modern sensor-equipped vehicles include two- and three-dimensional sensors, typically in the form of multiple camera modules and LiDAR, data from which is gathered across a period of time — but very few efforts have been made to gather everything in a single place and process it as a whole.

A diagram showing how 4D-Net operates. A block labelled "RGB video feature maps:" takes multiple maps from camera images captured over time and sends them to a connection search syste for projection. A separate box shows point clouds captured over time being sorted into stack pillars, learned features, then a pseudo image processed vi a backbone point cloud network for merging with the projected images and the output of recognised 3D bounding boxes.The 4D-Net system aims to boost object recognition accuracy at distance, by blending two-dimensional camera imagery with 3D point cloud data — all gathered over to capture motion.

4D-Net address this gap, blending three-dimensional point-cloud data with visible-light camera imagery while blending in a time element by processing a series of each captured over a set period. The secret to its success: A novel learning technique which can autonomously find and build connections between the data, fusing it at different levels dynamically in order to boost performance over any of the data feeds alone.

“Images in time are […] highly informative, and complementary to both a still image and PCiT [Point Cloud in Time],” the researchers explain of the benefit to the approach. “In fact, for challenging detection cases, motion can be a very powerful clue. While motion can be captured in 3D, a purely PC [Point Cloud]-based method might miss such signals simply because of the sensing sparsity” — the same issue, incidentally, which means distant or small objects can be missed by a LiDAR sensor but picked up on visible-light camera systems or the driver’s naked eye.

Machine learning through time

To handle both types of data, the team turns to a range of pre-processing steps. 3D point cloud data is run through PointPillars, a system for converting the data into a pseudo-image which can be further processed using a convolutional neural network (CNN) designed for two-dimensional data, with a time indicator added to each point to create a denser representation including motion. Conversion into fixed-sized representations, effectively sub-sampling the point cloud, is also used — an approach which densifies the point cloud where data is sparse and sparsifies it where data is dense, thus boosting performance at long ranges.

The two-dimensional camera data, meanwhile, is processed into feature maps through Tiny Video Networks, then the data projected to align the 3D points with their corresponding points on the 2D imagery — a process which assumes “calibrated and synchronized sensors.” For point cloud data which lies outside the view of the vehicle’s cameras, a vector of zeroes is applied.

A diagram showing a multi-feed variant of 4D-Net, in which high-resolution image, medium-resolution image, and video data are linked to the point cloud backbone for generation of 3D boxes.A variant of the 4D-Net system using multiple resolutions of image and video feed proved ideal, offering additional accuracy gains over the single-feed variant in benchmark testing.

The truly smart part of the 4D-Net system, though, comes in the form of its connection architecture search — the means by which it is able to extract the most, and most suitable, information from the fused data. A one-shot lightweight differentiable architecture search finds related information in both 3D and in time and connects it across the two different sensing modalities — and learns the combination of feature representations at various abstraction levels for both sensors.

“[This] is very powerful,” the team explains, “as it allows to learn the relations between different levels of feature abstraction and different sources of features.” To further tweak the approach for autonomous vehicles, the team modified the connections to be dynamic, based on the concept of self-attention mechanisms, allowing the network to dynamically select particular visible-light data blocks for information extraction — meaning it can learn how and where to select features based on variable input.

Impressive results

Testing both a single-stream and a multi-stream variant of the system, the latter pulling in additional input streams in the form of still-image and video feeds running at different resolutions, the team claims some impressive gains over rival state-of-the-art approaches.

Tested against the Waymo Open Dataset, 4D-Net improved on average precision (AP) across all tested rival approaches. While its performance proved, on average, weaker at shorter distances, its ability to recognize more distant objects — particularly the 50-meter-plus range — is reportedly unsurpassed, particularly when running in multi-stream mode.

A chart showing the performance of 4D-Net compared to rival approaches StarNet, LaserNet, PointPillars, MVF, Huang et al, PilalrMultiView, and PVRCNN. Overall performance favours 4D-Net; accuracy is lower at 30m detection distances, but noticeably higher at 30-50m and higher still at 50m and more. Runtime, meanwhile, is above LaserNet and PillarMultiView but below PVRCNN.The team's experimentation shows a significant accuracy gain for 4D-Net over rival approaches at mid- to long-range, though a drop in accuracy for shorter detection distances.

“We demonstrate improved state-of-the-art performance and competitive inference run-times,” the team concludes, “despite using 4D sensing and both modalities in time. Without loss of generality, the same approach can be extended to other streams of RGB images, e.g., the side cameras providing critical information for highly occluded objects, or to diverse learnable feature representations for PC [Point Cloud] or images, or to other sensors.”

The researchers suggest that the 4D-Net approach can also be used outside the field of autonomous driving, wherever there’ a need to capture different aspects of the same domain by aligning audio, video, text, and image data automatically.

The team’s work was presented at the International Conference on Computer Vision (ICCV) 2021, and has been made available under open-access terms. A supporting write-up by AJ Piergiovanni and Anelia Angelova is available on the Google AI Blog. The researchers have pledged to make their code available under an open-source license, but it had not been published at the time of writing.


AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, and Anelia Angelova: 4D-Net for Learned Multi-Modal Alignment, IEEE/CVF ICCV '21. DOI 10.1109/ICCV48922.2021.01515.

17 Mar, 2022

A freelance technology and science journalist and author of best-selling books on the Raspberry Pi, MicroPython, and the BBC micro:bit, Gareth is a passionate technologist with a love for both the cutting edge and more vintage topics.

Stay Informed, Be Inspired

Wevolver’s free newsletter delivers the highlights of our award winning articles weekly to your inbox.