Computer vision: Finding the best teaching frame in a video for fake video fightback
The frame in which a human marks out the boundaries of an object makes a huge difference in how well AI software can identify that object through the rest of the video.
To improve computer vision of emerging technologies, University of Michigan researchers are working on Bubblnets: A new deep learning method to help computers delineate boundaries for annotation in the real world.
Contributing to a project that aims to detect “deepfake” videos, University of Michigan engineers developed software that improves a computer’s ability to track an object through a video clip by 11% on average.
The software, called BubbleNets, chooses the best frame for a human to annotate. In addition to helping train algorithms for spotting doctored clips, it could improve computer vision in many emerging areas such as driverless cars, drones, surveillance and home robotics.
“The U.S. government has a real concern about state-sponsored groups manipulating videos and releasing them on social media,” said Brent Griffin, U-M assistant research scientist in electrical and computer engineering. “There are way too many videos for analysts to assess, so we need autonomous systems that can detect whether or not a video is authentic.”
Current software for parsing video clips relies on humans to mark up the objects—such as people, animals and vehicles—in the video. “Video object segmentation” algorithms then follow the boundaries of the marked objects through the videos.
Today’s advanced “deep learning” programs require the human to mark only a single frame. The frame presented to the human is typically the first frame in the video, which is rarely the best choice. But until now, there was no automated way to choose a preferable frame.
When the Defense Advanced Research Projects Agency (DARPA) requested this capability, the U-M team was skeptical that it was even possible. The software wouldn’t even know what in the video you were trying to track, so how could it recommend a frame?
But with deep learning techniques, Griffin and senior author Jason Corso, U-M professor of electrical engineering and computer science, didn’t have to figure out how to choose the best annotation frame—the algorithm would do that. Their challenge was creating enough “training” data so that the algorithm could draw its own conclusions from a large set of examples.
Griffin and Corso started with 60 videos in which every frame had been annotated. If they posed the question the obvious way—”Which frame is the best annotation frame in each video?”—they would have only 60 training examples. Instead, they designed their “BubbleNets” software to compare two frames at a time. The software predicts which frame, if selected for a human to annotate, will enable the segmentation software to stay truer to the object’s boundaries. This gave them nearly 745,000 pairs of frames for training the algorithm.
It is hard to say exactly what BubbleNets looks for in an annotation frame, but testing showed it preferred frames that:
- Weren’t too near the beginning or end of the video.
- Looked most like other frames in the video.
- Showed a clear view of the objects in the video.
BubbleNets is already “a small cog” in DARPA’s multi-university media forensics program, Griffin said. In an effort to identify falsified propaganda videos, DARPA needs to train its own algorithms on manipulated videos. BubbleNets helps other software automatically erase objects from videos to create training data.
But BubbleNets could also be useful in other robotics and computer vision tasks. For instance, future home robots will need to learn the layout and contents of a house. The robot would be able to present its owner with a set of frames that contain unidentified objects.
“Think about a toddler. A toddler sort of knows what they know, and then at some point, they realize they don’t really know something. So they ask a question. And that’s what we want to enable the computer to do,” Corso said.
Computer vision algorithms that have to operate without human input, such as those for driverless cars or drones, could also benefit. In these cases, the software would sift through training video clips looking for things that it didn’t recognize. Then, when it found a problematic clip, BubbleNets would choose that best frame for a human to explain.
The research was presented today at the IEEE/CVF Computer Vision and Pattern Recognition conference and was a finalist for best paper.
Corso is a member of Michigan Robotics and is also co-founder and CEO of the computer vision company Voxel51. Griffin is an affiliate of Michigan Robotics.