Many times a day, people effortlessly grasp objects, yet human grasping is a complex phenomenon that has proven challenging to emulate and analyze. If robots could better grasp objects, they could be more useful in homes and factories. If AI systems could better perceive human grasping, they could more naturally interact and collaborate with people. If researchers better understood human grasping, they might find new ways to help people whose hands are impaired. A key aspect of human grasping that has been hidden from view is the contact that occurs between the human hand and the object. We present a novel method that reveals this hidden interaction, opening up a new path for research on human grasping.
This blog post introduces ContactDB, the first large-scale dataset of contact maps from human grasps of household objects.
A contact map is a textured mesh of the object, where the texture indicates contact. A typical contact map looks like this (note the detailed, continuous and real-world nature of our data, which distinguishes ContactDB from previous attempts at capturing contact):
Why Capture Contact Maps?
Human grasping has traditionally been captured in the hand–pose space. Contact maps capture grasping from the novel perspective of contact. In addition, we envision that this data can be used to design ergonomic tools, soft robotic grippers capable of executing human contact patterns and to develop a deeper understanding of hands in action from images and videos.
How We Capture Contact Maps
Heat transfers from warm human hands to the object surface during grasping. After this, the contact pattern can be seen clearly through a thermal camera even if the object is let go.
To capture this pattern, we 3D print a set of household objects and abstract shapes to ensure uniform thermal properties. 50 participants were invited to our laboratory to hold the objects. All objects are grasped with the functional intent of handing them off, and more than half are also grasped with the intent of using them.
Once grasped, the objects are put on a turntable and scanned with a Kinect V2 RGB-D camera and a FLIR Boson 640 thermal camera. The thermal images are texture-mapped to the object 3D mesh to generate a contact map. A typical scan from the RGB, depth and thermal cameras looks like this:
What Insights Does This Data Reveal?
Grasps are significantly influenced by the functional intent:
Soft tissue of the human hand in the palm and the distal parts of the fingers play a large role in grasping. This is shown by the following figure, which plots the average contact areas for each object, calculated from the observed grasps. The red line indicates an upper bound on the contact area if the grasp were fingertip-only. The average contact area for many objects is significantly higher than the upper bound on the fingertip-only contact area.
This motivates the inclusion of non-fingertip areas in grasp prediction and modeling algorithms, and presents an opportunity to inform the design of soft robotic manipulators.
See the paper for more analysis and insights.
Predicting Contact Maps from Object Shape
Robots are required to manipulate known objects in many situations. For example, teams had access to object 3D models in the DARPA Robotics Challenge. Models that can predict optimal contact regions from known object shape can help in the positioning of the robot before manipulation.
We have also developed the ContactGrasp algorithm, which synthesizes human-like functional grasps for diverse robotic end-effectors from ContactDB contact maps.
In this paper, we experimented with two 3D object shape representations: voxel occupancy grid and point-cloud. Since each participant’s contact map is a correct way to grasp the object, we need to learn a one-to-many mapping from object shape to contact map.
We adopted two approaches from the literature for this: DiverseNet and Stochastic Multiple Choice Learning (sMCL). Our experiment used two 3D object shape representations: voxel occupancy grid (processed by a CNN architecturewith 3D convolutions), and pointcloud (processed by the PointNet architecture).
We found in our experiments that the voxel occupancy grid representation is better suited for this task, probably because the CNN with 3D convolutions learns a hierarchical representation, while PointNet does not.
We evaluated our models on three unseen test classes of objects: mug, pan, and wine-glass to test their generalization ability across objects. Some example predictions are shown below; out of 10 predictions made by our model, we are showing the 3 that are most realistic.
This particular model was trained with the voxel occupancy grid shape representation using DiverseNet for predicting contact maps for the ‘use’ intent. For more detailed quantitative evaluations of all our models, read our paper.
To test the generalization ability across object shapes, we also evaluated our model on new shapes of objects seen during training:
The camera is grasped from the side with the contact at the shutter button and the hammer is grasped at the handle, which are good ways to grasp these objects.
Dataset, Code and Trained Models
You can download the entire ContactDB dataset along with code to perform deep-learning experiments on contact maps at this GitHub repository. If you want to record your own contact maps or access our raw data, we have also open-sourced our data collection and processing code at this GitHub repository (data collection requires ROS).