Why Does Machine Learning Require So Much Training Data?

author avatar

Last updated on 08 Jan, 2019

Why Does Machine Learning Require So Much Training Data?

Over the last decade or so, the world of AI has been obsessed with the problem of pattern recognition — Ask any budding engineer “Give me an algorithm to find X in an environment Y”, and expect to hear the words, “Give me T examples in environment Y, and I will predict X with so much accuracy”.

Over the last decade or so, the world of AI has been obsessed with the problem of pattern recognition — Ask any budding engineer “Give me an algorithm to find X in an environment Y”, and expect to hear the words, “Give me T examples in environment Y, and I will predict with so much accuracy”.

The recent advancement in ML techniques have equipped upcoming entrepreneurs with the belief that computers can do magic.

Having watched this video makes me think who are we to say otherwise. To add to this, a recent stroll at the CVPR’17 poster session was convincing enough that the academic world is testing the strength of these techniques by solving previously unimaginable and jaw-dropping problems. These algorithms are passing tests with flying colors.

A basic computer is actually a pretty dumb entity working on binary electrical voltages and so it is safe to assume it can do no magic, and thus is heavily reliant on smart algorithms and tuning to solve the aforementioned problems.

This brings us to an important question,

How can I make a computer solve my problems?

The answer to this question has seen many paradigm shifts since computers were created, and the current approach is exactly what our budding engineer had mentioned: build ‘smart algorithms’ and present the computer with ‘enough’ real-world examples of the environment (training data), so that when the computer sees ‘similar data’, it knows what to do. With the business of “smart algorithms” been taken care of by computer scientists all over the world, one piece of the puzzle that might still haunt upcoming startups is “how much data is actually enough?”.

Before answering this question, let’s try to understand why is this even a question. Why can’t we have infinite data and just focus on solving the problem? The reason is every data collected has a $ value associated with it, since data is usually collected manually, and so people have to be trained and compensated for the tedious work of data collection, and so if I can do with Ttraining samples, I will not collect 10 X T training samples. This brings us back to the main question: How much is this T for my problem, and can I do with lesser T? The answer to this question depends mainly on two factors:

  • How catastrophic is it for my product to make errors?
  • How diverse can be the input to my product?

While everyone wants to come up with a perfect product, real-world ML products are seldom perfect. They always have a potential to make errors with different products having different tolerances for error. For example, an application that predicts which team will win a football game has a higher tolerance than the goal-line technology to assess whether a goal was scored.

Whatever be the product type, no one likes products which make mistakes, and if your business revolves around customer satisfaction, 99.0 % accuracy is more important than 98.95 % accuracy, and so every new training point matters. Imagine a 0.0001 % accuracy reduction in the left/right prediction in your self-driving car, and you would never think about reducing your training data.

Just like companies want error-free products, they also want products that work in any environment in this world. “A face detection algorithm developed for detecting faces in Africa should also detect an alien’s face when they arrive in 2050”. To achieve such high standards, the algorithm needs to be presented with every input possibility, so it learns to better the notion of “similar data”.

The recent growth of deep learning architectures is necessary because our world is complex. To build something reliable to handle all its subtleties, we need to train our product as much as we can. To illustrate the impact of deep architectures on training data size, let us analyze the problem of face detection/recognition.

The famous Haar Classifier or SVM based face detectors can be trained on a few thousand samples to obtain their optimal performance. In comparison to this, the FaceNet architecture for detecting and recognizing faces was trained on more than 400K training data samples, and it has effectively “solved” the closed set face recognition problem. However, detecting faces is a relatively simple problem in the field of Computer Vision, and more complex problems of ML (language translation, self-driving data) would only need more and more data before they are “solved”. If only Apple had access to infinite training, SIRI would have been able to recognize my Indian accent as well as it recognizes the accent of my American colleagues.

The cost of collecting training data has inspired scientists to look for alternative approaches for manual data annotations. Two approaches at the forefront are, transfer learning and synthetic data generation. Using a lot of data to train a dog detector and using this as a starting point to detect cats is a simple example of transfer learning. An important aspect to understand while discussing transfer is that it can utilize the powers of unsupervised learning, which means training data without annotations (which is extremely easy to collect). Algorithms can be built to learn important features from these unlabelled data, and eventually tuned with limited training data. Synthetic data can be considered heuristics-based computer-generated data to best simulate the environment. The importance of this direction is substantiated by Apple’s recent effort, which received the best paper award in CVPR 2017. These approaches are still in their early stages of development. Only time will tell whether we can replace human annotations altogether.

Two inevitable aspects which increase the necessity of collecting more training data are human error and data shelf-life. Even though modern ML algorithms are robust to noise in training data, human errors end up reducing the effective training size in almost all cases. To add to this, data collected using sensors (camera, microphone) suffer from the fact that what is state-of-the-art data collection now might be insufficient in a few years — Imagine classifying the ImageNet images using pixelated images obtained 5 years ago.

The quantity and quality of training data is undeniably a hot topic right now people are exploring different techniques to solve this issue. In a few years, we might get a more clear picture of how much training data is sufficient for a problem. But for now it would be safe to conclude “the more, the better”.