This article outlines the three primary aspects known as the 3 V’s of big data and two secondary aspects of Big Data that will help you develop a better understanding of characteristics that can be used to identify whether given data is Big Data or not.
Contrary to popular belief Artificial Intelligence is not a new field. We have had the theoretical design of most of the currently used AI algorithms for over seven decades now. What we didn’t have, was access to the amount of data needed to make such intelligent predictions. It is very recent that we can gather such magnitude of data globally and deploy the computational power to parallelly process all of it.
But what happens when massive datasets keep getting larger, become more varied, and develop even more complexity in structure? They become what is collectively known as Big Data and are increasingly difficult to store, analyze or visualize for processes critical to deriving results within AI. The availability of large amounts of data enables organizations to make decisions based on evidence rather than intuition.
As of 2021, 2.5 quintillion bytes of data are produced by humans every single day. A quintillion is a number with 18 zeros. It is estimated that 463 exabytes of data will be generated by 5 billion internet users all over the world by the year 2025. These statistics reflect the importance of the first V i.e. the data volume. Big data is all about volume. The large amounts of data enable organizations to get a more holistic view of a customer by using current as well as historical data to derive insights. With such huge volumes of data, arises the need for the development of different and unique data processing and storing technologies. These datasets are just too large to be processed by a traditional desktop computer and processor.
Every individual generates around 1.7MB of data per second and there are over 4.66 billion active internet users. Big data velocity refers to the speed at which data is generated, acquired, and distributed. This is the massive pace at which the data flows in from different sources like machines, business processes, and by interactions of humans with social media sites, etc. High-velocity data requires special processing techniques with advanced analytics and algorithmic tools. For certain applications, the speed of data creation becomes even more important than the actual volume. Financial trading organizations use fast-flowing data to their advantage.
The different sources of collecting both structured and unstructured data are relatively new and include GPS signals from phones, sensor readings, images on social networks, and more. With the increased penetration of the internet worldwide, smartphones and other mobile devices have become a crucial source of data related to people, activity, and locations. The traditionally available data like spreadsheets and text make up structured data which is easier to store. The large varieties of unstructured data including videos, images, audios are difficult to store and analyze to derive results. It is very rare to find data that is perfectly ordered and can be processed readily.
Different sources believe that the first 3 V’s i.e. volume, velocity, and variety are the only factors used to characterize big data. The reason for such an assumption is the slight ambiguity in the definition of the remaining 2 V’s: veracity and value.
Data veracity is the degree to how accurate and truthful a data set may be. Inversely, in the context of big data, veracity also refers to the biases, abnormality, and noise in the data.
The accuracy of data is defined by a variety of factors including the source and type of data and also the type of preprocessing done on the data. The various steps in including data accuracy include removing abnormalities, inconsistencies, duplication, and biases. The processing carried out on the data must be sensible and align with the business needs to generate the required outputs. Pete Warden, the author of the Big Data Glossary, writes in his book “I probably spend more time turning messy source data into something usable than I do on the rest of the data analysis process combined.” This statement indicates the importance of accurate data and the need for preprocessing data before analysis.
Data is defined as the set of qualitative and quantitative values collected by observation. The processing of the collected data gives rise to information that helps to derive the value of the data. Data by itself is of no use, unless, it can be used to derive actionable insights which help a business or an organization to grow. The data value is a concept that is often quantified as the potential that the data might hold to be of economic value. The value of the same data might vary differently from one organization to another. For example, the GPS data from a mobile phone may be used to calculate a navigation route by an app like Google Maps. The same GPS data can also be used to calculate the number of steps and the number of calories burnt, by a fitness app like Apple Health. Hence, the concept of the value of data is quite weakly defined.
The five V’s of big data are the characteristics that are used to separate data from big data. These features help to uniquely identify the properties which help to determine if available data is big.
While the data volume, velocity, and variety are aspects that can be quantitatively or qualitatively measured, the same is not true for data veracity and value. The noise or abnormalities can be measured by multiple methods and there is no one right way to determine data veracity. Similarly, as discussed in the example above the value of a dataset is extrinsic to the data itself and more closely related to the business problem which is being solved, because data-driven decisions are better decisions.
How much data is created every day in 2021? (2021, March 18). Retrieved April 03, 2021, from https://techjury.net/blog/how-much-data-is-created-every-day
RELATED TECH SPECS