Companies are collecting large amounts of data to understand users' patterns and provide personalized recommendations. To truly benefit from this data, companies must constantly analyze and process it to extract value. Otherwise, the collected data will simply be treated as digital garbage. As the volume and diversity of data increase, traditional data analysis and data science methods become inadequate. This has led to the emergence of a new field called Big Data!
Big Data is the level of complexity in data that makes it difficult for traditional storage and analytical methods to handle. It cannot be accessed using standard SQL approaches or stored in traditional relational databases. As a result, advanced data processing frameworks (Hadoop, Spark, etc.) have been developed to support the processing of Big Data.
The emergence of Big Data is directly linked to the growth of companies like Google and Facebook. These companies generate and store petabytes of data in various forms like text, images, videos and social media interactions. etc. This data is highly diverse, unstructured and dynamic, which makes it challenging to process and analyze.
Here are some examples of popular sources of Big Data:
Stock Exchange Data: Investment firms and banks analyze data from various stock exchanges like purchase and sell orders, profit, loss, and other financial information to make informed investment decisions.
Social Media Data: Facebook, Instagram, and Twitter collect a wide range of data on their users like their interests, preferences, and demographics. They then use this data to personalize user experiences by customizing their news feeds.
Black Box Data: Aviation companies use black boxes in aeroplanes and helicopters to record a wide range of information about a flight like pilot actions, altitude, and speed. If there is an incident or accident, they analyze the data from the black box to understand what went wrong and take steps to prevent similar incidents in the future.
Map and Transport Data: Companies like Google Maps, continuously collect data on the location and movement of devices that use the app. This data helps them provide the best routes to destinations, provide traffic updates and plan for transportation infrastructure.
The five characteristics of Big Data (also known as the Five V's of Big Data) are Velocity, Volume, Variety, Veracity and Value. Let's understand each one of them one by one.
1. Velocity is the rate at which data is generated, collected, and processed. For example, millions of users simultaneously watch videos on YouTube. Every second, YouTube stores and process data about their activity like videos they are watching, the time spent on the platform, and other information. Velocity is crucial to understand because it determines the resources needed for collecting, storing, and processing data.
2. Volume is the quantity of data being collected. Recently the world population crossed the 8 Billion mark and most of these people are connected to the internet. It produces tons of data regularly, which requires a large number of databases to store and process them. Some estimations show that approximately 3 Quintillion bytes of data are recorded daily. 1 Quintillion is equivalent to 1 Billion Gigabytes (GBs), or 1 Million Terabytes (TBs)!
3. Variety: Data collected can be present in multiple formats like text, image, video, audio, and many more. It includes structured, semi-structured, and unstructured data from a variety of sources like devices, people, the internet, processes, and sometimes from nature.
4. Veracity: The biggest problem with today's data is verifying its quality and authenticity. Data is collected from millions of different sources, which makes traceability extremely difficult. With the ease in availability of technology and ignorance about fact-checking, it becomes critical to know whether the information in the form of data is true or false.
5. Value: Data scientists are investing a humongous amount of time in Big Data techniques to extract value from the data. So value refers to the techniques using which we make data useful. Value need not always be monetary. Sometimes, it can help verify the critical hypothesis, for example, in the medical or defence domains.
Big Data can be classified into three main categories based on their structure:
Structured Data: This type of data is usually stored in relational database management systems (RDBMS) in a tabular format consisting of rows and columns. This data can be queried and accessed using structured query languages (SQL).
Unstructured Data: This type of data does not have a specific format or structure. It typically includes text, images, videos, or audio data. Unstructured data is usually stored in non-relational databases (NoSQL), which do not require a predefined schema.
Semi-structured Data: This is a type of data that has some structure but cannot be represented in a traditional tabular format with rows and columns, like structured data. Instead, semi-structured data often contain some form of metadata or hierarchical structure that allows for more flexible querying and analysis. Common examples of semi-structured data formats include XML, JSON, and HTML.
JSON Data Example:
[
{
color: "red",
value: "#f00"
},
{
color: "green",
value: "#0f0"
},
{
color: "blue",
value: "#00f"
}
]
Big Data requires advanced technologies to store and process large amounts of information. To meet this demand, a variety of scalable databases and distributed computing systems have emerged. The technologies used for Big Data can be divided into two categories: Operational Technologies and Analytical Technologies.
Among these solutions, Hadoop stands out as the leading framework that supports both operational and analytical technologies. As an open-source platform, Hadoop can scale from a single server to thousands of machines and provides distributed computing through simple programming.
Big Data engineers face several significant challenges.
In this article, we explored the idea of Big Data, which addresses the challenge of managing vast amounts of complex and diverse data. We discussed what constitutes Big Data, its types, characteristics, examples, use cases, the various technologies used, advantages, and the challenges faced by engineers working with Big Data. If you have any queries or feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy data science!