Big Data is a stage of data where traditional techniques fail to analyze and perform analysis. So we need a new tech stack that can handle the analytical and operational needs of Big Data. Hadoop is one such technique specially designed to serve the purpose of storing and processing massive databases efficiently. It is also frequently mentioned technology in job descriptions for Data Scientist positions.
Key takeaways from this blog
After going through this blog, you will learn following things:
Let’s start by knowing more about Hadoop.
Hadoop is an open-source framework that enables large-scale data processing for BIG Data. But what does open-source software mean? Open-source software is a type of software whose source code is available to the public, allowing anyone to inspect, modify, and enhance the base code. This type of software is continuously improved through community-driven development, providing many benefits, including cost savings, flexibility, and a larger pool of developers working to improve the software.
Hadoop has evolved into a powerful technology with the contributions and support of many contributors and users. The literal meaning of Hadoop is a yellow elephant, given by its inventor after his son’s toy. It is also known as Apache Hadoop, as it is a top-level project of the Apache Software Foundation and is licensed under Apache License 2.0.
But why? Why is Hadoop so important in the context of Big Data? Let’s understand.
Why do traditional systems fail to process Big Data, or what are the challenges of Big Data handled by Hadoop? Let’s see.
The three main challenges faced by engineers working on BIG Data are:
Big Data comprises various datasets classified into structured, unstructured, and semi-structured. Previously, datasets were primarily present in a structured format and were very easily manageable. Storing structured format datasets on a single system is easy. However, modern databases are more complex. They are comprised of video, audio, textual and in a semi-structured format, and they need non-conventional approaches.
Solution: Hadoop efficiently manages the diverse nature of modern datasets by separating them according to their type and storing them on multiple low-end hardware. The default storage capacity for one hardware is 128 MBs. For example, if we have 1000 MBs of data to store, it would require eight low-end hardware (128 * 7 + 104 * 1). Additionally, Hadoop creates multiple copies of these datasets to ensure that the data is automatically shifted to other available hardware if there is a hardware failure. This decentralized approach makes it more reliable.
Big Data is a large and varied collection of data, which makes it difficult to process it simultaneously. So this would take a significant amount of time to execute.
Solution: To handle this efficiently, Hadoop enables parallel data processing on multiple low-end hardware. For this, Hadoop divides the analysis tasks among multiple hardware and then combines the results from all the hardware.
In traditional data processing methods, resources such as CPU, GPU, RAM, and ROM are not taken into account, making it challenging to manage job assignments and schedule them efficiently.
Solution: Hadoop addresses this issue using YARN’s resource management framework (Yet Another Resource Negotiator). YARN allocates the necessary resources to the process based on their availability. We will learn about this in the later part of this blog.
Out of these challenges in BIG Data, various modules are designed in Hadoop to address them.
There are mainly four modules present in Hadoop:
Hadoop Distributed File System (HDFS) is a distributed storage system used in Hadoop. For example, if any human resource is overloaded from work, we can hire additional resources to manage the load and increase productivity. Similarly, in BIG Data, when the storage of one machine gets overloaded, Hadoop distributes the data samples across multiple machines to manage it effectively. All these systems form a cluster in Hadoop, which is done with the help of HDFS.
HDFS has two main components: the Name Node and the Data Node. The Name Node and Data Nodes work together using a Master-slave mechanism, where the Name Node is the Master, and the Data Node is the Slave. The Name Node is responsible for managing the smooth functioning of the Data Nodes.
The Data Nodes read, write, process, and replicate the Big Data stored in them according to their capacity. They also send heartbeat signals to the Name Node to communicate their status, whether they are functioning or not. This provides high security and tolerance against faults. If one hardware goes down, the data is automatically transferred to other available hardware.
The Name Node is connected to multiple Data Nodes, which use standard or low-end hardware to store the Big Data. But why low-end hardware is used instead of standard cloud servers?
Using low-end hardware for Data Nodes is a cost-effective solution compared to standard cloud servers. An enterprise server version costs around $10k per terabyte with full processing power. In the case of BIG Data, where 100 TBs of data is not uncommon, standard cloud servers would require hundreds of servers and cost millions of dollars.
MapReduce is a framework that provides support for parallel processing in Hadoop. While discussing the challenges with BIG Data processing, we mentioned the processing speed would be very slow as the data is enormous. But what if we distribute the processing across multiple systems and later combine the results? This would be highly efficient as compared to traditional ways. That’s what Hadoop does with the help of the MapReduce technique.
There are two components in MapReduce: the Map task and the Reduce task. The map task takes the input data and transforms it for key-value operations. For example, in the dataset below, we need to count the number of dogs, cats, and mice entries in the list.
Hadoop first splits the entire list into smaller sets and stores them in different low-end hardware. Then at the Map stage, each data sample is represented as a ‘key’, and the value of ‘1’ is given to it. This is the same as the dictionary data type in Python {‘key’: ‘1’}. Then it operates on these key-value pairs; for example, sorting is one kind of operation on this key-value pair. After performing operations, it brings similar keys together. Finally, aggregation of results from parallel processing happens at the reduce phase, and the final output is prepared. That’s MapReduce for us.
In traditional data analysis and pre-processing methods, we were not thinking about the available resources. For example, while filtering the outliers, we were not concerned about the CPU, GPU or RAM usage. But in BIG Data, we need some mechanism to optimize these resources smartly, and Hadoop has a solution.
YARN (Yet Another Resource Negotiator) is the management module for Hadoop. It is designed to coordinate among large numbers of machines working together in a Hadoop cluster. YARN continuously monitors and manages the cluster nodes and systematically allocates resources to prevent any system from becoming overloaded. It also schedules jobs and tasks that need to be performed on Big Data, ensuring efficient use of resources and smooth operation of the system.
It comprises four main components: 1. the resource manager 2. the node manager 3. the application master, and 4. the containers.
The smooth functioning of all these components in YARN helps Hadoop perform operations efficiently and effectively.
Hadoop Common is the central component of the Hadoop framework. It includes all the standard utilities, libraries and APIs that support YARN, HDFS, and MapReduce. It also contains the necessary Java libraries to start the Hadoop system.
Hadoop Common also includes the documentation and source code of the framework and a section for contributions showcasing software built on top of Hadoop. It is freely available to the public, and one can visit the GitHub repository for Hadoop for more detailed information.
Now let’s combine the working of all these modules and see how it works.
Let’s understand the complete picture of the working of Hadoop step-wise. Please remember that there are three main challenges of Big Data, and Hadoop is trying to solve them: Bigger size, single heavy process, and resource wastage. Let’s see how Hadoop will help:
Hadoop is widely used by tech companies that collect and manage large amounts of data. Some well-known companies that use Hadoop systems include Google, Meta, Netflix, Uber, Twitter, British Airways, and the US National Security Agency. As a result, Hadoop has become an essential tool for data scientists working at these companies.
If Hadoop is so efficient and charming, why don’t we use Hadoop everywhere? Let’s understand the reason for that.
Some of the most popular components present in the Hadoop ecosystem are:
HBase provides a high-performance, low-latency, and flexible data model, making it an attractive option for Big Data storage and management.
It has a proven linear scalable nature, which means we can anytime scale horizontally or vertically as per the user’s traffic in a linear manner. One example of a linearly scalable system is shown in the image below, where Netflix uses Cassandra to scale the system as the write-traffic increases linearly (horizontally).
Source: Netflix Blog
While discussing MapReduce in Hadoop, we saw how Map and Reduce tasks are performed on BIG Data to do the parallel processing. We discussed a simple use-case of sorting there, but in actual, these tasks can be complex to write.
One challenge with Hadoop was the need for real-time analytics support. Hadoop only supports batch processing and the need for a technique better suited for real-time data processing, leading to the development of Apache Spark.
We will learn about Apache Spark and Pyspark in our next blog.
Hadoop has become a valuable tool for companies that handle large amounts of data. It offers benefits such as scalability, storage, speed, and analytics. In this article, we provided an overview of the basics of Hadoop and how it works. We also highlighted some popular applications built on top of the Hadoop framework that supports various use cases of Big Data. We hope you found the information helpful and enjoyed reading the article.