Introduction to Hadoop Framework for Big Data

Introduction

Big Data is a stage of data where traditional techniques fail to analyze and perform analysis. So we need a new tech stack that can handle the analytical and operational needs of Big Data. Hadoop is one such technique specially designed to serve the purpose of storing and processing massive databases efficiently. It is also frequently mentioned technology in job descriptions for Data Scientist positions.

Key takeaways from this blog

After going through this blog, you will learn following things:

  • What is Hadoop, and what makes it so important? 
  • What challenges do engineers working on BIG Data face, and how does Hadoop provide solutions? 
  • What are HDFS, YARN, and MapReduce in the Hadoop Framework? 
  • What are the challenges of using Hadoop? 
  • What are the various applications built on top of the Hadoop Framework?

Let’s start by knowing more about Hadoop.

What is Hadoop?

Hadoop is an open-source framework that enables large-scale data processing for BIG Data. But what does open-source software mean? Open-source software is a type of software whose source code is available to the public, allowing anyone to inspect, modify, and enhance the base code. This type of software is continuously improved through community-driven development, providing many benefits, including cost savings, flexibility, and a larger pool of developers working to improve the software.

Hadoop has evolved into a powerful technology with the contributions and support of many contributors and users. The literal meaning of Hadoop is a yellow elephant, given by its inventor after his son’s toy. It is also known as Apache Hadoop, as it is a top-level project of the Apache Software Foundation and is licensed under Apache License 2.0.

But why? Why is Hadoop so important in the context of Big Data? Let’s understand.

Why is Hadoop so important?

  • Hadoop supports all three formats of datasets: Structured, Unstructured, or Semi-Structured.
  • Hadoop consists of a cluster of machines performing tasks. Because of that, it provides a scalable solution for handling massive datasets. For example, if our website sees a sudden surge of incoming users, it can add more machines (as nodes) in the same Hadoop cluster to manage the load.
  • With the help of Hadoop, many organizations can use the stored data to perform various analyses simultaneously.
  • It is protected from hardware failures. If a node from the cluster goes down, the processes automatically get redirected to other free nodes.

Why do traditional systems fail to process Big Data, or what are the challenges of Big Data handled by Hadoop? Let’s see.

How Hadoop handles the challenges of Big Data?

The three main challenges faced by engineers working on BIG Data are:

Challenge 1: Huge Database with heterogeneous nature

Big Data comprises various datasets classified into structured, unstructured, and semi-structured. Previously, datasets were primarily present in a structured format and were very easily manageable. Storing structured format datasets on a single system is easy. However, modern databases are more complex. They are comprised of video, audio, textual and in a semi-structured format, and they need non-conventional approaches.

Solution: Hadoop efficiently manages the diverse nature of modern datasets by separating them according to their type and storing them on multiple low-end hardware. The default storage capacity for one hardware is 128 MBs. For example, if we have 1000 MBs of data to store, it would require eight low-end hardware (128 * 7 + 104 * 1). Additionally, Hadoop creates multiple copies of these datasets to ensure that the data is automatically shifted to other available hardware if there is a hardware failure. This decentralized approach makes it more reliable.

Challenge 2: Traditional ways of processing Data is Very Slow

Big Data is a large and varied collection of data, which makes it difficult to process it simultaneously. So this would take a significant amount of time to execute.

Solution: To handle this efficiently, Hadoop enables parallel data processing on multiple low-end hardware. For this, Hadoop divides the analysis tasks among multiple hardware and then combines the results from all the hardware.

Challenge 3: Process Overload and mismanagement of Jobs performed on hardware

In traditional data processing methods, resources such as CPU, GPU, RAM, and ROM are not taken into account, making it challenging to manage job assignments and schedule them efficiently.

Solution: Hadoop addresses this issue using YARN’s resource management framework (Yet Another Resource Negotiator). YARN allocates the necessary resources to the process based on their availability. We will learn about this in the later part of this blog.

How hadoop framework works?

Out of these challenges in BIG Data, various modules are designed in Hadoop to address them.

Hadoop Modules

There are mainly four modules present in Hadoop:

1. Hadoop Distributed File Systems (HDFS)

Hadoop Distributed File System (HDFS) is a distributed storage system used in Hadoop. For example, if any human resource is overloaded from work, we can hire additional resources to manage the load and increase productivity. Similarly, in BIG Data, when the storage of one machine gets overloaded, Hadoop distributes the data samples across multiple machines to manage it effectively. All these systems form a cluster in Hadoop, which is done with the help of HDFS. 

HDFS has two main components: the Name Node and the Data Node. The Name Node and Data Nodes work together using a Master-slave mechanism, where the Name Node is the Master, and the Data Node is the Slave. The Name Node is responsible for managing the smooth functioning of the Data Nodes.

The Data Nodes read, write, process, and replicate the Big Data stored in them according to their capacity. They also send heartbeat signals to the Name Node to communicate their status, whether they are functioning or not. This provides high security and tolerance against faults. If one hardware goes down, the data is automatically transferred to other available hardware.

What is Hadoop Distributed File Systems or HDFS in Hadoop?

The Name Node is connected to multiple Data Nodes, which use standard or low-end hardware to store the Big Data. But why low-end hardware is used instead of standard cloud servers?

Using low-end hardware for Data Nodes is a cost-effective solution compared to standard cloud servers. An enterprise server version costs around $10k per terabyte with full processing power. In the case of BIG Data, where 100 TBs of data is not uncommon, standard cloud servers would require hundreds of servers and cost millions of dollars.

2. MapReduce

MapReduce is a framework that provides support for parallel processing in Hadoop. While discussing the challenges with BIG Data processing, we mentioned the processing speed would be very slow as the data is enormous. But what if we distribute the processing across multiple systems and later combine the results? This would be highly efficient as compared to traditional ways. That’s what Hadoop does with the help of the MapReduce technique.

There are two components in MapReduce: the Map task and the Reduce task. The map task takes the input data and transforms it for key-value operations. For example, in the dataset below, we need to count the number of dogs, cats, and mice entries in the list.

How MapReduce works in Hadoop and what are the map and reduce tasks?

Hadoop first splits the entire list into smaller sets and stores them in different low-end hardware. Then at the Map stage, each data sample is represented as a ‘key’, and the value of ‘1’ is given to it. This is the same as the dictionary data type in Python {‘key’: ‘1’}. Then it operates on these key-value pairs; for example, sorting is one kind of operation on this key-value pair. After performing operations, it brings similar keys together. Finally, aggregation of results from parallel processing happens at the reduce phase, and the final output is prepared. That’s MapReduce for us.

3. Yet Another Resource Negotiator (YARN)

In traditional data analysis and pre-processing methods, we were not thinking about the available resources. For example, while filtering the outliers, we were not concerned about the CPU, GPU or RAM usage. But in BIG Data, we need some mechanism to optimize these resources smartly, and Hadoop has a solution.

YARN (Yet Another Resource Negotiator) is the management module for Hadoop. It is designed to coordinate among large numbers of machines working together in a Hadoop cluster. YARN continuously monitors and manages the cluster nodes and systematically allocates resources to prevent any system from becoming overloaded. It also schedules jobs and tasks that need to be performed on Big Data, ensuring efficient use of resources and smooth operation of the system.

What are the components of YARN in Hadoop?

It comprises four main components: 1. the resource manager 2. the node manager 3. the application master, and 4. the containers.

  • Containers: It holds physical resources such as CPU, RAM, and ROM.
  • Resource Managers: We, as clients, request the resource manager to perform a specific operation or analysis on Big Data.
  • Application master: When the resource manager receives the job, the application master requests the necessary resources from the node manager.
  • Node Manager: The node manager manages resources within the containers and allocates them to the resource manager when they become available.

The smooth functioning of all these components in YARN helps Hadoop perform operations efficiently and effectively.

4. Hadoop Common

Hadoop Common is the central component of the Hadoop framework. It includes all the standard utilities, libraries and APIs that support YARN, HDFS, and MapReduce. It also contains the necessary Java libraries to start the Hadoop system. 

Hadoop Common also includes the documentation and source code of the framework and a section for contributions showcasing software built on top of Hadoop. It is freely available to the public, and one can visit the GitHub repository for Hadoop for more detailed information.

Now let’s combine the working of all these modules and see how it works.

How Hadoop works?

Let’s understand the complete picture of the working of Hadoop step-wise. Please remember that there are three main challenges of Big Data, and Hadoop is trying to solve them: Bigger size, single heavy process, and resource wastage. Let’s see how Hadoop will help:

  • Hadoop will split the entire data into smaller datasets and store them in distributed file systems over multiple low-end hardware (HDFS). By default, the data size that can be stored in one hardware is 128 MBs. There will be various HDFS clusters where it will replicate each dataset 3 times (by default) to avoid any cases of data loss because of hardware failure.
  • As the dataset is enormous, we can not run the entire operation at once. For example, suppose a process on one data sample takes 1 milli-second, and in Big Data, we have 100 Million data samples. Then if we run our code on the entire data, it will take 1/1⁰³ * 100 Million seconds, equivalent to 28 hours of a continuous process. If any data sample from 100 Million samples is corrupt, we will have to perform the entire process again from the start. With MapReduce’s help, Hadoop solves this issue by performing parallel processing on each hardware and finally aggregates the results provided by each HDFS and stores the result in different hardware. This is the job of the MapReduce module.
  • Whenever a client requests to operate on data, resource managers efficiently allocate sufficient resources from the clusters and avoid any case of overloading on any hardware.
  • After completing the reduce task for all the data nodes, the output is written back to HDFS.

Which companies use Hadoop Systems?

Hadoop is widely used by tech companies that collect and manage large amounts of data. Some well-known companies that use Hadoop systems include Google, Meta, Netflix, Uber, Twitter, British Airways, and the US National Security Agency. As a result, Hadoop has become an essential tool for data scientists working at these companies.

If Hadoop is so efficient and charming, why don’t we use Hadoop everywhere? Let’s understand the reason for that.

What are the various applications built over the Hadoop framework?

Some of the most popular components present in the Hadoop ecosystem are:

HBase

  • Apache HBase was designed with two primary goals: 1. Hosting large tables (Billions of rows X Millions of columns) on low-end hardware, 2. Real-time read or write access for the vast dataset.
  • It is a non-relational (NoSQL) distributed database built on top of the Hadoop Distributed File System (HDFS).
  • It provides fast, horizontally scalable, fault-tolerant BIG Data access. Horizontal scaling means adding additional machines into the cluster, and vertical scaling means adding extra power (CPU/GPU memory, RAM, etc.) to the existing system.
  • Some famous use cases of HBase are online analytics, time-series data, and real-time data processing.

HBase provides a high-performance, low-latency, and flexible data model, making it an attractive option for Big Data storage and management.

Hive

  • Apache Hive is a Data warehouse built on top of the Hadoop framework to store and process massive datasets efficiently. A warehouse can be correlated with a central information store, using which we can perform analysis efficiently.
  • Hive supports querying and analyzing large datasets stored in HDFS using HiveQL, a SQL-like query language. This SQL-like interface makes it preferable for data scientists and analysts as they only need to learn some of the new frameworks to perform queries on databases.
  • It is designed to handle enormous datasets (Petabytes) and support distributed computing using MapReduce.
  • Hive is also extensible, allowing developers to add custom functions to support their specific data processing needs.

Cassandra

  • Apache Cassandra is the only distributed NoSQL database management system that provides an always-on availability. Availability in databases refers to the condition when an organization’s data is available for the end-users.
  • It allows organizations to store large amounts of data across multiple servers with no single point of failure.
  • It has a proven linear scalable nature, which means we can anytime scale horizontally or vertically as per the user’s traffic in a linear manner. One example of a linearly scalable system is shown in the image below, where Netflix uses Cassandra to scale the system as the write-traffic increases linearly (horizontally).

    Netflix uses Cassandra's linear scalable property to manage the surge in traffic.

Source: Netflix Blog

  • It can be easily integrated with other Big Data technologies like Apache Hadoop and Apache Spark.

Pig

While discussing MapReduce in Hadoop, we saw how Map and Reduce tasks are performed on BIG Data to do the parallel processing. We discussed a simple use-case of sorting there, but in actual, these tasks can be complex to write.

  • Apache Pig is a high-level platform for creating MapReduce programs that process and analyze large data sets. It provides an easier-to-use interface for writing complex data transformations, such as aggregation, summarization, and sorting, which can be difficult to express using MapReduce alone.
  • Pig is often used for big data processing, mining, and machine learning tasks. It allows developers to express complex data flows and logic in a simple script.
  • Pig also supports parallel processing on large datasets, making it efficient and effective for big data analytics. As a result, Pig is a popular tool among data engineers and data scientists working with big data.

Spark

One challenge with Hadoop was the need for real-time analytics support. Hadoop only supports batch processing and the need for a technique better suited for real-time data processing, leading to the development of Apache Spark.

  • Apache Spark has an advanced analytical engine, making it 100 times faster than native Hadoop.
  • It is widely used for fast data analysis and applying machine learning algorithms on large datasets, making it a popular choice for Big Data processing.

We will learn about Apache Spark and Pyspark in our next blog.

Conclusion

Hadoop has become a valuable tool for companies that handle large amounts of data. It offers benefits such as scalability, storage, speed, and analytics. In this article, we provided an overview of the basics of Hadoop and how it works. We also highlighted some popular applications built on top of the Hadoop framework that supports various use cases of Big Data. We hope you found the information helpful and enjoyed reading the article.

References

  1. https://aws.amazon.com/emr/details/hadoop/what-is-hadoop/
  2. https://hadoop.apache.org/docs/stable/index.html

Enjoy Learning!

More from EnjoyAlgorithms

Self-paced Courses and Blogs