Random Forest in Machine Learning

Random forest, a.k.a Random Decision Trees, is a supervised learning algorithm in machine learning that can solve both classifications and regression problems. It comes under the family of CART (classification and regression trees) algorithms, combines the predictions of multiple decision trees, and provides the best output. It can achieve better accuracy even with the simplest dataset and is hence very popular in communities offering data science competitions. In this article, we will discuss the Random forest algorithm in detail.

Key takeaways from this blog

How does Random Forest & Bagging work?
What is bootstrapping in Bagging & Random Forest?
What are the hyperparameters involved with Random forest and their tuning for optimal performance?
Feature Importance (Embedded Method).
Implementation of Random Forests in python using Scikit-learn.

So let's start without any further delay.

Random Forests leverages the power of Decision Trees, which is also its building block. The later part of the discussion will use the basics of Decision Trees, so we recommend you look at our Decision Tree Algorithm blog to familiarize yourself.

Random Forest

Decision trees work well on training data but poorly over the testing dataset. In other words, decision trees are prone to overfitting, especially when a tree is particularly deep. Hence, a single Decision Tree is not the best fit for complex real-life problems. One intuitive suggestion to tackle this problem can be to make multiple decision trees, train them and then make a conclusive decision based on all DTs' predictions. That's what Random forest does for us.

Random forest is a flexible, easy-to-use supervised machine learning algorithm that falls under the Ensemble learning approach. It strategically combines multiple decision trees (a.k.a. weak learners) to solve a particular computational problem. If we talk about all the ensemble approaches in machine learning, the two most popular ensemble methods are Bagging and Boosting. To understand the Random Forest, we require the Bagging approach. So, let's learn about it first.

What is bagging?

Bagging, also called Bootstrap Aggregating, is a machine-learning ensemble technique designed to improve the stability and accuracy of machine-learning algorithms. It helps eliminate overfitting by reducing the variance of the output. To understand how Bagging works, let's first understand what bootstrapping is and how does it work?

What is Bootstrapping in Bagging and Random Forest?

Bootstrapping is a statistical technique used for data resampling. It involves iteratively resampling a dataset with replacement. This statement is widespread and can be found in every definition of bootstrapping. The objective is to create multiple training datasets by collecting random samples from the original training set. Generally, we remove a selection from the subsequent trial once a sample gets selected in a random trial. But in bootstrapping, we do not do that. The same sample has an equal probability of getting selected for the subsequent trials, so we call it iteratively resampling a dataset with "replacement".

To understand it better, let's take the example of a bag full of 5 balls (1 Red, 1 Blue, 1 Pink, 1 Brown, 1 Purple). We picked a random ball from the bag and noted the ball's color. We again put the ball in the same bag, so the probability of picking any color ball remains the same.

We will create multiple datasets in bootstrapping by selecting random samples from the original training set. A single observation might appear multiple times in the newly formed dataset, also called bootstrapped dataset. The number of observations in the bootstrapped dataset would equal the original training set's number of observations. These so-formed bootstrapped training datasets are used in training weak learners. This technique helps reduce the predictions' variance, vastly improving the overall predictive performance.

  Complete Dataset           X1   X2   X3   X4   x5

Bootstrapped Dataset 1       X3   X1   X3   X3   x5

Bootstrapped Dataset 2       X5   X5   X3   X1   x2

Bootstrapped Dataset 3       X5   X5   X1   X2   x1

Bootstrapped Dataset 4       X3   X1   X3   X3   x5

Bootstrapped Dataset 5       X4   X4   X4   X4   x1

Now, we have enough understanding of bootstrapping. Let's continue with Bagging. So, Bagging comprises of following three steps that are more or less common in Random Forest:

Bootstrapping: The bootstrap method involves iteratively resampling a dataset with replacement. It helps create diverse samples that are usable for training weak learners.
Parallel Training: Bootstrap samples are used to train independent weak learners in parallel.
Aggregation: In the case of classification, majority voting is performed to compute prediction. If most weak learners predict the same class, then the final prediction will go with the majority. In regression, an average is calculated on all the outputs predicted by the individual weak learners.

In Random forest, we apply the same general bagging technique using Decision Trees as the weak learners, along with one extra modification. Let’s learn about this modification which differentiates Random Forests from the general Bagging approach.

Combination of multiple decision trees used to make the Random forest algorithm

The fundamental idea behind a random forest is to combine the predictions made by many decision trees into a single model. Individual predictions made by decision trees may need to be revised. So, when combined, the predictions will be closer to the actual value as it will give a scope of landing in the position of global optima for the cost function used for classification or regression problems.

Random Forest combines the simplicity of Decision Trees with flexibility resulting in vastly improved accuracy. But the question is still the same,

What is the difference between Bagging and Random Forest?

In training any tree algorithm, we split the node by selecting the best feature present in the dataset on which we are training it. In Random Forest, we select only a subset of features randomly from the total features, and then the best split feature from the selected subset is used to split each node in a tree. Whereas in Bagging, all features are considered for splitting a node. Random Forest is a natural extension of Bagging. For example, one dataset has ten different features, but when we build DTs for the random forest, we need to use only 3 out of 10 features.

Let's build Random Forest step by step

Step 1. Create bootstrapped datasets

Multiple Bootstrapped datasets are created by drawing samples randomly with replacements from the original dataset. The same observations can appear more than once. The number of observations is the same in bootstrapped and the original dataset.

bootstrap = True
# When Bootstrapping is false, each tree will use the same dataset
# without randomness.

Step 2. Build a Decision tree on the Bootstrapped dataset

Build decision trees over the bootstrapped datasets formed in the previous step, but these decision trees only consider a random subset of variables for each split. Generally, this number is decided by the square root of the total number of features in the original dataset, which can be tuned for optimal performance. If there are 36 independent features in the training dataset, Random Forest will randomly select six to build the decision trees. A variant of log(#features) is also popular in industries. These choices are popular because scientists experimented with several datasets and concluded the same.

max_features = sqrt(n_features)

Step 3. Aggregation

Finally, samples are fed through the trained Decision Trees for classification, where all the decision trees make classifications. Lastly, majority voting determines the new sample's final class. In the regression task, the prediction is calculated by averaging the predictions of the decision trees producing the final prediction.

The figure below illustrates how the Random Forest algorithm works.

Majority voting for the predictions made by multiple weak-learners in case of a binary classification problem

Statistically, 1/3rd of original data never gets reflected in the bootstrapped dataset. This is known as the Out-of-Bag Dataset.

How to test the accuracy of Random Forest?

Not all the data points from the original dataset are reflected in the Bootstrapped dataset. Such data points are collectively known as Out-of-Bag Datasets and can be used to test the Random Forest's accuracy. We can measure the accuracy of our Random Forest model by the proportion of Out-of-Bag samples that were correctly classified by it. On the contrary, the proportion of incorrectly classified samples is called "Out-of-Bag Error."

Now, as we know most of the things about this algorithm, let's quickly see one important support that it provides, i.e.,

Feature Importance In Random Forest

Feature Importance is another reason for using Random Forest. In feature selection, embedded methods are known for their high performance. The embedded method requires an algorithm for the computation of feature importance. No doubt, Random Forest performs well in determining the relative feature importance, unlike the linear models.

It describes the dependency of the independent parameters over the target class by assigning relative importance.

Advantages of Random Forest

There are many, but let's list down some important ones,

Reduction in overfitting — The bagging method significantly lowers the risk of overfitting.
Minimal data cleaning efforts are required.
Acceptable results even without tuning the hyper-parameters.
It can be used for both regression and classification tasks.
Out-of-Bag data can be used as a validation set for the model. So, no need to segregate the data for train and testing.
Measures the relative importance of each feature on the prediction.

Hyperparameters in Random Forest Algorithm

Let's get familiar with the Hyperparameters of Random Forest, which we will need for tuning the performance of the Random Forest algorithm.

n_estimators: Number of trees in the Random Forest.
max_features: The number of features to consider when looking for the best split.
minisampleleaf: The minimum number of samples required at a leaf node.
max_depth: The maximum depth of the tree. If none, then nodes are expanded until all leaves are pure.
minsamplessplit: The minimum number of samples required to split an internal node.
oob_score: OOB stands for Out-of-Bag. It is the score obtained by evaluating the model over the out of Bag samples. These samples were not used in the model building and can be used as a validation set.

Enough theory. Let's see Random Forest in action!

Basic Implementation of Random Forest Algorithm

Dataset Explanation

For the demonstration purpose, we will use the Pima Indian Diabetes Dataset. This is a supervised dataset that classifies patients as diabetic or non-diabetic.

# Importing libraries
import graphviz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier


# Loading Data
df = pd.read_csv("diabetes.csv")
df.head(5)

Fitting the Random Forest model

X = df.drop('Outcome',axis=1)
y = df['Outcome']

# Initialize Base-Line Random Forest Model
classifier = RandomForestClassifier(random_state=90, oob_score = True)
# Fitting data over the baseline model
classifier.fit(X, y)

# Evaluate the performance over the Out-of-Bag dataset
classifier.oob_score_

# 0.7734375

We have achieved 77.34% accuracy without any hyper-tuning! Let's see how far we can improve this using hyper-parameter tuning.

Hyperparameter Tuning for Random Forest

# Define the parameter Grid
params = {
 'max_depth': [15, 20, 25],
 'max_features': ['auto', 'sqrt'],
 'min_samples_split': [10, 20, 25],
 'min_samples_leaf': [5, 10],
 'n_estimators': [10, 25, 30]
}

# Initialize the Grid Search with accuracy metrics 
grid_search_result = GridSearchCV(estimator=classifier,
                                  param_grid=params,
                                  cv = 5,
                                  scoring="accuracy")
                                  
# Fitting 5 Folds for each of 108 candidates, total 540 fits
# Fit the grid search to the data

grid_search.fit(X, y)

# Let's check the score
grid_search.best_score_


# 0.783868
# 1.35% improvement

Accuracy has slightly improved!! The tuned model has 1.35% better accuracy than the baseline model.

# Let's check the parameters of our best model
best_model = grid_search.best_estimator_
print(best_model)

# RandomForestClassifier(max_depth=15, min_samples_leaf=10, 
#      min_samples_split=25, n_estimators=25, random_state=90)

Now visualize the tree!!

# Visualize a single Decision Tree having depth = 4
dot_data = tree.export_graphviz(best_model.estimators_[0],
                                out_file=None, 
                                feature_names=X.columns,  
                                class_names="Outcome",
                                filled=True)
                                
graph = graphviz.Source(dot_data, format="png") 
graph

How to visulaize the tree after the training of Random forest algorithm?

Relative Feature Importance as per the tuned model

# Plotting the Relative Importance as per tuned model
feature_importance = best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
sorted_idx = sorted_idx[len(feature_importance) - 50:]
pos = np.arange(sorted_idx.shape[0]) + .5


plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()

How to plot the feature importance graph after training the random forest model?

Possible Interview Questions

Random Forest is one of the coolest and favorite algorithms in Machine Learning. A more comprehensive range of applications can be seen built using this algorithm, and hence considered to be one of the important topics to be asked in machine learning interviews,

Explain the working of Random Forest.
What is an out-of-bag error, and why is it considered one of the better choices to test the random forest model?
What are the disadvantages of Random Forest?
What is the difference between Bagging and Random Forest?
Explain the Bootstrapping process.
What is an ensemble process, and why is Random Forest an ensemble approach?

Conclusion

In this blog, we developed an understanding of Random Forests and their working. We also learned the working of bagging and compared its relevance with the Random Forest algorithm. Finally, we learned model building, evaluation, hyper-parameter involved, and finding important features with the help of Scikit-learn. We hope you enjoyed the article.

Next Blog: Random forest industrial application: How Uber uses Machine Learning?

Random Forest Algorithm in Machine Learning