Random forest, a.k.a Random Decision Trees, is a supervised learning algorithm in machine learning that can solve both classifications and regression problems. It comes under the family of CART (classification and regression trees) algorithms, combines the predictions of multiple decision trees, and provides the best output. It can achieve better accuracy even with the simplest dataset and is hence very popular in communities offering data science competitions. In this article, we will discuss the Random forest algorithm in detail.
So let's start without any further delay.
Random Forests leverages the power of Decision Trees, which is also its building block. The later part of the discussion will use the basics of Decision Trees, so we recommend you look at our Decision Tree Algorithm blog to familiarize yourself.
Decision trees work well on training data but poorly over the testing dataset. In other words, decision trees are prone to overfitting, especially when a tree is particularly deep. Hence, a single Decision Tree is not the best fit for complex real-life problems. One intuitive suggestion to tackle this problem can be to make multiple decision trees, train them and then make a conclusive decision based on all DTs' predictions. That's what Random forest does for us.
Random forest is a flexible, easy-to-use supervised machine learning algorithm that falls under the Ensemble learning approach. It strategically combines multiple decision trees (a.k.a. weak learners) to solve a particular computational problem. If we talk about all the ensemble approaches in machine learning, the two most popular ensemble methods are Bagging and Boosting. To understand the Random Forest, we require the Bagging approach. So, let's learn about it first.
Bagging, also called Bootstrap Aggregating, is a machine-learning ensemble technique designed to improve the stability and accuracy of machine-learning algorithms. It helps eliminate overfitting by reducing the variance of the output. To understand how Bagging works, let's first understand what bootstrapping is and how does it work?
Bootstrapping is a statistical technique used for data resampling. It involves iteratively resampling a dataset with replacement. This statement is widespread and can be found in every definition of bootstrapping. The objective is to create multiple training datasets by collecting random samples from the original training set. Generally, we remove a selection from the subsequent trial once a sample gets selected in a random trial. But in bootstrapping, we do not do that. The same sample has an equal probability of getting selected for the subsequent trials, so we call it iteratively resampling a dataset with "replacement".
To understand it better, let's take the example of a bag full of 5 balls (1 Red, 1 Blue, 1 Pink, 1 Brown, 1 Purple). We picked a random ball from the bag and noted the ball's color. We again put the ball in the same bag, so the probability of picking any color ball remains the same.
We will create multiple datasets in bootstrapping by selecting random samples from the original training set. A single observation might appear multiple times in the newly formed dataset, also called bootstrapped dataset. The number of observations in the bootstrapped dataset would equal the original training set's number of observations. These so-formed bootstrapped training datasets are used in training weak learners. This technique helps reduce the predictions' variance, vastly improving the overall predictive performance.
Complete Dataset X1 X2 X3 X4 x5
Bootstrapped Dataset 1 X3 X1 X3 X3 x5
Bootstrapped Dataset 2 X5 X5 X3 X1 x2
Bootstrapped Dataset 3 X5 X5 X1 X2 x1
Bootstrapped Dataset 4 X3 X1 X3 X3 x5
Bootstrapped Dataset 5 X4 X4 X4 X4 x1
Now, we have enough understanding of bootstrapping. Let's continue with Bagging. So, Bagging comprises of following three steps that are more or less common in Random Forest:
In Random forest, we apply the same general bagging technique using Decision Trees as the weak learners, along with one extra modification. Let’s learn about this modification which differentiates Random Forests from the general Bagging approach.
The fundamental idea behind a random forest is to combine the predictions made by many decision trees into a single model. Individual predictions made by decision trees may need to be revised. So, when combined, the predictions will be closer to the actual value as it will give a scope of landing in the position of global optima for the cost function used for classification or regression problems.
Random Forest combines the simplicity of Decision Trees with flexibility resulting in vastly improved accuracy. But the question is still the same,
In training any tree algorithm, we split the node by selecting the best feature present in the dataset on which we are training it. In Random Forest, we select only a subset of features randomly from the total features, and then the best split feature from the selected subset is used to split each node in a tree. Whereas in Bagging, all features are considered for splitting a node. Random Forest is a natural extension of Bagging. For example, one dataset has ten different features, but when we build DTs for the random forest, we need to use only 3 out of 10 features.
Multiple Bootstrapped datasets are created by drawing samples randomly with replacements from the original dataset. The same observations can appear more than once. The number of observations is the same in bootstrapped and the original dataset.
bootstrap = True
# When Bootstrapping is false, each tree will use the same dataset
# without randomness.
Build decision trees over the bootstrapped datasets formed in the previous step, but these decision trees only consider a random subset of variables for each split. Generally, this number is decided by the square root of the total number of features in the original dataset, which can be tuned for optimal performance. If there are 36 independent features in the training dataset, Random Forest will randomly select six to build the decision trees. A variant of log(#features) is also popular in industries. These choices are popular because scientists experimented with several datasets and concluded the same.
max_features = sqrt(n_features)
Finally, samples are fed through the trained Decision Trees for classification, where all the decision trees make classifications. Lastly, majority voting determines the new sample's final class. In the regression task, the prediction is calculated by averaging the predictions of the decision trees producing the final prediction.
The figure below illustrates how the Random Forest algorithm works.
Statistically, 1/3rd of original data never gets reflected in the bootstrapped dataset. This is known as the Out-of-Bag Dataset.
Not all the data points from the original dataset are reflected in the Bootstrapped dataset. Such data points are collectively known as Out-of-Bag Datasets and can be used to test the Random Forest's accuracy. We can measure the accuracy of our Random Forest model by the proportion of Out-of-Bag samples that were correctly classified by it. On the contrary, the proportion of incorrectly classified samples is called "Out-of-Bag Error."
Now, as we know most of the things about this algorithm, let's quickly see one important support that it provides, i.e.,
Feature Importance is another reason for using Random Forest. In feature selection, embedded methods are known for their high performance. The embedded method requires an algorithm for the computation of feature importance. No doubt, Random Forest performs well in determining the relative feature importance, unlike the linear models.
It describes the dependency of the independent parameters over the target class by assigning relative importance.
There are many, but let's list down some important ones,
Let's get familiar with the Hyperparameters of Random Forest, which we will need for tuning the performance of the Random Forest algorithm.
Enough theory. Let's see Random Forest in action!
For the demonstration purpose, we will use the Pima Indian Diabetes Dataset. This is a supervised dataset that classifies patients as diabetic or non-diabetic.
# Importing libraries
import graphviz
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
# Loading Data
df = pd.read_csv("diabetes.csv")
df.head(5)
X = df.drop('Outcome',axis=1)
y = df['Outcome']
# Initialize Base-Line Random Forest Model
classifier = RandomForestClassifier(random_state=90, oob_score = True)
# Fitting data over the baseline model
classifier.fit(X, y)
# Evaluate the performance over the Out-of-Bag dataset
classifier.oob_score_
# 0.7734375
We have achieved 77.34% accuracy without any hyper-tuning! Let's see how far we can improve this using hyper-parameter tuning.
# Define the parameter Grid
params = {
'max_depth': [15, 20, 25],
'max_features': ['auto', 'sqrt'],
'min_samples_split': [10, 20, 25],
'min_samples_leaf': [5, 10],
'n_estimators': [10, 25, 30]
}
# Initialize the Grid Search with accuracy metrics
grid_search_result = GridSearchCV(estimator=classifier,
param_grid=params,
cv = 5,
scoring="accuracy")
# Fitting 5 Folds for each of 108 candidates, total 540 fits
# Fit the grid search to the data
grid_search.fit(X, y)
# Let's check the score
grid_search.best_score_
# 0.783868
# 1.35% improvement
Accuracy has slightly improved!! The tuned model has 1.35% better accuracy than the baseline model.
# Let's check the parameters of our best model
best_model = grid_search.best_estimator_
print(best_model)
# RandomForestClassifier(max_depth=15, min_samples_leaf=10,
# min_samples_split=25, n_estimators=25, random_state=90)
# Visualize a single Decision Tree having depth = 4
dot_data = tree.export_graphviz(best_model.estimators_[0],
out_file=None,
feature_names=X.columns,
class_names="Outcome",
filled=True)
graph = graphviz.Source(dot_data, format="png")
graph
# Plotting the Relative Importance as per tuned model
feature_importance = best_model.feature_importances_
feature_importance = 100.0 * (feature_importance / feature_importance.max())
sorted_idx = np.argsort(feature_importance)
sorted_idx = sorted_idx[len(feature_importance) - 50:]
pos = np.arange(sorted_idx.shape[0]) + .5
plt.figure(figsize=(10,12))
plt.barh(pos, feature_importance[sorted_idx], align='center')
plt.yticks(pos, X.columns[sorted_idx])
plt.xlabel('Relative Importance')
plt.title('Variable Importance')
plt.show()
Random Forest is one of the coolest and favorite algorithms in Machine Learning. A more comprehensive range of applications can be seen built using this algorithm, and hence considered to be one of the important topics to be asked in machine learning interviews,
In this blog, we developed an understanding of Random Forests and their working. We also learned the working of bagging and compared its relevance with the Random Forest algorithm. Finally, we learned model building, evaluation, hyper-parameter involved, and finding important features with the help of Scikit-learn. We hope you enjoyed the article.
Next Blog: Random forest industrial application: How Uber uses Machine Learning?