Tree-based algorithms are a crucial aspect of the Machine Learning. We can use the Decision Trees algorithm to automate the creation of flowcharts for making predictions. These predictions can be either categorical or continuous numerical values. However, relying on a single decision tree can be risky as they may be prone to overfitting. To overcome this issue, we use an ensemble approach. This involves building multiple decision trees and combining their decisions. There are two types of ensemble approaches: Bagging and Boosting.
After completing this blog, we will understand the following concepts:
So let’s begin our journey with the bagging concept revision.
Using the bootstrap approach, we create multiple copies of the original training dataset, fit different decision trees on these copies, and then combine their predictions to form a single predictive model. All these decision trees are treated independently.
Random forest is a variation of Bagging, where we use only a subsample of features present in the data to create DT on the bootstrapped dataset instead of all the features.
Boosting, similar to bagging in terms of outcome, is one such approach where we ensemble the predictions made by all decision trees, but this process occurs sequentially. Every decision tree is grown using the information of the previously grown trees. It does not involve the bootstrapped data; instead, each tree fits onto a modified form of the original dataset. This modification is done based on the information present in the previously grown trees. So we can say in boosting, trees are not independent.
Now that we know what boosting is let’s understand its pseudo code to grasp its working principle.
Let’s say we are trying to solve a regression problem using boosting. Suppose we want to grow M trees sequentially(one after the other). The function we want to fit is f(x), and we have n samples of supervised data in the format of input variables as X and target variables as y. Now, let’s define the step-wise processes.
#Initializing the function that we want to learn
fx = 0
#Initiailizing the learning variable for every sample. We will be using this variable to keep a track what has already been learnt and what next we want to learn.
rk = Yk # for k in [0, n] as we have n samples
In boosting process, we grow trees sequentially so that whatever information previously grown trees have already learned, the newer trees do not waste time learning the same things again. It makes the overall process faster and smoother.
for i in range (0, M-1):
# restrict the growth of tree by defining the max split
fitting DT(i) with only d splits (d+1 terminal nodes) on (X,r)
# As bossting is ensemble process, so we anyway need to combine the # predictions with certain weightage values. That weightage is λ.
Updating the fx as, fx := fx + λ*DT(i)
# As a property of boosting, we can use the learnings already # happened by previous DTs. Hence we just need to learn things which # our previous DTs failed to learn
Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
We generally use smaller trees in boosting approach. This can be controlled using the “d” parameter in the above code. If the allowed depth is d, the terminal nodes (also called leaf nodes) will be d + 1. How?
After every split, we break the parent node into two child nodes. So in the figure above, we have performed four splits: One root node (blue) and three decision nodes (orange). It resulted in 5 terminal nodes (green).
In the final stage of training one DT, we need to check what things it already has learned so that we can fit our next DT on the remaining stuff to learn. In statistical learning, it is assumed that the approaches that learn slowly perform well. Hence, the term λ is introduced as a hyperparameter which tends to slow the learning residual update process, and more diverse decision trees try to capture the unlearned things from data.
# Final output of the boosted model:
fx = ∑λ*DT(i)
By looking into this pseudo code, DTs formed are not independent, and the growth of any tree strongly depends on the evolution of previously grown trees.
If we look into the pseudo-code of boosting, we have three tunable parameters that can affect the performance of our boosting algorithms. These are:
Update the learning residuals as, rk := rk - λ*DT(i)(Xk)
There are three prevalent variants of boosting algorithms:
In this article, we will keep our discussion limited to Adaboost and Gradient Boost. XG Boost will be covered in a separate blog.
AdaBoost is a boosting algorithm; hence it works on the same principle discussed above. There are two significant updates from the native variant of boosting:
In AdaBoost, we first assign equal weightage to all the training samples while training the first Decision Tree. After that, it evaluates DT and assigns higher weightage to those samples where earlier DTs struggled to perform well.
This is another variant of boosting algorithm where we use the gradient descent algorithm to add the weak learners in a stage-wise manner such that the overall loss/error in prediction gets reduced. Let’s understand it in more detail.
There are mainly three components of this algorithm:
We always add a fixed amount of DTs, as boosting techniques are highly prone to overfitting. But at the same time, we need more accurate models which demand a larger number of trees. Hence, there are certain improvements in the basic Gradient boosting mechanism, so let’s look at them.
Gradient boosting is a greedy algorithm that suffers from overfitting problems, and we know a technique that can solve this problem, yes, regularization. We penalize the steps and improve the performance by reducing overfitting. Apart from regularization, there are some other techniques as well. So let’s understand them.
Our weak learners must understand the patterns in the data, but at the same time, we want to keep them weak. Hence we need to control the construction of these weak learners, i.e., trees. Fewer trees are required if we do not control their construction, but learning will not be as accurate as with many weak learners.
There can be multiple ways to do so:
We add prediction of every DT sequentially in a stage-wise manner. Instead of adding the DTs directly, we can slow down the learning by multiplying the predictions with the shrinkage parameters. It will allow us to add more trees making predictions more accurate.
Y = alpha*DT1 + beta*DT2 + gamma*DT3 + ... + error
where alpha, beta and gamma are shrinkage parameters.
We prefer to keep these values smaller, 0.1 or 0.01.
If we remember the Bagging and random forest process, we build different trees on different subsamples of the dataset so that trees must be uncorrelated. Similarly, this variant becomes Stochastic Gradient boosting if we do the same in boosting.
This additional constraint can be applied to parametrized trees along with the structural limits. Traditional DTs are not weak learners; we modify them to make them weak learners, known as regression trees. Terminal nodes of these trees store numerical values and are commonly referred to as weights. These weight values are regularized using either L1 or L2 regularization techniques. We will cover the implementation in the next blog.
Boosting is a sophisticated and advanced algorithm, and knowledge about it is always a plus point. These are the following questions that can be framed on this topic:
In this article, we learned about one of the advanced algorithms in Machine Learning, i.e., Boosting. This is considered a building block for more sophisticated algorithms like XG-Boost, the most used algorithm in Kaggle competitions, and gives almost comparable accuracy to neural networks. We saw the variants of boosting algorithms AdaBoost and Gradient boosting and the possible improvements that can be made. We hope you enjoyed the article.
Next Blog: Introduction to Random Forest in Machine Learning