The subscription market has become more challenging with the increment in available options. For example, Netflix Prime, Hotstar, or Zee5 all these OTT (Over-the-top) platforms are competing to cover most market share. This can be ensured by retaining existing customers and attracting more new customers. With the advancement in Data Science and Machine learning, companies can predict the behaviour of potential customers who can cancel their subscriptions.
Companies spend more money on retaining customers than focusing on new customers. Hence, they invest heavily in knowing which customers might stop using the company's products or services, and the term used for this is Customer churn. Besides running customer retention campaigns and product publicity programs, churn prediction is a common practice followed by companies working in competitive markets.
In this article, we will be building a decision tree-based Churn rate prediction model for an investment bank so that they can retain more customers.
First, let's understand how companies lose customers.
Let Jack be a customer of X bank for more than ten years. Now, with the fast-growing inflation rate and the slow-growing interest rate provided by the bank, he is concerned about his savings. Plus, he is disappointed by the outdated product designs and the tedious application process for new services. So Jack decides to transfer his account to another bank.
A report by Sells Agency states that 20% of customers leave their banks due to poor in-bank experience and tiring procedures, and 3.6% leave their banks for new products and services.
Industries like airline ticket services, telecommunication, OTT platforms, banking, and finance need to retain customers. These all are competitive market industries and constantly hustle to grab the majority of market share. To accomplish this, they continuously collect customer data to build ML models and learn customer behaviour. One such model is the customer churn prediction which we are building in this blog.
Understanding the problem statement is the first step of model building. Here we are given a dataset of bank customers, and we want to build a churn prediction model that can classify customers into two classes: potential churners and non-churners. This classification problem is solved via a supervised learning approach, meaning we will have labelled Input and Output data samples. This model should be precise as it might affect the future investments of the bank. Now, let's study the Dataset in detail.
The Dataset we used to build the model is available on Kaggle and can be downloaded from "https://www.kaggle.com/datasets/mathchi/churn-for-bank-customers." There are 14 columns/features and 10k rows/samples.
Let's see the effect of these features on the model's prediction.
Other columns are RowNumber, CustomerId, Surname, Gender, and Age, having their usual meaning.
Let's look at some rows of the Dataset and the different datatypes used to store them.
df=pd.read_csv("N:\Machine learning\Algorithms\churn.csv")
df.head()
df.info()
Here we can infer that columns Rownumber, CustomerId, and Surname only describe customers uniquely and do not affect the target variable, so we can drop these columns. The other observation is we have two more categorical columns, geography, and gender, which we will encode using labelencoder() provided by the sklearn library of python.
Now let's jump to the data preprocessing steps and clean the data.
Data preprocessing is cleaning the Dataset before feeding it to the model. There are some standard preprocessing steps we must follow.
df.isnull().sum()
df.duplicated().sum()
'''
RowNumber 0
CustomerId 0
Surname 0
CreditScore 0
Geography 0
Gender 0
Age 0
Tenure 0
Balance 0
NumOfProducts 0
HasCrCard 0
IsActiveMember 0
EstimatedSalary 0
'''
There is no missing or duplicate value in any column. Hence we do not need to drop anything. Otherwise, we can use df.dropna() to drop them.
Outliers are those data samples that are present far from the other data samples. They drastically affect the learning of the model and manipulate the predictions towards them. There are two standard methods to remove outliers: 1. Inter Quartile Range (IQR) or 2. Standard deviation. We will be using the IQR method to detect the presence of outliers:
for i in numcols:
q75, q25 = np.percentile(df[i], [75 ,25])
iqr = q75 - q25
min_val = q25 - (iqr*1.5)
max_val = q75 + (iqr*1.5)
df=df[(df[i]<max_val)]
df=df[(df[i]>min_val)]
We have removed outliers from the customer dataset and can validate from the boxplot.
fig, ax = plt.subplots(2,2, figsize = (15,15))
for i, subplot in zip(numcols, ax.flatten()):
sns.boxplot(x = 'Exited', y = i , data = df, ax = subplot, palette=color_0_1)
plt.show()
The age column still has some outliers but fewer than before data preprocessing. The plots show that the data is balanced around the box's centre line except for the Balance column. Boxplot also helps in finding out the distribution of data around the quartiles.
Feature scaling is generally referred to as scaling and normalizing the data. We have discussed in this blog that the updation of parameters is affected by the magnitude of features, and that's why they must be scaled in the same range. But tree-based algorithms are immune to feature scaling because different feature magnitudes and variances do not affect them. Therefore, it's unnecessary to scale data for our model.
Now, we will perform exploratory data analysis to understand the trend followed by the data.
This section will present some data visualizations of the bank customer dataset.
Heatmap is the tabular representation of the correlation of features. We study heatmaps to determine features' dependency on other features and the target variable. Observing the heatmap below, the features are independent because their correlation values are near zero.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(8,8))
sns.heatmap(df.corr(), cmap='Blues', annot=True)
plt.show()
The other use of heatmap is in feature selection. We generally drop those highly correlated features (positively and negatively) and keep only one. Read more about why we need feature selection here.
Counterplots are best for observing feature values with the target variable. We can see that females have more churners, and people with credit cards are less likely to turn their back on the banks. Similarly, active customers with their transactions or bank visits are less likely to leave or change the bank.
fig, ax = plt.subplots(1,3, figsize = (15,15))
for i, subplot in zip(categorical_features, ax.flatten()):
sns.countplot(x = i, hue="Exited", data = df, ax = subplot, palette=color_0_1)
plt.show()
These observations suggest that banks should spend more money on maintaining the quality of service and retaining their customers. Products like credit cards, loans, tax rebates, and fixed deposits are essential to hook customers and ensure their engagement.
We are ready to build our Customer churn prediction model! But first, we must select which machine learning algorithm is best for churn prediction.
We have a small dataset of 10000 rows. Thus, we need an algorithm that learns well even with a small training data size. A decision tree is the best option because tree-based algorithms are easy to implement, have great explainability, work well with a small dataset, and require minimal data preprocessing.
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train,Y_train)
eval_matrices(dtc)
############### Tuning the parameters ##############
dtc_new = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 10, min_samples_leaf = 6 , max_features = 'sqrt', random_state = 1)
dtc_new.fit(X_train,Y_train)
eval_matrices(dtc_new)
We can import the Decision Tree Classifier from the sklearn library and change the default parameter values.
Other parameters, like maxdepth, maxfeatures, and class_weight, can also be tuned. Explore more about them here.
Let's evaluate the model's performance before and after tuning the parameters.
Evaluation metric measures how well the model performs on train and test data. Evaluation metrics are decided based on problem type. For a regression problem, mean square error, root mean square error, and R squared are the most commonly used metrics. Confusion matrix, accuracy, F1 score, and area under the curve (AUC) are the most common metrics for a classification problem. For a detailed explanation of evaluation metrics, hit this link.
To see the actual predictions of the model confusion matrix is best. It divides predictions into four classes true positive, true negative, false positive, and false negative. As we can see from the below matrix, a higher value in the true negative column means that our model is good at predicting non-churners but needs to be tuned more for better performance.
plt.figure(figsize=(12,12))
sns.heatmap(cm, annot=True,
linewidths=1, square = True, cmap = 'Blues_r', fmt='0.4g')
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
plt.show()
scores.append({
'model': model_name,
'best_score': clf.best_score_,
'best_params': clf.best_params_
})
We try to optimize our model performance by changing the threshold value and deciding the best fit for the model. Imagine plotting a confusion matrix each time we vary the threshold. Yes, the confusion matrix will result in a lot of confusion. So we use ROC in its place. Considering all threshold values, it plots a graph between the True positive rate and the False positive rate. The optimum threshold value will be the point with the highest true positive rate and the lowest false positive rate. For more details, check out this blog.
Other evaluation metrics results are also listed below.
def eval_classification(model):
y_pred = model.predict(X_test)
y_pred_train = model.predict(X_train)
y_pred_prob = model.predict_proba(X_test)
y_pred_prob_train = model.predict_proba(X_train)
print("Accuracy (Test Set): %.2f" % accuracy_score(Y_test, y_pred))
print("F1-Score (Test Set): %.2f" % f1_score(Y_test, y_pred))
Model performance before tuning parameters
Accuracy (Test Set): 0.78F1-Score (Test Set): 0.47
Model performance after tuning parameters
Accuracy (Test Set): 0.83F1-Score (Test Set): 0.52
These results could be better but can be a benchmark to start with. The path from good to best is complicated and marked, with many research papers claiming to increase accuracy by 1% or more. Let's look at one of the research papers using an advanced evaluation metric for profit-driven decision trees.
A churn prediction model should follow a profit-driven approach. This means that cost added due to wrong predictions and the cost of retaining the customer should be considered while predicting the output. The aim of using churn models is not only to tell potential churners but also to predict those customers who are most profitable to the company and thus worth retaining. ProfTree is one such decision tree-based machine learning algorithm that uses an evolutionary algorithm to learn profit-driven decision trees.
Standard classification metrics like AUC or ROC are not recommended to evaluate the performance of profit-driven decision trees because they treat the cost due to misclassifications as the same. So a new evaluation metric is proposed, "expected maximum profit measure for customer churn," which helps identify the most profitable model. This paper presents a new metric based on customer lifetime value (CLV), according to which a churner is defined as a customer whose CLV is decreasing. CLV is the discounted value of future marginal earnings related to the customer's activity and can be calculated by the following equation.
CLTV = (Customer Value / Churn Rate) x Profit Margin
Customer Value = Average Order Value * Purchase Frequency
For reference, read this paper https://paperswithcode.com/paper/profit-driven-decision-trees-for-churn/review/
Netflix is unbeatable, with the OTT industry's lowest churn rate of 2.3%. The reason for this is Netflix's recommendation system and data analysis. Netflix analyses everything, which movie is watched for how long, what is the most liked genre, and which web series is trending, and updates its system every 24 hours to catch customers' attention by providing them with something new every time they visit. Over 80% of the content a user watches comes from Netflix recommendations. Netflix constantly monitors its potential churners and keeps on increasing its user experience.
DTH(Direct-to-home) is one of the fastest-growing industries where customer retention is the key to success. Customers have a plethora of options and will be influenced by offers, service, network quality, and publicity. Tata Sky's dedicated team studies customer recharge patterns and identifies customer cohorts needing focused retention efforts. This helped them immensely during the Covid lockdown period to design packages satisfying customer needs.
Customer churn models will become an integral part of every industry in the future. It is crucial to understand how they work and what are the possible advancements. We discussed the complete process of model development in this blog. From understanding the problem statement, data preprocessing, and data visualization to model evaluation. Then we discussed a research paper presenting an advanced decision tree approach and a new evaluation metric to measure the performance of profit-driven decision trees. We hope you enjoyed the article.
If you have any queries/doubts/feedback, please write us at contact@enjoyalgorithms.com. Enjoy learning, Enjoy algorithms!