With the advancements in Machine Learning and Data Science, we now have the ability to predict the remaining life expectancy of a person with a high degree of accuracy, based on certain essential parameters. In this blog post, we will be exploring the parameters that affect the life expectancy of individuals living in different countries, and how machine learning models can be used to estimate life expectancy. We will also be focusing on the specific parameters that have the most significant impact on an individual's life expectancy.
Let’s start by understanding Life Expectancy.
The term “life expectancy” refers to the number of years a person can expect to live. By definition, life expectancy is based on an estimate of the average age members of a particular population group will be when they die.
Life expectancy depends on several factors, the most important being gender and birth year. Generally, females have a slightly higher life expectancy than males due to biological differences. Other factors that influence life expectancy include:
However, that’s hardly the entire list! As we work our way through the data analysis, we will explore additional hidden factors that influence the life expectancy of an individual.
We use the Life Expectancy (WHO) Kaggle dataset for this demonstration.
Let’s start by loading the data.
import pandas as pd
life_exp = pd.read_csv('Life Expectancy Data.csv')
life_exp.head()
Most of the lifespan lies between 45 to 90 years, with an average lifespan of 69 years.
sns.histplot(life_exp['Life expectancy'].dropna(), kde=True, color='orange')
A correlation heat map is a graphical representation of a correlation matrix representing the correlation between different variables. This helps in understanding the linear dependencies of variables over each other. Correlation is always calculated between two variables and has a range of [-1, 1].
Let’s implement the heat map in python for dependency visualization:
cmap = sns.diverging_palette(500, 10, as_cmap=True)
sns.heatmap(life_exp.corr(), cmap=cmap, center=0, annot=False, square=True);
Life expectancy considerably correlates with Adult Mortality, BMI, Schooling, HIV/AIDS, ICOR, and GDP. The following insights can be drawn based on the correlation heatmap:
What is adult mortality?
The adult mortality rate is shown in the probability that those who have reached age 15 will die before age 60 (shown per 1,000 persons).
fig = px.scatter(to_bubble, x='GDP', y='Life expectancy',
size='Population', color='Continent',
hover_name='Country', log_x=True, size_max=40)
fig.show()
This bubble plot is very informative in understanding the trend of life expectancies for different continents. The size of the bubble defines the population in the respective countries. Following are the safe inferences we can make based on the bubble plot:
Let’s analyze the impact of GDP for different continents versus Life expectancy.
for continent, ax in zip(set(life_exp['Continent']), axs.flat):
continents = life_exp[life_exp['Continent'] == continent]
sns.regplot(x = continents['GDP'],y = continents['Life expectancy'], color = 'red', ax = ax).set_title(continent)
plt.tight_layout()
plt.show()
High GDP has a strong positive impact on life expectancy! In other words, If someone is residing in a developed country with a high GDP, his life expectancy is expected to be relatively higher than a person living in a developing country.
What is ICOR (Income Composition of Resources)?
ICOR measures how good a country is at utilizing its resources. ICOR is graded between 0 to 1, and higher ICOR indicates optimal utilization of available resources. ICOR has a considerably high correlation with Life expectancy. Let’s visualize the impact of ICOR on Life expectancy continent-wise.
for continent, ax in zip(set(life_exp["Continent"]), axs.flat):
continents = life_exp[life_exp['Continent'] == continent]
sns.regplot(x = continents['Income composition of resources'],y = continents["Life expectancy "], color = 'blue', ax = ax).set_title(continent)
plt.tight_layout()
plt.show()
As expected, higher ICOR yields higher Life expectancy. If a country utilizes its resources productively, it is more likely to see its citizens live longer than expected.
The question arises, is there a way to predict life expectancy based on the 22 independent features discussed? The answer is yes, but first we must choose a suitable supervised regression algorithm for the task.
There are many algorithms available for regression tasks, and each has its own advantages and disadvantages. One algorithm might produce better results than others, but may require more interpretability. Even if interpretability is not an issue, deploying complex algorithms can be difficult. In other words, there is a trade-off between accuracy, model complexity, and model interpretability. An optimal algorithm must be interpretable, accurate, and easy to deploy, but there is no perfect algorithm.
For example, Linear Regression is a relatively simple and interpretable algorithm. It requires minimal effort to deploy, but its accuracy can be limited when the data is non-linear. Complex algorithms may perform better on non-linear datasets, but the model may lack interpretability.
Let’s proceed with Linear Regression for this task.
Linear Regression is a regression algorithm with a linear approach. It’s a supervised regression algorithm where we try to predict a continuous value of a given data point by generalizing the data we have in hand. The linear part indicates the linear approach for the generalization of data.
The idea is to predict the dependent variable (Y) using a given independent variable (X). This can be accomplished by fitting a best-fit line in the data. A line providing the least sum of residual error is the best fit line or regression line.
What is a residual error?
A residual error measures how far away a point is vertically from the regression line. Simply, it is the error between a predicted value and the observed actual value. A line providing the least sum of residual error is the best fit line or regression line.
Let’s predict Life expectancy by using Linear Regression. Before building the model, we need to split the dataset into training and testing sets. We will use this test set to evaluate the model's performance.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
target = life_exp['Life expectancy']
features = life_exp[life_exp.columns.difference(['Life expectancy', 'Year'])]
#----- Splitting the dataset -----#
x_train, x_test, y_train, y_test = train_test_split(pd.get_dummies(features), target, test_size=0.3)
#----- Linear Regression -----#
lr = LinearRegression()
#----- Fitting model over training data -----#
lr.fit(x_train, y_train)
#----- Evaluating the model over test data -----#
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
#lr confidence: 0.9538309850283277
The coefficient of determination R-square came out closer to 1, indicating the model optimally predicts the Life expectancies.
For validation of the model, let’s check the distribution of residuals.
sns.histplot(residuals, kde=True, color="orange")
plt.title('Residual Plot')
plt.xlabel('Residuals: (Predictions - Actual)')
plt.ylabel('Density');
Residual distribution is approximately normal, having a mean close to zero. This is precisely what we are looking for. Let’s visualize the residuals in a scatter plot!
Residuals are centered around zero, and the coefficient of determination R-square is close to 1. Close to 1 R-squared value indicates a good fit over the test dataset. With these results, our model is highly efficient in predicting Life Expectancy.
World Health Organization (WHO) keeps track of all countries' health status and many other related factors. They are responsible for monitoring the public health risk, promoting human health, and coordinating responses to health emergencies. WHO highly relies on statistical algorithms like Linear Regression for studying the Life expectancy and impact of pandemics over the lifespan.
Blue Shield of California(BSOC) is at the forefront of innovation in the healthcare domain. BSOC is a non-profit health insurance organization that provides the best possible medical treatment and insurance plans. BSOC uses the Linear Regression model to estimate medical expenses based on insurance data.
JPMorgan Chase is a global leader in financial services offering solutions to the world’s most important corporations and government institutions. JPMC has generalised the use of Linear Regression in their Capital Asset Pricing Model (CAPM), where risky assets are merged with non-risky assets to reduce the unsystematic risk. Moreover, JPMC uses Linear Regression for Forecasting and Financial Analysis.
Johnson & Johnson (J&J) is an American multinational corporation that develops medical devices, pharmaceuticals, and consumer packaged goods. J&J uses linear regression to estimate the remaining shelf life of medicine stocks.
Walmart is a well-known retail corporation that operates various hypermarkets, department stores, grocery stores, and garment buying houses. Walmart relies on regression analysis for sales forecasting and improved decision making.
Predicting life expectancy is a popular machine learning project that is commonly found in resumes of freshers. Therefore, possible interview questions that interviewers may ask include:
We began by understanding life expectancy and the factors that affect it. We then visualized these affecting parameters and correlated them to make inferences. Finally, we covered linear regression and used it to predict life expectancy.
One can potentially increase their lifespan by adopting a healthy lifestyle, getting a proper education, and getting vaccinated. Demographic location also plays a significant role. Our analysis found that people living in Europe have a higher lifespan than other continents. A country's GDP and income composition also have a broader impact on life expectancy. Some parameters, such as pollution and environmental index, were not included in this analysis but are expected to have a strong correlation with life expectancy.
Next Blog: Introduction to Logistic Regression