Using a Linear Regression Model to Predict Life Expectancy 📈

Shanzeh Haji
6 min readJan 6, 2023

Almost everyone has heard of the famous Steve Jobs- Co-creator of the Apple II, Macintosh, iPod, iPhone, iPad, and first Apple Stores.

Steve Jobs died of respiratory arrest on October 5, 2011, at the age of 56. At that time, the average lifespan in the USA was 78.7 years; he died 22 years before the average death in the country.

The shocking difference made me question the average life expectancy. What factors play a part in this prediction?

How I created my regression model:

1. Importing Libraries and Datasets

The World Health Organization (WHO) published a health data set on life expectancy that included many factors from the four major categories related to health, social, economic, and mortality.

The data set contained information from 193 countries from 2000 to 2015 with 22 columns and 2938 rows.

This image is just a snippet of the information produced by The World Health Organization. The rows show the different factors that were analyzed through this data set.

2. Perform Exploratory Data Analysis and Data Visualization

After analyzing the different columns, I was able to see which features correlate with each other to avoid multicollinearity. Multicollinearity is when independent variables (the different factors) are highly correlated. If the variables are similar, the data would be displayed twice; this undermines the statistical significance. To avoid the problem, we can create a correlation matrix.

Notice that a correlation matrix is perfectly symmetrical. For example, the top right cell shows the exact same value as the bottom left cell. Because a correlation matrix is symmetrical, half of the correlation coefficients shown in the matrix are redundant and unnecessary. Also, notice that the correlation coefficients along the diagonal of the table are all equal to 1 because each variable is perfectly correlated with itself. These cells aren’t useful for interpretation.

We can also look at a scatterplot to see how each feature is correlated to life expectancy.

By analyzing the different scatterplots, you can further look at some of the scatterplots that have strong correlations to life expectancy. These variables are schooling, GDP, income composition of resources and HIV/AIDS death rates.

3. Understanding the correlation between the factors with strong correlations to life expectancy.

Schooling: Low educational attainment is linked to self-reported poor health, decreased life expectancy, and decreased survival. Higher education is the foundation for stable or well-paid jobs. Increased income can assist in paying for more nutritious food, better housing and quality.

Although there is a discrepancy between schooling and stress- many believe that obtaining a degree could shorten your lifespan. In reality, the top cause of stress is income or money which can be prevented through a job brought by a degree.

If you want to find out more about how stress impacts your lifespan, check out this article:

GDP: GDP per capita raises life expectancy at birth by promoting economic growth and development in a country, resulting in increased longevity. Economists traditionally use the gross domestic product (GDP) to measure economic progress. If GDP is rising, the economy is in solid shape, and the nation is moving forward.

Higher GDP is often associated with positive outcomes in a wide range of areas such as better health, more education, and even greater life satisfaction.

Income Composition of Resources: There is a strong relationship between a country’s income and its life expectancy. In general, countries with higher incomes tend to have higher life expectancies, while countries with lower incomes tend to have lower life expectancies. The income composition of a country’s resources can include agriculture, industry, and services. The income composition of a country’s resources can have a significant impact on its overall development and standard of living.

This relationship can be explained by a variety of factors, including access to quality healthcare, nutrition, education, and other resources that contribute to overall health and well-being.

HIV/AIDS: HIV/AIDS can have a significant impact on life expectancy, particularly in countries with high rates of HIV infection. HIV is a virus that attacks the immune system, making it difficult for the body to fight off infections and diseases. If left untreated, HIV can progress to AIDS, which is the most advanced stage of the infection.

In the early years of the HIV/AIDS epidemic, life expectancy for people with HIV was significantly shorter than for those without the virus. However, with the development of antiretroviral therapy (ART), life expectancy for people with HIV has improved dramatically. ART is a combination of medications that can suppress the virus and help to prevent it from progressing to AIDS. When taken consistently, ART can help people with HIV to live long and healthy lives.

Despite these advances, HIV/AIDS continues to have a significant impact on life expectancy in many parts of the world, particularly in low-income countries where access to ART and other healthcare resources may be limited. In these countries, HIV/AIDS is still one of the leading causes of death, and it can have a significant impact on overall life expectancy.

4. Train a Linear regression model in Scikit-Learn

Scikit Learn uses machine learning in python as a tool for predictive analysis. Using this program, you are able to test accuracy and create the regression model.

Using the 4 main data inputs (schooling, GDP, income composition of resources and HIV/AIDS), I could get an accurate representation of life expectancy. The model and equation I created had an accuracy rate of 81% as some of the null values within the data needed to be filled- I did this by finding the average or the mean and working accordingly.

Why do we predict life expectancy using regression models?

Regression models are statistical tools that can be used to predict the value of a dependent variable based on one or more independent variables. In the context of predicting life expectancy, regression models can be used to understand the relationship between different factors (such as income, healthcare access, and nutrition) and life expectancy.

There are several reasons why regression models might be used to predict life expectancy:

  1. Regression models can help to identify the factors that are most strongly associated with life expectancy, allowing policymakers to target interventions and resources to the areas that are most likely to have the greatest impact.
  2. Regression models can be used to predict life expectancy under different scenarios, such as if healthcare access were to improve or if income were to increase. This can help policymakers to understand the potential impact of different policy interventions.
  3. Regression models can help to quantify the relationship between different factors and life expectancy, allowing policymakers to better understand the magnitude of the impact that different interventions might have.
Check out a video version

I appreciate your reading, and I hope you learnt something 😊. My code is posted on GitHub if you wanted to take a look. Feel free to connect with me on Linkedin and send me a note if you enjoyed reading this post or have any questions. You can also follow my Medium page and remain updated on all the content I produce!



Shanzeh Haji

I'm a 15y/o longevity enthusiast on a mission to make a positive contribution to society by exploring ways to increase lifespan