Machine Learning Model to Predict Heart Disease

Introduction

Heart disease is a leading cause of death worldwide, with an estimated 17.9 million deaths each year. As a data scientist, I understand the challenges of early detection and diagnosis, especially given the complex and multifactorial nature of the disease. In recent years, I have explored machine learning as a promising approach for predicting and identifying risk factors associated with heart disease. In this article, I present a case study using the Autogluon library to develop machine-learning models for heart disease prediction and discuss the implications and future applications of the project results.

Tools

Python programming language
Jupyter Notebook as the development environment
AWS Sagemaker Autopilot
Pandas for data manipulation
Numpy for numerical operations
Seaborn and Matplotlib for data visualization
Scikit-learn for evaluation metrics

Uses

Some possible uses of this model could be

Physicians and healthcare professionals can use these models to predict the likelihood of heart disease in patients based on their age, weight, gender, lifestyle choices, blood pressure, cholesterol, and glucose levels.
Patients can also use these models to assess their risk of heart disease and make necessary lifestyle changes to prevent or manage the disease.
Healthcare organizations and policymakers can use these models to identify risk factors and develop targeted prevention and treatment strategies for heart disease.

Overall, the trained models have the potential to contribute to better patient outcomes and improved public health by providing accurate predictions and insights related to heart disease.

Data

Input:

Age, Height weight, Gender,

Smoking, Alcohol intake, Physical activity

systolic blood pressure, Diastolic blood pressure

Cholesterol, Glucosee

All this data is meant to produce a binary output (0 or 1) showing if the patient either has heart disease or not

The dataset uses can be found here

Setup

The first step was to set up my work environment and make sure I had all the resources and libraries I needed for the project. So I started by installing the following modules using pip: autogluon, pandas, numpy, seaborn and matplotlib.pyplot, then imported the following libraries and changed the theme of my jupyter notebook to make the data easier to visualize

Then I read the disease data using pandas

the age is currently being counted in days so I converted it using and removed the id column

Visualization

For this part first I checked the data summary to make sure there were no null values in the data.

Here I used some data visualization tools to analyze the data

Histogram:

from this, I can tell that there are not a lot of smokers in the data set and even fewer drinkers

Then I created a correlation matrix to show how the data relates to itself

The diagonals 1's are there because the age column correlates perfectly with itself. From this, we can tell there is some correlation between height and gender and also between glucose and cholesterol

Training

Before I started training my model I first split the data into 2 parts, I randomly set aside 20% of the data for testing and left the remaining 80% for testing, like so;

You can see the order has been shuffled, this is important as we do not want the model to learn the order of the data

Then I trained the model, I set the training quality to medium which is the default, I set the output to cardio and the problem_type to binary

here is a summary of the models trained

I created a leaderboard to show which of the models is the best

Testing

Here I used the 20% of the original data I set aside to generate predictions

then I created a confusion matrix. confusion matrices are used

this matrix shows (5558 + 4738) that have been correctly classified and (1511 + 2193) that have been wrongly identified

from this classification report

we can see how precise and accurate the model is

Conclusion

From this project, I learned that machine learning can be a powerful tool for predicting heart disease and identifying risk factors. I also gained experience in using the Autogluon library and exploring data using various visualization techniques.

Moving forward, this knowledge can be implemented in several ways. For instance, the models developed in this project could be further refined and optimized using additional data and feature engineering techniques. Additionally, the models could be integrated into a clinical decision support system to assist physicians in making accurate and timely diagnoses of heart disease.

Furthermore, the approach taken in this project could be applied to other health conditions or diseases, such as diabetes or cancer, by modifying the input features and target variable. By using machine learning to identify patterns and risk factors associated with various health conditions, we can develop more targeted and effective prevention and treatment strategies, ultimately improving health outcomes for patients.