- Project Overview
- About Dataset
- Goal of the Project
- Data Sourcing
- Tools
- Data Cleaning/Preparation
- Exploratory Data Analysis
- Data Analysis
- Results/Findings
- Recommendation
- Limitation
This project is about data cleaning and transformation to ensure quality by delving into the fascinating world of Diabetes Prediction using a Kaggle dataset. This involved techniques like handling missing values, identifying and correcting inconsistencies, handling Outliers and ensuring data format consistency. As well as performing an Exploratory Data Analysis to get a sense of the distribution of variables and their relationships.
The Diabetes Prediction Dataset contains a collection of medical and demographic features (age, BMI, hypertension, etc.) associated with patients' diabetes status (positive/negative), enabling analysis and prediction of diabetes risk.
To uncover hidden patterns and prepare the data for accurate prediction models.
A Kaggle Dataset https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
- Python- Data Cleaning Download Here
- Excel- Data Cleaning
In the initial stage of data cleaning, we performed the following tasks:
- Data Loading and Inspection
- Handling and treatinng missing values
- Handling the outliers
-
Are there any relationship between the Demographic Features and Diabetics?
-
What are the key Metrics for Diabetics Prediction?
Some interesting codes/features we worked with:
import matplotlib.pyplot as plt
x = diabetes_prediction['age']
y = diabetes_prediction['hypertension']
plt.scatter(x, y)
plt.xlabel('Age', fontsize=16)
plt.ylabel('Hypertension (1 for yes, 0 for no)', fontsize=16)
plt.title('Relationship between Age and Hypertension', fontsize=20)
plt.show();
The Analysis results are summarized as follows:
- No Significant Relationship Between Age and Hypertension in Diabetes
- There is little to no significant difference in the range of BMI values (difference between min and max) between males and females with hypertension
- BMI distribution across genders and hypertension groups shows minimal differences. Interestingly, no individuals with "other" gender classification have hypertension in this dataset.
Based on the Analysis conducted, these are the recommentions:
- Collection of Data for other risk factors (Family History, Lifestyle, etc) beyond age for predicting hypertension in diabetic patients.
- Insufficient provisions of data for the size and representativeness of "Other" Gender Category
BMI Categorization ©️ See Here
Kaggle DataSet ©️ Link