Author: Ryan Harper
This data on health and heart disease was made available by UC Irvine's Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php). It includes categorical variables (i.e. ~2-4 variations of a physical attribute) and continuous variables (blood pressure, cholesterol, and age). The aim of the research team was to use the 13 features (from age to thal) to predict the existing heart disease diagnosis (num).
Variables (health indicators):
age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type (1=typical) (1-4)
trestbps: resting blood pressure (in mm Hg)
chol: serum cholesterol (in mg/dl)
fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg: resting electrocardiographic results (0 = normal) (0-2)
thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak: st depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment (1: upslope, 2: flat, 3: downslope)
ca: number of major vessels (0-3)
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
num: diagnosis of heart disease (angiographic disease status) (0: no presence, 1-4: increasing classification of heart disease severity)
Observations:
Switzerland had a high percentage of null data in certain categories. This suggests that different locations might also have different data collection tools/processes.
chol in Virginia had a non-normal distibution but chol in other locations appeared to be normally distributed.
thalach had similar parametric distributions per location but the respective mean per location was fairly different. This could indicate that patients in one location (i.e. Cleveland) results in a higher central tendency of thalach than an area like Switzerland or Hungaria.
There is statistical significance between oldpeak Cleveland, VA, and Hungarian samples.
Hypothesis:
Patients can be classified by hospital location using 14 physical health measurements.
Thoughts:
Data for this machine learning classification project was collected from four different hospitals (two in the US and two in Europe). My assumption is that diet, weather, and exercise habits vary between the populations of patients that attend each of the four hospitals. Patients from a community in Switzerland may have a very different min/max/mean range for the 14 health indicators than patients from Cleveland, USA. I predict that there is variation in the data samples' distributions by hospital location and that hospital location can be determined with the use of a data classification model.
There is total of 920 samples from Cleveland (303), Virginia (200), Swizterland (123), and Hungaria (294). Because of the large number of health indicators collected per sample for this project, I believe that it is possible that hospital location can be predicted based on the 14 health features in this data set.
There are a high number of dependent variables (physical health measurements) relative to the small number of samples and could result in overfitting.
Because of relatively low number of samples, this research project would be exploratory with the purpose of providing researchers new data to work with and new ways of considering sample location for patients and physical health measurements.
Method:
I will use either naive Bayes classifier or KNN classifier to determine the hospital location from the 14 physical health measurements. For preprocessing, I will check each health feature to determine if the samples vary per location.
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import calendar
import seaborn as sns
from matplotlib.pyplot import subplots, show
import ast
import math
import re
from scipy.stats import ttest_ind, mannwhitneyu, median_test, f_oneway
import missingno as msno
from IPython.display import display
from IPython.core.debugger import Tracer
%matplotlib inline
# change optional settings for libraries
sns.set_style("whitegrid")
pd.set_option('show_dimensions', False)
np.warnings.filterwarnings('ignore')
filename = ['processed.cleveland.data','processed.hungarian.data','processed.switzerland.data','processed.va.data']
colnames=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
heartdisease = []
for path in filename:
tempdf = pd.read_csv('../data/heart/' + path, names=colnames)
tempname = re.findall(r'\.([a-z]*)\.',path)[0]
# adds location of data to the dataframe
tempdf['location'] = tempname
heartdisease.append(tempdf)
heartdf = pd.concat(heartdisease)
heartdf['location'].loc[heartdf['num'] == 0].value_counts()
heartdf['location'] = pd.Categorical(heartdf['location'])
Missing Data
# check for missing data
heartdf = heartdf.replace('?',np.nan)
msno.matrix(heartdf,color=(.3, .3, .5))
plt.show()
slope, ca, and thal appear to have a large number of missing values. trestbps, thalach, exang, and oldpeak appear to have a smaller percentage of missing values for the entire row.
Delete variables with large number of null values
del heartdf['slope']
del heartdf['ca']
del heartdf['thal']
Clean null value rows from dataframe and convert strings to floats
print("Location Count Before DropNA:\n{}".format(heartdf['location'].value_counts()))
heart = heartdf.loc[heartdf['chol'] != 0].dropna()
for column in heart.columns:
if column != 'location':
heart[column] = pd.to_numeric(heart[column])
print("\nLocation Count After DropNA:\n{}".format(heart['location'].value_counts()))
heart = heart.loc[heart['trestbps'] != 0]
heart = heart.loc[heart['chol'] != 0]
Check again for Missing Data
msno.matrix(heart,color=(.3, .3, .5))
plt.show()
180 rows were lost from data cleaning, 3 variables were removed, and string object columns were converted to floats.
Data Summary:
# reporting of data
display(heart.head(3),heart.shape,heart.info())
A. Scatterplot using univariate and bivariate methods
Before cleaning data
# Make the scatterplot matrix
featurelist = ['age','trestbps','chol','thalach','oldpeak','exang','num']
locations = ['cleveland', 'hungarian', 'switzerland', 'va']
palette = ['r','b','g','y']
i = 1
fig=plt.figure(figsize=(15,10))
for column in featurelist:
plt.subplot(4, 2, i)
i = i + 1
for idx, location in enumerate(locations):
sns.kdeplot(heartdf[column].loc[heartdf['location'] == location], color = palette[idx])
plt.title(column)
plt.legend(locations)
plt.show()
After cleaning data
# Make the scatterplot matrix
featurelist = ['age','trestbps','chol','thalach','oldpeak']
g = sns.PairGrid(data=heart, diag_sharey=False, hue="location", vars=featurelist, palette=['r','b','g','y'])
g.map_offdiag(plt.scatter, alpha=.5)
g.map_diag(sns.kdeplot, lw=3)
g.add_legend()
plt.show()
A. Mood’s Median test (2+ Non-Normally Distributed Independent Samples)
Null Hypothesis: Assumes no statistically significant difference between samples.
stat, p, med, tbl = median_test(heart['chol'].loc[heart['location'] == 'hungarian'],heart['chol'].loc[heart['location'] == 'va'],heart['chol'].loc[heart['location'] == 'cleveland'])
print(stat,med)
print(p)
p-value is not less than 5% and doesn't disprove the null hypothesis. There may not be statistical significance between Hungarian, VA, and Cleveland cholesterol samples.
stat, p, med, tbl = median_test(heart['oldpeak'].loc[heart['location'] == 'hungarian'],heart['oldpeak'].loc[heart['location'] == 'va'],heart['oldpeak'].loc[heart['location'] == 'cleveland'])
print(stat,med)
print(p)
p-value is less than 5% and disproves the null hypothesis. The differences between Hungarian, VA, and Cleveland oldpeak samples are statistically significant.
B. One-Way AnovaTest (2+ Normally Distributed Independent Samples)
Null Hypothesis: Assumes no statistically significant difference between samples.
f, p = f_oneway(heart['chol'].loc[heart['location'] == 'hungarian'],heart['chol'].loc[heart['location'] == 'va'],heart['chol'].loc[heart['location'] == 'cleveland'])
print("F-Value: {}, p:{}".format(f,p))
p-value is not less than 5% and doesn't disprove the null hypothesis. There may not be statistical significance between Hungarian, VA, and Cleveland samples.
f, p = f_oneway(heart['oldpeak'].loc[heart['location'] == 'hungarian'],heart['oldpeak'].loc[heart['location'] == 'va'],heart['oldpeak'].loc[heart['location'] == 'cleveland'])
print("F-Value: {}, p:{}".format(f,p))
p-value is less than 5% and disproves the null hypothesis. The differences between Hungarian, VA, and Cleveland oldpeak samples are statistically significant.
NOTES