Capstone #2: Unit 1 Narrative Analytics and Experimentation


Heart Disease Hospital Location Classification

Author: Ryan Harper



Ex Visualization of Heart Disease Classification (1-4)

Image of Heart Disease Stages

http://www.heartfailure.org/heart-failure/the-stages-of-heart-failure/index.html

Overview (top)

This data on health and heart disease was made available by UC Irvine's Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php). It includes categorical variables (i.e. ~2-4 variations of a physical attribute) and continuous variables (blood pressure, cholesterol, and age). The aim of the research team was to use the 13 features (from age to thal) to predict the existing heart disease diagnosis (num).

Variables (health indicators):

age: age in years
sex: sex (1 = male; 0 = female)
cp: chest pain type (1=typical) (1-4)
trestbps: resting blood pressure (in mm Hg)
chol: serum cholesterol (in mg/dl)
fbs: fasting blood sugar > 120 mg/dl (1 = true; 0 = false)
restecg: resting electrocardiographic results (0 = normal) (0-2)
thalach: maximum heart rate achieved
exang: exercise induced angina (1 = yes; 0 = no)
oldpeak: st depression induced by exercise relative to rest
slope: the slope of the peak exercise ST segment (1: upslope, 2: flat, 3: downslope)
ca: number of major vessels (0-3)
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
num: diagnosis of heart disease (angiographic disease status) (0: no presence, 1-4: increasing classification of heart disease severity)

Observations:

Switzerland had a high percentage of null data in certain categories. This suggests that different locations might also have different data collection tools/processes.

chol in Virginia had a non-normal distibution but chol in other locations appeared to be normally distributed.

thalach had similar parametric distributions per location but the respective mean per location was fairly different. This could indicate that patients in one location (i.e. Cleveland) results in a higher central tendency of thalach than an area like Switzerland or Hungaria.

There is statistical significance between oldpeak Cleveland, VA, and Hungarian samples.

Experiment (top)

Hypothesis:

Patients can be classified by hospital location using 14 physical health measurements.

Thoughts:

Data for this machine learning classification project was collected from four different hospitals (two in the US and two in Europe). My assumption is that diet, weather, and exercise habits vary between the populations of patients that attend each of the four hospitals. Patients from a community in Switzerland may have a very different min/max/mean range for the 14 health indicators than patients from Cleveland, USA. I predict that there is variation in the data samples' distributions by hospital location and that hospital location can be determined with the use of a data classification model.

There is total of 920 samples from Cleveland (303), Virginia (200), Swizterland (123), and Hungaria (294). Because of the large number of health indicators collected per sample for this project, I believe that it is possible that hospital location can be predicted based on the 14 health features in this data set.

There are a high number of dependent variables (physical health measurements) relative to the small number of samples and could result in overfitting.

Because of relatively low number of samples, this research project would be exploratory with the purpose of providing researchers new data to work with and new ways of considering sample location for patients and physical health measurements.

Method:

I will use either naive Bayes classifier or KNN classifier to determine the hospital location from the 14 physical health measurements. For preprocessing, I will check each health feature to determine if the samples vary per location.


Part 1: Cleaning Data (top)

In [3]:
# import libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import calendar
import seaborn as sns
from matplotlib.pyplot import subplots, show
import ast
import math
import re
from scipy.stats import ttest_ind, mannwhitneyu, median_test, f_oneway
import missingno as msno
from IPython.display import display
from IPython.core.debugger import Tracer

%matplotlib inline
In [4]:
# change optional settings for libraries
sns.set_style("whitegrid")
pd.set_option('show_dimensions', False)
np.warnings.filterwarnings('ignore')
In [5]:
filename = ['processed.cleveland.data','processed.hungarian.data','processed.switzerland.data','processed.va.data']
colnames=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']
In [6]:
heartdisease = []

for path in filename:
    tempdf = pd.read_csv('../data/heart/' + path, names=colnames)
    tempname = re.findall(r'\.([a-z]*)\.',path)[0]
    # adds location of data to the dataframe
    tempdf['location'] = tempname
    heartdisease.append(tempdf)

heartdf = pd.concat(heartdisease)
In [7]:
heartdf['location'].loc[heartdf['num'] == 0].value_counts()

heartdf['location'] = pd.Categorical(heartdf['location'])

Missing Data

In [8]:
# check for missing data
heartdf = heartdf.replace('?',np.nan)

msno.matrix(heartdf,color=(.3, .3, .5))
plt.show()

slope, ca, and thal appear to have a large number of missing values. trestbps, thalach, exang, and oldpeak appear to have a smaller percentage of missing values for the entire row.

Delete variables with large number of null values

In [9]:
del heartdf['slope']
del heartdf['ca']
del heartdf['thal']

Clean null value rows from dataframe and convert strings to floats

In [10]:
print("Location Count Before DropNA:\n{}".format(heartdf['location'].value_counts()))

heart = heartdf.loc[heartdf['chol'] != 0].dropna()

for column in heart.columns:
    if column != 'location':
        heart[column] = pd.to_numeric(heart[column])
        
print("\nLocation Count After DropNA:\n{}".format(heart['location'].value_counts()))

heart = heart.loc[heart['trestbps'] != 0]
heart = heart.loc[heart['chol'] != 0]
Location Count Before DropNA:
cleveland      303
hungarian      294
va             200
switzerland    123
Name: location, dtype: int64

Location Count After DropNA:
cleveland      303
hungarian      261
va             130
switzerland      0
Name: location, dtype: int64

Check again for Missing Data

In [11]:
msno.matrix(heart,color=(.3, .3, .5))
plt.show()

180 rows were lost from data cleaning, 3 variables were removed, and string object columns were converted to floats.

Data Summary:

In [12]:
# reporting of data
display(heart.head(3),heart.shape,heart.info())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 661 entries, 0 to 199
Data columns (total 12 columns):
age         661 non-null float64
sex         661 non-null float64
cp          661 non-null float64
trestbps    661 non-null float64
chol        661 non-null float64
fbs         661 non-null float64
restecg     661 non-null float64
thalach     661 non-null float64
exang       661 non-null float64
oldpeak     661 non-null float64
num         661 non-null int64
location    661 non-null category
dtypes: category(1), float64(10), int64(1)
memory usage: 62.8 KB
age sex cp trestbps chol fbs restecg thalach exang oldpeak num location
0 63.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 0 cleveland
1 67.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2 cleveland
2 67.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 1 cleveland
(661, 12)
None

Part 2: Exploring the Data (top)

A. Scatterplot using univariate and bivariate methods

Before cleaning data

In [24]:
# Make the scatterplot matrix
featurelist = ['age','trestbps','chol','thalach','oldpeak','exang','num']
locations = ['cleveland', 'hungarian', 'switzerland', 'va']
palette = ['r','b','g','y']

i = 1
fig=plt.figure(figsize=(15,10))

for column in featurelist:
    plt.subplot(4, 2, i)
    i = i + 1
    for idx, location in enumerate(locations):
        sns.kdeplot(heartdf[column].loc[heartdf['location'] == location], color = palette[idx])
    plt.title(column)
    plt.legend(locations)

plt.show()

After cleaning data

In [26]:
# Make the scatterplot matrix
featurelist = ['age','trestbps','chol','thalach','oldpeak']


g = sns.PairGrid(data=heart, diag_sharey=False, hue="location", vars=featurelist, palette=['r','b','g','y'])
g.map_offdiag(plt.scatter, alpha=.5)
g.map_diag(sns.kdeplot, lw=3)

g.add_legend()

plt.show()

Part 3: Statistical Significance (top)

A. Mood’s Median test (2+ Non-Normally Distributed Independent Samples)

Null Hypothesis: Assumes no statistically significant difference between samples.

In [19]:
stat, p, med, tbl = median_test(heart['chol'].loc[heart['location'] == 'hungarian'],heart['chol'].loc[heart['location'] == 'va'],heart['chol'].loc[heart['location'] == 'cleveland'])
print(stat,med)
print(p)
2.762559861123221 240.0
0.2512567559639496

p-value is not less than 5% and doesn't disprove the null hypothesis. There may not be statistical significance between Hungarian, VA, and Cleveland cholesterol samples.

In [21]:
stat, p, med, tbl = median_test(heart['oldpeak'].loc[heart['location'] == 'hungarian'],heart['oldpeak'].loc[heart['location'] == 'va'],heart['oldpeak'].loc[heart['location'] == 'cleveland'])
print(stat,med)
print(p)
31.283575034307557 0.5
1.6101190716440596e-07

p-value is less than 5% and disproves the null hypothesis. The differences between Hungarian, VA, and Cleveland oldpeak samples are statistically significant.

B. One-Way AnovaTest (2+ Normally Distributed Independent Samples)

Null Hypothesis: Assumes no statistically significant difference between samples.

In [20]:
f, p = f_oneway(heart['chol'].loc[heart['location'] == 'hungarian'],heart['chol'].loc[heart['location'] == 'va'],heart['chol'].loc[heart['location'] == 'cleveland'])

print("F-Value: {}, p:{}".format(f,p))
F-Value: 0.9757825070704461, p:0.377442015599753

p-value is not less than 5% and doesn't disprove the null hypothesis. There may not be statistical significance between Hungarian, VA, and Cleveland samples.

In [23]:
f, p = f_oneway(heart['oldpeak'].loc[heart['location'] == 'hungarian'],heart['oldpeak'].loc[heart['location'] == 'va'],heart['oldpeak'].loc[heart['location'] == 'cleveland'])

print("F-Value: {}, p:{}".format(f,p))
F-Value: 18.519691973497217, p:1.496999637910008e-08

p-value is less than 5% and disproves the null hypothesis. The differences between Hungarian, VA, and Cleveland oldpeak samples are statistically significant.


NOTES