Author: Ryan Harper
Dataset: Behavioral Risk Factor Surveillance System Data
Source: https://www.cdc.gov/brfss/
Summary: The data set is a collection of responses from ~400,000 questionnaires across the United States on health related topics. BRFSS started in the 1980s and has been used to monitor pervasive illnesses in the US through questionnaires. The research is a standardized questionairre with additional questions varying by state. Interviewers reach partipant in house and by telephone. The telephone surveys are conducted using “Random Digit Dialing (RDD) techniques on both landlines and cell phones” so much of the data appears to be from random sampling. The research is retroactive and not a designed experiment so while correlation can be inferred, causality cannot.
The features in the data set are both continuous and categorical. Some features are missing values so data preprocessing is essential. Because the participants were selected randomly the sample can be generalized.
Goals: Explore any correlations among gender, weight, and age
library(ggplot2)
library(dplyr)
library(Rgraphviz)
library(knitr)
library(grid)
library(gridExtra)
load("brfss2013.RData")
# group and count a feature with discrete values
feature_vcounts <- function(df, f) {
df %>%
group_by_at(f) %>%
count()}
# method for binning values
bin_min_sample <- function(p) {
n = 10
a = 10/p
b = 10/(1-p)
max(c(a,b))}
# create a new df for simulating binom probability distribution
binom_prob_df <- function(df, f, target) {
new_df <- feature_vcounts(df,f)
new_df$n[new_df[f] == target]/sum(new_df$n)}
# filtering df with subgroup value
subgroup_df <- function(df,f, group) {
filter(df,df[f]==group)}
# calc the vector probability
binom_prob_vec <- function(v, target) {
sum(v == target)/length(v)}
# sample from df
binom_sample <- function(s,v)
sample(v, size=s, replace=TRUE)
# create the binomial sample distribution
binom_sample_dist <- function(df,f,target) {
sample_dist <- c()
for (i in 1:10001) {
prob <- binom_prob_vec(binom_sample(100,df[,f]),target)
sample_dist <- append(sample_dist,prob)}
return(sample_dist)}
# convert decimal to percent
to_percent <- function(pvalue) {
paste(round(pvalue*100,digits= 2),"%",sep="")}
Import and filter the data to only include important features relevant to diabetes, gender, weight, and age. The features are reduced 4 columns and then the data is cleaned to remove rows that have invalid data or data that has less pertinent information.
# Import original file:
orig_dim <- dim(brfss2013)
# Select only relevant features:
weight_diabetes <- brfss2013 %>%
select(sex, X_ageg5yr, weight2,diabete3)
# ------------------Cleaning data------------------
# 1.Weight strings -> numeric
weight_diabetes$weight2 <- as.numeric(as.character(weight_diabetes$weight2))
new_dim <- dim(weight_diabetes)
# 2. Remove Null Weights and Weights over 400
weight_diabetes <- na.omit(weight_diabetes)
weight_diabetes <- filter(weight_diabetes, weight2 <= 400)
# 3. Remove Diabetes Responses
target <- c("Yes", "No")
weight_diabetes <- filter(weight_diabetes, diabete3 %in% target)
# 4. Add index and reorder
weight_diabetes$index <- seq.int(nrow(weight_diabetes))
weight_diabetes <- weight_diabetes[c(4,3,1,2)]
clean_dim <- dim(weight_diabetes)
# Show data:
kable(head(weight_diabetes,n=5), caption="Diabetes Data Set",padding=0, format = "markdown",align="l")
diabete3 | weight2 | sex | X_ageg5yr |
---|---|---|---|
No | 250 | Female | Age 60 to 64 |
No | 127 | Female | Age 50 to 54 |
No | 160 | Female | Age 55 to 59 |
No | 128 | Female | Age 60 to 64 |
No | 265 | Male | Age 65 to 69 |
The data looks simplified and only contains the features wanted for this project. Because the data needs to be anonymized, age ranges are a safe alternative to specific ages. Age ranges will be used as categorial information for this data set.
Original dimensions:
[491775] x [330]
Reduced dimensions:
[491775] x [4]
After cleaning:
[453836] x [4]
Because gender is a key variable in biometrics, it is important to explore whether gender might correllate to other variables. In this case, we are looking at whether or not sex correllates with weight.
The goal of this exploratory project is to check if weight/gender/age correlates with diabetes. Knowledge of any correlation could be helpful with regards to informing patients of their likelihood of having diabetes based on their gender and weight.
To further explore possible correlations with diabetes, we will also look at the relationship between four variables.
Weight2 is continuous so it is important to first check the distribution of the data. Sex is binary categorical so we will visualize it’s distribution with a bar plot.
centered <- theme(plot.title = element_text(hjust = 0.5))
hist_weight <- ggplot(data=weight_diabetes,aes(weight2, fill=weight2))+
geom_histogram(fill='salmon',color='white') + ggtitle("Histogram [Weight]") + centered
weight_diabetes$log_weight <- log(weight_diabetes$weight2)
hist_log_weight <- ggplot(data=weight_diabetes,aes(log_weight, fill=log_weight))+
geom_histogram(fill='mediumturquoise',color='white') + ggtitle("Histogram [Log_Weight]") + centered
grid.arrange(hist_weight, hist_log_weight, ncol = 2)
For Weight2, the distribution is right skewed while the log of Weight2 is nearly normal. Because the lognormed version of the data is nearly normal and unimodal, the weight can be used for subsequent analyses in inferential statistics.
gender_count <- ggplot(data=weight_diabetes,aes(sex, fill = sex))+
geom_bar(color='white') + ggtitle("Gender") + centered
gender_count
There are more female participants than male participants by a margin much larger than the general population in the US. This could be an indicator that the sampling method was not entirely random with regards to gender sampling. Nonetheles, the data sample is large enough to continue evaluating health risk factors.
age_count <- ggplot(data=weight_diabetes,aes(X_ageg5yr)) +
geom_bar(fill='salmon',color='white') + ggtitle("Age") + centered
age_count
The age range appears to be left skewed with extremes at both tails. The large tails are most likely a result of binning as the youngest and oldest age ranges account for a larger range of ages.
ggplot(weight_diabetes, aes(x=X_ageg5yr,y=weight2, color=sex)) +
geom_boxplot()
When comparing age to weight, there does seem to be a clear difference in weigh distributions for genders. Male’s appear to weigh more than females.
It should also be noted that there also appear to be adult patients who weigh less than 50 lbs and many patients who weigh around 400 lbs. Future analysis of the data collection process should explore whether these low and high outliers are mistakes or if they reflect patients with serious health issues.
ggplot(weight_diabetes, aes(x = factor(sex), fill = factor(diabete3))) +
geom_bar(position="fill")
When looking at female and male participants in the sample, the ratio of reporting diabetes appears is very similar. It is likely that difference in gender does not correllate to reporting of diabetes.
ggplot(weight_diabetes, aes(x = factor(X_ageg5yr), fill = factor(diabete3))) +
geom_bar(position="fill")
As age increases, the ratio of reported diabetes also appears to increase until 80 years of age and older. It is likely that age has some degree of correllation with diabetes.
ggplot(weight_diabetes, aes(x = factor(weight2), fill = factor(diabete3))) +
geom_bar(position="fill") +
theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
As weight increases, the ratio of reported diabetes increases. Weight appears to have a strong correllation with reported diabetes and should be explored further.
ggplot(weight_diabetes, aes(x=sex,y=weight2, color=diabete3)) +
geom_boxplot() +
scale_y_continuous()
There is a high weight distribution for both men and women who report having diabetes. Males have a heavier weight distribution than women.
ggplot(weight_diabetes, aes(x=X_ageg5yr,y=weight2, color=diabete3)) +
geom_boxplot() +
scale_y_continuous()
This collection of boxplots depicts a much clearer story. Patients who report having diabetes appear to be heavier in every age range. Younger patients who report having diabetes appear to have a larger weight range than older patients. While it is unclear how age correllates with diabetes and weight, this relationship should be further explored.
From initial exploration of the data, it is clear that some features have stronger correllations than others. Weight has some relationship with gender. Gender does not appear to correllate with weight. Diabetes does, however, seem to correllate with age and strongly correllates with weight.