The goals of these exercises are:
Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards.
alcodata1
: https://docs.google.com/presentation/d/1XsBlKE4_zFnFEwrBCvM8yLc_oPrURFAoD6rXv6pqJMo/edit?usp=sharingalcodata2
: https://docs.google.com/presentation/d/19Z40NJioCzrx71m9UcM31TnJyDIvlUgjqF2H3JQWj00/edit?usp=sharingNote: All code examples below are written for alcodata1
. If you were assigned alcodata2
, please remember to change the code accordingly.
#For dataset 1:
load("./data/alcodata1.rda")
#For dataset 2:
load("./data/alcodata2.rda")
alcodata1
):#Look at the first observations in the data
head(alcodata1)
#Look at the full dataset
View(alcodata1)
#Look at summarizing statistics for each variable in the dataset
summary(alcodata1)
alcodata1$completecase <- "Non-complete"
alcodata1$completecase[complete.cases(alcodata1)] <- "Complete"
View()
or head()
to inspect the data.completecase
(barplots for categorical variables, histograms for numerical variables)
#Code example: Plot gender stratified by "completecase" status (two methods)
#Method 1: Using the ggplot2 package:
library(ggplot2)
qplot(gender, facets = ~ completecase, data = alcodata1)
#Method 2: Using the standard built-in plotting tools (use hist() for numerical variables):
par(mfrow = c(1,2)) #change plot window to 1 x 2
plot(alcodata1$gender[alcodata1$completecase == "Non-complete"], main = "Non-complete")
plot(alcodata1$gender[alcodata1$completecase == "Complete"], main = "Complete")
par(mfrow = c(1,1)) #change plot window back to 1 x 1
We will now look more directly at the missing information and try to inspect the patterns in which it occur.
#Missing information visualization using the naniar package:
library(naniar)
gg_miss_var(alcodata1)
gg_miss_var(alcodata1, facet = country, show_pct = TRUE)
vis_miss(alcodata1)
gg_miss_upset(alcodata1)
gg_miss_case(alcodata1)
gg_miss_fct(alcodata1, gender)
#Missing information visualization using the mice package:
library(mice)
md.pattern(alcodata1, rotate.names = TRUE)
We will now investigate whether some variables are predictive of missingness status. Note that there are code examples after the exercises you can use.
For each variable with any missing information, repeat the modeling from 4.a, but replace the outcome with an indicator that shows only if there is missing information in that variable, e.g. if education
is missing.
#overall non-response analysis:
model1 <- glm(completecase == "Non-complete" ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")
#variable-specific non-response analysis:
model2 <- glm(is.na(education) ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")