The goals of these exercises are:
Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards: https://docs.google.com/presentation/d/1McihIS04U21zAQ3nDzRGSjmn9OhoXappB6tpSj6HNRw/edit?usp=sharing
Open the project called “MissingData” in the (unzipped) course day folder.
Load the dataset using the following lines of code:
load("./data/alcodata1.rda")
Ensure that you have loaded the data correctly - and have a look at them - by using a few summary functions. Here are some commands you may try:
#Look at the first observations in the data
head(alcodata1)
#Look at the full dataset
View(alcodata1)
#Look at summarizing statistics for each variable in the dataset
summary(alcodata1)
Add a new variable to your dataset that indicates whether each observation is complete or not:
alcodata1$completecase <- "Non-complete"
alcodata1$completecase[complete.cases(alcodata1)] <- "Complete"
Make sure you understand what this variable contains. Remember
that you can use View()
or head()
to inspect
the data.
Make distribution plots for each variable, stratified by
completecase
(barplots for categorical variables,
histograms for numerical variables)
#Code example: Plot gender stratified by "completecase" status (two methods)
#Method 1: Using the ggplot2 package:
library(ggplot2)
#Categorical variable
ggplot(alcodata1, aes(x = gender)) +
geom_bar() +
facet_wrap( ~ completecase)
#Numerical variable:
ggplot(alcodata1, aes(x = ageBeyond60)) +
geom_histogram() +
facet_wrap( ~ completecase)
#Method 2: Using the standard built-in plotting tools (use hist() for numerical variables):
#Categorical variable
par(mfrow = c(1,2)) #change plot window to 1 x 2
plot(alcodata1$gender[alcodata1$completecase == "Non-complete"], main = "Non-complete")
plot(alcodata1$gender[alcodata1$completecase == "Complete"], main = "Complete")
par(mfrow = c(1,1)) #change plot window back to 1 x 1
#Numerical variable
par(mfrow = c(1,2)) #change plot window to 1 x 2
hist(alcodata1$ageBeyond60[alcodata1$completecase == "Non-complete"], main = "Non-complete")
hist(alcodata1$ageBeyond60[alcodata1$completecase == "Complete"], main = "Complete")
par(mfrow = c(1,1)) #change plot window back to 1 x 1
We will now look more directly at the missing information and try to inspect the patterns in which it occurs.
Try some of the functions below and discuss what each plot shows and what it tells you about the data and the missing information.
Try replacing “gender” and “country” in the stratified plots with other variables that you find interesting.
#Missing information visualization using the naniar package:
library(naniar)
gg_miss_var(alcodata1)
gg_miss_var(alcodata1, facet = country, show_pct = TRUE)
vis_miss(alcodata1)
gg_miss_upset(alcodata1)
gg_miss_case(alcodata1)
gg_miss_fct(alcodata1, gender)
#Missing information visualization using the mice package:
library(mice)
md.pattern(alcodata1, rotate.names = TRUE)
We will now investigate whether some variables are predictive of missingness status. Note that there is a code example after the exercises you can use.
For each variable with any missing information, repeat the modeling
from 4.a, but replace the outcome with an indicator that shows only if
there is missing information in that variable, e.g. if
education
is missing.
#overall non-response analysis:
model1 <- glm(completecase == "Non-complete" ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")
summary(model1)
drop1(model1, test = "LRT")
#variable specific non-response analysis example: drinks
model2 <- glm(is.na(drinks) ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")
summary(model2)
drop1(model2, test = "LRT")