Overview

The goals of these exercises are:

  1. To get an overview of how much information is missing in the data, and in what variables
  2. To form first ideas of whether the missing information is problematic for the analyses we will perform later
  3. To investigate whether there are systematic relationships between information being missing and some of the observed variables

Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards: https://docs.google.com/presentation/d/1McihIS04U21zAQ3nDzRGSjmn9OhoXappB6tpSj6HNRw/edit?usp=sharing

1. Load data and project

2. Compare distributions for complete cases and cases with missing information

3. Inspect patterns of missing information

We will now look more directly at the missing information and try to inspect the patterns in which it occurs.

4. Non-response analysis

We will now investigate whether some variables are predictive of missingness status. Note that there is a code example after the exercises you can use.

4.a. Over-all assessment of non-response

  • Fit a logistic regression model with:
    • Outcome: An indicator (0/1) of whether the observation is complete in all variables.
    • Predictors: All the four completely observed variables.
  • Look at the parameter estimates. Do you think that the observations are missing completely at random?

4.b. Variable-specific assessment of non-response

For each variable with any missing information, repeat the modeling from 4.a, but replace the outcome with an indicator that shows only if there is missing information in that variable, e.g. if education is missing.

Code example:

#overall non-response analysis:
model1 <- glm(completecase == "Non-complete" ~ country + gender + partner + prevTreat,
              data = alcodata1, family = "binomial")
summary(model1)
drop1(model1, test = "LRT")

#variable specific non-response analysis example: drinks
model2 <- glm(is.na(drinks) ~ country + gender + partner + prevTreat,
              data = alcodata1, family = "binomial")
summary(model2)
drop1(model2, test = "LRT")