Exercises: Analyze data with missing information

Overview

The goals of these exercises are:

To compare complete case analysis and data analysis by use of multiple imputations
To get a bit of practical experience with using the MICE package in R

Add your results to the slides: https://docs.google.com/presentation/d/1x-2TsJyE_dfvNQhfGqwS1ofUof2h1NNshDbxFk9wHgs/edit?usp=sharing

1. A model

In the following, we will fit a number of models. As today’s purpose is to explore methods for handling missing information, you should not focus on model selection or model diagnostics. Rather, you shall use a linear normal regression model with drinks as the outcome and additive effects of all variables, i.e.:

\[\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}\]

where $\epsilon$ is a normally distributed error term. 1(*) are indicator functions (dummies) that are 1 if * is true and 0 otherwise. The intercept ($\alpha$) is the expected number of drinks for a person who is 60 years old, lives in Denmark, has dependence: “low”, education: “no degree”, gender: “male”, no partner and 0 previous treatments.

What is the interpretation of each of the remaining regression parameters of this model ($\beta_1, ..., \beta_{11}$)?
How could you fit it in R if there were no missing information?

2. Complete case analysis

How many complete cases are there in the data? (Hint: The function complete.cases() might be handy.)
Fit the model from (1) above on the complete cases and save it for later under the name alco_ccmodel. (Hint: the default behavior of most R model functions, including lm(), is to remove all incomplete cases before fitting a model)
Look at the estimates by using the summary() function on your model.
Plot the model estimates using a helper function plotEstimates() sourced from the functions.R file in the project folder by running the following lines of code:

source("./R/functions.R")

plotEstimates(`Complete cases`  = alco_ccmodel)

This plot shows your estimates together with 95% confidence intervals.

3. Multiple imputation using chained equations

We will now try to use the mice package for performing multiple imputation by chanied equations (MICE). MICE consists of three steps:

Impute $m$ different datasets
Fit the same model on each of the $m$ datasets
Pool the results across the $m$ models to obtain a final combined model

We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).

To make everything a bit easier, remove the variable completecase from your dataset again:
```
alcodata1$completecase <- NULL
```

Run the code chunk below one line at a time.

What does each line do?
Do you get the same result if you run the code again? Why/why not?
Look at the plot and discuss: What can you conclude from comparing the MICE estimates with the CC analysis estimates?

alco_imp <- mice(alcodata1)
alco_fit <- with(alco_imp, 
                 lm(drinks ~ ageBeyond60 + prevTreat + country + gender + education + partner + dependence))
alco_micemodel <- pool(alco_fit)

summary(alco_micemodel, conf.int = TRUE)

plotEstimates(`Complete cases` =  alco_ccmodel, `MICE` = alco_micemodel)

We will now compare the results with the “true model” - i.e. the model we could have fitted if we had had access to the whole dataset. Compared to a real-life data analysis, this is of course cheating, as we would normally not be able to know this model.
- Run the code below to load in the true model and plot it together with your mice model and the complete case model.
- Discuss the differences between the three models.
- Update your best guesses for how the missing data was generated for each variable with missing information (MCAR/MAR/MNAR). If MAR or MNAR, consider which variables you think caused the missing information.
```
#Load in "true" model
load("./data/ex3model.rda")

#Look at the estimates
summary(m_true)

#Plot estimates together with complete case model and true model
plotEstimates(`Full data` = m_true, 
              `MICE` = alco_micemodel,
              `Complete cases` = alco_ccmodel)
```

4. A closer look at the imputation step

In the imputation step we can vary a number of settings, including:

The number of imputed datasets.
Which variables to use in the imputation models

Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both.

4.a. The number of imputed datasets

You can set the number of imputed datasets in the mice() function by use of the argument m.

Look at a summary of your imputed datasets to see how many datasets are imputed as the default:
```
summary(alco_imp)
```
Try running the MICE steps from above with $m = 1$, $m = 5$, $m = 10$, $m = 50$, $m = 100$, and possibly more values of $m$. Save the (pooled) models under the names mice_m1, mice_m5, …, mice_m200 respectively.
- Tip: You may opt to use the argument print = FALSE for the mice() function to avoid having a lot of information written on the screen when $m$ is large.
Compare the results, e.g. by use of the plotEstimates() function (see code example below). Discuss the following points:
- What happens when $m = 1$?
- How large do you think $m$ has to be in this specific analysis before the results are sufficiently stable?
Add the true model (m_true) to the plot. Does this change your opinion?

#Code example: Plot estimates from models with varying numbers of imputations
plotEstimates(`m = 1` = mice_m1, 
              `m = 5` = mice_m5, 
              `m = 10` = mice_m10, 
              `m = 50` = mice_m50, 
              `m = 100` = mice_m100)

4.b. Choice of variables to use in imputation models

We will now look at what happens if we change what variables are included in the imputation models. This is specified using a so-called predictor matrix of 0s and 1s. Here is an example of such a matrix for a small dataset with only three variables, X, Y and Z:

##   X Y Z
## X 0 0 1
## Y 1 0 0
## Z 0 0 0

The matrix is read row by row as follows:

For the imputation model for X, use Z as a predictor variable
For the imputation model for Y, use X as a predictor variable
For the imputation model for Z, use no predictor variables

Work through the following exercises:

Look at a the predictor matrix for your imputed datasets to see what variables are included for each imputation model now:
```
alco_imp$predictorMatrix
```
Why are there 0s in the diagonal?

We will now change the choice of predictors for each imputation model. This can be specified in using the predictorMatrix argument in the mice() function.

First, we construct a new predictor matrix where only the variable education is used as a predictor in the missing data models (using two different, equivalent methods). Look closely at the code and make sure you understand the structure of the matrix:

#Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
  mat1 <- alco_imp$predictorMatrix

  #change all entries except the column "education" to zero
  mat1[, c("country", "gender", "ageBeyond60", "partner", 
           "dependence", "prevTreat", "drinks")] <- 0

  #Look at the result
  mat1

#Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
  #make a matrix with 1 in all entires
  mat1 <- matrix(1, 8, 8)

  #change the diagonal to be zero
  diag(mat1) <- 0

  #change all the entries, except for the 4th column, to be zero
  mat1[, -4] <- 0

  #Look at the result
  mat1

Perform MICE (all three steps) with this new predictor matrix by using the argument (predictorMatrix = mat1) in your mice() call. Compare the results to your previous model (alco_micemodel) and the true model (m_true) and discuss the differences. Was it a good idea to remove the other variables from the predictor matrix?
Try handpicking what variables you think are needed for each imputation model to construct one or more new predictor matrices. Run MICE again and compare the results with alco_micemodel and the true model m_true.
- Tip: Note that it does not matter which variables you pick for the rows corresponding to variables without any missing information - they won’t be imputed anyway.
- Discuss: Is this a good strategy? Why/why not?