Overview

The goals of these exercises are:

  1. To compare complete case analysis and data analysis by use of multiple imputations
  2. To get a bit of practical experience with using the MICE package in R

Add your results to the slides: https://docs.google.com/presentation/d/1x-2TsJyE_dfvNQhfGqwS1ofUof2h1NNshDbxFk9wHgs/edit?usp=sharing

1. A model

In the following, we will fit a number of models. As today’s purpose is to explore methods for handling missing information, you should not focus on model selection or model diagnostics. Rather, you shall use a linear normal regression model with drinks as the outcome and additive effects of all variables, i.e.:

\[\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}\]

where \(\epsilon\) is a normally distributed error term. 1(*) are indicator functions (dummies) that are 1 if * is true and 0 otherwise. The intercept (\(\alpha\)) is the expected number of drinks for a person who is 60 years old, lives in Denmark, has dependence: “low”, education: “no degree”, gender: “male”, no partner and 0 previous treatments.

2. Complete case analysis

source("./R/functions.R")

plotEstimates(`Complete cases`  = alco_ccmodel)

This plot shows your estimates together with 95% confidence intervals.

3. Multiple imputation using chained equations

We will now try to use the mice package for performing multiple imputation by chanied equations (MICE). MICE consists of three steps:

  1. Impute \(m\) different datasets
  2. Fit the same model on each of the \(m\) datasets
  3. Pool the results across the \(m\) models to obtain a final combined model

We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).

4. A closer look at the imputation step

In the imputation step we can vary a number of settings, including:

  1. The number of imputed datasets.
  2. Which variables to use in the imputation models

Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both.

4.a. The number of imputed datasets

You can set the number of imputed datasets in the mice() function by use of the argument m.

  • Look at a summary of your imputed datasets to see how many datasets are imputed as the default:

    summary(alco_imp)
  • Try running the MICE steps from above with \(m = 1\), \(m = 5\), \(m = 10\), \(m = 50\), \(m = 100\), and possibly more values of \(m\). Save the (pooled) models under the names mice_m1, mice_m5, …, mice_m200 respectively.

    • Tip: You may opt to use the argument print = FALSE for the mice() function to avoid having a lot of information written on the screen when \(m\) is large.
  • Compare the results, e.g. by use of the plotEstimates() function (see code example below). Discuss the following points:

    • What happens when \(m = 1\)?
    • How large do you think \(m\) has to be in this specific analysis before the results are sufficiently stable?
  • Add the true model (m_true) to the plot. Does this change your opinion?

#Code example: Plot estimates from models with varying numbers of imputations
plotEstimates(`m = 1` = mice_m1, 
              `m = 5` = mice_m5, 
              `m = 10` = mice_m10, 
              `m = 50` = mice_m50, 
              `m = 100` = mice_m100)

4.b. Choice of variables to use in imputation models

We will now look at what happens if we change what variables are included in the imputation models. This is specified using a so-called predictor matrix of 0s and 1s. Here is an example of such a matrix for a small dataset with only three variables, X, Y and Z:

##   X Y Z
## X 0 0 1
## Y 1 0 0
## Z 0 0 0

The matrix is read row by row as follows:

  1. For the imputation model for X, use Z as a predictor variable
  2. For the imputation model for Y, use X as a predictor variable
  3. For the imputation model for Z, use no predictor variables

Work through the following exercises:

  • Look at a the predictor matrix for your imputed datasets to see what variables are included for each imputation model now:

    alco_imp$predictorMatrix
  • Why are there 0s in the diagonal?

  • We will now change the choice of predictors for each imputation model. This can be specified in using the predictorMatrix argument in the mice() function.

    • First, we construct a new predictor matrix where only the variable education is used as a predictor in the missing data models (using two different, equivalent methods). Look closely at the code and make sure you understand the structure of the matrix:

      #Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
        mat1 <- alco_imp$predictorMatrix
      
        #change all entries except the column "education" to zero
        mat1[, c("country", "gender", "ageBeyond60", "partner", 
                 "dependence", "prevTreat", "drinks")] <- 0
      
        #Look at the result
        mat1
      
      #Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
        #make a matrix with 1 in all entires
        mat1 <- matrix(1, 8, 8)
      
        #change the diagonal to be zero
        diag(mat1) <- 0
      
        #change all the entries, except for the 4th column, to be zero
        mat1[, -4] <- 0
      
        #Look at the result
        mat1
  • Perform MICE (all three steps) with this new predictor matrix by using the argument (predictorMatrix = mat1) in your mice() call. Compare the results to your previous model (alco_micemodel) and the true model (m_true) and discuss the differences. Was it a good idea to remove the other variables from the predictor matrix?

  • Try handpicking what variables you think are needed for each imputation model to construct one or more new predictor matrices. Run MICE again and compare the results with alco_micemodel and the true model m_true.

    • Tip: Note that it does not matter which variables you pick for the rows corresponding to variables without any missing information - they won’t be imputed anyway.
    • Discuss: Is this a good strategy? Why/why not?