## Overview

The goals of these exercises are:

1. To compare complete case analysis and data analysis by use of multiple imputations
2. To get a bit of practical experience with using the MICE package in R

Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards.

## 1. A model

In the following, we will fit a number of models. As today’s purpose is to explore methods for handling missing information, you should not focus on model selection or model diagnostics. Rather, you shall use a linear normal regression model with drinks as the outcome and additive effects of all variables, i.e.:

\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}

where $$\epsilon$$ is a normally distributed error term. 1(*) are indicator functions that are 1 if * is true and 0 otherwise. The intercept ($$\alpha$$) is the expected number of drinks for a person who is 60 years old, lives in Denmark, has dependence: “low”, education: “no degree”, gender: “male”, no partner and 0 previous treatments.

• What is the interpretation of each of the remaining regression parameters of this model ($$\beta_1, ..., \beta_{11}$$)?
• How could you fit it in R if there were no missing information?

## 2. Complete case analysis

• How many complete cases are there in the data? (Hint: The function na.omit() removes incomplete cases and the function nrow() counts the number of observations in a dataset.)
• Fit the model from above on the complete cases and save it for later under the name alco_ccmodel. (Hint: the default behavior of most R model functions, including lm(), is to remove all incomplete cases before fitting a model)
• Look at the estimates by using the summary() function on your model.
• Plot the model estimates using a helper function plotEstimates() sourced from the functions.R file in the project folder by running the following lines of code:
source("./R/functions.R")

plotEstimates(Complete cases  = alco_ccmodel)

This plot shows your estimates together with 95% confidence intervals.

## 3. Multiple imputation using chained equations

We will now try to use the mice package for performing multiple imputation by chanied equations (MICE). MICE consists of three steps:

1. Impute $$m$$ different datasets
2. Fit the same model on each of the $$m$$ datasets
3. Pool the results across the $$m$$ models to obtain a final combined model

We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).

• To make everything a bit easier, remove the variable completecase from your dataset again:
alcodata1$completecase <- NULL • Run the code chunk below one line at a time. • What does each line do? • Do you get the same result if you run the code again? Why/why not? • Look at the plot and discuss: What can you conclude from comparing the MICE estimates with the CC analysis estimates? alco_imp <- mice(alcodata1) alco_fit <- with(alco_imp, lm(drinks ~ ageBeyond60 + prevTreat + country + gender + education + partner + dependence)) alco_micemodel <- pool(alco_fit) summary(alco_micemodel, conf.int = TRUE) plotEstimates(Complete cases = alco_ccmodel, MICE = alco_micemodel) • We will now compare the results with the “true model” - i.e. the model we could have fitted if we had had access to the whole dataset. Compared to a real-life data analysis, this is of course cheating, as we would normally not be able to know this model. • Run the code below to load in the true model and plot it together with your mice model and the complete case model. • Discuss the differences between the three models. • Update your best guesses for how the missing data was generated for each variable with missing information (MCAR/MAR/MNAR). If MAR or MNAR, consider which variables you think caused the missing information. #Load in "true" model load("./data/ex3model.rda") #Look at the estimates summary(m_true) #Plot estimates together with complete case model and true model plotEstimates(Full data = m_true, MICE = alco_micemodel, Complete cases = alco_ccmodel) ## 4. A closer look at the imputation step In the imputation step we can vary a number of settings, including: 1. The number of imputed datasets. 2. Which variables to use in the imputation models Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both. ### 4.a. The number of imputed datasets You can set the number of imputed datasets in the mice() function by use of the argument m. • Look at a summary of your imputed datasets to see how many datasets are imputed as the default: summary(alco_imp) • Try running the MICE steps from above with $$m = 1$$, $$m = 5$$, $$m = 10$$, $$m = 50$$, $$m = 100$$, and possibly more values of $$m$$. Save the (pooled) models under the names mice_m1, mice_m5, …, mice_m200 respectively. • Tip: You may opt to use the argument print = FALSE for the mice() function to avoid having a lot of information written on the screen when $$m$$ is large. • Compare the results, e.g. by use of the plotEstimates() function (see code example below). Discuss the following points: • What happens when $$m = 1$$? • How large do you think $$m$$ has to be in this specific analysis before the results are sufficiently stable? • Add the true model (m_true) to the plot. Does this change your opinion? #Code example: Plot estimates from models with varying numbers of imputations plotEstimates(m = 1 = mice_m1, m = 5 = mice_m5, m = 10 = mice_m10, m = 50 = mice_m50, m = 100 = mice_m100) ### 4.b. Choice of variables to use in imputation models We will now look at what happens if we change what variables are included in the imputation models. This is specified using a so-called predictor matrix of 0s and 1s. Here is an example of such a matrix for a small dataset with only three variables, X, Y and Z: ## X Y Z ## X 0 0 1 ## Y 1 0 0 ## Z 0 0 0 The matrix is read row by row as follows: 1. For the imputation model for X, use Z as a predictor variable 2. For the imputation model for Y, use X as a predictor variable 3. For the imputation model for Z, use no predictor variables Work through the following exercises: • Look at a the predictor matrix for your imputed datasets to see what variables are included for each imputation model now: alco_imp$predictorMatrix
• Why are there 0s in the diagonal?
• We will now change the choice of predictors for each imputation model. This can be specified in using the predictorMatrix argument in the mice() function.
• First, we construct a new predictor matrix where only the variable education is used as a predictor the missing data models (using two different, equivalent methods). Look closely at the code and make sure you understand the structure of the matrix:
#Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
mat1 <- alco_imp\$predictorMatrix

#change all entries except the column "education" to zero
mat1[, c("country", "gender", "ageBeyond60", "partner",
"dependence", "prevTreat", "drinks")] <- 0

#Look at the result
mat1

#Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
#make a matrix with 1 in all entires
mat1 <- matrix(1, 8, 8)

#change the diagonal to be zero
diag(mat1) <- 0

#change all the entries, except for the 4th column, to be zero
mat1[, -4] <- 0

#Look at the result
mat1
• Perform MICE (all three steps) with this new predictor matrix by using the argument (predictorMatrix = mat1) in your mice() call. Compare the results to your previous model (alco_micemodel) and the true model (m_true) and discuss the differences. Was it a good idea to remove the other variables from the predictor matrix?
• Try handpicking what variables you think are needed for each imputation model to construct one or more new predictor matrices. Run MICE again and compare the results with alco_micemodel and the true model m_true.
• Tip: Note that it does not matter which variables you pick for the rows corresponding to variables without any missing information - they won’t be imputed anyway.
• Discuss: Is this a good strategy? Why/why not?