The goals of these exercises are:

- To compare complete case analysis and data analysis by use of multiple imputations
- To get a bit of practical experience with using the
`MICE`

package in R

Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards.

In the following, we will fit a number of models. As today’s purpose is to explore methods for handling missing information, you should **not** focus on model selection or model diagnostics. Rather, you shall use a linear normal regression model with `drinks`

as the outcome and additive effects of all variables, i.e.:

\[\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}\]

where \(\epsilon\) is a normally distributed error term. `1(*)`

are indicator functions that are `1`

if `*`

is true and 0 otherwise. The intercept (\(\alpha\)) is the expected number of drinks for a person who is 60 years old, lives in Denmark, has dependence: “low”, education: “no degree”, gender: “male”, no partner and 0 previous treatments.

- What is the interpretation of each of the remaining regression parameters of this model (\(\beta_1, ..., \beta_{11}\))?
- How could you fit it in
`R`

if there were no missing information?

- How many complete cases are there in the data? (Hint: The function
`na.omit()`

removes incomplete cases and the function`nrow()`

counts the number of observations in a dataset.) - Fit the model from above on the complete cases and save it for later under the name
`alco_ccmodel`

. (Hint: the default behavior of most R model functions, including`lm()`

, is to remove all incomplete cases before fitting a model) - Look at the estimates by using the
`summary()`

function on your model. - Plot the model estimates using a helper function
`plotEstimates()`

sourced from the`functions.R`

file in the project folder by running the following lines of code:

```
source("./R/functions.R")
plotEstimates(`Complete cases` = alco_ccmodel)
```

This plot shows your estimates together with 95% confidence intervals.

We will now try to use the `mice`

package for performing multiple imputation by chanied equations (MICE). MICE consists of three steps:

- Impute \(m\) different datasets
- Fit the same model on each of the \(m\) datasets
- Pool the results across the \(m\) models to obtain a final combined model

We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).

- To make everything a bit easier, remove the variable
`completecase`

from your dataset again:

`alcodata1$completecase <- NULL`

- Run the code chunk below one line at a time.
- What does each line do?
- Do you get the same result if you run the code again? Why/why not?
- Look at the plot and discuss: What can you conclude from comparing the MICE estimates with the CC analysis estimates?

```
alco_imp <- mice(alcodata1)
alco_fit <- with(alco_imp,
lm(drinks ~ ageBeyond60 + prevTreat + country + gender + education + partner + dependence))
alco_micemodel <- pool(alco_fit)
summary(alco_micemodel, conf.int = TRUE)
plotEstimates(`Complete cases` = alco_ccmodel, `MICE` = alco_micemodel)
```

- We will now compare the results with the “true model” - i.e. the model we could have fitted if we had had access to the whole dataset. Compared to a real-life data analysis, this is of course cheating, as we would normally not be able to know this model.
- Run the code below to load in the true model and plot it together with your
`mice`

model and the complete case model. - Discuss the differences between the three models.
- Update your best guesses for how the missing data was generated for each variable with missing information (MCAR/MAR/MNAR). If MAR or MNAR, consider which variables you think caused the missing information.

- Run the code below to load in the true model and plot it together with your

```
#Load in "true" model
load("./data/ex3model.rda")
#Look at the estimates
summary(m_true)
#Plot estimates together with complete case model and true model
plotEstimates(`Full data` = m_true,
`MICE` = alco_micemodel,
`Complete cases` = alco_ccmodel)
```

In the imputation step we can vary a number of settings, including:

- The number of imputed datasets.
- Which variables to use in the imputation models

Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both.

You can set the number of imputed datasets in the `mice()`

function by use of the argument `m`

.

- Look at a summary of your imputed datasets to see how many datasets are imputed as the default:

`summary(alco_imp)`

- Try running the MICE steps from above with \(m = 1\), \(m = 5\), \(m = 10\), \(m = 50\), \(m = 100\), and possibly more values of \(m\). Save the (pooled) models under the names
`mice_m1`

,`mice_m5`

, …,`mice_m200`

respectively.- Tip: You may opt to use the argument
`print = FALSE`

for the`mice()`

function to avoid having a lot of information written on the screen when \(m\) is large.

- Tip: You may opt to use the argument
- Compare the results, e.g. by use of the
`plotEstimates()`

function (see code example below). Discuss the following points:- What happens when \(m = 1\)?
- How large do you think \(m\) has to be in this specific analysis before the results are sufficiently stable?

- Add the true model (
`m_true`

) to the plot. Does this change your opinion?

```
#Code example: Plot estimates from models with varying numbers of imputations
plotEstimates(`m = 1` = mice_m1,
`m = 5` = mice_m5,
`m = 10` = mice_m10,
`m = 50` = mice_m50,
`m = 100` = mice_m100)
```

We will now look at what happens if we change what variables are included in the imputation models. This is specified using a so-called predictor matrix of 0s and 1s. Here is an example of such a matrix for a small dataset with only three variables, `X`

, `Y`

and `Z`

:

```
## X Y Z
## X 0 0 1
## Y 1 0 0
## Z 0 0 0
```

The matrix is read row by row as follows:

- For the imputation model for
`X`

, use`Z`

as a predictor variable - For the imputation model for
`Y`

, use`X`

as a predictor variable - For the imputation model for
`Z`

, use no predictor variables

Work through the following exercises:

- Look at a the predictor matrix for your imputed datasets to see what variables are included for each imputation model now:

`alco_imp$predictorMatrix`

- Why are there 0s in the diagonal?
- We will now change the choice of predictors for each imputation model. This can be specified in using the
`predictorMatrix`

argument in the`mice()`

function.- First, we construct a new predictor matrix where only the variable
`education`

is used as a predictor the missing data models (using two different, equivalent methods). Look closely at the code and make sure you understand the structure of the matrix:

- First, we construct a new predictor matrix where only the variable

```
#Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
mat1 <- alco_imp$predictorMatrix
#change all entries except the column "education" to zero
mat1[, c("country", "gender", "ageBeyond60", "partner",
"dependence", "prevTreat", "drinks")] <- 0
#Look at the result
mat1
#Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
#make a matrix with 1 in all entires
mat1 <- matrix(1, 8, 8)
#change the diagonal to be zero
diag(mat1) <- 0
#change all the entries, except for the 4th column, to be zero
mat1[, -4] <- 0
#Look at the result
mat1
```

- Perform MICE (all three steps) with this new predictor matrix by using the argument (
`predictorMatrix = mat1`

) in your`mice()`

call. Compare the results to your previous model (`alco_micemodel`

) and the true model (`m_true`

) and discuss the differences. Was it a good idea to remove the other variables from the predictor matrix? - Try handpicking what variables you think are needed for each imputation model to construct one or more new predictor matrices. Run MICE again and compare the results with
`alco_micemodel`

and the true model`m_true`

.- Tip: Note that it does not matter which variables you pick for the rows corresponding to variables without any missing information - they won’t be imputed anyway.
- Discuss: Is this a good strategy? Why/why not?