The goals of these exercises are:
MICE
package in RPlease add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards.
In the following, we will fit a number of models. As today’s purpose is to explore methods for handling missing information, you should not focus on model selection or model diagnostics. Rather, you shall use a linear normal regression model with drinks
as the outcome and additive effects of all variables, i.e.:
\[\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}\]
where \(\epsilon\) is a normally distributed error term. 1(*)
are indicator functions that are 1
if *
is true and 0 otherwise. The intercept (\(\alpha\)) is the expected number of drinks for a person who is 60 years old, lives in Denmark, has dependence: “low”, education: “no degree”, gender: “male”, no partner and 0 previous treatments.
R
if there were no missing information?na.omit()
removes incomplete cases and the function nrow()
counts the number of observations in a dataset.)alco_ccmodel
. (Hint: the default behavior of most R model functions, including lm()
, is to remove all incomplete cases before fitting a model)summary()
function on your model.plotEstimates()
sourced from the functions.R
file in the project folder by running the following lines of code:source("./R/functions.R")
plotEstimates(`Complete cases` = alco_ccmodel)
This plot shows your estimates together with 95% confidence intervals.
We will now try to use the mice
package for performing multiple imputation by chanied equations (MICE). MICE consists of three steps:
We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).
completecase
from your dataset again:alcodata1$completecase <- NULL
alco_imp <- mice(alcodata1)
alco_fit <- with(alco_imp,
lm(drinks ~ ageBeyond60 + prevTreat + country + gender + education + partner + dependence))
alco_micemodel <- pool(alco_fit)
summary(alco_micemodel, conf.int = TRUE)
plotEstimates(`Complete cases` = alco_ccmodel, `MICE` = alco_micemodel)
mice
model and the complete case model.#Load in "true" model
load("./data/ex3model.rda")
#Look at the estimates
summary(m_true)
#Plot estimates together with complete case model and true model
plotEstimates(`Full data` = m_true,
`MICE` = alco_micemodel,
`Complete cases` = alco_ccmodel)
In the imputation step we can vary a number of settings, including:
Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both.
You can set the number of imputed datasets in the mice()
function by use of the argument m
.
summary(alco_imp)
mice_m1
, mice_m5
, …, mice_m200
respectively.
print = FALSE
for the mice()
function to avoid having a lot of information written on the screen when \(m\) is large.plotEstimates()
function (see code example below). Discuss the following points:
m_true
) to the plot. Does this change your opinion?#Code example: Plot estimates from models with varying numbers of imputations
plotEstimates(`m = 1` = mice_m1,
`m = 5` = mice_m5,
`m = 10` = mice_m10,
`m = 50` = mice_m50,
`m = 100` = mice_m100)
We will now look at what happens if we change what variables are included in the imputation models. This is specified using a so-called predictor matrix of 0s and 1s. Here is an example of such a matrix for a small dataset with only three variables, X
, Y
and Z
:
## X Y Z
## X 0 0 1
## Y 1 0 0
## Z 0 0 0
The matrix is read row by row as follows:
X
, use Z
as a predictor variableY
, use X
as a predictor variableZ
, use no predictor variablesWork through the following exercises:
alco_imp$predictorMatrix
predictorMatrix
argument in the mice()
function.
education
is used as a predictor the missing data models (using two different, equivalent methods). Look closely at the code and make sure you understand the structure of the matrix:#Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
mat1 <- alco_imp$predictorMatrix
#change all entries except the column "education" to zero
mat1[, c("country", "gender", "ageBeyond60", "partner",
"dependence", "prevTreat", "drinks")] <- 0
#Look at the result
mat1
#Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
#make a matrix with 1 in all entires
mat1 <- matrix(1, 8, 8)
#change the diagonal to be zero
diag(mat1) <- 0
#change all the entries, except for the 4th column, to be zero
mat1[, -4] <- 0
#Look at the result
mat1
predictorMatrix = mat1
) in your mice()
call. Compare the results to your previous model (alco_micemodel
) and the true model (m_true
) and discuss the differences. Was it a good idea to remove the other variables from the predictor matrix?alco_micemodel
and the true model m_true
.