The goals of these exercises are:
MICE
package in RAdd your results to the slides: https://docs.google.com/presentation/d/1x-2TsJyE_dfvNQhfGqwS1ofUof2h1NNshDbxFk9wHgs/edit?usp=sharing
In the following, we will fit a number of models. As today’s purpose
is to explore methods for handling missing information, you should
not focus on model selection or model diagnostics.
Rather, you shall use a linear normal regression model with
drinks
as the outcome and additive effects of all
variables, i.e.:
\[\begin{align} \text{drinks} &= \alpha + \beta_1 \cdot \text{agebeyond60} + \beta_2 \cdot 1(\text{country: DE}) + \beta_3 \cdot 1(\text{country: USA}) + \beta_4 \cdot 1(\text{dependence: intermediate}) + \\ &\quad \beta_5 \cdot 1(\text{dependence: servere}) + \beta_6 \cdot 1(\text{education: undergrad.}) + \beta_7 \cdot 1(\text{education: grad./post grad.}) + \\ &\quad \beta_8 \cdot 1(\text{gender: female}) + \beta_9 \cdot 1(\text{partner: TRUE}) + \beta_{10} \cdot 1(\text{prev. treat: 1-2}) + \beta_{11} \cdot 1(\text{prev.treat: 3+}) + \epsilon \end{align}\]
where \(\epsilon\) is a normally
distributed error term. 1(*)
are indicator functions
(dummies) that are 1
if *
is true and 0
otherwise. The intercept (\(\alpha\))
is the expected number of drinks for a person who is 60 years old, lives
in Denmark, has dependence: “low”, education: “no degree”, gender:
“male”, no partner and 0 previous treatments.
R
if there were no missing
information?complete.cases()
might be handy.)alco_ccmodel
. (Hint: the default
behavior of most R model functions, including lm()
, is to
remove all incomplete cases before fitting a model)summary()
function
on your model.plotEstimates()
sourced from the functions.R
file in the project folder by running the following lines of code:source("./R/functions.R")
plotEstimates(`Complete cases` = alco_ccmodel)
This plot shows your estimates together with 95% confidence intervals.
We will now try to use the mice
package for performing
multiple imputation by chanied equations (MICE). MICE consists of three
steps:
We will first conduct an analysis using MICE with standard settings and then, in the next exercise, take a closer look at the imputation step (1).
To make everything a bit easier, remove the variable
completecase
from your dataset again:
alcodata1$completecase <- NULL
Run the code chunk below one line at a time.
alco_imp <- mice(alcodata1)
alco_fit <- with(alco_imp,
lm(drinks ~ ageBeyond60 + prevTreat + country + gender + education + partner + dependence))
alco_micemodel <- pool(alco_fit)
summary(alco_micemodel, conf.int = TRUE)
plotEstimates(`Complete cases` = alco_ccmodel, `MICE` = alco_micemodel)
We will now compare the results with the “true model” - i.e. the model we could have fitted if we had had access to the whole dataset. Compared to a real-life data analysis, this is of course cheating, as we would normally not be able to know this model.
mice
model and the complete case model.#Load in "true" model
load("./data/ex3model.rda")
#Look at the estimates
summary(m_true)
#Plot estimates together with complete case model and true model
plotEstimates(`Full data` = m_true,
`MICE` = alco_micemodel,
`Complete cases` = alco_ccmodel)
In the imputation step we can vary a number of settings, including:
Below, we take a closer look at each of these settings in 4.a and 4.b, respectively. Choose which topic you would like to work with first - you may not have time to do them both.
You can set the number of imputed datasets in the mice()
function by use of the argument m
.
Look at a summary of your imputed datasets to see how many datasets are imputed as the default:
summary(alco_imp)
Try running the MICE steps from above with \(m = 1\), \(m =
5\), \(m = 10\), \(m = 50\), \(m =
100\), and possibly more values of \(m\). Save the (pooled) models under the
names mice_m1
, mice_m5
, …,
mice_m200
respectively.
print = FALSE
for
the mice()
function to avoid having a lot of information
written on the screen when \(m\) is
large.Compare the results, e.g. by use of the
plotEstimates()
function (see code example below). Discuss
the following points:
Add the true model (m_true
) to the plot. Does this
change your opinion?
#Code example: Plot estimates from models with varying numbers of imputations
plotEstimates(`m = 1` = mice_m1,
`m = 5` = mice_m5,
`m = 10` = mice_m10,
`m = 50` = mice_m50,
`m = 100` = mice_m100)
We will now look at what happens if we change what variables are
included in the imputation models. This is specified using a so-called
predictor matrix of 0s and 1s. Here is an example of such a matrix for a
small dataset with only three variables, X
, Y
and Z
:
## X Y Z
## X 0 0 1
## Y 1 0 0
## Z 0 0 0
The matrix is read row by row as follows:
X
, use Z
as a
predictor variableY
, use X
as a
predictor variableZ
, use no predictor
variablesWork through the following exercises:
Look at a the predictor matrix for your imputed datasets to see what variables are included for each imputation model now:
alco_imp$predictorMatrix
Why are there 0s in the diagonal?
We will now change the choice of predictors for each imputation
model. This can be specified in using the predictorMatrix
argument in the mice()
function.
First, we construct a new predictor matrix where only the
variable education
is used as a predictor in the missing
data models (using two different, equivalent methods). Look closely at
the code and make sure you understand the structure of the matrix:
#Make new predictor matrix: method 1 - use an existing predictor matrix and modify it
mat1 <- alco_imp$predictorMatrix
#change all entries except the column "education" to zero
mat1[, c("country", "gender", "ageBeyond60", "partner",
"dependence", "prevTreat", "drinks")] <- 0
#Look at the result
mat1
#Make new predictor matrix: method 2 - first make a matrix only of 1s, then edit it
#make a matrix with 1 in all entires
mat1 <- matrix(1, 8, 8)
#change the diagonal to be zero
diag(mat1) <- 0
#change all the entries, except for the 4th column, to be zero
mat1[, -4] <- 0
#Look at the result
mat1
Perform MICE (all three steps) with this new predictor matrix by
using the argument (predictorMatrix = mat1
) in your
mice()
call. Compare the results to your previous model
(alco_micemodel
) and the true model (m_true
)
and discuss the differences. Was it a good idea to remove the other
variables from the predictor matrix?
Try handpicking what variables you think are needed for each
imputation model to construct one or more new predictor matrices. Run
MICE again and compare the results with alco_micemodel
and
the true model m_true
.