The goals of these exercises are:

- To get an overview of how much information is missing in the data, and in what variables
- To form first ideas of whether the missing information is problematic for the analyses we will perform later
- To investigate whether there are systematic relationships between information being missing and some of the observed variables

Please add results/information/plots/key points to your Google Slides show as you work through the exercises, so we can all discuss the findings afterwards.

- Slides for
`alcodata1`

: https://docs.google.com/presentation/d/1XsBlKE4_zFnFEwrBCvM8yLc_oPrURFAoD6rXv6pqJMo/edit?usp=sharing - Slides for
`alcodata2`

: https://docs.google.com/presentation/d/19Z40NJioCzrx71m9UcM31TnJyDIvlUgjqF2H3JQWj00/edit?usp=sharing

Note: All code examples below are written for `alcodata1`

. If you were assigned `alcodata2`

, please remember to change the code accordingly.

- Open the project called “MissingData” in the (unzipped) course day folder.
- Load the dataset that you wish to work on using the following lines of code:

```
#For dataset 1:
load("./data/alcodata1.rda")
#For dataset 2:
load("./data/alcodata2.rda")
```

- Ensure that you have loaded the data correctly - and have a look at them - by using a few summary functions. Here are some commands you may try (written for
`alcodata1`

):

```
#Look at the first observations in the data
head(alcodata1)
#Look at the full dataset
View(alcodata1)
#Look at summarizing statistics for each variable in the dataset
summary(alcodata1)
```

- Add a new variable to your dataset that indicates whether each observation is complete or not:

```
alcodata1$completecase <- "Non-complete"
alcodata1$completecase[complete.cases(alcodata1)] <- "Complete"
```

- Make sure you understand what this variable contains. Remember that you can use
`View()`

or`head()`

to inspect the data. - Make distribution plots for each variable, stratified by
`completecase`

(barplots for categorical variables, histograms for numerical variables)- See the code examples below if you want inspiration for how this can be done.
- Discuss: What do you see? Do you think a complete case analysis will give the same estimates as an analysis on the full dataset where no observations are missing? Why/why not?

```
#Code example: Plot gender stratified by "completecase" status (two methods)
#Method 1: Using the ggplot2 package:
library(ggplot2)
qplot(gender, facets = ~ completecase, data = alcodata1)
#Method 2: Using the standard built-in plotting tools (use hist() for numerical variables):
par(mfrow = c(1,2)) #change plot window to 1 x 2
plot(alcodata1$gender[alcodata1$completecase == "Non-complete"], main = "Non-complete")
plot(alcodata1$gender[alcodata1$completecase == "Complete"], main = "Complete")
par(mfrow = c(1,1)) #change plot window back to 1 x 1
```

We will now look more directly at the missing information and try to inspect the patterns in which it occur.

- Try some of the functions below and discuss what each plot shows and what it tells you about the data and the missing information.
- Try replacing “gender” and “country” in the stratified plots with other variables that you find interesting.

```
#Missing information visualization using the naniar package:
library(naniar)
gg_miss_var(alcodata1)
gg_miss_var(alcodata1, facet = country, show_pct = TRUE)
vis_miss(alcodata1)
gg_miss_upset(alcodata1)
gg_miss_case(alcodata1)
gg_miss_fct(alcodata1, gender)
```

```
#Missing information visualization using the mice package:
library(mice)
md.pattern(alcodata1, rotate.names = TRUE)
```

We will now investigate whether some variables are predictive of missingness status. Note that there are code examples after the exercises you can use.

- Fit a logistic regression model with:
- Outcome: An indicator (0/1) of whether the observation is complete in all variables.
- Predictors: All the four completely observed variables.

- Look at the parameter estimates. Do you think that the observations are missing completely at random?

For each variable with any missing information, repeat the modeling from 4.a, but replace the outcome with an indicator that shows only if there is missing information in **that** variable, e.g. if `education`

is missing.

```
#overall non-response analysis:
model1 <- glm(completecase == "Non-complete" ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")
#variable-specific non-response analysis:
model2 <- glm(is.na(education) ~ country + gender + partner + prevTreat,
data = alcodata1, family = "binomial")
```