1. Main analysis
Run this code to load the causalDisco package, load in
the example dataset and our jointly made expert graph:
# load the causalDisco package
library("causalDisco")
# load the data
load(url("https://biostatistics.dk/teaching/advtopics/data/nlsdata.rda"))
You will now have an object available in your global environment:
nlsdata (the dataset).
1.1 Inspect data and expert graph
- Have a quick look at the dataset, e.g. by using the
summary() command on it, to make sure it has been loaded
correctly. You can find more information about the dataset in the
previous exercise (“Human generated DAG”).
Now, load the expert graph:
# load the expert graph
load(url("https://biostat.ku.dk/ahp/adv/expertm.rda"))
You will now have a new object available: expertm, an
adjacency matrix representation of the “expert graph” we constructed
together.
- Have a look at the expert graph and check that it corresponds to the
graph we have drawn together. You can plot it in two different ways, try
both:
plot(expertm) and tplot(expertm).
1.2 Run PC
Run the PC algorithm using significance level \(\alpha = 0.05\):
# run the PC algorithm
pcres <- pc(nlsdata, sparsity = 0.05)
Plot you result using either plot() or
tplot() (see code example below), and have a look at the
resulting graph.
- Do you find any meaningful proposed direct causal relationships?
What about indirect (mediated) causal relationships?
- Do you find any proposed causal relationships you do not think is
correct?
Next, compare the CPDAG from the PC algorithm with our jointly made
expert graph:
- Where do the two graphs agree on adjacencies (i.e., presence of an
edge, disregarding its orientation)?
- Where do the two graphs disagree on adjacencies?
- Among the adjacencies included in both graphs, where do the two
agree concerning the orientation of the edge? Where do they
disagree?
- Which graphs do you trust more to be a good representation of the
data generating mechanism?
# CODE HINTS:
# plot PC algorithm results with temporal layout
plot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
# if you want to save the plots, you can use the option "keepfiles = TRUE" and
# perhaps add an optional file name. This places a file nlsdata-pc-plots.pdf
# with the output figure in your working directory
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")), keepfiles = TRUE,
file = "nlsdata-pc-plots")
2. Additional analyses
The following exercises 2.1-2.4 can be done in whichever order you
prefer. There are way too many for you to do all of them – choose the
ones you find interesting!
The exercises cover these topics:
- 2.1: Try a more quantitative approach to comparing graphs
- 2.2: Run PC with other choices of test significance level
- 2.3: Run PC using a different test of conditional independence
- 2.4: Try adding more variables to the dataset (but subject to more
missing information)
2.1 Comparing graphs
Here we will consider different ways to compare two graphs. First, we
consider the confusion function for comparing edges, which
computes a confusion
matrix. We start by comparing adjacencies, i.e., where edges are
placed but disregarding their orientations:
# first argument is the estimated graph, second is the "true" graph
confusion(pcres, expertm)
This outputs 4 numbers:
The number of true positives (tp)
The number of true negatives (tn)
The number of false positives (fp)
The number of false negatives (fn)
What do you think these numbers mean? If you get stuck, try having a
look at the documentation for the confusion function by
typing ?confusion.
Now try comparing edge directions:
confusion(pcres, expertm, type = "dir")
Again, you may have a look at the documentation for the
confusion function if you get stuck.
2.2 Changing the significance level
In the main analysis we used a significance level (\(\alpha\)) of 0.05 to estimate the graphs.
Rerun the analysis above using different significance levels (try
e.g. 0.1 and 0.001) and investigate how the results change:
For which significance levels do the graphs become
denser/sparser?
For which significance levels is the result closer to the expert
graph?
Do the estimated graphs become more/less plausible?
When does it take more/less time to run the algorithm? Why do you
think that is?
2.3 Changing the test
So far, we have used the the regTest option to test for
conditional independence. This test accommodates mixed numeric/binary
data. For purely numeric data, we can also use the test
corTest, which tests for vanishing partial correlations. Note
that if the data are truly jointly normally distributed, this is a test
of conditional independence.
We now recode the binary variables to 0-1 variables (rather than
factors), so we can run the corTest test on them:
nlsdata_num <- nlsdata
nlsdata_num$r1_urban <- ifelse(nlsdata$r1_urban == "Urban", 1, 0)
nlsdata_num$r1_mcollege <- ifelse(nlsdata$r1_mcollege == "Yes", 1, 0)
nlsdata_num$r6_depressed <- ifelse(nlsdata$r6_depressed == "Sometimes", 1, 0)
nlsdata_num$r12_health <- ifelse(nlsdata$r12_health == "Good", 1, 0)
nlsdata_num$r12_depressed <- ifelse(nlsdata$r12_depressed == "Sometimes", 1, 0)
And we run the PC algorithm again on this “new” data using the
corTest option:
pcres_cortest <- pc(nlsdata_num, sparsity = 0.05, test = corTest)
Compare the results with what you found in exercise 1.2:
2.4 More variables or more observations?
Load in a new version of the NLS97 dataset:
#load the large dataset
load(url("https://biostatistics.dk/teaching/advtopics/data/nlsdata2.rda"))
The dataset nlsdata2 contains additional variables, but
has missing values – also in some of the variables we have already
worked with. In fact, nlsdata only included complete cases
for the variables we have worked with so far.
- Get an overview of the data, and focus especially on missing
information (you can e.g. run
summary(nlsdata2)).
Missing data is a big problem in causal discovery (just like other
statistical analyses) and there are multiple ways to deal with it (see
e.g. Witte et al., 2022). There is always a trade-off between including
additional variables that may have missing information and thereby
reducing the number of complete observations, or excluding variables
with missing information and thereby potentially leaving out important
variables.
Construct a new version of nlsdata (with the same
variables as above), where missing information has not been removed.
Count how many observations you have in this new dataset, and compare
with the number of observations in the original nlsdata to
count how many have been removed due to missing information. (See code
hint 1 below).
Apply the PC algorithm to the new dataset that includes missing
information. You will get an error. Try instead using test-wise deletion
to handle missing information. Test-wise deletion removes observations
with missing information test-by-test in the conditional independence
tests conducted by the PC algorithm. To find out how to do this, look at
the help page for the pc() function (e.g., by typing
?pc). Compare the resulting graph with the one you got in
Exercise 1.2. Which do you find more plausible?
Try to construct other subsets of the data with additional
variables from nlsdata2 that you find interesting (see code
hint 2 below), and when you run PC, either remove all missing
information (using the argument methodNA = "cc") or use
test-wise deletion. Think about the tradeoff between including more
variables or having less missing information. Are there specific
important variables you think ought to be included for the analysis to
be more valid?
It is also possible to use multiple imputation of missing values
together with the PC algorithm, but this is more advanced and we will
not cover it here. See Witte et al. (2022) for more information on
multiple imputation for PC.
# Code hint 1: Construct new version of nlsdata with missing values
nlsdata_withmiss <- nlsdata2[, names(nlsdata)]
# Code hint 2: Make a new version of nlsdata2_withmiss where you add more variables,
# here for example adding r1_health (perception of own health at round 1):
nlsdata_withmiss_extravars <- nlsdata_withmiss
nlsdata_withmiss_extravars$r1_health <- nlsdata2$r1_health
References
Janine Witte, Ronja Foraita, and Vanessa Didelez. “Multiple
imputation and test‐wise deletion for causal discovery with incomplete
cohort data.” Statistics in Medicine, 41.23 4716-4743, 2022