1. Main analysis
Run this code to load the causalDisco
package, load in
the example dataset and our jointly made expert graph:
# load the causalDisco package
library("causalDisco")
# load the data
load(url("https://biostatistics.dk/teaching/eurocim/data/nlsdata.rda"))
# load the expert graph
load(url("https://biostatistics.dk/teaching/eurocim/data/expertm.rda"))
You will now have two objects available in your global environment:
nlsdata
(the dataset) and expertm
(an
adjacency matrix representation of the “expert graph” we constructed
together).
The example code provided below uses the example data
nlsdata
, but you can use whichever dataset you prefer, if
you would rather try the functions on your own data. Note that exercise
2.5 shows you how you may type in your own expert graph and use it in
the package.
1.1 Inspect data and expert graph
Have a quick look at the dataset, e.g. by using the
summary()
command on it, to make sure it has been loaded
correctly. You can find more information about the
dataset here.
Have a look at the expert graph and check that it corresponds to
the graph we have drawn together. You can plot it in two different ways,
try both: plot(expertm)
and
tplot(expertm)
.
1.2 Run PC
Run the PC algorithm using sparsity = 0.05:
# run the PC algorithm
pcres <- pc(nlsdata, sparsity = 0.05)
Plot you result using either plot()
or
tplot()
(see code example below), and have a look at the
resulting graph.
- Do you find any meaningful proposed direct causal relationships?
What about indirect (mediates) causal relationships?
- Do you find any proposed causal relationships you do not think is
correct?
Next, compare the CPDAG from the PC algorithm with our jointly made
expert graph:
- Where do the two graphs agree on adjacencies (i.e., presence of an
edge, disregarding its orientation)?
- Where do the two graphs disagree on adjacencies?
- Among the adjacencies included in both graphs, where do the two
agree concerning the orientation of the edge? Where do they
disagree?
- Which graphs do you trust more to be a good representation of the
data generating mechanism?
# plot PC algorithm results with temporal layout
plot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
# if you want to save the plots, you can use the option "keepfiles = TRUE" and
# perhaps add an optional file name. This places a file nlsdata-pc-plots.pdf
# with the output figure in your working directory
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")), keepfiles = TRUE,
file = "nlsdata-pc-plots")
2. Additional analyses
The following exercises 2.1-2.5 can be done in whichever order you
prefer. There are way too many for you to do all of them – choose the
ones you find interesting!
The exercises cover these topics:
- 2.1: Try a more quantitative approach to comparing graphs
- 2.2: Run PC with other choices of test/sparsity level
- 2.3: Run PC using a different test of conditional independence
- 2.4: Try adding more variables to the dataset (but subject to more
missing information)
- 2.5: Understand how DAGs/CPDAGs are represented by adjacency
matrices
2.1 Comparing graphs
Here we will consider different ways to compare two graphs. First, we
consider the confusion function for comparing edges, which
computes a confusion
matrix:
# first argument is the estimated graph, second is the "true" graph
confusion(pcres, expertm)
This outputs 4 numbers:
The number of true positives (tp)
The number of true negatives (tn)
The number of false positives (fp)
The number of false negatives (fn)
What do you think these numbers mean? Try having a look at the
documentation for the confusion
function by typing
?confusion
if you get stuck.
Now try comparing edge directions:
confusion(pcres, expertm, type = "dir")
Again, you may have a look at the documentation for the
confusion
function if you get stuck.
2.2 Changing the sparsity level
In the main analysis we used a sparsity level of 0.05 to estimate the
graphs. Rerun the analysis above using different sparsity levels (try
e.g. 0.1 and 0.001) and investigate how the results change:
For which sparsity levels do the graphs become
denser/sparser?
For which sparsity levels is the result closer to the expert
graph?
Do the estimated graphs become more/less plausible?
2.3 Changing the test
The default test for conditional independence in the pc
function is the regTest, which accommodates mixed
numeric/binary data. For numeric data, the function also offers the test
corTest, which tests for vanishing partial correlations. Note
that if the data are truly jointly normally distributed, this is a test
of conditional independence.
We now recode the binary variables to 0-1 variables (rather than
factors), so we can run the corTest
on them:
nlsdata_num <- nlsdata
nlsdata_num$r1_urban <- ifelse(nlsdata$r1_urban == "Urban", 1, 0)
nlsdata_num$r1_mcollege <- ifelse(nlsdata$r1_mcollege == "Yes", 1, 0)
nlsdata_num$r6_depressed <- ifelse(nlsdata$r6_depressed == "Sometimes", 1, 0)
nlsdata_num$r12_health <- ifelse(nlsdata$r12_health == "Good", 1, 0)
nlsdata_num$r12_depressed <- ifelse(nlsdata$r12_depressed == "Sometimes", 1, 0)
And we run the PC algorithm again on this “new” data using the
corTest
option:
pcres_cortest <- pc(nlsdata_num, sparsity = 0.05, test = corTest)
Compare the results with what you found in exercise 1.2:
The pc function allows the user to apply their own
independence tests as well. It then requires the arguments
suffStat and varnames instead of the data
argument. The micd package (Foraita and Witte, 2023; Andrews et
al., 2021) provides an alternative flexible conditional independence
test. The package includes the functions getSuff and
flexCItest, which we can use directly in the pc
function:
# load the micd package
library(micd)
# run the PC algorithm with flexCItest
pcres_flextest <- pc(suffStat = getSuff(nlsdata, test = "flexCItest"),
sparsity = 0.05,
test = flexCItest,
varnames = names(nlsdata))
Again, compare the results with what you found in exercise 1.2:
2.4 More variables or more observations?
Load in a new version of the NLS97 dataset:
#load the large dataset
load(url("https://biostatistics.dk/teaching/eurocim/data/nlsdata2.rda"))
The dataset nlsdata2
contains more variables, but has
missing values.
- Get an overview of the data, and focus especially on missing
information (you can e.g. run
summary(nlsdata2)
).
Missing data is a big problem in causal discovery (just like other
statistical analyses) and there are multiple ways to deal with it (see
e.g. Witte et al., 2022), but for now we will do a complete case
analysis:
Remove observations with missing values using the function
na.omit()
and count how many observations remain (using
e.g. nrow()
).
Estimate a CPDAG using the PC algorithm on the fully observed
version of the nlsdata2
dataset and compare if with your
CPDAG from Exercise 1.2. Reflect on how selection bias might be
affecting your results.
Try to construct other subsets of the data and run PC on them
using the methodNA = "cc"
argument to ensure complete case
analysis is used. Think about the tradeoff between including more
variables or having less missing information. Are there specific
important variables you think need to be included for the analysis to be
more valid?
2.5 Adjacency matrices
The causalDisco
package uses adjacency matrices to
represent graphs.
Look at the provided adjacency matrix expertm
, and
compare it with the output of plot(expertm)
. What do you
think a 0 in the adjacency matrix means? What does a 1 mean?
Test your hypothesis of the meaning of 0s and 1s by comparing the
adjacency matrix found in pcres
with the result of
plot(pcres)
Here is some code for constructing a new (temporal) adjacency
matrix for an imaginary dataset. See if you can draw the resulting CPDAG
by hand, and compare your drawing with the output of
plot(extra_cpdag)
.
# type in matrix
thism <- matrix(c(0, 0, 0, 0,
0, 0, 0, 1,
1, 1, 0, 0,
0, 1, 0, 0), 4, 4, byrow = TRUE)
# add variable names w/ periods
rownames(thism) <- colnames(thism) <- c("p1_X1", "p1_X2", "p2_X3", "p2_X4")
# define temporal adjacency matrix object
extra_cpdag <- tamat(thism, order = c("p1", "p2"))
References
Ryan M. Andrews, Ronja Foraita, Vanessa Didelez, Janine Witte. “A
practical guide to causal discovery with cohort data”. arXiv
preprint arXiv:2108.13395, https://arxiv.org/abs/2108.13395, 2021.
Diego Colombo and Marloes H Maathuis. “Order-independent
constraint-based causal structure learning”. Journal of Machine
Learning Research, 15(1):3741–3782, 2014.
Ronja Foraita and Janine Witte. “micd: Multiple Imputation in Causal
Graph Discovery”, R package version 1.1.1, 2023.
Christopher Meek. “Causal inference and causal explanation with
background knowledge”. Proceedings of the Eleventh conference on
Uncertainty in artificial intelligence, pages 403–410, 1995.
Anne Helby Petersen. “causalDisco: Tools for Causal Discovery on
Observational Data”. R package version 0.9.2, https://github.com/annennenne/causalDisco, 2023.
Janine Witte, Ronja Foraita, and Vanessa Didelez. “Multiple
imputation and test‐wise deletion for causal discovery with incomplete
cohort data.” Statistics in Medicine, 41.23 4716-4743, 2022
See https://cran.r-project.org/web//packages/causalDisco/causalDisco.pdf
and https://cran.r-project.org/web//packages/micd/micd.pdf
for details. These packages uses the pcalg
package as a
backend for most of the computations. Details about this package can be
found here:
Another package that has causal discovery tools in R in
bnnlearn
: