PC exercises

Getting started

Make sure you have installed the necessary packages (see our installation guide).
Share your results with us and the other participants along the way by adding them to the shared slideshow here. We will use this slideshow to discuss our findings together.

1. Main analysis

Run this code to load the causalDisco package, load in the example dataset and our jointly made expert graph:

# load the causalDisco package
library("causalDisco")

# load the data
load(url("https://biostatistics.dk/teaching/advtopics/data/nlsdata.rda"))

You will now have an object available in your global environment: nlsdata (the dataset).

1.1 Inspect data and expert graph

Have a quick look at the dataset, e.g. by using the summary() command on it, to make sure it has been loaded correctly. You can find more information about the dataset in the previous exercise (“Human generated DAG”).

Now, load the expert graph:

# load the expert graph
load(url("https://biostat.ku.dk/ahp/adv/expertm.rda"))

You will now have a new object available: expertm, an adjacency matrix representation of the “expert graph” we constructed together.

Have a look at the expert graph and check that it corresponds to the graph we have drawn together. You can plot it in two different ways, try both: plot(expertm) and tplot(expertm).

1.2 Run PC

Run the PC algorithm using significance level \(\alpha = 0.05\):

# run the PC algorithm
pcres <- pc(nlsdata, sparsity = 0.05)

Plot you result using either plot() or tplot() (see code example below), and have a look at the resulting graph.

Do you find any meaningful proposed direct causal relationships? What about indirect (mediated) causal relationships?
Do you find any proposed causal relationships you do not think is correct?

Next, compare the CPDAG from the PC algorithm with our jointly made expert graph:

Where do the two graphs agree on adjacencies (i.e., presence of an edge, disregarding its orientation)?
Where do the two graphs disagree on adjacencies?
Among the adjacencies included in both graphs, where do the two agree concerning the orientation of the edge? Where do they disagree?
Which graphs do you trust more to be a good representation of the data generating mechanism?

# CODE HINTS:

# plot PC algorithm results with temporal layout
plot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")))

# if you want to save the plots, you can use the option "keepfiles = TRUE" and 
# perhaps add an optional file name. This places a file nlsdata-pc-plots.pdf
# with the output figure in your working directory
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")), keepfiles = TRUE,
      file = "nlsdata-pc-plots")

2. Additional analyses

The following exercises 2.1-2.4 can be done in whichever order you prefer. There are way too many for you to do all of them – choose the ones you find interesting!

The exercises cover these topics:

2.1: Try a more quantitative approach to comparing graphs
2.2: Run PC with other choices of test significance level
2.3: Run PC using a different test of conditional independence
2.4: Try adding more variables to the dataset (but subject to more missing information)

2.1 Comparing graphs

Here we will consider different ways to compare two graphs. First, we consider the confusion function for comparing edges, which computes a confusion matrix. We start by comparing adjacencies, i.e., where edges are placed but disregarding their orientations:

# first argument is the estimated graph, second is the "true" graph
confusion(pcres, expertm)

This outputs 4 numbers:

The number of true positives (tp)
The number of true negatives (tn)
The number of false positives (fp)
The number of false negatives (fn)

What do you think these numbers mean? If you get stuck, try having a look at the documentation for the confusion function by typing ?confusion.

Now try comparing edge directions:

confusion(pcres, expertm, type = "dir")

Which results do you get now?
In what way do you think these results are connected to the results above?

Again, you may have a look at the documentation for the confusion function if you get stuck.

2.2 Changing the significance level

In the main analysis we used a significance level (\(\alpha\)) of 0.05 to estimate the graphs. Rerun the analysis above using different significance levels (try e.g. 0.1 and 0.001) and investigate how the results change:

For which significance levels do the graphs become denser/sparser?
For which significance levels is the result closer to the expert graph?
Do the estimated graphs become more/less plausible?
When does it take more/less time to run the algorithm? Why do you think that is?

2.3 Changing the test

So far, we have used the the regTest option to test for conditional independence. This test accommodates mixed numeric/binary data. For purely numeric data, we can also use the test corTest, which tests for vanishing partial correlations. Note that if the data are truly jointly normally distributed, this is a test of conditional independence.

We now recode the binary variables to 0-1 variables (rather than factors), so we can run the corTest test on them:

nlsdata_num <- nlsdata
nlsdata_num$r1_urban <- ifelse(nlsdata$r1_urban == "Urban", 1, 0)
nlsdata_num$r1_mcollege <- ifelse(nlsdata$r1_mcollege == "Yes", 1, 0)
nlsdata_num$r6_depressed <- ifelse(nlsdata$r6_depressed == "Sometimes", 1, 0)
nlsdata_num$r12_health <- ifelse(nlsdata$r12_health == "Good", 1, 0)
nlsdata_num$r12_depressed <- ifelse(nlsdata$r12_depressed == "Sometimes", 1, 0)

And we run the PC algorithm again on this “new” data using the corTest option:

pcres_cortest <- pc(nlsdata_num, sparsity = 0.05, test = corTest)

Compare the results with what you found in exercise 1.2:

Which edges are shared and which edges differ?
Which estimated graphs do you find most plausible and why?

2.4 More variables or more observations?

Load in a new version of the NLS97 dataset:

#load the large dataset
load(url("https://biostatistics.dk/teaching/advtopics/data/nlsdata2.rda"))

The dataset nlsdata2 contains additional variables, but has missing values – also in some of the variables we have already worked with. In fact, nlsdata only included complete cases for the variables we have worked with so far.

Get an overview of the data, and focus especially on missing information (you can e.g. run summary(nlsdata2)).

Missing data is a big problem in causal discovery (just like other statistical analyses) and there are multiple ways to deal with it (see e.g. Witte et al., 2022). There is always a trade-off between including additional variables that may have missing information and thereby reducing the number of complete observations, or excluding variables with missing information and thereby potentially leaving out important variables.

Construct a new version of nlsdata (with the same variables as above), where missing information has not been removed. Count how many observations you have in this new dataset, and compare with the number of observations in the original nlsdata to count how many have been removed due to missing information. (See code hint 1 below).
Apply the PC algorithm to the new dataset that includes missing information. You will get an error. Try instead using test-wise deletion to handle missing information. Test-wise deletion removes observations with missing information test-by-test in the conditional independence tests conducted by the PC algorithm. To find out how to do this, look at the help page for the pc() function (e.g., by typing ?pc). Compare the resulting graph with the one you got in Exercise 1.2. Which do you find more plausible?
Try to construct other subsets of the data with additional variables from nlsdata2 that you find interesting (see code hint 2 below), and when you run PC, either remove all missing information (using the argument methodNA = "cc") or use test-wise deletion. Think about the tradeoff between including more variables or having less missing information. Are there specific important variables you think ought to be included for the analysis to be more valid?

It is also possible to use multiple imputation of missing values together with the PC algorithm, but this is more advanced and we will not cover it here. See Witte et al. (2022) for more information on multiple imputation for PC.

# Code hint 1: Construct new version of nlsdata with missing values
nlsdata_withmiss <- nlsdata2[, names(nlsdata)]

# Code hint 2: Make a new version of nlsdata2_withmiss where you add more variables,
# here for example adding r1_health (perception of own health at round 1):
nlsdata_withmiss_extravars <- nlsdata_withmiss
nlsdata_withmiss_extravars$r1_health <- nlsdata2$r1_health

References

Janine Witte, Ronja Foraita, and Vanessa Didelez. “Multiple imputation and test‐wise deletion for causal discovery with incomplete cohort data.” Statistics in Medicine, 41.23 4716-4743, 2022

Software for causal discovery in R

Anne Helby Petersen. “causalDisco: Tools for Causal Discovery on Observational Data”. R package version 0.9.4, https://github.com/annennenne/causalDisco, 2023.

Ronja Foraita and Janine Witte. “micd: Multiple Imputation in Causal Graph Discovery”, R package version 1.1.1, 2023.

See https://cran.r-project.org/web//packages/causalDisco/causalDisco.pdf and https://cran.r-project.org/web//packages/micd/micd.pdf for details. These packages use the pcalg package as a backend for most of the computations. Details about this package can be found here:

pcalg (https://cran.r-project.org/web/packages/pcalg/pcalg.pdf)

Another package that has causal discovery tools in R in bnnlearn:

bnlearn (https://cran.r-project.org/web/packages/bnlearn/bnlearn.pdf).