Getting started

1. Main analysis

Run this code to load the causalDisco package, load in the example dataset and our jointly made expert graph:

# load the causalDisco package
library("causalDisco")

# load the data
load(url("https://biostatistics.dk/teaching/advtopicsB/data/nlsdata.rda"))

# load the expert graph
load(url("https://biostat.ku.dk/ahp/advb/expertm.rda"))

You will now have two objects available in your global environment: nlsdata (the dataset) and expertm (an adjacency matrix representation of the “expert graph” we constructed together).

The example code provided below uses the example data nlsdata, but you can use whichever dataset you prefer, if you would rather try the functions on your own data. Note that exercise 2.5 shows you how you may type in your own expert graph and use it in the package.

1.1 Inspect data and expert graph

  • Have a quick look at the dataset, e.g. by using the summary() command on it, to make sure it has been loaded correctly. You can find more information about the dataset in the previous exercise (“Human generated DAG”).

  • Have a look at the expert graph and check that it corresponds to the graph we have drawn together. You can plot it in two different ways, try both: plot(expertm) and tplot(expertm).

1.2 Run PC

Run the PC algorithm using sparsity = 0.05:

# run the PC algorithm
pcres <- pc(nlsdata, sparsity = 0.05)

Plot you result using either plot() or tplot() (see code example below), and have a look at the resulting graph.

  • Do you find any meaningful proposed direct causal relationships? What about indirect (mediates) causal relationships?
  • Do you find any proposed causal relationships you do not think is correct?

Next, compare the CPDAG from the PC algorithm with our jointly made expert graph:

  • Where do the two graphs agree on adjacencies (i.e., presence of an edge, disregarding its orientation)?
  • Where do the two graphs disagree on adjacencies?
  • Among the adjacencies included in both graphs, where do the two agree concerning the orientation of the edge? Where do they disagree?
  • Which graphs do you trust more to be a good representation of the data generating mechanism?


# plot PC algorithm results with temporal layout
plot(tamat(amat(pcres), order = c("r1", "r6", "r12")))
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")))

# if you want to save the plots, you can use the option "keepfiles = TRUE" and 
# perhaps add an optional file name. This places a file nlsdata-pc-plots.pdf
# with the output figure in your working directory
tplot(tamat(amat(pcres), order = c("r1", "r6", "r12")), keepfiles = TRUE,
      file = "nlsdata-pc-plots")

2. Additional analyses

The following exercises 2.1-2.5 can be done in whichever order you prefer. There are way too many for you to do all of them – choose the ones you find interesting!

The exercises cover these topics:

2.1 Comparing graphs

Here we will consider different ways to compare two graphs. First, we consider the confusion function for comparing edges, which computes a confusion matrix:

# first argument is the estimated graph, second is the "true" graph
confusion(pcres, expertm)

This outputs 4 numbers:

  1. The number of true positives (tp)

  2. The number of true negatives (tn)

  3. The number of false positives (fp)

  4. The number of false negatives (fn)

What do you think these numbers mean? Try having a look at the documentation for the confusion function by typing ?confusion if you get stuck.

Now try comparing edge directions:

confusion(pcres, expertm, type = "dir")
  • Which results do you get now?

  • In what way do you think these results are conditional on the results above?

Again, you may have a look at the documentation for the confusion function if you get stuck.

2.2 Changing the sparsity level

In the main analysis we used a sparsity level of 0.05 to estimate the graphs. Rerun the analysis above using different sparsity levels (try e.g. 0.1 and 0.001) and investigate how the results change:

  • For which sparsity levels do the graphs become denser/sparser?

  • For which sparsity levels is the result closer to the expert graph?

  • Do the estimated graphs become more/less plausible?

2.3 Changing the test

The default test for conditional independence in the pc function is the regTest, which accommodates mixed numeric/binary data. For numeric data, the function also offers the test corTest, which tests for vanishing partial correlations. Note that if the data are truly jointly normally distributed, this is a test of conditional independence.

We now recode the binary variables to 0-1 variables (rather than factors), so we can run the corTest on them:

nlsdata_num <- nlsdata
nlsdata_num$r1_urban <- ifelse(nlsdata$r1_urban == "Urban", 1, 0)
nlsdata_num$r1_mcollege <- ifelse(nlsdata$r1_mcollege == "Yes", 1, 0)
nlsdata_num$r6_depressed <- ifelse(nlsdata$r6_depressed == "Sometimes", 1, 0)
nlsdata_num$r12_health <- ifelse(nlsdata$r12_health == "Good", 1, 0)
nlsdata_num$r12_depressed <- ifelse(nlsdata$r12_depressed == "Sometimes", 1, 0)

And we run the PC algorithm again on this “new” data using the corTest option:

pcres_cortest <- pc(nlsdata_num, sparsity = 0.05, test = corTest)

Compare the results with what you found in exercise 1.2:

  • Which edges are shared and which edges differ?

  • Which estimated graphs do you find most plausible and why?

The pc function allows the user to apply their own independence tests as well. It then requires the arguments suffStat and varnames instead of the data argument. The micd package (Foraita and Witte, 2023; Andrews et al., 2021) provides an alternative flexible conditional independence test. The package includes the functions getSuff and flexCItest, which we can use directly in the pc function:

# load the micd package
library(micd)

# run the PC algorithm with flexCItest
pcres_flextest <- pc(suffStat = getSuff(nlsdata, test = "flexCItest"),
                     sparsity = 0.05,
                     test = flexCItest,
                     varnames = names(nlsdata))

Again, compare the results with what you found in exercise 1.2:

  • Which edges are shared and which edges differ?

  • Which estimated graphs do you find most plausible and why?

2.4 More variables or more observations?

Load in a new version of the NLS97 dataset:

#load the large dataset
load(url("https://biostatistics.dk/teaching/advtopicsB/data/nlsdata2.rda"))

The dataset nlsdata2 contains more variables, but has missing values.

  1. Get an overview of the data, and focus especially on missing information (you can e.g. run summary(nlsdata2)).

Missing data is a big problem in causal discovery (just like other statistical analyses) and there are multiple ways to deal with it (see e.g. Witte et al., 2022), but for now we will do a complete case analysis:

  1. Remove observations with missing values using the function na.omit() and count how many observations remain (using e.g. nrow()).

  2. Estimate a CPDAG using the PC algorithm on the fully observed version of the nlsdata2 dataset and compare if with your CPDAG from Exercise 1.2. Reflect on how selection bias might be affecting your results.

  3. Try to construct other subsets of the data and run PC on them using the methodNA = "cc" argument to ensure complete case analysis is used. Think about the tradeoff between including more variables or having less missing information. Are there specific important variables you think need to be included for the analysis to be more valid?

2.5 Adjacency matrices

The causalDisco package uses adjacency matrices to represent graphs.

  1. Look at the provided adjacency matrix expertm, and compare it with the output of plot(expertm). What do you think a 0 in the adjacency matrix means? What does a 1 mean?

  2. Test your hypothesis of the meaning of 0s and 1s by comparing the adjacency matrix found in pcres with the result of plot(pcres)

  3. Here is some code for constructing a new (temporal) adjacency matrix for an imaginary dataset. See if you can draw the resulting CPDAG by hand, and compare your drawing with the output of plot(extra_cpdag).

    # type in matrix
    thism <- matrix(c(0, 0, 0, 0,
                      0, 0, 0, 1, 
                      1, 1, 0, 0, 
                      0, 1, 0, 0), 4, 4, byrow = TRUE)
    
    # add variable names w/ periods
    rownames(thism) <- colnames(thism) <- c("p1_X1", "p1_X2", "p2_X3", "p2_X4")
    
    # define temporal adjacency matrix object
    extra_cpdag <- tamat(thism, order = c("p1", "p2"))


References

Ryan M. Andrews, Ronja Foraita, Vanessa Didelez, Janine Witte. “A practical guide to causal discovery with cohort data”. arXiv preprint arXiv:2108.13395, https://arxiv.org/abs/2108.13395, 2021.

Diego Colombo and Marloes H Maathuis. “Order-independent constraint-based causal structure learning”. Journal of Machine Learning Research, 15(1):3741–3782, 2014.

Ronja Foraita and Janine Witte. “micd: Multiple Imputation in Causal Graph Discovery”, R package version 1.1.1, 2023.

Christopher Meek. “Causal inference and causal explanation with background knowledge”. Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, pages 403–410, 1995.

Anne Helby Petersen. “causalDisco: Tools for Causal Discovery on Observational Data”. R package version 0.9.2, https://github.com/annennenne/causalDisco, 2023.

Janine Witte, Ronja Foraita, and Vanessa Didelez. “Multiple imputation and test‐wise deletion for causal discovery with incomplete cohort data.” Statistics in Medicine, 41.23 4716-4743, 2022


See https://cran.r-project.org/web//packages/causalDisco/causalDisco.pdf and https://cran.r-project.org/web//packages/micd/micd.pdf for details. These packages uses the pcalg package as a backend for most of the computations. Details about this package can be found here:

Another package that has causal discovery tools in R in bnnlearn:


2024