Data-Driven Model Building for Life-Course Epidemiology

Claus Thorn Ekstrøm
(joint work with Anne Helby Petersen and Merete Osler)

August 30, 2023

Purpose

Learn/infer (new) causal graphs from life-course data

  • Learn something new (not just confirmatory)
  • When to intervene?
  • On what?

How?

  • Extend the PC algorithm to temporal and finite data.

PC algorithm

PC algorithm

PC algorithm

PC algorithm

Extending the PC algorithm

  1. After creating skeleton but before aligning arrows from \(v\)-structures:
    Orient edges according to temporal information.

  2. No oracle assumption. Need to test conditional independencies.
    What test should we use to investigate \(X_i \perp X_j | Z_1, \ldots, Z_m\)?

  • \(M_{0a}: g(X_i) = \sum_{k=1}^m f_k(Z_k) \;\;\) vs \(\;\; M_{1a}: g(X_i) = f_0(X_j) + \sum_{k=1}^m f_k(Z_k)\)
  • \(M_{0b}: \widetilde{g}(X_j) = \sum_{k=1}^m \widetilde{f}_k(Z_k) \;\;\) vs \(\;\; M_{1b}: \widetilde{g}(X_j) = \widetilde{f}_0(X_i) + \sum_{k=1}^m \widetilde{f}_k(Z_k)\)

Testing conditional independence

\(g\) and \(\widetilde{g}\) are identity/logit link functions depending on output type.

\[f_k(Z_k) = \left\{\begin{array}{cl}\alpha Z_k & \text{if } Z_k \text{binary} \\s(Z_k) & \text{if } Z_k \text{numeric}\end{array}\right.\]

Due to lack of symmetry we test

\[H_{0a}: M_{0a}=M_{1a} \text{ and } M_{0b}=M_{1b}\]

Conclude independence if either one or the other is not rejected.

This tests a necessary (but not sufficient) condition for cond. independence.

Choice of significance level

Instead of considering only a single significance level, we assess how the model develops for a sequence of significance levels.

The result will then be a sequence of life-course models

\[\psi \in \{0.05, 0.01, 0.001, 0.0001, ... \}\]

Not valid significance levels in the sense that they do not measure the risk of type I errors: the result of one test will have implications for what other tests will be conducted.

Case study: Metropolit cohort

  • Danish men born in 1953. Followed from birth until 65 yo.

  • Surveys at age 12 and 51.
    Extensive administrative register data from the Danish national registers. \(N = 2928\).

  • Consider 33 variables measured in 5 periods over the life course:
    birth, childhood (age \(\sim12\)), youth (age 18-30), adulthood (age \(\sim51\)), and early old age (age \(\sim65\)).

  • Outcome: clinical depression at age 65.

Summary and reference

Extended the PC algorithm such that

  • it incorporates temporal causal information
  • it includes flexible fast testing of conditional independence (necessary but not sufficient)
  • Metropolit data: early imprinting might not be the biggest problem

Implemented in the causalDisco R package.

Reference

Petersen, Anne Helby, Merete Osler, and Claus Thorn Ekstrøm (2021). “Data-Driven Model Building for Life Course Epidemiology”. American Journal of Epidemiology