Previous seminars

Friday, April 05, 2024, 14:00

Mirthe van Diepen
Radboud University, Nijmegen, Netherlands
Detecting New Causal Risk Factors via Causal Discovery in Aortic Surgery

Understanding the causal relationships between demographic information and biomarkers can be extremely useful to get a better understanding of causal risk factors in healthcare. It can motivate future studies to search for an intervention that lowers the risk or for possible treatment alternatives that can improve quality of life expectations. Using random controlled trials (RCTs), we can try to infer specific causal relationships. However, it is not always possible to directly intervene on (proxy) variables due to ethical reasons, or it is just impossible in practice. Causal discovery algorithms try to address this problem by searching for the causal structure between variables in an observational data set instead of using interventions on the variables.Nonetheless, in medical journals, the currently used methods to analyze data are usually not based on causal discovery methods due to the assumptions made which are difficult to test for, and the non-intuitive definitions that are required for this field. In this research, we aim to show how to handle these using a specific case study that exhibits many of these challenges.

This study is motivated by a data set containing subjects who had aortic surgery at the St. Antonius Hospital in Nieuwegein. We use this data set to demonstrate what important steps are needed for the analysis. Challenges of this aortic surgery data set are (1) small sample size, (2) consisting of a complex combination of very different variables, both discrete and continuous, (3) unknown causal structure (there might be unknown confounders in the causal structure), (4) context variables and time-dependent variables (variables from the different phases in the perioperative period), and (5) missing values. We suggest how to combine the outputs of a causal discovery method with bootstrapping to make it more robust for small data sets, how to deal with context variables, and how to deal with mixed data.

Monday, March 04, 2024, 11:00

Søren Wengel Mogensen
Department of Automatic Control, Lund University
Graphical models of local independence in stochastic processes

Graphs are often used as representations of conditional independence structures of random vectors. In stochastic processes, one may use graphs to represent so-called local independence. Local independence is an asymmetric notion of independence which describes how a system of stochastic processes (e.g., point processes or diffusions) evolves over time. Let A, B, and C be three subsets of the coordinate processes of the stochastic system. Intuitively speaking, B is locally independent of A given C if at every point in time knowing the past of both A and C is not more informative about the present of B than knowing the past of C only. Directed graphs can be used to describe the local independence structure of the stochastic processes using a separation criterion which is analogous to d-separation. In such a local independence graph, each node represents an entire coordinate process rather than a single random variable.

In this talk, we will describe various properties of graphical models of local independence and then turn our attention to the case where the system is only partially observed, i.e., some coordinate processes are unobserved. In this case, one can use so-called directed mixed graphs to describe the local independence structure of the observed coordinate processes. Several directed mixed graphs may describe the same local independence model, and therefore it is of interest to characterize such equivalence classes of directed mixed graphs. It turns out that directed mixed graphs satisfy a certain maximality property which allows one to construct a simple graphical representation of an entire Markov equivalence class of marginalized local independence graphs. This is convenient as the equivalence class can be learned from data and its graphical representation concisely describes what underlying structure could have generated the observed local independencies.

Deciding Markov equivalence of two directed mixed graphs is computationally hard, and we introduce a class of equivalence relations that are weaker than Markov equivalence, i.e., lead to larger equivalence classes. The weak equivalence classes enjoy many of the same properties as the Markov equivalence classes, and they provide a computationally feasible framework while retaining a clear interpretation. We discuss how this can be used for graphical modeling and causal structure learning based on local independence.

Friday, February 23, 2024, 11:00

Ivana Malenica
Wojcicki Troper HDSI Fellow, Department of Statistics, Harvard University
Personalized Decision-Making in Highly Dependent Settings

Effective management of emerging and existing epidemics requires strategic decisions on where, when, and to whom interventions should be applied. However, personalized decision-making in infectious disease applications introduces new and unique statistical challenges. For instance, the individuals at risk of infection are unknown, the true outcome of interest (positive infection status) is often a latent variable, and the presence of complex dependence reduces data to a single observation. In this work, we investigate an adaptive sequential design under latent outcome structures and unspecified dependence through space and time. The statistical problem is addressed within a nonparametric model that respects the unknown dependence structure. I will begin by formalizing a treatment allocation strategy that utilizes up-to-date data to inform who is at risk of infection in real-time, with favorable theoretical properties. The optimal allocation strategy, or optimal policy, maximizes the mean latent outcome under a resource constraint. The proposed estimator learns the optimal policy over time and exploits the double-robust structure of the efficient influence function of the target parameters of interest. In the second part of the talk, I will present the study of data-adaptive inference on the mean under the optimal policy, where the target parameter adapts over time in response to the observed data (state of the epidemic). Lastly, I present a novel paradigm in nonparametric efficient estimation particularly suited for target parameters with complex dependence.

Monday, January 22, 2024, 15:15

Michael P. Fay
National Institute of Allergy and Infectious Diseases, NIH
Individual-level and Population-level Causal Estimands in Randomized Clinical Trials

Randomized trials are one of the best ways to establish causal effects without making strong untestable assumptions. Although randomization can ensure that the apparent causal effect is not due a confounding factor that affects both the treatment choice and the response, the interpretation of the causal estimand is sometimes not straightforward. To avoid some common misinterpretations of causal estimands from randomized trials, I discuss two overlapping classes of estimands: individual-level and population-level causal estimands. The individual-level causal estimand first compares potential outcomes on each of the two treatment arms within an individual, then summarizes those comparisons across a population. In contrast, the population-level causal estimand first summarizes the marginal distribution of each of the two potential outcomes, then compares the two summaries. Difference-in-means estimands are members of both classes, but some other common estimands (e.g., the Mann-Whitney parameter or the hazard ratio) are only population-level estimands and are often causally misinterpreted as individual-level estimands. I discuss these issues using a placebo-controlled randomized vaccine trial as an example.

Thursday, November 30, 2023, 15:15

Jan Beyersmann
Institute of Statistics, Ulm University
Can today’s intention to treat have a causal effect on tomorrow’s hazard function?

Hazards condition on previous survival, which makes them both identifiable based on censored data and the inferential key quantities of survival analysis. It also makes them subject to critique from a causal point of view. The worry is that after randomization of the intention to treat a more beneficial treatment will help sicker patients to survive longer, rendering treatment intention and markers of sickness dependent after time origin. Called ‘collider bias’, this is interpreted as breaking randomization and therefore complicating detection of a causal treatment effect. The strange part of this argument is that the situation at later times is explained as a causal consequence of treatment. I will try to review this dilemma - identifiability vs. causal concerns - and argue that there is a causal effect of today’s intention to treat on the future hazard function, if interpreted in a functional way. I will also argue that things are the way they should be and ‘collider bias’ really ‘collider effect’, that the latter has little to do with time-to-event, and that piecewise constant hazard ratios carry information on how treatment works. My impression is that the debate is a bit pointed, but that there is general agreement that analyses of hazards - where the causal effect is hidden or perhaps obvious - should routinely be translated onto the probability scale. My worry is that these subtleties are lost in translation and I will illustrate matters with a (typical?) example from benefit-risk assessment in Germany, where a company managed to both claim a better and a worse safety profile of their drug, while only partially acknowledging the need to account for censoring. Time permitting, I will also discuss a multistate approach to g-computation motivated by a phase 3 trial of non-small-cell lung cancer patients where the experimental treatment was put on’(‘clinical’) hold by the FDA for some months shortly before recruitment was completed. The aim of the analysis is to estimate the survival distributions (sic) in the hypothetical scenario where the put-on-hold hazard is equated with zero (sic). The difficulty is that time-to-clinical-hold and time-to-death are not independent.

Tuesday, October 03, 2023, 10:00

Christine Winther Bang
PhD student at Leibniz Institute for Prevention Research and Epidemiology - BIPS
Improving causal discovery with temporal background knowledge

Causal discovery methods aim to estimate a (causal) graph from data. These methods have well-known issues: The output in form of an estimated equivalence class (represented by a so-called CPDAG) can be sensitive to statistical errors and is often not very informative. Including background knowledge, if correct, can only improve (and never harm) the result of causal discovery. This talk will focus on temporal background knowledge as would be available in longitudinal or cohort studies, but the results presented here are valid for any kind of data that has a tiered ordering of the variables. This type of background knowledge is reliable, straightforward to incorporate, and the resulting estimated graphs have desirable properties.

First, I will describe how to incorporate temporal background knowledge in a causal discovery algorithm, and provide a practical example of how it can be applied to cohort data. This algorithm outputs restricted equivalence classes (represented by so-called tiered MPDAGs) that are more informative, and more robust to statistical errors compared to CPDAGs.

Second, I will show how tiered MPDAGs can be characterised as distinct from MPDAGs based on other types of background knowledge, and how this allows us to determine exactly when temporal knowledge adds new information, and when it is redundant. Finally, I will show that this class of graphs inherits key properties of CPDAGs so that they retain the usual interpretation as well as computational efficiency.

Monday, October 02, 2023, 15:15

Xiao-Hua Zhou
PKU Endowed Chair Professor, Beijing International Center for Mathematical Research, Chair, Department of Biostatistics, Peking University
Some Statistical Methods in Causal Inferences and Diagnostic Medicine

Two important areas in biostatistics are causal inference and statistical methods in diagnostic medicine. In this talk, I give an overview on my research interests in these two areas. Particularly, I discuss some new developments in the statistical methodology for making causal inference, and discuss some future research directions. In addition, I give an overview on some new developments in statistical methods in evaluation of the accuracy of medical devices.

Monday, August 21, 2023, 15:15

Klaus Rostgaard
Senior Statistician, Danish Cancer Society
Bettering scientific reporting by replacing p values with simple, informative, objective and flexible Bayes factors

This talk is about the situation where we (in principle) have a d‐dimensional parameter estimate of interest and the d x d dimensional (co‐)variance of it and make inference from that. Often, we test a null hypothesis H0: that the parameter of interest is 0, versus the alternative H1: that the parameter can be anywhere in the d-dimensional parameter space.

The testing mindset has many unfortunate behavioural side effects on what is reported and how in the scientific literature. Furthermore, traditional significance testing, i.e., using p values, does not suffice to compare the evidence in favour of H0 and H1, respectively, as it only makes assumptions about H0. In practise, it is therefore biased in the direction of favouring H1. Bayesian methodology takes H1 into consideration, but often in a way that is either subjective (contradicting scientific ideals) or objective by way of assuming very little information in the prior, which by itself is untrustworthy and often clearly favours H0.

Here we develop an approximation of the so‐called Bayes factor (BF) applicable in the above setting; BF is the Bayesian equivalent of a likelihood ratio. By design the approximation is monotone in the p value. It it thus a tool to transform p values into evidence (probabilities of H0 and H1, respectively). This BF depends on a parameter that expresses the ratio of information contained in the likelihood and the prior. We give suggestions for choosing this parameter. The standard version of our BF corresponds to a continuous version of the Akaike information criterion for model (hypothesis) selection.

Posterior odds of H1 and H0, i.e., Pr(H1|X)/Pr(H0|X) (and hence probabilities and evidence for each), are obtained by multiplying BF with prior (pre-data) odds of H1 and H0, i.e., Pr(H1)/Pr(H0). We suggest that for scientific reporting and discussion prior odds should be set to 1; the reader can modify prior odds to fit their own a priori beliefs and obtain the corresponding posterior inferences. BF=1 represents equiprobability of the hypothesis, H0 and H1. BF is thus centered at the right value, for the purpose of making immediate judgments about which hypothesis is the more likely and how strong the evidence is for that based only on the likelihood function.

Replacing p-values (and implicit tests by confidence intervals) by BFs should allow for shorter, more informative, and less biased reporting of many scientific studies.

We exemplify the calculations and interpretations and illustrate the flexibility of our approach based on a real-world epidemiologic example where we a priori believe H0 to be a good approximation of physical reality. H0 is that an 8-dimensional predictor has exactly the same non-trivial effect (measured by a hazard ratio) on two distinct disease outcomes.

Finally, we compare these new BF-based inferences with those based on p values. Although there is a bijection between BF and p for fixed d it is non-trivial – so you need to calculate BF. Generally, BF-based inference is more in favour of H0 than p-value inference, i.e., less biased in favour of the alternative, H1. The BF is easy to calculate (only requires d and p or a test statistic), flexible and objective. It is a Bayesian solution to the Fisherian project of making statistical inference based exclusively on the likelihood function.

Wednesday, June 14, 2023, 15:15

Mike Daniels
Professor and Chair, Andrew Banks Family Endowed Chair, Department of Statistics, University of Florida
A Bayesian Non-parametric Approach for Causal Mediation with a Post-treatment Confounder

We propose a new Bayesian non-parametric (BNP) method for estimating the causal effects of mediation in the presence of a post-treatment confounder. We specify an enriched Dirichlet process mixture (EDPM) to model the joint distribution of the observed data (outcome, mediator, post-treatment confounder, treatment, and baseline confounders). For identifiability, we use the extended version of the standard sequential ignorability as introduced in Hong et al. (2022, Biometrics). The observed data model and causal identification assumptions enable us to estimate and identify the causal effects of mediation, i.e., the natural direct effects (NDE), and indirect effects (NIE). Our method enables easy computation of NDE and NIE for a subset of confounding variables and addresses missing data through data augmentation under the assumption of ignorable missingness. We conduct simulation studies to assess the performance of our proposed method. Furthermore, we apply this approach to evaluate the causal mediation effect in the Rural LITE trial, demonstrating its practical utility in real-world scenarios.

Monday, June 12, 2023, 15:15

Martin Bladt
Associate Professor in Insurance Mathematics at the Department of Mathematical Sciences, University of Copenhagen
Conditional Aalen–Johansen estimation

Aalen–Johansen estimation targets transition probabilities in multi-state Markov models subject to right-censoring. In particular, it belongs to the standard toolkit of statisticians specializing in health and disability. We introduce for the first time the conditional Aalen-Johansen estimator, a kernel-based estimator that allows for the inclusion of covariates and, importantly, is also applicable in non-Markov models. We establish uniform strong consistency and asymptotic normality under lax regularity conditions; here, the theory of empirical processes plays a central role and leads to a transparent treatment. We also illustrate the practical implications and strength of the estimation methodology.

Monday, May 15, 2023, 15:15

Niels Lundtorp Olsen
Assistant professor, The Danish Technical University
Local Inference for Functional Data on Manifold Domains

This talk is about local inference in functional data analysis: that is, assessing which part of a domain that is ‘significant’ in terms of a given (pointwise) null hypothesis. Due to the continuous domain, this is an extreme case of the multiple testing problem. A popular approach is the interval-wise testing procedure. We extend this to a general setting where the domain is a Riemannian manifold. This requires new methodology such as how to define adjustment sets on product manifolds and how to handle non-zero curvature. We present data and simulation examples and also relate to another recent local inference procedure

Monday, May 08, 2023, 15:15

Benoit Liquet-Weiland
Laboratory of Mathematics and their Applications, University of Pau and Pays de l’Adour
Best Subset Selection for Linear Dimension Reduction Models using Continuous Optimization

Choosing the most important variables in supervised and unsupervised learning is a difficult task, especially when dealing with high-dimensional data where the number of variables far exceeds the number of observations. In this study, we focus on two popular multivariate statistical methods - principal component analysis (PCA) and partial least squares (PLS) - both of which are linear dimensionality reduction techniques used in a variety of fields such as genomics, biology, environmental science, and engineering. Both PCA and PLS generate new variables, known as principal components, that are combinations of the original variables. However, interpreting these components can be challenging when working with large numbers of variables. To address this issue, we propose a method that incorporates the best subset selection approach into the PCA and PLS frameworks using a continuous optimization algorithm. Our empirical results demonstrate the effectiveness of our method in identifying the most relevant variables. We illustrate the use of our algorithm on two real datasets - one analyzed using PCA and the other using PLS.

Wednesday, April 26, 2023, 15:15

Todd Ogden
Department of Biostatistics, Mailman School of Public Health, University of Columbia
Analysis of shape data with applications to mitochondria

In many modern data applications there is a need for an objective framework for the analysis of data that can be represented as shapes (or curves or functions, etc.). While standard analytic techniques could be applied to scalar-valued summary measures of such objects, a more objective approach would involve comparing the data objects in shape space (curve space, function space, etc.). This would require the determination of a measure of ‘distance’ between objects, ideally one that respects the topology of the space. Once such metric has been established, many traditional statistical modeling techniques can be applied to such data. This talk will describe some potential metrics for closed curves and propose some corresponding adaptations of statistical inference procedures. The analysis will be applied to data from an experiment in animal cell biology, in which exercise regimen is thought to have an effect on mitochondrial morphology.

Friday, March 03, 2023, 10:00

Stephen Senn
Statistical Consultant, self-employed, and University of Sheffield, UK
Covariate adjustment in clinical trials: issues, opportunities and problems

Using analysis of covariance to improve the efficiency of clinical trials has a long tradition within drug development and is explicitly recognised as being a valuable thing to do by regulatory guidelines. Nevertheless it continues to attract criticism and it also raises various issues. In this talk I shall look at some of them in particular.

1. What the difference is between stratification and analysis of covariance.
2. How this relates to type I and type II sums of squares.
3. Whether propensity score adjustment is a valid alternative to analysis of covariance.
4. What problems arise in connection with hierarchical data.
5. What the Rothamsted approach teaches us and its relevance to Lord’s paradox.
6. What changes when we move from common two parameter models, such as the Normal model, to single parameter models such as the Poisson distribution.
7. Whether marginal or conditional estimates are generally to be preferred or of there is a role for both.
8. What care must be taken when considering covariate by treatment interaction.

I shall conclude that using covariates wisely does require care but it is valuable and that despite the general regulatory approval, underused and that it would make a much bigger contribution to design efficiency than the currently fashionable topic of flexible designs.

Friday, February 03, 2023, 15:00

Stijn Vansteelandt
Ghent University
Assumption-lean regression

Twenty years ago, the late Leo Breiman sent a wake-up call to the statistical community, thereby criticizing the dominant use of data models’ (Breiman, 2001). In this talk, I will revisit his critiques in light of the developments on algorithmic modeling, debiased machine learning and targeted learning that have taken place over the past 2 decades, largely within the causal inference literature (Vansteelandt, 2021). I will argue that these developments resolve Breiman's critiques, but are not ready for mainstream use by researchers without in-depth training in causal inference. They focus almost exclusively on evaluating the effects of dichotomous exposures; when even slightly more complex settings are envisaged, then this restrictive focus encourages poor practice (such as dichotomization of a continuous exposure) or makes users revert to the traditional modeling culture. Moreover, while there is enormous value in the ability to quantify the effects of specific interventions, this focus is also artificial in the many scientific studies where no specific interventions are targeted.<br><br>I will accommodate these concerns via a general conceptual framework on assumption-lean regression, which I recently introduced in a discussion paper that was read before the Royal Statistical Society (Vansteelandt and Dukes, 2022). This framework builds heavily on the debiased / targeted machine learning literature, but intends to be as broadly useful as standard regression methods, while continuing to resolve Breiman's concerns and other typical concerns about regression.<br><br>A large part of this talk will be conceptual and is aimed to be widely accessible; parts of the talk will demonstrate in more detail how assumption-lean regression works in the context of generalised linear models and Cox proportional hazard models (Vansteelandt et al., 2022).<br><br>References:<br>Breiman, L. (2001). Statistical modeling: The two cultures (with comments and a rejoinder by the author). Statistical science, 16(3), 199-231.<br>Vansteelandt, S. (2021). Statistical Modelling in the Age of Data Science. Observational Studies, 7(1), 217-228.<br>Vansteelandt, S and Dukes, O. (2022) Assumption-lean inference for generalised linear model parameters (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(3), 657– 685. <br>Vansteelandt, S., Dukes, O., Van Lancker, K., & Martinussen, T. (2022). Assumption-lean Cox regression. Journal of the American Statistical Association, 1-10.<br></div></p></div><div class="panel-footer"><h4 class="panel-title">Room: 35.3.13 </h4></div></p></div><div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title"> Monday, September 26, 2022, 15:15 </h3></div><div class="panel-body"><div class="abstract-name"> Andrew Vickers </div><div class="abstract-affiliation"> Memorial Sloan Kettering Cancer Center, Attending Research Methodologist </div><div class="abstract-title"><a class="abstract-title" data-toggle="collapse" href="#abstract16">If calibration, discrimination, Brier, lift gain, precision recall, F1, Youden, AUC, and 27 other accuracy metrics can’t tell you if a prediction model (or diagnostic test, or marker) is of clinical value, what should you use instead?</a><p class="collapse abstract" align = "left" id="abstract16">A typical paper on a prediction model (or diagnostic test or marker) presents some accuracy metrics - say, an AUC of 0.75 and a calibration plot that doesn’t look too bad – and then recommends that the model (or test or marker) can be used in clinical practice. But how high an AUC (or Brier or F1 score) is high enough? What level of miscalibration would be too much? The problem is redoubled when comparing two different models (or tests or markers). What if one prediction model has better discrimination but the other has better calibration? What if one diagnostic test has better sensitivity but worse specificity? Note that it doesn’t help to state a general preference, such as “if we think sensitivity is more important, we should take the test with the higher sensitivity” because this does not allow to evaluate trade-offs (e.g. test A with sensitivity of 80% and specificity of 70% vs. test B with sensitivity of 81% and specificity of 30%). The talk will start by showing a series of everyday examples of prognostic models, demonstrating that it is difficult to tell which is the better model, or whether to use a model at all, on the basis of routinely reported accuracy metrics such as AUC, Brier or calibration. We then give the background to decision curve analysis, a net benefit approach first introduced about 15 years ago, and show how this methodology gives clear answers about whether to use a model (or test or marker) and which is best. Decision curve analysis has been recommended in editorials in many major journals, including JAMA, JCO and the Annals of Internal Medicine, and is very widely used in the medical literature, with well over 1000 empirical uses a year.</div></p></div><div class="panel-footer"><h4 class="panel-title">Room: 5.2.46 (Biostats library) </h4></div></p></div><div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title"> Tuesday, June 21, 2022, 15:15 </h3></div><div class="panel-body"><div class="abstract-name"> Benoit Liquet-Weiland </div><div class="abstract-affiliation"> School of Mathematics and physical sciences, Macquarie University and Laboratory of Mathematics and their Applications, University of Pau and Pays de l’Adour </div><div class="abstract-title"><a class="abstract-title" data-toggle="collapse" href="#abstract17">Leveraging pleiotropic association using sparse group variable selection</a><p class="collapse abstract" align = "left" id="abstract17">Genome-wide association studies (GWAS) have identified genetic variants associated with multiple complex diseases. We can leverage this phenomenon, known as pleiotropy, to integrate multiple data sources in a joint analysis. Often integrating additional information such as gene pathway knowledge can improve statistical efficiency and biological interpretation. In this talk, we propose frequentist ad Bayesian statistical methods which incorporate both gene pathway and pleiotropy knowledge to increase statistical power and identify important risk variants affecting multiple traits. Our methods are applied to identify potential pleiotropy in an application considering the joint analysis of thyroid and breast cancers.</div></p></div><div class="panel-footer"><h4 class="panel-title">Room: 5.2.46 (Biostats library) </h4></div></p></div><div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title"> Wednesday, June 08, 2022, 15:15 </h3></div><div class="panel-body"><div class="abstract-name"> Carolin Herrmann </div><div class="abstract-affiliation"> Institute of Biometry and Clinical Epidemiology, Charité – University Medicine Berlin </div><div class="abstract-title"><a class="abstract-title" data-toggle="collapse" href="#abstract18">Sample size adaptations during ongoing clinical trials – possibilities and challenges</a><p class="collapse abstract" align = "left" id="abstract18">One central design aspect of clinical trials is a valid sample size calculation. The sample size needs to be large enough to detect an existing effect with sufficient power and at the same time it needs to be ethically feasible. Sample size calculations are based on the applied test statistic as well as the significance level and desired power. However, it is not always straightforward to determine the underlying parameter values, such as the expected treatment effect size and variance, during the planning stage of a clinical trial.<br><br>Adaptive designs provide the possibility of adapting the sample size during an ongoing trial. At so called interim analyses, nuisance parameters can be re-estimated. Alternatively, unblinded interim analyses may be performed where the treatment effect can be re-estimated and the trial may also be stopped early for efficacy or futility. In this talk, we will focus on unblinded interim analyses and its different possibilities for recalculating the sample size. We discuss their performance evaluation as well as possibilities for improving existing and optimizing sample size recalculation approaches.</div></p></div><div class="panel-footer"><h4 class="panel-title">Room: 5.2.46 (Biostats library) </h4></div></p></div><div class="panel panel-primary"><div class="panel-heading"><h3 class="panel-title"> Monday, May 30, 2022, 15:15 </h3></div><div class="panel-body"><div class="abstract-name"> Robin Evans </div><div class="abstract-affiliation"> Associate Professor, Department of Statistics at the University of Oxford </div><div class="abstract-title"><a class="abstract-title" data-toggle="collapse" href="#abstract19">Parameterizing and Simulating from Causal Models</a><p class="collapse abstract" align = "left" id="abstract19">Many statistical problems in causal inference involve a probability distribution other than the one from which data are actually observed; as an additional complication, the object of interest is often a marginal quantity of this other probability distribution. This creates many practical complications for statistical inference, even where the problem is non-parametrically identified. In particular, it is difficult to perform likelihood-based inference, or even to simulate from the model in a general way.<br><br>We introduce the frugal parameterization, which places the causal effect of interest at its centre, and then build the rest of the model around it. We do this in a way that provides a recipe for constructing a regular, non-redundant parameterization using causal quantities of interest. In the case of discrete variables we can use odds ratios to complete the parameterization, while in the continuous case copulas are the natural choice; other possibilities are also discussed.<br><br>We introduce thefrugal parameterization’, which places the causal effect of interest at its centre, and then build the rest of the model around it. We do this in a way that provides a recipe for constructing a regular, non-redundant parameterization using causal quantities of interest. In the case of discrete variables we can use odds ratios to complete the parameterization, while in the continuous case copulas are the natural choice; other possibilities are also discussed.

This is joint work with Vanessa Didelez (University of Bremen and Leibniz Institute for Prevention Research and Epidemiology).

Monday, May 16, 2022, 15:15

Philip Hougaard
Vice President, Biometrics, Lundbeck
The use of Bayesian statistical methods during drug development

Over the last few decades, Bayesian methods have gained momentum also within pharmaceutical drug development. During this talk, I will try to dig into this issue. This first covers how the Bayesian philosophy considers probability, parameters and populations. I will give my personal assessment of whether drug development has obtained a completely new paradigm or just an expansion of the statistical toolbox. This also includes a prioritized list of where Bayesian methods can add value compared to standard frequentist methods.

Monday, March 07, 2022, 15:15

Benjamin Christoffersen
Department of Medical Epidemiology and Biostatistics, Karolinska Institutet
Joint models with multiple markers and multiple time-to-event outcomes using variational approximations

Joint models are well suited to modelling linked data from laboratories and health registers. However, there are few examples of joint models that allow for (a) multiple markers, (b) multiple survival outcomes, (c) delayed entry and (d) scalability. We propose a full likelihood approach for joint models based on a Gaussian variational approximation to satisfy criteria (a)-(d). Our simulations and applications show that the variational approximation is close to the full likelihood, very fast to optimize, and scalable. Our open source implementation is available with support for general joint models and computation in parallel.

Monday, February 21, 2022, 15:15

CANCELLED Benjamin Christoffersen
CANCELLED Department of Medical Epidemiology and Biostatistics, Karolinska Institutet
CANCELLED - will be POSTPONED

CANCELLED - will be postponed

Monday, December 13, 2021, 15:15

Michael Sachs
Biostatistical Researcher, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet
Event History Regression with Pseudo-Observations: Computational Approaches and an Implementation in R

Due to tradition and ease of estimation, the vast majority of clinical and epidemiological papers with time-to-event data report hazard ratios from Cox proportional hazards regression models. Although hazard ratios are well known, they can be difficult to interpret, particularly as causal contrasts, in many settings. Nonparametric or fully parametric estimators allow for the direct estimation of more easily causally interpretable estimands such as the cumulative incidence and restricted mean survival. However, modeling these quantities as functions of covariates is limited to a few categorical covariates with nonparametric estimators, and often requires simulation or numeric integration with parametric estimators. Combining pseudo-observations based on non-parametric estimands with parametric regression on the pseudo-observations allows for the best of these two approaches and has many nice properties. In this talk, I will describe an implementation of these methods in the eventglm R package, focusing on the computational approach, usage from the average data analyst’s perspective, and features for further development and extension.

Tuesday, October 12, 2021, 15:15

Samir Bhatt
Professor of Machine Learning and Public Health, University of Copenhagen and Senior Lecturer in Geostatistics, Imperial College London
A brief tour of renewal equations for epidemic modelling

In this talk I’ll discuss the renewal equation and its use in epidemic modelling. I will briefly discuss how many popular approaches result in a renewal equation, and highlight a few applications where the renewal equation is used. I will then discuss new work we have recently done to rigorously derive how the renewal equation arises from age dependent branching processes. I will then discuss a few interesting (i think) implications of this derivation - including overdispersion and a link to generalised Fibonacci numbers! Paper is under second review currently and available here https://arxiv.org/abs/2107.05579

Friday, October 01, 2021, 15:15

Erin Gabriel
Researcher and Docent of Biostatistics, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet
Causal Bounds for Outcome-Dependent Sampling in Observational Studies

Outcome-dependent sampling designs are common in many different scientific fields including epidemiology, ecology, and economics. As with all observational studies, such designs often suffer from unmeasured confounding, which generally precludes the nonparametric identification of causal effects. Nonparametric bounds can provide a way to narrow the range of possible values for a nonidentifiable causal effect without making additional untestable assumptions. The nonparametric bounds literature has almost exclusively focused on settings with random sampling, and the bounds have often been derived with a particular linear programming method. We derive novel bounds for the causal risk difference, often referred to as the average treatment effect, in six settings with outcome-dependent sampling and unmeasured confounding for a binary outcome and exposure. Our derivations of the bounds illustrate two approaches that may be applicable in other settings where the bounding problem cannot be directly stated as a system of linear constraints.

Monday, August 30, 2021, 15:15

Giuliana Cortese
Dept. of Statistical Sciences, University of Padova

Tuesday, June 15, 2021, 16:00

Niels Richard Hansen and Lasse Petersen
Dept. of Mathematical Sciences, University of Copenhagen
Causal event process models, local independence and nonparametric inference

In the first part of the talk, we will give a review of how causal models of event processes and local independence graphs are linked. In particular, how local independence graphs can represent partially observed systems and how local independence testing can be used to infer local independence graphs.

In the second part of the talk, we will show how to build flexible models of event intensities using a deep learning framework, specifically Tensorflow. Combined with double machine learning techniques, this makes nonparametric local independence testing feasible. However, the Tensorflow implementation may be of independent interest for other nonparametric modeling purposes.

Wednesday, April 07, 2021, 16:00

Theis Lange, Søren Rasmussen, and Zeyi Wang
University of Copenhagen, Novo Nordisk, and UC Berkeley
A series of three presentations of methodological developments in longitudinal mediation analysis and clinical applications

The Joint Initiative for Causal Inference Webinar Series is a series of presentations on utilizing causal inference and targeted learning methods to answer pressing health questions in the modern methodological and data ecosystem. Targeted learning methods bring the rigor and power of classical statistics and causal inference together with advances in machine learning to bring robust insight and evidence to the important health challenges. This program is organized by the University of California, Berkeley’s Center for Targeted Machine Learning, University of Copenhagen, and Novo Nordisk, a leading global healthcare company headquartered in Denmark. The talks will range from those targeted at a general audience with an interest in the future of trials and real-world evidence generation to statisticians and data scientists working at or interested in the intersection of causal inference, machine learning, and statistics.

Wednesday, January 20, 2021, 15:15

Michael Höhle
Department of Mathematics, Stockholm University
Transmission Risk Classification in Digital Contact Tracing Apps for COVID-19

Inspired by the influential paper of Ferretti et al. (2020), many countries have decided to use a digital contact tracing app as part of their COVID-19 response. Due to its widespread availability on standard mobile phones and its privacy preserving decentralized approach, Google and Apple’s Exposure Notification (GAEN) framework based on Bluetooh Low Energy proximity tracing, has become the de-facto standard on which such digital contact tracing apps are based.

In this data-free talk, I will give a short introduction to the aims of digital contact tracing and then focus on the mathematical calculations occurring within the GAEN while determining the risk of being infected. In particular I will focus on the possibility to perform a more detailed computation of the so called Transmission Risk Level (TRL), which is an indication of how infectious a given individual is at the time of the potential exposure event. This TRL score computation is used as part of the German Corona-Warn-App (CWA) and consists of deducing infectiousness based on the day of upload via a stochastic model. I will end the talk with some remarks about the importance of transparency when using mathematical risk scoring in applications that heavily depend on widespread voluntary use in the population. A transparency the Danish smitte|stop app currently does not have, but appears to have planned for 2021.

Literature:

CWA Team (2020), Epidemiological Motivation of the Transmission Risk Level, https://github.com/corona-warn-app/cwa-documentation/blob/master/transmission_risk.pdf
Ferretti, L., Wymant, C., Kendall, M., Zhao, L., Nurtay, A., Abeler-Dörner, L., Parker, M., Bonsall, D., & Fraser, C. (2020). Quantifying SARS-CoV-2 transmission suggests epidemic control with digital contact tracing. Science (New York, N.Y.), 368(6491), eabb6936. https://doi.org/10.1126/science.abb6936
Höhle, M. (2020), Risk Scoring in Digital Contact Tracing Apps, https://staff.math.su.se/hoehle/blog/2020/09/17/gaen_riskscoring.html

Monday, December 07, 2020, 15:15

Andreas Bjerre-Nielsen
Head of Studies for MSc. Social Data Science, Center for Social Data Science (SODAS), University of Copenhagen
‘Big data’ vs. the ‘right data’: Balancing privacy and prediction in higher (online) education

Increasingly, human behavior can be monitored through the collection of data from digital devices revealing information on behaviors and locations. In the context of higher education, a growing number of schools and universities collect data on their students with the purpose of assessing or predicting behaviors and academic performance, and the COVID-19 induced move to online education dramatically increases what can be accumulated in this way, raising concerns about students’ privacy. We focus on academic performance and ask whether predictive performance for a given data set can be achieved with less-privacy invasive, but more task-specific, data. We draw on a unique data set on a large student population containing both highly detailed measures of behavior and personality and high quality third-party reported individual level administrative data. We find that models estimated using the big behavioral data are indeed able to accurately predict academic performance out-of-sample. However, models using only low-dimensional and arguably less privacy-invasive administrative data perform considerably better and, importantly, do not improve when we add the high-resolution, privacy-invasive behavioral data. We argue that combining big behavioral data with `ground truth’ administrative registry data can ideally allow the identification of privacy-preserving task-specific features that can be employed instead of current indiscriminate troves of behavioral data, with better privacy and better prediction resulting.

Friday, April 24, 2020, 14:00

Helene Charlotte Rytgaard
Biostatistics, University of Copenhagen
PhD defence:Targeted causal learning for longitudinal data

This thesis develops statistical methodology for causal inference based on observational longitudinal data. The work is motivated by problems in pharmacoepidemiology, where hazard ratios routinely are used to assess the association of time-fixed and time-dependent exposure with time-to-event outcomes. However, the interpretation of hazard ratios as the measure of treatment effect is hampered for many reasons. Causal effect parameters may instead be formulated as intervention-specific mean outcomes, for instance to target the effect of dynamic treatment regimes on the absolute risk scale.

Targeted minimum loss-based estimation (TMLE) provides a general template for efficient estimation of such causal parameters in semiparametric models. The main part of my thesis is concerned with a generalization of the TMLE template to a continuous-time setting. In this setting, the number and schedule of covariate changes and intervention time-points are allowed to be subject-specific and to occur in continuous time. I propose a novel targeting estimation algorithm, where nuisance parameters are handled by super learning, and derive the asymptotic distribution of the resulting estimator.

In my thesis I also suggest extensions of generalized random forests for conditional and marginal causal effect estimation with time-to-event outcome observed in presence of right-censoring and competing risks. I apply these methods to Danish registry data to search through all drugs on the market for repurposing effects.



Assessment committee:
Associate Professor Andreas Kryger Jensen, Section of Biostatistics, Department of Public Health, University of Copenhagen
Assistant Professor Edward H. Kennedy, Department of Statistics & Data Science, Carnegie Mellon University
Professor Søren Feodor Nielsen, Center for Statistics, Department of Finance, Copenhagen Business School

Monday, March 09, 2020, 17:00

Mark van der Laan
University of California, Berkeley
Targeted Learning, Super Learning, and the Highly Adaptive Lasso

We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand under realistic statistical assumptions. TMLE involves maximizing a parametric likelihood along a so-called least favourable parametric model that uses as off-set an initial estimator (e.g., ensemble super-learner) of the relevant functional of the data distribution. The asymptotic normality and efficiency of the TMLE relies on the asymptotic negligibility of a second-order term. We present a general Highly Adaptive Lasso (HAL) estimator of the data distribution and its functionals that converges at a sufficient n-1/3 regardless of the dimensionality of the data/model, under almost no additional regularity. This allows us to propose a general TMLE that is asymptotically efficient in great generality. We also discuss the appealing properties of HAL, due to HAL being an MLE over a big function class, and present various of its implications for super- learning and TMLE.

Monday, February 24, 2020, 15:15

Klaus Larsen
Lundbeck
A principal stratum estimand investigating the treatment effect in patients who would comply, if treated with a specific treatment

The draft ICH E9 (R1) addendum by the International Conference on Harmonisation working group opens for the use of a principal stratum in the analysis of study data for regulatory purpose, if a relevant estimand can be justified. Inspired by the so-called complier average causal effect and work within this framework, I will propose a new estimator – Extrapolation based on propensity to comply – that estimates the treatment effect of an active treatment A relative to a comparator B (active or placebo), in the principal stratum of patients who would comply, if they were treated with treatment A. Sensitivity of the approach to the number of covariates and their ability to predict principal stratum membership will be shown based on data from a placebo-controlled study of brexpiprazole in schizophrenia. The performance of the estimator is compared with another estimator that is also based on principal stratification. A simulation study supports that the proposed estimator has a negligible bias even with a small sample size, except when the covariate predicting compliance is very weak. Not surprisingly, precision of the estimate increases substantially with stronger predictors of compliance.

Thursday, November 28, 2019, 15:00

Alberto Cairo
Director of the Visualization Program at the Center for Computational Science, University of Miami
How Charts Lie

We’ve all heard that a picture is worth a thousand words, but what if we don’t understand what we’re looking at?

Charts, infographics, and diagrams are ubiquitous. They are useful because they can reveal patterns and trends hidden behind the numbers we encounter in our lives. Good charts make us smarter—if we know how to read them.

However, they can also deceive us. Charts lie in a variety of ways—displaying incomplete or inaccurate data, suggesting misleading patterns, and concealing uncertainty— or are frequently misunderstood. Many of us are ill-equipped to interpret the visuals that politicians, journalists, advertisers, and even our employers present each day. This talk teaches to not only spot the lies in deceptive visuals, but also to take advantage of good ones.

Link to the slides

Thursday, November 28, 2019, 11:00

Alberto Cairo
Director of the Visualization Program at the Center for Computational Science, University of Miami
Visualization and Graphic Design for Scientists

When designing a data visualization, showing the data comes first. After all, the main goal of a visualization is letting the reader spot patterns and trends behind numbers. But what if the visualization we design is to be presented to a general audience? In that case we may want to think deeply about visual design elements such as typography, color, composition, and hierarchy. This talk teaches non-designers such as scientists and statisticians how to make our charts, graphs, publications, and conference posters look better.

Link to the slides

Thursday, November 14, 2019, 15:15

Michael Væth
Department of Biostatistics, Univeristy of Aarhus
Estimating survival benefit in a clinical trial

In a general population, a proportional change of mortality results in a change in life expectancy which, to a close approximation, is proportional to the logarithm of the change in mortality.

Using censored follow-up data, this relationship may be used to predict the difference in average remaining lifetime between two groups of individuals with approximately proportional mortality. The usefulness of the methodology in a clinical trial setting is explored using follow-up data from a clinical trial of breast cancer patients. Two methods are considered. One approach applies standardized mortality ratios with special attention to non-proportionality early in the follow-up period, the second approach uses a hazard ratio estimated in Cox regression analysis. Advantages and disadvantages of these approaches are discussed. The results are not discouraging, and the methodology seems potentially useful in a cost-effectiveness analysis of a new treatment option.

Wednesday, November 06, 2019, 15:15

Yu Shen
Department of Biostatistics, M. D. Anderson Cancer Center, University of Texas
Estimation of Longitudinal Medical Cost Trajectory

Estimating the average monthly medical costs from disease diagnosis to a terminal event such as death for an incident cohort of patients is a topic of immense interest to researchers in health policy and health economics because patterns of average monthly costs over time reveal how medical costs vary across phases of care. The statistical challenges to estimating monthly medical costs longitudinally are multifold; the longitudinal cost trajectory (formed by plotting the average monthly costs from diagnosis to the terminal event) is likely to be nonlinear, with its shape depending on the time of the terminal event, which can be subject to right censoring. We tackle this statistically challenging topic by estimating the conditional mean cost at any month given the time of the terminal event. The longitudinal cost trajectories with different terminal event times form a bivariate surface, under some constraint. We propose to estimate this surface using bivariate penalized splines in an Expectation-Maximization algorithm that treats the censored terminal event times as missing data. We evaluate the proposed model and estimation method in simulations and apply the method to the medical cost data of an incident cohort of stage IV breast cancer patients from the Surveillance, Epidemiology and End Results–Medicare Linked Database. This is a joint work of Li, Wu, Ning, Huang, Shih and Shen.

Monday, October 21, 2019, 15:15

Halina Frydman
NYU Stern School of Business
An Ensemble Method for Interval-Censored Time-to-Event Data

Interval-censored data analysis is important in biomedical statistics for any type of time-to-event response where the time of response is not known exactly, but rather only known to occur between two assessment times. Many clinical trials and longitudinal studies generate interval-censored data; one common example occurs in medical studies that entail periodic follow-up. In this paper, we propose a survival forest method for interval-censored data based on the conditional inference framework. We describe how this framework can be adapted to the situation of interval-censored data. We show that the tuning parameters have a non-negligible effect on the survival forest performance and guidance is provided on how to tune the parameters in a data-dependent way to improve the overall performance of the method. Using Monte Carlo simulations, we show that the proposed survival forest is at least as effective as a survival tree method when the underlying model has a tree structure, performs similarly to an interval-censored Cox proportional hazards model when the true relationship is linear, and outperforms the survival tree method and Cox model when the true relationship is nonlinear. We illustrate the application of the method on a breast cancer data.

Monday, October 07, 2019, 15:15

Jan Feifel
Institute of Statistics, Ulm University
Subcohorting methods for rare time-dependent exposures in time-to-event data

Antimicrobial resistance is one of the major burdens for today’s society. The challenges for researches conducting studies on the effect of those rare exposures on the hospital stay are manifold.

For large cohort studies with rare outcomes nested case-control designs are favorable due to the efficient use of limited resources. In our setting, nested case-control designs apply but do not lead to truly reduced sample sizes, because the outcome is not rare. We, therefore, study a modified nested case-control design, which samples all exposed patients but not all unexposed ones. Here, the inclusion probability of observed events evolves over time. This new scheme improves on the classical nested case-control design where for every observed event controls are chosen at random.

We will discuss several options on how to account for past time-dependent exposure status within a nested case-control design and their related merits. It will be seen that a smart utilization of the available information at each point in time can lead to a powerful and simultaneously less expensive design. We will also sketch alternative designs, e.g. treating exposure as a left-truncation event that generates matched controls, and time-simultaneous inference of the baseline hazard using the wild bootstrap. The methods will be applied to observational data on the impact of hospital-acquired pneumonia on the length-of-stay in hospital, which is an outcome commonly used to express both the impact and the costs of such adverse events.

Tuesday, September 03, 2019, 14:15

Christina Boschini
Biostatistics, Department of Public Health, University of Copenhagen and the Cancer Society
Ph.D.-defence: Excess risk estimation in matched cohort studies

The work presented in this thesis aims at contributing to the field of statistical methodology for the analysis of excess risk in matched cohort studies. The project was initiated by the Danish Cancer Society Research Center and motivated by the desire to investigate long-term health consequences of childhood cancer survivors. During the last five decades, as a consequence of improved survival rates, the major concern of childhood cancer research shifted from survival to late effects related to childhood cancer diagnosis and treatment. In 2009, thanks to dedicated childhood cancer researchers and to the resourceful Nordic national registries, the Adult Life after Childhood Cancer in Scandinavia (ALiCCS) was established to improve knowledge about late effects of childhood cancer. This study has a matched cohort design where for each childhood cancer survivor, five healthy comparison subjects of the same sex, age and country were randomly selected. The statistical models introduced in this thesis exploit the matching structure of the data to get a representative estimate of the excess risk of late effects in childhood cancer survivors. Two are the methods described: the first models the excess risk in terms of excess hazard, while the second estimates the excess cumulative incidence function. Both approaches assume that the risk for a childhood cancer survivor is the sum of a cluster-specific background risk defined on the age time scale and an excess term defined on the time since exposure time scale. Estimates of the excess model parameters are obtained by pairwise comparisons between the cancer survivor and all the other matched comparison members in the same cluster. The contribution of the models introduced in this thesis on the public health area is presented by an application on the 5-year soft-tissue sarcoma survivor data from the ALiCCS study. By handling different features of registry data, such as multiple events, different time scales, right censoring and left truncation, this approach offers an easy tool to study how the excess risk develops in time and how it is affected by important risk factors, such as treatment.

Functions estimating the excess risk models were implemented in R and are publicly available.

Supervisors: Thomas Scheike, Klaus K. Andersen, Christian Dehlendorff and Jeanette Falck Winther

Evaluators: Thomas Alexander Gerds, Martin Bøgsted, Bjørn Møller.

Friday, July 19, 2019, 15:15

Yang Zhao
Department of Biostatistics, School of Public Health, Nanjing Medical University, P.R.China
Mediation Analysis and Random Forests

In this presentation, we will introduce the possibility and practice of using random forests, an ensembled machine learning method, in causal mediation analysis. We will also discuss the advantages and potential risks of using RF-based methods in causal inference.

We would firstly describe the limitations of the traditional regression-based mediation analysis. We then briefly describe the basic procedure of random forests. We proposed a residual based method to remove confounding effects in RF analysis and introduce its applications in high dimensional genetic analysis[1]. The proposed RF-based mediation analysis framework includes three steps. First, we build a causal forest model under the counterfactual framework to model the relationship between outcome, treatment, mediators and covariates[2]. Next, we predict the mediators using traditional random forests using predictors including treatment and covariates. The average effects are then estimated using weighted methods. Possible candidates for the weights include the inverses of probabilities and variances. We performed extensive computer simulations to evaluate the performance of random forests in mediation analysis. We observed that the proposed methods can obtain accurate estimates on the direct and in-direct effects. Meanwhile, The results demonstrated that RF-based methods is more flexible than traditional regression based methods. As the RF-based method can handle non-linear relationship and high order interactions, we do not need to specify whether there is exposure-mediator interactions and their types as that in traditional regression-based methods.

Data from phase-II and III clinical trials of a novel small molecular multi-targeted cancer drug , which is already marketed in China, is used to illustrate the application of the RF-based mediation analysis. We evaluated the mediation effects of some measurements from the blood regular tests, such as platelets, on the progression and death outcome for non-small cell lung cancer patients.

Conclusions are that RF-based methods have their advantages in the mediation analysis.

Monday, May 27, 2019, 14:15

Liis Starkopf
Biostatistics, Department of Public Health, University of Copenhagen
Ph.D.-defence: Statistical methods for causal inference and mediation analysis

Many clinical or epidemiological studies aim to estimate the casual effect of some exposure or intervention on some outcome. The use of causal inference helps to design statistical analyses that come as close as possible to answering the causal questions of interest. In this thesis we focus on the statistical methodology for causal inference in general and mediation analysis in particular. Specifically, we compare five existing software solutions for mediation analysis to provide practical advice for the applied researchers interested in mediation analysis. We further focus on natural effect models and propose a new estimation approach that is especially advantageous in settings where the mediator and the outcome distributions are difficult to model, but the exposure is a single binary variable. Finally, we propose a penalized g-computation estimator of marginal structural models with monotonicity constraints to estimate the counterfactual 30-day survival probability in cardiac arrest patients receiving/not receiving cardiopulmonary resuscitation (CPR) as a non-increasing function of ambulance response time.

Supervisors: Theis Lange, Thomas A. Gerds

Evaluators: Frank Eriksson, Jacob v. B. Hjelmborg, Ingeborg Waernbaum.

Monday, May 06, 2019, 15:15

Benoit Liquet
Laboratory of Mathematics and their Applications, University of Pau and Pays de l’Adour
Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set

It is well established that incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability.

In this talk, I will present new multivariate methods incorporating grouping structure in frequentist methodology for variable selection and dimension reduction to tackle the analysis of high dimensional and Big-Data set.

Friday, April 26, 2019, 15:15

Morten Overgaard
Aarhus Universitet
When do pseudo-observations have the appropriate conditional expectation?

A regression approach based on substituting observed and unobserved outcome values for pseudo-observations ought to work if the pseudo-observations have the appropriate conditional expectation. The pseudo-observations under study are jack-knife pseudo-values of some estimator and are closely related to the influence function of the estimator they are based on.

In this talk, we will have a look at some examples of such influence functions and look at potential problems and solutions concerning the conditional expectation. Specifically, influence functions from inverse probability of censoring weighted estimators where the estimate of the censoring distribution is allowed to take covariates into account and influence functions of the Kaplan–Meier estimator in a delayed entry setting will be considered.

Friday, January 18, 2019, 14:15

Silke Szymczak
Institut für Medizinische Informatik und Statistik, Universitätsklinikum Schleswig-Holstein
Looking into the black box of random forests

Machine learning methods and in particular random forests (RFs) are promising approaches for classification and regression based on omics data sets. I will first give a short introduction to RFs and variable selection, i.e. the identification of variables that are important for prediction. In the second part of my talk I will present some results of our current methodological work on RFs. We performed a simulation based comparison of different variable selection methods where Boruta (Kursa & Rudnicki, 2010, J Stat Softw) and Vita (Janitza et al. 2016 Adv Data Anal Classif) were consistently superior to the other approaches. Furthermore, we developed a novel method called surrogate minimal depth (SMD). It is based on the structure of the decision trees in the forest and additionally takes into account relationships between variables. In simulation studies we showed that correlation patterns can be reconstructed and that SMD is more powerful than existing variable selection methods. We are currently working on an evaluation of extensions of the RF algorithm that integrate pathway membership information into the model building process and I will show the first preliminary results.

Tuesday, November 20, 2018, 13:00

Ditte Nørbo Sørensen
Biostatistics, UCPH
PhD defence: Causal proportional hazards estimation in the presence of an instrumental variable

Causation and correlation are two fundamentally different concepts, but too often correlation is misunderstood as causation. Based on given data, correlations are straightforward to establish, whereas the underlying causal structures that can explain a given association are hypothetically endless in their variety. The importance of the statistical discipline known as causal inference has been recognized in the past decades, and the field is still expanding. In this thesis we turn our attention to survival outcome, and how to estimate proportional hazards from which we can learn about causation. Our focus is specifically the case where an instrumental variable is present.

Monday, November 05, 2018, 15:15

Lars Endahl and Henrik Ravn
Biostatistics, Novo Nordisk A/S
Estimands and missing data - two hot topics in the pharmaceutical industry

A 2012 report commissioned by the US Food and Drug Administration (FDA) on the prevention and analysis of trial results in the presence of missing data, has recently lead to significant changes in the clinical drug development. The report also introduced estimands as a new concept - a concept elaborated on in recently updated statistical guidelines for the pharmaceutical industry (the ICH-E9(R1) still in draft). The focus of the ICH-E9(R1) guideline is to discuss how intercurrent events, such as death or discontinuation of the randomised trial product can be embraced in the estimation of a treatment effect rather than just seen as a source of bias. In this talk we will outline how the estimand concept and the focus on prevention of missing data have changed the way clinical trials for new drug approvals are designed and conducted, how the data is analysed and how the results are communicated.

Monday, September 24, 2018, 15:15

Boris Hejblum
Universite Bordeaux
Controlling Type-I error in RNA-seq differential analysis through a variance component score test

Gene expression measurement technology has shifted from microarrays to sequencing, producing ever richer high-througput data for transcriptomics studies. As studies using these data grow in size, frequency, and importance, it is becoming urgent to develop and refine the statistical tools available for their analysis. In particular, there is a need for methods that better control the type-I error as clinical RNA-seq studies are including a growing number of subjects (measurements being cheaper) resulting in larger sample sizes. We model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity, in a principled, model-free, and efficient manner for detecting differentially expressed genes from RNA-seq data. Our method can identify the genes whose expression is significantly associated with one or several factors of interest in complex experimental designs, including studies with longitudinal measurement of gene expression. We rely on a powerful variance component score test that can account for both adjustement covariates and data heteroscedasticity without assuming any specific parametric distribution for the (transformed) RNA-seq counts. Despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, which can be computed quickly. A permutation version of the test is also derived for small sample sizes, but this leads to issues in controlling the False Discovery Rate. Finally we also propose an extension of the method for Gene Set Analysis. Applied to both simulated data and real benchmark datasets, we show that our test has good statistical properties when compared to state-of-the-art methods limma/voom, edgeR, and DESeq2. In particular, we show that those three methods can all fail to control the type I error and the False Discovery Rate under realistic settings, while our method behaves as expected. We apply our proposed method to two candidate vaccine phase-I studies with repeated gene expression measurements: one public dataset investigating a candidate vaccine against EBOLA, and one original dataset investigating a candidate vaccine against HIV.

Thursday, June 21, 2018, 11:00

Ramon Oller Piqué
Central University of Catalonia
A nonparametric test for the association between longitudinal covariates and censored survival data

Many biomedical studies focus on the association between a longitudinal measurement and a time-to-event outcome and quantify this association by means of a longitudinal-survival joint model. In this paper we propose the LLR test, a longitudinal extension of the log-rank test statistic given by Peto and Peto (1972), to provide evidence of a plausible association between a time-to-event outcome (right- or interval-censored) and a longitudinal covariate. As joint model methods are complex and hard to interpret, a preliminar test for the association between both processes, such as LLR, is wise. The statistic LLR can be expressed in the form of a weighted difference of hazards, yielding to a broad class of weighted log-rank test statistics, LWLR, which allow to assess the association between the longitudinal covariate and the survival time stressing earlier, middle or late hazard differences through different weighting functions. The asymptotic distribution of LLR is derived by means of a permutation approach under the assumption that the underlying censoring process is identical for all individuals. A simulation study is conducted to evaluate the performance of the test statistics LLR and LWLR and shows that the empirical size is close to the significance level and that the power of the test depends on the association between the covariates and the survival time. Four data sets together with a toy example are used to illustrate the LLR test. Three of the data sets involve right-censored data and correspond to the European Randomized Screening for Prostate Cancer study (Serrat and others, 2015) and two well-known data sets given in the R package JM. The fourth data set explores the study Epidemiology of Diabetes Interventions and Complications (Sparling and others, 2006) which includes interval-censored data.

Monday, June 18, 2018, 15:15

Jacob Fiksel
Johns Hopkins Bloomberg School of Public Health, Baltimore, USA
Optimized Survival Evaluation to Guide Bone Metastases Management: Developing an Improved Statistical Approach

In managing bone metastases, estimation of life expectancy is central for individualizing patient care given a range of radiotherapy (RT) treatment options. With access to larger volume and more complex patient data and statistical models, oncologists and statisticians must develop methods for optimal decision support. Approaches incorporating many covariates should identify complex interactions and effects while also managing missing data. In this talk, I discuss how a statistical learning approach, random survival forests (RSF), handles these challenges in building survival prediction models. I show how we applied RSF to develop a clinical model which predicts survival for patients with bone metastases using 26 predictor variables and outperforms two simpler, validated Cox regression models. I will conclude by introducing a simple bootstrap based procedure, which can be used for both simple and complex prediction models, to produce valid confidence interval estimates for model performance metrics using internal validation.

Tuesday, March 20, 2018, 15:15

Philip Hougaard (joint with Jacob von Hjelmborg)
Lundbeck A/S
Survival of Danish twins born 1870-2000 – preliminary report

Hougaard, Harvald and Holm (JASA, 1992) used frailty models to consider the survival of same-sex Danish twins born between 1881-1930 with follow-up until 1980 for twins where both were alive at age 15. This presentation gives an update to that analysis. For the birth cohorts 1870-1930, same-sex twins, where both were alive at age 6, are considered. For the birth cohorts 1931-2000, all twins are included. Follow-up is to 2016. Besides presenting the results, I will discuss the appropriateness of shared frailty models for studying this problem.

Tuesday, March 06, 2018, 15:15

Xiang Zhou
Department of Biostatistics, University of Michigan
Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models

There has been a growing interest in using genotype data to perform genetic prediction of complex traits. Accurate genetic prediction can facilitate genomic selection in animal and plant breeding programs, and can aid in the development of personalized medicine in humans. Because most complex traits have a polygenic architecture and are each influenced by many genetic variants with small effects, accurate genetic prediction requires the development of polygenic methods that can model all genetic variants jointly. Many recently developed polygenic methods make parametric modeling assumptions on the effect size distribution and different polygenic methods differ in such effect size assumption. However, depending on how well the effect size distribution assumption matches the unknown truth, existing polygenic methods can perform well for some traits but poorly for others. To enable robust phenotype prediction performance across a range of phenotypes, we develop a novel polygenic model with a flexible assumption on the effect size distribution. We refer to our model as the latent Dirichlet Process Regression (DPR). DPR relies on the Dirichlet process to assign a prior on the effect size distribution itself, is non-parametric in nature, and is capable of inferring the effect size distribution from the data at hand. Because of the flexible modeling assumption, DPR is able to adapt to a broad spectrum of genetic architectures and achieves robust predictive performance for a variety of complex traits. We compare the predictive performance of DPR with several commonly used polygenic methods in simulations. We further illustrate the benefits of DPR by applying it to predict gene expressions using cis-SNPs, to conduct PrediXcan based gene set test, to perform genomic selection of four traits in two species, and to predict five complex traits in a human cohort. Our method is implemented in the DPR software, freely available at www.xzlab.org/software.html.

Monday, December 04, 2017, 15:15

Federico Ambrogi
Laboratory of Medical Statistics and Biometry, University of Milan
Predicting survival probabilities using pseudo-observations

Pseudovalues may provide a way to use ‘standard’ estimation procedures in survival analysis, where ‘standard’ refer to methods not specifically designed for accounting of censoring. In this work a generalized additive linear model is analyzed using pseudo-values to provide a smooth estimate of the survival function by using P-spline basis functions. The performances of the estimator compared to both standard tools of survival analysis and machine learning techniques are presented through simulations and a real example.

Monday, November 20, 2017, 15:15

Klaus Groes Larsen
Lundbeck Denmark
Network Meta Analysis

Network Meta Analysis (NMA) is a statistical framework that allows for comparison of several pharmacological treatments based on results reported in clinical trials. The value of NMAs lies in that they permit the summary of the overall evidence and ranking of different treatment in terms of efficacy and safety endpoints combining both direct and indirect evidence. The statistical model is itself relatively simple and allows for addressing specific model assumptions such as heterogeneity and consistency (both of which will be defined and discussed). The methodology will be introduced through two examples, one concerning the efficacy and safety of SSRIs/SNRIs in the treatment of Depression, and one that compares the cognitive performance as measured by the digit-symbol-substitution test DSST in patients with Depression

Monday, November 06, 2017, 15:15

Mireille Schnitzer
Biostatistics, Université de Montréal
Longitudinal variable selection in causal inference with collaborative targeted minimum loss-based estimation

Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, marginal structural models model the expectation of the counterfactual outcome conditional only on past treatment and possibly a set of baseline covariates. In such contexts, model covariates (potential time-varying confounders) are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive variable selection in causal inference were generally limited to the single time-point setting. We develop a longitudinal extension of collaborative targeted minimum loss-based estimation (C-TMLE) for the estimation of the parameters in a marginal structural model that can be applied to perform variable selection in propensity score models. We demonstrate the properties of this estimator through a simulation study and apply the method to investigate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma.

Thursday, November 02, 2017, 15:15

Arvid Sjölander
Department of Medical Epidemiology and Biostatistics, Karolinska
Confounding, mediation and colliding - what types of shared covariates does the sibling comparison design control for?

The sibling comparison design is an important epidemiological tool to control for unmeasured confounding, in studies of the causal effect of an exposure on an outcome. It is routinely argued that within-sibling associations are automatically controlled for all measured and unmeasured covariates that are shared (constant) within sets of siblings, such as early childhood environment and parental genetic make-up. However, an important lesson from modern causal inference theory is that not all types of covariate control are desirable. In particular, it has been argued that collider control always lead to bias, and that mediator control may or may not lead to bias, depending on the research question. In this presentation we use Directed Acyclic Graphs (DAGs) to distinguish between shared confounders, shared mediators and shared colliders, and we examine which of these shared covariates the sibling comparison design really controls for.

Monday, October 30, 2017, 15:15

Sebastien Haneuse
Harvard T.H. Chan School of Public Health
Adjusting for selection bias in electronic health records-based research

Electronic health records (EHR) data provide unique opportunities for public health and medical research. From a methodological perspective, much of the focus in the literature has been on the control of confounding bias. In contrast, selection due to incomplete data is an under-appreciated source of bias in analyzing EHR data. When framed as a missing-data problem, standard methods could be applied to control for selection bias in the EHR context. In such studies, however, the process by which data are complete for any given patient likely involves the interplay of numerous clinical decisions made by patients, health care providers, and the health system. In this sense, standard methods fail to capture the complexity of the data so that residual selection bias may remain. Building on a recently-proposed framework for characterizing how data arise in EHR-based studies, sometimes referred to as the data provenance, we develop and evaluate a statistical framework for regression modeling based on inverse probability weighting that adjusts for selection bias in the complex setting of EHR-based research. We show that the resulting estimator is consistent and asymptotically Normal, and derive the form of the asymptotic variance. Plug-in estimators for the latter are proposed. We use simulations to: (i) highlight the potential for bias in EHR studies when standard approaches are used to account for selection bias, and (ii) evaluate the small-sample operating characteristics of the proposed framework. Finally, the methods are illustrated using data from an on-going, multi-site EHR-based study of bariatric surgery on BMI.

Thursday, September 28, 2017, 14:00

Anna Bellach
University of Copenhagen
Ph.D.-defence: Competing risks regression models based on pseudo risk sets

Competing risks frequently occur in medical studies, when individuals are exposed to several mutually exclusive event types. A common approach is to model the cause specific hazards. Challenges arise from the fact that the relation between the cause specific hazard and the corresponding cumulative incidence function is complex. The product limit estimator based on the cause specific hazard systematically overestimates the cumulative incidence function and estimated regression parameters are not interpretable with regard to the cumulative incidence function.
Direct regression modeling of the cumulative incidence function has thus become popular for analyzing such complex time to event data. The special feature of the Fine-Gray model is that regression parameters target the subdistribution hazard, which has a one-to-one correspondence to the cumulative incidence function. This enables the extension to a general likelihood framework that is proposed and further developed in this thesis. In particular we establish a nonparametric maximum likelihood estimation and its extension to the practical relevant setting of recurrent event data with competing terminal events and to independently left-truncated and right-censored competing risks data.
We establish asymptotic properties of the estimated parameters and propose a sandwich estimator for the variance. The solid performance of the proposed method is demonstrated in comprehensive simulation studies. To illustrate its practical utility we provide applications to a bone marrow transplant dataset, a bladder cancer dataset and to an HIV dataset from the CASCADE collaboration.

Monday, September 18, 2017, 15:15

Kjetil Røysland
Institute of Basic Medical Sciences, Biostatistics, Oslo University
Causal local independence models

Causal inference has lately had a huge impact on how statistical analyses based on non-experimental data are done. The idea is to use data from a non-experimental scenario that could be subject to several spurious effects and then fit a model that would govern the frequencies we would have seen in a related hypothetical scenario where the spurious effects are eliminated.This opens up for using health registries to answer new and more ambitious questions. However, there has not been so much focus on causal inference based time-to-event data or survival analysis. The now well established theory of causal Bayesian networks is for instance not suitable for handling such processes. Motivated by causal inference event-history data from the health registries, we have introduced causal local independence models. We show that they offer a generalization of causal Bayesian networks that also enables us to carry out causal inference based on non-experimental data when there is continuous-time processes involved. The main purpose of this work in collaboration with Vanessa Didelez, is to provide new tools for determining identifiability of causal effects of event history data that is subject to censoring. It builds on previous work on local independence graphs and delta-separation by Vanessa Didelez and previous work on causal inference for counting processes by Kjetil Røysland. We provide a new result that gives quite general graphical criteria for when causal validity of a local independence model is preserved in sub-models. If the observable variables, or processes, form a causally valid sub-model, then we can identify most relevant causal effects by re-weighting the actual observations. This is used to prove that the continuous time marginal structural models for event history analysis, based on martingale dynamics, are valid in a much more general context than what has been known previously.

Monday, September 11, 2017, 15:15

Philip Hougaard
Lundbeck and University of Southern Denmark
A personal opinion on personalized medicine

For biomarkers there is a consensus definition from 2001. However, there is no similar thing for personalized medicine. This has created some confusion. Actually, I believe that conceptually there are two contrasting viewpoints on what personalized medicine covers. Besides, there are differences on a smaller scale regarding the technical complexity of the individual information to be used in a treatment strategy. Based on a series of scenarios, I will discuss these issues. I will not end up with a formal definition but rather an informal description of the two possibilities; thus allowing for discussion. Finally, I will have some slides on the drug development program needed for progressing a personalized treatment.

Monday, September 04, 2017, 15:15

Sarah Friedrich
Institute of Statistics, Ulm University
Permutation- and resampling-based inference for semi- and non-parametric effects in dependent data

We consider different resampling approaches for testing general linear hypothesis with dependent data. We distinguish between a repeated measures model, where subjects are repeatedly observed over time, and multivariate data. Furthermore, we consider semi-parametric approaches for metric data, where we test null hypotheses formulated in terms of means, as well as non-parametric rank-based models for ordinal data. In these settings, current state-of-the-art test statistics include the Wald-type statistic (WTS), which is asymptotically chi-square-distributed, and the ANOVA-type statistic (ATS), which is no asymptotic pivot, but can be approximated by an F-distribution. To improve the small sample behavior of these test statistics in the described settings, we consider different resampling schemes. In each setting, we prove the asymptotic validity of the considered approach(es), analyze the small sample behavior of the tests in simulation studies and apply the resampling approaches to data examples from the life sciences.

Friday, June 23, 2017, 15:15

Pierre Joly
Biostatistics, University Bourdeaux
Pseudo-values for interval censored data

The pseudo value approach has been developed for estimating regression models for health indicators like absolute risk to develop a disease or life expectancy without disease when data are right censored. The Penalized likelihood approach allows estimating an Illness-death model taking into account competing risks and interval censoring of the time of illness. In this work, we propose to use a pseudo value with estimators from an illness death model estimated by penalized likelihood. We illustrate this approach with cohort data with the aim to estimate the (remaining) lifetime probabilities to develop dementia.