Yang Zhao

Department of Biostatistics, School of Public Health, Nanjing Medical University, P.R.China

Mediation Analysis and Random Forests

In this presentation, we will introduce the possibility and practice of using random forests, an ensembled machine learning method, in causal mediation analysis. We will also discuss the advantages and potential risks of using RF-based methods in causal inference.

We would firstly describe the limitations of the traditional regression-based mediation analysis. We then briefly describe the basic procedure of random forests. We proposed a residual based method to remove confounding effects in RF analysis and introduce its applications in high dimensional genetic analysis[1]. The proposed RF-based mediation analysis framework includes three steps. First, we build a causal forest model under the counterfactual framework to model the relationship between outcome, treatment, mediators and covariates[2]. Next, we predict the mediators using traditional random forests using predictors including treatment and covariates. The average effects are then estimated using weighted methods. Possible candidates for the weights include the inverses of probabilities and variances. We performed extensive computer simulations to evaluate the performance of random forests in mediation analysis. We observed that the proposed methods can obtain accurate estimates on the direct and in-direct effects. Meanwhile, The results demonstrated that RF-based methods is more flexible than traditional regression based methods. As the RF-based method can handle non-linear relationship and high order interactions, we do not need to specify whether there is exposure-mediator interactions and their types as that in traditional regression-based methods.

Data from phase-II and III clinical trials of a novel small molecular multi-targeted cancer drug , which is already marketed in China, is used to illustrate the application of the RF-based mediation analysis. We evaluated the mediation effects of some measurements from the blood regular tests, such as platelets, on the progression and death outcome for non-small cell lung cancer patients.

Conclusions are that RF-based methods have their advantages in the mediation analysis.

Liis Starkopf

Biostatistics, Department of Public Health, University of Copenhagen

Ph.D.-defence: Statistical methods for causal inference and mediation analysis

Many clinical or epidemiological studies aim to estimate the casual effect of some exposure or intervention on some outcome. The use of causal inference helps to design statistical analyses that come as close as possible to answering the causal questions of interest. In this thesis we focus on the statistical methodology for causal inference in general and mediation analysis in particular. Specifically, we compare five existing software solutions for mediation analysis to provide practical advice for the applied researchers interested in mediation analysis. We further focus on natural effect models and propose a new estimation approach that is especially advantageous in settings where the mediator and the outcome distributions are difficult to model, but the exposure is a single binary variable. Finally, we propose a penalized g-computation estimator of marginal structural models with monotonicity constraints to estimate the counterfactual 30-day survival probability in cardiac arrest patients receiving/not receiving cardiopulmonary resuscitation (CPR) as a non-increasing function of ambulance response time.

Supervisors: Theis Lange, Thomas A. Gerds

Evaluators: Frank Eriksson, Jacob v. B. Hjelmborg, Ingeborg Waernbaum.

Benoit Liquet

Laboratory of Mathematics and their Applications, University of Pau and Pays de l’Adour

Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set

It is well established that incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability.

In this talk, I will present new multivariate methods incorporating grouping structure in frequentist methodology for variable selection and dimension reduction to tackle the analysis of high dimensional and Big-Data set.

Morten Overgaard

Aarhus Universitet

When do pseudo-observations have the appropriate conditional expectation?

A regression approach based on substituting observed and unobserved outcome values for pseudo-observations ought to work if the pseudo-observations have the appropriate conditional expectation. The pseudo-observations under study are jack-knife pseudo-values of some estimator and are closely related to the influence function of the estimator they are based on.

In this talk, we will have a look at some examples of such influence functions and look at potential problems and solutions concerning the conditional expectation. Specifically, influence functions from inverse probability of censoring weighted estimators where the estimate of the censoring distribution is allowed to take covariates into account and influence functions of the Kaplan–Meier estimator in a delayed entry setting will be considered.

Silke Szymczak

Institut für Medizinische Informatik und Statistik, Universitätsklinikum Schleswig-Holstein

Looking into the black box of random forests

Machine learning methods and in particular random forests (RFs) are promising approaches for classification and regression based on omics data sets. I will first give a short introduction to RFs and variable selection, i.e. the identification of variables that are important for prediction. In the second part of my talk I will present some results of our current methodological work on RFs. We performed a simulation based comparison of different variable selection methods where Boruta (Kursa & Rudnicki, 2010, J Stat Softw) and Vita (Janitza et al. 2016 Adv Data Anal Classif) were consistently superior to the other approaches. Furthermore, we developed a novel method called surrogate minimal depth (SMD). It is based on the structure of the decision trees in the forest and additionally takes into account relationships between variables. In simulation studies we showed that correlation patterns can be reconstructed and that SMD is more powerful than existing variable selection methods. We are currently working on an evaluation of extensions of the RF algorithm that integrate pathway membership information into the model building process and I will show the first preliminary results.

Ditte Nørbo Sørensen

Biostatistics, UCPH

PhD defence: Causal proportional hazards estimation in the presence of an instrumental variable

Causation and correlation are two fundamentally different concepts, but too often correlation is misunderstood as causation. Based on given data, correlations are straightforward to establish, whereas the underlying causal structures that can explain a given association are hypothetically endless in their variety. The importance of the statistical discipline known as causal inference has been recognized in the past decades, and the field is still expanding. In this thesis we turn our attention to survival outcome, and how to estimate proportional hazards from which we can learn about causation. Our focus is specifically the case where an instrumental variable is present.

Lars Endahl and Henrik Ravn

Biostatistics, Novo Nordisk A/S

Estimands and missing data - two hot topics in the pharmaceutical industry

A 2012 report commissioned by the US Food and Drug Administration (FDA) on the prevention and analysis of trial results in the presence of missing data, has recently lead to significant changes in the clinical drug development. The report also introduced estimands as a new concept - a concept elaborated on in recently updated statistical guidelines for the pharmaceutical industry (the ICH-E9(R1) still in draft). The focus of the ICH-E9(R1) guideline is to discuss how intercurrent events, such as death or discontinuation of the randomised trial product can be embraced in the estimation of a treatment effect rather than just seen as a source of bias. In this talk we will outline how the estimand concept and the focus on prevention of missing data have changed the way clinical trials for new drug approvals are designed and conducted, how the data is analysed and how the results are communicated.

Boris Hejblum

Universite Bordeaux

Controlling Type-I error in RNA-seq differential analysis through a variance component score test

Gene expression measurement technology has shifted from microarrays to sequencing, producing ever richer high-througput data for transcriptomics studies. As studies using these data grow in size, frequency, and importance, it is becoming urgent to develop and refine the statistical tools available for their analysis. In particular, there is a need for methods that better control the type-I error as clinical RNA-seq studies are including a growing number of subjects (measurements being cheaper) resulting in larger sample sizes. We model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity, in a principled, model-free, and efficient manner for detecting differentially expressed genes from RNA-seq data. Our method can identify the genes whose expression is significantly associated with one or several factors of interest in complex experimental designs, including studies with longitudinal measurement of gene expression. We rely on a powerful variance component score test that can account for both adjustement covariates and data heteroscedasticity without assuming any specific parametric distribution for the (transformed) RNA-seq counts. Despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, which can be computed quickly. A permutation version of the test is also derived for small sample sizes, but this leads to issues in controlling the False Discovery Rate. Finally we also propose an extension of the method for Gene Set Analysis. Applied to both simulated data and real benchmark datasets, we show that our test has good statistical properties when compared to state-of-the-art methods limma/voom, edgeR, and DESeq2. In particular, we show that those three methods can all fail to control the type I error and the False Discovery Rate under realistic settings, while our method behaves as expected. We apply our proposed method to two candidate vaccine phase-I studies with repeated gene expression measurements: one public dataset investigating a candidate vaccine against EBOLA, and one original dataset investigating a candidate vaccine against HIV.

Ramon Oller Piqué

Central University of Catalonia

A nonparametric test for the association between longitudinal covariates and censored survival data

Many biomedical studies focus on the association between a longitudinal measurement and a time-to-event outcome and quantify this association by means of a longitudinal-survival joint model. In this paper we propose the LLR test, a longitudinal extension of the log-rank test statistic given by Peto and Peto (1972), to provide evidence of a plausible association between a time-to-event outcome (right- or interval-censored) and a longitudinal covariate. As joint model methods are complex and hard to interpret, a preliminar test for the association between both processes, such as LLR, is wise. The statistic LLR can be expressed in the form of a weighted difference of hazards, yielding to a broad class of weighted log-rank test statistics, LWLR, which allow to assess the association between the longitudinal covariate and the survival time stressing earlier, middle or late hazard differences through different weighting functions. The asymptotic distribution of LLR is derived by means of a permutation approach under the assumption that the underlying censoring process is identical for all individuals. A simulation study is conducted to evaluate the performance of the test statistics LLR and LWLR and shows that the empirical size is close to the significance level and that the power of the test depends on the association between the covariates and the survival time. Four data sets together with a toy example are used to illustrate the LLR test. Three of the data sets involve right-censored data and correspond to the European Randomized Screening for Prostate Cancer study (Serrat and others, 2015) and two well-known data sets given in the R package JM. The fourth data set explores the study Epidemiology of Diabetes Interventions and Complications (Sparling and others, 2006) which includes interval-censored data.

Jacob Fiksel

Johns Hopkins Bloomberg School of Public Health, Baltimore, USA

Optimized Survival Evaluation to Guide Bone Metastases Management: Developing an Improved Statistical Approach

In managing bone metastases, estimation of life expectancy is central for individualizing patient care given a range of radiotherapy (RT) treatment options. With access to larger volume and more complex patient data and statistical models, oncologists and statisticians must develop methods for optimal decision support. Approaches incorporating many covariates should identify complex interactions and effects while also managing missing data. In this talk, I discuss how a statistical learning approach, random survival forests (RSF), handles these challenges in building survival prediction models. I show how we applied RSF to develop a clinical model which predicts survival for patients with bone metastases using 26 predictor variables and outperforms two simpler, validated Cox regression models. I will conclude by introducing a simple bootstrap based procedure, which can be used for both simple and complex prediction models, to produce valid confidence interval estimates for model performance metrics using internal validation.

Philip Hougaard (joint with Jacob von Hjelmborg)

Lundbeck A/S

Survival of Danish twins born 1870-2000 – preliminary report

Hougaard, Harvald and Holm (JASA, 1992) used frailty models to consider the survival of same-sex Danish twins born between 1881-1930 with follow-up until 1980 for twins where both were alive at age 15. This presentation gives an update to that analysis. For the birth cohorts 1870-1930, same-sex twins, where both were alive at age 6, are considered. For the birth cohorts 1931-2000, all twins are included. Follow-up is to 2016. Besides presenting the results, I will discuss the appropriateness of shared frailty models for studying this problem.

Xiang Zhou

Department of Biostatistics, University of Michigan

Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models

There has been a growing interest in using genotype data to perform genetic prediction of complex traits. Accurate genetic prediction can facilitate genomic selection in animal and plant breeding programs, and can aid in the development of personalized medicine in humans. Because most complex traits have a polygenic architecture and are each influenced by many genetic variants with small effects, accurate genetic prediction requires the development of polygenic methods that can model all genetic variants jointly. Many recently developed polygenic methods make parametric modeling assumptions on the effect size distribution and different polygenic methods differ in such effect size assumption. However, depending on how well the effect size distribution assumption matches the unknown truth, existing polygenic methods can perform well for some traits but poorly for others. To enable robust phenotype prediction performance across a range of phenotypes, we develop a novel polygenic model with a flexible assumption on the effect size distribution. We refer to our model as the latent Dirichlet Process Regression (DPR). DPR relies on the Dirichlet process to assign a prior on the effect size distribution itself, is non-parametric in nature, and is capable of inferring the effect size distribution from the data at hand. Because of the flexible modeling assumption, DPR is able to adapt to a broad spectrum of genetic architectures and achieves robust predictive performance for a variety of complex traits. We compare the predictive performance of DPR with several commonly used polygenic methods in simulations. We further illustrate the benefits of DPR by applying it to predict gene expressions using cis-SNPs, to conduct PrediXcan based gene set test, to perform genomic selection of four traits in two species, and to predict five complex traits in a human cohort. Our method is implemented in the DPR software, freely available at www.xzlab.org/software.html.

Federico Ambrogi

Laboratory of Medical Statistics and Biometry, University of Milan

Predicting survival probabilities using pseudo-observations

Pseudovalues may provide a way to use ‘standard’ estimation procedures in survival analysis, where ‘standard’ refer to methods not specifically designed for accounting of censoring. In this work a generalized additive linear model is analyzed using pseudo-values to provide a smooth estimate of the survival function by using P-spline basis functions. The performances of the estimator compared to both standard tools of survival analysis and machine learning techniques are presented through simulations and a real example.

Klaus Groes Larsen

Lundbeck Denmark

Network Meta Analysis

Network Meta Analysis (NMA) is a statistical framework that allows for comparison of several pharmacological treatments based on results reported in clinical trials. The value of NMAs lies in that they permit the summary of the overall evidence and ranking of different treatment in terms of efficacy and safety endpoints combining both direct and indirect evidence. The statistical model is itself relatively simple and allows for addressing specific model assumptions such as heterogeneity and consistency (both of which will be defined and discussed). The methodology will be introduced through two examples, one concerning the efficacy and safety of SSRIs/SNRIs in the treatment of Depression, and one that compares the cognitive performance as measured by the digit-symbol-substitution test DSST in patients with Depression

Mireille Schnitzer

Biostatistics, Université de Montréal

Longitudinal variable selection in causal inference with collaborative targeted minimum loss-based estimation

Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, marginal structural models model the expectation of the counterfactual outcome conditional only on past treatment and possibly a set of baseline covariates. In such contexts, model covariates (potential time-varying confounders) are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive variable selection in causal inference were generally limited to the single time-point setting. We develop a longitudinal extension of collaborative targeted minimum loss-based estimation (C-TMLE) for the estimation of the parameters in a marginal structural model that can be applied to perform variable selection in propensity score models. We demonstrate the properties of this estimator through a simulation study and apply the method to investigate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma.

Arvid Sjölander

Department of Medical Epidemiology and Biostatistics, Karolinska

Confounding, mediation and colliding - what types of shared covariates does the sibling comparison design control for?

The sibling comparison design is an important epidemiological tool to control for unmeasured confounding, in studies of the causal effect of an exposure on an outcome. It is routinely argued that within-sibling associations are automatically controlled for all measured and unmeasured covariates that are shared (constant) within sets of siblings, such as early childhood environment and parental genetic make-up. However, an important lesson from modern causal inference theory is that not all types of covariate control are desirable. In particular, it has been argued that collider control always lead to bias, and that mediator control may or may not lead to bias, depending on the research question. In this presentation we use Directed Acyclic Graphs (DAGs) to distinguish between shared confounders, shared mediators and shared colliders, and we examine which of these shared covariates the sibling comparison design really controls for.

Sebastien Haneuse

Harvard T.H. Chan School of Public Health

Adjusting for selection bias in electronic health records-based research

Electronic health records (EHR) data provide unique opportunities for public health and medical research. From a methodological perspective, much of the focus in the literature has been on the control of confounding bias. In contrast, selection due to incomplete data is an under-appreciated source of bias in analyzing EHR data. When framed as a missing-data problem, standard methods could be applied to control for selection bias in the EHR context. In such studies, however, the process by which data are complete for any given patient likely involves the interplay of numerous clinical decisions made by patients, health care providers, and the health system. In this sense, standard methods fail to capture the complexity of the data so that residual selection bias may remain. Building on a recently-proposed framework for characterizing how data arise in EHR-based studies, sometimes referred to as the data provenance, we develop and evaluate a statistical framework for regression modeling based on inverse probability weighting that adjusts for selection bias in the complex setting of EHR-based research. We show that the resulting estimator is consistent and asymptotically Normal, and derive the form of the asymptotic variance. Plug-in estimators for the latter are proposed. We use simulations to: (i) highlight the potential for bias in EHR studies when standard approaches are used to account for selection bias, and (ii) evaluate the small-sample operating characteristics of the proposed framework. Finally, the methods are illustrated using data from an on-going, multi-site EHR-based study of bariatric surgery on BMI.

Anna Bellach

University of Copenhagen

Ph.D.-defence: Competing risks regression models based on pseudo risk sets

Competing risks frequently occur in medical studies, when individuals are exposed to several mutually exclusive event types. A common approach is to model the cause specific hazards. Challenges arise from the fact that the relation between the cause specific hazard and the corresponding cumulative incidence function is complex. The product limit estimator based on the cause specific hazard systematically overestimates the cumulative incidence function and estimated regression parameters are not interpretable with regard to the cumulative incidence function.

Direct regression modeling of the cumulative incidence function has thus become popular for analyzing such complex time to event data. The special feature of the Fine-Gray model is that regression parameters target the subdistribution hazard, which has a one-to-one correspondence to the cumulative incidence function. This enables the extension to a general likelihood framework that is proposed and further developed in this thesis. In particular we establish a nonparametric maximum likelihood estimation and its extension to the practical relevant setting of recurrent event data with competing terminal events and to independently left-truncated and right-censored competing risks data.

We establish asymptotic properties of the estimated parameters and propose a sandwich estimator for the variance. The solid performance of the proposed method is demonstrated in comprehensive simulation studies. To illustrate its practical utility we provide applications to a bone marrow transplant dataset, a bladder cancer dataset and to an HIV dataset from the CASCADE collaboration.

Kjetil Røysland

Institute of Basic Medical Sciences, Biostatistics, Oslo University

Causal local independence models

Causal inference has lately had a huge impact on how statistical analyses based on non-experimental data are done. The idea is to use data from a non-experimental scenario that could be subject to several spurious effects and then fit a model that would govern the frequencies we would have seen in a related hypothetical scenario where the spurious effects are eliminated.This opens up for using health registries to answer new and more ambitious questions. However, there has not been so much focus on causal inference based time-to-event data or survival analysis. The now well established theory of causal Bayesian networks is for instance not suitable for handling such processes. Motivated by causal inference event-history data from the health registries, we have introduced causal local independence models. We show that they offer a generalization of causal Bayesian networks that also enables us to carry out causal inference based on non-experimental data when there is continuous-time processes involved. The main purpose of this work in collaboration with Vanessa Didelez, is to provide new tools for determining identifiability of causal effects of event history data that is subject to censoring. It builds on previous work on local independence graphs and delta-separation by Vanessa Didelez and previous work on causal inference for counting processes by Kjetil Røysland. We provide a new result that gives quite general graphical criteria for when causal validity of a local independence model is preserved in sub-models. If the observable variables, or processes, form a causally valid sub-model, then we can identify most relevant causal effects by re-weighting the actual observations. This is used to prove that the continuous time marginal structural models for event history analysis, based on martingale dynamics, are valid in a much more general context than what has been known previously.

Philip Hougaard

Lundbeck and University of Southern Denmark

A personal opinion on personalized medicine

For biomarkers there is a consensus definition from 2001. However, there is no similar thing for personalized medicine. This has created some confusion. Actually, I believe that conceptually there are two contrasting viewpoints on what personalized medicine covers. Besides, there are differences on a smaller scale regarding the technical complexity of the individual information to be used in a treatment strategy. Based on a series of scenarios, I will discuss these issues. I will not end up with a formal definition but rather an informal description of the two possibilities; thus allowing for discussion. Finally, I will have some slides on the drug development program needed for progressing a personalized treatment.

Sarah Friedrich

Institute of Statistics, Ulm University

Permutation- and resampling-based inference for semi- and non-parametric effects in dependent data

We consider different resampling approaches for testing general linear hypothesis with dependent data. We distinguish between a repeated measures model, where subjects are repeatedly observed over time, and multivariate data. Furthermore, we consider semi-parametric approaches for metric data, where we test null hypotheses formulated in terms of means, as well as non-parametric rank-based models for ordinal data. In these settings, current state-of-the-art test statistics include the Wald-type statistic (WTS), which is asymptotically chi-square-distributed, and the ANOVA-type statistic (ATS), which is no asymptotic pivot, but can be approximated by an F-distribution. To improve the small sample behavior of these test statistics in the described settings, we consider different resampling schemes. In each setting, we prove the asymptotic validity of the considered approach(es), analyze the small sample behavior of the tests in simulation studies and apply the resampling approaches to data examples from the life sciences.

Pierre Joly

Biostatistics, University Bourdeaux

Pseudo-values for interval censored data

The pseudo value approach has been developed for estimating regression models for health indicators like absolute risk to develop a disease or life expectancy without disease when data are right censored. The Penalized likelihood approach allows estimating an Illness-death model taking into account competing risks and interval censoring of the time of illness. In this work, we propose to use a pseudo value with estimators from an illness death model estimated by penalized likelihood. We illustrate this approach with cohort data with the aim to estimate the (remaining) lifetime probabilities to develop dementia.