Mark van der Laan

University of California, Berkeley

Targeted Learning, Super Learning, and the Highly Adaptive Lasso

We review targeted minimum loss estimation (TMLE), which provides a general template for the construction of asymptotically efficient plug-in estimators of a target estimand under realistic statistical assumptions. TMLE involves maximizing a parametric likelihood along a so-called least favourable parametric model that uses as off-set an initial estimator (e.g., ensemble super-learner) of the relevant functional of the data distribution. The asymptotic normality and efficiency of the TMLE relies on the asymptotic negligibility of a second-order term. We present a general Highly Adaptive Lasso (HAL) estimator of the data distribution and its functionals that converges at a sufficient n-1/3 regardless of the dimensionality of the data/model, under almost no additional regularity. This allows us to propose a general TMLE that is asymptotically efficient in great generality. We also discuss the appealing properties of HAL, due to HAL being an MLE over a big function class, and present various of its implications for super- learning and TMLE.

Klaus Larsen

Lundbeck

A principal stratum estimand investigating the treatment effect in patients who would comply, if treated with a specific treatment

The draft ICH E9 (R1) addendum by the International Conference on Harmonisation working group opens for the use of a principal stratum in the analysis of study data for regulatory purpose, if a relevant estimand can be justified. Inspired by the so-called complier average causal effect and work within this framework, I will propose a new estimator – Extrapolation based on propensity to comply – that estimates the treatment effect of an active treatment A relative to a comparator B (active or placebo), in the principal stratum of patients who would comply, if they were treated with treatment A. Sensitivity of the approach to the number of covariates and their ability to predict principal stratum membership will be shown based on data from a placebo-controlled study of brexpiprazole in schizophrenia. The performance of the estimator is compared with another estimator that is also based on principal stratification. A simulation study supports that the proposed estimator has a negligible bias even with a small sample size, except when the covariate predicting compliance is very weak. Not surprisingly, precision of the estimate increases substantially with stronger predictors of compliance.

Alberto Cairo

Director of the Visualization Program at the Center for Computational Science, University of Miami

How Charts Lie

We’ve all heard that a picture is worth a thousand words, but what if we don’t understand what we’re looking at?

Charts, infographics, and diagrams are ubiquitous. They are useful because they can reveal patterns and trends hidden behind the numbers we encounter in our lives. Good charts make us smarter—if we know how to read them.

However, they can also deceive us. Charts lie in a variety of ways—displaying incomplete or inaccurate data, suggesting misleading patterns, and concealing uncertainty— or are frequently misunderstood. Many of us are ill-equipped to interpret the visuals that politicians, journalists, advertisers, and even our employers present each day. This talk teaches to not only spot the lies in deceptive visuals, but also to take advantage of good ones.

Link to the slides

Alberto Cairo

Director of the Visualization Program at the Center for Computational Science, University of Miami

Visualization and Graphic Design for Scientists

When designing a data visualization, showing the data comes first. After all, the main goal of a visualization is letting the reader spot patterns and trends behind numbers. But what if the visualization we design is to be presented to a general audience? In that case we may want to think deeply about visual design elements such as typography, color, composition, and hierarchy. This talk teaches non-designers such as scientists and statisticians how to make our charts, graphs, publications, and conference posters look better.

Link to the slides

Michael Væth

Department of Biostatistics, Univeristy of Aarhus

Estimating survival benefit in a clinical trial

In a general population, a proportional change of mortality results in a change in life expectancy which, to a close approximation, is proportional to the logarithm of the change in mortality.

Using censored follow-up data, this relationship may be used to predict the difference in average remaining lifetime between two groups of individuals with approximately proportional mortality. The usefulness of the methodology in a clinical trial setting is explored using follow-up data from a clinical trial of breast cancer patients. Two methods are considered. One approach applies standardized mortality ratios with special attention to non-proportionality early in the follow-up period, the second approach uses a hazard ratio estimated in Cox regression analysis. Advantages and disadvantages of these approaches are discussed. The results are not discouraging, and the methodology seems potentially useful in a cost-effectiveness analysis of a new treatment option.

Yu Shen

Department of Biostatistics, M. D. Anderson Cancer Center, University of Texas

Estimation of Longitudinal Medical Cost Trajectory

Estimating the average monthly medical costs from disease diagnosis to a terminal event such as death for an incident cohort of patients is a topic of immense interest to researchers in health policy and health economics because patterns of average monthly costs over time reveal how medical costs vary across phases of care. The statistical challenges to estimating monthly medical costs longitudinally are multifold; the longitudinal cost trajectory (formed by plotting the average monthly costs from diagnosis to the terminal event) is likely to be nonlinear, with its shape depending on the time of the terminal event, which can be subject to right censoring. We tackle this statistically challenging topic by estimating the conditional mean cost at any month given the time of the terminal event. The longitudinal cost trajectories with diﬀerent terminal event times form a bivariate surface, under some constraint. We propose to estimate this surface using bivariate penalized splines in an Expectation-Maximization algorithm that treats the censored terminal event times as missing data. We evaluate the proposed model and estimation method in simulations and apply the method to the medical cost data of an incident cohort of stage IV breast cancer patients from the Surveillance, Epidemiology and End Results–Medicare Linked Database. This is a joint work of Li, Wu, Ning, Huang, Shih and Shen.

Halina Frydman

NYU Stern School of Business

An Ensemble Method for Interval-Censored Time-to-Event Data

Interval-censored data analysis is important in biomedical statistics for any type of time-to-event response where the time of response is not known exactly, but rather only known to occur between two assessment times. Many clinical trials and longitudinal studies generate interval-censored data; one common example occurs in medical studies that entail periodic follow-up. In this paper, we propose a survival forest method for interval-censored data based on the conditional inference framework. We describe how this framework can be adapted to the situation of interval-censored data. We show that the tuning parameters have a non-negligible effect on the survival forest performance and guidance is provided on how to tune the parameters in a data-dependent way to improve the overall performance of the method. Using Monte Carlo simulations, we show that the proposed survival forest is at least as effective as a survival tree method when the underlying model has a tree structure, performs similarly to an interval-censored Cox proportional hazards model when the true relationship is linear, and outperforms the survival tree method and Cox model when the true relationship is nonlinear. We illustrate the application of the method on a breast cancer data.

Jan Feifel

Institute of Statistics, Ulm University

Subcohorting methods for rare time-dependent exposures in time-to-event data

Antimicrobial resistance is one of the major burdens for today’s society. The challenges for researches conducting studies on the effect of those rare exposures on the hospital stay are manifold.

For large cohort studies with rare outcomes nested case-control designs are favorable due to the efficient use of limited resources. In our setting, nested case-control designs apply but do not lead to truly reduced sample sizes, because the outcome is not rare. We, therefore, study a modified nested case-control design, which samples all exposed patients but not all unexposed ones. Here, the inclusion probability of observed events evolves over time. This new scheme improves on the classical nested case-control design where for every observed event controls are chosen at random.

We will discuss several options on how to account for past time-dependent exposure status within a nested case-control design and their related merits. It will be seen that a smart utilization of the available information at each point in time can lead to a powerful and simultaneously less expensive design. We will also sketch alternative designs, e.g. treating exposure as a left-truncation event that generates matched controls, and time-simultaneous inference of the baseline hazard using the wild bootstrap. The methods will be applied to observational data on the impact of hospital-acquired pneumonia on the length-of-stay in hospital, which is an outcome commonly used to express both the impact and the costs of such adverse events.

Christina Boschini

Biostatistics, Department of Public Health, University of Copenhagen and the Cancer Society

Ph.D.-defence: Excess risk estimation in matched cohort studies

The work presented in this thesis aims at contributing to the field of statistical methodology for the analysis of excess risk in matched cohort studies. The project was initiated by the Danish Cancer Society Research Center and motivated by the desire to investigate long-term health consequences of childhood cancer survivors. During the last five decades, as a consequence of improved survival rates, the major concern of childhood cancer research shifted from survival to late effects related to childhood cancer diagnosis and treatment. In 2009, thanks to dedicated childhood cancer researchers and to the resourceful Nordic national registries, the Adult Life after Childhood Cancer in Scandinavia (ALiCCS) was established to improve knowledge about late effects of childhood cancer. This study has a matched cohort design where for each childhood cancer survivor, five healthy comparison subjects of the same sex, age and country were randomly selected. The statistical models introduced in this thesis exploit the matching structure of the data to get a representative estimate of the excess risk of late effects in childhood cancer survivors. Two are the methods described: the first models the excess risk in terms of excess hazard, while the second estimates the excess cumulative incidence function. Both approaches assume that the risk for a childhood cancer survivor is the sum of a cluster-specific background risk defined on the age time scale and an excess term defined on the time since exposure time scale. Estimates of the excess model parameters are obtained by pairwise comparisons between the cancer survivor and all the other matched comparison members in the same cluster. The contribution of the models introduced in this thesis on the public health area is presented by an application on the 5-year soft-tissue sarcoma survivor data from the ALiCCS study. By handling different features of registry data, such as multiple events, different time scales, right censoring and left truncation, this approach offers an easy tool to study how the excess risk develops in time and how it is affected by important risk factors, such as treatment.

Functions estimating the excess risk models were implemented in R and are publicly available.

Supervisors: Thomas Scheike, Klaus K. Andersen, Christian Dehlendorff and Jeanette Falck Winther

Evaluators: Thomas Alexander Gerds, Martin Bøgsted, Bjørn Møller.

Yang Zhao

Department of Biostatistics, School of Public Health, Nanjing Medical University, P.R.China

Mediation Analysis and Random Forests

In this presentation, we will introduce the possibility and practice of using random forests, an ensembled machine learning method, in causal mediation analysis. We will also discuss the advantages and potential risks of using RF-based methods in causal inference.

We would firstly describe the limitations of the traditional regression-based mediation analysis. We then briefly describe the basic procedure of random forests. We proposed a residual based method to remove confounding effects in RF analysis and introduce its applications in high dimensional genetic analysis[1]. The proposed RF-based mediation analysis framework includes three steps. First, we build a causal forest model under the counterfactual framework to model the relationship between outcome, treatment, mediators and covariates[2]. Next, we predict the mediators using traditional random forests using predictors including treatment and covariates. The average effects are then estimated using weighted methods. Possible candidates for the weights include the inverses of probabilities and variances. We performed extensive computer simulations to evaluate the performance of random forests in mediation analysis. We observed that the proposed methods can obtain accurate estimates on the direct and in-direct effects. Meanwhile, The results demonstrated that RF-based methods is more flexible than traditional regression based methods. As the RF-based method can handle non-linear relationship and high order interactions, we do not need to specify whether there is exposure-mediator interactions and their types as that in traditional regression-based methods.

Data from phase-II and III clinical trials of a novel small molecular multi-targeted cancer drug , which is already marketed in China, is used to illustrate the application of the RF-based mediation analysis. We evaluated the mediation effects of some measurements from the blood regular tests, such as platelets, on the progression and death outcome for non-small cell lung cancer patients.

Conclusions are that RF-based methods have their advantages in the mediation analysis.

Liis Starkopf

Biostatistics, Department of Public Health, University of Copenhagen

Ph.D.-defence: Statistical methods for causal inference and mediation analysis

Many clinical or epidemiological studies aim to estimate the casual effect of some exposure or intervention on some outcome. The use of causal inference helps to design statistical analyses that come as close as possible to answering the causal questions of interest. In this thesis we focus on the statistical methodology for causal inference in general and mediation analysis in particular. Specifically, we compare five existing software solutions for mediation analysis to provide practical advice for the applied researchers interested in mediation analysis. We further focus on natural effect models and propose a new estimation approach that is especially advantageous in settings where the mediator and the outcome distributions are difficult to model, but the exposure is a single binary variable. Finally, we propose a penalized g-computation estimator of marginal structural models with monotonicity constraints to estimate the counterfactual 30-day survival probability in cardiac arrest patients receiving/not receiving cardiopulmonary resuscitation (CPR) as a non-increasing function of ambulance response time.

Supervisors: Theis Lange, Thomas A. Gerds

Evaluators: Frank Eriksson, Jacob v. B. Hjelmborg, Ingeborg Waernbaum.

Benoit Liquet

Laboratory of Mathematics and their Applications, University of Pau and Pays de l’Adour

Variable Selection and Dimension Reduction methods for high dimensional and Big-Data Set

It is well established that incorporation of prior knowledge on the structure existing in the data for potential grouping of the covariates is key to more accurate prediction and improved interpretability.

In this talk, I will present new multivariate methods incorporating grouping structure in frequentist methodology for variable selection and dimension reduction to tackle the analysis of high dimensional and Big-Data set.

Morten Overgaard

Aarhus Universitet

When do pseudo-observations have the appropriate conditional expectation?

A regression approach based on substituting observed and unobserved outcome values for pseudo-observations ought to work if the pseudo-observations have the appropriate conditional expectation. The pseudo-observations under study are jack-knife pseudo-values of some estimator and are closely related to the influence function of the estimator they are based on.

In this talk, we will have a look at some examples of such influence functions and look at potential problems and solutions concerning the conditional expectation. Specifically, influence functions from inverse probability of censoring weighted estimators where the estimate of the censoring distribution is allowed to take covariates into account and influence functions of the Kaplan–Meier estimator in a delayed entry setting will be considered.

Silke Szymczak

Institut für Medizinische Informatik und Statistik, Universitätsklinikum Schleswig-Holstein

Looking into the black box of random forests

Machine learning methods and in particular random forests (RFs) are promising approaches for classification and regression based on omics data sets. I will first give a short introduction to RFs and variable selection, i.e. the identification of variables that are important for prediction. In the second part of my talk I will present some results of our current methodological work on RFs. We performed a simulation based comparison of different variable selection methods where Boruta (Kursa & Rudnicki, 2010, J Stat Softw) and Vita (Janitza et al. 2016 Adv Data Anal Classif) were consistently superior to the other approaches. Furthermore, we developed a novel method called surrogate minimal depth (SMD). It is based on the structure of the decision trees in the forest and additionally takes into account relationships between variables. In simulation studies we showed that correlation patterns can be reconstructed and that SMD is more powerful than existing variable selection methods. We are currently working on an evaluation of extensions of the RF algorithm that integrate pathway membership information into the model building process and I will show the first preliminary results.

Ditte Nørbo Sørensen

Biostatistics, UCPH

PhD defence: Causal proportional hazards estimation in the presence of an instrumental variable

Causation and correlation are two fundamentally different concepts, but too often correlation is misunderstood as causation. Based on given data, correlations are straightforward to establish, whereas the underlying causal structures that can explain a given association are hypothetically endless in their variety. The importance of the statistical discipline known as causal inference has been recognized in the past decades, and the field is still expanding. In this thesis we turn our attention to survival outcome, and how to estimate proportional hazards from which we can learn about causation. Our focus is specifically the case where an instrumental variable is present.

Lars Endahl and Henrik Ravn

Biostatistics, Novo Nordisk A/S

Estimands and missing data - two hot topics in the pharmaceutical industry

A 2012 report commissioned by the US Food and Drug Administration (FDA) on the prevention and analysis of trial results in the presence of missing data, has recently lead to significant changes in the clinical drug development. The report also introduced estimands as a new concept - a concept elaborated on in recently updated statistical guidelines for the pharmaceutical industry (the ICH-E9(R1) still in draft). The focus of the ICH-E9(R1) guideline is to discuss how intercurrent events, such as death or discontinuation of the randomised trial product can be embraced in the estimation of a treatment effect rather than just seen as a source of bias. In this talk we will outline how the estimand concept and the focus on prevention of missing data have changed the way clinical trials for new drug approvals are designed and conducted, how the data is analysed and how the results are communicated.

Boris Hejblum

Universite Bordeaux

Controlling Type-I error in RNA-seq differential analysis through a variance component score test

Gene expression measurement technology has shifted from microarrays to sequencing, producing ever richer high-througput data for transcriptomics studies. As studies using these data grow in size, frequency, and importance, it is becoming urgent to develop and refine the statistical tools available for their analysis. In particular, there is a need for methods that better control the type-I error as clinical RNA-seq studies are including a growing number of subjects (measurements being cheaper) resulting in larger sample sizes. We model RNA-seq counts as continuous variables using nonparametric regression to account for their inherent heteroscedasticity, in a principled, model-free, and efficient manner for detecting differentially expressed genes from RNA-seq data. Our method can identify the genes whose expression is significantly associated with one or several factors of interest in complex experimental designs, including studies with longitudinal measurement of gene expression. We rely on a powerful variance component score test that can account for both adjustement covariates and data heteroscedasticity without assuming any specific parametric distribution for the (transformed) RNA-seq counts. Despite the presence of a nonparametric component, our test statistic has a simple form and limiting distribution, which can be computed quickly. A permutation version of the test is also derived for small sample sizes, but this leads to issues in controlling the False Discovery Rate. Finally we also propose an extension of the method for Gene Set Analysis. Applied to both simulated data and real benchmark datasets, we show that our test has good statistical properties when compared to state-of-the-art methods limma/voom, edgeR, and DESeq2. In particular, we show that those three methods can all fail to control the type I error and the False Discovery Rate under realistic settings, while our method behaves as expected. We apply our proposed method to two candidate vaccine phase-I studies with repeated gene expression measurements: one public dataset investigating a candidate vaccine against EBOLA, and one original dataset investigating a candidate vaccine against HIV.

Ramon Oller Piqué

Central University of Catalonia

A nonparametric test for the association between longitudinal covariates and censored survival data

Many biomedical studies focus on the association between a longitudinal measurement and a time-to-event outcome and quantify this association by means of a longitudinal-survival joint model. In this paper we propose the LLR test, a longitudinal extension of the log-rank test statistic given by Peto and Peto (1972), to provide evidence of a plausible association between a time-to-event outcome (right- or interval-censored) and a longitudinal covariate. As joint model methods are complex and hard to interpret, a preliminar test for the association between both processes, such as LLR, is wise. The statistic LLR can be expressed in the form of a weighted difference of hazards, yielding to a broad class of weighted log-rank test statistics, LWLR, which allow to assess the association between the longitudinal covariate and the survival time stressing earlier, middle or late hazard differences through different weighting functions. The asymptotic distribution of LLR is derived by means of a permutation approach under the assumption that the underlying censoring process is identical for all individuals. A simulation study is conducted to evaluate the performance of the test statistics LLR and LWLR and shows that the empirical size is close to the significance level and that the power of the test depends on the association between the covariates and the survival time. Four data sets together with a toy example are used to illustrate the LLR test. Three of the data sets involve right-censored data and correspond to the European Randomized Screening for Prostate Cancer study (Serrat and others, 2015) and two well-known data sets given in the R package JM. The fourth data set explores the study Epidemiology of Diabetes Interventions and Complications (Sparling and others, 2006) which includes interval-censored data.

Jacob Fiksel

Johns Hopkins Bloomberg School of Public Health, Baltimore, USA

Optimized Survival Evaluation to Guide Bone Metastases Management: Developing an Improved Statistical Approach

In managing bone metastases, estimation of life expectancy is central for individualizing patient care given a range of radiotherapy (RT) treatment options. With access to larger volume and more complex patient data and statistical models, oncologists and statisticians must develop methods for optimal decision support. Approaches incorporating many covariates should identify complex interactions and effects while also managing missing data. In this talk, I discuss how a statistical learning approach, random survival forests (RSF), handles these challenges in building survival prediction models. I show how we applied RSF to develop a clinical model which predicts survival for patients with bone metastases using 26 predictor variables and outperforms two simpler, validated Cox regression models. I will conclude by introducing a simple bootstrap based procedure, which can be used for both simple and complex prediction models, to produce valid confidence interval estimates for model performance metrics using internal validation.

Philip Hougaard (joint with Jacob von Hjelmborg)

Lundbeck A/S

Survival of Danish twins born 1870-2000 – preliminary report

Hougaard, Harvald and Holm (JASA, 1992) used frailty models to consider the survival of same-sex Danish twins born between 1881-1930 with follow-up until 1980 for twins where both were alive at age 15. This presentation gives an update to that analysis. For the birth cohorts 1870-1930, same-sex twins, where both were alive at age 6, are considered. For the birth cohorts 1931-2000, all twins are included. Follow-up is to 2016. Besides presenting the results, I will discuss the appropriateness of shared frailty models for studying this problem.

Xiang Zhou

Department of Biostatistics, University of Michigan

Non-parametric genetic prediction of complex traits with latent Dirichlet process regression models

There has been a growing interest in using genotype data to perform genetic prediction of complex traits. Accurate genetic prediction can facilitate genomic selection in animal and plant breeding programs, and can aid in the development of personalized medicine in humans. Because most complex traits have a polygenic architecture and are each influenced by many genetic variants with small effects, accurate genetic prediction requires the development of polygenic methods that can model all genetic variants jointly. Many recently developed polygenic methods make parametric modeling assumptions on the effect size distribution and different polygenic methods differ in such effect size assumption. However, depending on how well the effect size distribution assumption matches the unknown truth, existing polygenic methods can perform well for some traits but poorly for others. To enable robust phenotype prediction performance across a range of phenotypes, we develop a novel polygenic model with a flexible assumption on the effect size distribution. We refer to our model as the latent Dirichlet Process Regression (DPR). DPR relies on the Dirichlet process to assign a prior on the effect size distribution itself, is non-parametric in nature, and is capable of inferring the effect size distribution from the data at hand. Because of the flexible modeling assumption, DPR is able to adapt to a broad spectrum of genetic architectures and achieves robust predictive performance for a variety of complex traits. We compare the predictive performance of DPR with several commonly used polygenic methods in simulations. We further illustrate the benefits of DPR by applying it to predict gene expressions using cis-SNPs, to conduct PrediXcan based gene set test, to perform genomic selection of four traits in two species, and to predict five complex traits in a human cohort. Our method is implemented in the DPR software, freely available at www.xzlab.org/software.html.

Federico Ambrogi

Laboratory of Medical Statistics and Biometry, University of Milan

Predicting survival probabilities using pseudo-observations

Pseudovalues may provide a way to use ‘standard’ estimation procedures in survival analysis, where ‘standard’ refer to methods not specifically designed for accounting of censoring. In this work a generalized additive linear model is analyzed using pseudo-values to provide a smooth estimate of the survival function by using P-spline basis functions. The performances of the estimator compared to both standard tools of survival analysis and machine learning techniques are presented through simulations and a real example.

Klaus Groes Larsen

Lundbeck Denmark

Network Meta Analysis

Network Meta Analysis (NMA) is a statistical framework that allows for comparison of several pharmacological treatments based on results reported in clinical trials. The value of NMAs lies in that they permit the summary of the overall evidence and ranking of different treatment in terms of efficacy and safety endpoints combining both direct and indirect evidence. The statistical model is itself relatively simple and allows for addressing specific model assumptions such as heterogeneity and consistency (both of which will be defined and discussed). The methodology will be introduced through two examples, one concerning the efficacy and safety of SSRIs/SNRIs in the treatment of Depression, and one that compares the cognitive performance as measured by the digit-symbol-substitution test DSST in patients with Depression

Mireille Schnitzer

Biostatistics, Université de Montréal

Longitudinal variable selection in causal inference with collaborative targeted minimum loss-based estimation

Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, marginal structural models model the expectation of the counterfactual outcome conditional only on past treatment and possibly a set of baseline covariates. In such contexts, model covariates (potential time-varying confounders) are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive variable selection in causal inference were generally limited to the single time-point setting. We develop a longitudinal extension of collaborative targeted minimum loss-based estimation (C-TMLE) for the estimation of the parameters in a marginal structural model that can be applied to perform variable selection in propensity score models. We demonstrate the properties of this estimator through a simulation study and apply the method to investigate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma.

Arvid Sjölander

Department of Medical Epidemiology and Biostatistics, Karolinska

Confounding, mediation and colliding - what types of shared covariates does the sibling comparison design control for?

The sibling comparison design is an important epidemiological tool to control for unmeasured confounding, in studies of the causal effect of an exposure on an outcome. It is routinely argued that within-sibling associations are automatically controlled for all measured and unmeasured covariates that are shared (constant) within sets of siblings, such as early childhood environment and parental genetic make-up. However, an important lesson from modern causal inference theory is that not all types of covariate control are desirable. In particular, it has been argued that collider control always lead to bias, and that mediator control may or may not lead to bias, depending on the research question. In this presentation we use Directed Acyclic Graphs (DAGs) to distinguish between shared confounders, shared mediators and shared colliders, and we examine which of these shared covariates the sibling comparison design really controls for.

Sebastien Haneuse

Harvard T.H. Chan School of Public Health

Adjusting for selection bias in electronic health records-based research

Electronic health records (EHR) data provide unique opportunities for public health and medical research. From a methodological perspective, much of the focus in the literature has been on the control of confounding bias. In contrast, selection due to incomplete data is an under-appreciated source of bias in analyzing EHR data. When framed as a missing-data problem, standard methods could be applied to control for selection bias in the EHR context. In such studies, however, the process by which data are complete for any given patient likely involves the interplay of numerous clinical decisions made by patients, health care providers, and the health system. In this sense, standard methods fail to capture the complexity of the data so that residual selection bias may remain. Building on a recently-proposed framework for characterizing how data arise in EHR-based studies, sometimes referred to as the data provenance, we develop and evaluate a statistical framework for regression modeling based on inverse probability weighting that adjusts for selection bias in the complex setting of EHR-based research. We show that the resulting estimator is consistent and asymptotically Normal, and derive the form of the asymptotic variance. Plug-in estimators for the latter are proposed. We use simulations to: (i) highlight the potential for bias in EHR studies when standard approaches are used to account for selection bias, and (ii) evaluate the small-sample operating characteristics of the proposed framework. Finally, the methods are illustrated using data from an on-going, multi-site EHR-based study of bariatric surgery on BMI.

Anna Bellach

University of Copenhagen

Ph.D.-defence: Competing risks regression models based on pseudo risk sets

Competing risks frequently occur in medical studies, when individuals are exposed to several mutually exclusive event types. A common approach is to model the cause specific hazards. Challenges arise from the fact that the relation between the cause specific hazard and the corresponding cumulative incidence function is complex. The product limit estimator based on the cause specific hazard systematically overestimates the cumulative incidence function and estimated regression parameters are not interpretable with regard to the cumulative incidence function.

Direct regression modeling of the cumulative incidence function has thus become popular for analyzing such complex time to event data. The special feature of the Fine-Gray model is that regression parameters target the subdistribution hazard, which has a one-to-one correspondence to the cumulative incidence function. This enables the extension to a general likelihood framework that is proposed and further developed in this thesis. In particular we establish a nonparametric maximum likelihood estimation and its extension to the practical relevant setting of recurrent event data with competing terminal events and to independently left-truncated and right-censored competing risks data.

We establish asymptotic properties of the estimated parameters and propose a sandwich estimator for the variance. The solid performance of the proposed method is demonstrated in comprehensive simulation studies. To illustrate its practical utility we provide applications to a bone marrow transplant dataset, a bladder cancer dataset and to an HIV dataset from the CASCADE collaboration.

Kjetil Røysland

Institute of Basic Medical Sciences, Biostatistics, Oslo University

Causal local independence models

Causal inference has lately had a huge impact on how statistical analyses based on non-experimental data are done. The idea is to use data from a non-experimental scenario that could be subject to several spurious effects and then fit a model that would govern the frequencies we would have seen in a related hypothetical scenario where the spurious effects are eliminated.This opens up for using health registries to answer new and more ambitious questions. However, there has not been so much focus on causal inference based time-to-event data or survival analysis. The now well established theory of causal Bayesian networks is for instance not suitable for handling such processes. Motivated by causal inference event-history data from the health registries, we have introduced causal local independence models. We show that they offer a generalization of causal Bayesian networks that also enables us to carry out causal inference based on non-experimental data when there is continuous-time processes involved. The main purpose of this work in collaboration with Vanessa Didelez, is to provide new tools for determining identifiability of causal effects of event history data that is subject to censoring. It builds on previous work on local independence graphs and delta-separation by Vanessa Didelez and previous work on causal inference for counting processes by Kjetil Røysland. We provide a new result that gives quite general graphical criteria for when causal validity of a local independence model is preserved in sub-models. If the observable variables, or processes, form a causally valid sub-model, then we can identify most relevant causal effects by re-weighting the actual observations. This is used to prove that the continuous time marginal structural models for event history analysis, based on martingale dynamics, are valid in a much more general context than what has been known previously.

Philip Hougaard

Lundbeck and University of Southern Denmark

A personal opinion on personalized medicine

For biomarkers there is a consensus definition from 2001. However, there is no similar thing for personalized medicine. This has created some confusion. Actually, I believe that conceptually there are two contrasting viewpoints on what personalized medicine covers. Besides, there are differences on a smaller scale regarding the technical complexity of the individual information to be used in a treatment strategy. Based on a series of scenarios, I will discuss these issues. I will not end up with a formal definition but rather an informal description of the two possibilities; thus allowing for discussion. Finally, I will have some slides on the drug development program needed for progressing a personalized treatment.

Sarah Friedrich

Institute of Statistics, Ulm University

Permutation- and resampling-based inference for semi- and non-parametric effects in dependent data

We consider different resampling approaches for testing general linear hypothesis with dependent data. We distinguish between a repeated measures model, where subjects are repeatedly observed over time, and multivariate data. Furthermore, we consider semi-parametric approaches for metric data, where we test null hypotheses formulated in terms of means, as well as non-parametric rank-based models for ordinal data. In these settings, current state-of-the-art test statistics include the Wald-type statistic (WTS), which is asymptotically chi-square-distributed, and the ANOVA-type statistic (ATS), which is no asymptotic pivot, but can be approximated by an F-distribution. To improve the small sample behavior of these test statistics in the described settings, we consider different resampling schemes. In each setting, we prove the asymptotic validity of the considered approach(es), analyze the small sample behavior of the tests in simulation studies and apply the resampling approaches to data examples from the life sciences.

Pierre Joly

Biostatistics, University Bourdeaux

Pseudo-values for interval censored data

The pseudo value approach has been developed for estimating regression models for health indicators like absolute risk to develop a disease or life expectancy without disease when data are right censored. The Penalized likelihood approach allows estimating an Illness-death model taking into account competing risks and interval censoring of the time of illness. In this work, we propose to use a pseudo value with estimators from an illness death model estimated by penalized likelihood. We illustrate this approach with cohort data with the aim to estimate the (remaining) lifetime probabilities to develop dementia.