Introdution to Bayesian data analyses - Advanced Statistical Topis B

Benjamin Skov Kaas-Hansen

-1. Preamble

Kolmogorov and probability

Let \((\Omega, F, P)\) be a measure space. If \(P(E)\) is the probability of an event \(E\), then \((\Omega, F, P)\) is a probability space with sample space \(\Omega\), event space \(F\) and probability measure \(P\). Three axioms:


The probability of an event is a non-negative real number: \(P(E) \in \mathbb{R} \text{ and } P(E) \geq 0 \quad \forall E \in F\)


At least one of the events in the entire set \(\Omega\) will occur: \(P(\Omega) = 1\)


Any countable sequence of disjoint (= mutually exclusive) sets \(E_1, E_2, ..., E_\infty\) satisfies: \(P \left( \bigcup_{i=1}^\infty E_i \right) = \sum_{i=1}^\infty P(E_i)\)

Wiener process

Flip a coin. Tail: 1 up. Head: 1 down. Where do we end?

Aye, we’re good and lost now

Good — then let’s start over

Probability

A number between 0 and 1 which encompasses my (our?) statement about uncertainty/certainty

1 is complete certainty that something is the case

0 is complete certainty that something is not the case

Subjective measure. Can we ever be completely certain about something in the real world?

Today

Very pragmatic and applied

Get real experience with Bayesian analysis

We won’t get lost in technical details

Off-the-shelve software: as easy as other statistical models

Study design and data collection ≠ data analysis

Learning objectives

Don’t be afraid of Bayesian analyses

Describe the applications of Bayesian analysis

Express prior belief with probability distributions

Fit Bayesian GLMs

Complete the five steps of BDA

Follow the guidelines for reporting Bayesian analyses

Programme

Follows the 5 steps of Bayesian analyses

  • Design phase
    1. Characterise data

    2. Define appropriate data model

    3. Specify priors

  • Analysis
    1. Update parameters in light of new data

    2. Diagnostics (convergence and posterior predictive checks)

Frequent breaks. You present the exercises to each other.

Me

Medical doctor with MSc and PhD

Postdoc in intensive care

Bayesian platform trial (methods and data science)

Mainly real-world evidence (routine-care and PRO data)

0. Probability and Bayesianism

Decision problems

“The Bayesian approach is tailored to decision making.

Designing a clinical trial is a decision problem.

Drawing a conclusion from a trial (…) is a decision problem.

Allocating resources among various research projects is a decision problem.

Stopping drug development is a decision problem.”

A formalism for human intuition

\(P(\theta | D) \propto P(D | \theta) \times P(\theta)\)

posterior \(\propto\) likelihood \(\times\) prior

Start with a belief about something

Observe the world to learn something new

Update your brief in light of these new observations

We make probability statements about model parameters

Derived quantities follow directly (as we’ll see)

Scientific questions

What is the average weight of patients?

What is the effect of this treatment? Is it better than placebo?

Will our capacity be exhausted tomorrow?

What’s Bayesian analysis good for?

Formalism for human intuition

Priors express pre-existing knowledge (or expectations)

Easier to build complex models

Use priors to regularise estimation

Modelling uncertainty enables decision analysis

Bayesian barriers

  • Seen as complicated and “mathy”

  • Erroneous idea that Bayesian analyses are subjective (and, thus, unreliable) and frequentist are objective

  • It’s a bit more tricky than conventional analyses

  • Devising reasonable priors not trivial

  • You force yourself to formalise existing knowledge or lack thereof

  • Frequentist methods the “go-to” in health research

Example

The proportion of women among patients at a 22-bed ICU on a given day? Prior: Equal distribution between women and men. Data: 8 women. The next day?

What might be wrong with this conclusion? What do we assume?

Probability distributions

Credible intervals

Interpret this confidence interval (CI):

Repeat after me: it’s as simple as this.

On the CDF we can easily read these credible intervals.

          lower     upper
var1 0.01366318 0.4294567
attr(,"Probability")
[1] 0.95
      2.5%      97.5% 
0.02839478 0.46748520 

Break?

5 steps in a Bayesian analysis

1. Characterise data (outcome)

  • Numeric (interval, ratio, discrete) - Normal, log-Normal, Poisson
  • Binary - Binomial
  • Time to event - survival analysis
  • Categorical or ordinal - Multinomial
  • Other type (e.g. per 1,000 patienter or per day) - offset Poisson

2. Appropriate data model

Example

  • RCT with 2 arms

  • We’re interested in the probability of death (= mortality risk) \(\theta\) in each arm

  • Count outcome => Binomial likelihood

  • Outcomes across patients assumes independent

ECMO Control
Deaths \(z_\text{ecmo}\) \(z_\text{ctl}\)
Survivors \(N_\text{ecmo}\) \(N_\text{ctl}\)
Risk \(\frac{z_\text{ecmo}}{N_\text{ecmo}}\) \(\frac{z_\text{ctl}}{N_\text{ctl}}\)

Estimation

  • Conjugate analysis
    • Easy and simple, inadequate for any real analysis
  • Sampling
    • Simple: naive and inefficeient (but correct)
    • Markov chain Monte Carlo (MCMC): efficient and now simple to use

The binomial model

  • \(N\) independent trials

    • Called Bernoulli model if \(N = 1\)
  • Two possible outcomes: success (1) and failure (0)

  • \(z\) is the number of successes observed

  • Each trial has the same probability of success = \(\theta\)

    • The parameter to be estimated

Conjugate analysis

  • The posterior distribution is the same type as the prior

  • Over-simplistic for any realistic analysis

  • Prior: \(\theta_\text{prior} \sim\) Beta(\(\alpha\), \(\beta\))

    • Think of \(\alpha\) and \(\beta\) as the numbers without and with the event
  • Likelihood: \(z \sim\) Binomial(\(N\), \(\theta_\text{prior}\))

    • \(z\) and \(N\) are observed data points
  • Posterior: \(\theta_\text{posterior} \sim\) Beta(\(\alpha + z\), \(\beta + n - z\))

Simple sampling

Naive (but correct) approach for Binomial likelihood:

  1. Assume prior distribution of \(\theta\) (any will do)

  2. Draw one \(\theta'\)from prior distribution

  3. Draw a sample outcome \(z'\) given \(\theta'\)

  4. Keep \(\theta'\) if \(z'\) is the same as the observed \(z\)

  5. Repeat 2. through 4. a large number of times

Simple sampling (cont.)

n_samples <- 1000
z_observed <- 8 # no. women
N_observed <- 22 # no. beds

theta_prime <- runif(n_samples, 0, 1)
z_prime <- rbinom(n_samples, N_observed, theta_prime)
posterior_samples <- theta_prime[z_prime == z_observed]

median_95CrI <- quantile(posterior_samples, probs = c(0.025, 0.5, 0.975))
(HPDinterval(as.mcmc(posterior_samples)))
         lower     upper
var1 0.1739939 0.4917233
attr(,"Probability")
[1] 0.9473684
p <- ggplot() + # shown in next tab
  geom_histogram(aes(x = posterior_samples), bins = 20) +
  geom_vline(aes(xintercept = median_95CrI), linetype = 2) +
  geom_label(aes(y = 0, x = median_95CrI, label = round(median_95CrI, 3))) +
  labs(x = "Proportion", y = "Frequency")
p

Break and exercises

3. Devising priors

Sources

  • Non-informative priors don’t exist

  • Even entirely uniform priors carry information about your expectations

  • Uniform priors are implicit in frequentist results

  • Published results (RCTs, observational studies)

  • Domain experts

  • Logical conclusions (RR > 10 very unlikely)

  • Regularisation of estimates (e.g. lasso or ridge regression)

Example: ARDS, ECMO and mortality

  • Let’s design a new trial
  • Effect of ECMO on 60-day mortality in ARDS patients
  • 2 arms: ECMO vs. no ECMO
  • 10-100 patients in each arm
  • How many survive in each arm?
  • Ask 5 colleagues with lots of domain knowledge in this area:
z_ecmo <- c(20, 10, 50, 20, 5) # no. deaths in ECMO arm
n_ecmo <- c(100, 100, 100, 50, 10) # no. survivors in ECMO arm
z_ctl <- c(40, 20, 50, 30, 4) # no. deaths in control arm
n_ctl <- c(100, 100, 100, 50, 9) # no. survivors in control arm

Proportions

Beta distribution obvious choice: flexible and easy to intuit

Task: Expecting 30% proportion in a certain group and the lower bound of the 95% CrI lying at 10%, we would like to devise the corresponding beta prior.

find_beta_params(theta = 0.3, boundary_target = 0.1, n_dec = 0)
  alpha beta       p2.5     p50.0     p97.5
1     4   10 0.09092039 0.2752758 0.5381315
find_beta_params(theta = 0.3, boundary_target = 0.1, n_dec = 3)
  alpha  beta      p2.5     p50.0     p97.5
1 4.204 9.809 0.1000033 0.2902668 0.5537915
ggplot() +
  stat_function(fun = ~ dbeta(., 4, 10))

ggplot() +
  stat_function(fun = ~ dbeta(., 4.204, 9.809))

Coefficients in logistic regression

Recall that coefficients in logistic regressions are on the log-OR scale.

So how do we express meaningful and realistic priors for such coefficients?

For a 2x2 table we have that the approximate SD of the log-OR is given by,

\(\qquad \sigma = \sqrt{\frac{1}{z_0} + \frac{1}{N_0 - z_0} + \frac{1}{z_1} + \frac{1}{N_1 - z_1}}\)

If we assume the same event proportion \(r\) in both arms and a total of \(N = N_0 + N_1\) patients, we see that,

\(\qquad z_0 = z_1 = \tfrac{1}{2} N r, \quad N_0 - z_0 = N_1 - z_1 = \tfrac{1}{2} N (1-r)\)

Rearranging yields,

\(\qquad \sigma = \sqrt{\frac{4}{N (1-r)} + \frac{4}{Nr}}\)

making it easy to specify a standard deviation \(\sigma\) for a prior to express no difference in a trial with a certain event proportion and of a certain size \(N\)

Break and exercises

4. Update parameters in light of data

MCMC sampling with rstanarm

head(trees)
  Girth Height Volume
1   8.3     70   10.3
2   8.6     65   10.3
3   8.8     63   10.2
4  10.5     72   16.4
5  10.7     81   18.8
6  10.8     83   19.7
f1 <- stan_glm(
  Volume ~ Height,
  data = trees,
  chains = 4,
  cores = 1,
  seed = 4131,
  iter = 4000
)

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000146 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.46 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 1: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 1: Iteration:  800 / 4000 [ 20%]  (Warmup)
Chain 1: Iteration: 1200 / 4000 [ 30%]  (Warmup)
Chain 1: Iteration: 1600 / 4000 [ 40%]  (Warmup)
Chain 1: Iteration: 2000 / 4000 [ 50%]  (Warmup)
Chain 1: Iteration: 2001 / 4000 [ 50%]  (Sampling)
Chain 1: Iteration: 2400 / 4000 [ 60%]  (Sampling)
Chain 1: Iteration: 2800 / 4000 [ 70%]  (Sampling)
Chain 1: Iteration: 3200 / 4000 [ 80%]  (Sampling)
Chain 1: Iteration: 3600 / 4000 [ 90%]  (Sampling)
Chain 1: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 0.187577 seconds (Warm-up)
Chain 1:                0.138795 seconds (Sampling)
Chain 1:                0.326372 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 1.9e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 2: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 2: Iteration:  800 / 4000 [ 20%]  (Warmup)
Chain 2: Iteration: 1200 / 4000 [ 30%]  (Warmup)
Chain 2: Iteration: 1600 / 4000 [ 40%]  (Warmup)
Chain 2: Iteration: 2000 / 4000 [ 50%]  (Warmup)
Chain 2: Iteration: 2001 / 4000 [ 50%]  (Sampling)
Chain 2: Iteration: 2400 / 4000 [ 60%]  (Sampling)
Chain 2: Iteration: 2800 / 4000 [ 70%]  (Sampling)
Chain 2: Iteration: 3200 / 4000 [ 80%]  (Sampling)
Chain 2: Iteration: 3600 / 4000 [ 90%]  (Sampling)
Chain 2: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 0.119801 seconds (Warm-up)
Chain 2:                0.107132 seconds (Sampling)
Chain 2:                0.226933 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 1.9e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 3: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 3: Iteration:  800 / 4000 [ 20%]  (Warmup)
Chain 3: Iteration: 1200 / 4000 [ 30%]  (Warmup)
Chain 3: Iteration: 1600 / 4000 [ 40%]  (Warmup)
Chain 3: Iteration: 2000 / 4000 [ 50%]  (Warmup)
Chain 3: Iteration: 2001 / 4000 [ 50%]  (Sampling)
Chain 3: Iteration: 2400 / 4000 [ 60%]  (Sampling)
Chain 3: Iteration: 2800 / 4000 [ 70%]  (Sampling)
Chain 3: Iteration: 3200 / 4000 [ 80%]  (Sampling)
Chain 3: Iteration: 3600 / 4000 [ 90%]  (Sampling)
Chain 3: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 0.137662 seconds (Warm-up)
Chain 3:                0.109331 seconds (Sampling)
Chain 3:                0.246993 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'continuous' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 1.4e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.14 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 4000 [  0%]  (Warmup)
Chain 4: Iteration:  400 / 4000 [ 10%]  (Warmup)
Chain 4: Iteration:  800 / 4000 [ 20%]  (Warmup)
Chain 4: Iteration: 1200 / 4000 [ 30%]  (Warmup)
Chain 4: Iteration: 1600 / 4000 [ 40%]  (Warmup)
Chain 4: Iteration: 2000 / 4000 [ 50%]  (Warmup)
Chain 4: Iteration: 2001 / 4000 [ 50%]  (Sampling)
Chain 4: Iteration: 2400 / 4000 [ 60%]  (Sampling)
Chain 4: Iteration: 2800 / 4000 [ 70%]  (Sampling)
Chain 4: Iteration: 3200 / 4000 [ 80%]  (Sampling)
Chain 4: Iteration: 3600 / 4000 [ 90%]  (Sampling)
Chain 4: Iteration: 4000 / 4000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 0.107725 seconds (Warm-up)
Chain 4:                0.098684 seconds (Sampling)
Chain 4:                0.206409 seconds (Total)
Chain 4: 

Default weak priors: \(N(0, 2.5)\) for the intercept and \(N(0, 2.5)\) otherwise.

prior_summary(f1)
Priors for model 'f1' 
------
Intercept (after predictors centered)
  Specified prior:
    ~ normal(location = 30, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 30, scale = 41)

Coefficients
  Specified prior:
    ~ normal(location = 0, scale = 2.5)
  Adjusted prior:
    ~ normal(location = 0, scale = 6.4)

Auxiliary (sigma)
  Specified prior:
    ~ exponential(rate = 1)
  Adjusted prior:
    ~ exponential(rate = 0.061)
------
See help('prior_summary.stanreg') for more details
summary(f1, probs = c(0.025, 0.5, 0.975))

Model Info:
 function:     stan_glm
 family:       gaussian [identity]
 formula:      Volume ~ Height
 algorithm:    sampling
 sample:       8000 (posterior sample size)
 priors:       see help('prior_summary')
 observations: 31
 predictors:   2

Estimates:
              mean   sd     2.5%   50%    97.5%
(Intercept)  -86.4   30.2 -146.8  -86.7  -25.7 
Height         1.5    0.4    0.7    1.5    2.3 
sigma         13.8    1.8   10.8   13.6   18.0 

Fit Diagnostics:
           mean   sd   2.5%   50%   97.5%
mean_PPD 30.1    3.5 23.2   30.1  37.0   

The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).

MCMC diagnostics
              mcse Rhat n_eff
(Intercept)   0.4  1.0  7236 
Height        0.0  1.0  7357 
sigma         0.0  1.0  5966 
mean_PPD      0.0  1.0  7395 
log-posterior 0.0  1.0  3235 

For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Shows 50% and 95% credible intervals.

plot(f1)

MCMC concepts

  • Chains: a positive integer specifying the number of Markov chains. Necessary to check convergence.
  • Iterations: a positive integer specifying the number of iterations for each chain (including warmup). More iterations (= longer chains) yields more precise CrIs. Usually better to increase number of chains as well.
  • Warm-up: a positive integer specifying the number iof warmup iterations per chain. Warmup iterations are not part of the resulting posterior samples.
  • Thinning: a positive integer specifying the period for saving samples. Default is 1; uncommon with modern efficient MCMC methods

Parameter estimates

posterior <- as.matrix(f1)
head(posterior)
          parameters
iterations (Intercept)   Height    sigma
      [1,]   -60.23942 1.225287 14.70179
      [2,]  -104.71891 1.727595 12.33015
      [3,]  -163.47424 2.601016 14.52079
      [4,]  -152.94004 2.444259 16.84465
      [5,]   -89.00094 1.550048 14.34543
      [6,]   -55.66791 1.153183 15.39242
mcmc_areas(posterior, pars = "Height", prob = 0.8)

# Sample outcomes and check if they be compatible with those observed
post_samples <- posterior_predict(f1, draws = 500)
ppc_dens_overlay(trees$Volume, post_samples[1:50, ])

prior_samples <- update(f1, prior_PD = TRUE, refresh = 0)
mcmc_areas(prior_samples, area_method = "equal height")

launch_shinystan(f1)

5. Model diagnostics and reporting

For all their usefulness, probabilistic results and posterior draws are all conditional on the assumptions you put into the full model.

No different from frequentist results—but because Bayesian analyses explicate these assumptions, these results tend to come across as more prone to the whim of the researcher.

Diagnostics

  • Ascertain convergence of chains
  • Ascertain mixing chains
  • Any divergent transition renders the results invalid (look at the control argument in ?rstan::stan)

Reporting

  • Specify the priors
  • Explain how the priors were selected
  • Describe the statistical model used
  • Describe the techniques used in the analysis (incl. chain numbers and lengths)
  • Identify the statistical software program used in the analysis
  • Summarize the posterior distribution with a measure of central tendency and a credibility interval
  • Assess the sensitivity of the analysis to different priors
  • SAMPL guidelines: https://doi.org/10.1016/j.ijnurstu.2014.09.006

Break and exercises

Generating data

ICU-LOS

expit <- function(x) exp(x) / (1 + exp(x))

N <- 1000
N_women <- 0.4 * N
N_men <- N - N_women
set.seed(4131)
tibble(
  id = seq_len(N),
  treatment = sample(0:1, N, replace = TRUE),
  age = round(runif(N, 18, 85)),
  woman = c(rep(1, N_women), rep(0, N_men)),
  sepsis = c(sample(0:1, N_women, TRUE, c(0.3, 0.7)), sample(0:1, N_men, TRUE)),
  icu_los = round(c(
    rlnorm(N_women, log(5 + sepsis), log(5)), 
    rlnorm(N_men, log(7 + sepsis), log(4))
  ), 2),
  hosp_los = round(pmax(icu_los, icu_los * (3.5 + rnorm(N, 0, 0.5))), 2)
) %>% 
  filter(icu_los < 365 | hosp_los < 365) %>% 
  mutate(
    dead_at_1y = 0L + (0.5 < expit(-18 - 0.5 * treatment + 0.3 * age - 0.2 * woman + 0.4 * sepsis))
  ) %>% 
  # summarise(mean(dead_at_1y))
  write_tsv("data/icu.tsv")

Description: We would like to estimate the association between length of stay at the ICU (ICO-LOS) and the total hospital length of stay (hosp-LOS). We know the biological sex of the patient and whether they were admitted with sepsis or not.