All of these exercises use the R program package. We will be needing
a few packages (all on CRAN). You can install these with the
install.packages()
function:
install.packages(c(
"rstan", "rstanarm", "bayesplot", "coda",
"dplyr", "ggplot2", "purrr",
"adaptr"
))
for (p in c("dplyr", "ggplot2", "purrr", "adaptr", "coda")) {
library(p, character.only = TRUE)
}
The exercises revolve around two types of problems: introductory code that you should run through to make sure you understand how to use the function(s) and interpret the output, and additional questions to explore and discuss to dig a bit deeper into Bayesian data analysis and inference.
In this exercise we will take the example with the ICU patients from the lecture and investigate what happens to the conclusions from a Bayesian analysis when we modify the prior and have more observations. This will also address why the straight-forward sampling approach might not be sufficiently efficient and why specialised software would be needed.
Run the code below and make sure you understand what is going on in each line
<- 10000
n_samples <- 22
n_obs <- 8
n_women
<- runif(n_samples)
theta <- rbinom(n_samples, size = n_obs, prob = theta)
cond_likelihood_draws <- theta[cond_likelihood_draws == n_women] posterior_draws
Plot a histogram of the posterior draws of theta (i.e. theta values you have kept because they gave rise to an outcome matching our original outcome)
ggplot() +
geom_histogram(aes(posterior_draws), bins = 10)
ggplot() +
stat_ecdf(aes(posterior_draws))
Compute the median and 95% CrI of the posterior; provide a guess of the MAP
<- function(x, ...) { # ... passed on to density()
dist_mode with(density(x, ...), x[which.max(y)])
}
<- function(x, n_digits = 2) { # helper function; finds mode of x
dist_mode_bad <- round(x, n_digits)
x <- unique(x)
ux which.max(tabulate(match(x, ux)))][1] # index in case of ties
ux[
}
median(posterior_draws)
## [1] 0.3669366
quantile(posterior_draws, c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 0.1887885 0.3669366 0.5565065
dist_mode(posterior_draws)
## [1] 0.3487001
Now use a non-uniform distribution
Hint: you are free to pick any non-uniform distribution that yields only values between 0 and 1; after all, it’s a prior over a proportion (which can be neither below 0 nor above 1)
Hint: a beta distribution could be interesting. Use the
rbeta()
function and select some interesting shape
parameters that would express realistic prior expectations about the
distribution between women and men
<- function(a, b, N = n_obs, z = n_women) {
posterior_theta <- rbeta(n_samples, a, b)
theta <- rbinom(n_samples, size = N, prob = theta)
cond_likelihood_draws return(theta[z == cond_likelihood_draws])
}
<- rbeta(n_samples, 11, 11)
prior_theta_11_11 ggplot() +
geom_density(aes(prior_theta_11_11))
<- posterior_theta(11, 11) post_theta_11_11
Visualise the prior distribution
ggplot() +
stat_density(aes(x = prior_theta_11_11), geom = "line", linetype = 2) +
stat_density(aes(x = post_theta_11_11), geom = "line") +
stat_density(aes(x = posterior_draws), geom = "line", colour = "red") +
labs(x = NULL, y = NULL)
Repeat the analysis from point 1 above with the new prior. Plot the posterior distribution, and compute the MAP and median posterior value.
# plot: see above
median(post_theta_11_11)
## [1] 0.4348256
dist_mode(post_theta_11_11)
## [1] 0.4490562
Conclude on your analysis with words
# The distribution is pulled downward towards lower proportion of women
How did the results with your own prior change compared to the initial analysis? Why?
How many thetas are part of the posterior distribution? How does
that compare to n_samples
? What is the problem here?
length(post_theta_11_11)
## [1] 800
length(post_theta_11_11) / n_samples
## [1] 0.08
map_dfr(1:10, ~ quantile(posterior_theta(11, 11), c(0.025, 0.5, 0.975)))
## # A tibble: 10 × 3
## `2.5%` `50%` `97.5%`
## <dbl> <dbl> <dbl>
## 1 0.300 0.431 0.584
## 2 0.297 0.432 0.581
## 3 0.288 0.434 0.578
## 4 0.289 0.435 0.567
## 5 0.297 0.435 0.583
## 6 0.289 0.435 0.574
## 7 0.280 0.426 0.579
## 8 0.296 0.433 0.592
## 9 0.280 0.431 0.572
## 10 0.296 0.428 0.576
Give a rule a thumb as to the number of effective samples (i.e. those prior samples giving rise to the observed outcome) you need to obtain stable results.
Now assume that we had had 10 ICUs (all with 22 beds) but with
the same distribution between sexes (i.e. n_obs <- 220
and n_women <- 80
). Rerun the analysis with your own
prior and compare the results from this larger dataset to the previous
analysis. How are they similar? How are they different?
# Data clearly dominate the prior
<- posterior_theta(11, 11, 220, 80)
posterior_draws_large <- posterior_theta(110, 110, 220, 80)
posterior_draws_large_11_11
median(post_theta_11_11)
## [1] 0.4348256
median(posterior_draws_large)
## [1] 0.3824527
dist_mode(post_theta_11_11)
## [1] 0.4490562
dist_mode(posterior_draws_large)
## [1] 0.3795257
ggplot() +
stat_density(aes(x = prior_theta_11_11), geom = "line", linetype = 2) +
stat_density(aes(posterior_draws, colour = "1 ICU"), geom = "line") +
stat_density(aes(posterior_draws_large, colour = "10 ICUs"), geom = "line") +
stat_density(aes(posterior_draws_large_11_11, colour = "10 ICUs, strong prior"), geom = "line")
Reusing the posterior samples from exercise A6, find the
percentile 95% CrI of the posterior distribution using
quantile()
. Then, compute also the 95% highest posterior
density interval (HPDI) and comment on any differences that may
emerge.
Hint: convert first to an mcmc
object (the
coda
package has the functions to do this)
Hint: try to plot the distribution of the posterior samples and draw both the percentile CrI and the HPDI.
<- quantile(post_theta_11_11, c(0.025, 0.975))
cri <- HPDinterval(as.mcmc(post_theta_11_11), prob = 0.95)
hpdi
<- with(density(post_theta_11_11), tibble(x, dens = y)) %>%
plot_df mutate(
in_95cri = between(x, cri[1], cri[2]),
in_95hpdi = between(x, hpdi[1], hpdi[2])
)
ggplot(plot_df, aes(x = x, y = dens)) +
geom_line() +
geom_area(aes(fill = "95% CrI"), ~ filter(., in_95cri), alpha = 0.2) +
geom_area(aes(fill = "95% HPDI"), ~ filter(., in_95hpdi), alpha = 0.2)
ggplot() +
geom_density(aes(rbeta(10000, 11, 11)))
Expecting an in-ICU mortality risk of 20% in patients with septic shock, devise a couple of priors that express this expectation with multiple levels of certainty. Compute the median and 95% interval
find_beta_params()
from the adaptr
packagemap_dfr(c(0.025, 0.05, 0.1), ~ find_beta_params(0.2, ., n_dec = 2))
## alpha beta p2.5 p50.0 p97.5
## 1 1.85 7.41 0.02490928 0.1777423 0.4937889
## 2 3.22 12.88 0.04998349 0.1874077 0.4197214
## 3 9.43 37.71 0.10003411 0.1957777 0.3240885
Devise priors with the same levels of certainty but with other in-ICU mortality risks
expand.grid(theta = c(0.15, 0.2, 0.25), boundary = c(0.025, 0.05, 0.1)) %>%
bind_cols(with(., map2_dfr(theta, boundary, find_beta_params, n_dec = 2)))
## theta boundary alpha beta p2.5 p50.0 p97.5
## 1 0.15 0.025 2.32 13.14 0.02502351 0.1348900 0.3588380
## 2 0.20 0.025 1.85 7.41 0.02490928 0.1777423 0.4937889
## 3 0.25 0.025 1.59 4.76 0.02519434 0.2231835 0.6188157
## 4 0.15 0.050 4.70 26.66 0.04992283 0.1423983 0.2917302
## 5 0.20 0.050 3.22 12.88 0.04998349 0.1874077 0.4197214
## 6 0.25 0.050 2.53 7.59 0.05000309 0.2330860 0.5420524
## 7 0.15 0.100 24.89 141.04 0.10000338 0.1485954 0.2079829
## 8 0.20 0.100 9.43 37.71 0.10003411 0.1957777 0.3240885
## 9 0.25 0.100 5.77 17.31 0.09999493 0.2426845 0.4408792
Assume that you are working at a production facility with quality control of a novel drug miraculixaban. You want to estimate the probability of producing a faulty batch in the coming year. In the previous year 365 batches were produced; none were faulty.
Estimate the probability of faulty batch based on a frequentist analysis of the data. What is your estimated proportion? What is your forecast of faulty batches for the coming year? Can any happen?
Hint: the maximum likelihood estimate of a binomial variable with \(k\) successes (here, faulty batches) in \(N\) trials (here, days in a year) is \(\hat \theta = \tfrac{k}{N}\)
Hint: recall the definitions of probability from the slides
#' The maximum likelihood estimator theta (MLE) of a binomial variable with
#' k successes in n k/n. Because k=0, the MLE is zero, leaving no uncertainty,
#' and so there’s no standard error, making it impossible to actually derive
#' a confidence interval. So the forecast would be 0 failures.
Use a Bayesian modeling approach to estimate the probability of a faulty batch for the coming year. How/why can the conclusion change? Which result is more reasonable?
#' For a beta(a, b) prior and a binomial(n, theta) likelihood, the posterior is
#' beta(a + k, b + n - k), because the beta distribution is a conjugate prior
#' to the binomial likelihood. Thus, prop_failure is the posterior
#' distribution of the expected probability of producing a failed drug,
#' assuming a beta(0.5, 0.5) prior (Jeffrey’s prior).
<- rbeta(10000, 0.5 + 0, 0.5 + 365)
prob_failure ggplot() + geom_density(aes(prob_failure), trim = TRUE)
ggplot() +
geom_density(aes(rbeta(10000, 0.5 + 0, 0.5 + 365), colour = "0.5, 0.5")) +
geom_density(aes(rbeta(10000, 1 + 0, 1 + 365), colour = "1, 1")) +
geom_density(aes(rbeta(10000, 1 + 0, 364 + 365), colour = "1, 364"))
# Distribution of the expected number of faults with 365 batches produced:
ggplot() + geom_histogram(aes(prob_failure * 365), binwidth = 1)
Imagine you know 10 experts with experience in the same type of production line as your factory. You would like to survey these experts and use their responses to devise an appropriate prior for the annual fault probability. How would you go about that?
#' One could ask them to give the number of faulty batches out of 365
#' produced. Then, one could compute the pooled proportion and "rescale"
#' to arrive at a 365-day time frame. For example:
<- c(1, 1, 0, 0, 2, 1, 2, 1, 1, 0)
survey_responses mean(survey_responses)
## [1] 0.9
ggplot() +
stat_function(fun = ~ dbeta(., 0.9, 365 - 0.9), n = 1000, xlim = c(0, 0.05))
We will now perform the same analysis as we did above but using the
rstanarm
package. We will start by loading the relevant
package and create a dataset that can be used for logistic
regression.
library("rstanarm")
library("bayesplot")
<- data.frame(woman = rep(0:1, times = c(22 - 8, 8))) dta
Use the stan_glm()
function to fit a logistic model.
Remember that a logistic model fits parameters on the log-odds (= logit)
scale.
Hint: remember to set family = binomial
to run a
logistic regression
Hint: to fit a regression model without covariates, the formula
should have the form y ~ 1
where y
is the
outcome variable in the data.
<- stan_glm(
fit ~ 1,
woman data = dta,
family = binomial,
cores = 1,
refresh = 0 # just to suppress sampler output (avoids cluttering
)
Extract the posterior values and plot the posterior distribution of the intercept.
as.matrix()
and
mcmc_areas()
<- as.matrix(fit)
posteriors as_tibble(fit)
## # A tibble: 4,000 × 1
## `(Intercept)`
## <dbl>
## 1 -0.698
## 2 -0.379
## 3 -0.695
## 4 -0.223
## 5 0.0707
## 6 -0.0314
## 7 -0.214
## 8 -0.440
## 9 -0.601
## 10 -0.445
## # … with 3,990 more rows
mcmc_areas(posteriors)
Compute the MAP and median posterior on the probability scale.
Hint: recall that \(g(x) = \frac{\exp(x)}{1 + \exp(x)}\) is the inverse function to the logistic transform
Hint: look at ?posterior_epred
<- function(logit) {
g exp(logit) / (1 + exp(logit))
}
ggplot() +
geom_density(aes(g(posteriors[, 1])))
stan_glm()
uses a zero-mean normal distribution as
prior for the intercept by default. Thus, we might not get the same
posterior distribution as we did above. How could we modify the prior
distribution to make it resemble a uniform distribution? Try to refit
the model with this uniform distribution
?stan_glm
and https://mc-stan.org/rstanarm/articles/priors.html<- stan_glm(
fit_uniform_prior ~ 1,
woman data = dta,
family = binomial,
prior_intercept = NULL, # uses uniform prior
cores = 1,
refresh = 0 # just to suppress sampler output (avoids cluttering
)
Try to change the prior distribution to
e.g. normal(0.0, 0.1)
and rerun the analysis. How does this
impact the conclusion? Why?
<- stan_glm(
fit_normal_prior ~ 1,
woman data = dta,
family = binomial,
prior_intercept = normal(0.0, 0.1),
cores = 1,
refresh = 0 # just to suppress sampler output (avoids cluttering)
)
Try your own prior and see how it influences the result.
#' Same as above but using whatever prior you used
Prior predictive checks are like posterior predictive checks,
only that they ignore the data and generates data based solely on the
priors. This can often be useful to check that the priors you have
specified will yield data that look reasonable for the modelling task at
hand. Carry out a prior predictive check of the three models above
(default prior, normal(0.0, 0.1)
and your own). All
good?
prior_PD
argument in
?stan_glm
.# Create new larger dataset
update(fit, prior_PD = TRUE)
## stan_glm
## family: binomial [logit]
## formula: woman ~ 1
## observations: 22
## predictors: 1
## ------
## Median MAD_SD
## (Intercept) 0.1 2.5
##
## ------
## * For help interpreting the printed output see ?print.stanreg
## * For info on the priors used see ?prior_summary.stanreg
Now assume, again, that we had had data from 10 ICUs (as above). Construct a new data frame akin to the one above but with more observations. Then, rerun the analysis with your own prior and compare the results from this larger dataset to those of the previous analysis. How are they similar? How do they differ?
# Create new larger dataset
<- data.frame(woman = rep(0:1, times = c(220 - 80, 80)))
dta_large
<- update(fit, data = dta_large) # do the same with the other fits
fit_large summary(fit_large)
##
## Model Info:
## function: stan_glm
## family: binomial [logit]
## formula: woman ~ 1
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 220
## predictors: 1
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) -0.6 0.1 -0.8 -0.6 -0.4
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 0.4 0.0 0.3 0.4 0.4
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 1038
## mean_PPD 0.0 1.0 1903
## log-posterior 0.0 1.0 971
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
In this exercise we will use data from a fictive RCT with ICU
patients randomised to miraculixaban (treatment = 1
)
vs. standard of care (treatment = 0
) . The dataset also
contains a few baseline covariates (age,
woman
and sepsis
) along with outcome variables
(icu_los
, hosp_los
and
dead_at_1y
).
Download the data and take a look:
<- read.delim("https://raw.githubusercontent.com/epiben/course_adv_stats_ucph/main/bayesian/icu.tsv")
icu head(icu)
## id treatment age woman sepsis icu_los hosp_los dead_at_1y
## 1 1 0 74 1 1 1.69 5.64 1
## 2 2 0 62 1 0 11.82 54.12 1
## 3 3 0 21 1 1 13.75 48.53 0
## 4 4 0 44 1 1 18.05 57.66 0
## 5 5 0 62 1 1 0.34 1.51 1
## 6 6 1 74 1 1 2.93 12.57 1
Analyse the data with a regular frequentist model, estimating the
effect of relevant covariates on the ICU length of stay. Ignore the
treatment
covariate for now. What are your overall
conclusions? What are your conclusions about the impact of sepsis on ICU
length-of-stay?
<- lm(icu_los ~ age + sepsis + woman, data = icu)
freq_fit summary(freq_fit)
##
## Call:
## lm(formula = icu_los ~ age + sepsis + woman, data = icu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.48 -18.12 -13.05 -0.03 338.19
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.62306 3.89321 6.068 1.84e-09 ***
## age -0.04298 0.06370 -0.675 0.500
## sepsis -1.23795 2.57159 -0.481 0.630
## woman 0.05376 2.59385 0.021 0.983
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39.28 on 994 degrees of freedom
## Multiple R-squared: 0.0006794, Adjusted R-squared: -0.002337
## F-statistic: 0.2252 on 3 and 994 DF, p-value: 0.8789
Analyse the same model using a Bayesian analysis with the
stan_glm()
function. Try to specify reasonable priors
<- stan_glm(
bayes_fit ~ age + sepsis + woman,
icu_los data = icu,
seed = 4131,
refresh = 0
)
summary(bayes_fit)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: icu_los ~ age + sepsis + woman
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 998
## predictors: 4
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 23.6 3.9 18.6 23.5 28.6
## age 0.0 0.1 -0.1 0.0 0.0
## sepsis -1.3 2.6 -4.6 -1.3 2.1
## woman 0.1 2.6 -3.2 0.1 3.4
## sigma 39.3 0.9 38.2 39.3 40.4
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 20.7 1.8 18.5 20.7 23.0
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.1 1.0 5692
## age 0.0 1.0 6092
## sepsis 0.0 1.0 5721
## woman 0.0 1.0 5333
## sigma 0.0 1.0 5639
## mean_PPD 0.0 1.0 4921
## log-posterior 0.0 1.0 1859
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Examine the fit of the model. Convergence okay?
rhat(bayes_fit)
## (Intercept) age sepsis woman sigma
## 0.9996638 0.9998776 0.9995471 0.9996142 0.9995893
mcmc_dens_overlay(bayes_fit)
mcmc_trace(bayes_fit)
Use shinystan to investigate the model.
launch_shinystan(bayes_fit)
Increase the number of chains to, say, 8 and the double the number of iterations. Examine the fitted model (for example with shinystan). What’s the impact on the results? What problems could more and longer chains potentially fix?
<- stan_glm(
bayes_fit_more_longer ~ age + sepsis + woman,
icu_los data = icu,
iter = 8000,
chains = 8,
seed = 4131,
refresh = 0
)
summary(bayes_fit_more_longer)
##
## Model Info:
## function: stan_glm
## family: gaussian [identity]
## formula: icu_los ~ age + sepsis + woman
## algorithm: sampling
## sample: 32000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 998
## predictors: 4
##
## Estimates:
## mean sd 10% 50% 90%
## (Intercept) 23.6 3.9 18.7 23.6 28.6
## age 0.0 0.1 -0.1 0.0 0.0
## sepsis -1.2 2.6 -4.5 -1.2 2.1
## woman 0.0 2.6 -3.3 0.0 3.4
## sigma 39.3 0.9 38.2 39.3 40.4
##
## Fit Diagnostics:
## mean sd 10% 50% 90%
## mean_PPD 20.7 1.8 18.5 20.7 23.0
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 46903
## age 0.0 1.0 46806
## sepsis 0.0 1.0 43739
## woman 0.0 1.0 41971
## sigma 0.0 1.0 44268
## mean_PPD 0.0 1.0 36564
## log-posterior 0.0 1.0 13082
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Look at the posterior distribution of the sepsis
parameter. What is your conclusion about the effect of sepsis on ICU
length of stay? What is the probability that the effect of sepsis is
associated with shortened ICU-LOS? Shortening by more than 1 day?
mcmc_areas(bayes_fit, pars = "sepsis", prob = 0.95)
mean(as.data.frame(bayes_fit)$sepsis < 0)
## [1] 0.68725
mean(as.data.frame(bayes_fit)$sepsis < -1)
## [1] 0.54175
mean(as.data.frame(bayes_fit)$sepsis < -2)
## [1] 0.39125
mean(as.data.frame(bayes_fit)$sepsis < -5)
## [1] 0.0715
How might you use a survey to devise an informative prior on the effect of age on ICU length of stay?
#' One could, for example, ask respondents to distribute 20 X's on a grid with
#' how many days longer a patient would be expected to remain admitted for
#' each decade they are older. Then, combine all those X's into one and fit
#' a Normal distribution to it whose mean and sd would be the prior values.
#'
#' One respondents might submit this:
#'
#' ^
#' | X X X
#' | X X X X X
#' | X X X X X
#' | X X X X X X X
#' |----------------------->
We can use leave-one-out-cross-validation (LOO) information criterion (IC) to see how well different models predict the outcome. These LOOICs can then be compared. Like other information criteria, lower is better. Importantly, the LOOIC comes with a standard error. Fit a model without the sepsis covariate and compare the LOOIC between the original and the reduced models.
?loo
and ?loo_compare
<- stan_glm(
bayes_fit_wo_sepsis ~ age + woman,
icu_los data = icu,
seed = 4131,
refresh = 0
)
<- loo(bayes_fit)
loo_fit <- loo(bayes_fit_wo_sepsis)
loo_fit_wo_sepsis loo_compare(loo_fit, loo_fit_wo_sepsis)
## elpd_diff se_diff
## bayes_fit_wo_sepsis 0.0 0.0
## bayes_fit -0.7 0.5
A friend of yours wants to test a new treatment obelixaban against standard of care, in ICU patients with septic shock. Pre-trial evidence suggest that the 30-day mortality in this patient group receiving standard of care is roughly 20% (95% CrI: 5%-35%). Your friend’s main results with respect to 30-day mortality are in the following 2x2 table (assume no loss to follow-up):
Standard of care | Miraculixaban | |
---|---|---|
No. alive at 30 days | 161 | 164 |
No. dead at 30 days | 46 | 35 |
Using an appropriate prior, estimate the effect size of obelixaban compared to standard of care.
Hint: recall that derived quantities, based on posterior parameter estimates, can be computed directly with the posterior samples of the parameters
Hint: effect-size metrics include risk ratio (RR), risk difference (RD) and number-needed-to-treat (NNT)
Hint: use what you learnt in exercises A6 and B1
find_beta_params(0.2, 0.05)
## alpha beta p2.5 p50.0 p97.5
## 1 3 13 0.04331201 0.1743207 0.4046027
<- tibble(
post_estimates mortality_prop_control = rbeta(10000, 3 + 46, 13 + 161),
mortality_prop_obelixaban = rbeta(10000, 3 + 35, 13 + 164),
RR = mortality_prop_obelixaban / mortality_prop_control
)quantile(post_estimates$RR, c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## 0.5460058 0.8038889 1.1755551
After helping your friend with their unadjusted effect-size
estimation above, you return to your own results of the effect of
miraculixaban against standard of care on 1-year mortality (same data as
in section D).
Model the 1-year mortality risk using the four baseline covariates in
the dataset (treatment
, age
,
woman
and sepsis
) and default priors. What do
you conclude about the conditional effects of miraculixaban on the
1-year mortality risk? Can you conclude anything about the effects of
the other covariates?
<- stan_glm(
mortality_fit ~ treatment + age + woman + sepsis,
dead_at_1y data = icu,
family = binomial,
seed = 4131,
refresh = 0 # suppress fitting progress output
)
summary(mortality_fit, probs = c(0.025, 0.5, 0.975))
##
## Model Info:
## function: stan_glm
## family: binomial [logit]
## formula: dead_at_1y ~ treatment + age + woman + sepsis
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 998
## predictors: 5
##
## Estimates:
## mean sd 2.5% 50% 97.5%
## (Intercept) -43.3 4.1 -51.9 -43.2 -35.8
## treatment -0.9 0.4 -1.8 -0.9 0.0
## age 0.7 0.1 0.6 0.7 0.9
## woman 0.0 0.4 -0.8 0.0 0.9
## sepsis 0.5 0.4 -0.3 0.5 1.4
##
## Fit Diagnostics:
## mean sd 2.5% 50% 97.5%
## mean_PPD 0.4 0.0 0.4 0.4 0.4
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.1 1.0 1737
## treatment 0.0 1.0 2886
## age 0.0 1.0 1753
## woman 0.0 1.0 2614
## sepsis 0.0 1.0 3110
## mean_PPD 0.0 1.0 3532
## log-posterior 0.0 1.0 1871
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
# Convergence diagnostics
rhat(mortality_fit)
## (Intercept) treatment age woman sepsis
## 1.0036576 0.9997484 1.0035293 1.0003075 0.9997862
mcmc_dens_overlay(mortality_fit)
mcmc_trace(mortality_fit)
mcmc_areas(
mortality_fit, point_est = "median",
pars = vars(-"(Intercept)"),
prob = 0.95
+
) scale_x_continuous(
breaks = log(c(0.1, 0.2, 0.33, 0.5, 0.67, 1, 1.5, 2, 3, 5, 10)),
labels = exp
)
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
Since you need to get your new drug approved by the authorities, you decide to use a somewhat skeptical prior on the treatment effect so that an estimated effect will convince even the grumpiest regulator. How could you devise such a skeptical prior for the effect of the treatment?
<- 0.35
r <- 200
N <- sqrt(4/(1-r)/N + 4/r/N)) (sigma
## [1] 0.2964997
ggplot() +
geom_density(aes(rlnorm(1000, meanlog = 0, sdlog = sigma))) +
scale_x_log10(breaks = c(0.33, 0.5, 0.67, 0.8, 1, 1.25, 1.5, 2, 3))
Rerun the logistic regression with this new informative prior. What’s the impact on the (estimated) conditional treatment effect?
# Only difference is the scale of the first parameter
<- normal(
priors location = c(0, 0, 0, 0),
scale = c(sigma, 2.5, 2.5, 2.5)
)
<- stan_glm(
mortality_fit_skeptical ~ treatment + age + woman + sepsis,
dead_at_1y data = icu,
prior = priors,
family = binomial,
seed = 4131,
refresh = 0 # suppress fitting progress output
)
summary(mortality_fit_skeptical, probs = c(0.025, 0.5, 0.975))
##
## Model Info:
## function: stan_glm
## family: binomial [logit]
## formula: dead_at_1y ~ treatment + age + woman + sepsis
## algorithm: sampling
## sample: 4000 (posterior sample size)
## priors: see help('prior_summary')
## observations: 998
## predictors: 5
##
## Estimates:
## mean sd 2.5% 50% 97.5%
## (Intercept) -75.2 9.0 -93.8 -74.9 -58.4
## treatment -0.3 0.3 -0.8 -0.3 0.2
## age 1.2 0.1 1.0 1.2 1.5
## woman 0.1 0.6 -1.0 0.1 1.2
## sepsis 1.3 0.6 0.1 1.3 2.5
##
## Fit Diagnostics:
## mean sd 2.5% 50% 97.5%
## mean_PPD 0.4 0.0 0.4 0.4 0.4
##
## The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
##
## MCMC diagnostics
## mcse Rhat n_eff
## (Intercept) 0.2 1.0 1969
## treatment 0.0 1.0 3317
## age 0.0 1.0 1991
## woman 0.0 1.0 3542
## sepsis 0.0 1.0 2901
## mean_PPD 0.0 1.0 4117
## log-posterior 0.0 1.0 1685
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
prior_summary(mortality_fit_skeptical)
## Priors for model 'mortality_fit_skeptical'
## ------
## Intercept (after predictors centered)
## ~ normal(location = 0, scale = 2.5)
##
## Coefficients
## ~ normal(location = [0,0,0,...], scale = [0.3,2.5,2.5,...])
## ------
## See help('prior_summary.stanreg') for more details
map(list(mortality_fit, mortality_fit_skeptical), as.data.frame) %>%
map_dfc("treatment") %>%
setNames(c("default", "skeptical")) %>%
ggplot() +
geom_density(aes(x = exp(default), colour = "default")) +
geom_density(aes(x = exp(skeptical), colour = "skeptical")) +
scale_x_log10(breaks = c(0.1, 0.2, 0.33, 0.5, 0.67, 1, 1.5, 2, 3, 5, 10)) +
labs(x = "OR", y = NULL)
Compute average treatment effect using G-computation
<- map_dfc(
posterior_marginal_estimates c(miraculixaban = 1, control = 0),
~ rowMeans(posterior_epred(mortality_fit, mutate(icu, treatment = .)))
%>%
) mutate(
risk_ratio = miraculixaban / control,
risk_diff = control - miraculixaban,
NNT = 1/risk_diff
)
ggplot(posterior_marginal_estimates) +
geom_density(aes(risk_ratio))
NB! We didn’t make it here
Build and run your own Stan model of a linear regression as the one in D2 above. Analyze the full data with the lung capacity as a Stan model and run it. Argue for all the choices that you have made so far regarding priors, likelihood, etc. Recall from the lecture that a basic Stan program contains three sections: data, parameters, and model. Below is the sample code for a standard linear regression analysis:
data {
int<lower=1> N;
vector[N] y;
vector[N] x;
}
parameters {
real intercept;
real beta;
real<lower=0> sigma; // note lower limit
}
model {
// Prior distributions
0, 10);
intercept ~ normal(0, 10);
beta ~ normal(0, 2.5);
sigma ~ cauchy(
// Likelihood
y ~ normal(intercept + beta * x, sigma); }
Modify this code to use all covariates and run the analysis
Hint: start by including and using the two numeric variables first.
Hint: then include the categorical variables as numeric variables (with values either 0 or 1). This works because the categorical predictors are binary.