All of these exercises use the R program package. We will be needing
the MESS
, isdals
, rstan
,
brms
, rstanarm
, and bayesplot
packages, which can all be installed using the
install.packages()
function.
install.packages(c("isdals", "rstan", "brms", "rstanarm", "bayesplot", "MESS"))
The exercises are build up over two types of problems: introductory text that you should run through to make sure you understand how to use the function(s) and interpret the output and a few additional questions for you to explore and discuss.
In this exercise we will take the example with the horoscopes from the lecture and investigate what happens to the conclusions from a Bayesian analysis when we modify the prior, and let the number of observations increase. We will also investigate why this straight-forward might not be the most efficient approach for Bayesian analysis and we need specialized software.
Run the code below and make sure you understand what is going on in each of the single lines.
nsim <- 1000 # The number of samples to generate
ssize <- 84 # The sample size (horoscopes)
observed_succes <- 27 # The number of obseved correct guesses
theta <- runif(nsim) # Draw from prior distribution
res <- rbinom(nsim, size=ssize, prob=theta)
keep <- theta[res==observed_succes]
Plot a histogram of the \(\theta\)s that you have kept (i.e., the thetas that gave rise to an outcome matching our original outcome).
Compute the median posterior and provide a guess of the MAP.
Now change your prior distribution such that it is not
uniform. You are free to pick any non-uniform distribution, but note
that it should only result in values between 0 and 1 since the value
will be used subsequently as a probability. [Hint: a beta-distribution
could be interesting. Use the rbeta()
function and select
some interesting shape parameters.]
Draw a histogram or density of your prior distribution.
Run the same algorithm as above. Plot the posterior distribution, and guess the MAP and median posterior value.
State a conclusion of your analysis in words.
How did the results change compared to the initial analysis? Why?
How many thetas are part of the posterior distribution? How does
that compare to nsim
? What is the problem here?
Now assume that we had had 10 times the number of observations
but with the same rate of success:
ssize <- 840 ; observed_succes <- 270
. Rerun the
analysis with your own prior and compare the results from this larger
dataset to the previous analysis. How are they similar? How are they
different?
We will now perform the same analysis as we did above but using the
rstanarm
package. We will start by loading the relevant
package and create a dataset that can be used for logistic
regression.
library("rstanarm") # Load package
library("bayesplot") # Load package
DF <- data.frame(correct=rep(0:1, times=c(84-27, 27)))
Use the stan_glm()
function to fit a logistic model.
Remember that a logistic model fits parameters on the log odds scale.
[Hint: remember that glm()
requires argument
family=binomial
to run a logistic regression]
Extract the posterior values and plot the posterior distribution
of the intercept. [Hint: use as.matrix()
and
mcmc_areas()
]
Compute the MAP and median posterior on the probability scale. [Hint: recall that \(g(x) = \frac{\exp(x)}{(1 + \exp(x))}\) is the inverse function to the logistic transform.]
Note that stan_glm()
uses a zero-mean normal
distribution as prior for the intercept. Thus, we might not get the same
posterior distribution as we did above. How could we modify the prior
distribution to make it resemble a uniform distribution? Try to refit
the model with this uniform distribution [Hint: look at the
prior_intercept
argument and this
website]
Try to change the prior distribution to, say,
normal(0, .1)
and rerun the analysis. How does the
conclusions change? Why?
Try your own prior and see how it influences the result.
Now assume that we had had 10 times the number of observations
but with the same rate of success:
ssize <- 840 ; observed_succes <- 270
. Rerun the
analysis with your own prior and compare the results from this larger
dataset to the previous analysis. How are they similar? How are they
different?
Assume that you are working a production faciclity with quality control of a particular drug. Your want to estimate the probability that the production will fail. In the previous year 365 batches were produced and none of them failed.
Estimate the probability of failure based on a frequentist analysis of the data. What is your estimated proportion? What is your forecast of failures for the next year? Can any happen?
Use a Bayesian modeling approach to estimte the probability of failure for the next year. How/why can the conclusion change? Which result is more reasonable?
In this exercise we will use the fev
dataset from the
isdals
package. The data frame contains results from a
study of 600+ children and their lung capacity. The primary hypothesis
is to investigate how smoking influences the lung capacity,
FEV
. The data also contains information on
Age
, height (Ht
) in inches,
Gender
, and Smoke
Load the package and get the data:
library("isdals")
data(fev)
Analyze the data with a regular frequentist model. What are your overall conclusions? What are your conclusions about the impact of smoking on lung capacity?
Analyze the same model using a Bayesian analysis with the
stan_glm()
function. Make sure you have a reasonable
prior.
Examine the fit of the model. Convergence okay?
Use shinystan
to investigate the model.
Increase the number of chains to, say, 8 and the double the number of iterations. Examine the fitted model (for example with ’´shinystan`). How did the modifying the number of chains and iterations change the results. What could changing these potentially fix?
Look at the posterior distribution of the smoking parameters. What is your conclusion about smoking. What is the probability that the smoking effect is greater than zero?
Now fit the same model with the brm()
function from
the brms
package. Look at the output from this model, and
how it differs from the stan_glm
model. What extra
information do you get here?
Use the get_prior()
function on a model formula to
see the parameters that can have their priors changed. Class
b
is about the mean parameters.
If we first define the priors then they can be included in
brm
using the prior
argument. Tweak the
following code to work with the fev
data:
## define some priors
prior <- c(prior_string("normal(0,10)", class = "b"), # Default example
prior(normal(1,2), class = b, coef = treat), # Specific for treat
prior(cauchy(0,2), class = "sigma")) # For the variance
Rerun the analysis with the updated priors (of your choice). Check the model convergence.
We can use leave-one-out-cross-validation information criterion
(IC) to see how well different models predict the outcome. These LOO can
then be compared. Like other IC, lower is better. Importantly, the LOOIC
comes with a standard error. Fit a model without
Smoke
and compare the LOOIC between the original model and
the reduced model.
Load the data. This data frame contains information on the
Class
, Sex
, Age
(binary), and
Survived
. The data is stored as a 4-dimensional array so we
need to expand it to have one row per individuals on the ship. This can
be accomplished with the expand_table()
from the
MESS
package.
library("MESS")
data(Titanic)
ship <- expand_table(Titanic)
Analyze the data with a regular frequentist logistic regression
model. What are your conclusions? [Hint: you can use the
glm()
function.]
Analyze the model using a Bayesian approach. Remember to consider your priors and check model convergence.
What are the conclusions from the Bayesian analysis?
Complete separation happens when the outcome variable separates a predictor variable or a combination of predictor variables completely. This is a common problem in logistic regression. When such a dataset is analyzed using standard frequentist meethods with maximum likelihood then the estimate for the separating variable does not exist. In this exercise we will see what happens when the data is analyzed using Bayesian statistics.
Load the data frame qsep
using the command
load(url("https://biostatistics.dk/teaching/advtopicsB/data/qsep.rda"))
Analyze y
as a function of x
and
g
using a standard logistic regression model. Notice the
size and standard error of the variable x
.
Analyze the same data with Bayesian model with a vague prior and look at the estimate and standard error. How (and why) does the Bayesian analysis circumvent the problem with complete separation.
Analyze the full data with the lung capacity as a stan
model and run it. Argue for all the choices that you have made so far
regarding priors etc. Recall from the lecture that a basic stan program
contains three sections: data
, parameters
, and
model
.
Below is the sample code for a standard linear regression analysis. Modify this code to run the full analysis (see hints below).
data {
int<lower=1> N;
vector[N] y;
vector[N] x;
}
parameters {
real alpha;
real beta;
real<lower=0> sigma; // Note lower limit
}
model {
alpha ~ normal(0, 10); // Prior
beta ~ normal(0, 10); // distributions
sigma ~ cauchy(0, 2.5); // defined here
y ~ normal(alpha + beta * x, sigma);
}
Hint:
2024