All of these exercises use the R program package. We will be needing the MASS, MESS, mlbench, pls and caret packages, which can be installed using the install.packages() function as shown below

install.packages(c("MESS", "MASS", "pls", "caret", "mlbench"))

The exercises are constructed so they contain two types of problems: introductory text that you should run through to make sure you understand how to use the functions and interpret the output, and a few additional questions for you to explore.

Multiple testing

The multiple testing problem refers to the situation where several tests or statistical inferences are considered simultaneously as a group. Each test has a false positive rate (the type I error rate) which is equal to the threshold set for statistical significance, generally 5%. However, when more than one test is done then the overall type I error rate (i.e., the probability that at least one of the test results is a false positive) is much greater than 5%.

Multiple testing corrections adjust \(p\) values derived from multiple statistical tests in order to control the overall type I error rate and reduce the occurrence of false positives.

The p.adjust() function takes a vector of \(p\)-values as input and returns a vector of \(p\)-values adjusted for multiple comparisons. The method argument sets the correction method, and the values "holm" for Holm’s correction (the default), "bonferroni" for Bonferroni correction, and "fdr" (for false discovery rate) are among the possibilities. The Holm and Bonferroni methods both seek to control the overall false positive rate with Holm’s method being less conservative than the Bonferroni method. The fdr method controls the false discovery rate which is a less stringent criterion than the overall false positive rate.

We illustrate the p.adjust() function with a typical microarray gene expression experiment found as the superroot2 data frame in the MESS package. The expression levels of 21,500 different genes are examined and for each gene we wish to determine if there is any difference in gene expression between two plant types (mutant and wild type) after correcting for dye color and array. A standard analysis of variance model is used to test the hypothesis of no difference in gene expressions between the two types of plants for each of the 21,500 genes. The by function is combined with anova and lm to calculate the 21,500 \(p\)-values.

  1. How many genes would you expect to be deemed ``significant’’ by chance with a standard \(p\)-value cut-off of 0.05?
  2. In the code below what is going on in the line starting with pval <-
  3. Run the code to obtain the \(p\) values.
library("MESS")
data(superroot2)

pval <- by(superroot2,
           superroot2$gene,
           FUN=function(dat) {anova(lm(log(signal) ~ array + color + plant, data=dat))[3,5]} )
  1. List the 8 smallest \(p\) values. [Hint: use sort]
  2. List the 8 smallest \(p\) values after using Holm’s correction.
  3. List the 8 smallest \(p\) values after using FDR correction.

Your output should show that the smallest \(p\)-value is 0.2254 after correction with the Holm method.

  1. Would you conclude that there are any significantly differentially expressed genes?
  2. Does this conclusion change for any of the methods?
  3. Do the Bonferroni correction yourself and see how that changes your results.
  4. Why are the smallest \(p\)-values for the FDR correction all identical?
  5. When might these multiple testing methods be inadequate?
  6. Why might it not always be a good idea to focus on \(p\) values (think about prediction and discovery, and the experimental setup).

Bootstrapping

In this exercise we will return to the data with the Pima Indians. Get that dataset with the following commands.

library(mlbench)
data(PimaIndiansDiabetes2)
library(bootstrap)

We will only be looking at the insulin variable here. Remember it contains missing observations which we will remove.

insulin <- PimaIndiansDiabetes2$insulin[!is.na(PimaIndiansDiabetes2$insulin)]
  1. Plot the insulin values to see the distribution. It is clearly not symmetric.
  2. Compute the 90% quantile insulin level. [Hint: use the quantile() function with argument p=.9]
  3. Compute a jackknife estimate of the SE of the median, and compute a 95% confidence interval for the 90% quantile. [Hint: you can add extra arguments to quantile in jackknife by supplying an extra argument]
  4. Compute a simple bootstrap-based 95% confidence interval for the 90% quantile. How much does that differ from the jackknife estimate. Why?
  5. Compute a bias-corrected and adjusted bootstrap percentile-based 95% confidence interval for the 90% quantile.
  6. Describe the differences between the three confidence intervals computed.
  7. Discuss how you could use jackknifing for identifying influential observations/outliers from an analysis.

Principal component analysis

Principal component analysis is a useful technique when you have obtained measurements on a number of (possibly correlated) observed variables and you wish to reduce the number of observed variables to a smaller number of artificial variables (called principal components that account for most of the variance in the observed variables. The principal components may then be used as predictors in subsequent analyses.

The prcomp function performs principal component analysis on a given data matrix. By default, R centers the variables in the data matrix, but does not scale them to have unit variance. In order to scale the variables we should set the logical option scale.=TRUE.

In the following code we will make a principal component analysis on the nine explanatory variables found in the biopsy dataset from the MASS package. There are missing data present in the data frame so we restrict our attention to the complete cases.

library("MASS")
data(biopsy)
names(biopsy)
predictors <- biopsy[complete.cases(biopsy),2:10]
fit <- prcomp(predictors, scale=TRUE)
summary(fit)
  1. What is the interpretation of the results that you have obtained?
  2. Try using plot(fit) and biplot(fit) and explain what you see. Make sure you produce these two plots and understand what they show Especially the biplot is important.
  3. Which variable(s) provide(s) any real contribution to the second principal component?
  4. Look at the loading factor for the result. Do you see any patterns in the composition of the principal components?

Input to prcomp can also be specified as a formula with no response variable. In this case, missing variables are handled directly by the na.action argument.

  1. Try to do that (fit the same model using a model formula as input).
  2. Based on these results how would you think the correlation between the variables are? Use the cor() function to see if that is correct.

Principal component regression

This exercise continues/overlaps the exercise from above.

Principal component analysis is often used to reduce data dimensionality and to overcome problems with collinearity among observed variables.

Principal component regression uses the principal components identified through principal component analysis as explanatory variables in a regression model instead of the original variables, and principal component regression is often seen as a natural ``next step’’ to principal component analysis. Typically, only a subset of the principal components are used in the regression.

In the following code we start with a principal component analysis of the nine explanatory variables found in the biopsy dataset from the MASS package. There are missing data present in the data frame so we restrict our attention to the available complete cases. We then use the predicted principal components as input to a logistic regression model where the response is the breast tumor type: benign or malignant.

  1. Fit a principal component regression model. [If there are problems with missing observations then look at the complete.cases function to identify rows with full information and just use those rows.]
  2. Try to formulate an interpretation of this model.
  3. Is the model useful for prediction or discovery or both?
  4. Try fitting a model with only the two most important principal components. How does that change your fit and your interpretations?
  5. How would you decide on the number of principal components to include?

Partial least squares

Partial least squares resemble PCA and makes a dimension reduction that optimizes the prediction. Thus it is a supervised approach compared to the PLS has the same benefits as PCA/PCR. In the exercise we will fit PLS to two datasets: the biopsy data from the earlier exercises and a dataset on cola preferences. For the cola preference the outcome is numeric/quantitative, and for the biopsy data the output is categorical. We will start with the cola data.

  1. Start by loading the pls package.
  2. Load the cola data by running the command cola <- read.csv("https://biostatistics.dk/teaching/advtopicsA/data/cola.csv"). There should be 34 predictor variables and the outcome pref.

Our purpose is to find out, how well we can predict the preference of a consumer based on information about the (mental associations) they have with the product. Furthermore we wish to find out, which items then consumes use to discriminate the preferences.

  1. Fit a PLS model to the data. Consider the fit of the model (visually through a plot). How well do you thing the model is performing and why?
  2. How many components would you use to predict the preference?
  3. Look at the loadings plot. Are there any of the predictors that you would find to be particularly influential on the prediction?

v Now consider the biopsy dataset from the previous exercise. Since the outcome here is categorical we need to use the plsda() function from the caret package.

  1. Generate a vector of outcomes. [Hint: use the class variable from the biopsy data frame. Make sure that you only keep the elements where we have no missing information similarly to what we did before.]
  2. Fit a PLSDA model (PLS with subsequent discriminant analysis) using the plsda() function from caret. It needs two types of input x and y in the same way that glmnet needed that. The argument ncomp can be set to determine the maximum number of components.
  3. Plot the fitted model and call the fit res. Try to interpret what we have on the \(x\) and \(y\) axes.
  4. Try to run the command confusionMatrix(predict(res), y) in order to get the confusion matrix. How can you use the information shown? And what is the major problem with the results from this analysis?
  5. Fix the “problem” from question 9 by looking at a relevant subset of the data.

Last updated 2021