**Topics for module**- Big-p small-n problems
- Multiple testing techniques (inference correction, false discovery rates, q-values)
- The correlation vs. causation and prediction vs. hypothesis differences
- Generalized linear models refresher
- Penalized regression
- Principal component regression, and principal component
regression

**To read:**Nothing- Slides: Multiple comparisons, penalized regression, and PCA
- R code from the exercises

All of these exercises use the R program package. We will be needing
the `MASS`

, `MESS`

, and `glmnet`

packages, which can all be installed using the
`install.packages()`

function.

`install.packages(c("MASS", "MESS", "glmnet"))`

The exercises are build up over two types of problems: introductory text that you should run through to make sure you understand how to use the cuntion and interpret the output and a few additional questions for you to explore.

The multiple testing problem refers to the situation where several tests or statistical inferences are considered simultaneously as a group. Each test has a false positive rate (the type I error rate) which is equal to the threshold set for statistical significance, generally 5%. However, when more than one test is done then the overall type I error rate (i.e., the probability that at least one of the test results is a false positive) is much greater than 5%.

Multiple testing corrections adjust \(p\) values derived from multiple statistical tests in order to control the overall type I error rate and reduce the occurrence of false positives.

The `p.adjust()`

function takes a vector of \(p\)-values as input and returns a vector of
\(p\)-values adjusted for multiple
comparisons. The `method`

argument sets the correction
method, and the values `"holm"`

for Holm’s correction (the
default), `"bonferroni"`

for Bonferroni correction, and
`"fdr"`

(for false discovery rate) are among the
possibilities. The Holm and Bonferroni methods both seek to control the
overall false positive rate with Holm’s method being less conservative
than the Bonferroni method. The `fdr`

method controls the
false discovery rate which is a less stringent criterion than the
overall false positive rate.

We illustrate the `p.adjust()`

function with a typical
microarray gene expression experiment found as the
`superroot2`

data frame in the `MESS`

package. The
expression levels of 21,500 different genes are examined and for each
gene we wish to determine if there is any difference in gene expression
between two plant types (mutant and wild type) after correcting for dye
color and array. A standard analysis of variance model is used to test
the hypothesis of no difference in gene expressions between the two
types of plants for each of the 21,500 genes. The `by`

function is combined with `anova`

and `lm`

to
calculate the 21,500 \(p\)-values.

- How many genes would you expect to be deemed ``significant’’ by chance with a standard \(p\)-value cut-off of 0.05?
- In the code below what is going on in the line starting with
`pval <-`

- Run the code to obtain the \(p\) values.

```
library("MESS")
data(superroot2)
pval <- by(superroot2,
superroot2$gene,
FUN=function(dat) {anova(lm(log(signal) ~ array + color + plant, data=dat))[3,5]} )
```

- List the 8 smallest \(p\) values.
[Hint: use
`sort`

] - List the 8 smallest \(p\) values after using Holm’s correction.
- List the 8 smallest \(p\) values after using FDR correction.

Your output should show that the smallest \(p\)-value is 0.2254 after correction with the Holm method.

- Would you conclude that there are any significantly differentially expressed genes?
- Does this conclusion change for any of the methods?
- Do the Bonferroni correction yourself and see how that changes your results.
- Why are the smallest \(p\)-values for the FDR correction all identical?
- When might these multiple testing methods be inadequate?
- Why might it not always be a good idea to focus on \(p\) values (think about prediction and discovery, and the experimental setup).

Penalized regression analysis is often used to shrink the parameters
of a regression model in order to accommodate more variables and/or
provide better predictions. In R we can fit penalized generalized
regression model using the function `glmnet()`

from the
`glmnet`

package.

`glmnet()`

expects a predictor matrix and an outcome
vector as input — the has matrix has a row for each unit and a column
for each variable. The vector is the vector of outcomes and should have
the same length as the number of rows of the design matrix.

In this exercise we shall use some sample data from a GWAS study. You can load the data directly from R using the command

`load(url("https://www.biostatistics.dk/teaching/advtopicsA/data/lassodata.rda"))`

Now you should have two objects in your workspace:
`genotype`

which is a matrix of genotypes for 2000
individuals and phenotypes (the outcome for the 2000 individuals).

- Fit a lasso model to the genotype/phenotype data.
- Is it necessary to standardize the input data before running the
analysis? [Hint: look at the
`standardize`

argument] - Why would it normally make sense to standardize the columns of the predictors? Explain what might happen if we do not and how the penalty will influence the different predictors.
- Use cross-validation to obtain a reasonable estimate for the penalty parameter.
- Use the estimated penalty parameter and extract the corresponding list of coefficients. How many predictors are selected?

Above we just considered the outcome continuous (even though it is a
series of 0s and 1s). A better model would be to use a binomial model
like logistic regression. To analyze a dichotomous outcome such as
case/control status we use `family=”binomial”`

.

- Try to do that and compare the results. What should/shouldn’t you be looking for here?
- Run the same analysis using ridge regression and compare to the lasso results.
- Although none of the parameters are set to zero for ridge regression
would you still think it would be possible to at least get information
about a sparse solution? How? [Hint: this is an
*ad hoc*question/answer so just state a general idea] - Run the same analysis using elastic net and compare to the previous results.

Run the previous lasso analysis using adaptive lasso. How will that change the results? What are the advantages?

- Fit a highly adaptive lasso model to the data and see how that might improve the fit and MSPE
- Modify the
`max_degree`

,`smoothness_orders`

and`num_knots`

to see if that improves the fit.

Principal component analysis is a useful technique when you have obtained measurements on a number of (possibly correlated) observed variables and you wish to reduce the number of observed variables to a smaller number of artificial variables (called principal components that account for most of the variance in the observed variables. The principal components may then be used as predictors in subsequent analyses.

The `prcomp`

function performs principal component
analysis on a given data matrix. By default, R centers the variables in
the data matrix, but does not scale them to have unit variance. In order
to scale the variables we should set the logical option
`scale.=TRUE`

.

In the following code we will make a principal component analysis on
the nine explanatory variables found in the `biopsy`

dataset
from the `MASS`

package. There are missing data present in
the data frame so we restrict our attention to the complete cases.

```
library(MASS)
data(biopsy)
names(biopsy)
predictors <- biopsy[complete.cases(biopsy),2:10]
fit <- prcomp(predictors, scale=TRUE)
summary(fit)
```

- What is the interpretation of the results that you have obtained?
- Try using
`plot(fit)`

and`biplot(fit)`

and explain what you see. Make sure you produce these two plots and understand what they show Especially the biplot is important. - Which variable(s) provide(s) any real contribution to the second principal component?
- Look at the loading factor for the result. Do you see any patterns in the composition of the principal components?

Input to `prcomp`

can also be specified as a formula with
no response variable. In this case, missing variables are handled
directly by the `na.action`

argument.

- Try to do that (fit the same model using a model formula as input).
- Based on these results how would you think the correlation between
the variables are? Use the
`cor()`

function to see if that is correct.

This exercise continues/overlaps the exercise from above.

Principal component analysis is often used to reduce data dimensionality and to overcome problems with collinearity among observed variables.

Principal component regression uses the principal components identified through principal component analysis as explanatory variables in a regression model instead of the original variables, and principal component regression is often seen as a natural ``next step’’ to principal component analysis. Typically, only a subset of the principal components are used in the regression.

In the following code we start with a principal component analysis of
the nine explanatory variables found in the `biopsy`

dataset
from the `MASS`

package. There are missing data present in
the data frame so we restrict our attention to the available complete
cases. We then use the predicted principal components as input to a
logistic regression model where the response is the breast tumor type:
`benign`

or `malignant`

.

- Fit a principal component regression model. [If there are problems
with missing observations then look at the
`complete.cases`

function to identify rows with full information and just use those rows.] - Try to formulate an interpretation of this model.
- Is the model useful for prediction or discovery or both?

- Try fitting a model with only the two most important principal components. How does that change your fit and your interpretations?
- How would you decide on the number of principal components to include?

Claus Ekstrøm 2024