All of these exercises use the R program package. We will be needing the `glmnet`

, `MASS`

, `mlbench`

, `boot`

, `bootstrap`

, and `selectiveInference`

packages, which can be installed using the `install.packages()`

function as shown below

```
install.packages(c("glmnet", "mlbench", "bootstrap",
"boot", "selectiveInference"))
```

The exercises are constructed so they contain two types of problems: introductory text that you should run through to make sure you understand how to use the functions and interpret the output, and a few additional questions for you to explore.

Use the `biopsy`

data for this analysis. In particular we want to look at the prediction based on the principal component analysis (we will get back to principal components on Friday). For the first part of the analysis we will keep the estimated principal component fixed.

```
library(MASS)
data(biopsy)
predictors <- biopsy[complete.cases(biopsy),2:10]
fit <- prcomp(predictors, scale=TRUE)
outcome <- biopsy$class[complete.cases(biopsy)]
DF <- data.frame(outcome, PC1=fit$x[,1], PC2=fit$x[,2], PC3=fit$x[,3], PC4=fit$x[,4])
res <- glm(outcome ~ PC1 + PC2 + PC3 + PC4, data=DF, family=binomial)
```

- Estimate the cross-validation error rate

In this case the outcome is really binary but the `cv.glm`

function uses the squared error as a default cost function. A 0-1 cost function that works for binary data is

`cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)`

- Rerun the cross-validation error estimation but use the binary cost function defined above. [Hint: the function can be passed as a argument to
`cv.glm`

] - How much does the two error rates differ? How does their interpretation differ?
- Now rerun the analysis again but use 10-fold cross-validation. Does that change your error rate in this case?

Penalized regression analysis is often used to shrink the parameters of a regression model in order to accommodate more variables and/or provide better predictions. In R we can fit penalized generalized regression model using the function `glmnet`

from the `glmnet`

package.

`glmnet`

expects a matrix and a vector as input — the has matrix has a row for each unit and a column for each variable. The vector is the vector of outcomes and should have the same length as the number of rows of the design matrix.

In this exercise we shall use some sample data from a GWAS study. You can load the data directly from R using the command

`load(url("https://www.biostatistics.dk/teaching/advtopicsA/data/lassodata.rda"))`

Now you should have two objects in your workspace: `genotype`

which is a matrix of genotypes for 2000 individuals and `phenotype`

(the outcome for the 2000 individuals).

- Fit a lasso model to the genotype/phenotype data.
- Is it necessary to standardize the input data before running the analysis? [Hint: look at the
`standardize`

argument] - Why would it normally make sense to standardize the columns of the predictors? Explain what might happen if we do not and how the penalty will influence the different predictors.
- Use cross-validation to obtain a reasonable estimate for the penalty parameter.
- Use the estimated penalty parameter and extract the corresponding list of coefficients. How many predictors are selected?

Above we just considered the outcome continuous (even though it is a series of 0s and 1s). A better model would be to use a binomial model like logistic regression. To analyze a dichotomous outcome such as case/control status we use `family=”binomial”`

.

- Try to do that and compare the results. What should/shouldn’t you be looking for here?
- Run the same analysis using ridge regression and compare to the lasso results.
- Although none of the parameters are set to zero for ridge regression would you still think it would be possible to at least get information about a sparse solution? How? [Hint: this is an
*ad hoc*question/answer so just state a general idea] - Run the same analysis using elastic net and compare to the previous results.
- Try to delasso the results and compare a few of the predictions. [Hint: use
`predict()`

] - Try to see if post-selection inference works in this case (use the
`selectiveInference`

package in the`fixedLassoInf()`

function. See the last slide from the lectuers).

Last updated 2021