Penalized regression

Topics for module
- Generalized linear models refresher
- Penalized regression (lasso and elastic net)
- Parametric and non-parametric bootstrap
- Cross-validation
- Extentions: group lasso, selective inference
To read: Chapter 8 (in particular 8.1-8.3). Flip through Chapters 10, 11, 16, and 20.
Slides: Multiple testing and Penalized regression
R code from the exercises

Exercises

For these exercises we will be needing the glmnet, gglasso, hal9001, MESS, MASS, caret and mlbench packages, which can be installed using the install.packages() function as shown below

install.packages(c("glmnet", "gglasso", "MESS", "hal9001"))
install.packages(c("MASS", "caret", "mlbench"))

The exercises are constructed so they contain two types of problems: introductory text that you should run through to make sure you understand how to use the functions and interpret the output, and a few additional questions for you to explore.

Multiple testing

The multiple testing problem refers to the situation where several tests or statistical inferences are considered simultaneously as a group. Each test has a false positive rate (the type I error rate) which is equal to the threshold set for statistical significance, generally 5%. However, when more than one test is done then the overall type I error rate (i.e., the probability that at least one of the test results is a false positive) is much greater than 5%.

Multiple testing corrections adjust $p$ values derived from multiple statistical tests in order to control the overall type I error rate and reduce the occurrence of false positives.

The p.adjust() function takes a vector of $p$-values as input and returns a vector of $p$-values adjusted for multiple comparisons. The method argument sets the correction method, and the values "holm" for Holm’s correction (the default), "bonferroni" for Bonferroni correction, and "fdr" (for false discovery rate) are among the possibilities. The Holm and Bonferroni methods both seek to control the overall false positive rate with Holm’s method being less conservative than the Bonferroni method. The fdr method controls the false discovery rate which is a less stringent criterion than the overall false positive rate.

We illustrate the p.adjust() function with a typical microarray gene expression experiment found as the superroot2 data frame in the MESS package. The expression levels of 21,500 different genes are examined and for each gene we wish to determine if there is any difference in gene expression between two plant types (mutant and wild type) after correcting for dye color and array. A standard analysis of variance model is used to test the hypothesis of no difference in gene expressions between the two types of plants for each of the 21,500 genes. The by function is combined with anova and lm to calculate the 21,500 $p$-values.

How many genes would you expect to be deemed ``significant’’ by chance with a standard $p$-value cut-off of 0.05?
In the code below what is going on in the line starting with pval <-
Run the code to obtain the $p$ values.

library("MESS")
data(superroot2)

pval <- by(superroot2,
           superroot2$gene,
           FUN=function(dat) {anova(lm(log(signal) ~ array + color + plant, data=dat))[3,5]} )

List the 8 smallest $p$ values. [Hint: use sort]
List the 8 smallest $p$ values after using Holm’s correction.
List the 8 smallest $p$ values after using FDR correction.

Your output should show that the smallest $p$-value is 0.2254 after correction with the Holm method.

Would you conclude that there are any significantly differentially expressed genes?
Does this conclusion change for any of the methods?
Do the Bonferroni correction yourself and see how that changes your results.
Why are the smallest $p$-values for the FDR correction all identical?
When might these multiple testing methods be inadequate?
Why might it not always be a good idea to focus on $p$ values (think about prediction and discovery, and the experimental setup).

Bootstrapping

In this exercise we use a data set concerning diabetes among Pima Indians. Access the dataset with the following commands

library(mlbench)
data(PimaIndiansDiabetes2)
library(bootstrap)

We will only be looking at the insulin variable here. Remember it contains missing observations which we will remove.

insulin <- PimaIndiansDiabetes2$insulin[!is.na(PimaIndiansDiabetes2$insulin)]

Plot the insulin values to see the distribution. It is clearly not symmetric.
Compute the 90% quantile insulin level. [Hint: use the quantile() function with argument p=.9]
Compute a jackknife estimate of the SE of the median, and compute a 95% confidence interval for the 90% quantile. [Hint: you can add extra arguments to quantile in jackknife by supplying an extra argument]
Compute a simple bootstrap-based 95% confidence interval for the 90% quantile. How much does that differ from the jackknife estimate. Why?
Compute a bias-corrected and adjusted bootstrap percentile-based 95% confidence interval for the 90% quantile. [Hint: check the bcanon() function.]
Describe the differences between the three confidence intervals computed.
Discuss how you could use jackknifing for identifying influential observations/outliers from an analysis.

Amyotrophic Lateral Sclerosis (Lou Gerig’s disease).

Penalized regression analysis is often used to shrink the parameters of a regression model in order to accommodate more variables and/or provide better predictions. In R we can fit penalized generalized regression models using the function glmnet() from the glmnet package.

glmnet() expects a matrix and a vector as input — the matrix should be a design matrix with a row for each unit and a column for each variable. The vector is the vector of outcomes and should have the same length as the number of rows of the design matrix.

In this exercise we shall use data from 1822 individuals with ALS. The goal is to predict the rate of progression dFRS of a functional rating score, using 369 predictors based on measurements (and derivatives of these) obtained from patient visits.

The first variable in the file is testset, a logical variable indicating our devision into a training (FALSE) and a test (TRUE) set. The next variable dFRS is the response, and the remaining columns are predictors.

You can read the data directly into R using the command

als <- read.table("http://hastie.su.domains/CASI_files/DATA/ALS.txt", header=TRUE)

You can see a codebook with the variables here.

Prepare data

We start with a small bit of data wrangling to have data in the right format

Extract a subset of the data to be used to train the model.
```
als_train <- als[als$testset==FALSE, ]
```
Confirm that this training dataset has 1197 observations and 371 variables.
Extract a vector dFRS from the als_train data frame to use as outcome. [ Hint: make sure that this a vector and not a data frame ]
Prepare the design matrix for the training dataset. Make sure it is a matrix [ Hint: you can use the as.matrix() function to convert to a matrix ]
You need to do the exact same steps for the test dataset.

Now we are ready to fit the penalized regression model.

Fit the penalized regression model

Fit a lasso model to the ALS data.
What does the output show?
Is it necessary to standardize the input data before running the analysis? [Hint: look at the standardize argument to glmnet()]
Why would it normally make sense to standardize the columns of the predictors? Explain what might happen if we do not and how the penalty will influence the different predictors.
Use cross-validation to obtain a reasonable estimate for the penalty parameter. What number do you obtain?
Extract the relevant non-zero coefficients. How many predictors are selected?
Use the fitted model to predict the outcomes of the testdata. Compute the mean-squared prediction error for the test data. [ Hint: you need to do something similar to the following to make that computation. Below Y2 is the outcome for the test dataset, X2 is the predictors for the test dataset and m1 is the model fitted from the training data. ]

``` r
mean((Y2 - predict(m1, newx=X2, s=0.04472))^2)
```

Compare the coefficients to the coefficients you get from a delassoed analysis. [ Hint: you can use the glm() or lm() function to fit a linear model. ]
Compare the mean squared prediction error from the lasso model to the MSPE from the delassoed analysis. Which model performs better?
How would these results change if you did not standardize? [Hint: Run the analysis and see]

Part 2

For part 2 of this analysis we continue where we let go.

Run the same analysis using ridge regression and compare to the lasso results. How does the MSPE perform?
Although none of the parameters are set to zero for ridge regression would you still think it would be possible to at least get information about a sparse solution? How? [Hint: this is an ad hoc question/answer so just state a general idea]
Run the same analysis using elastic net and compare to the previous results.

Adaptive lasso

Run the previous lasso analysis using adaptive lasso. How will that change the results? What are the advantages?

Group lasso

The dataset contains several variables that are dummy-encoded. Consider the set of variables beginning with Symptom.. Group those together and rerun the lasso analysis. What do you find? How should the group of variables be interpreted? What are the pros/cons ot smaller/larger groups?

HAL

Fit a highly adaptive lasso model to the ALS data and see how that might improve the fit and MSPE
Modify the max_degree, smoothness_orders and num_knots to see if that improves the fit.

Parkinson disease

The data used in this study were gathered from 188 patients with Parkinson’s disease (107 men and 81 women) with ages ranging from 33 to 87 (65.1 years $\pm$ 10.9). The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1 $\pm$ 8.9).

Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson’s Disease (PD) patients to extract clinically useful information for PD assessment.

During the data collection process, the microphone for recording speech is set to 44.1 KHz and following the physician’s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.

The class variables contains info on PD status (0 = control) and we will only be using the first repetition for each individual.

Read in the data using

pd <- read.csv("http://www.biostatistics.dk/pd_speech_features.csv", header=TRUE, skip=1)

and use only the first repetition

PD <- pd[seq(1, nrow(pd), 3),]

Do the data wrangling
Analyze the data similar to above. Remember to use family=binomial since we are using classification in this case (and hence base the model un an underlying logistic regression mode). When possible use the underlying probability rather than accurracy.

2025