For these exercises we will be needing the glmnet
,
gglasso
, hal9001
, stabs
, and
MESS
, packages, which can be installed using the
install.packages()
function as shown below
install.packages(c("glmnet", "gglasso", "MESS", "hal9001", "stabs"))
Penalized regression analysis is often used to shrink the parameters
of a regression model in order to accommodate more variables and/or
provide better predictions. In R
we can fit penalized
generalized regression models using the function glmnet()
from the glmnet
package.
glmnet()
expects a matrix and a vector as input — the
matrix should be a design matrix with a row for each unit and a column
for each variable. The vector is the vector of outcomes and should have
the same length as the number of rows of the design matrix.
In this exercise we shall use data from 1822 individuals with ALS.
The goal is to predict the rate of progression dFRS
of a
functional rating score, using 369 predictors based on measurements (and
derivatives of these) obtained from patient visits.
The first variable in the file is testset, a logical variable
indicating our devision into a training (FALSE
) and a test
(TRUE
) set. The next variable dFRS
is the
response, and the remaining columns are predictors.
You can read the data directly into R using the command
als <- read.table("https://hastie.su.domains/CASI_files/DATA/ALS.txt", header=TRUE)
You can see a codebook with the variables here.
We start with a small bit of data wrangling to have data in the right format
Extract a subset of the data to be used to train the model.
als_train <- als[als$testset==FALSE, ]
Confirm that this training dataset has 1197 observations and 371 variables.
Extract a vector dFRS
from the
als_train
data frame to use as outcome. [ Hint: make sure
that this a vector and not a data frame ]
Prepare the design matrix for the training dataset. Make sure it
is a matrix
[ Hint: you can use the
as.matrix()
function to convert to a matrix ]
You need to do the exact the steps for the test dataset.
Now we are ready to fit the penalized regression model.
Fit a lasso model to the ALS data.
What does the output show?
Is it necessary to standardize the input data before running the
analysis? [Hint: look at the standardize
argument to
glmnet()
]
Why would it normally make sense to standardize the columns of the predictors? Explain what might happen if we do not and how the penalty will influence the different predictors.
Use cross-validation to obtain a reasonable estimate for the penalty parameter. What number do you obtain?
Extract the relevant non-zero coefficients. How many predictors are selected?
Use the fitted model to predict the outcomes of the testdata.
Compute the mean-squared prediction error for the test data. [ Hint: you
need to do something similar to the following to make that computation.
Below Y2
is the outcome for the test dataset,
X2
is the predictors for the test dataset and
m1
is the model fitted from the training data. ]
mean((Y2 - predict(m1, newx=X2, s=0.04472))^2)
Compare the coefficients to the coefficients you get from a
delassoed analysis. [ Hint: you can use the glm()
or
lm()
function to fit a linear model. ]
Compare the mean squared prediction error from the lasso model to the MSPE from the delassoed analysis. Which model performs better?
How would these results change if you did not standardize? [Hint: Run the analysis and see]
For part 2 of this analysis we continue where we let go.
Run the previous lasso analysis using adaptive lasso. How will that change the results? What are the advantages?
Symptom.
.
Group those together and rerun the lasso analysis. What do you find? How
should the group of variables be interpreted? What are the pros/cons ot
smaller/larger groups?max_degree
, smoothness_orders
and num_knots
to see if that improves the fit.The data used in this study were gathered from 188 patients with Parkinson’s disease (107 men and 81 women) with ages ranging from 33 to 87 (65.1 years \(\pm\) 10.9). The control group consists of 64 healthy individuals (23 men and 41 women) with ages varying between 41 and 82 (61.1 \(\pm\) 8.9).
Various speech signal processing algorithms including Time Frequency Features, Mel Frequency Cepstral Coefficients (MFCCs), Wavelet Transform based Features, Vocal Fold Features and TWQT features have been applied to the speech recordings of Parkinson’s Disease (PD) patients to extract clinically useful information for PD assessment.
During the data collection process, the microphone for recording speech is set to 44.1 KHz and following the physician’s examination, the sustained phonation of the vowel /a/ was collected from each subject with three repetitions.
The class
variables contains info on PD status (0 =
control) and we will only be using the first repetition for each
individual.
Read in the data using
pd <- read.csv("http://www.biostatistics.dk/pd_speech_features.csv", header=TRUE, skip=1)
and use only the first repetition
PD <- pd[seq(1, nrow(pd), 3),]
Do the data wrangling
Analyze the data similar to above. Remember to use
family=binomial
since we are using classification in this
case (and hence base the model un an underlying logistic regression
mode). When possible use the underlying probability rather than
accurracy.
Claus Thorn Ekstrøm and Mikkel Meyer Andersen 2024