Exercises for day 2
All of these exercises use the R program package. We will be needing
the glmnet
, grpreg
, and MESS
,
packages, which can be installed using the
install.packages()
function as shown below
install.packages(c("glmnet", "gglasso", "MESS", "hal9001"))
Penalized regression
Penalized regression analysis is often used to shrink the parameters
of a regression model in order to accommodate more variables and/or
provide better predictions. In R we can fit penalized generalized
regression model using the function glmnet
from the
glmnet
package.
glmnet
expects a matrix and a vector as input — the has
matrix has a row for each unit and a column for each variable. The
vector is the vector of outcomes and should have the same length as the
number of rows of the design matrix.
In this exercise we shall use some sample data from a GWAS study. You
can load the data directly from R using the command
load(url("http://www.biostatistics.dk/teaching/advtopicsA/data/lassodata.rda"))
Now you should have two objects in your workspace:
genotype
which is a matrix of genotypes for 2000
individuals and phenotypes (the outcome for the 2000 individuals).
- Fit a lasso model to the genotype/phenotype data.
- Is it necessary to standardize the input data before running the
analysis? [Hint: look at the
standardize
argument]
- Why would it normally make sense to standardize the columns of the
predictors? Explain what might happen if we do not and how the penalty
will influence the different predictors.
- Pick a penalty and extract the relevant non-zero coefficients. How
many predictors are selected? [Base the choice of penalty on a graph of
the coefficients]
- Compare the coefficients to the coefficients you get from a
delassoed analysis.
- Compare the mean squared prediction error from the lasso model to
the MSPE from the delassoed analysis.
- How would these results change if you did not standardize?
[Run the analysis]
For part 2 of this analysis we continue where we let go.
- Use cross-validation to obtain a reasonable estimate for the penalty
parameter.
- Use the estimated penalty parameter and extract the corresponding
list of coefficients. How many predictors are selected?
- Compare the MSPE to the results obtained from the previous
results
- Run the same analysis using ridge regression and compare to the
lasso results.
- Although none of the parameters are set to zero for ridge regression
would you still think it would be possible to at least get information
about a sparse solution? How? [Hint: this is an ad hoc
question/answer so just state a general idea]
- Run the same analysis using elastic net and compare to the previous
results.
Above we just considered the outcome continuous (even though it is a
series of 0s and 1s). A better model would be to use a binomial model
like logistic regression. To analyze a dichotomous outcome such as
case/control status we use family=”binomial”
.
- Try to do that and compare the results. What should/shouldn’t you be
looking for here?
Adaptive lasso
Run the previous lasso analysis using adaptive lasso. How will that
change the results? What are the advantages?
Group lasso
- Set up a group variable that contains chunks of 10 genes.
- Run the analysis using the group lasso [Hint: use the
gglasso()
function]
- When might a group lasso be relevant? Can it fix some of the
problems with the traditional lasso?
HAL
- Fit a highly adaptive lasso model to the data above and see how that
might improve the fit and MSPE
- Modify the
max_degree
, smoothness_orders
and num_knots
to see if that improves the fit.
Last updated 2023