All of these exercises use the R program package. We will be needing
the glmnet
, MASS
, mlbench
,
boot
, bootstrap
, and
selectiveInference
packages, which can be installed using
the install.packages()
function as shown below
install.packages(c("glmnet", "mlbench", "bootstrap",
"boot", "selectiveInference"))
The exercises are constructed so they contain two types of problems: introductory text that you should run through to make sure you understand how to use the functions and interpret the output, and a few additional questions for you to explore.
Use the biopsy
data for this analysis. In particular we
want to look at the prediction based on the principal component analysis
(we will get back to principal components on Friday). For the first part
of the analysis we will keep the estimated principal component
fixed.
library(MASS)
data(biopsy)
predictors <- biopsy[complete.cases(biopsy),2:10]
fit <- prcomp(predictors, scale=TRUE)
outcome <- biopsy$class[complete.cases(biopsy)]
DF <- data.frame(outcome, PC1=fit$x[,1], PC2=fit$x[,2], PC3=fit$x[,3], PC4=fit$x[,4])
res <- glm(outcome ~ PC1 + PC2 + PC3 + PC4, data=DF, family=binomial)
In this case the outcome is really binary but the cv.glm
function uses the squared error as a default cost function. A 0-1 cost
function that works for binary data is
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
cv.glm
]Penalized regression analysis is often used to shrink the parameters
of a regression model in order to accommodate more variables and/or
provide better predictions. In R we can fit penalized generalized
regression model using the function glmnet
from the
glmnet
package.
glmnet
expects a matrix and a vector as input — the has
matrix has a row for each unit and a column for each variable. The
vector is the vector of outcomes and should have the same length as the
number of rows of the design matrix.
In this exercise we shall use some sample data from a GWAS study. You can load the data directly from R using the command
load(url("https://www.biostatistics.dk/teaching/advtopicsA/data/lassodata.rda"))
Now you should have two objects in your workspace:
genotype
which is a matrix of genotypes for 2000
individuals and phenotype
(the outcome for the 2000
individuals).
standardize
argument]Above we just considered the outcome continuous (even though it is a
series of 0s and 1s). A better model would be to use a binomial model
like logistic regression. To analyze a dichotomous outcome such as
case/control status we use family=”binomial”
.
predict()
]selectiveInference
package in the
fixedLassoInf()
function. See the last slide from the
lectuers).Last updated 2023