All of these exercises use the R program package. We will be needing
the MASS
, MESS
, mlbench
,
pls
and caret
packages, which can be installed
using the install.packages()
function as shown below
install.packages(c("MESS", "MASS", "pls", "caret", "mlbench"))
The exercises are constructed so they contain two types of problems: introductory text that you should run through to make sure you understand how to use the functions and interpret the output, and a few additional questions for you to explore.
The multiple testing problem refers to the situation where several tests or statistical inferences are considered simultaneously as a group. Each test has a false positive rate (the type I error rate) which is equal to the threshold set for statistical significance, generally 5%. However, when more than one test is done then the overall type I error rate (i.e., the probability that at least one of the test results is a false positive) is much greater than 5%.
Multiple testing corrections adjust \(p\) values derived from multiple statistical tests in order to control the overall type I error rate and reduce the occurrence of false positives.
The p.adjust()
function takes a vector of \(p\)-values as input and returns a vector of
\(p\)-values adjusted for multiple
comparisons. The method
argument sets the correction
method, and the values "holm"
for Holm’s correction (the
default), "bonferroni"
for Bonferroni correction, and
"fdr"
(for false discovery rate) are among the
possibilities. The Holm and Bonferroni methods both seek to control the
overall false positive rate with Holm’s method being less conservative
than the Bonferroni method. The fdr
method controls the
false discovery rate which is a less stringent criterion than the
overall false positive rate.
We illustrate the p.adjust()
function with a typical
microarray gene expression experiment found as the
superroot2
data frame in the MESS
package. The
expression levels of 21,500 different genes are examined and for each
gene we wish to determine if there is any difference in gene expression
between two plant types (mutant and wild type) after correcting for dye
color and array. A standard analysis of variance model is used to test
the hypothesis of no difference in gene expressions between the two
types of plants for each of the 21,500 genes. The by
function is combined with anova
and lm
to
calculate the 21,500 \(p\)-values.
pval <-
library("MESS")
data(superroot2)
pval <- by(superroot2,
superroot2$gene,
FUN=function(dat) {anova(lm(log(signal) ~ array + color + plant, data=dat))[3,5]} )
sort
]Your output should show that the smallest \(p\)-value is 0.2254 after correction with the Holm method.
In this exercise we use a data set concerning diabetes among Pima Indians. Access the dataset with the following commands
library(mlbench)
data(PimaIndiansDiabetes2)
library(bootstrap)
We will only be looking at the insulin variable here. Remember it contains missing observations which we will remove.
insulin <- PimaIndiansDiabetes2$insulin[!is.na(PimaIndiansDiabetes2$insulin)]
quantile()
function with argument
p=.9
]quantile
in jackknife
by
supplying an extra argument]bcanon()
function.]In this exercise we use a data set concerning forced expiratory
volume in children. Access the dataset called fev
with the
following commands
library(isdals)
data(fev)
Perform a \(t\) test to
investigate if there is a difference in forced expiratory volume between
children who smoke and children who do not smoke. Use the
t.test()
or lm()
function or some other
function that enables this.
Randomize the outcome (FEV
) using the
sample()
function. Decide on a reasonable test statistic to
investigate this hypothesis? What null hypothesis does this correspond
to? Is this the null hypothesis you want to test?
Run the randomization test a large number of times (for example using the code below). How would you compute a test statistic?
for (i in 1:1000) {
# Your code here
}
What would have happened if you have randomized the predictor? What would have happened if you had chosen a different test statistic?
What is the conclusion? Compare to the parametric test.
Now consider the inclusion of additional variables in the model. We are still primarily interested in the association between smoking and fev. Perform a randomization test where the outcome is randomized. What is the conclusion? What null hypothesis does this correspond to?
Perform a randomization test where the smoking variable is randomized. What null hypothesis does this correspond to?
What are the advantages and potential caveats of randomziation tests?
Principal component analysis is a useful technique when you have obtained measurements on a number of (possibly correlated) observed variables and you wish to reduce the number of observed variables to a smaller number of artificial variables (called principal components that account for most of the variance in the observed variables. The principal components may then be used as predictors in subsequent analyses.
The prcomp
function performs principal component
analysis on a given data matrix. By default, R centers the variables in
the data matrix, but does not scale them to have unit variance. In order
to scale the variables we should set the logical option
scale.=TRUE
.
In the following code we will make a principal component analysis on
the nine explanatory variables found in the biopsy
dataset
from the MASS
package. There are missing data present in
the data frame so we restrict our attention to the complete cases.
library("MASS")
data(biopsy)
names(biopsy)
predictors <- biopsy[complete.cases(biopsy),2:10]
fit <- prcomp(predictors, scale=TRUE)
summary(fit)
plot(fit)
and biplot(fit)
and
explain what you see. Make sure you produce these two plots and
understand what they show Especially the biplot is important.Input to prcomp
can also be specified as a formula with
no response variable. In this case, missing variables are handled
directly by the na.action
argument.
cor()
function to see if that is
correct.This exercise continues/overlaps the exercise from above.
Principal component analysis is often used to reduce data dimensionality and to overcome problems with collinearity among observed variables.
Principal component regression uses the principal components identified through principal component analysis as explanatory variables in a regression model instead of the original variables, and principal component regression is often seen as a natural ``next step’’ to principal component analysis. Typically, only a subset of the principal components are used in the regression.
In the following code we start with a principal component analysis of
the nine explanatory variables found in the biopsy
dataset
from the MASS
package. There are missing data present in
the data frame so we restrict our attention to the available complete
cases. We then use the predicted principal components as input to a
logistic regression model where the response is the breast tumor type:
benign
or malignant
.
complete.cases
function to identify rows with full information and just use those
rows.]Last updated 2023