These exercises will primarily use functions in R but we will also
briefly touch upon the plink
software package. The
exercises are build up over two types of problems: introductory text
that you should run through to make sure you understand how to use and a
few additional questions for you to explore.
We will be needing the MESS
, lme4
,
coxme
, and glmnet
packages, which can all be
installed using the install.packages()
function.
install.packages(c("MESS", "lme4", "coxme", "glmnet"))
In this exercise we will try to run a standard genome-wide association study. As much as possible we will be using standard functionality in R and try to use the techniques we looked at this Monday. There are a few specialized packages that wrap some of fucntionality into complete functions but we will try to do as much of the analyses by hand to make sure we understand what is going on. Also, we are considering only a part of a full GWAS dataset.
Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI). The data can be downloaded directly from the web page using
options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda")) # Genotypes
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda")) # Phenotypes
and it contains information from two chromosomes: the first 11283 columns in the genotype file are from chromosome 1 and the rest for chromosome 2.
mfastLmCpp()
function from the
MESS
package to run an analysis between an outcome and each
predictor.]glmnet()
so we should instead compare the set of
interesting findings.]A group of researchers are nervous that there is unmeasured population admixture in the data. This essentially means that the full population in reality contains two or more “hidden” sub-populations.
We can use principal component analysis based on all SNPs to produce a combined measure (of the SNPs). If there is population substructure then the allele frequencies are likely to be different among different sub-populations and across many SNPs and this effect is likely to be captured by the principal component since they are computed regardless of the outcome.
Use the data from the previous exercise for this exercise. You may have to look at a single chromosome only, if your computer is slightly slow
prcomp()
function that we
have used to far is rather slow on large datasets like this. It will
take 5 minutes or so to compute the PCAs using prcomp()
. I
can really recommend the function batchpca()
from the
package onlinePCA
]In this exercise we will try to estimate the proportion of variance
explained. To do this we will use the lmekin()
function
from the coxme
package. Use the large dataset from the
exercises above for the following.
We should start by computing the allele frequencies for each of
the SNPs. Use the colMeans()
function.
The mean of each SNP is \(2p\)
where \(p\) is its allele frequency,
and it has a standard deviation of \(\sqrt{2p(1-p)}\). Standardize the values
for each gene by subtracting the correct mean and dividing by its
standard deviation. Call the result W
in R. See the code
below for inspiration.
Run the following code
MAF <- 1-colMeans(genotypes)/2 # Minor allele frequency
W <- t((t(genotypes)/2 - MAF)/sqrt(2*MAF*(1-MAF)))
y <- phenotypes$BMI
N <- length(y)
A <- tcrossprod(W) / ncol(genotypes)
library("coxme")
id <- 1:N
res <- lmekin(y ~ 1 + (1|id), varlist=list(id=A))
The first couple of lines compute the correlation matrix we should use for the calculations. The last line fits the model.
Explain what goes on in the code above.
Estimate the proportion of variance explained. What is the heritability in this case?
The following text explains how to make a heat map.
Heat map plots are used to visualize data from a two-dimensional
array using colors to indicate the values in the array. In R, the
function heatmap()
plots a heat map, and it requires a
numeric matrix as its first argument.
There are several formulas for computing linkage disequilibrium
between two genes. A simple measure that is fast to compute and easy to
interpret is the correlation coefficient. cor()
computes
the correlation matrix for all pairwise correlations.
heatmap(x, Rowv=NA, Colv=NA)
(x
is the
matrix of correlation values computed above). What do you see in the
graph?Rowv=NA, Colv=NA
?The color scheme of the heat map plot can be set with the
col
option. By default it uses heat colors obtained from
the heat.colors
function, but that can be changed to give
other color schemes. If the options ColSideColors
and
RowSideColors
are provided, they should contain a vector of
color names for a horizontal/vertical side bar that annotates the
columns and rows of the data matrix, respectively. This can be used to
indicate known groupings by color code as shown below.
We want to see if there is any reason to consider a burden/SKAT test in order to find association between rare variants and the outcome.
colMeans()
function to compute the mean of the columns in
the genotype data.library("lme4")
# Include, say, V6 and V7 as random variable
y <- phenotypes$BMI
fullmodel <- lmer(y ~ 1 + (1|V6) + (1|V7), data=geno2, REML=FALSE)
smallmodel <- lm(y ~ 1, data=geno2)
logLik(fullmodel)
logLik(smallmodel)
This exercise is to try the software program plink
. You
need to download and install the programme before you can run the
exercise. plink
is super versatile and has many more
options than what we can consider here. Instead we will try to make
small extensions to the analyses we made above.
Download the plink
program from
http://zzz.bwh.harvard.edu/plink/download.shtml
and install it.
The plink
executable file should be placed in either the
current working directory or somewhere in the command path. All commands
involve typing plink at the command prompt followed by a number of
options (all starting with --option
) to specify the data
files methods to be used.
plink
needs two standard files: a ped-file and a
map-file. The necssary files can be downloaded from the course
website.
Check the files and make sure you understand their formats. The map file contains chromosome number, the SNP name, the position in morgans on the chromosome, and the position in base pairs on the chromosome.
Run the following commands and discuss the output:
plink --noweb --file hapmap1
and look at the
output. It takes two files (a ped file and a map file) as input. They
can be downloaded from the course website.plink --noweb --missing --file hapmap1
and look at
the output. What information is contained in the missing files.plink --noweb --freq --file hapmap1
and look at
the output. What information is contained in the frequency files. How
does the results compare to the results you computed manually earlier in
R?plink --noweb --assoc --file hapmap1
and look at
the output. What information is contained in the association files.plink --noweb --assoc --adjust --file hapmap1
and
look at the output. What information is contained in the adjusted
association files.Claus Ekstrøm 2024