These exercises will primarily use functions in R but we will also briefly touch upon the plink software package. The exercises are build up over two types of problems: introductory text that you should run through to make sure you understand how to use and a few additional questions for you to explore.

We will be needing the MESS, lme4, coxme, and glmnet packages, which can all be installed using the install.packages() function.

install.packages(c("MESS", "lme4", "coxme", "glmnet"))

Standard GWAS analysis

In this exercise we will try to run a standard genome-wide association study. As much as possible we will be using standard functionality in R and try to use the techniques we looked at this Monday. There are a few specialized packages that wrap some of fucntionality into complete functions but we will try to do as much of the analyses by hand to make sure we understand what is going on. Also, we are considering only a part of a full GWAS dataset.

Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI). The data can be downloaded directly from the web page using

options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda")) # Genotypes
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda")) # Phenotypes

and it contains information from two chromosomes: the first 11283 columns in the genotype file are from chromosome 1 and the rest for chromosome 2.

  1. How many SNPs are available in our dataset? How many individuals are we analyzing? What else is available?
  2. What type of analysis would we do if we had just observed a single gene variant?
  3. Run a simple analysis corresponding to the model you proposed in 2.
  4. Make a full GWAS analysis as shown at the lectures. That means you should run an analysis for each column in your genotype dataset. [Hint: you can use the mfastLmCpp() function from the MESS package to run an analysis between an outcome and each predictor.]
  5. Create a Manhattan plot with the resulting \(p\)-values. Note: on the \(y\)-axis of a Manhattan plot we typically plot \(-\log_{10}(p)\) instead of the \(p\)-value itself.
  6. Try to analyze the data using the lasso instead. Compare the results to the previous marginal findings. Is something consistant? Is something disappearing? [Note: we cannot get \(p\)-values directly out of glmnet() so we should instead compare the set of interesting findings.]
  7. Try to analyze the data using the elastic net instead with at least half of the penalty weight put on ridge regression. What should be the advantage of that in this situation compared to the regular lasso?
  8. Compute the genomic control factor and correct the analyses if necessary.

Handling unobserved population admixture

A group of researchers are nervous that there is unmeasured population admixture in the data. This essentially means that the full population in reality contains two or more “hidden” sub-populations.

  1. Why is population admixture a potentially big problem in GWAS studies?

We can use principal component analysis based on all SNPs to produce a combined measure (of the SNPs). If there is population substructure then the allele frequencies are likely to be different among different sub-populations and across many SNPs and this effect is likely to be captured by the principal component since they are computed regardless of the outcome.

Use the data from the previous exercise for this exercise. You may have to look at a single chromosome only, if your computer is slightly slow

  1. Make a full analysis where you also take potential population structure into account. How does this change the results? [ Hint: compute a principal component or two and check if they contribute anything to the analyses. The prcomp() function that we have used to far is rather slow on large datasets like this. It will take 5 minutes or so to compute the PCAs using prcomp(). I can really recommend the function batchpca() from the package onlinePCA]
  2. Which markers should ideally be used for the principal components? What risks for the study power are there for this approach on the GWAS findings?


In this exercise we will try to estimate the proportion of variance explained. To do this we will use the lmekin() function from the coxme package. Use the large dataset from the exercises above for the following.

  1. We should start by computing the allele frequencies for each of the SNPs. Use the colMeans() function.

  2. The mean of each SNP is \(2p\) where \(p\) is its allele frequency, and it has a standard deviation of \(\sqrt{2p(1-p)}\). Standardize the values for each gene by subtracting the correct mean and dividing by its standard deviation. Call the result W in R. See the code below for inspiration.

  3. Run the following code

    MAF <- 1-colMeans(genotypes)/2  # Minor allele frequency
    W <- t((t(genotypes)/2 - MAF)/sqrt(2*MAF*(1-MAF)))
    y <- phenotypes$BMI
    N <- length(y)
    A <- tcrossprod(W) / ncol(genotypes)
    id <- 1:N
    res <- lmekin(y ~ 1 + (1|id), varlist=list(id=A))

    The first couple of lines compute the correlation matrix we should use for the calculations. The last line fits the model.

  4. Explain what goes on in the code above.

  5. Estimate the proportion of variance explained. What is the heritability in this case?

Find linkage disequilibrium (LD) blocks

The following text explains how to make a heat map.

Heat map plots are used to visualize data from a two-dimensional array using colors to indicate the values in the array. In R, the function heatmap() plots a heat map, and it requires a numeric matrix as its first argument.

There are several formulas for computing linkage disequilibrium between two genes. A simple measure that is fast to compute and easy to interpret is the correlation coefficient. cor() computes the correlation matrix for all pairwise correlations.

  1. Use the same data as before. Pick 50 SNPs on each side around a region of interest - for example one of the SNPs found to be associated with the outcome. The columns in the genotype dataset are ordered according to their position on the genome. If we use all the SNPs then the computations will take too long.
  2. Compute the correlation matrix for the 100 genes/SNPs.
  3. Use heatmap(x, Rowv=NA, Colv=NA) (x is the matrix of correlation values computed above). What do you see in the graph?
  4. What happens if you do not include the arguments Rowv=NA, Colv=NA?

The color scheme of the heat map plot can be set with the col option. By default it uses heat colors obtained from the heat.colors function, but that can be changed to give other color schemes. If the options ColSideColors and RowSideColors are provided, they should contain a vector of color names for a horizontal/vertical side bar that annotates the columns and rows of the data matrix, respectively. This can be used to indicate known groupings by color code as shown below.

Rare variants

We want to see if there is any reason to consider a burden/SKAT test in order to find association between rare variants and the outcome.

  1. Compute the minor allele frequency for each of the genes. What is the distribution of allele frequencies? You can use the colMeans() function to compute the mean of the columns in the genotype data.
  2. Identify the SNPs found to be in the same clusters using the heatmap from above and make a simple burden test for each of the clusters identified. An example of this can be analyzed in R as follows (needs to be modified):
# Include, say, V6 and V7 as random variable
y <- phenotypes$BMI
fullmodel <- lmer(y ~ 1 + (1|V6) + (1|V7), data=geno2, REML=FALSE)
smallmodel <- lm(y ~ 1, data=geno2)
  1. How should we compare the results? Has the power improved?

Claus Ekstrøm 2024