**Topics for module**- GWAS
- Imputation
- Common variants vs rare variants. Sequence Kernel Association Test
- Enrichment approaches, gene-set analyses

**To read:**Nothing- Slides: GWAS
- R code from the exercises

These exercises will primarily use functions in R but we will also
briefly touch upon the `plink`

software package. The
exercises are build up over two types of problems: introductory text
that you should run through to make sure you understand how to use and a
few additional questions for you to explore.

We will be needing the `MESS`

, `lme4`

,
`coxme`

, and `glmnet`

packages, which can all be
installed using the `install.packages()`

function.

`install.packages(c("MESS", "lme4", "coxme", "glmnet"))`

In this exercise we will try to run a standard genome-wide association study. As much as possible we will be using standard functionality in R and try to use the techniques we looked at this Monday. There are a few specialized packages that wrap some of fucntionality into complete functions but we will try to do as much of the analyses by hand to make sure we understand what is going on. Also, we are considering only a part of a full GWAS dataset.

Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI). The data can be downloaded directly from the web page using

```
options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda")) # Genotypes
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda")) # Phenotypes
```

and it contains information from two chromosomes: the first 11283 columns in the genotype file are from chromosome 1 and the rest for chromosome 2.

- How many SNPs are available in our dataset? How many individuals are we analyzing? What else is available?
- What type of analysis would we do if we had just observed a single gene variant?
- Run a simple analysis corresponding to the model you proposed in 2.
- Make a full GWAS analysis as shown at the lectures. That means you
should run an analysis for each column in your genotype dataset. [Hint:
you can use the
`mfastLmCpp()`

function from the`MESS`

package to run an analysis between an outcome and each predictor.] - Create a Manhattan plot with the resulting \(p\)-values. Note: on the \(y\)-axis of a Manhattan plot we typically plot \(-\log_{10}(p)\) instead of the \(p\)-value itself.
- Try to analyze the data using the lasso instead. Compare the results
to the previous marginal findings. Is something consistant? Is something
disappearing? [Note: we cannot get \(p\)-values directly out of
`glmnet()`

so we should instead compare the set of interesting findings.] - Try to analyze the data using the elastic net instead with at least half of the penalty weight put on ridge regression. What should be the advantage of that in this situation compared to the regular lasso?
- Compute the genomic control factor and correct the analyses if necessary.

A group of researchers are nervous that there is unmeasured population admixture in the data. This essentially means that the full population in reality contains two or more “hidden” sub-populations.

- Why is population admixture a potentially big problem in GWAS studies?

We can use principal component analysis based on all SNPs to produce a combined measure (of the SNPs). If there is population substructure then the allele frequencies are likely to be different among different sub-populations and across many SNPs and this effect is likely to be captured by the principal component since they are computed regardless of the outcome.

Use the data from the previous exercise for this exercise. You may have to look at a single chromosome only, if your computer is slightly slow

- Make a full analysis where you also take potential population
structure into account. How does this change the results? [ Hint:
compute a principal component or two and check if they contribute
anything to the analyses. The
`prcomp()`

function that we have used to far is rather slow on large datasets like this. It will take 5 minutes or so to compute the PCAs using`prcomp()`

. I can really recommend the function`batchpca()`

from the package`onlinePCA`

] - Which markers should ideally be used for the principal components? What risks for the study power are there for this approach on the GWAS findings?

In this exercise we will try to estimate the proportion of variance
explained. To do this we will use the `lmekin()`

function
from the `coxme`

package. Use the large dataset from the
exercises above for the following.

We should start by computing the allele frequencies for each of the SNPs. Use the

`colMeans()`

function.The mean of each SNP is \(2p\) where \(p\) is its allele frequency, and it has a standard deviation of \(\sqrt{2p(1-p)}\). Standardize the values for each gene by subtracting the correct mean and dividing by its standard deviation. Call the result

`W`

in R. See the code below for inspiration.Run the following code

`MAF <- 1-colMeans(genotypes)/2 # Minor allele frequency W <- t((t(genotypes)/2 - MAF)/sqrt(2*MAF*(1-MAF))) y <- phenotypes$BMI N <- length(y) A <- tcrossprod(W) / ncol(genotypes) library("coxme") id <- 1:N res <- lmekin(y ~ 1 + (1|id), varlist=list(id=A))`

The first couple of lines compute the correlation matrix we should use for the calculations. The last line fits the model.

Explain what goes on in the code above.

Estimate the proportion of variance explained. What is the heritability in this case?

The following text explains how to make a heat map.

Heat map plots are used to visualize data from a two-dimensional
array using colors to indicate the values in the array. In R, the
function `heatmap()`

plots a heat map, and it requires a
numeric matrix as its first argument.

There are several formulas for computing linkage disequilibrium
between two genes. A simple measure that is fast to compute and easy to
interpret is the correlation coefficient. `cor()`

computes
the correlation matrix for all pairwise correlations.

- Use the same data as before. Pick 50 SNPs on each side around a region of interest - for example one of the SNPs found to be associated with the outcome. The columns in the genotype dataset are ordered according to their position on the genome. If we use all the SNPs then the computations will take too long.
- Compute the correlation matrix for the 100 genes/SNPs.
- Use
`heatmap(x, Rowv=NA, Colv=NA)`

(`x`

is the matrix of correlation values computed above). What do you see in the graph? - What happens if you do not include the arguments
`Rowv=NA, Colv=NA`

?

The color scheme of the heat map plot can be set with the
`col`

option. By default it uses heat colors obtained from
the `heat.colors`

function, but that can be changed to give
other color schemes. If the options `ColSideColors`

and
`RowSideColors`

are provided, they should contain a vector of
color names for a horizontal/vertical side bar that annotates the
columns and rows of the data matrix, respectively. This can be used to
indicate known groupings by color code as shown below.

We want to see if there is any reason to consider a burden/SKAT test in order to find association between rare variants and the outcome.

- Compute the minor allele frequency for each of the genes. What is
the distribution of allele frequencies? You can use the
`colMeans()`

function to compute the mean of the columns in the genotype data. - Identify the SNPs found to be in the same clusters using the heatmap from above and make a simple burden test for each of the clusters identified. An example of this can be analyzed in R as follows (needs to be modified):

```
library("lme4")
# Include, say, V6 and V7 as random variable
y <- phenotypes$BMI
fullmodel <- lmer(y ~ 1 + (1|V6) + (1|V7), data=geno2, REML=FALSE)
smallmodel <- lm(y ~ 1, data=geno2)
logLik(fullmodel)
logLik(smallmodel)
```

- How should we compare the results? Has the power improved?

This exercise is to try the software program `plink`

. You
need to download and install the programme before you can run the
exercise. `plink`

is super versatile and has many more
options than what we can consider here. Instead we will try to make
small extensions to the analyses we made above.

Download the

`plink`

program from`http://zzz.bwh.harvard.edu/plink/download.shtml`

and install it.

The `plink`

executable file should be placed in either the
current working directory or somewhere in the command path. All commands
involve typing plink at the command prompt followed by a number of
options (all starting with `--option`

) to specify the data
files methods to be used.

`plink`

needs two standard files: a ped-file and a
map-file. The necssary files can be downloaded from the course
website.

Check the files and make sure you understand their formats. The map file contains chromosome number, the SNP name, the position in morgans on the chromosome, and the position in base pairs on the chromosome.

Run the following commands and discuss the output:

- Type
`plink --noweb --file hapmap1`

and look at the output. It takes two files (a ped file and a map file) as input. They can be downloaded from the course website. - Type
`plink --noweb --missing --file hapmap1`

and look at the output. What information is contained in the missing files. - Type
`plink --noweb --freq --file hapmap1`

and look at the output. What information is contained in the frequency files. How does the results compare to the results you computed manually earlier in R? - Type
`plink --noweb --assoc --file hapmap1`

and look at the output. What information is contained in the association files. - Type
`plink --noweb --assoc --adjust --file hapmap1`

and look at the output. What information is contained in the adjusted association files.

- Type

Claus Ekstrøm 2024