Exercises

These exercises will primarily use functions in R but we will also briefly touch upon the freebayes software package. The exercises are build up over two types of problems: introductory text that you should run through to make sure you understand how to use and a few additional questions for you to explore.

We will be needing the MESS, energy, multiway, pscl and compositions packages, which can all be installed using the install.packages() function.

install.packages(c("MESS", "energy", "compositions", "multiway", "pscl"))

Distance correlations

Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI) that we looked at earlier. The data can be downloaded directly from the web page using

options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda"))
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda"))

We are interested in looking for other relationships that we did not find previously.

  1. Use the same data as before. Pick 20 SNPs from columns 791 to 810 which are located around one of the SNPs of interest. If we use all the SNPs then the computations will take too long.

  2. Compute the Pearson correlation matrix for the 20 genes/SNPs selected above. You can use the cor() function for this.

  3. Load the energy package to get access to the dcor() function. Compute the distance correlation for each of the SNPs against SNP 802. Does the results differ from the findings from cor()? [ Hint: you can use something like

    sapply(791:810, function(i) { dcor(genotypes[,802], genotypes[,i])})

    to do the computations. Make sure you understand what goes on here. ]

  4. Use dcor() to compare the SNPs to the BMI outcome for the SNPs that were found on Wednesday to be related to the outcome. Do you find any apparent relationships between the SNPs and the phenotypes, that you did not find using the traditional regression model?

  5. Compute t tests for each of the distance correlations. What are your findings? [Hint: use the dcorT.test() function. ]

  6. Why does it make sense to compare the outcome from the regression model to results from dcor()?

Zero-inflated and hurdle models

Zero-inflated and hurdle regression models with Poisson and negative-binomial models can be modeled in R using the pscl package. There are two primary functions we will be using: hurdle() and zeroinfl(). Both functions work similary to the glm() function that we have used previously.

  1. Load the data from webpage by running the following command

    load(url("http://www.biostatistics.dk/microbiome.rda"))

The data consists of two conditions, A and B, the relative abundance from 15 taxa and 35 samples from each condition. There is also information on age an sex of the individuals the provided the samples.

  1. Fit a zero-inflated Poisson model for the raw relative abundance of taxa 1 (otu 1) to compare the two conditions using the zeroinfl() function. Hint: you can use the subset argument to restrict the analysis to only one taxa.

    Also, the zeroinfl() function uses the Poisson model by default so it is necessary to round the relative abundance to the closest integer using, say, the round() function.

    What is the conclusion? What are the directions of the effect of the conditions? Is this what you would expect?

  2. Fit the same model with argument dist="negbin" (still for the raw relative abundance of taxa 1). Which of the two models fit best?

  3. Now try the same two models but include additional information on the gender and age. You may run into problems with the model fit / convergence. You can specify individual models for the proper distribution and the logistic regression model by specifying individual models. This can be done in the model formula by giving two models separate split with a vertical bar such as y~x | x+z.

  4. Try fitting the same models with a hurdle model (use the hurdle() function)

Compositional data analysis

Use the microbiome data from the exercise above. The data consists of two conditions, A and B, the relative abundance from 15 taxa and 35 samples from each condition. There is also information on age an sex of the individuals the provided the samples. We start by extracting the intensities as a matrix with 15 columns and 70 rows.

We start be extracting the values as a matrix with 70 rows and 15 columns

Arrange by otu!


exprmatrix <- matrix(microbiome$value, nrow=70)
  1. Load the compositions package and compute the isometric-log-ratio for each row of the expression matrix. Store the result in some object for later use. [ Hint: You can use the ilr() function.]
  2. What is the dimension of the resulting matrix? What is returned?
  3. We can now analyse the data in different ways:
    • We could use each of the columns in the transformed matrix independently of each other one by one and use those as outcomes. The predictors could be condition, age, and gender.
    • We could use the full set of trsansformed values to predict the condition based on the microbiome composition. This could be a regular logistic regression model or a penalized regression model.
    Do one or both of these analyses.

PARAFAC

For this exercise we will be using the multiway package and the array x that is grabbed from an R dataset on the web. We wish to decompose 5 small metabolite arrays that are found in x and possibly get information on the basis functions.

library("multiway")
load(url("http://www.biostatistics.dk/parafac.rda")) # Loads x

The individual metabolite scans can be accessed through the last index of the array. For example will x[,,1] give the \(120\times 120\) matrix of values for individual 1. To get an idea of the data try plotting a 3D plot of the output from person 1:

persp(x[,,1])

If you have the rgl package installed you can create a 3D plot that can be rotated:

library("rgl")
persp3d(x[,,1], col="lightblue")

There are two peaks on the array. The same is true for the other individuals although the relative height of the peaks vary from person to person.

Let us create the matrices used to create the tensor product:

res <- parafac(x, nfac=2)
summary(res)
  1. Look at the output from the analysis. What does res contain? In particular look at the elements A, B, and C. Why do they have the dimensions they have?
  2. Try to plot the “basis” functions that make up the different components. For example, the first component of the first matrix can be extracted with res$A[1,]. Plot the three components from the first matrix and interpret them.

We can create our matrix approximation with the fitted() function.

appx <- fitted(res)
  1. Compare the approximated tensor with the original tensor. How well would you say the matrix was approximated (you should just be eye-balling here).
  2. In the computations above we used 2 components but this can be changed with the nfac argument. Try to approximate the tensor with 3 components. How much gain is there in the tensor approximation now?

Claus Ekstrøm 2024