**Topics for module**- Advanced correlations
- Zero-inflated and hurdle models (microbiome data and RNA-seq revisited)
- Analysis of compositional data
- Combining data from multiple platforms and experiments
- Inference methods for combined (and simultaneous) data

**To read:**Nothing- Slides
- R code from the exercises

These exercises will primarily use functions in R but we will also
briefly touch upon the `freebayes`

software package. The
exercises are build up over two types of problems: introductory text
that you should run through to make sure you understand how to use and a
few additional questions for you to explore.

We will be needing the `MESS`

, `energy`

,
`multiway`

, `pscl`

and `compositions`

packages, which can all be installed using the
`install.packages()`

function.

`install.packages(c("MESS", "energy", "compositions", "multiway", "pscl"))`

Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI) that we looked at earlier. The data can be downloaded directly from the web page using

```
options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda"))
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda"))
```

We are interested in looking for *other* relationships that we
did not find previously.

Use the same data as before. Pick 20 SNPs from columns 791 to 810 which are located around one of the SNPs of interest. If we use all the SNPs then the computations will take too long.

Compute the Pearson correlation matrix for the 20 genes/SNPs selected above. You can use the

`cor()`

function for this.Load the

`energy`

package to get access to the`dcor()`

function. Compute the distance correlation for each of the SNPs against SNP 802. Does the results differ from the findings from`cor()`

? [ Hint: you can use something like`sapply(791:810, function(i) { dcor(genotypes[,802], genotypes[,i])})`

to do the computations. Make sure you understand what goes on here. ]

Use

`dcor()`

to compare the SNPs to the BMI outcome for the SNPs that were found on Wednesday to be related to the outcome. Do you find any apparent relationships between the SNPs and the phenotypes, that you did not find using the traditional regression model?Compute t tests for each of the distance correlations. What are your findings? [Hint: use the

`dcorT.test()`

function. ]Why does it make sense to compare the outcome from the regression model to results from

`dcor()`

?

Zero-inflated and hurdle regression models with Poisson and
negative-binomial models can be modeled in R using the `pscl`

package. There are two primary functions we will be using:
`hurdle()`

and `zeroinfl()`

. Both functions work
similary to the `glm()`

function that we have used
previously.

Load the data from webpage by running the following command

`load(url("http://www.biostatistics.dk/microbiome.rda"))`

The data consists of two conditions, A and B, the relative abundance from 15 taxa and 35 samples from each condition. There is also information on age an sex of the individuals the provided the samples.

Fit a zero-inflated Poisson model for the raw relative abundance of taxa 1 (otu 1) to compare the two conditions using the

`zeroinfl()`

function. Hint: you can use the`subset`

argument to restrict the analysis to only one taxa.Also, the

`zeroinfl()`

function uses the Poisson model by default so it is necessary to round the relative abundance to the closest integer using, say, the`round()`

function.What is the conclusion? What are the directions of the effect of the conditions? Is this what you would expect?

Fit the same model with argument

`dist="negbin"`

(still for the raw relative abundance of taxa 1). Which of the two models fit best?Now try the same two models but include additional information on the gender and age. You may run into problems with the model fit / convergence. You can specify individual models for the proper distribution and the logistic regression model by specifying individual models. This can be done in the model formula by giving two models separate split with a vertical bar such as

`y~x | x+z`

.Try fitting the same models with a hurdle model (use the

`hurdle()`

function)

Use the `microbiome`

data from the exercise above. The
data consists of two conditions, A and B, the relative abundance from 15
taxa and 35 samples from each condition. There is also information on
age an sex of the individuals the provided the samples. We start by
extracting the intensities as a matrix with 15 columns and 70 rows.

We start be extracting the values as a matrix with 70 rows and 15 columns

Arrange by otu!

```
exprmatrix <- matrix(microbiome$value, nrow=70)
```

- Load the
`compositions`

package and compute the isometric-log-ratio for each row of the expression matrix. Store the result in some object for later use. [ Hint: You can use the`ilr()`

function.] - What is the dimension of the resulting matrix? What is returned?
- We can now analyse the data in different ways:
- We could use each of the columns in the transformed matrix independently of each other one by one and use those as outcomes. The predictors could be condition, age, and gender.
- We could use the full set of trsansformed values to predict the condition based on the microbiome composition. This could be a regular logistic regression model or a penalized regression model.

For this exercise we will be using the `multiway`

package
and the array `x`

that is grabbed from an R dataset on the
web. We wish to decompose 5 small metabolite arrays that are found in
`x`

and possibly get information on the basis functions.

```
library("multiway")
load(url("http://www.biostatistics.dk/parafac.rda")) # Loads x
```

The individual metabolite scans can be accessed through the last
index of the array. For example will `x[,,1]`

give the \(120\times 120\) matrix of values for
individual 1. To get an idea of the data try plotting a 3D plot of the
output from person 1:

`persp(x[,,1])`

If you have the `rgl`

package installed you can create a
3D plot that can be rotated:

```
library("rgl")
persp3d(x[,,1], col="lightblue")
```

There are two peaks on the array. The same is true for the other individuals although the relative height of the peaks vary from person to person.

Let us create the matrices used to create the tensor product:

```
res <- parafac(x, nfac=2)
summary(res)
```

- Look at the output from the analysis. What does
`res`

contain? In particular look at the elements`A`

,`B`

, and`C`

. Why do they have the dimensions they have? - Try to plot the “basis” functions that make up the different
components. For example, the first component of the first matrix can be
extracted with
`res$A[1,]`

. Plot the three components from the first matrix and interpret them.

We can create our matrix approximation with the `fitted()`

function.

`appx <- fitted(res)`

- Compare the approximated tensor with the original tensor. How well would you say the matrix was approximated (you should just be eye-balling here).
- In the computations above we used 2 components but this can be
changed with the
`nfac`

argument. Try to approximate the tensor with 3 components. How much gain is there in the tensor approximation now?

Claus Ekstrøm 2024