Introduction

Aim of the exercise

The goal of this exercise is to introduce you to Bioconductor and some basic analyses on raw read counts including normalization, transformation, visualization and differential expression. You will also examine the variance and the dispersion in RNA-seq data - two concepts that are needed to understand how DESeq2 and edgeR pipelines work for differential expression.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button in Rstudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Bioconductor

About Bioconductor

Bioconductor project host several open source R packages that are designed for the analysis of biological assays. The main tools edgeR and DESeq2 are part of Bioconductor repository. R packages from Bioconductor comes with vignettes that are designed to run a full analysis for each tool deposited there. Vignettes and detailed documents about how to use the R package is of great help for anyone who performs bioinformatics analysis. For example, both edgeR and DESeq2 has great manuals (perhaps the best in Bioconductor).

Data

The exercise is based on the following two sources:

Read counts come from a study about the profiling of poly(A)-selected RNA in LNCaP cells before and after androgen stimulation. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT (endogenous androgen sex steroid and hormone) and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes.

Read count table

file_url <- 'http://davetang.org/eg/pnas_expression.txt'
data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1)
head(data)
##                 lane1 lane2 lane3 lane4 lane5 lane6 lane8  len
## ENSG00000215696     0     0     0     0     0     0     0  330
## ENSG00000215700     0     0     0     0     0     0     0 2370
## ENSG00000215699     0     0     0     0     0     0     0 1842
## ENSG00000215784     0     0     0     0     0     0     0 2393
## ENSG00000212914     0     0     0     0     0     0     0  384
## ENSG00000212042     0     0     0     0     0     0     0   92
rmCol <- 8 # column "len" with no count information
rmZero <- rowSums(data[,-rmCol])>0 # genes zero counts across all lanes
table(rmZero)
## rmZero
## FALSE  TRUE 
## 15558 21877
data_subset <- data[rmZero,-rmCol]
Question: How many genes have more than zero expression?
Click for answer 21877 genes have some expression from at least one lane

Basic concepts of DGE analysis

Counts per million (CPM) normalization

data_subset_cpm <- data_subset
colSums(data_subset_cpm)
##   lane1   lane2   lane3   lane4   lane5   lane6   lane8 
##  978576 1156844 1442169 1485604 1823460 1834335  681743
norm_cpm <- function(x){
x * 1000000 / sum(x)
}
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
colSums(data_subset_cpm)
## lane1 lane2 lane3 lane4 lane5 lane6 lane8 
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06

Mean and variance of CPM normalized read counts

## Basic statistics
mock_cols <- 1:4 # Column ID for mock treated samples
dht_cols <- 5:7  # Column ID for DHT treated samples
mock_var <- apply(data_subset_cpm[,mock_cols], 1, var)   # Variance for mock group
mock_mean <- apply(data_subset_cpm[,mock_cols], 1, mean) # Mean for mock group
dht_var <- apply(data_subset_cpm[,dht_cols], 1, var)     # Variance for DHT group
dht_mean <- apply(data_subset_cpm[,dht_cols], 1, mean)   # Mean for DHT group
cor(mock_mean, mock_var, method="spearman") # Mean-Variance correlation in mock group
## [1] 0.9353861
cor(dht_mean, dht_var, method="spearman")   # Mean-Variance correlation in DHT group
## [1] 0.9139193
## Plot mean-variance trends (mock)
mock <- as.data.frame(cbind(mock_mean, mock_var))
# Linear scale
ggplot(mock, aes(mock_mean, mock_var)) + 
  geom_point() + 
  geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'

# Log2 scale
ggplot(mock, aes(mock_mean, mock_var)) + 
  geom_point() + 
  geom_smooth(method="loess") + 
  scale_x_continuous(trans='log2') + 
  scale_y_continuous(trans='log2') + 
  labs(x = 'mock_mean (log2 scale)', y = 'mock_var (log2 scale)')
## `geom_smooth()` using formula = 'y ~ x'

## Plot mean-variance trends (DHT)
dht <- as.data.frame(cbind(dht_mean, dht_var))
# Linear scale
ggplot(dht, aes(dht_mean, dht_var)) + 
  geom_point() + 
  geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'

# Log2 scale
ggplot(dht, aes(dht_mean, dht_var)) + 
  geom_point() + 
  geom_smooth(method="loess") + 
  scale_x_continuous(trans='log2') + 
  scale_y_continuous(trans='log2') + 
  labs(x = 'dht_mean (log2 scale)', y = 'dht_var (log2 scale)')
## `geom_smooth()` using formula = 'y ~ x'

Question: How are mean and variance correlated?
Click for answer The mean and variance of gene expression is correlated; the higher the expression, the higher the variance.
Question How do mean and variance differ between Control and DHT treatment samples?
Click for answer

The gene with the highest variance in the DHT treatment samples is almost ten fold (max(dht_var)/max(mock_var)) higher than the gene with the highest variance in the mock samples.

  max(dht_var)/max(mock_var)
  ## [1] 8.572825

Poisson distribution

Simulate counts that follow a Poisson distribution

set.seed(31)
sim <- t(sapply(dht_mean, rpois, n=4))
sim_mean <- apply(sim, 1, mean)
sim_var <- apply(sim, 1, var)
cor(sim_mean, sim_var, method="spearman")
## [1] 0.9446539
sim.df <- as.data.frame(cbind(sim_mean, sim_var))
ggplot(sim.df, aes(sim_mean, sim_var)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

ggplot(sim.df, aes(sim_mean, sim_var)) + 
  geom_point() + 
  geom_smooth() + 
  scale_x_continuous(trans='log2') + 
  scale_y_continuous(trans='log2')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Question: How does the mean versus variance plot look for counts that follow a Poisson distribution?
Click for answer The mean versus the variance scatter plot of data simulated from a Poisson distribution is more symmetrical than real data.

edgeR analysis for simulated counts that follow the Poisson distribution

sim.e <- DGEList(counts = sim, group=c(rep("mock",2),c(rep("dht", 2))))

# calculate edgeR’s average log2(CPM) and tagwise dispersion
sim.e <- calcNormFactors(sim.e)
sim.e <- estimateCommonDisp(sim.e, verbose=T)
## Disp = 1e-04 , BCV = 0.01
sim.e <- estimateTagwiseDisp(sim.e)
summary(sim.e$tagwise.dispersion)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.571e-06 1.571e-06 1.571e-06 1.796e-03 4.838e-03 6.434e-03
# sanity check between edgeR’s average log2 cpm calculations versus our own
cor(sim.e$AveLogCPM, log2(sim_mean), method="spearman")
## [1] 0.9998214
plot(sim.e$AveLogCPM, log2(sim_mean), pch='.')

# plot the tagwise (gene-wise) dispersion against the average log2(CPM)
sim.df <- as.data.frame(cbind(sim.e$AveLogCPM, sim.e$tagwise.dispersion)); colnames(sim.df) <- c("AveLogCPM", "Dispersion")
ggplot(sim.df, aes(AveLogCPM, Dispersion)) + geom_point() + geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'

Question: How are the average log2-CPM calculated by edgeR and log2 mean raw count of the simulated data correlated?
Click for answer There is very high correlation despite of low counts.
Question: How are tagwise (gene-wise) dispersion and the average log2-CPM (expression level) correlated in the count data from the Poisson distribution?
Click for answer There is no dependency of dispersion and mean in the count data following the theoretical Poisson distribution.

Gene dispersion

Assume that you have 2 genes with the following counts:

# Create simple gene count data
simple_counts <- t(data.frame(Gene_1 = c(1000, 1010, 990, 1000), Gene_2 = c(10, 20, 0, 10)))
simple_counts
##        [,1] [,2] [,3] [,4]
## Gene_1 1000 1010  990 1000
## Gene_2   10   20    0   10
# Calculate variance
simple_vars <- apply(simple_counts, 1, var)
simple_vars
##   Gene_1   Gene_2 
## 66.66667 66.66667
# Calculate % fluctuation from mean
perc_fluctuate <- function(x){
  ((x-mean(x))/mean(x)) * 100
}
simple_fluc <- apply(simple_counts, 1, perc_fluctuate)
t(simple_fluc)
##        [,1] [,2] [,3] [,4]
## Gene_1    0    1   -1    0
## Gene_2    0  100 -100    0

Although the variance of these two genes are exactly the same, Gene_1 has 1% fluctuations around mean, while Gene_2 has 100% fluctuations around mean. Clearly, Gene_2 is more variable and this is handled by dispersion estimations. Dispersion estimation is sophisticated, however most modern tools like edgeR and DESeq2 handles it automatically.

The bigger dispersion means more variability across genes. For instance, if you estimate dispersion as 0.19, biological coefficient of variation (BCV, square root of dispersion) is equal to 0.44. This means that the expression values vary up and down by 44% between replicates.

Inspect dispersion of real data with edgeR

d <- data[rmZero,-rmCol]
# Working only with "mock" samples for simplicity. 
# In real life, we include all the samples for better dispersion estimation (below: estimateDispersion)
d <- d[,mock_cols] 
group <- c(rep("mock",4))
d.e <- DGEList(counts = d, group=group)

# calculate edgeR’s average log2(CPM) and tagwise dispersion
d.e <- calcNormFactors(d.e)
d.e <- estimateCommonDisp(d.e, verbose=T)
## Disp = 0.01932 , BCV = 0.139
d.e <- estimateTagwiseDisp(d.e)
summary(d.e$tagwise.dispersion)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0003019 0.0209960 0.0536757 0.1563791 0.1994373 1.2338966
# sanity check between edgeR’s average log2 cpm calculations versus our own
cor(d.e$AveLogCPM, log2(mock_mean), method="spearman")
## [1] 0.9999541
plot(d.e$AveLogCPM, log2(mock_mean), pch='.')

# plot the tagwise (gene-wise) dispersion against the average log2(CPM)
d.df <- data.frame(AveLogCPM = d.e$AveLogCPM, Dispersion  = d.e$tagwise.dispersion)
ggplot(d.df, aes(AveLogCPM, Dispersion)) + 
  geom_point() + 
  geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'

Question: The square root of the negative binomial dispersion is the biological coefficient of variation (BCV) across the replicate samples. So how large is the variability of the normalized read counts in the (1) mock samples and (2) DHT treatment samples?
Click for answer The biological variability (BCV) of (1) mock samples is 13.9% and (2) DHT treatment samples is 14.8%.
Question: Are edgeR’s average log2-CPM calculation and our own log2 CPM the same?
Click for answer edgeR’s average log2-CPM calculations and our own are the same despite of the low counts.
Question: How are tagwise (gene-wise) dispersion and the average log2 CPM (expression level) correlated?
Click for answer In the real RNA-seq data we observe that the higher the expression, the lower the tagwise dispersion.

Filtering counts

Genes with low counts across libraries provide little evidence for differential expression. Because,

  • Biological aspect: The gene expression level should be at a certain minimal level before it is transcribed into proteins
  • Statistical aspect: It is presumed that the expression difference between the genes with low counts are not relevant; therefore, including them in multiple testing would inflate the type-1 error. If less statistical tests are performed, then the correction for multiple testing becomes less stringent.

One should consider removing the lowly expressed genes before proceeding with differential analysis. There are different strategies on filtering lowly expressed genes. For example:

  • DESeq2: Uses “independent” filtering using filter_p function from genefilter R package.
  • edgeR: Uses internal function filterByExprto remove lowly expressed genes across groups of interest.

Examples of these steps are provided below for each analysis pipeline.

DESeq2 differential expression analysis

Read count matrix

d <- data[,-rmCol]
sample_info <- data.frame(condition = c(rep("mock",4),rep("dht",3)), row.names = colnames(d))
sample_info
##       condition
## lane1      mock
## lane2      mock
## lane3      mock
## lane4      mock
## lane5       dht
## lane6       dht
## lane8       dht
#create DE dataset in DESeq2
DESeq.ds <- DESeqDataSetFromMatrix(countData = d, colData = sample_info, design = ~ condition)
#pre-filtering counts: although not necesseary, it will increase the speed due to less memory consumption 
mincount <- 0
DESeq.ds <- DESeq.ds[rowSums(counts(DESeq.ds)) > mincount , ]

Normalize counts

Counts normalized using DESeq2’s size factor:

#calculate and normalize with DESeq's size factor
colSums(counts(DESeq.ds))
##   lane1   lane2   lane3   lane4   lane5   lane6   lane8 
##  978576 1156844 1442169 1485604 1823460 1834335  681743
DESeq.ds <- estimateSizeFactors(DESeq.ds)
sizeFactors(DESeq.ds)
##     lane1     lane2     lane3     lane4     lane5     lane6     lane8 
## 0.7912131 0.9433354 1.1884939 1.2307145 1.4099376 1.4224762 0.5141056
colData(DESeq.ds)
## DataFrame with 7 rows and 2 columns
##       condition sizeFactor
##        <factor>  <numeric>
## lane1      mock   0.791213
## lane2      mock   0.943335
## lane3      mock   1.188494
## lane4      mock   1.230714
## lane5      dht    1.409938
## lane6      dht    1.422476
## lane8      dht    0.514106
counts.sf_normalized <- counts(DESeq.ds, normalized = TRUE)

#put two plots in the same plotting space
par(mfrow=c(1, 2))
#plot DESeq's library size normalized counts
boxplot(counts.sf_normalized, notch = TRUE, main = "untransformed read counts", ylab = "read counts")
#plot log2 transformed DESeq's normalized counts
log.norm.counts <- log2(counts.sf_normalized + 1)
boxplot(log.norm.counts, notch = TRUE, main = "log2 - transformed read counts", ylab = "log2(read counts)")

DESeq2’s rlog() function returns values that are both normalized for sequencing depth and transformed to the log2 scale where the values are adjusted to fit the experiment-wide trend of the variance-mean relationship

#Transformation of read counts including variance shrinkage
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE )
rlog.norm.counts <- assay(DESeq.rlog)

par(mfrow=c(2, 2))
#plot raw counts
plot(counts(DESeq.ds)[,1:2], cex =.1 , main = "Raw read counts" )
#plot DESeq's library size normalized counts
plot(counts.sf_normalized[,1:2], cex =.1 , main = "library size normalized read counts" )
#plot log2 transformed DESeq's library size normalized counts
plot(log.norm.counts[,1:2], cex =.1 , main = "library size normalized log2 (read counts)")
#plot rlog-transformed counts
plot(rlog.norm.counts[,1:2], cex =.1, main = "rlog-transformed read counts")

#mean-sd-plots based on all samples of the experiment
msd_plot <- meanSdPlot(log.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("library size normalized log2 (read counts)") + ylab (" standard deviation") + ylim(0,1)

msd_plot <- meanSdPlot(rlog.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("rlog-transformed read counts") + ylab ("standard deviation") + ylim(0,1)

Question: Visually explore the normalized and transformed read counts between sample 1 and 2? Describe the differences between (1) raw read counts, (2) DESeq’s library size normalized counts, (3) log2 transformed DESeq’s library size normalized counts, and (4) DESeq’s rlog-transformed counts.
Click for answer The different normalization and transformation steps make the read counts between samples more comparable.
Question: Many statistical tests and analyses assume that data is homoskedastic, i.e. that all variables have similar variance. However, data with large differences among the sizes of the individual observations often shows heteroskedastic behavior. Do you see a dependence of the standard deviation (or variance) on the mean for (1) log2 transformed DESeq’s library size normalized counts, and (2) DESeq’s rlog-transformed counts?
Click for answer The dependence of the variance on the mean in the log transformed counts violate the assumption of homoskedasticity. The transformation of read counts including variance shrinkage with DESeq2’s rlog() function has removed this dependency (the standard deviation is roughly constant along the whole dynamic range).

Exploring global read count patterns

Hierarchical clustering

For hierarchical clustering of samples you have to decide about a similarity measure for pairwise comparison of individual samples and a linkage measure to define clusters. For background, see: https://chagall.med.cornell.edu/RNASEQcourse/Intro2RNAseq.pdf#subsubsection.5.3.2

distance.m_rlog <- as.dist(1-cor(rlog.norm.counts, method="pearson"))
plot(hclust(distance.m_rlog), labels = colnames(rlog.norm.counts), main="rlog transformed read counts\ndistance: Pearson correlation")

Question: Can you make a dendrogram with Euclidean distance and linkage method “average”?
Click for answer Run as.dist and hclust with the appropriate parameter settings. To find them check the help pages, e.g. by typing “?as.dist”.

Dimension reduction: PCA (Principal component analysis)

P <- plotPCA(DESeq.rlog)
P + theme_bw() + ggtitle("Rlog transformed counts")

Question: How well can the different conditions be separated through dimension reduction? How much of the variance is described by the principle component that separates conditions?
Click for answer The first principle component separates the two conditions and explains 93% of the variance in the data (which is pretty high!)

Differential expression

# Setup the factor levels for design 
str(colData(DESeq.ds)$condition)
##  Factor w/ 2 levels "dht","mock": 2 2 2 2 1 1 1
# We set "mock" as the first level which willbe treated as reference condition in the differential expression analysis
colData(DESeq.ds)$condition <- relevel(colData(DESeq.ds)$condition, "mock")
str(colData(DESeq.ds)$condition)
##  Factor w/ 2 levels "mock","dht": 1 1 1 1 2 2 2
# Re-defining design. 
# Although not necessary, it would be beneficial in case that you want to change your design down the pipeline
design(DESeq.ds) <- ~ condition

# Run DESeq
#instead of calling the function DESeq you can also run the following sequence of commands:
#dds <- estimateSizeFactors(dds)
#dds <- estimateDispersions(dds)
#dds <- nbinomWaldTest(dds)
DESeq.ds <- DESeq(DESeq.ds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Differential expression results 
# Note the contrast = c("condition", "dht", "mock")
# We are comparing "dht" vs "mock" in column 'condition'
DGE.results <- results(object = DESeq.ds, 
                       independentFiltering = TRUE, 
                       alpha = 0.05, 
                       contrast = c("condition", "dht", "mock")
                       )
summary(DGE.results)
## 
## out of 21877 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 1663, 7.6%
## LFC < 0 (down)     : 1468, 6.7%
## outliers [1]       : 0, 0%
## low counts [2]     : 8483, 39%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Question: How many differentially expressed genes are found with multiple testing adjusted p<0.05?
Click for answer 1663 differentially expressed genes with LFC > 0 (up): 1663, 7.6%; and 1468 differentially expressed genes with LFC < 0 (down).

Visualization of DGE results

p-value histogram

The p-values follow a uniform distribution under null hypothesis. If you have a peak around zero, your alternative hypotheses live- along with some potential false positives.

If you would like to learn more about potential scenarious about how your p-value distribution might look like, here is a detailed blog post by David Robinson about “How to interpret a p-value histogram”.

# Histogram of p-values for all genes tested for no differential expression between the two conditions
hist(DGE.results$pvalue, col = "grey" , border = "white", xlab = "", ylab = "", main = "frequencies of p-values")

MA-plots

# MA plot
DESeq2::plotMA(DGE.results, alpha = 0.05, main = "Control vs. DHT-Treated" , ylim = c(-9,9))

Heatmaps

# Heatmap
DGE.results.sorted <- DGE.results[order(DGE.results$padj), ]
DGEgenes <- rownames(subset(DGE.results.sorted, padj < 0.05))
hm.mat_DGEgenes <- log.norm.counts[DGEgenes,]
# No clustering and no row (gene) scaling
aheatmap(hm.mat_DGEgenes, Rowv = NA, Colv = NA, main = "No clustering")

# Clustering and no row (gene) scaling.
aheatmap(hm.mat_DGEgenes, Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", main = "Clustering without row (gene) scaling")

# Clustering and row (gene) scaling.
aheatmap(hm.mat_DGEgenes, Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", scale = "row", main = "Clustering with row (gene) scaling")

# Gene scaling before hierarchical clustering.
aheatmap(t(scale(t(hm.mat_DGEgenes))), Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", main = "Clustering after row (gene) scaling, and without sample-wise clustering")

Volcano plots

volDat <- as.data.frame(DGE.results.sorted)
volDat <- volDat[!is.na(volDat$padj), ]
volDat$DGE <- volDat$padj < 0.05

ggplot(volDat, aes(x = log2FoldChange, y = -log10(padj), color = DGE)) + 
  geom_point(alpha = 0.50)

Annotating DGE results

# Read counts of a single gene
#keytypes(org.Hs.eg.db)
#columns(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keys = DGEgenes, keytype = "ENSEMBL", columns = c("GENENAME", "SYMBOL"))
## 'select()' returned 1:many mapping between keys and columns
DGE.results.sorted_logFC <- DGE.results[order(DGE.results$log2FoldChange),]
DGEgenes_logFC <- rownames(subset(DGE.results.sorted_logFC, padj < 0.05))
head(anno[match(DGEgenes_logFC, anno$ENSEMBL),])
##              ENSEMBL                                     GENENAME   SYMBOL
## 391  ENSG00000091972                               CD200 molecule    CD200
## 825  ENSG00000100373                                 uroplakin 3A    UPK3A
## 975  ENSG00000118513     MYB proto-oncogene, transcription factor      MYB
## 1062 ENSG00000081237 protein tyrosine phosphatase receptor type C    PTPRC
## 1057 ENSG00000196660           solute carrier family 30 member 10 SLC30A10
## 1131 ENSG00000117245                     kinesin family member 17    KIF17
# merge DFE analysis with gene annotation
out.df <- merge(as.data.frame(DGE.results), anno, by.x = "row.names", by.y = "ENSEMBL")
# filter gene of interest
out.df %>% filter(SYMBOL=="TMEFF2")
##         Row.names baseMean log2FoldChange     lfcSE      stat      pvalue
## 1 ENSG00000144339 209.3953      -1.021127 0.1544868 -6.609805 3.84826e-11
##           padj
## 1 9.562819e-10
##                                                                 GENENAME SYMBOL
## 1 transmembrane protein with EGF like and two follistatin like domains 2 TMEFF2
write.table(out.df, file = "de_exercise_DESeq2results.txt", sep = "\t", quote = FALSE, row.names = FALSE)


# Plot individual genes
subset(anno, SYMBOL == "TMEFF2") #known differential expressed gene in LNCaP human prostate cancer progression model
##             ENSEMBL
## 545 ENSG00000144339
##                                                                   GENENAME
## 545 transmembrane protein with EGF like and two follistatin like domains 2
##     SYMBOL
## 545 TMEFF2
plotCounts(dds = DESeq.ds , gene = "ENSG00000144339", normalized = TRUE, transform = FALSE, main = expression(atop("Expression of "*italic("TMEFF2"))))

Question: What is the function of TMEFF2? Does the expression profile of the gene fit with your expectations?
Click for answer Would love to hear your interpretation personally :) For instance check the following publication: https://www.nature.com/articles/1205142