The goal of this exercise is to introduce you to Bioconductor and some basic
analyses on raw read counts including normalization, transformation,
visualization and differential expression. You will also examine the
variance and the dispersion in RNA-seq data - two concepts that are
needed to understand how DESeq2
and edgeR
pipelines work for differential expression.
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com. When you click the Knit button in Rstudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
Bioconductor project host
several open source R packages that are designed for the analysis of
biological assays. The main tools edgeR
and DESeq2
are part of Bioconductor repository. R packages from Bioconductor comes
with vignettes that are designed to run a full analysis for each tool
deposited there. Vignettes and detailed documents about how to use the R
package is of great help for anyone who performs bioinformatics
analysis. For example, both edgeR
and DESeq2
has great manuals (perhaps the best in Bioconductor).
The exercise is based on the following two sources:
Read counts come from a study about the profiling of poly(A)-selected RNA in LNCaP cells before and after androgen stimulation. Briefly, the study used LNCaP cells, which are androgen-sensitive human prostate adenocarcinoma cells, and treated the cells with DHT (endogenous androgen sex steroid and hormone) and with a mock treatment as the control. The poly-A RNAs were captured and sequenced on 7 lanes.
file_url <- 'http://davetang.org/eg/pnas_expression.txt'
data <- read.table(file_url, header=T, sep="\t", stringsAsFactors=F, row.names=1)
head(data)
## lane1 lane2 lane3 lane4 lane5 lane6 lane8 len
## ENSG00000215696 0 0 0 0 0 0 0 330
## ENSG00000215700 0 0 0 0 0 0 0 2370
## ENSG00000215699 0 0 0 0 0 0 0 1842
## ENSG00000215784 0 0 0 0 0 0 0 2393
## ENSG00000212914 0 0 0 0 0 0 0 384
## ENSG00000212042 0 0 0 0 0 0 0 92
rmCol <- 8 # column "len" with no count information
rmZero <- rowSums(data[,-rmCol])>0 # genes zero counts across all lanes
table(rmZero)
## rmZero
## FALSE TRUE
## 15558 21877
data_subset <- data[rmZero,-rmCol]
Question: How many genes have more than zero
expression?
data_subset_cpm <- data_subset
colSums(data_subset_cpm)
## lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 978576 1156844 1442169 1485604 1823460 1834335 681743
norm_cpm <- function(x){
x * 1000000 / sum(x)
}
data_subset_cpm <- apply(data_subset_cpm, 2, norm_cpm)
colSums(data_subset_cpm)
## lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06 1e+06
## Basic statistics
mock_cols <- 1:4 # Column ID for mock treated samples
dht_cols <- 5:7 # Column ID for DHT treated samples
mock_var <- apply(data_subset_cpm[,mock_cols], 1, var) # Variance for mock group
mock_mean <- apply(data_subset_cpm[,mock_cols], 1, mean) # Mean for mock group
dht_var <- apply(data_subset_cpm[,dht_cols], 1, var) # Variance for DHT group
dht_mean <- apply(data_subset_cpm[,dht_cols], 1, mean) # Mean for DHT group
cor(mock_mean, mock_var, method="spearman") # Mean-Variance correlation in mock group
## [1] 0.9353861
cor(dht_mean, dht_var, method="spearman") # Mean-Variance correlation in DHT group
## [1] 0.9139193
## Plot mean-variance trends (mock)
mock <- as.data.frame(cbind(mock_mean, mock_var))
# Linear scale
ggplot(mock, aes(mock_mean, mock_var)) +
geom_point() +
geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
# Log2 scale
ggplot(mock, aes(mock_mean, mock_var)) +
geom_point() +
geom_smooth(method="loess") +
scale_x_continuous(trans='log2') +
scale_y_continuous(trans='log2') +
labs(x = 'mock_mean (log2 scale)', y = 'mock_var (log2 scale)')
## `geom_smooth()` using formula = 'y ~ x'
## Plot mean-variance trends (DHT)
dht <- as.data.frame(cbind(dht_mean, dht_var))
# Linear scale
ggplot(dht, aes(dht_mean, dht_var)) +
geom_point() +
geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
# Log2 scale
ggplot(dht, aes(dht_mean, dht_var)) +
geom_point() +
geom_smooth(method="loess") +
scale_x_continuous(trans='log2') +
scale_y_continuous(trans='log2') +
labs(x = 'dht_mean (log2 scale)', y = 'dht_var (log2 scale)')
## `geom_smooth()` using formula = 'y ~ x'
The gene with the highest variance in the DHT treatment samples is
almost ten fold (max(dht_var)/max(mock_var)
) higher than
the gene with the highest variance in the mock samples.
max(dht_var)/max(mock_var)
## [1] 8.572825
set.seed(31)
sim <- t(sapply(dht_mean, rpois, n=4))
sim_mean <- apply(sim, 1, mean)
sim_var <- apply(sim, 1, var)
cor(sim_mean, sim_var, method="spearman")
## [1] 0.9446539
sim.df <- as.data.frame(cbind(sim_mean, sim_var))
ggplot(sim.df, aes(sim_mean, sim_var)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
ggplot(sim.df, aes(sim_mean, sim_var)) +
geom_point() +
geom_smooth() +
scale_x_continuous(trans='log2') +
scale_y_continuous(trans='log2')
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
edgeR
analysis for simulated counts that follow the
Poisson distributionsim.e <- DGEList(counts = sim, group=c(rep("mock",2),c(rep("dht", 2))))
# calculate edgeR’s average log2(CPM) and tagwise dispersion
sim.e <- calcNormFactors(sim.e)
sim.e <- estimateCommonDisp(sim.e, verbose=T)
## Disp = 1e-04 , BCV = 0.01
sim.e <- estimateTagwiseDisp(sim.e)
summary(sim.e$tagwise.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.571e-06 1.571e-06 1.571e-06 1.796e-03 4.838e-03 6.434e-03
# sanity check between edgeR’s average log2 cpm calculations versus our own
cor(sim.e$AveLogCPM, log2(sim_mean), method="spearman")
## [1] 0.9998214
plot(sim.e$AveLogCPM, log2(sim_mean), pch='.')
# plot the tagwise (gene-wise) dispersion against the average log2(CPM)
sim.df <- as.data.frame(cbind(sim.e$AveLogCPM, sim.e$tagwise.dispersion)); colnames(sim.df) <- c("AveLogCPM", "Dispersion")
ggplot(sim.df, aes(AveLogCPM, Dispersion)) + geom_point() + geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
Assume that you have 2 genes with the following counts:
# Create simple gene count data
simple_counts <- t(data.frame(Gene_1 = c(1000, 1010, 990, 1000), Gene_2 = c(10, 20, 0, 10)))
simple_counts
## [,1] [,2] [,3] [,4]
## Gene_1 1000 1010 990 1000
## Gene_2 10 20 0 10
# Calculate variance
simple_vars <- apply(simple_counts, 1, var)
simple_vars
## Gene_1 Gene_2
## 66.66667 66.66667
# Calculate % fluctuation from mean
perc_fluctuate <- function(x){
((x-mean(x))/mean(x)) * 100
}
simple_fluc <- apply(simple_counts, 1, perc_fluctuate)
t(simple_fluc)
## [,1] [,2] [,3] [,4]
## Gene_1 0 1 -1 0
## Gene_2 0 100 -100 0
Although the variance of these two genes are exactly the same, Gene_1
has 1% fluctuations around mean, while Gene_2 has 100% fluctuations
around mean. Clearly, Gene_2 is more variable and this is handled by
dispersion estimations. Dispersion estimation is
sophisticated, however most modern tools like edgeR
and
DESeq2
handles it automatically.
The bigger dispersion means more variability across genes. For instance, if you estimate dispersion as 0.19, biological coefficient of variation (BCV, square root of dispersion) is equal to 0.44. This means that the expression values vary up and down by 44% between replicates.
edgeR
d <- data[rmZero,-rmCol]
# Working only with "mock" samples for simplicity.
# In real life, we include all the samples for better dispersion estimation (below: estimateDispersion)
d <- d[,mock_cols]
group <- c(rep("mock",4))
d.e <- DGEList(counts = d, group=group)
# calculate edgeR’s average log2(CPM) and tagwise dispersion
d.e <- calcNormFactors(d.e)
d.e <- estimateCommonDisp(d.e, verbose=T)
## Disp = 0.01932 , BCV = 0.139
d.e <- estimateTagwiseDisp(d.e)
summary(d.e$tagwise.dispersion)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0003019 0.0209960 0.0536757 0.1563791 0.1994373 1.2338966
# sanity check between edgeR’s average log2 cpm calculations versus our own
cor(d.e$AveLogCPM, log2(mock_mean), method="spearman")
## [1] 0.9999541
plot(d.e$AveLogCPM, log2(mock_mean), pch='.')
# plot the tagwise (gene-wise) dispersion against the average log2(CPM)
d.df <- data.frame(AveLogCPM = d.e$AveLogCPM, Dispersion = d.e$tagwise.dispersion)
ggplot(d.df, aes(AveLogCPM, Dispersion)) +
geom_point() +
geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
Genes with low counts across libraries provide little evidence for differential expression. Because,
One should consider removing the lowly expressed genes before proceeding with differential analysis. There are different strategies on filtering lowly expressed genes. For example:
DESeq2
: Uses “independent” filtering using
filter_p
function from genefilter
R package.edgeR
: Uses internal function
filterByExpr
to remove lowly expressed genes across groups
of interest.Examples of these steps are provided below for each analysis pipeline.
DESeq2
differential expression analysisd <- data[,-rmCol]
sample_info <- data.frame(condition = c(rep("mock",4),rep("dht",3)), row.names = colnames(d))
sample_info
## condition
## lane1 mock
## lane2 mock
## lane3 mock
## lane4 mock
## lane5 dht
## lane6 dht
## lane8 dht
#create DE dataset in DESeq2
DESeq.ds <- DESeqDataSetFromMatrix(countData = d, colData = sample_info, design = ~ condition)
#pre-filtering counts: although not necesseary, it will increase the speed due to less memory consumption
mincount <- 0
DESeq.ds <- DESeq.ds[rowSums(counts(DESeq.ds)) > mincount , ]
Counts normalized using DESeq2’s size factor:
#calculate and normalize with DESeq's size factor
colSums(counts(DESeq.ds))
## lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 978576 1156844 1442169 1485604 1823460 1834335 681743
DESeq.ds <- estimateSizeFactors(DESeq.ds)
sizeFactors(DESeq.ds)
## lane1 lane2 lane3 lane4 lane5 lane6 lane8
## 0.7912131 0.9433354 1.1884939 1.2307145 1.4099376 1.4224762 0.5141056
colData(DESeq.ds)
## DataFrame with 7 rows and 2 columns
## condition sizeFactor
## <factor> <numeric>
## lane1 mock 0.791213
## lane2 mock 0.943335
## lane3 mock 1.188494
## lane4 mock 1.230714
## lane5 dht 1.409938
## lane6 dht 1.422476
## lane8 dht 0.514106
counts.sf_normalized <- counts(DESeq.ds, normalized = TRUE)
#put two plots in the same plotting space
par(mfrow=c(1, 2))
#plot DESeq's library size normalized counts
boxplot(counts.sf_normalized, notch = TRUE, main = "untransformed read counts", ylab = "read counts")
#plot log2 transformed DESeq's normalized counts
log.norm.counts <- log2(counts.sf_normalized + 1)
boxplot(log.norm.counts, notch = TRUE, main = "log2 - transformed read counts", ylab = "log2(read counts)")
DESeq2’s rlog() function returns values that are both normalized for sequencing depth and transformed to the log2 scale where the values are adjusted to fit the experiment-wide trend of the variance-mean relationship
#Transformation of read counts including variance shrinkage
DESeq.rlog <- rlog(DESeq.ds, blind = TRUE )
rlog.norm.counts <- assay(DESeq.rlog)
par(mfrow=c(2, 2))
#plot raw counts
plot(counts(DESeq.ds)[,1:2], cex =.1 , main = "Raw read counts" )
#plot DESeq's library size normalized counts
plot(counts.sf_normalized[,1:2], cex =.1 , main = "library size normalized read counts" )
#plot log2 transformed DESeq's library size normalized counts
plot(log.norm.counts[,1:2], cex =.1 , main = "library size normalized log2 (read counts)")
#plot rlog-transformed counts
plot(rlog.norm.counts[,1:2], cex =.1, main = "rlog-transformed read counts")
#mean-sd-plots based on all samples of the experiment
msd_plot <- meanSdPlot(log.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("library size normalized log2 (read counts)") + ylab (" standard deviation") + ylim(0,1)
msd_plot <- meanSdPlot(rlog.norm.counts, ranks=FALSE, plot=FALSE)
msd_plot$gg + ggtitle("rlog-transformed read counts") + ylab ("standard deviation") + ylim(0,1)
For hierarchical clustering of samples you have to decide about a similarity measure for pairwise comparison of individual samples and a linkage measure to define clusters. For background, see: https://chagall.med.cornell.edu/RNASEQcourse/Intro2RNAseq.pdf#subsubsection.5.3.2
distance.m_rlog <- as.dist(1-cor(rlog.norm.counts, method="pearson"))
plot(hclust(distance.m_rlog), labels = colnames(rlog.norm.counts), main="rlog transformed read counts\ndistance: Pearson correlation")
as.dist
and hclust
with the appropriate
parameter settings. To find them check the help pages, e.g. by typing
“?as.dist”.
P <- plotPCA(DESeq.rlog)
P + theme_bw() + ggtitle("Rlog transformed counts")
# Setup the factor levels for design
str(colData(DESeq.ds)$condition)
## Factor w/ 2 levels "dht","mock": 2 2 2 2 1 1 1
# We set "mock" as the first level which willbe treated as reference condition in the differential expression analysis
colData(DESeq.ds)$condition <- relevel(colData(DESeq.ds)$condition, "mock")
str(colData(DESeq.ds)$condition)
## Factor w/ 2 levels "mock","dht": 1 1 1 1 2 2 2
# Re-defining design.
# Although not necessary, it would be beneficial in case that you want to change your design down the pipeline
design(DESeq.ds) <- ~ condition
# Run DESeq
#instead of calling the function DESeq you can also run the following sequence of commands:
#dds <- estimateSizeFactors(dds)
#dds <- estimateDispersions(dds)
#dds <- nbinomWaldTest(dds)
DESeq.ds <- DESeq(DESeq.ds)
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
# Differential expression results
# Note the contrast = c("condition", "dht", "mock")
# We are comparing "dht" vs "mock" in column 'condition'
DGE.results <- results(object = DESeq.ds,
independentFiltering = TRUE,
alpha = 0.05,
contrast = c("condition", "dht", "mock")
)
summary(DGE.results)
##
## out of 21877 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 1663, 7.6%
## LFC < 0 (down) : 1468, 6.7%
## outliers [1] : 0, 0%
## low counts [2] : 8483, 39%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Question: How many differentially expressed genes are
found with multiple testing adjusted p<0.05?
The p-values follow a uniform distribution under null hypothesis. If you have a peak around zero, your alternative hypotheses live- along with some potential false positives.
If you would like to learn more about potential scenarious about how your p-value distribution might look like, here is a detailed blog post by David Robinson about “How to interpret a p-value histogram”.
# Histogram of p-values for all genes tested for no differential expression between the two conditions
hist(DGE.results$pvalue, col = "grey" , border = "white", xlab = "", ylab = "", main = "frequencies of p-values")
# MA plot
DESeq2::plotMA(DGE.results, alpha = 0.05, main = "Control vs. DHT-Treated" , ylim = c(-9,9))
# Heatmap
DGE.results.sorted <- DGE.results[order(DGE.results$padj), ]
DGEgenes <- rownames(subset(DGE.results.sorted, padj < 0.05))
hm.mat_DGEgenes <- log.norm.counts[DGEgenes,]
# No clustering and no row (gene) scaling
aheatmap(hm.mat_DGEgenes, Rowv = NA, Colv = NA, main = "No clustering")
# Clustering and no row (gene) scaling.
aheatmap(hm.mat_DGEgenes, Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", main = "Clustering without row (gene) scaling")
# Clustering and row (gene) scaling.
aheatmap(hm.mat_DGEgenes, Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", scale = "row", main = "Clustering with row (gene) scaling")
# Gene scaling before hierarchical clustering.
aheatmap(t(scale(t(hm.mat_DGEgenes))), Rowv = TRUE, Colv = TRUE, distfun = "euclidean", hclustfun = "average", main = "Clustering after row (gene) scaling, and without sample-wise clustering")
volDat <- as.data.frame(DGE.results.sorted)
volDat <- volDat[!is.na(volDat$padj), ]
volDat$DGE <- volDat$padj < 0.05
ggplot(volDat, aes(x = log2FoldChange, y = -log10(padj), color = DGE)) +
geom_point(alpha = 0.50)
# Read counts of a single gene
#keytypes(org.Hs.eg.db)
#columns(org.Hs.eg.db)
anno <- select(org.Hs.eg.db, keys = DGEgenes, keytype = "ENSEMBL", columns = c("GENENAME", "SYMBOL"))
## 'select()' returned 1:many mapping between keys and columns
DGE.results.sorted_logFC <- DGE.results[order(DGE.results$log2FoldChange),]
DGEgenes_logFC <- rownames(subset(DGE.results.sorted_logFC, padj < 0.05))
head(anno[match(DGEgenes_logFC, anno$ENSEMBL),])
## ENSEMBL GENENAME SYMBOL
## 391 ENSG00000091972 CD200 molecule CD200
## 825 ENSG00000100373 uroplakin 3A UPK3A
## 975 ENSG00000118513 MYB proto-oncogene, transcription factor MYB
## 1062 ENSG00000081237 protein tyrosine phosphatase receptor type C PTPRC
## 1057 ENSG00000196660 solute carrier family 30 member 10 SLC30A10
## 1131 ENSG00000117245 kinesin family member 17 KIF17
# merge DFE analysis with gene annotation
out.df <- merge(as.data.frame(DGE.results), anno, by.x = "row.names", by.y = "ENSEMBL")
# filter gene of interest
out.df %>% filter(SYMBOL=="TMEFF2")
## Row.names baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000144339 209.3953 -1.021127 0.1544868 -6.609805 3.84826e-11
## padj
## 1 9.562819e-10
## GENENAME SYMBOL
## 1 transmembrane protein with EGF like and two follistatin like domains 2 TMEFF2
write.table(out.df, file = "de_exercise_DESeq2results.txt", sep = "\t", quote = FALSE, row.names = FALSE)
# Plot individual genes
subset(anno, SYMBOL == "TMEFF2") #known differential expressed gene in LNCaP human prostate cancer progression model
## ENSEMBL
## 545 ENSG00000144339
## GENENAME
## 545 transmembrane protein with EGF like and two follistatin like domains 2
## SYMBOL
## 545 TMEFF2
plotCounts(dds = DESeq.ds , gene = "ENSG00000144339", normalized = TRUE, transform = FALSE, main = expression(atop("Expression of "*italic("TMEFF2"))))