\begin{center} \date{September, 19, 2017}
\small{$^{1}$ Centre for Genomic Sciences, LKS Faculty of Medicine, The}\protect \ \small{University of Hong Kong, Hong Kong SAR, China;}\protect \ \small{$^{2}$ School of Biomedical Sciences, LKS Faculty of Medicine, The}\protect \ \small{University of Hong Kong, Hong Kong SAR, China;}\protect \ \small{$^{3}$ Department of Health Sciences Research and Center for Individualized}\protect \ \small{Medicine, Mayo Clinic, Scottsdale, AZ, 85259, USA;}\protect \ \small{$^{4}$ Department of Psychiatry, LKS Faculty of Medicine, The}\protect \ \small{University of Hong Kong, Hong Kong SAR, China;}\protect \ \small{$^{5}$ State Key Laboratory in Cognitive and Brain Sciences, The}\protect \ \small{University of Hong Kong, Hong Kong SAR, China;}\protect \ \small{$^{6}$ Department of Biomedical Informatics, Arizona State University,}\protect \ \small{Scottsdale, AZ, 85259, USA.}\protect \
\end{center}
\bigskip
\begin{abstract} Linnorm is an R package for the analysis of single cell RNA-seq (scRNA-seq), RNA-seq, ChIP-seq count data or any large scale count data. Its main function is to normalize and transform such datasets for parametric tests\footnote{Implemented as Linnorm and Linnorm.Norm}. Various functions and analysis pipelines are also implemented for users' convenience, including data imputation\footnote{Implemented as the Linnorm.DataImput function.}, stable gene selection for scRNA-seq\footnote{for scRNA-seq data without spike-in genes or users who do not want to rely on spike-in genes. Implemented as the Linnorm.SGenes function.}, using \Biocpkg{limma} for differential expression analysis or differential peak detection\footnote{The Linnorm-limma pipeline is implemented as the "Linnorm.limma" function. Please cite both Linnorm and limma if you use this function for publication.}, co-expression network analysis\footnote{Implemented as the Linnorm.Cor function.}, highly variable gene discovery\footnote{Implemented as the Linnorm.HVar function.} and cell subpopulation analysis with t-SNE K-means clustering\footnote{Implemented as the Linnorm.tSNE function.}, PCA K-means clustering\footnote{Implemented as the Linnorm.PCA function.} and hierarchical clustering analysis\footnote{Implemented as the Linnorm.HClust function.}. Linnorm can work with raw count, CPM, RPKM, FPKM and TPM and is compatible with data generated from simple count algorithms\footnote{Such as \href{http://www-huber.embl.de/HTSeq/doc/overview.html}{HTSeq}, \Biocpkg{Rsubread} and etc} and supervised learning algorithms\footnote{Such as \href{http://cole-trapnell-lab.github.io/cufflinks/}{Cufflinks}, \href{http://bio.math.berkeley.edu/eXpress/index.html}{eXpress}, \href{https://pachterlab.github.io/kallisto/}{Kallisto}, \href{http://deweylab.github.io/RSEM/}{RSEM}, \href{http://www.cs.cmu.edu/~ckingsf/software/sailfish/}{Sailfish}, and etc}. Additionally, the RnaXSim function is included for simulating RNA-seq data for the evaluation of DEG analysis methods.
\end{abstract}
\newpage
\setcounter{tocdepth}{4} \tableofcontents
\newpage
Linnorm is a normalization and transformation method.
The Linnorm R package contains a variety of functions for single cell RNA-seq data analysis by utilizing Linnorm.
Compared to RNA-seq or other large scale count data, scRNA-seq data have more noises and more zero counts. Linnorm is developed to work with scRNA-seq data. Since RNA-seq data are similar to scRNA-seq data, but with less noise, Linnorm is also compatible with RNA-seq data analysis.
Linnorm accepts any RNA-seq Expression data, including but not limited to
We suggest RPKM, FPKM or TPM for most purposes. Please do NOT input Log transformed datasets, as Linnorm will perform log transformation.
Linnorm accepts matrix as its data type. Data frames are also accepted and will be automatically converted into the matrix format before analysis. Each column in the matrix should be a sample or replicate. Each row should be a Gene/Exon/Isoform/etc.
Example:
| |Sample 1|Sample 2|Sample 3|... | |:---:|:------:|:------:|:------:|:----| |Gene1|1 |2 |1 |... | |Gene2|3 |0 |4 |... | |Gene3|10.87 |11.56 |12.98 |... | |... |... |... |... |... | |... |... |... |... |... |
Please note that undefined values such as NA, NaN, INF, etc are NOT supported.
By setting the RowSamples argument as TRUE, Linnorm can accept a matrix where each column is a sample or replicate and each row should be a Gene/Exon/Isoform/etc. Example:
| |Gene 1|Gene 2|Gene 3|... | |:-----:|:----:|:----:|:----:|:----| |Sample1|1 |3 |10.87 |... | |Sample2|2 |0 |11.56 |... | |Sample3|1 |4 |12.98 |... | |..... |... |... |... |... | |..... |... |... |... |... |
Note that Linnorm should work slightly faster with RowSamples set to TRUE.
By using the argument, "input = "Linnorm"", several functions provided by the Linnorm package can also accept Linnorm transformed datasets as input. Therefore, the dataset doesn't need to be re-transformed multiple times for multiple functions.
Instructions can be found on Linnorm's \href{https://bioconductor.org/packages/release/bioc/html/Linnorm.html}{bioconductor page}. Please note that some functions in this vignette are only available in Linnorm version 1.99.x+.
\newpage
Here, we will demonstrate how to generate and ouput Linnorm Transformed dataset into a tab delimited file.
library(Linnorm) #Obtain data data(Islam2011) #Do not filter gene list: Transformed <- Linnorm(Islam2011) #Filter low count genes FTransformed <- Linnorm(Islam2011, Filter=TRUE)
#You can use this file with Excel. #This file includes all genes. write.table(Transformed, "Transformed.txt", quote=FALSE, sep="\t", col.names=TRUE, row.names=TRUE) #This file filtered low count genes. write.table(FTransformed, "Transformed.txt", quote=FALSE, sep="\t", col.names=TRUE, row.names=TRUE)
Here, we will demonstrate how to generate and ouput Linnorm normalized dataset into a tab delimited file.
library(Linnorm) Normalized <- Linnorm.Norm(Islam2011) #Important: depending on downstream analysis, please #set output to be "XPM" or "Raw". #Set to "XPM" if downstream tools will convert the #dataset into CPM or TPM. #Set to "Raw" if input is raw counts and downstream #tools will work with raw counts.
#You can use this file with Excel. write.table(Normalized, "Normalized.txt", quote=FALSE, sep="\t", col.names=TRUE, row.names=TRUE)
Fold change can be calculated by the Linnorm.limma function. It is included in differential expression analysis results. However, for users who would like to calculate fold change from Linnorm transformed dataset and analyze it. Here is an example.
library(Linnorm) data(LIHC)
#Now, we can calculate fold changes between #sample set 1 and sample set 2. #Index of sample set 1 set1 <- 1:5 #Index of sample set 2 set2 <- 6:10 #Normalization LIHC2 <- Linnorm.Norm(LIHC,output="XPM") #Optional: Only use genes with at least 50% #of the values being non-zero LIHC2 <- LIHC2[rowSums(LIHC2 != 0) >= ncol(LIHC2)/2,]
#Put resulting data into a matrix FCMatrix <- matrix(0, ncol=1, nrow=nrow(LIHC2)) rownames(FCMatrix) <- rownames(LIHC2) colnames(FCMatrix) <- c("Log 2 Fold Change") FCMatrix[,1] <- log((rowMeans(LIHC2[,set1])+1)/(rowMeans(LIHC2[,set2])+1),2) #Now FCMatrix contains fold change results. #If the optional filtering step is not done, #users might need to remove infinite and zero values: #Remove Infinite values. FCMatrix <- FCMatrix[!is.infinite(FCMatrix[,1]),,drop=FALSE] #Remove Zero values FCMatrix <- FCMatrix[FCMatrix[,1] != 0,,drop=FALSE] #Now FCMatrix contains fold change results.
Density <- density(FCMatrix[,1]) plot(Density$x,Density$y,type="n",xlab="Log 2 Fold Change of TPM+1", ylab="Probability Density",) lines(Density$x,Density$y, lwd=1.5, col="blue") title("Probability Density of Fold Change.\nTCGA Partial LIHC data Paired Tumor vs Adjacent Normal") legend("topright",legend=paste("mean = ", round(mean(FCMatrix[,1]),2), "\nStdev = ", round(sd(FCMatrix[,1]),2)))
#Normalization with BE_strength set to 1. #This increases normalization strength. LIHC2 <- Linnorm.Norm(LIHC,output="XPM",BE_strength=1) #Optional: Only use genes with at least 50% #of the values being non-zero LIHC2 <- LIHC2[rowSums(LIHC2 != 0) >= ncol(LIHC2)/2,] FCMatrix <- matrix(0, ncol=1, nrow=nrow(LIHC2)) rownames(FCMatrix) <- rownames(LIHC2) colnames(FCMatrix) <- c("Log 2 Fold Change") FCMatrix[,1] <- log((rowMeans(LIHC2[,set1])+1)/(rowMeans(LIHC2[,set2])+1),2) #Now FCMatrix contains fold change results. Density <- density(FCMatrix[,1]) plot(Density$x,Density$y,type="n",xlab="Log 2 Fold Change of TPM+1", ylab="Probability Density",) lines(Density$x,Density$y, lwd=1.5, col="blue") title("Probability Density of Fold Change.\nTCGA Partial LIHC data Paired Tumor vs Adjacent Normal") legend("topright",legend=paste("mean = ", round(mean(FCMatrix[,1]),2), "\nStdev = ", round(sd(FCMatrix[,1]),2)))
We can see that using a stronger normalization strength can decrease the amount of differences between the two conditions; this causes the mean of fold change to decrease. By default, BE_strength is set to 0.5 to prevent overfitting. However, user can adjust it based on the dataset. Please note that the Linnorm() function for normalizing transformation also allows users to control normalization strength.
By default, Linnorm's data imputation function replaces zerores in the dataset.
However, we do not recommend performing data imputation unless it is shown to improve results.
library(Linnorm) data(Islam2011) #Obtain transformed data Transformed <- Linnorm(Islam2011) #Data Imputation DataImputed <- Linnorm.DataImput(Transformed) #Or, users can directly perform data imputation during transformation. DataImputed <- Linnorm(Islam2011,DataImputation=TRUE)
#You can use this file with Excel. write.table(DataImputed, "DataImputed.txt", quote=FALSE, sep="\t", col.names=TRUE, row.names=TRUE)
This function selects stable gene/features from the dataset. It is implemented for users who need spike-in genes or do not want to rely on spike-in genes in scRNA-seq data analysis.
library(Linnorm) data(Islam2011) #Obtain stable genes StableGenes <- Linnorm.SGenes(Islam2011)
#You can use this file with Excel. write.table(StableGenes, "StableGenes.txt", quote=FALSE, sep="\t", col.names=TRUE, row.names=TRUE)
We use 10 samples of RNA-seq data. These 10 samples are paired tumor and adjacent normal tissue samples from 5 individuals from the TCGA LIHC dataset.
library(Linnorm) data(LIHC) datamatrix <- LIHC
The Linnorm-limma pipeline only consists of two steps.
#5 samples for condition 1 and 5 samples for condition 2. #You might need to edit this line design <- c(rep(1,5),rep(2,5)) #These lines can be copied directly. design <- model.matrix(~ 0+factor(design)) colnames(design) <- c("group1", "group2") rownames(design) <- colnames(datamatrix)
a. Basic Differential Expression Analysis. (Follow this if you are not sure what to do.)
library(Linnorm) #The Linnorm-limma pipeline only consists of one line. DEG_Results <- Linnorm.limma(datamatrix,design) #The DEG_Results matrix contains your DEG analysis results.
b. Advanced: to output both DEG analysis results and the transformed matrix for further analysis
library(Linnorm) #Just add output="Both" into the argument list BothResults <- Linnorm.limma(datamatrix,design,output="Both") #To separate results into two matrices: DEG_Results <- BothResults$DEResults TransformedMatrix <- BothResults$Linnorm #The DEG_Results matrix now contains DEG analysis results. #The TransformedMatrix matrix now contains a Linnorm #Transformed dataset.
write.table(DEG_Results, "DEG_Results.txt", quote=FALSE, sep="\t", col.names=TRUE,row.names=TRUE)
Genes10 <- DEG_Results[order(DEG_Results[,"adj.P.Val"]),][1:10,] #Users can print the gene list by the following command: #print(Genes10) #logFC: log 2 fold change of the gene. #XPM: If input is raw count or CPM, XPM is CPM. # If input is RPKM, FPKM or TPM, XPM is TPM. #t: moderated t statistic. #P.Value: p value. #adj.P.Val: Adjusted p value. This is also called False Discovery Rate or q value.} #B: log odds that the feature is differential. #Note all columns are printed here
library(knitr) kable(Genes10[,c(1:5)], digits=4)
SignificantGenes <- DEG_Results[DEG_Results[,5] <= 0.05,1]
plot(x=DEG_Results[,1], y=DEG_Results[,5], col=ifelse(DEG_Results[,1] %in% SignificantGenes, "red", "green"),log="y", ylim = rev(range(DEG_Results[,5])), main="Volcano Plot", xlab="log2 Fold Change", ylab="q values", cex = 0.5)
In this section, we use Islam et al. (2011)'s single cell RNA-seq dataset to perform DEG analysis. In this analysis, we are using 48 mouse embryonic stem cells and 44 mouse embryonic fibroblasts.
library(Linnorm) data(Islam2011) IslamData <- Islam2011[,1:92]
#48 samples for condition 1 and 44 samples for condition 2. #You might need to edit this line design <- c(rep(1,48),rep(2,44)) #There lines can be copied directly. design <- model.matrix(~ 0+factor(design)) colnames(design) <- c("group1", "group2") rownames(design) <- colnames(IslamData)
#Example 1: Filter low count genes. #and genes with high technical noise. scRNAseqResults <- Linnorm.limma(IslamData,design,Filter=TRUE) #Example 2: Do not filter gene list. scRNAseqResults <- Linnorm.limma(IslamData,design)
In this section, we are going to use Islam2011 single cell RNA-seq embryonic stem cell data and perform Gene Correlation Analysis.
data(Islam2011) #Obtain embryonic stem cell data datamatrix <- Islam2011[,1:48]
#Setting plotNetwork to TRUE will create a figure file in your current directory. #Setting it to FALSE will stop it from creating a figure file, but users can plot it #manually later using the igraph object in the output. #For this vignette, we will plot it manually in step 4. results <- Linnorm.Cor(datamatrix, plotNetwork=FALSE, #Edge color when correlation is positive plot.Pos.cor.col="red", #Edge color when correlation is negative plot.Neg.cor.col="green")
Genes10 <- results$Results[order(results$Results[,5]),][1:10,] #Users can print the gene list by the following command: #print(Genes10)
library(knitr) kable(Genes10, digits=4, row.names=FALSE)
We will demonstrate how to plot the figure "networkplot.png" here.
library(igraph) Thislayout <- layout_with_kk(results$igraph) plot(results$igraph, #Vertex size vertex.size=2, #Vertex color, based on clustering results vertex.color=results$Cluster$Cluster, #Vertex frame color vertex.frame.color="transparent", #Vertex label color (the gene names) vertex.label.color="black", #Vertex label size vertex.label.cex=0.05, #This is how much the edges should be curved. edge.curved=0.1, #Edge width edge.width=0.05, #Use the layout created previously layout=Thislayout )
For example, what are the genes that belong to the same cluster as the Mmp2 gene?
TheCluster <- which(results$Cluster[,1] == "Mmp2")
#Index of the genes ListOfGenes <- which(results$Cluster[,2] == TheCluster) #Names of the genes GeneNames <- results$Cluster[ListOfGenes,1] #Users can write these genes into a file for further analysis.
top100 <- results$Results[order(results$Results[,4],decreasing=FALSE)[1:100],1]
Top100.Cor.Matrix <- results$Cor.Matrix[top100,top100]
library(RColorBrewer) library(gplots) heatmap.2(as.matrix(Top100.Cor.Matrix), #Hierarchical clustering on both row and column Rowv=TRUE, Colv=TRUE, #Center white color at correlation 0 symbreaks=TRUE, #Turn off level trace trace="none", #x and y axis labels xlab = 'Genes', ylab = "Genes", #Turn off density info density.info='none', #Control color key.ylab=NA, col=colorRampPalette(c("blue", "white", "yellow"))(n = 1000), lmat=rbind(c(4, 3), c(2, 1)), #Gene name font size cexRow=0.3, cexCol=0.3, #Margins margins = c(8, 8) )
In this section, we will perform highly variable gene discovery on the embryonic stem cells from Islam(2011).
data(Islam2011)
#The first 48 columns are the embryonic stem cells. results <- Linnorm.HVar(Islam2011[,1:48])
resultsdata <- results$Results Genes10 <- resultsdata[order(resultsdata[,"q.value"]),][1:10,3:5] #Users can print the gene list by the following command: #print(Genes10)
library(knitr) kable(Genes10, digits=4)
print(results$plot$plot) #By default, this plot highlights genes/features with p value less than 0.05.
In this section, we use Islam et al. (2011)'s single cell RNA-seq dataset to perform subpopluation analysis. The 96 samples are consisted of 48 mouse embryonic stem cells, 44 mouse embryonic fibroblasts and 4 negative controls. We do not use the negative controls here.
library(Linnorm) data(Islam2011)
tSNE.results <- Linnorm.tSNE(Islam2011[,1:92])
#Here, we can see two clusters. print(tSNE.results$plot$plot)
#The first 48 samples belong to mouse embryonic stem cells. Groups <- rep("ES_Cells",48) #The next 44 samples are mouse embryonic fibroblasts. Groups <- c(Groups, rep("EF_Cells",44))
#Useful arguments: #Group: #allows user to provide each sample's information. #num_center: #how many clusters are supposed to be there. #num_PC: #how many principal components should be used in k-means #clustering. tSNE.results <- Linnorm.tSNE(Islam2011[,1:92], Group=Groups, num_center=2)
#Here, we can see two clusters. print(tSNE.results$plot$plot)
In this section, we use Islam et al. (2011)'s single cell RNA-seq dataset to perform subpopluation analysis. The 96 samples are consisted of 48 mouse embryonic stem cells, 44 mouse embryonic fibroblasts and 4 negative controls. We do not use the negative controls here. This section is basically identical to the t-SNE K-means Clustering section above.
library(Linnorm) data(Islam2011)
PCA.results <- Linnorm.PCA(Islam2011[,1:92])
#Here, we can see multiple clusters. print(PCA.results$plot$plot)
#The first 48 samples belong to mouse embryonic stem cells. Groups <- rep("ES_Cells",48) #The next 44 samples are mouse embryonic fibroblasts. Groups <- c(Groups, rep("EF_Cells",44))
#Useful arguments: #Group: #allows user to provide each sample's information. #num_center: #how many clusters are supposed to be there. #num_PC #how many principal components should be used in k-means #clustering. PCA.results <- Linnorm.PCA(Islam2011[,1:92], Group=Groups, num_center=2, num_PC=3)
#Here, we can see two clusters. print(PCA.results$plot$plot)
In this section, we will perform hierarchical clustering on Islam2011 data.
data(Islam2011) Islam <- Islam2011[,1:92]
#48 ESC, 44 EF, and 4 NegCtrl Group <- c(rep("ESC",48),rep("EF",44)) colnames(Islam) <- paste(colnames(Islam),Group,sep="_")
#Note that there are 3 known clusters. HClust.Results <- Linnorm.HClust(Islam,Group=Group, num_Clust=2, fontsize=1.5, Color = c("Red","Blue"), RectColor="Green")
We can simply print the plot out.
print(HClust.Results$plot$plot)
In this section, we will run RnaXSim with default settings as a demonstration.
library(Linnorm) data(SEQC) SampleA <- SEQC
#This will generate two sets of RNA-seq data with 5 replicates each. #It will have 20000 genes totally with 2000 genes being differentially #expressed. It has the Negative Binomial distribution. SimulatedData <- RnaXSim(SampleA)
#Index of differentially expressed genes. DE_Index <- SimulatedData[[2]] #Expression Matrix ExpMatrix <- SimulatedData[[1]] #Sample Set 1 Sample1 <- ExpMatrix[,1:3] #Sample Set 2 Sample2 <- ExpMatrix[,4:6]
In this section, we will show an example where RnaXSim is run with customized settings.
data(SEQC) SampleA <- SEQC
library(Linnorm) SimulatedData <- RnaXSim(SampleA, distribution="Gamma", #Distribution in the simulated dataset. #Put "NB" for Negative Binomial, "Gamma" for Gamma, #"Poisson" for Poisson and "LogNorm" for Log Normal distribution. NumRep=5, #Number of replicates in each sample set. #5 will generate 10 samples in total. NumDiff = 1000, #Number of differentially expressed genes. NumFea = 5000 #Total number of genes in the dataset )
#Index of differentially expressed genes. DE_Index <- SimulatedData[[2]] #Expression Matrix ExpMatrix <- SimulatedData[[1]] #Simulated Sample Set 1 Sample1 <- ExpMatrix[,1:3] #Simulated Sample Set 2 Sample2 <- ExpMatrix[,4:6]
\newpage
Can I use Linnorm Transformed dataset to calculate Fold Change?
Answer: Linnorm Transformed dataset is a log transformed dataset. You should not use it to calculate fold change directly. To do it correctly, please refer to the calculate fold change section.
I only have two samples in total. Can I perform Linnorm Transformation?
Answer: No, you cannot. Linnorm requires a minimum of 3 samples.
I only have 1 replicate for each sample set. Can I perform Differential Expression Analysis with Linnorm and limma?
Answer: No, linear model based methods must have replicates. So, limma wouldn't work.
There are a lot of fold changes with INF values in Linnorm.limma output. Can I convert them into numerical units like those in the voom-limma pipeline?
Answer: Since the expression in one set of sample can be zero, while the other can be otherwise, it is arithmetically correct to generate INFs. However, it is possible for the Linnorm.limma function to prevent generating INFs by setting the noINF argument as TRUE, which is the default.
Do I need to run Linnorm.Norm() in addition to transforming the datset with Linnorm()?
Answer: Linnorm()'s transformation also performs Linnorm.Norm()'s normalization step. Therefore, please DO NOT rerun Linnorm.Norm() before or after Linnorm().
\newpage
We appreciate and welcome any Bug Reports, Questions and Suggestions. They can be posted on bioconductor's support site, \href{https://support.bioconductor.org}{https://support.bioconductor.org}. Please remember to add the Linnorm tag in your post so that Ken will be notified. Bioconductor's posting guide can be found at \href{http://www.bioconductor.org/help/support/}{http://www.bioconductor.org/help/support/}.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.