This vignette has been changed in BioC 3.14, when each data package (LRBase.XXX.eg.db) is deprecated and the way to provide LRBase data has changed to AnnotationHub-style.

Introduction

We explain the way to create a matrix, in which the row names are NCBI Gene ID (ENTREZID), for specifying an input of r Biocpkg("scTensor"). Typical r Biocpkg("scTensor") workflow can be described as below.

library("scTensor")
library("AnnotationHub")
library("LRBaseDbi")

# Input matrix
input <- ...
sce <- SingleCellExperiment(assays=list(counts = input))
# Celltype vector
label <- ...
# LRBase.XXX.eg.db
ah <- AnnotationHub()
dbfile <- query(ah, c("LRBaseDb", "Homo sapiens", "v002"))[[1]]
LRBase.Hsa.eg.db <- LRBaseDbi::LRBaseDb(dbfile)
# Setting
cellCellSetting(sce, LRBase.Hsa.eg.db, label)

In r Biocpkg("scTensor"), the row names of the input matrix is limited only NCBI Gene ID to cooperate with the other R packages (cf. data("Germline")). Since the user has many different types of the data matrix, here we introduce some situations and the way to convert the row names of the user's matrix as NCBI Gene ID.

Step.1: Create a gene-level expression matrix

First of all, we have to prepare the expression matrix (gene $\times$ cell). There are many types of single-cell RNA-Seq (scRNA-Seq) technologies and the situation will be changed by the used experimental methods and quantification tools described below.

Case I: Gene-level quantification

In Plate-based scRNA-Seq experiment (i.e. Smart-Seq2, Quart-Seq2, CEL-Seq2, MARS-Seq,...etc), the FASTQ file is generated in each cell. After the mapping of reads in each FASTQ file to the reference genome, the same number of BAM files will be generated. By using some quantification tools such as featureCounts, or HTSeq-count, user can get the expression matrix and used as the input of r Biocpkg("scTensor"). These tools simply count the number of reads in union-exon in each gene. One downside of these tools is that such tools do not take "multimap" of not unique reads into consideration and the quantification is not accurate. Therefore, we recommend the transcript-level quantification and gene-level summarization explained below.

Case II: Transcript-level quantification

Some quantification tools such as RSEM, Sailfish, Salmon, Kallisto, and StringTie use the reference transcriptome instead of genome, and quantify the expression level in each transcript. After the quantification, the transcript-level expression can be summarized to gene-level by using summarizeToGene function of r Biocpkg("tximport"). The paper of tximport showed that the transcript-level expression summalized to gene-level is more accurate than the gene-level expression calculated by featureCounts.

Note that if you use the reference transcriptome of GENCODE, this step becomes slightly complicated. Although the number of transcripts of GENCODE and that of Ensembl is almost the same, and actually, most of the transcript is duplicated in these two databases, the gene identifier used in GENCODE looks complicated like "ENST00000456328.2|ENSG00000223972.5|OTTHUMG00000000961.2|OTTHUMT00000362751.1|DDX11L1-202|DDX11L1|1657|processed_transcript|". In such a case, firstly only Ensembl Transcript ID should be extracted (e.g. ENST00000456328.2), removed the version (e.g. ENST00000456328), summarized to Ensembl Gene ID by tximport (e.g. ENSG00000223972), and then converted to NCBI Gene ID (e.g. 100287102) by each organism package such as r Biocpkg("Homo.sapiens").

Case III: UMI-count

In the droplet-based scRNA-Seq experiment (i.e. Drop-Seq, inDrop RNA-Seq, 10X Chromium), unique molecular identifier (UMI) is introduced for avoiding the bias of PCR amplification, and after multiplexing by cellular barcode, digital gene expression (DGE) matrix is generated by counting the number of types of UMI mapped in each gene.

When user perform Drop-seq, Drop-Seq tool can generate the DGE matrix.

Another tool Alevin, which is a subcommand of Salmon is also applicable to Drop-seq data. In such case [tximport] with option "type = 'alevin'" can import the result of Alevin into R and summarize the DGE matrix.

When the user performs 10X Chromium, using Cell Ranger developed by 10X Genomics is straightforward.

Although Cell Ranger is implemented by Python, starting from the files generated by Cell Ranger (e.g. filtered_gene_bc_matrices/{hg19,mm10}/{barcodes.tsv,genes.tsv,matrix.mtx}), r CRANpkg("Seurat") can import these files to an R object.

For example, according to the tutorial of Seurat (Seurat - Guided Clustering Tutorial), PBMC data of Cell Ranger can be imported by the Read10X function and DGE matrix of UMI-count is available by the output of CreateSeuratObject function.

if(!require(Seurat)){
    BiocManager::install("Seurat")
    library(Seurat)
}

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data,
  project = "pbmc3k", min.cells = 3, min.features = 200)

Note that the matrix is formatted as a sparse matrix of r CRANpkg("Matrix") package (MM: Matrix market), but the r Biocpkg("scTensor") assumes dense matrix for now. By using as.matrix function, the sparse matrix is easily converted to a dense matrix as follows.

# Sparse matrix to dense matrix
for_sc <- as.matrix(pbmc.data)

Step.2: Convert the row names of a matrix as NCBI Gene ID (ENTREZID)

Even after creating the gene-level expression matrix in Step.1, many kinds of gene-level gene identifiers can be assigned as row names of the matrix such as Ensembl Gene ID, RefSeq, or Gene Symbol. Again, only NCBI Gene ID can be used as row names of the input matrix of r Biocpkg("scTensor"). To do such a task, we originally implemented a function convertRowID function of r Biocpkg("scTGIF"). The only user has to prepare for using this function is the 1. input matrix (or data.frame) filled with only numbers, 2. current gene-level gene identifier in each row of the input matrix, and 3. corresponding table containing current gene-level gene identifier (left) and corresponding NCBI Gene ID (right). The usage of this function is explained below.

Case I: Ensembl Gene ID to NCBI Gene ID

In addition to 1. and 2., the user has to prepare the 3. corresponding table. Here we introduce two approaches to assign the user's Ensembl Gene ID to NCBI Gene ID. First approarch is using Organism DB packages such as r Biocpkg("Homo.sapiens"), r Biocpkg("Mus.musculus"), and r Biocpkg("Rattus.norvegicus").

Using the select function of Organism DB, the corresponding table can be retrieved like below.

suppressPackageStartupMessages(library("scTensor"))
if(!require(Homo.sapiens)){
    BiocManager::install("Homo.sapiens")
    suppressPackageStartupMessages(library(Homo.sapiens))
}
if(!require(scTGIF)){
    BiocManager::install("scTGIF")
    suppressPackageStartupMessages(library(scTGIF))
}

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("ENSG00000204531", "ENSG00000181449",
  "ENSG00000136997", "ENSG00000136826")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
# ID conversion
(input <- convertRowID(input, rowID, LefttoRight))

Second approarch is using r Biocpkg("AnnotationHub") package.

Although only three Organism DB packages are explicitly developed, even if the data is generated from other species (e.g. Zebrafish, Arabidopsis thaliana), similar database is also available from r Biocpkg("AnnotationHub"), and select function can be performed like below.

suppressPackageStartupMessages(library("AnnotationHub"))

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
LefttoRight <- select(hs,
  column=c("ENSEMBL", "ENTREZID"),
  keytype="ENSEMBL", keys=rowID)
(input <- convertRowID(input, rowID, LefttoRight))

Case II: Gene Symbol to NCBI Gene ID

When using cellranger or r CRANpkg("Seurat") to quantify UMI-count (cf. Step1, Case III), the row names of the input matrix might be Gene Symbol, and have to be converted to NCBI Gene ID. As well as the Case I described above, Organism DB and r Biocpkg("AnnotationHub") will support such a task like below.

# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 2. Gene identifier in each row
rowID <- c("POU5F1", "SOX2", "MYC", "KLF4")
# 3. Corresponding table
LefttoRight <- select(Homo.sapiens,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
# ID conversion
(input <- convertRowID(input, rowID, LefttoRight))
# 1. Input matrix
input <- matrix(1:20, nrow=4, ncol=5)
# 3. Corresponding table
ah <- AnnotationHub()
# Database of Human
hs <- query(ah, c("OrgDb", "Homo sapiens"))[[1]]
LefttoRight <- select(hs,
  column=c("SYMBOL", "ENTREZID"),
  keytype="SYMBOL", keys=rowID)
(input <- convertRowID(input, rowID, LefttoRight))

Step.3: Normalize the count matrix

Finally, we introduce some situations to perform some normalization methods of gene expression matrix.

If a user converts a Seurat object to a SingleCellExperient object by using as.SingleCellExperiment, the result of the NormalizeData function (log counts) is inherited to the SingleCellExperient object as follows;

pbmc2 <- NormalizeData(pbmc, normalization.method = "LogNormalize",
    scale.factor = 10000)
sce <- as.SingleCellExperiment(pbmc2)
assayNames(sce) # counts, logcounts

If the user want to use r Biocpkg("scater") package, calculateCPM or normalize function can calculate the normalized expression values as follows; (see also the vignette of scater).

if(!require(scater)){
    BiocManager::install("scater")
    library(scater)
}
sce <- SingleCellExperiment(assays=list(counts = input))
cpm(sce) <- calculateCPM(sce)
sce <- normalize(sce)
assayNames(sce) # counts, normcounts, logcounts, cpm

Any original normalization can be stored in the sce. For example, we can calculate the value of count per median (CPMED) as follows;

# User's Original Normalization Function
CPMED <- function(input){
    libsize <- colSums(input)
    median(libsize) * t(t(input) / libsize)
}
# Normalization
normcounts(sce) <- log10(CPMED(counts(sce)) + 1)

We recommend using the normcounts slot to save such original normalization values. After the normalization, such values can be specified by assayNames option in cellCellRanks cellCellDecomp and cellCellReport functions.

Session information {.unnumbered}

sessionInfo()


rikenbit/scTensor documentation built on Jan. 28, 2023, noon