Characterization of miRNA and isomiR molecules

library(BiocStyle)
knitr::opts_chunk$set(tidy=FALSE, cache=FALSE,
                      dev="png",
                      message=FALSE, error=FALSE,
                      warning=TRUE)

Lorena Pantano Harvard TH Chan School of Public Health, Boston, US

Introduction

miRNAs are small RNA fragments (18-23 nt long) that influence gene expression during development and cell stability. Morin et al [@morin], discovered isomiRs first time after sequencing human stem cells.

IsomiRs are miRNAs that vary slightly in sequence, which result from variations in the cleavage site during miRNA biogenesis (5’-trimming and 3’-trimming variants), nucleotide additions to the 3’-end of the mature miRNA (3’-addition variants) and nucleotide modifications (substitution variants)[@emarti].

There are many tools designed for isomiR detection, however the majority are web application where user can not control the analysis. The two main command tools for isomiRs mapping are SeqBuster and sRNAbench [@barturen2014]. r Biocpkg("isomiRs"). package is designed to analyze the output of SeqBuster tool or any other tool after converting to the desire format.

Citing isomiRs

If you use the package, please cite this paper [@isomirs].

Input format

The input should be the output of SeqBuster-miraligner tool (*.mirna files). It is compatible with mirTOP tool as well, which parses BAM files with alignments against miRNA precursors.

For each sample the file should have the following format:

seq                     name           freq mir           start end  mism  add  t5  t3
TGTAAACATCCTACACTCAGCT  seq_100014_x23  23  hsa-miR-30b-5p  17  40   0       0  0   GT
TGTAAACATCCCTGACTGGAA   seq_100019_x4   4   hsa-miR-30d-5p  6   26   13TC    0  0   g
TGTAAACATCCCTGACTGGAA   seq_100019_x4   4   hsa-miR-30e-5p  17  37   12CT    0  0   g
CAAATTCGTATCTAGGGGATT   seq_100049_x1   1   hsa-miR-10a-3p  63  81   0       TT 0   ata
TGACCTAGGAATTGACAGCCAGT seq_100060_x1   1   hsa-miR-192-5p  25  47   8GT     0  c   agt

This is the standard output of SeqBuster-miraligner tool, but can be converted from any other tool having the mapping information on the precursors. Read more on miraligner manual.

IsomirDataSeq class

This object will store all raw data from the input files and some processed information used for visualization and statistical analysis. It is a subclass of SummarizedExperiment with colData and counts methods. Beside that, the object contains raw and normalized counts from miraligner allowing to update the summarization of miRNA expression.

Access data

The user can access the normalized count matrix with counts(object, norm=TRUE).

You can browse for the same miRNA or isomiRs in all samples with isoSelect method.

library(isomiRs)
data(mirData)
head(isoSelect(mirData, mirna="hsa-let-7a-5p", 1000))

metadata(mirData) contains two lists: rawList is a list with same length than number of samples and stores the input files for each sample; isoList is a list with same length than number of samples and stores information for each isomiR type summarizing the different changes for the different isomiRs (trimming at 3', trimming a 5', addition and substitution). For instance, you can get the data stored in isoList for sample 1 and 5' changes with this code metadata(ids)[["isoList"]][[1]]["t5sum"].

isomiRs annotation

IsomiR names follows this structure:

In general nucleotides in UPPER case mean insertions respect to the reference sequence, and nucleotides in LOWER case mean deletions respect to the reference sequence.

Reading input

We are going to use a small RNAseq data from human brain samples [@pantano2016] to give some basic examples of isomiRs analyses.

In this data set we will find two groups:

pc: 7 control individuals pt: 7 patients with Parkinson's Disease in early stage.

library(isomiRs)
data(mirData)

The function IsomirDataSeqFromFiles needs a vector with the paths for each file and a data frame with the design experiment similar to the one used for a mRNA differential expression analysis. Row names of the data frame should be the names for each sample in the same order than the list of files.

ids <- IsomirDataSeqFromFiles(fn_list, design=de)

Manipulation

Descriptive analysis

You can plot isomiRs expression with isoPlot. In this figure you will see how abundant is each type of isomiRs at different positions considering the total abundance and the total number of sequences. The type parameter controls what type of isomiRs to show. It can be trimming (iso5 and iso3), addition (add) or substitution (subs) changes.

ids <- isoCounts(mirData)
isoPlot(ids, type="all")

Count data

isoCounts gets the count matrix that can be used for many different downstream analyses changing the way isomiRs are collapsed. The following command will merge all isomiRs into one feature: the reference miRNA.

head(counts(ids))

The normalization uses rlog from r Biocpkg("DESeq2") package and allows quick integration to another analyses like heatmap, clustering or PCA.

library(pheatmap)
ids = isoNorm(ids, formula = ~ condition)
pheatmap(counts(ids, norm=TRUE)[1:100,], 
         annotation_col = data.frame(colData(ids)[,1,drop=FALSE]),
         show_rownames = FALSE, scale="row")

Annotation

To get a detail description for each isomiR, the function isoAnnotate can return the naming, sequence and importance value for each sample and isomiR. The importance is calculated by:

$$importance = \frac{isomiR_reads}{miRNA_reads}$$

The columns are:

head(isoAnnotate(ids))

Classification

Differential expression analysis

The isoDE uses functions from r Biocpkg("DESeq2") package. This function has parameters to create a matrix using only the reference miRNAs, all isomiRs, or some of them. This matrix and the design matrix are the inputs for DESeq2. The output will be a DESeqDataSet object, allowing to generate any plot or table explained in DESeq2 package vignette.

dds <- isoDE(ids, formula=~condition)
library(DESeq2)
plotMA(dds)
head(results(dds, format="DataFrame"))

You can differentiate between reference sequences and isomiRs at 5' end with this command:

dds = isoDE(ids, formula=~condition, ref=TRUE, iso5=TRUE)
head(results(dds, tidy=TRUE))

Alternative, for more complicated cases or if you want to control more the differential expression analysis paramters you can use directly r Biocpkg("DESeq2") package feeding it with the output of counts(ids) and colData(ids) like this:

dds = DESeqDataSetFromMatrix(counts(ids),
                             colData(ids), design = ~condition)

Supervised classification

Partial Least Squares Discriminant Analysis (PLS-DA) is a technique specifically appropriate for analysis of high dimensionality data sets and multicollineality [@perezenciso]. PLS-DA is a supervised method (i.e. makes use of class labels) with the aim to provide a dimension reduction strategy in a situation where we want to relate a binary response variable (in our case young or old status) to a set of predictor variables. Dimensionality reduction procedure is based on orthogonal transformations of the original variables (isomiRs) into a set of linearly uncorrelated latent variables (usually termed as components) such that maximizes the separation between the different classes in the first few components [@xia]. We used sum of squares captured by the model (R2) as a goodness of fit measure. We implemented this method using the DiscriMiner into isoPLSDA function. The output p-value of this function will tell about the statistical significant of the group separation using miRNA expression data. Moreover, the function isoPLSDAplot helps to visualize the results. It will plot the samples using the significant components (t1, t2, t3 ...) from the PLS-DA analysis and the samples distribution along the components.

ids = isoCounts(ids, iso5=TRUE, minc=10, mins=6)
ids = isoNorm(ids, formula = ~ condition)
pls.ids = isoPLSDA(ids, "condition", nperm = 2)
df = isoPLSDAplot(pls.ids)

The analysis can be done again using only the most important discriminant isomiRS from the PLS-DA models based on the analysis. We used Variable Importance for the Projection (VIP) criterion to select the most important features, since takes into account the contribution of a specific predictor for both the explained variability on the response and the explained variability on the predictors.

pls.ids = isoPLSDA(ids,"condition", refinment = FALSE, vip = 0.8)

Gene - miRNA integration

The package offers a correlation analysis of miRNA and gene expression data. Having two SummarizedExperiments objects with their expression, the target prediction for each miRNA, the function isoNetwork and isoPlotNetwork can generate a summarized figure showing the relationship between expression profile, miRNA repression and enrichment analysis:

# library(org.Mm.eg.db)
# library(clusterProfiler)
data(isoExample)

# ego <- enrichGO(row.names(assay(gene_ex_rse, "norm")),
#                 org.Mm.eg.db, "ENSEMBL", ont = "BP")
data = isoNetwork(mirna_ex_rse, gene_ex_rse, target = ma_ex,
                  enrich = ego, summarize = "group")
isoPlotNet(data,  minGenes = 4)

As an option, org can be org.Mm.eg.db and genename can be ENSEMBL and it will run enrcihGO internally.

To create the ma_ex matrix, the user can use findTargets:

mirna_ma <- data.frame(gene = names(gene_ex_rse)[1:20],
                       mir = names(mirna_ex_rse))
ma_ex <- findTargets(mirna_ex_rse, gene_ex_rse, mirna_ma)
head(ma_ex[,1:4])

And to get the mirna_ma data.frame with the miRNA-target information, the user can use mirna2targetscan function:

library(targetscan.Hs.eg.db)
mirna_ma <- mirna2targetscan(c("hsa-miR-34c-5p"))
head(mirna_ma)

Session info

Here is the output of sessionInfo on the system on which this document was compiled:

sessionInfo()

References



Try the isomiRs package in your browser

Any scripts or data that you put into this service are public.

isomiRs documentation built on Jan. 31, 2021, 2 a.m.