This document guides one through all available functions of the anamiR
package. Package anamiR aims to find potential miRNA-target gene interactions from both miRNA and mRNA expression data.
Traditional miRNA analysis method is to use online databases to predict miRNA-target gene interactions. However, the inconsistent results make interactions are less reliable. To address this issue, anamiR integrates the whole expression analysis with expression data into workflow, including normalization, differential expression, correlation and then databases intersection, to find more reliable interactions. Moreover, users can identify interactions from interested pathways or gene sets.
anamiR is on Bioconductor and can be installed following standard installation procedure.
install.packages("BiocManager") BiocManager::install("anamiR")
To use,
library(anamiR)
The general workflow can be summarized as follows,
Basically there are six steps, corresponding to six R functions, to complete the whole analysis:
As shown in the workflow, not only samples of paired miRNA and mRNA expression data, but phenotypical information of miRNA and mRNA are also required for the analysis. Since anamiR reads data in expression matrices, data sources are platform and technology agnostic. particularly, expression data from microarray or next generation sequencing are all acceptable for anamiR. However, this also indicates the raw data should be pre-processd to the expression matrices before using anamiR.
Columns for samples. Rows for genes
GENE SmapleA SamplB ... A 0.1 0.2 B -0.5 -0.3 C 0.4 0.1
Columns for samples. Rows for miRNAs
miRNA SampleA SampleB ... A 0.1 0.5 B -0.5 2.1 C 0.4 0.3
Cloumns for samples. Rows for feature name, including two groups, multiple groups, continuous data.
Feature groupA groupA groupA ... SampleA 123.33 A A SampleB 120.34 B C SampleC 121.22 A B
Now, we show an example using internal data for anamiR workflow.
To demonstrate the usage of the anamiR
package, the package contains 30 paired
miRNA and mRNA breast cancer samples, which are selected from 101 miRNA samples and
114 mRNA samples from GSE19536. As for phenotype data (hybridization information),
there are three types of information in it, including two-groups, multi-groups,
continuous data.
The mRNA data was conducted by Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) platform and the miRNA data was generated from Agilent-019118 Human miRNA Microarray 2.0 G4470B (miRNA ID version).
First of all, load the internal data and check the format.
data(mrna) data(mirna) data(pheno.mirna) data(pheno.mrna)
Basically, the format of input data should be the same as the internal data. For mRNA expression data should be like,
mrna[1:5, 1:5]
As for miRNA expression data,
mirna[1:5, 1:5]
And the phenotype data, (NOTICE:users should arrange the case columns front of the control columns.)
pheno.mrna[1:3, 1:3] pheno.mrna[28:30, 1:3]
Actually, the phenotype data of miRNA and mRNA share the same contents,but in this case, we still make it in two data to prevent users from being confused about it.
Secondly, we normalize data. (If you use the normalized data, you can skip this step.)
se <- normalization(data = mirna, method = "quantile")
For this function, there are three methods provided, including quantile
,
rank.invariant
, normal
. For more detail, Please refer to their
documentation.
Note that internal data have already been normalized, here is only for demonstration.
Before entering the main workflow, we should put our data and phenotype
information into SummarizedExperiment
class first, which you can get
more information from SummarizedExperiment.
mrna_se <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=mrna), colData = pheno.mrna) mirna_se <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=mirna), colData = pheno.mirna)
Third, we will find differential expression genes and miRNAs.
There are three statitical methods in this function. here, we use
t.test
for demonstration.
mrna_d <- differExp_discrete(se = mrna_se, class = "ER", method = "t.test", t_test.var = FALSE, log2 = FALSE, p_value.cutoff = 0.05, logratio = 0.5 ) mirna_d <- differExp_discrete(se = mirna_se, class = "ER", method = "t.test", t_test.var = FALSE, log2 = FALSE, p_value.cutoff = 0.05, logratio = 0.5 )
This function will delete genes and miRNAs (rows), which do not differential express, and add another three columns represent fold change (log2), p-value, adjusted p-value.
Take differential expressed mRNA data for instance,
nc <- ncol(mrna_d) mrna_d[1:5, (nc-4):nc]
Before using collected databases for intersection with potential miRNA-target gene interactions, we have to make sure all miRNA are in the latest annotation version (miRBase 21). If not, we could use this function to do it.
mirna_21 <- miR_converter(data = mirna_d, remove_old = TRUE, original_version = 17, latest_version = 21)
Now, we can compare these two data,
# Before head(row.names(mirna_d)) # After head(row.names(mirna_21))
Note that user must put the right original version into parameter, because it is an important information for function to convert annotation.
To find potential miRNA-target gene interactions, we should
combine the information in two differential expressed data,
which we obtained from differExp_discrete
.
cor <- negative_cor(mrna_data = mrna_d, mirna_data = mirna_21, method = "pearson", cut.off = -0.5)
For output,
head(cor)
As the showing list
, each row is a potential interaction,
and only the row that correlation coefficient < cut.off would
be kept in list.
Note that in our assumption, miRNAs negatively regulate expression of their target genes, that is, cut.off basically should be negative decimal.
There is a function for user to see the heatmaps about the miRNA-target gene interactions remaining in the correlation analysis table.
heat_vis(cor, mrna_d, mirna_21)
After correlation analysis, we have some potential interactions,
and then using database_support
helps us to get information
that whether there are databases predict or validate these
interactions.
sup <- database_support(cor_data = cor, org = "hsa", Sum.cutoff = 3)
From output, column Sum
tells us the total hits by 8 predict
databases and column Validate
tells us if this interaction
have been validated.
head(sup)
Note that we have 8 predict databases (DIANA microT CDS, EIMMo, Microcosm, miRDB, miRanda, PITA, rna22, TargetScan) and 2 validate databases (miRecords, miRTarBase).
The last, after finding reliable miRNA-target gene interactions, we are also interested in pathways, which may be enriched by these genes.
path <- enrichment(data_support = sup, org = "hsa", per_time = 500)
Note that for parameter per_time, we only choose 500 times, because it is for demonstration here. Default is 5000 times.
The output from this data not only shows P-Value generated by hypergeometric test, but Empirical P-Value, which means the value of average permutation test in each pathway.
head(path)
This package also provides another workflow for analyzing expression data.
Basically there are only two steps with two R functions, to complete the whole analysis:
Now, we show an example using internal data for anamiR workflow.
To demonstrate the usage of the anamiR
package, the package contains 99 paired
miRNA and mRNA breast cancer samples, which are selected from 101 miRNA samples and
114 mRNA samples from GSE19536. As for phenotype data (hybridization information).
The mRNA data was conducted by Agilent-014850 Whole Human Genome Microarray 4x44K G4112F (Probe Name version) platform and the miRNA data was generated from Agilent-019118 Human miRNA Microarray 2.0 G4470B (miRNA ID version).
The same as the format in the first workflow.
require(data.table) aa <- system.file("extdata", "GSE19536_mrna.csv", package = "anamiR") mrna <- fread(aa, fill = TRUE, header = TRUE) bb <- system.file("extdata", "GSE19536_mirna.csv", package = "anamiR") mirna <- fread(bb, fill = TRUE, header = TRUE) cc <- system.file("extdata", "pheno_data.csv", package = "anamiR") pheno.data <- fread(cc, fill = TRUE, header = TRUE)
transform the data format to matrix.
mirna_name <- mirna[["miRNA"]] mrna_name <- mrna[["Gene"]] mirna <- mirna[, -1] mrna <- mrna[, -1] mirna <- data.matrix(mirna) mrna <- data.matrix(mrna) row.names(mirna) <- mirna_name row.names(mrna) <- mrna_name pheno_name <- pheno.data[["Sample"]] pheno.data <- pheno.data[, -1] pheno.data <- as.matrix(pheno.data) row.names(pheno.data) <- pheno_name
mrna expression data should be the same format as,
mrna[1:5, 1:5]
as for mirna,
mirna[1:5, 1:5]
and phenotype data,
pheno.data[1:5, 1] pheno.data[94:98, 1]
Before entering the main workflow, we should put our data and phenotype
information into SummarizedExperiment
class first, which you can get
more information from SummarizedExperiment.
mrna_se <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=mrna), colData = pheno.data) mirna_se <- SummarizedExperiment::SummarizedExperiment( assays = S4Vectors::SimpleList(counts=mirna), colData = pheno.data)
First step, we use GSEA_ana
function to find the pathways
which are the most likely enriched in given expreesion data.
table <- GSEA_ana(mrna_se = mrna_se, mirna_se = mirna_se, class = "ER", pathway_num = 2)
the result would be a list containg related genes and miRNAs matrix for each pathway.
Note that because it would take a few minutes to run GSEA_ana, here we use the pre-calculated data to show the output.
data(table_pre)
For the first pathway,
names(table_pre)[1] table_pre[[1]][1:5, 1:5] names(table_pre)[2] table_pre[[2]][1:5, 1:5]
GSEA_ana
intersects the related genes and miRNAs found
from databases with the given expression data.
After doing GSEA analysis, we have selected miRNA and gene
expression data for each enriched pathway. As for the second
step, the generated odject would be put into GSEA_res
.
result <- GSEA_res(table = table_pre, pheno.data = pheno.data, class = "ER", DE_method = "limma", cor_cut = 0)
This function helps us calculate P-value, Fold-Change, Correlation for each miRNA-gene pair and show these value to users.
names(result)[1]
result[[1]]
As for the data, which classify samples into more than two
groups, anamiR provides function multi_Differ
. User can
get more information about this function by refering to
its documentation.
The data with continuous phenotype feature are also supported,
differExp_continuous
contains linear regression model, which
can fit the continuous data series. User can get more
information about this function by refering to its documentation.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.