metaseqr2 | R Documentation |
This function is the main metaseqr2 workhorse and implements the main metaseqr2 workflow which performs data read, filtering, normalization and statistical selection, creates diagnostic plots and exports the results and a report if requested. The metaseqr2 function is responsible for assembling all the steps of the metaseqr2 pipeline which i) reads the input gene or exon read count table ii) performs prelimininary filtering of data by removing chrM and other non-essential information for a typical differential gene expression analysis as well as a preliminary expression filtering based on the exon counts, if an exon read count file is provided. iii) performs data normalization with one of currently widely used algorithms, including EDASeq (Risso et al., 2011), DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), NOISeq (Tarazona et al., 2012) or no normalization iv) performs a second stage of filtering based on the normalized gene expression according to several gene filters v) performs statistical testing with one or more of currently widely used algorithms, including DESeq (Anders and Huber, 2010), edgeR (Robinson et al., 2010), NOISeq (Tarazona et al., 2012), limma (Smyth et al., 2005) for RNA-Seq data vi) in the case of multiple statistical testing algorithms, performs meta-analysis using one of six vailable methods (see the meta.p argument) vii) exports the resulting differentially expressed gene list in text tab-delimited format viii) creates a set of diagnostic plots either available in the aforementioned packages or metaseqr2 specific ones and ix) creates a comprehensive HTML report which summarizes the run information, the results and the diagnostic plots. Certain diagnostic plots (e.g. the volcano plot) can be interactive with the use of the external Highcharts (http://www.highcharts.com) JavaScript library for interactive graphs. Although the inputs to the metaseqr2 workflow are many, in practice, setting only very few of them and accepting the defaults as the rest can result in quite comprehensible results for mainstream organisms like mouse, human, fly and rat.
metaseqr2(counts, sampleList, excludeList = NULL,
fileType = c("auto", "sam", "bam", "bed"),
path = NULL, contrast = NULL, libsizeList = NULL,
embedCols = list(idCol = 4, gcCol = NA, nameCol = NA,
btCol = NA),
annotation = NULL, org = c("hg18", "hg19", "hg38", "mm9",
"mm10", "rn5", "rn6", "dm3", "dm6", "danrer7",
"pantro4", "susscr3", "tair10", "equcab2"),
refdb = c("ensembl", "ucsc", "refseq"), version = "auto",
transLevel = c("gene", "transcript", "exon"),
countType = c("gene", "exon","utr"),
utrOpts = list(frac = 1, minLength = 300, downstream = 50),
exonFilters = list(minActiveExons = list(exonsPerGene = 5,
minExons = 2, frac = 1/5)),
geneFilters = list(length = list(length = 500),
avgReads = list(averagePerBp = 100, quantile = 0.25),
expression = list(median = TRUE, mean = FALSE,
quantile = NA, known = NA, custom = NA),
biotype = getDefaults("biotypeFilter", org[1]),
presence = list(frac = 0.25, minCount = 10,
perCondition = FALSE)),
whenApplyFilter = c("postnorm", "prenorm"),
normalization = c("deseq", "deseq2", "edaseq", "edger",
"noiseq", "nbpseq", "absseq", "dss", "each", "none"),
normArgs = NULL,
statistics = c("deseq", "deseq2", "edger", "noiseq",
"limma", "nbpseq", "absseq", "dss"),
statArgs = NULL,
adjustMethod = sort(c(p.adjust.methods, "qvalue")),
metaP = if (length(statistics) > 1) c("simes",
"bonferroni", "fisher", "dperm_min", "dperm_max",
"dperm_weight", "fperm", "whitlock", "minp", "maxp",
"weight", "pandora", "none") else "none",
weight = rep(1/length(statistics), length(statistics)),
nperm = 10000, pcut = NA, logOffset = 1, pOffset = NULL,
preset = NULL, qcPlots = c("mds", "biodetection",
"countsbio", "saturation", "readnoise","filtered",
"correl", "pairwise", "boxplot", "gcbias",
"lengthbias", "meandiff", "meanvar", "rnacomp",
"deheatmap", "volcano", "biodist", "mastat",
"statvenn", "foldvenn", "deregulogram"),
figFormat = c("png", "jpg", "tiff", "bmp", "pdf", "ps"),
outList = FALSE, exportWhere = NA,
exportWhat = c("annotation", "p_value", "adj_p_value",
"meta_p_value", "adj_meta_p_value", "fold_change",
"stats", "counts","flags"),
exportScale = c("natural", "log2", "log10", "vst",
"rpgm"),
exportValues = c("raw", "normalized"),
exportStats = c("mean", "median", "sd", "mad", "cv",
"rcv"),
exportCountsTable = FALSE,
restrictCores = 0.6, report = TRUE, reportTop = 0.1,
reportTemplate = "default", saveGeneModel = TRUE,
verbose = TRUE, runLog = TRUE,
reportDb = c("dexie", "sqlite"),
localDb = file.path(system.file(package = "metaseqR2"),
"annotation.sqlite"),
offlineReport = TRUE,
createTracks = FALSE, overwriteTracks = FALSE,
trackExportPath = file.path(exportWhere, "tracks"),
trackInfo = list(stranded = FALSE, normTo = 1e+9,
urlBase = "http://www.trackserver.me",
fasta = NULL, gtf = NULL,
hubInfo = list(name = "MyHub", shortLabel = "My hub",
longLabel = "My hub long",
email = "someone@example.com")), .progressFun = NULL,
.exportR2C = FALSE,...)
counts |
a text tab-delimited file containing gene,
exon or 3'UTR counts in one of the following formats: i) the
first column contains unique gene or exon identifiers and
the rest of the columns contain the read counts for each
sample. ii) The first n columns should contain only **gene**
annotation elements like chromosomal locations, gene
accessions, exon accessions, GC content etc. and the rest
columns should contain gene read counts, iii) |
sampleList |
a list containing condition names and the samples under each condition or a small tab-delimited file with the experiment description. Not needed when restoring a previous analysis. See Details for analytical description. |
excludeList |
a list of samples to exclude, in the
same (list) format as |
path |
an optional path where all the BED/BAM files are placed, to be prepended to the BAM/BED file names in the targets file. See Details for further information. |
fileType |
the type of raw input files. It can be
|
contrast |
a character vector of contrasts to be
tested in the statistical testing step(s) of the pipeline.
Each element of contrast should STRICTLY have the format
"ConditionA_vs_ConditionB_vs_...". Special attention is
needed as fold change calculations are based on this
argument. If it is |
libsizeList |
an optional named list where names
represent samples (MUST be the same as the samples in
|
embedCols |
a named list with column numbers to guide the case of embedded annotation. See Details for further information. |
annotation |
It can be one of i) |
org |
the supported organisms by |
refdb |
the reference annotation repository from
which to retrieve annotation elements to use with
metaseqr2. It can be one of |
version |
the version of the annotation to use. See Details. |
transLevel |
perform differential expression
analysis at which transcriptional unit, can be one of
|
countType |
the type of reads inside the counts
file. It can be one of |
utrOpts |
a named list with members |
exonFilters |
a named list whose names are the names of the supported exon filters and its members the filter parameters. See section "Exon filters" below for details. |
geneFilters |
a named list whose names are the names of the supported gene filters and its members the filter parameters. See section "Gene filters" below for details. |
whenApplyFilter |
a character string determining
when to apply the exon and/or gene filters, relative to
normalization. It can be |
normalization |
the normalization algorithm to be
applied on the count data. It can be one of
|
normArgs |
a named list whose names are the names
of the normalization algorithm parameters and its members
parameter values. See section "Normalization parameters"
below for details. Leave |
statistics |
one or more statistical analyses to be
performed by the metaseqr2 pipeline. It can be one or more
of |
statArgs |
a named list whose names are the names
of the statistical algorithms used in the pipeline. Each
member is another named list whose names are the
algorithm parameters and its members are the parameter
values. See section "Statistics parameters" below for
details. Leave |
adjustMethod |
the multiple testing p-value
adjustment method. It can be one of
|
metaP |
the meta-analysis method to combine
p-values from multiple statistical tests . It can be
one of |
weight |
a vector of weights with the same length as
the |
nperm |
the number of permutations performed to
derive the meta p-value when |
pcut |
a p-value cutoff for exporting differentially genes, default is to export all the non-filtered genes. |
logOffset |
an offset to be added to values during
logarithmic transformations in order to avoid Infinity
(default is |
pOffset |
a value between |
preset |
an analysis strictness preset.
|
qcPlots |
a set of diagnostic plots to show/create.
It can be one or more of |
figFormat |
the format of the output diagnostic
plots. It can be one or more of |
outList |
a logical controlling whether to export a list with the results in the running environment. |
exportWhere |
an output directory for the project results (report, lists, diagnostic plots etc.) |
exportWhat |
the content of the final lists. It can
be one or more of |
exportScale |
export values from one or more
transformations applied to the data. It can be one or
more of |
exportValues |
It can be one or more of
|
exportStats |
calculate and export several
statistics on raw and normalized counts, condition-wise.
It can be one or more of |
exportCountsTable |
exports also the calculated
read counts table when input is read from bam files
and exports also the normalized count table in all
cases. Defaults to |
restrictCores |
in case of parallel execution of
several subfunctions, the fraction of the available cores
to use. In some cases if all available cores are used
( |
report |
a logical value controlling whether to
produce a summary report or not. Defaults to
|
reportTop |
a fraction of top statistically
significant genes to append to the HTML report. This
helps in keeping the size of the report as small as
possible, as appending the total gene list might
create a huge HTML file. Users can always retrieve
the whole gene lists from the report links. Defaults
to |
reportTemplate |
an HTML template to use for the report. Do not change this unless you know what you are doing. |
saveGeneModel |
in case of exon analysis, a list
with exon counts for each gene will be saved to the file
|
verbose |
print informative messages during
execution? Defaults to |
runLog |
write a log file of the |
reportDb |
database system to use for storing the
report intereactive graphs. Can be |
localDb |
the metaseqR2 annotaation database location.
See also |
offlineReport |
|
.
createTracks |
option to create normalized bigWig
files to display in a genome browser (e.g. UCSC). Defaults
to |
overwriteTracks |
overwrite tracks if they already
exist? Defaults to |
trackExportPath |
where to export the bigWig files,
defaults to |
trackInfo |
if |
.progressFun |
a function which updates a
|
.exportR2C |
export additional RData along with
|
... |
further arguments that may be passed to
plotting functions, related to |
When counts
is a tab-delimited file, the following
restrictions apply:
In the case (i) the first cell of each row is a
gene or exon accession and the rest are integers
representing the counts for that accession. In that
case, the annotation
parameter should strictly
be NULL
or an external file in GTF format.
In the case (ii) the annotation
parameter
can also be "embedded"
. The ideal embedded
annotation contains 8 columns, chromosome, gene or
exon start, gene or exon end, gene or exon accession,
GC-content (fraction or percentage), strand, HUGO
gene symbol and gene biotype (e.g. "protein_coding"
or "ncRNA"). When the annotation
parameter is
"embedded"
, certain of these features are
mandatory (co-ordinates and accessions). If they are
not present, the pipeline will not run. If additional
elements are not present (e.g. GC content or biotypes),
certain features of metaseqr2 will not be available.
For example, EDASeq normalization will not be
performed based on a GC content covariate but based
on gene length which is not what the authors of
EDASeq suggest. If biotypes are not present, a lot of
diagnostic plots will not be available. If the HUGO
gene symbols are missing, the final annotation will
contain only gene accessions and thus be less
comprehensible. Counts can be a data frame satisfying
the above conditions. It is a data frame by default
when read2count
is used.
In the case (iii) the .RData file
(output of save
function contains
static input elements (list containing the gene
model (exon counts for each gene), gene and exon
annotation to avoid re-(down)loading and/or gene
counts depending on countType
). This kind of
input facilitates the re-analysis of the same
experiment, using different filtering, normalization
and statistical algorithms. This .RData
file
is produced when saveGeneModel=TRUE
.
In the case (iv) counts
can be a list
representing the gene model (exon/UTR counts for each
gene). This .RData
file can be generated by
setting saveGeneModel=TRUE
when performing data
analysis for the first time.
Regarding sampleList
it should have the format
sampleList <-
list(ConditionA=c("Sample_A1",
"Sample_A2","Sample_A3"),
ConditionB=c("Sample_B1","Sample_B2"),
ConditionC=c("Sample_C1","Sample_C2"))
. The names of
the samples in list members MUST match the column names
containing the read counts in the counts file. If they do
not match, the pipeline will either crash or at best, ignore
several of your samples. Alternative, sampleList
can
be a small tab-delimited file structured as follows: the
first line of the external tab delimited file should contain
column names (names are not important). The first column MUST
contain UNIQUE sample names and the second column MUST
contain the biological condition where each of the samples
in the first column should belong to. If the counts
argument is missing, the sampleList
argument MUST be
a targets text tab-delimited file which contains the sample
names, the BAM/BED file names and the biological
conditions/groups for each sample/file. The file should be
text tab-delimited and structured as follows: the first line
of the external tab delimited file should contain column
names (names are not important). The first column MUST
contain UNIQUE sample names. The second column MUST contain
the raw BAM/BED files WITH their full path. Alternatively,
the path
argument should be provided. If path
is not provided and if the files in the second column of
the targets file do not contain a path to a directory, the
current directory is assumed to be the BAM/BED file
container. The third column MUST contain the biological
condition where each of the samples in the first column
should belong to.
Regarding contrast
, a valid example based on the
sampleList
above is
contrast <- c("ConditionA_vs_ConditionB",
"ConditionA_vs_ConditionC",
"ConditionA_vs_ConditionB_vs_ConditionC")
. The
first element of pairwise contrasts (e.g. "ConditionA"
above) MUST be the control condition or any reference
that ConditionB is checked against. metaseqr2 uses this
convention to properly calculate fold changes.
Regarding embedCols
, this list must contain four
members, named idCol
, gcCol
, nameCol
and btCol
, which hold the position in the delimited
file with embedded annotation, where unique gene ids, GC
content, gene names and gene biotypes respectively are
located. More specifically:
idCol
is an integer denoting the column
number in the file (or data frame) provided with the
counts argument, where the unique gene accessions are.
Default to 4
which is the standard feature name
column in a BED file.
gcCol
is an integer denoting the column
number in the file (or data frame) provided with the
counts
argument, where each gene's GC content
is given. If not provided, GC content normalization
provided by EDASeq will not be available.
nameCol
is an integer denoting the column
number in the file (or data frame) provided with the
counts argument, where the HUGO gene symbols are given.
If not provided, it will not be available when reporting
results. In addition, the "known"
gene filter
will not be available for application.
btCol
is an integer denoting the column
number in the file (or data frame) provided with the
counts argument, where the gene biotypes are given.
If not provided, the "biodetection"
,
"countsbio"
, "saturation"
,
"filtered"
and "biodist"
plots will not
be available.
Regarding annotation instructs metaseqr2 where to find the
annotation for the given counts file. It can be one of i)
"download"
(default) for automatic downloading of
the annotation for the organism specified by the org
parameter (using biomaRt), ii) "embedded"
if the
annotation elements are embedded in the read counts file
or iv) a file specified by the user which should be as
similar as possible to the "download"
case, in
terms of column structure.
Regarding org
, it can be, for human genomes
"hg18"
, "hg19"
or "hg38"
, for mouse
genomes "mm9"
, "mm10"
, for rat genomes
"rn5"
or "rn6"
, for drosophila genome
"dm3"
or "dm6"
, for zebrafish genome
"danrer7"
, "danrer10"
or "danrer11"
,
for chimpanzee genome "pantro4"
, "pantro5"
,
for pig genome "susscr3"
, "susscr11"
, for
Arabidopsis thaliana genome "tair10"
and for
Equus caballus genome "equcab2"
. Finally, it can
be "USER_NAMED_ORG"
with a custom organism which
has been imported to the annotation database by the user
using a GTF file. For example org="mm10_p1"
.
Regarding version
, this is an integer denoting the
version of the annotation to use from the local annotation
database or fetch on the fly. For Ensembl, it corresponds
to Ensembl releases, while for UCSC/RefSeq, it is the
date of creation (locally).
Regarding whenApplyFilter
, in the case of
whenApplyFilter="prenorm"
, a first normalization
round is applied to a copy of the gene counts matrix in
order to derive the proper normalized values that will
constitute the several expression-based filtering
cutoffs.
Regarding metaP
, for the "fisher"
and
"fperm"
methods, see the documentation of the R
package MADAM. For the "whitlock"
method, see the
documentation of the survcomp Bioconductor package. With
the "maxp"
option, the final p-value is the maximum
p-value out of those returned by each statistical test.
This is equivalent to an "intersection" of the results
derived from each algorithm so as to have a final list with
the common genes returned by all statistical tests.
Similarly, when meta.p="minp"
, is equivalent to a
"union" of the results derived from each algorithm so as
to have a final list with all the genes returned by all
statistical tests. The latter can be used as a very lose
statistical threshold to aggregate results from all methods
regardless of their False Positive Rate. With the
"simes"
option, the method proposed by Simes
(Simes, R. J., 1986) is used. With the "dperm_min"
,
"dperm.max"
, "dperm.weight"
options, a
permutation procedure is initialed, where nperm
permutations are performed across the samples of the
normalized counts matrix, producing nperm
permuted
instances of the initital dataset. Then, all the chosen
statistical tests are re-executed for each permutation.
The final p-value is the number of times that the p-value
of the permuted datasets is smaller than the original
dataset. The p-value of the original dataset is created
based on the choice of one of dperm.min
,
dperm.max
or dperm.weight
options. In case of
dperm.min
, the intial p-value vector is consists
of the minimum p-value resulted from the applied
statistical tests for each gene. The maximum p-value
is used with the dperm.max
option. With the
dperm.weight
option, the weight
weighting vector for each statistical test is used to
weight each p-value according to the power of
statistical tests (some might work better for a
specific dataset). Be careful as the permutation
procedure usually requires a lot of time. However,
it should be the most accurate. This method will NOT
work when there are no replicated samples across
biological conditions. In that case, use
meta.p="simes"
instead. Finally, there are the
"minp"
, "maxp"
and "weight"
options which correspond to the latter three methods
but without permutations. Generally, permutations
would be accurate to use when the experiment includes
>5 samples per condition (or even better 7-10) which
is rather rare in RNA-Seq experiments. Finally,
"pandora"
is the same as "weight"
and is
added to be in accordance with the main algorithm name.
Regarding pOffset
, it is used to correct for
the case of a p-value which is equal to 0 as a result
of internal numerical and approximation procedures.
When NULL
, random numbers greater than 0 and
less than or equal to 0.5 are used to multiply the
offending p-values with the lowest provided non-zero
p-value, maintaining thus a virtual order of
significance, avoiding having the same p-values for
two tests and assuming that all zero p-values represent
extreme statistical significance. When a numeric
between 0 and 1, this number is used for the above
multiplication instead.
Regarding qcPlots
The "mds"
stands
for Mutlti-Dimensional Scaling and it creates a PCA-like
plot but using the MDS dimensionality reduction instead.
It has been succesfully used for NGS data (e.g. see the
package htSeqTools) and it shows how well samples from
the same condition cluster together. For
"biodetection"
, "countsbio"
,
"saturation"
, "rnacomp"
,
"readnoise"
, "biodist"
see the vignette of
NOISeq package. The "saturation"
case has been
rewritten in order to display more samples in a more
simple way. In addition, the "readnoise"
plots
represent an older version or the RNA composition plot
included in older versions of NOISeq. For "gcbias"
,
"lengthbias"
, "meandiff"
, "meanvar"
see the vignette of EDASeq package. "lenghtbias"
is similar to "gcbias"
but using the gene length
instead of the GC content as covariate. The "boxplot"
option draws boxplots of log2 transformed gene counts. The
"filtered"
option draws a 4-panel figure with the
filtered genes per chromosome and per biotype, as
absolute numbers and as fractions of the genome. See also
the help page of diagplotFiltered
. The
"deheatmap"
option performs hierarchical
clustering and draws a heatmap of differentially
expressed genes. In the context of diagnostic plots, it's
useful to see if samples from the same groups cluster
together after statistical testing. The "volcano"
option draws a volcano plot for each contrast and if a
report is requested, an interactive volcano plot is
presented in the HTML report. The "venn"
option
will draw an up to 5-way Venn diagram depicting the
common and specific to each statistical algorithm genes
and for each contrast, when meta-analysis is performed.
The "correl"
option creates two correlation
graphs: the first one is a correlation heatmap (a
correlation matrix which depicts all the pairwise
correlations between each pair of samples in the counts
matrix is drawn as a clustered heatmap) and the second
one is a correlogram plot, which summarizes the
correlation matrix in the form of ellipses (for an
explanation please see the vignette/documentation of the
R package corrplot. Set qcPlots=NULL
if you don't
want any diagnostic plots created.
Regarding reportDb
, contrary with the first version of
metaseqR, all graphs in the metaseqR2 report are interactive
with the usage of the JavaScript libraries Highcharts, plotly
(heatmaply) and jvenn.js. However, this adds a great burden
regarding rendering the final HTML file and its content,
a burden which becomes heavier by the fact the metaseqR2
report is rendered using knitr
and rmarkdown
instead of raw HTML (previously, brew
). Therefore, the
pre-calculated JSON objects representing the graphs are
stored either in a report-specific IndexedDB
(https://javascript.info/indexeddb) flavor called Dexie
(https://dexie.org/) (default) or in an SQLite database
and then queried using sql.js
(https://github.com/kripken/sql.js/). Dexie is
prefered because it is very efficient and can produce an
independent report that does not need to be served through
a web-server and can be viewed locally. Although Dexie is
very efficient, some caution is required as knitr
and
render
from rmarkdown
are not very
memory efficient when rendering larger HTML files. A large
HTML file may be produced when analyzing a large dataset
with a lot of contrasts that may result in a lot of tables.
In such cases, if the report generation crashes with errors
related to memory, try lowering the reportTop
argument. reportTop
does not affect the final lists
of differentially expressed genes, only the report tables.
The same must be applied also if the report takes too much
time to load. If the report is to be served through a web
server like Apache (e.g. when the report is provided by a
facility to end users), reportDb="sqlite"
may be
preferred as the total report size will be smaller because
of an SQLite database hosting all plots which are queried
when required but from the SQLite database and not from
the in-browser database (Dexie). **Note** that when using
an SQLite database, you will **NOT** be able to view the
report in any browser other than Microsoft Edge because
of security policies regarding local file access. **Note**
also that sql.js
is a rather large JavaScript library
(around 2.5MB).
Regarding trackInfo
, it is a helper list to guide the
bigWig track creation and has the following members:
stranded
, which can be TRUE
or
FALSE
depending on whether you wish to create
stranded tracks by separating + and - strand reads.
In the case of stranded tracks, a UCSC Genome Brower
trackhub is created. Individual tracks can be retrieved
from the trackhub.
normTo
, which is a large integer, denoting
the total sum of signal to be used as the normalization
target. It defaults to 1e+9
. This means that if
for a particular sample the sum of signal is 1.5e+9
(sum(sapply(coverage(x),sum)) == 1.5e+9
) then
this is linearly scaled to 1e+9.
urlBase
, which is a base url appended to
the bigWig files produced (the base path of the
bigDataUrl
in UCSC Genome Browser track lines).
hubInfo
, a list with the track hub
description in case of stranded tracks. Please see the
track hub specifications at the UCSC Genome Browser site.
fasta
, reference genome in FASTA format for
the case of analyzing a custom, non-directly supported
organism. It will be converted to the .2bit format and
written along with a track hub.
gtf
, a GTF file describing gene models in
the case of analyzing a custom, non-directly supported
organism. It will be converted to the .bigBed format and
written along with a track hub. Essentially the same as
annotation$gtf
.
All files (bigWig files, track/trackhub info) are written in
the tracks
subdirectory of the main path where the
report and the outputs are written.
If outList
is TRUE
, a named list whose
length is the same as the number of requested contrasts.
Each list member is named according to the corresponding
contrast and contains a data frame of differentially
expressed genes for that contrast. The contents of the
data frame are defined by the exportWhat,
exportScale, exportStats, exportValues
parameters. If
report
is TRUE
, the output list contains
two main elements. The first is described above (the
analysis results) and the second contains the same
results but in HTML formatted tables.
The exon filters are a set of filters which are applied
after the gene models are assembled from the read counts
of individual exons and before the gene expression is
summarized from the exons belonging to each gene. These
filters can be applied when the input read counts file
contains exon reads. It is not applicable when the input
file already contains gene counts. Such filters can be
for example "accept genes where all the exons contain
more than x reads" or "accept genes where there is read
presence in at least m/n exons, n being the total exons
of the gene". Such filters are NOT meant for detecting
differential splicing as also the whole metaseqr2
pipeline, thus they should not be used in that context.
The exonFilters
argument is a named list of
filters, where the names are the filter names and the
members are the filter parameters (named lists with
parameter name, parameter value). See the usage of the
metaseqr2
function for an example of how these
lists are structured. The supported exon filters in the
current version are: i) minActiveExons
which
implements a filter for demanding m out of n exons of a
gene to have a certain read presence with parameters
exonsPerGene
, minExons
and frac
.
The filter is described as follows: if a gene has up to
exonsPerGene
exons, then read presence is
required in at least minExons
of them, else read
presence is required in a frac
fraction of the
total exons. With the default values, the filter
instructs that if a gene has up to 5 exons, read presence
is required in at least 2, else in at least 20
exons, in order to be accepted. More filters will be
implemented in future versions and users are encouraged
to propose exon filter ideas to the author by mail. See
metaseqr2
usage for the defaults. Set
exonFilters=NULL
to not apply any exon filtering.
The gene filters are a set of filters applied to gene
expression as this is manifested through the read
presence on each gene and are preferably applied after
normalization. These filters can be applied both when the
input file or data frame contains exon read counts and
gene read counts. Such filter can be for example "accept
all genes above a certain count threshold" or "accept all
genes with expression above the median of the normalized
counts distribution" or "accept all with length above a
certain threshold in kb" or "exclude the 'pseudogene'
biotype from further analysis". The supported gene
filters in the current version, which have the same
structure as the exon filters (named list of lists with
filter names, parameter names and parameter arguments)
are: i) length
which implements a length filter
where genes are accepted for further analysis if they are
above length
(its parameter) kb. ii)
avg.reads
which implements a filter where a gene
is accepted for further analysis if it has more average
reads than the quantile
of the average count
distribution per averagePerBp
base pairs. In
summary, the reads of each gene are averaged per
averagePerBp
based on each gene's length (in
case of exons, input the "gene's length" is the sum of
the lengths of exons) and the quantile
quantile of
the average counts distribution is calculated for each
sample. Genes passing the filter should have an average
read count larger than the maximum of the vector of the
quantiles calculated above. iii) expression
which
implements a filter based on the overall expression of a
gene. The parameters of this filter are: median
,
where genes below the median of the overall count
distribution are not accepted for further analysis (this
filter has been used to distinguish between "expressed"
and "not expressed" genes in several cases, e.g. (Mokry
et al., NAR, 2011) with a logical as value, mean
which is the same as median
but using the mean,
quantile
which is the same as the previous two but
using a specific quantile of the total counts
distribution, known
, where in this case, a set of
known not-expressed genes in the system under
investigation are used to estimate an expression cutoff.
This can be quite useful, as the genes are filtered based
on a "true biological" cutoff instead of a statistical
cutoff. The value of this filter is a character vector of
HUGO gene symbols (MUST be contained in the annotation,
thus it's better to use annotation="download"
)
whose counts are used to build a "null" expression
distribution. The 90th quantile of this distribution is
then the expression cutoff. This filter can be combined
with any other filter. Be careful with gene names as they
are case sensitive and must match exactly ("Pten" is
different from "PTEN"!). iv) biotype
where in this
case, genes with a certain biotype (MUST be contained in
the annotation, thus it's better to use
annotation="download"
) are excluded from the
analysis. This filter is a named list of logical, where
names are the biotypes in each genome and values are
TRUE
or FALSE
. If the biotype should be
excluded, the value should be TRUE
else
FALSE
. See the result of
get.defaults("biotype.filter","hg19")
for an
example. Finally, in future versions there will be
support for user-defined filters in the form of a
function. v) presence
where in this case, a gene
is further considered for statistical testing if
frac
(x100 for a percentage value) have more
than minCount
reads across all samples
(perCondition=FALSE
) or across the samples
of each condition (perCondition=TRUE
).
The normalization parameters are passed again as a named
list where the names of the members are the normalization
parameter names and the values are the normalization
parameter values. You should check the documentation of
the packages EDASeq, DESeq, edgeR, NOISeq and NBPSeq for
the parameter names and parameter values. There are a few
exceptions in parameter names: in case of
normalization="edaseq"
the only parameter names
are within.which
and between.which
,
controlling the withing lane/sample and between
lanes/samples normalization algorithm. In the case
of normalization="nbpseq"
, there is one
additional parameter called main.method
which can
take the calues "nbpseq"
or "nbsmyth"
.
These values correspond to the two different workflows
available in the NBPSeq package. Please, consult the
NBPSeq package documentation for further details. For the
rest of the algorithms, the parameter names are the same
as the names used in the respective packages. For
examples, please use the getDefaults
function.
The statistics parameters as passed to statistical
algorithms in metaseqr2, exactly with the same way as the
normalization parametes above. In this case, there is one
more layer in list nesting. Thus, statArgs
is a
named list whose names are the names the algorithms used
(see the statistics
parameter). Each member is
another named list,with parameters to be used for each
statistical algorithm. Again, the names of the member
lists are parameter names and the values of the member
lists are parameter values. You should check the
documentations of DESeq, edgeR, NOISeq, limma and
NBPSeq for these parameters. There are a few exceptions
in parameter names: In case of statistics="edger"
,
apart from the rest of the edgeR statistical testing
arguments, there is the argument mainMethod
which
can be either "classic"
or "glm"
, again
defining whether the binomial test or GLMs will be used
for statistical testing. For examples, please use the
getDefaults
function. When
statistics="nbpseq"
, apart from the rest arguments
of the NBPSeq functions estimate.disp
and
estimate.dispersion
, there is the argument
mainMethod
which can be "nbpseq"
or
"nbsmyth"
. This argument determines the parameters
to be used by the estimate.dispersion
function or
by the estimate.disp
function to estimate RNA-Seq
count dispersions. The difference between the two is that
they constitute different starting points for the two
workflows in the package NBPSeq. The first worklfow (with
mainMethod="nbpseq"
and the
estimate.dispersion
function is NBPSeq package
specific, while the second (with
mainMethod="nbsmyth"
and the estimate.disp
function is similar to the workflow of the edgeR package.
For additional information regarding the statistical
testing in NBPSeq, please consult the documentation of
the NBPSeq package.
The analysis presets are a set of keywords (only one can
be used) that predefine some of the parameters of the
metaseqr2 pipeline. For the time being they are quite
simple and they control i) the strictness of filtering
and statistical thresholding with three basic levels
("all", "medium", "strict") and ii) the data columns that
are exported, again in three basic ways ("basic",
"normal", "full") controlling the amount of data to be
exported. These keywords can be combined with a dot in
the middle (e.g. "all.basic"
to define an analysis
preset. When using analysis presets, the following
argumentsof metaseqr2 are overriden: exonFilters
,
geneFilters
, pcut
, exportWhat
,
exportScale
, exportValues
,
exonStats
. If you want to explicitly control the
above arguments, the preset
argument should be set
to NULL
(default). Following is a synopsis of the
different presets and the values of the arguments they
moderate:
"all_basic"
: use all genes (do not filter)
and export all genes and basic annotation and statistics
elements. In this case, the above described arguments become:
exonFilters=NULL
geneFilters=NULL
pcut=1
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean")
"all_normal"
: use all genes (do not filter)
and export all genes and normal annotation and statistics
elements. In this case, the above described arguments
become:
exonFilters=NULL
geneFilters=NULL
pcut=1
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change","stats",
"counts")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean","sd","cv")
"all_full"
: use all genes (do not filter)
and export all genes and all annotation and statistics
elements. In this case, the above described arguments
become:
exonFilters=NULL
geneFilters=NULL
pcut=1
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change","stats","counts")
exportScale=c("natural","log2","log10","vst")
exportValues=c("raw","normalized")
exportStats=c("mean","median","sd","mad",
"cv","rcv")
"medium_basic"
: apply a medium set of filters
and and export statistically significant genes and basic
annotation and statistics elements. In this case, the above
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=5,minExons=2,frac=1/5))
geneFilters=list(length=list(length=500),
avgReads=list(averagePerBp=100,quantile=0.25),
expression=list(median=TRUE,mean=FALSE,quantile=NA,
known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.05
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean")
"medium_normal"
: apply a medium set of filters
and export statistically significant genes and normal
annotation and statistics elements. In this case, the
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=5,minExons=2,frac=1/5))
geneFilters=list(length=list(length=500),
avgReads=list(averagePerBp=100,quantile=0.25),
expression=list(median=TRUE,mean=FALSE,
quantile=NA,known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.05
export.what=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change",
"stats","counts")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean","sd","cv")
"medium_full"
: apply a medium set of filters
and export statistically significant genes and full
annotation and statistics elements. In this case, the
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=5,minExons=2,frac=1/5))
geneFilters=list(length=list(length=500),
avgReads=list(averagePerBp=100,quantile=0.25),
expression=list(median=TRUE,mean=FALSE,
quantile=NA,known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.05
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change",
"stats","counts")
exportScale=c("natural","log2","log10",
"vst")
exportValues=c("raw","normalized")
exportStats=c("mean","median","sd","mad",
"cv","rcv")
"strict_basic"
: apply a strict set of filters
and and export statistically significant genes and basic
annotation and statistics elements. In this case, the
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=4,minExons=2,frac=1/4))
geneFilters=list(length=list(length=750),
avgReads=list(averagePerBp=100,quantile=0.5),
expression=list(median=TRUE,mean=FALSE,
quantile=NA,known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.01
exportWhat=c("annotation","p_value",
"adj_p_value","meta.p.value",
"adj_meta_p_value","fold_change")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean")
"strict_normal"
: apply a strict set of filters
and export statistically significant genes and normal
annotation and statistics elements. In this case, the
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=4,minExons=2,frac=1/4))
geneFilters=list(length=list(length=750),
avgReads=list(averagePerBp=100,quantile=0.5),
expression=list(median=TRUE,mean=FALSE,
quantile=NA,known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.01
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change",
"stats","counts")
exportScale=c("natural","log2")
exportValues=c("normalized")
exportStats=c("mean","sd","cv")
"strict_full"
: apply a strict set of filters
and export statistically significant genes and full
annotation and statistics elements. In this case, the
above described arguments become:
exonFilters=list(minActiveExons=
list(exonsPerGene=4,minExons=2,frac=1/4))
geneFilters=list(length=list(length=750),
avgReads=list(averagePerBp=100,quantile=0.5),
expression=list(median=TRUE,mean=FALSE,
quantile=NA,known=NA,custom=NA),
biotype=getDefaults("biotypeFilter",org[1]))
pcut=0.01
exportWhat=c("annotation","p_value",
"adj_p_value","meta_p_value",
"adj_meta_p_value","fold_change",
"stats","counts")
exportScale=c("natural","log2","log10",
"vst")
exportValues=c("raw","normalized")
exportStats=c("mean","median","sd","mad",
"cv","rcv")
Currently only gene and exon annotation from Ensembl
(http://www.ensembl.org), UCSC and RefSeq are supported.
In addition, the user may choose to use own GTF file on
the fly or import to the backend annotation database (see
buildAnnotationDatabase
). Thus, the unique
gene ids in the counts files should correspond to valid
Ensembl, UCSC or RefSeq gene or exon accessions for the
organism of interest, or according to the user's GTF.
If you are not sure about the source of your counts file or
do not know how to produce it, it's better to start from the
original BAM/BED files (metaseqr2 will use the
read2count
function to create a counts
file). Keep in mind that in the case of BED files, the
performance will be significantly lower and the overall
running time significanlty higher as the R functions
which are used to read BED files to proper structures
(GenomicRanges) and calculate the counts are quite slow.
An alternative way is maybe the easyRNASeq package
(Delhomme et al, 2012). The read2count
function does not use this package but rather makes use
of standard Bioconductor functions to handle NGS data. If
you wish to work outside R, you can work with other
popular read counters such as the HTSeq read counter
(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html).
Please also note that in the current version, the members
of the geneFilters
and exonFilters
lists
are not checked for validity so be careful to supply with
correct names otherwise the pipeline will crash or at the
best case scenario, will ignore the filters. Also note
that when you are supplying metaseqr2 wtih an exon counts
table, gene annotation is always downloaded so please be
sure to have a working internet connection. In addition
to the above, if you have a multiple core system, be very
careful on how you are using the restrictCores
argument and generally how many cores you are using with
scripts purely written in R. The analysis with exon read
data can very easily cause memory problems, so unless you
have more than 64Gb of RAM available, consider setting
restrict.cores to something like 0.2 when working with
exon data. Finally, if you do not wish to download the
same annotation again and again when performing multiple
analyses, it is best to use the
buildAnnotationDatabase
function to download
and store the resulting data frames in local SQLite
database and then use these files with the
org
, refdb
and version
options.
Please note that the meta-analysis feature provided by metaseqr2 does not satisfy the strict definition of "meta-analysis", which is the combination of multiple similar datasets under the same statistical methodology. Instead it is the use of mulitple statistical tests applied to the same data. For the Simes method, please consult also "Simes, R. J. (1986). "An improved Bonferroni procedure for multiple tests of significance". Biometrika 73 (3): 751–754."
Also, if weight="meta_perm"
ideally one would
want to create the same set of indices for a given
dataset so as to create reproducible p-values. To
achieve this, use the set.seed
function prior to
any calculations. When metaseqR2 is loaded, the random
seed is set to 42
.
Panagiotis Moulos
# An example pipeline with gene counts
data("mm9GeneData",package="metaseqR2")
result <- metaseqr2(
counts=mm9GeneCounts,
sampleList=sampleListMm9,
contrast=c("adult_8_weeks_vs_e14.5"),
libsizeList=libsizeListMm9,
annotation="embedded",
org="mm9",
countType="gene",
normalization="edger",
statistics="edger",
pcut=0.05,
figFormat="png",
qcPlots="mds",
exportWhat=c("annotation","p_value","adj_p_value","fold_change"),
exportScale="natural",
exportValues="normalized",
exportStats="mean",
exportWhere=file.path(tempdir(),"test1"),
restrictCores=0.01,
geneFilters=list(
length=list(
length=500
),
avgReads=list(
averagePerBp=100,
quantile=0.25
),
expression=list(
median=TRUE,
mean=FALSE,
quantile=NA,
known=NA,
custom=NA
),
biotype=getDefaults("biotypeFilter","mm9")
),
outList=TRUE
)
head(result$data[["adult_8_weeks_vs_e14.5"]])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.