BiocStyle::markdown()
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", crop = NULL ## Related to ## https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html )
## Track time spent on making the vignette startTime <- Sys.time() ## Bib setup library("knitcitations") ## Load knitcitations with a clean bibliography cleanbib() cite_options(hyperlink = "to.doc", citation_format = "text", style = "html") ## Write bibliography information bib <- c( R = citation(), BiocStyle = citation("BiocStyle")[1], knitcitations = citation("knitcitations")[1], knitr = citation("knitr")[1], rmarkdown = citation("rmarkdown")[1], sessioninfo = citation("sessioninfo")[1], testthat = citation("testthat")[1], ISAnalytics = citation("ISAnalytics")[1] ) write.bibtex(bib, file = "how_to_import_functions.bib")
ISAnalytics
import functions familyIn this vignette we're going to explain more in detail how functions of the import family should be used, the most common workflows to follow and more.
To install the package run the following code:
## For release version if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } BiocManager::install("ISAnalytics") ## For devel version if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") } # The following initializes usage of Bioc devel BiocManager::install(version = "devel") BiocManager::install("ISAnalytics")
To install from GitHub:
# For release version if (!require(devtools)) { install.packages("devtools") } devtools::install_github("calabrialab/ISAnalytics", ref = "RELEASE_3_12", dependencies = TRUE, build_vignettes = TRUE ) ## Safer option for vignette building issue devtools::install_github("calabrialab/ISAnalytics", ref = "RELEASE_3_12" ) # For devel version if (!require(devtools)) { install.packages("devtools") } devtools::install_github("calabrialab/ISAnalytics", ref = "master", dependencies = TRUE, build_vignettes = TRUE ) ## Safer option for vignette building issue devtools::install_github("calabrialab/ISAnalytics", ref = "master" )
library(ISAnalytics)
ISAnalytics
has a verbose option that allows some functions to print
additional information to the console while they're executing.
To disable this feature do:
# DISABLE options("ISAnalytics.verbose" = FALSE) # ENABLE options("ISAnalytics.verbose" = TRUE)
Some functions also produce report in a user-friendly HTML format, to set this feature:
# DISABLE HTML REPORTS options("ISAnalytics.widgets" = FALSE) # ENABLE HTML REPORTS options("ISAnalytics.widgets" = TRUE)
The vast majority of the functions included in this package is designed to work in combination with Vispa2 pipeline. If you don't know what it is, we strongly recommend you to take a look at these links:
Vispa2 produces a standard file system structure starting from a folder you specify as your workbench or root. The structure always follows this schema:
We've included 2 examples of this structure in our package, one correct and the
other one including errors or potential problems. They are both in .zip format,
so you might want to unzip them if you plan to experiment with them.
An example on how to access them:
root_correct <- system.file("extdata", "fs.zip", package = "ISAnalytics") root_correct <- unzip_file_system(root_correct, "fs") fs::dir_tree(root_correct)
If you want to import a single integration matrix you can do so by using the
import_single_Vispa2Matrix
function.
This function reads the file and converts it into a tidy structure: several
different formats can be read, since you can specify the column separator.
If you're not familiar with the "tidy" concept, we recommend to take a look at
this link to get the basics:
This package is in fact based on the tidyverse
and tries to follow its
philosophy and guidelines as close as possible.
Vispa2 pipeline and the associated Create Matrix tool produce matrices with a
standard structure which we'll refer to as "messy", because different
experimental data is divided in different columns and there are a lot of
NA values.
example_matrix_path <- system.file("extdata", "ex_annotated_ISMatrix.tsv.xz", package = "ISAnalytics" ) example_matrix <- read.csv(example_matrix_path, sep = "\t", header = TRUE, stringsAsFactors = FALSE, check.names = FALSE ) knitr::kable(head(example_matrix), caption = "A simple example of messy matrix.", align = "l" )
example_matrix_path <- system.file("extdata", "ex_annotated_ISMatrix.tsv.xz", package = "ISAnalytics" ) imported_im <- import_single_Vispa2Matrix( path = example_matrix_path, to_exclude = NULL, separator = "\t" )
knitr::kable(head(imported_im), caption = "Example of tidy integration matrix")
We will refer to the structure generated by import_single_Vispa2Matrix
as
"integration matrix" for convenience.
To be considered an integration matrix the data frame must contain the mandatory
variables, which are "chr" (chromosome), "integration_locus" and "strand".
It might also contain annotation variables if the matrix was annotated during
the Vispa2 pipeline run.
You can access these names by using two functions:
# Displays the mandatory vars, can be called also for manipulation purposes # on tibble instead of calling individual variables mandatory_IS_vars() # Displays the annotation variables annotation_IS_vars()
You can of course operate on the integration matrices as you would on any other data frame, but some functions will check the presence of specific columns because they're needed in that context.
While you can import single matrices for brief analysis, what you would like to
do most of the times is import multiple matrices at once, based on certain
parameters. To do that you must first import the association file, which is the
file that holds all associated metadata and information about every project,
pool and single experiment.
The function that imports this file does not simply read it into your R
environment, but performs an alignment check with your file system, so you
have to specify the path to the root folder where your Vispa2 runs produce
output (see the previous section).
To import the association file do:
path_as_file <- system.file("extdata", "ex_association_file.tsv", package = "ISAnalytics" ) withr::with_options(list(ISAnalytics.widgets = FALSE), { association_file <- import_association_file( path = path_as_file, root = root_correct, tp_padding = 4, dates_format = "dmy", separator = "\t" ) }) association_file
If you have the "widgets" option active, this will produce a visual HTML report
of the results of the alignment check, either in Rstudio or in your browser.
If projects or pools are missing you will be notified: until you fix those
problems, those elements will be ignored until you re-import the association
file.
If you're not interested in scanning the file system you can set the 'root' parameter to NULL and this step will be skipped.
The function can read multiple file formats including excel files, however since metadata are crucial for a correct workflow, we recommend using .tsv or .csv format to avoid potential parsing problems. Additionally you can also specify a filter to obtain a pre-filtered association file for your needs:
withr::with_options(list(ISAnalytics.widgets = FALSE), { association_file_filtered <- import_association_file( path = path_as_file, root = root_correct, tp_padding = 4, dates_format = "dmy", separator = "\t", filter_for = list(ProjectID = "CLOEXP") ) }) association_file_filtered
There are 2 different functions for importing multiple matrices in parallel:
import_parallel_Vispa2Matrices_interactive
import_parallel_Vispa2Matrices_auto
The interactive version will ask you to input your choices directly into the
console, the automatic version will not, but has some limitations.
Both functions rely on the association file and some basic parameters, most
notably:
quantification_type
: this is a string or a vector of characters indicating
which quantification types you want the function to look for. The possible
values are r quantification_types()
matrix_type
: tells the function if it should consider annotated or not
annotated matrices. The only possible options are "annotated" and
"not_annotated"workers
: indicates the number of parallel workers to instantiate when
importing. Keep in mind that the higher is the number, the faster the process
is, but also higher is the RAM peak, so you should be aware of this especially
if you're dealing with really big matrices. Set this parameter according to
your needs and according to your hardware specifications. Both the versions will produce an HTML report as a summary of the importing process. The report includes:
Both the functions, by default, return a multi-quantification matrix (see \link{comparison_matrix}).
As stated before, with the interactive version you have more control and you can directly choose:
If you haven't imported the association file yet, you can directly pass the path to the association file and the path to the root folder into the function: in this way the association file will automatically be imported.
Example:
withr::with_options(list(ISAnalytics.widgets = FALSE), { matrices <- import_parallel_Vispa2Matrices_interactive( association_file = path_as_file, root = root_correct, quantification_type = c("fragmentEstimate", "seqCount"), matrix_type = "annotated", workers = 2 ) })
If you've already imported the association file you can instead call the function like this:
matrices <- import_parallel_Vispa2Matrices_interactive( association_file = association_file, root = NULL, quantification_type = c("fragmentEstimate", "seqCount"), matrix_type = "annotated", workers = 2 )
You can simply access the data frames by doing:
matrices$fragmentEstimate matrices$seqCount
If you choose to opt for the automatic version you should keep in mind that the function automatically considers everything included in the association file, so if you want to import only a subset of projects and/or pools you should filter the association file according to your criteria before calling the function:
library(magrittr) refined_af <- association_file %>% dplyr::filter(.data$ProjectID == "CLOEXP")
In automatic version there is no way of discriminating duplicates, so there is the possibility to specify additional patterns to look for in file names to mitigate this problem. However, if after matching of the additional patterns duplicates are still found they're simply discarded.
There are 2 additional parameters to set:
patterns
: a string or a character vector containing regular expressions to
be matched on file names. If you're not familiar with regular expressions,
I suggest you to start
from here stringr cheatsheetmatching_opt
: a single string that tells the function how to match the
patterns. The possible values for this parameter are r matching_options()
: patterns
patterns
You can call the function with patterns
set to NULL if you don't wish to
match anything:
withr::with_options(list(ISAnalytics.widgets = FALSE), { matrices_auto <- import_parallel_Vispa2Matrices_auto( association_file = refined_af, root = NULL, quantification_type = c("fragmentEstimate", "seqCount"), matrix_type = "annotated", workers = 2, patterns = NULL, matching_opt = "ANY" # Same if you choose "ALL" or "OPTIONAL" ) }) matrices_auto
Let's do an example with a file system where there are issues, such as duplicates:
root_err <- system.file("extdata", "fserr.zip", package = "ISAnalytics") root_err <- unzip_file_system(root_err, "fserr") fs::dir_tree(root_err) withr::with_options(list(ISAnalytics.widgets = FALSE), { association_file_fserr <- import_association_file(path_as_file, root_err) refined_af_err <- association_file_fserr %>% dplyr::filter(.data$ProjectID == "CLOEXP") matrices_auto2 <- import_parallel_Vispa2Matrices_auto( association_file = refined_af_err, root = NULL, quantification_type = c("fragmentEstimate", "seqCount"), matrix_type = "annotated", workers = 2, patterns = "NoMate", matching_opt = "ANY" # Same if you choose "ALL" or "OPTIONAL" ) }) matrices_auto
As you can see, in the file system with issues we have more than one file for quantification type, duplicates have "NoMate" suffix in their file name. By specifying this pattern in the function, we're only going to import those files.
As for the interactive version, you can call the function with path to the association file and root if you want to simply import everything without filtering.
The r Biocpkg("ISAnalytics")
package r citep(bib[["ISAnalytics"]])
was
made possible thanks to:
r citep(bib[["R"]])
r Biocpkg("BiocStyle")
r citep(bib[["BiocStyle"]])
r CRANpkg("knitcitations")
r citep(bib[["knitcitations"]])
r CRANpkg("knitr")
r citep(bib[["knitr"]])
r CRANpkg("rmarkdown")
r citep(bib[["rmarkdown"]])
r CRANpkg("sessioninfo")
r citep(bib[["sessioninfo"]])
r CRANpkg("testthat")
r citep(bib[["testthat"]])
This package was developed using
r BiocStyle::Githubpkg("lcolladotor/biocthis")
.
R
session information.
## Session info library("sessioninfo") options(width = 120) session_info()
This vignette was generated using r Biocpkg("BiocStyle")
r citep(bib[["BiocStyle"]])
with r CRANpkg("knitr")
r citep(bib[["knitr"]])
and
r CRANpkg("rmarkdown")
r citep(bib[["rmarkdown"]])
running behind the scenes.
Citations made with r CRANpkg("knitcitations")
r citep(bib[["knitcitations"]])
.
## Print bibliography bibliography()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.