knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = TRUE, crop = NULL )
Sys.setenv( PATH = paste( Sys.getenv("PATH"), paste0(Sys.getenv("HOME"), "/.local/bin"), paste0( Sys.getenv("HOME"), "/Documents/Programs/salmon-1.9.0_linux_x86_64/bin/" ), "/opt/STAR-2.7.9a/bin/Linux_x86_64_static/", "/opt/bin/", "/opt/salmon-1.9.0_linux_x86_64/bin/", "/opt/kallisto/", "/opt/subread-2.0.3-Linux-x86_64/bin/", "/opt/stringtie-2.1.7.Linux_x86_64/", "/opt/taco-v0.7.3.Linux_x86_64/", sep = ":" ) ) options(timeout = 60000)
In the past decades, there has been an exponential accumulation of RNA-seq
data in public repositories. This steep increase paved the way to the creation
of gene expression atlases, which consist in comprehensive collections of
expression data from public databases, analyzed with a single pipeline for
consistency and cross-project comparison.
r BiocStyle::Githubpkg("almeidasilvaf/bears")
is a package that allows you to create your own gene expression atlas
for a given species using public data. The package features:
To install r BiocStyle::Githubpkg("almeidasilvaf/bears")
,
use the following code:
remotes::install_github("almeidasilvaf/bears")
Then, create a standard directory structure with create_dir_structure()
to
store your results. This is optional,
but it will make your life much easier. The output is a list of paths to
common directories that you will need to specify in several functions of this
package.
library(bears) # Create directory structure using a temporary directory as root ds <- create_dir_structure(rootdir = tempdir()) # Look at the output ds
To run the full pipeline implemented in
r BiocStyle::Githubpkg("almeidasilvaf/bears")
, you will need to have some
external software tools installed in your machine.
The names of the tools are listed below. In the Functions column, you
can see the names of functions in
r BiocStyle::Githubpkg("almeidasilvaf/bears")
that require each tool.
software_df <- data.frame( Software = c("fastp", "kallisto", "RSeQC", "salmon", "SortMeRNA", "STAR", "StringTie", "Subread", "TACO"), Version = c(">=0.22.0", ">=0.11.9", ">=4.0.0", ">=1.8.0", ">=4.3.4", ">=2.7.10a", ">=2.2.1", ">=2.0.1", ">=0.7.3") ) software_df$Functions <- c( "trim_reads()", "kallisto_index(), kallisto_quantify()", "infer_strandedness()", "salmon_index(), salmon_quantify()", "remove_rrna()", "star_genome_index(), star_align()", "stringtie_assemble(), stringtie_quantify()", "fcount()", "taco_merge()" ) knitr::kable(software_df)
To make your life easier, we have created .yml files with Conda environments
containing each of these external tools. You can create an environment for
each tool and manage Conda environments from the R session with
the Bioconductor package r BiocStyle::Biocpkg("Herper")
(see the FAQ section for details).
Alternatively, you can see here the code we used to install all external dependencies in the Ubuntu 20.04 virtual machine provided by GitHub Actions, which is how this package is tested and how this document was created.
As a sanity check, let's see if all external dependencies are installed.
# Test installation of external dependencies fastp_is_installed() star_is_installed() sortmerna_is_installed() rseqc_is_installed() salmon_is_installed() kallisto_is_installed() subread_is_installed() stringtie_is_installed() taco_is_installed()
First of all, you need to choose which samples you want to download and create
a metadata data frame for your samples. To create this data frame, you will
pass a search term to the function create_sample_info()
. The search term
has the same syntax of the SRA search term syntax. For example, you can search
by:
Let's create a metadata data frame for a human RNA-seq sample that is included
in the r BiocStyle::Biocpkg("airway")
Bioconductor package.
# Create metadata data frame term <- "SAMN02422669[BSPL]" metadata <- create_sample_info(term) metadata
To download the .fastq files from ENA, you will use the
function download_from_ena()
. As input, you only need to give the
metadata data frame and the path the output directory where .fastq files
will be stored.
IMPORTANT: Note below that you must change the timeout limit, or your downloads will be interrupted after the default timeout limit of 60 seconds. You will probably need more than 60 seconds to download some samples, especially if your internet connection is not very good.
Here's an example of how to run download_from_ena()
:
# Change default timeout limit options(timeout = 6000) # Download sample to temporary directory download <- download_from_ena(metadata, fastqdir = ds$fastqdir)
For running time issues, we will simply copy the example FASTQ files from the
extdata subdirectory of this package to ds$fastqdir
. These example files
contain a subset of the SAMN02422669 BioSample we mentioned before.
# For running time issues, copy example FASTQ files to ds$fastqdir f <- list.files( system.file("extdata", package = "bears"), pattern = ".fastq.gz", full.names = TRUE ) copy <- file.copy(f, ds$fastqdir)
After downloading FASTQ files, it's important to check the integrity of your
files. The function check_md5()
does that for you by checking if the
md5 checksum of the downloaded file matches the md5 checksum of the original
file.
check_md5(run_accessions = metadata$Run, fastqdir = ds$fastqdir)
Here, expectedly, the function reported an issue. As the example files
we are using contain only a subset of the SAMN02422669 BioSample, the
md5 checksums are different. If the md5 checksums were the same,
the Status
variable would be TRUE.
After downloading the .fastq files, you need to perform some filtering steps to remove low quality bases, adapters, and rRNA. Read filtering consists in 2 steps:
trim_reads()
- trim adapters and low quality bases (if any).
This function runs fastp [@chen2018fastp] and saves filtered .fastq files
in a directory named filtdir
.
remove_rrna()
- remove rRNA (if any) from .fastq files. rRNA
removal relies on the SortMeRNA [@kopylova2012sortmerna] program.
First, let's use trim_reads()
to filter reads. [^1]
[^1]: Friendly tip: filtered FASTQ files are stored in the directory
specified in the filtdir
parameter, even if no filtering was performed
(no adapters and no low quality bases, for instance). If you want to delete
unfiltered FASTQ files for hard drive issues, you can set delete_raw = TRUE
.
if(fastp_is_installed()) { # Trim reads fastp_status <- trim_reads( metadata, fastqdir = ds$fastqdir, filtdir = ds$filtdir, qcdir = ds$qcdir ) fastp_status # check run status }
The function trim_reads()
stores .json files containing fastp summary
statistics for each sample in the directory specified in qcdir
.
You can read it and parse it into a data frame with the function
summary_stats_fastp()
. Let's demonstrate how it works.
# Path to directory containing .json file from fastp qcdir <- system.file("extdata", package = "bears") qc_table <- summary_stats_fastp(qcdir) qc_table
In this vignette, we will use a small rRNA database as an example. In real-world applications, your rRNA database directory should contain all FASTA files distributed in the SortMeRNA GitHub repo. However, if you think some of these files (e.g., 5s and 5.8s rRNA) are not a concern in your data set, you don't need to include them in the database.
# Create a directory to store the rRNA db rrna_db_dir <- file.path(tempdir(), "rrna") dir.create(rrna_db_dir) # Copy the example 16S rRNA file to the db directory. rrna_file <- system.file("extdata", "bac_16s_subset.fa", package = "bears") copy <- file.copy(from = rrna_file, to = rrna_db_dir) # Remove rRNA (if any) if(sortmerna_is_installed()) { rrna_removal <- remove_rrna( metadata, fastqdir = ds$fastqdir, filtdir = ds$filtdir, rrna_db_dir = rrna_db_dir ) rrna_removal # check run status }
Now that we have performed all quality checks, we're good to go. [^1]
Quantification of transcript abundance can be done in two ways:
Alignment-based approaches, which involves mapping reads to a reference genome using STAR [@dobin2013star] and quantifying the expression based on uniquely mapped reads with featureCounts [@liao2014featurecounts] (in raw counts) and/or StringTie [@pertea2015stringtie] (in TPM).
Alignment-free approaches, which involves pseudo-aligning or quasi-mapping reads to a reference transcriptome with kallisto [@bray2016near] or salmon [@patro2017salmon], respectively.
Below, we will describe how to perform both approaches.
To start with, we will need to map the reads to a reference genome.
In r BiocStyle::Githubpkg("almeidasilvaf/bears")
, reads are mapped
to the reference genome with the software tool STAR.
Here, for the purpose of demonstration, we will map reads to a subset of the human genome. The FASTA and GTF files corresponding to the subset of the genome are available in the extdata/ subdirectory of this package.
Before mapping reads, we need to create a genome index. This can be done
with star_genome_index()
.
# Get paths to genome subset genome_path <- system.file( "extdata", "Homo_sapiens.GRCh37.75_subset.fa", package = "bears" ) gff_path <- system.file( "extdata", "Homo_sapiens.GRCh37.75_subset.gtf", package = "bears" ) # Create genome index if(star_is_installed()) { genome_idx <- star_genome_index( genome_path = genome_path, gff_path = gff_path, mappingdir = ds$mappingdir ) genome_idx # check run status }
Now that we have the genome index, we can map reads to it.
# Map reads to the genome if(star_is_installed()) { read_mapping <- star_align( metadata, filtdir = ds$filtdir, qc_table = qc_table, mappingdir = ds$mappingdir, gff_path = gff_path ) read_mapping # check run status }
Finally, let's get read mapping statistics with the function
summary_stats_star()
. Here, we will use an example Log.final.out that STAR
returns stored in the extdata/* subdirectory of this package.
# Obtaining read mapping statistics star_dir <- system.file("extdata", package = "bears") star_stats <- summary_stats_star(star_dir = star_dir) star_stats
Now, let's check if samples passed the minimum quality criteria. Here, samples are excluded if:
The function mapping_pass()
takes the metadata data frame and returns the
same data frame, but only with the samples that passed the minimum criteria.
# Check if samples passed the filtering criterion align_passed <- mapping_pass(star_stats, metadata) align_passed # inspect data # Compare to the original data set nrow(metadata) nrow(align_passed)
As you can see, the sample we used for read mapping passed the minimum quality criteria. Good, huh? We can now proceed to the next step.
Before quantification, we need to infer library strandedness with the
RSeQC [@wang2012rseqc] tool. The function infer_strandedness()
runs
RSeQC and returns the metadata data frame with an additional column
named Orientation containing library strandedness information.
This function requires the annotation in BED format, not GTF/GFF. To convert
from GTF/GFF to BED, use the function gff2bed()
.
# Convert GFF to BED bedpath <- gff2bed(gff_path) bedpath # check path # Infer strandedness if(rseqc_is_installed()) { new_metadata <- infer_strandedness( mapping_passed = align_passed, bedpath = bedpath, mappingdir = ds$mappingdir ) new_metadata }
As you can see, there is a new Orientation
column with strandedness info
for this sample.
Now that we have the .bam files from STAR and information on library strandedness for each sample, we can quantify the expression with featureCounts. This tool quantifies gene expression measured in raw read counts per gene.
To quantify gene expression with featureCounts, use the function
feaureCounts()
. This function runs featureCounts and returns a
gene expression matrix with genes in rows and samples in columns.
# Get gene expression in raw read counts if(subread_is_installed()) { fcounts_quant <- featureCounts( new_metadata, mappingdir = ds$mappingdir, gff_path = gff_path, fcountsdir = ds$fcountsdir ) # Explore the expression matrix head(fcounts_quant) }
Whenever you are working with gene expression data, we recommend storing your
data as SummarizedExperiment
objects, so you have the expression matrix and
sample metadata in a single object. If you are not familiar with
SummarizedExperiment
objects, take a look at the documentation of the
r BiocStyle::Biocpkg("SummarizedExperiment")
package.
To get a SummarizedExperiment
object from featureCounts, use the function
featureCounts2se()
.
# Create a SummarizedExperiment object from featureCounts output fcounts_se <- featureCounts2se( new_metadata, fc_output = fcounts_quant ) # Take a look at the SummarizedExperiment object fcounts_se # Exploring sample metadata SummarizedExperiment::colData(fcounts_se) # Exploring gene expression matrix SummarizedExperiment::assay(fcounts_se)
StringTie quantifies transcript-level and gene-level transcript abundances in normalized values (transcripts per million, TPM).
To obtain gene expression levels in TPM with StringTie, use the function
stringtie_quantify()
.
# Quantify expression in TPM with StringTie if(stringtie_is_installed()) { stringtie_quant <- stringtie_quantify( new_metadata, qc_table = qc_table, mappingdir = ds$mappingdir, gff_path = gff_path, stringtiedir = ds$stringtiedir ) stringtie_quant # check run status }
Now, let's read the output from StringTie as a SummarizedExperiment
object
with the function stringtie2se
. You can choose if you want the expression
at the gene level, at the transcript level, or both. Here, let's get the
gene-level expression. For that, you will need to give a 2-column data frame
with transcript IDs and their corresponding genes.
# Load transcript-to-gene correspondence data(tx2gene) head(tx2gene) # Read StringTie output as a SummarizedExperiment object stringtiese <- stringtie2se( new_metadata, stringtiedir = ds$stringtiedir, level = "gene", tx2gene = tx2gene ) # Exploring the SummarizedExperiment object stringtiese # Looking at gene expression matrix SummarizedExperiment::assay(stringtiese, "gene_TPM")
Besides quantifying transcript abundance, StringTie can also be used to assemble transcripts for each BioSample. Assembled transcripts for each BioSample are represented as .gtf files.
However, if you want a single .gtf file with the assembled transcripts for all BioSamples you are studying, you can merge the individual .gtf files from StringTie with the software tool TACO [@niknafs2017taco]. Below, we will demonstrate how that can be achieved.
# Transcript assembly with StringTie if(stringtie_is_installed()) { assembled_transcripts <- stringtie_assemble( new_metadata, qc_table = qc_table, mappingdir = ds$mappingdir, gff_path = gff_path, stringtiedir = ds$stringtiedir ) assembled_transcripts # check run status }
In this vignette, we have a single BioSample. However, in real-life scenarios,
you would have several samples. To merge the .gtf files for each sample in
a single .gtf file, use the function taco_merge()
.
# Merge assembled transcripts with TACO if(taco_is_installed()) { merged_transcripts <- taco_merge( new_metadata, stringtiedir = ds$stringtiedir ) merged_transcripts # check run status }
The merged transcript assembly will be stored in a file named
final_assembly.gtf in the subdirectory
assembly/merged_assembly, inside stringtiedir
. To get the path
to the .gtf file, use:
# Get path to merged transcript assembly final_assembly <- file.path( ds$stringtiedir, "assembly", "merged_assembly", "final_assembly.gtf" ) final_assembly
But why would I want to assemble transcripts and merge them if I already have a .gtf file the transcript annotations?
That's a great question! This is a way to identify novel transcripts that are not present in your reference .gtf file. Some transcripts can be missing in the reference annotation (.gtf file) mainly because: i. genome assembly does not have a good quality, so these transcripts could not be predicted. ii. false-positives from the transcript annotation software tool that was used.
If you want to have a more comprehensive transcript abundance quantification, you can assemble transcripts for each sample, merge them, and input the output file final_assembly.gtf to the quantification functions. This way, instead of using the reference transcript annotation, you will use your own transcript annotation, which may contain novel transcripts.
To quantify the expression without mapping reads to the genome, you have two options:
For both kallisto and salmon, you will need to have a
reference transcriptome, not a reference genome. This is a FASTA file
containing the sequences of all annotated transcripts in your genome. You can
easily create this file with the function extractTranscriptSeqs()
from the
r BiocStyle::Biocpkg("GenomicFeatures")
package.
First of all, we will need to index the reference transcriptome with the
function salmon_index()
.
# Path to reference transcriptome transcriptome_path <- system.file( "extdata", "Homo_sapiens.GRCh37.75_subset_transcripts.fa.gz", package = "bears" ) # Index the transcriptome if(salmon_is_installed()) { idx_salmon <- salmon_index( salmonindex = ds$salmonindex, transcriptome_path = transcriptome_path ) idx_salmon # check run status }
Now, we can quantify transcript abundance with salmon_quantify()
.
# Quantify transcript abundance if(salmon_is_installed()) { quant_salmon <- salmon_quantify( new_metadata, filtdir = ds$filtdir, salmonindex = ds$salmonindex, salmondir = ds$salmondir ) quant_salmon # check run status }
After running salmon_quantify()
, salmon output in .sf format will be stored
in the directory specified in salmondir
.
To read salmon output as a SummarizedExperiment
object,
use the function salmon2se()
. You can choose if you want the expression
at the gene level, at the transcript level, or both. Here, let's get the
gene-level expression. For that, you will need to give a 2-column data frame
with transcript IDs and their corresponding genes.
# Load transcript-to-gene data frame data(tx2gene) head(tx2gene) # Read salmon output as a SummarizedExperiment object salmon_se <- salmon2se( new_metadata, level = "gene", salmondir = ds$salmondir, tx2gene ) # Exploring the output salmon_se # Get gene expression matrix in TPM SummarizedExperiment::assay(salmon_se, "gene_TPM") # Get gene expression matrix as raw counts SummarizedExperiment::assay(salmon_se, "gene_counts")
Like we do in salmon, we will start by indexing the transcriptome.
# Index the transcriptome if(kallisto_is_installed()) { idx_kallisto <- kallisto_index( kallistoindex = ds$kallistoindex, transcriptome_path = transcriptome_path ) idx_kallisto # check run status }
Now, we can quantify the transcript abundance.
# Quantify transcript abundance if(kallisto_is_installed()) { quant_kallisto <- kallisto_quantify( new_metadata, qc_table, filtdir = ds$filtdir, kallistoindex = ds$kallistoindex, kallistodir = ds$kallistodir ) quant_kallisto # check run status }
To read kallisto output in a SummarizedExperiment
object, use
the function kallisto2se()
. Again, you will have specify if you want the
expressiona at the gene level, transcript level, or both. Let's get the gene
expression here.
# Read kallisto output to SummarizedExperiment object kallisto_se <- kallisto2se( new_metadata, level = "gene", kallistodir = ds$kallistodir, tx2gene ) # Exploring the output kallisto_se # Get gene expression matrix in TPM SummarizedExperiment::assay(kallisto_se, "gene_TPM") # Get gene expression matrix as raw counts SummarizedExperiment::assay(kallisto_se, "gene_counts")
If you are using r BiocStyle::Githubpkg("almeidasilvaf/bears")
, there are two
things you must keep in mind. First, this package was designed to be as
complete as possible, which means you don't need to run the complete pipeline
for your own project. For instance, if you just want gene expression values
in TPM for a particular BioProject or set of BioProjects, you can simply go
through the salmon path of the pipeline, skipping the read mapping section.
Likewise, if you are using r BiocStyle::Githubpkg("almeidasilvaf/bears")
for your own data set and you have already cleaned the reads, you can skip the
sequence quality checks and read filtering sections. The second thing to
consider is that r BiocStyle::Githubpkg("almeidasilvaf/bears")
is a work
in progress. Bioinformatics is a fast-evolving field, and new (and better)
methods to address a particular question are developed continuously. Hence,
we aim to keep r BiocStyle::Githubpkg("almeidasilvaf/bears")
up to date
with state-of-the-art methods.
How do I manage Conda environments from the R session?
You only need to do the following 2 things:
miniconda_path <- "~/tools/miniconda"
r BiocStyle::Biocpkg("Herper")
package to create an
environment for each tool from .yml files stored in the extdata/ subdirectory
of this package. To avoid conflicts, it is important to keep each tool in
its own environment. Here, each tool will be installed in an environment named <tool-name-lowercase>_env
(e.g., star_env
, rseqc_env
, salmon_env
).library(Herper) # Path to .yml files to create environments envs <- list.files( system.file("extdata", package = "bears"), pattern = ".yml", full.names = TRUE ) # Install miniconda in `my_miniconda` and create envs create_envs <- sapply(envs, function(x) { import_CondaEnv(x, pathToMiniConda = miniconda_path) })
Now, you can run functions that call external tools inside a call
to Herper::withCondaEnv()
. This function allows you to run an R function
inside a particular Conda environment, which you need to specify. For example,
to run the function salmon_quantify()
inside the environment salmon_env
,
you would do:
withCondaEnv( "salmon_env", quant_salmon <- salmon_quantify( new_metadata, filtdir = ds$filtdir, salmonindex = ds$salmonindex, salmondir = ds$salmondir ), pathToMiniconda = miniconda_path )
Can I use this package with my own RNA-seq data (not from a database)?
You surely can. For that, you will first need to create a metadata data frame
for your samples that looks like the data frame created
by create_sample_info()
(see the example data set sample_info
). While you
can add as many columns as you want, 5 columns MUST be present:
BioSample: BioSample ID. These can be fictional sample names, e.g., "Sample1", "Sample2", "Sample3", etc.
Run: Run accession. These must be the basename of your files. Make sure you don't repeat run accessions for paired-end reads, i.e., files example_1.fastq.gz and example_2.fastq.gz should both be represented by the same run accession. Examples of file names and the run accessions they should have:
| Layout | File name | Run accession | |:----|:----------|:--------------| | PAIRED | SampleA_1.fastq.gz and SampleA_2.fastq.gz | SampleA | | SINGLE | Sample1_control_replicate1.fastq.gz | Sample1_control_replicate1 |
BioProject: Project ID. If all files come from a single project, you can give it whatever name you want, such as MyNicePhDProject.
Instrument: Sequencing instrument (e.g., Illumina HiSeq 2500). Note that long-read technologies (e.g., PacBio) are not supported and, hence, will be skipped.
Layout: Sequencing protocol. Either "PAIRED" or "SINGLE".
Finally, if you want to create a standard directory structure with
create_dir_structure()
, you will have to either move your files to
the directory indicated in ds$fastqdir
or change ds$fastqdir
manually
to include the path to the directory where your FASTQ files are. For example:
ds$fastqdir <- "/home/myusername/my_nice_project/fastq_files"
This document was created under the following sections:
sessioninfo::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.