knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    eval = TRUE,
    crop = NULL
)
Sys.setenv(
    PATH = paste(
        Sys.getenv("PATH"),
        paste0(Sys.getenv("HOME"), "/.local/bin"),
        paste0(
            Sys.getenv("HOME"), 
            "/Documents/Programs/salmon-1.9.0_linux_x86_64/bin/"
        ),
        "/opt/STAR-2.7.9a/bin/Linux_x86_64_static/",
        "/opt/bin/",
        "/opt/salmon-1.9.0_linux_x86_64/bin/",
        "/opt/kallisto/",
        "/opt/subread-2.0.3-Linux-x86_64/bin/",
        "/opt/stringtie-2.1.7.Linux_x86_64/",
        "/opt/taco-v0.7.3.Linux_x86_64/",
        sep = ":"
    )
)
options(timeout = 60000)

Introduction

In the past decades, there has been an exponential accumulation of RNA-seq data in public repositories. This steep increase paved the way to the creation of gene expression atlases, which consist in comprehensive collections of expression data from public databases, analyzed with a single pipeline for consistency and cross-project comparison. r BiocStyle::Githubpkg("almeidasilvaf/bears") is a package that allows you to create your own gene expression atlas for a given species using public data. The package features:

Installation and setup

To install r BiocStyle::Githubpkg("almeidasilvaf/bears"), use the following code:

remotes::install_github("almeidasilvaf/bears")

Then, create a standard directory structure with create_dir_structure() to store your results. This is optional, but it will make your life much easier. The output is a list of paths to common directories that you will need to specify in several functions of this package.

library(bears)

# Create directory structure using a temporary directory as root
ds <- create_dir_structure(rootdir = tempdir())

# Look at the output
ds

To run the full pipeline implemented in r BiocStyle::Githubpkg("almeidasilvaf/bears"), you will need to have some external software tools installed in your machine. The names of the tools are listed below. In the Functions column, you can see the names of functions in r BiocStyle::Githubpkg("almeidasilvaf/bears") that require each tool.

software_df <- data.frame(
    Software = c("fastp", "kallisto", "RSeQC",
                 "salmon", "SortMeRNA", "STAR", "StringTie",
                 "Subread", "TACO"),
    Version = c(">=0.22.0", ">=0.11.9", ">=4.0.0",
                ">=1.8.0", ">=4.3.4", ">=2.7.10a", ">=2.2.1",
                ">=2.0.1", ">=0.7.3")
)

software_df$Functions <- c(
    "trim_reads()", "kallisto_index(), kallisto_quantify()",
    "infer_strandedness()", 
    "salmon_index(), salmon_quantify()", "remove_rrna()",
    "star_genome_index(), star_align()", 
    "stringtie_assemble(), stringtie_quantify()",
    "fcount()", "taco_merge()"
)
knitr::kable(software_df)

To make your life easier, we have created .yml files with Conda environments containing each of these external tools. You can create an environment for each tool and manage Conda environments from the R session with the Bioconductor package r BiocStyle::Biocpkg("Herper") (see the FAQ section for details).

Alternatively, you can see here the code we used to install all external dependencies in the Ubuntu 20.04 virtual machine provided by GitHub Actions, which is how this package is tested and how this document was created.

As a sanity check, let's see if all external dependencies are installed.

# Test installation of external dependencies
fastp_is_installed()
star_is_installed()
sortmerna_is_installed()
rseqc_is_installed()
salmon_is_installed()
kallisto_is_installed()
subread_is_installed()
stringtie_is_installed()
taco_is_installed()

Retrieving sample metadata

First of all, you need to choose which samples you want to download and create a metadata data frame for your samples. To create this data frame, you will pass a search term to the function create_sample_info(). The search term has the same syntax of the SRA search term syntax. For example, you can search by:

Let's create a metadata data frame for a human RNA-seq sample that is included in the r BiocStyle::Biocpkg("airway") Bioconductor package.

# Create metadata data frame
term <- "SAMN02422669[BSPL]"
metadata <- create_sample_info(term)
metadata

Downloading FASTQ files

To download the .fastq files from ENA, you will use the function download_from_ena(). As input, you only need to give the metadata data frame and the path the output directory where .fastq files will be stored.

IMPORTANT: Note below that you must change the timeout limit, or your downloads will be interrupted after the default timeout limit of 60 seconds. You will probably need more than 60 seconds to download some samples, especially if your internet connection is not very good.

Here's an example of how to run download_from_ena():

# Change default timeout limit
options(timeout = 6000)

# Download sample to temporary directory
download <- download_from_ena(metadata, fastqdir = ds$fastqdir)

For running time issues, we will simply copy the example FASTQ files from the extdata subdirectory of this package to ds$fastqdir. These example files contain a subset of the SAMN02422669 BioSample we mentioned before.

# For running time issues, copy example FASTQ files to ds$fastqdir
f <- list.files(
    system.file("extdata", package = "bears"), 
    pattern = ".fastq.gz", 
    full.names = TRUE
)
copy <- file.copy(f, ds$fastqdir)

After downloading FASTQ files, it's important to check the integrity of your files. The function check_md5() does that for you by checking if the md5 checksum of the downloaded file matches the md5 checksum of the original file.

check_md5(run_accessions = metadata$Run, fastqdir = ds$fastqdir)

Here, expectedly, the function reported an issue. As the example files we are using contain only a subset of the SAMN02422669 BioSample, the md5 checksums are different. If the md5 checksums were the same, the Status variable would be TRUE.

Read QC and filtering

After downloading the .fastq files, you need to perform some filtering steps to remove low quality bases, adapters, and rRNA. Read filtering consists in 2 steps:

  1. trim_reads() - trim adapters and low quality bases (if any). This function runs fastp [@chen2018fastp] and saves filtered .fastq files in a directory named filtdir.

  2. remove_rrna() - remove rRNA (if any) from .fastq files. rRNA removal relies on the SortMeRNA [@kopylova2012sortmerna] program.

Trimming adapters and low quality bases

First, let's use trim_reads() to filter reads. [^1]

[^1]: Friendly tip: filtered FASTQ files are stored in the directory specified in the filtdir parameter, even if no filtering was performed (no adapters and no low quality bases, for instance). If you want to delete unfiltered FASTQ files for hard drive issues, you can set delete_raw = TRUE.

if(fastp_is_installed()) {

    # Trim reads
    fastp_status <- trim_reads(
        metadata, 
        fastqdir = ds$fastqdir, 
        filtdir = ds$filtdir,
        qcdir = ds$qcdir
    )

    fastp_status # check run status
}

The function trim_reads() stores .json files containing fastp summary statistics for each sample in the directory specified in qcdir. You can read it and parse it into a data frame with the function summary_stats_fastp(). Let's demonstrate how it works.

# Path to directory containing .json file from fastp
qcdir <- system.file("extdata", package = "bears")
qc_table <- summary_stats_fastp(qcdir)

qc_table

Removing rRNA

In this vignette, we will use a small rRNA database as an example. In real-world applications, your rRNA database directory should contain all FASTA files distributed in the SortMeRNA GitHub repo. However, if you think some of these files (e.g., 5s and 5.8s rRNA) are not a concern in your data set, you don't need to include them in the database.

# Create a directory to store the rRNA db
rrna_db_dir <- file.path(tempdir(), "rrna")
dir.create(rrna_db_dir)

# Copy the example 16S rRNA file to the db directory.
rrna_file <- system.file("extdata", "bac_16s_subset.fa", package = "bears")
copy <- file.copy(from = rrna_file, to = rrna_db_dir)

# Remove rRNA (if any)
if(sortmerna_is_installed()) {
    rrna_removal <- remove_rrna(
        metadata,
        fastqdir = ds$fastqdir,
        filtdir = ds$filtdir,
        rrna_db_dir = rrna_db_dir
    )

    rrna_removal # check run status
}

Now that we have performed all quality checks, we're good to go. [^1]

Quantification of transcript abundance

Quantification of transcript abundance can be done in two ways:

  1. Alignment-based approaches, which involves mapping reads to a reference genome using STAR [@dobin2013star] and quantifying the expression based on uniquely mapped reads with featureCounts [@liao2014featurecounts] (in raw counts) and/or StringTie [@pertea2015stringtie] (in TPM).

  2. Alignment-free approaches, which involves pseudo-aligning or quasi-mapping reads to a reference transcriptome with kallisto [@bray2016near] or salmon [@patro2017salmon], respectively.

Below, we will describe how to perform both approaches.

Alignment-based quantification

To start with, we will need to map the reads to a reference genome. In r BiocStyle::Githubpkg("almeidasilvaf/bears"), reads are mapped to the reference genome with the software tool STAR.

Read mapping

Here, for the purpose of demonstration, we will map reads to a subset of the human genome. The FASTA and GTF files corresponding to the subset of the genome are available in the extdata/ subdirectory of this package.

Before mapping reads, we need to create a genome index. This can be done with star_genome_index().

# Get paths to genome subset
genome_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset.fa", package = "bears"
)
gff_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset.gtf", package = "bears"
)

# Create genome index
if(star_is_installed()) {
    genome_idx <- star_genome_index(
        genome_path = genome_path, 
        gff_path = gff_path, 
        mappingdir = ds$mappingdir
    )

    genome_idx # check run status
}

Now that we have the genome index, we can map reads to it.

# Map reads to the genome
if(star_is_installed()) {
    read_mapping <- star_align(
        metadata, 
        filtdir = ds$filtdir,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path
    )

    read_mapping # check run status
}

Finally, let's get read mapping statistics with the function summary_stats_star(). Here, we will use an example Log.final.out that STAR returns stored in the extdata/* subdirectory of this package.

# Obtaining read mapping statistics
star_dir <- system.file("extdata", package = "bears")

star_stats <- summary_stats_star(star_dir = star_dir)
star_stats

Now, let's check if samples passed the minimum quality criteria. Here, samples are excluded if:

  1. >=50% of the reads fail to map or;
  2. >=40% of the reads fail to uniquely map.

The function mapping_pass() takes the metadata data frame and returns the same data frame, but only with the samples that passed the minimum criteria.

# Check if samples passed the filtering criterion
align_passed <- mapping_pass(star_stats, metadata)
align_passed # inspect data

# Compare to the original data set
nrow(metadata)
nrow(align_passed)

As you can see, the sample we used for read mapping passed the minimum quality criteria. Good, huh? We can now proceed to the next step.

Inferring library strandedness

Before quantification, we need to infer library strandedness with the RSeQC [@wang2012rseqc] tool. The function infer_strandedness() runs RSeQC and returns the metadata data frame with an additional column named Orientation containing library strandedness information.

This function requires the annotation in BED format, not GTF/GFF. To convert from GTF/GFF to BED, use the function gff2bed().

# Convert GFF to BED
bedpath <- gff2bed(gff_path)
bedpath # check path

# Infer strandedness
if(rseqc_is_installed()) {
    new_metadata <- infer_strandedness(
        mapping_passed = align_passed,
        bedpath = bedpath,
        mappingdir = ds$mappingdir
    )

    new_metadata
}

As you can see, there is a new Orientation column with strandedness info for this sample.

Using featureCounts

Now that we have the .bam files from STAR and information on library strandedness for each sample, we can quantify the expression with featureCounts. This tool quantifies gene expression measured in raw read counts per gene.

To quantify gene expression with featureCounts, use the function feaureCounts(). This function runs featureCounts and returns a gene expression matrix with genes in rows and samples in columns.

# Get gene expression in raw read counts
if(subread_is_installed()) {
    fcounts_quant <- featureCounts(
        new_metadata, 
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        fcountsdir = ds$fcountsdir
    )

    # Explore the expression matrix
    head(fcounts_quant)
}

Whenever you are working with gene expression data, we recommend storing your data as SummarizedExperiment objects, so you have the expression matrix and sample metadata in a single object. If you are not familiar with SummarizedExperiment objects, take a look at the documentation of the r BiocStyle::Biocpkg("SummarizedExperiment") package.

To get a SummarizedExperiment object from featureCounts, use the function featureCounts2se().

# Create a SummarizedExperiment object from featureCounts output
fcounts_se <- featureCounts2se(
    new_metadata, fc_output = fcounts_quant
)

# Take a look at the SummarizedExperiment object
fcounts_se

# Exploring sample metadata
SummarizedExperiment::colData(fcounts_se)

# Exploring gene expression matrix
SummarizedExperiment::assay(fcounts_se)

Using StringTie

StringTie quantifies transcript-level and gene-level transcript abundances in normalized values (transcripts per million, TPM).

To obtain gene expression levels in TPM with StringTie, use the function stringtie_quantify().

# Quantify expression in TPM with StringTie
if(stringtie_is_installed()) {
    stringtie_quant <- stringtie_quantify(
        new_metadata,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        stringtiedir = ds$stringtiedir
    )

    stringtie_quant # check run status
}

Now, let's read the output from StringTie as a SummarizedExperiment object with the function stringtie2se. You can choose if you want the expression at the gene level, at the transcript level, or both. Here, let's get the gene-level expression. For that, you will need to give a 2-column data frame with transcript IDs and their corresponding genes.

# Load transcript-to-gene correspondence
data(tx2gene)
head(tx2gene)

# Read StringTie output as a SummarizedExperiment object
stringtiese <- stringtie2se(
    new_metadata,
    stringtiedir = ds$stringtiedir,
    level = "gene",
    tx2gene = tx2gene
)

# Exploring the SummarizedExperiment object
stringtiese

# Looking at gene expression matrix
SummarizedExperiment::assay(stringtiese, "gene_TPM")

Bonus: Transcript assembly and merging

Besides quantifying transcript abundance, StringTie can also be used to assemble transcripts for each BioSample. Assembled transcripts for each BioSample are represented as .gtf files.

However, if you want a single .gtf file with the assembled transcripts for all BioSamples you are studying, you can merge the individual .gtf files from StringTie with the software tool TACO [@niknafs2017taco]. Below, we will demonstrate how that can be achieved.

# Transcript assembly with StringTie
if(stringtie_is_installed()) {
    assembled_transcripts <- stringtie_assemble(
        new_metadata,
        qc_table = qc_table,
        mappingdir = ds$mappingdir,
        gff_path = gff_path,
        stringtiedir = ds$stringtiedir
    )

    assembled_transcripts # check run status
}

In this vignette, we have a single BioSample. However, in real-life scenarios, you would have several samples. To merge the .gtf files for each sample in a single .gtf file, use the function taco_merge().

# Merge assembled transcripts with TACO
if(taco_is_installed()) {
    merged_transcripts <- taco_merge(
        new_metadata,
        stringtiedir = ds$stringtiedir
    )

    merged_transcripts # check run status
}

The merged transcript assembly will be stored in a file named final_assembly.gtf in the subdirectory assembly/merged_assembly, inside stringtiedir. To get the path to the .gtf file, use:

# Get path to merged transcript assembly
final_assembly <- file.path(
    ds$stringtiedir, "assembly", "merged_assembly", "final_assembly.gtf"
)

final_assembly

But why would I want to assemble transcripts and merge them if I already have a .gtf file the transcript annotations?

That's a great question! This is a way to identify novel transcripts that are not present in your reference .gtf file. Some transcripts can be missing in the reference annotation (.gtf file) mainly because: i. genome assembly does not have a good quality, so these transcripts could not be predicted. ii. false-positives from the transcript annotation software tool that was used.

If you want to have a more comprehensive transcript abundance quantification, you can assemble transcripts for each sample, merge them, and input the output file final_assembly.gtf to the quantification functions. This way, instead of using the reference transcript annotation, you will use your own transcript annotation, which may contain novel transcripts.

Alignment-free quantification

To quantify the expression without mapping reads to the genome, you have two options:

For both kallisto and salmon, you will need to have a reference transcriptome, not a reference genome. This is a FASTA file containing the sequences of all annotated transcripts in your genome. You can easily create this file with the function extractTranscriptSeqs() from the r BiocStyle::Biocpkg("GenomicFeatures") package.

Using salmon

First of all, we will need to index the reference transcriptome with the function salmon_index().

# Path to reference transcriptome
transcriptome_path <- system.file(
    "extdata", "Homo_sapiens.GRCh37.75_subset_transcripts.fa.gz",
    package = "bears"
)

# Index the transcriptome
if(salmon_is_installed()) {
    idx_salmon <- salmon_index(
        salmonindex = ds$salmonindex,
        transcriptome_path = transcriptome_path
    )

    idx_salmon # check run status
}

Now, we can quantify transcript abundance with salmon_quantify().

# Quantify transcript abundance
if(salmon_is_installed()) {
    quant_salmon <- salmon_quantify(
        new_metadata,
        filtdir = ds$filtdir,
        salmonindex = ds$salmonindex,
        salmondir = ds$salmondir
    )

    quant_salmon # check run status
}

After running salmon_quantify(), salmon output in .sf format will be stored in the directory specified in salmondir.

To read salmon output as a SummarizedExperiment object, use the function salmon2se(). You can choose if you want the expression at the gene level, at the transcript level, or both. Here, let's get the gene-level expression. For that, you will need to give a 2-column data frame with transcript IDs and their corresponding genes.

# Load transcript-to-gene data frame
data(tx2gene)
head(tx2gene)

# Read salmon output as a SummarizedExperiment object
salmon_se <- salmon2se(
    new_metadata, 
    level = "gene", 
    salmondir = ds$salmondir,
    tx2gene
)

# Exploring the output
salmon_se

# Get gene expression matrix in TPM
SummarizedExperiment::assay(salmon_se, "gene_TPM")

# Get gene expression matrix as raw counts
SummarizedExperiment::assay(salmon_se, "gene_counts")

Using kallisto

Like we do in salmon, we will start by indexing the transcriptome.

# Index the transcriptome
if(kallisto_is_installed()) {
    idx_kallisto <- kallisto_index(
        kallistoindex = ds$kallistoindex,
        transcriptome_path = transcriptome_path
    )

    idx_kallisto # check run status
}

Now, we can quantify the transcript abundance.

# Quantify transcript abundance
if(kallisto_is_installed()) {
    quant_kallisto <- kallisto_quantify(
        new_metadata,
        qc_table, 
        filtdir = ds$filtdir,
        kallistoindex = ds$kallistoindex,
        kallistodir = ds$kallistodir
    )

    quant_kallisto # check run status
}

To read kallisto output in a SummarizedExperiment object, use the function kallisto2se(). Again, you will have specify if you want the expressiona at the gene level, transcript level, or both. Let's get the gene expression here.

# Read kallisto output to SummarizedExperiment object
kallisto_se <- kallisto2se(
    new_metadata, 
    level = "gene", 
    kallistodir = ds$kallistodir,
    tx2gene
)

# Exploring the output
kallisto_se

# Get gene expression matrix in TPM
SummarizedExperiment::assay(kallisto_se, "gene_TPM")

# Get gene expression matrix as raw counts
SummarizedExperiment::assay(kallisto_se, "gene_counts")

Closing remarks

If you are using r BiocStyle::Githubpkg("almeidasilvaf/bears"), there are two things you must keep in mind. First, this package was designed to be as complete as possible, which means you don't need to run the complete pipeline for your own project. For instance, if you just want gene expression values in TPM for a particular BioProject or set of BioProjects, you can simply go through the salmon path of the pipeline, skipping the read mapping section. Likewise, if you are using r BiocStyle::Githubpkg("almeidasilvaf/bears") for your own data set and you have already cleaned the reads, you can skip the sequence quality checks and read filtering sections. The second thing to consider is that r BiocStyle::Githubpkg("almeidasilvaf/bears") is a work in progress. Bioinformatics is a fast-evolving field, and new (and better) methods to address a particular question are developed continuously. Hence, we aim to keep r BiocStyle::Githubpkg("almeidasilvaf/bears") up to date with state-of-the-art methods.

FAQ {.unnumbered}

How do I manage Conda environments from the R session?

You only need to do the following 2 things:

  1. Install miniconda. After installing miniconda in your machine, create an R object containing the path to your miniconda installation. For example:
miniconda_path <- "~/tools/miniconda"
  1. Create a different environment for each external tool. You can use the r BiocStyle::Biocpkg("Herper") package to create an environment for each tool from .yml files stored in the extdata/ subdirectory of this package. To avoid conflicts, it is important to keep each tool in its own environment. Here, each tool will be installed in an environment named <tool-name-lowercase>_env (e.g., star_env, rseqc_env, salmon_env).
library(Herper)

# Path to .yml files to create environments
envs <- list.files(
    system.file("extdata", package = "bears"), pattern = ".yml",
    full.names = TRUE
)

# Install miniconda in `my_miniconda` and create envs
create_envs <- sapply(envs, function(x) {
    import_CondaEnv(x, pathToMiniConda = miniconda_path)
})

Now, you can run functions that call external tools inside a call to Herper::withCondaEnv(). This function allows you to run an R function inside a particular Conda environment, which you need to specify. For example, to run the function salmon_quantify() inside the environment salmon_env, you would do:

withCondaEnv(
    "salmon_env", 
    quant_salmon <- salmon_quantify(
        new_metadata,
        filtdir = ds$filtdir,
        salmonindex = ds$salmonindex,
        salmondir = ds$salmondir
    ),
    pathToMiniconda = miniconda_path
)

Can I use this package with my own RNA-seq data (not from a database)?

You surely can. For that, you will first need to create a metadata data frame for your samples that looks like the data frame created by create_sample_info() (see the example data set sample_info). While you can add as many columns as you want, 5 columns MUST be present:

  1. BioSample: BioSample ID. These can be fictional sample names, e.g., "Sample1", "Sample2", "Sample3", etc.

  2. Run: Run accession. These must be the basename of your files. Make sure you don't repeat run accessions for paired-end reads, i.e., files example_1.fastq.gz and example_2.fastq.gz should both be represented by the same run accession. Examples of file names and the run accessions they should have:

| Layout | File name | Run accession | |:----|:----------|:--------------| | PAIRED | SampleA_1.fastq.gz and SampleA_2.fastq.gz | SampleA | | SINGLE | Sample1_control_replicate1.fastq.gz | Sample1_control_replicate1 |

  1. BioProject: Project ID. If all files come from a single project, you can give it whatever name you want, such as MyNicePhDProject.

  2. Instrument: Sequencing instrument (e.g., Illumina HiSeq 2500). Note that long-read technologies (e.g., PacBio) are not supported and, hence, will be skipped.

  3. Layout: Sequencing protocol. Either "PAIRED" or "SINGLE".

Finally, if you want to create a standard directory structure with create_dir_structure(), you will have to either move your files to the directory indicated in ds$fastqdir or change ds$fastqdir manually to include the path to the directory where your FASTQ files are. For example:

ds$fastqdir <- "/home/myusername/my_nice_project/fastq_files"

Session information {.unnumbered}

This document was created under the following sections:

sessioninfo::session_info()

References {.unnumbered}



almeidasilvaf/GEAtlas documentation built on April 14, 2023, 6:58 p.m.