knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",  
  fig.width = 6, 
  fig.height = 4
)

Introduction

The following is a quick but in-depth tutorial showing all of txtools 'core' functions.

We will show how to:

  1. Load read-alignments data, and the genomic and transcriptomic reference files into R.
  2. Assign alignments to genes.
  3. Summarize mapped-reads into different kinds of tables.
  4. Use tx_add_* functions to add useful data and calculated metrics to the summarized tables.
  5. Plot regions and analyses results using the tx_plot_* functions family.
  6. Use tx_get_*() functions to extract non-standard information structures from txtools tables.
  7. Use tx_test_*() functions to apply statiscal testing to detect transcriptomics positions that have unexpected values at either continuous or count data variables.

The aim of txtools is to provide a versatile framework to analyze transcriptomic data in a simple manner. To accomplish this we propose the processing of data in two stages:

  1. Transfer read-alignments from a genomic to a transcriptomic coordinates space.
  2. Summarize transcriptomic-alignments into table structures.

Fig 1. txtools main processing workflow

The intermediate output in Step 1 is of class GRanges from the r BiocStyle::Biocpkg("GenomicRanges") package, while that of Step 2, and final output of the main txtools processing pipeline, is of class data.table from the r CRANpkg("data.table") package, which is an extension of the data.frame class.

Current txtools development has favored functionality on the more common table structure, as it offers extendability and versatility but at the price of big memory allocation. This should be considered when developing analysis pipelines using txtools. Users will benefit benefit of not using the whole transcriptome at early development stages, unless necessary, and instead use a subset of the gene annotation while developing the analysis pipeline and only at late stages run over the complete transcriptome.

For this example we will use the data of the r BiocStyle::Biocpkg("pasillaBamSubset") package. A subset of reads aligned to the chromosome 4 of D. melanogaster.

library(txtools)
library(pasillaBamSubset)

Loading data tx_load_*()

To work with the reads in the transcriptomic space, txtools requires a Gene Annotation file in the form of a BED-12 or BED-6 file, loaded with tx_load_bed().

Importantly, a BED-12 file contains the information to construct the exon structure, when using a BED-6 file, txtools will assume that the genes in the annotation consist of a single exon block.

BED_path <- tx_dm3_geneAnnot() # geneAnnot built-in txtools
geneAnnot <- tx_load_bed(BED_path)

Additionally, it is recommended to have a Reference Genome sequence file in the form of a FASTA file, in which each entry corresponds to a chromosome, loaded with tx_load_genome().

FASTA_path <- pasillaBamSubset::dm3_chr4() # D. melanogaster chr4 built-in pasillaBamSubset
genome <- tx_load_genome(fastaFile = FASTA_path)

The main data that we will work with and process using txtools are read mappings, or read alignments, in the form of BAM files. These files can be loaded into R using tx_load_bam(). In this case we will set the pairedEnd argument to TRUE, to specify that the library is paired-end; as we will use the sequence information from the alignments we need to set the loadSeq argument also to TRUE.

BAM_path <- pasillaBamSubset::untreated3_chr4() # Paired-end BAM file built-in pasillaBamSubset
genAligns <- tx_load_bam(file = BAM_path, pairedEnd = TRUE, loadSeq = TRUE, loadSecondaryAligns = FALSE)

These three functions make loading the necessary input data a painless process.

Assigning alignments to genes tx_reads()

Reads mapped to a reference genome have to be assigned and processed to their genes. This is done using tx_reads().

tx_reads() accepts bam files loaded with tx_load_bam() recognizing if they are paired-end or single-end libraries. Additionally it is required to provide a gene-annotation as loaded with tx_load_bed().

By default tx_reads() will only process genes with 50 or more reads, will not output the processed sequence for each read, and will use only one core; all these options can be changed in its arguments.

txReads <- tx_reads(genAligns, geneAnnot = geneAnnot, minReads = 1, withSeq = TRUE)

To retrieve the genomic reads that where not assigned to any gene, due to being outside of the gene annotation, you can use tx_getUnassignedAlignments(). Background on the GenomicRanges package is necessary for further examination and manipulation of not assigned reads

tx_getUnassignedAlignments()

In this toy example most reads are not mapped as we only used a gene annotation with ten genes for assigning reads.

The resulting object txReads comprises a list of GRanges, genomic ranges by name, but which consist of the ranges of the mapped reads in the transcriptomic space, not in genomic space.

Summarizing reads into a txDT tx_makeDT_*()

After consolidating the paired-end reads, assigning them to their respective genes and converting their ranges into transcriptomic space, we can summarize their data into table format. These tables contain data such as the coverage, and nucleotide frequency counts; each of the rows in the table corresponds to a nucleotide position unique in the transcriptome, but could have duplicated genomic coordinate, as is the case for overlapping genes.

By default R base uses the class object data.frame to accomodate table data structures which contain different classes of data per column. For the case of txtools we have opted to use the class object data.table from the eponymous package, as the data.table object operates very similar to base-R's data.frames yet are optimized for speedier calculations and assignment.

There are three functions to generate summarized txDTs, each containing summarized data per nucleotide position in the transcriptome:

DT_cov <- tx_makeDT_coverage(txReads, geneAnnot = geneAnnot)
DT_cov
DT_nucFreq <- tx_makeDT_nucFreq(txReads, geneAnnot = geneAnnot)
DT_nucFreq
DT_covNucFreq <- tx_makeDT_covNucFreq(txReads, geneAnnot = geneAnnot) 

The three of these tables contain 5 columns that identify each transcriptomic position and link it to their respective genomic position as well:

txcoor includes UTR regions if present, i.e. the first nucleotide of the 5'UTR is considered to be position "1".

Additionally, in each of the functions when the genome argument is input with its corresponding reference genome sequence, the nucleotide identity per position is added in the column refSeq.

DT_covNucFreq <- tx_makeDT_covNucFreq(txReads, geneAnnot = geneAnnot, genome = genome) 
DT_covNucFreq

Adding useful data and metrics to txDTs tx_add_*()

After generating a txDT we already have all the information we need to do most of our count data analysis. Yet, sometimes we need this data as other metrics or ratios, consider for example the ratio of read-starts compared to coverage, which we call the read-start ratio or more simply startRatio.

We could manually calculate the startRatio by dividing the start_5p column over the cov column of our txDT. Yet other metrics in which the limits of the genes are important would be off calculated, e.g. startRatio 1bp upstream. To easily calculate different metrics and add additional information to a txDT we provide a set of functions that start with the tx_add_ prefix.

Adding metrics related to read-starts, read-ends, coverage, and nucleotide frequency.

Additional to the counts of coverage (cov), read-starts (start_5p), read-ends (end_3p), and nucleotide frequencies (A, T, C, G) for each nucleotide in the transcriptome, it can be useful to add certain related metrics.

For example the rate at which reads start compared to coverage, which we call startRatio. To create such common metrics we provide several functions under the tx_add_*() functions family. Bellow a list of them and a brief description of what they add.

The result of using a tx_add_() function is the same txDT input with the difference of a new column of the desired metric, hence it is possible to chain tx_add functions in succession with the pipe operator %>% from the r CRANpkg("magrittr") package.

library("magrittr")
DT_covNucFreq <- DT_covNucFreq %>% 
    tx_add_startRatio() %>% 
    tx_add_startRatio1bpDS() %>% 
    tx_add_startRatio1bpUS() %>% 
    tx_add_endRatio() %>% 
    tx_add_endRatio1bpDS() %>% 
    tx_add_endRatio1bpUS() %>% 
    tx_add_nucTotal() %>% 
    tx_add_misincCount() %>% 
    tx_add_misincRate() %>% 
    tx_add_misincRateNucSpec(refNuc = "C", misNuc = "T") %>% 
    tx_add_geneRegion(geneAnnot = geneAnnot)

Adding unique position identifier

If needed, by combining the gene and relative coordinate we can create a single unique identifier for each position in the transcriptome. We can add this with tx_add_pos().

DT_covNucFreq <- tx_add_pos(DT_covNucFreq)

IMPORTANT NOTE: Due to being unique identifiers the pos column is of character class and it requires a big amount of memory to be allocated. If not needed is recommended to omit the pos column or delete it after no longer is needed.

Adding reference sequence

If the genome sequence was not specified in the tx_makeDT_*() function call, adding the transcriptomic reference sequence after generating a txDT is possible with tx_add_refSeqDT().

Importantly, the sequence in refSeq has already been processed to represent the transcriptomic sequence considering the strand of the gene.

Note: Importantly, the nucleotide uracil is represented with the letter "T" instead of with the letter "U". This might change in future versions if poses a conflict or confusion to the users.

DT_covNucFreq <- tx_add_refSeqDT(DT_covNucFreq, genome = genome, geneAnnot = geneAnnot)

Adding sites annotation

Sometimes we want to identify previously annotated positions of a transcriptome in our data. For this, we can use tx_add_siteAnnotation() to add a logical vector that indicates the sites of interest to perform further analysis and plotting.

To add a sites annotation we only need to input a GRanges object that consists of ranges of length 1, which can also be loaded with tx_load_bed().

Here we use the in-built site annotation object annotSites_RRACH which contains 636 sites that fall in the arbitrarily selected RRACH motif.

DT_covNucFreq <- tx_add_siteAnnotation(DT_covNucFreq, GRanges = annotSites_RRACH, colName = "annotSites")

We will visualize the effect of annotating this RRACH sites with a seqlogo plot in section \@ref(ggseqlogo-plot)

Adding sequence motif presence

To mark all the positions at which specific sequences appear on transcripts in the txDT we can use tx_add_motifPresence().

Importantly, this function will take a character string that can consist of the characters in the IUPAC nucleotide ambiguity code

DT_covNucFreq <- tx_add_motifPresence(DT_covNucFreq, motif = "DRACH", nucPositions = "center")

Adding rolling mean

Sometime we need a smoother version of a metric, which can be acomplished by computing a roll mean across a moving window. For example we can smooth the coverage by applying a rolling mean across the cov column. The winSize argument determines the length of the window over which the mean is calculatates and the align.

DT_covNucFreq <- tx_add_rollingMean(DT_covNucFreq, colName = "cov", winSize = 50, align = "left")

Plotting functions tx_plot_*()

txtools provides a diverse set of functions to plot. For ease of use all of them start with the prefix tx_plot_

Start End Coverage plot

tx_plot_staEndCov(): This plot is useful to visualize the coverage, read-start, and read-end patterns. The required input is a txDT, the gene, and the transcriptomic range to be plot.

It can be used to plot the whole gene as bellow:

tx_plot_staEndCov(DT_covNucFreq, gene = "NM_001272152", show_yLabels = FALSE, bar_border = FALSE)

Or part of the gene specifying the relative coordinates range in the txRange argument.

tx_plot_staEndCov(DT_covNucFreq, gene = "NM_001272152", txRange = 330:360)

Nucleotide frequency plot

tx_plot_nucFreq(): This plot shows the nucleotide frequency and importantly, shows in color the nucleotides that are not matching the reference sequence. It is useful to visualize misincorporation rate, which can be originated due to a SNP, chemical treatment on RNA or (less informative) sequencing errors.

Importantly, this plot show the coverage due to inserts, which has not been assigned a nucleotide identity, as is the unsequenced gap between read1 and read2 in paired-end libraries

selSite <- DT_covNucFreq[which(misincRate > 0.5 & nucTotal > 40),
                         c("gene", "txcoor", "cov", "nucTotal", "misincCount", "misincRate")]
selSite
tx_plot_nucFreq(DT_covNucFreq, gene = selSite$gene, txRange = window_around(selSite$txcoor, 20), removeInsert = T)

ggseqlogo plot

tx_plot_ggseqlogo(): This function allows to quickly plot a r BiocStyle::CRANpkg("ggseqlogo") using as input a txDT and a logical annotation which marks the center position to be plot. It is useful to identify motif sequences in sites that have been selected, for example for passing certain thresholds of specific metrics.

In this example we use the sites we annotated while using tx_add_motifPresence() for annotating the DRACH motifs in our small transcriptome set.

tx_plot_ggseqlogo(DT_covNucFreq, logi_col = "DRACH_motif_center", upFlank = 5, doFlank = 5)

Metagene plot at CDS start/end

tx_plot_metageneAtCDS(): Calculates and plots metagenes using as input a txDT and the column names of the metrics to be plot.

tx_plot_metageneAtCDS(txDT = DT_covNucFreq, geneAnnot = geneAnnot,
                      colVars = c("A", "G", "C", "T"), CDS_align = "spliceSite", 
                      upFlank = 200, doFlank = 200)
tx_plot_metageneAtCDS(txDT = DT_covNucFreq, geneAnnot = geneAnnot,
                      colVars = c("cov", "nucTotal"), CDS_align = "spliceSite", 
                      upFlank = 100, doFlank = 100)
tx_plot_metageneExons(txDT = DT_covNucFreq, colVars = c("A", "T", "G", "C"), nBins = 30, geneAnnot = geneAnnot)
tx_plot_metageneRegions(txDT = DT_covNucFreq, colVars = c("cov", "nucTotal"),
                        geneAnnot = geneAnnot, nBins_5UTR = 40, nBins_CDS = 80,
                        nBins_3UTR = 80, summ_fun = "mean")

Get functions tx_get_*()

Get flanks from annotation

tx_get_flanksFromLogicAnnot(): Extracts the surrounding values of annotates sites from any column. The required input is the txDT containing the column from were the values will be extracted alogn a logical vector marking the annotated sites (TRUE = site of interest, FALSE = not annotated site). The output is a matrix in which rows correspond to annoted sites, and columns to the values at each of the positions up-stream, center, and down-stream from the site of interest.

Getting this values may be useful to confirm results derived from previous analysis, or to generate an aggregated analysis of the sites of interest. Bellow is an example in which we extract the coverage of the sites annotated in the DRACH_motif_center column.

tx_get_flanksFromLogicAnnot(DT_covNucFreq, logi_col = "DRACH_motif_center", 
                            values_col = "cov", upFlank = 4, doFlank = 4) %>% head(5) 

Getting flanking sequences

tx_get_flankSequence(): Extracts the contiguous sequence of sites of interest. Works similarly to tx_get_flanksFromLogicAnnot() but it is a specific case for getting only the sequences in a collapsed format. Its output is a character vector in which each element is the sequence at the site of interest and its upstream and downstream flanking regions.

Bellow is an example showing the flanking sequence of the sites annotated in the DRACH_morif_center column.

tx_get_flankSequence(DT_covNucFreq, logi_col = "DRACH_motif_center", 
                     upFlank = 4, doFlank = 4) %>% head(5) 

Get gene lengths

tx_get_geneLengths: Gets the length of genes contained in a txDT, as measured by the number of positions (rows) the gene has in the table.

tx_get_geneLengths(DT_covNucFreq)

Get metagene values

tx_get_metageneAtCDS(): Gets the metagene profile of desired column variables, centered at either the start or end of the CDS.

Bellow is an example in which we retrieve the values of coverage at the start of the CDS as well as upstream and downstream flanks.

tx_get_metageneAtCDS(txDT = DT_covNucFreq, geneAnnot = geneAnnot,
                      colVars = c("cov", "nucTotal"), CDS_align = "start", 
                      upFlank = 3, doFlank = 5)

Get transcriptome sequences

tx_get_transcriptSeqs(): Retrieves the spliced sequence of genes. Using as input the reference genome and the gene annotation loaded with tx_load_bed().

The output of this function can be either a DNAStringSet from the r BiocStyle::Biocpkg("Biostrings") package, or (when providing a file name) a FASTA file.

Bellow an example showing the sequences of the genes in the D. melanogaster gene annotation, and the DNAStringSet as output.

tx_get_transcriptSeqs(genome = genome, geneAnnot = geneAnnot)

Statistical tests tx_test_*()

txtools provides two functions to perform statistical tests to detect sites that show unexpected values at each site in a treatment group compared to a control group.

Both of these functions require as main input a list of txDTs (txDTL), each txDT representing a sample on the experimental design.

For examples on the usage of the tx_test functions, please consult the case studies vignette available at: [PENDING URL]

Session info {.unnumbered}

sessionInfo()


AngelCampos/tx_tools documentation built on Sept. 17, 2024, 1:27 a.m.