Ribo-seq is a specific form of RNA-seq expression assay in which the sequenced fragments are footprints of actively translating ribosomes. Ribo-seq experiments of necessity have much shorter read-lenegths than RNA-seq experiments, which can complicate quantification. In principle, ribo-seq experiments provide nucleotide resolution information about the location of ribosomes, and can thus be used to elucidate the dynamics of ribosomal elongation and initiation. In practice, determining where the ribosome's A/P-site is in relation to the footprint is complicated by random variations in footprint size and location, which "blur" the positions of ribosomes. Ribostan is a collection of tools for the analysis of ribo-seq data which includes isoform-aware quantification, P/A-site alignment via multiple methods, and uORF identification via multitaper periodicity test (similar to ORFquant and RiboTaper).
First, let's prepare some working data: a GTF annotation for human chromosome 22, its fasta sequence, and ribo-seq reads mapped in transcript-space to chromosome 22.
library(Ribostan)
# Prepare GTF annotation anno_file <- here::here('test.gc32.gtf') if(!file.exists(anno_file)){ library(AnnotationHub) ah <- AnnotationHub() gencode32 <- ah[['AH75191']] seqlevels(gencode32) <- 'chr22' rtracklayer::export(gencode32, anno_file) } # Prepare fasta sequence fafile <- here::here('chr22.fa') if(!file.exists(fafile)){ library(BSgenome.Hsapiens.UCSC.hg38) seq <- Biostrings::DNAStringSet(BSgenome.Hsapiens.UCSC.hg38[['chr22']]) names(seq) <- 'chr22' Biostrings::writeXStringSet(seq, fafile) } # Prepare ribo-seq alignments testbam <- system.file('extdata', 'nchr22.bam', package='Ribostan', mustWork=TRUE)
Ribostan includes various functionalities for a) filtering ORFs in an existing
annotation and b) finding potential uORFs. By default, load_annotation()
will
keep only those ORFs which are multiples of 3 bp long, and which begin with a
start codon and end wth a stop codon. Ribostan will also search for uORFs for
those ORFs, by default allowing uORFs that are as short as 2 bp. The function
load_annotation()
carries out these filterng steps and creates an object
with an attached fasta file for use by other Ribostan functions. The function
get_readgr()
loads ribosomal footprint alignments (including multimappers).
# Load our annotation, filter out suspect ORFs, and find potential uORFs chr22_anno <- load_annotation(anno_file, fafile, add_uorfs=TRUE) rpfs <- get_readgr(testbam, chr22_anno)
An RPF alignment is ambiguous with respect to the position of the underlying
ribosome because a) it maybe be multimapping and its actual origin is thus
uncertain, and b) stochastic processes underlying footprint size and location
mean that the precise location of the P-site must be determined. Various methods
exist to do this. Ribostan makes use of the method described by Ahmed et al
2019, in which for each phase and read length, an offset is chosen that
maximizes the number of reads within the CDS. The function get_offsets()
creates a data frame describing the optimal offsets using this process. And the
function get_psite_gr()
applies these offsets to the alignments, annotating
their ORF of origin (which is chosen randomly in the rare case where a
footprint's P-site plausibly overlaps more than one ORF since more than one
phase/offset is possible).
# Determine offsets by maximum CDS occupancy offsets_df <- get_offsets(rpfs, chr22_anno) # Use our offsets to determine P-site locations psites <- get_psite_gr(rpfs, offsets_df, chr22_anno)
Most methods of determinig P-site offsets, including the one above, are
vulnerable to error when unusual patterns of footprints exist at the start/end
of the CDS. An orthogonal means of determining A/P-site offsets is to plot
KL-divergence in "metacodon" profiles. KL-divergence measures the degree to
which the underlying codon predicts density at a given location relative to it,
and will typically have two large peaks due to cut-site bias at 0 and
-read_length, along with a peak between these corresponding to the influence of
codon-specific dwell time at the A- and P-site (see O'Connor et al 2016).
The function get_metacodon_profs()
derives average, normalized profiles around
each codon, while get_kl_df()
derives the KL-divergence per
read_length/sample/location. Plotting these provides an orthogonal means of
verifying P-site offsets.
# Determine P-site offsets per read length covgrs <- list(sample1 = rpfs) metacodondf <- get_metacodon_profs(covgrs, chr22_anno) kl_df <- get_kl_df(metacodondf, chr22_anno) kl_offsets <- select_offsets(kl_df) kl_offsets %>% dplyr::select(p_offset, length=nreadlen) %>% readr::write_tsv('offsets_rustvar.tsv') # Plot KL-divergence kl_div_plot <- plot_kl_dv(kl_df, kl_offsets) # Generate data frame with A/P/E-site occupancies allcodondt <- export_codon_dts(metacodondf, kl_offsets)
ORF periodicity is a good diagnostic tool for identifying bona fide translation
(non periodic noise such as RPFs with ribosome-like footprints can generate
ribo-seq signal in untranslated regions). The function
periodicity_filter_uORFs()
filters the uORFs found by load_annotation()
by
searching for periodicity in P-sites. The function get_ritpms()
furthermore
carries out optimization of ribosome densities, and similarly to software like
Salmon or RSEM, performs isoform-aware quantification. Multimapping P-sites can
then be sampled according to these TPMs, to give an estimate of true ribosome
locations.
# Use our P-sites to filter the annotation chr22_anno <- periodicity_filter_uORFs( psites, chr22_anno, remove=TRUE) # Use Stan to estimate normalized P-site densities (ritpms), ritpms <- get_ritpms(psites, chr22_anno) # and at the gene level (ignoring uORFs) gritpms <- gene_level_expr(ritpms, chr22_anno) # Remove multimapped alignments from the P-sites psites <- psites[psites$orf %in% names(chr22_anno$trspacecds)] psites <- sample_cov_gr(psites, chr22_anno, ritpms)
knitr::kable(tibble::enframe(head(ritpms))) knitr::kable(head(gritpms))
With P-site locations determined, the nucleotide resolution information on ribosome positioning can be used to estimate which codons show high or low occupancy, which given relative independence between codons will be linearly proportional to dwell time in a given sample. These codon level occupancies can be averaged for all codons in an ORF to predict which ORFs are fast or slow.
# Get codon-level occupancies using RUST (at least 1000 should be used # for n_genes on a real run) rust_codon_occ_df <- get_codon_occs(psites, offsets_df, chr22_anno, n_genes = 20, method = 'RUST_glm') # Get predicted mean elongation rates for ORFs, orf_elong <- get_orf_elong(chr22_anno, rust_codon_occ_df) # and at the gene level (ingoring uORFs) gn_elong = gene_level_elong(orf_elong, ritpms, chr22_anno)
knitr::kable(head(gn_elong))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.