View source: R/riboseq_analysis.R
RiboseQC_analysis | R Documentation |
This function loads annotation created by the prepare_annotation_files function, and analyzes a BAM file.
RiboseQC_analysis(annotation_file, bam_files, read_subset = T,
readlength_choice_method = "max_coverage", chunk_size = 5000000L,
write_tmp_files = T, dest_names = NA, rescue_all_rls = FALSE,
fast_mode = T, create_report = T, sample_names = NA,
report_file = NA, extended_report = F, pdf_plots = T)
annotation_file |
Full path to the annotation file (*Rannot). Or, a vector with paths to one annotation file per bam file. |
bam_files |
character vector containing the full path to the bam files |
read_subset |
Select readlengths up to 99 percent of the reads, defaults to |
readlength_choice_method |
Method used to subset relevant read lengths (see |
chunk_size |
the number of alignments to read at each iteration, defaults to 5000000, increase when more RAM is available. Must be between 10000 and 100000000 |
write_tmp_files |
Should output all the results (in *results_RiboseQC_all)? Defaults to |
dest_names |
character vector containing the prefixes to use for the result output files. Defaults to same as |
rescue_all_rls |
Set cutoff of 12 for read lengths ignored because of insufficient coverage. Defaults to |
fast_mode |
Use only top 500 genes to build profiles? Defaults to |
create_report |
Create an html report showing the RiboseQC analysis results. Defaults to |
sample_names |
character vector containing the names for each sample analyzed (for the html report). Defaults to "sample1", "sample2" ... |
report_file |
desired filename for for the html report file. Defaults to the first entry of |
extended_report |
creates a large html report including codon occupancy for each read length. Defaults to |
pdf_plots |
creates a pdf file for each produced plot. Defaults to |
This function loads different genomic regions created in the prepare_annotation_files
step,
separating features on different recognized organelles. The bam files is then analyzed in chunks to minimize RAM usage.
The complete list of analysis and output is as follows:
read_stats
: contains:
read length distribution (rld) per organelle, positions
containes mapping statistics
on different genomic regions, reads_pos1
contains 5' end mapping positions for each read, separated by read length.
counts_cds_genes
: contains read mapping statistics on CDS regions of protein coding genes, including gene symbols, counts, RPKM and TPM values
counts_all_genes
: is a similar object, but contains statistics on all annotated genes.
reads_summary
: reports mapping statistics on different genomic regions and divided by read length and organelle.
profiles_fivepr
contains:
five_prime_bins
: a DataFrame object (one for each read length and compartment) with signal values over 50 5'UTR bins,
100 CDS bins and 50 3'UTR bins; one representative transcript (reprentative_mostcommon) is selected for each gene. five_prime_subcodon
containes a similar structure, but for 25nt downstream the Transcription
Start Site (TSS), 25nt upstream start codons, 33nt donwstream the start codon, 33nt in the middle of the ORF, 33nt upstream the stop codon,
25nt downstream the stop codon, and 25nt upstream the Transcription End Site (TES).
selection_cutoffs
contains:
results_choice
: containing the calculated cutoffs and selected readlengths, together with data
about the different
methods. results_cutoffs
has statistics about calculated cutoffs, while analysis_frame_cutoff
has extensive
statistics concerning cutoff calculations and read length selection, see calc_cutoffs_from_profiles
for more details.
P_sites_stats
: contains the list of calculated P_sites, from all reads (P_sites_all), uniquely mapping reads (P_sites_all_uniq),
or uniquely mapping reads with mismatches (P_sites_uniq_mm). junctions
contains stastics on read mapping on annotated splice junctions.
coverage for entire reads (no 5'ends or P_sites-transformed) on different strands and for all and uniquely mapping reads are also calculated.
profiles_P_sites
contains:
P_sites_bins
: profiles for each organelle and read length around binned transcript locations.
P_sites_subcodon
: profiles for each organelle and read length around transcript start/ends and ORF start/ends.
Codon_counts
: codon occurrences in the first 11 codons, middle 11 codons, and last 11 codons for each ORF.
P_sites_percodon
: P_sites counts on each codon, separated by ORF positions as described above. Values are separated by organelle and read length.
P_sites_percodon_ratio
: ratio of P_sites_percodon/Codon_counts, as a measure of P_site occupancy on each codon, divided again by organelle and read length, for different ORF positions.
sequence_analysis
: contains a DataFrame object with the 50top mapping location in the genome, with the corresponding DNA sequence,
number of reads mapping (also in percentage of total n of reads), and genomic feature annotation.
summary_P_sites
: contains a DataFrame object summarizing the P_sites calculation and read length selection, including statistics on percentage of total reads used.
the function saves a "results_RiboseQC_all" R file appended to the bam_files path including the complete list of outputs described here. In addition, bigwig files for coverage value and P_sites position is appended to the bam_files path, including also a summary of P_sites selection statistics, a smaller "results_RiboseQC" R file used for creating a dynamic html report, and a "for_SaTAnn" R object that can be used in the SaTAnn pipeline.
Lorenzo Calviello, calviello.l.bio@gmail.com
prepare_annotation_files
, calc_cutoffs_from_profiles
, choose_readlengths
, create_html_report
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.