prepare_annotation_files | R Documentation |
This function processes a gtf file and a twobit file (created using faToTwoBit from ucsc tools: http://hgdownload.soe.ucsc.edu/admin/exe/ ) to create a comprehensive set of genomic regions of interest in genomic and transcriptomic space (e.g. introns, UTRs, start/stop codons). In addition, by linking genome sequence and annotation, it extracts additional info, such as gene and transcript biotypes, genetic codes for different organelles, or chromosomes and transcripts lengths.
prepare_annotation_files(annotation_directory, twobit_file, gtf_file,
scientific_name = "Homo.sapiens", annotation_name = "genc25",
export_bed_tables_TxDb = TRUE, forge_BSgenome = TRUE,
genome_seq = NULL, circ_chroms = DEFAULT_CIRC_SEQS,
create_TxDb = TRUE)
annotation_directory |
The target directory which will contain the output files |
twobit_file |
Full path to the genome file in twobit format |
gtf_file |
Full path to the annotation file in GTF format |
scientific_name |
A name to give to the organism studied; must be two words separated by a ".", defaults to Homo.sapiens |
annotation_name |
A name to give to annotation used; defaults to genc25 |
export_bed_tables_TxDb |
Export coordinates and info about different genomic regions in the annotation_directory? It defaults to |
forge_BSgenome |
Forge and install a |
create_TxDb |
Create a |
This function uses the makeTxDbFromGFF
function to create a TxDb object and extract
genomic regions and other info to a *Rannot R file; the mapToTranscripts
and mapFromTranscripts
functions are used to
map features to genomic or transcript-level coordinates. GTF file mist contain "exon" and "CDS" lines,
where each line contains "transcript_id" and "gene_id" values. Additional values such as "gene_biotype" or "gene_name" are also extracted.
Regarding sequences, the twobit file, together with input scientific and annotation names, is used to forge and install a
BSgenome package using the forgeBSgenomeDataPkg
function.
The resulting GTF_annotation object (obtained after runnning load_annotation
) contains:
txs
: annotated transcript boundaries.
txs_gene
: GRangesList including transcript grouped by gene.
seqinfo
: indicating chromosomes and chromosome lengths.
start_stop_codons
: the set of annotated start and stop codon, with respective transcript and gene_ids.
reprentative_mostcommon,reprentative_boundaries and reprentative_5len represent the most common start/stop codon,
the most upstream/downstream start/stop codons and the start/stop codons residing on transcripts with the longest 5'UTRs
cds_txs
: GRangesList including CDS grouped by transcript.
introns_txs
: GRangesList including introns grouped by transcript.
cds_genes
: GRangesList including CDS grouped by gene.
exons_txs
: GRangesList including exons grouped by transcript.
exons_bins
: the list of exonic bins with associated transcripts and genes.
junctions
: the list of annotated splice junctions, with associated transcripts and genes.
genes
: annotated genes coordinates.
threeutrs
: collapsed set of 3'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
fiveutrs
: collapsed set of 5'UTR regions, with correspinding gene_ids. This set does not overlap CDS region.
ncIsof
: collapsed set of exonic regions of protein_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
ncRNAs
: collapsed set of exonic regions of non_coding genes, with correspinding gene_ids. This set does not overlap CDS region.
introns
: collapsed set of intronic regions, with correspinding gene_ids. This set does not overlap exonic region.
intergenicRegions
: set of intergenic regions, defined as regions with no annotated genes on either strand.
trann
: DataFrame object including (when available) the mapping between gene_id, gene_name, gene_biotypes, transcript_id and transcript_biotypes.
cds_txs_coords
: transcript-level coordinates of ORF boundaries, for each annotated coding transcript. Additional columns are the same as as for the start_stop_codons
object.
genetic_codes
: an object containing the list of genetic code ids used for each chromosome/organelle. see GENETIC_CODE_TABLE for more info.
genome
: the name of the forged BSgenome package, or an FaFile_Circ object. Loaded with load_annotation
function.
stop_in_gtf
: stop codon, as defined in the annotation.
a TxDb file and a *Rannot files are created in the specified annotation_directory
.
In addition, a BSgenome object is forged, installed, and linked to the *Rannot object
Lorenzo Calviello, calviello.l.bio@gmail.com
load_annotation
, forgeBSgenomeDataPkg
, makeTxDbFromGFF
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.