FindGenes: Find Genes in a Genome
In DECIPHER: Tools for curating, analyzing, and manipulating biological sequences

Description Usage Arguments Details Value Author(s) See Also Examples

Predicts the start and stop positions of protein coding genes in a genome.

FindGenes(myDNAStringSet,
          geneticCode = getGeneticCode("11"),
          minGeneLength = 60,
          allowEdges = TRUE,
          allScores = FALSE,
          showPlot = FALSE,
          verbose = TRUE)

`myDNAStringSet`	A `DNAStringSet` object of unaligned sequences representing a genome.
`geneticCode`	A named character vector defining the translation from codons to amino acids. Optionally, an `"alt_init_codons"` attribute can be used to specify alternative initiation codons. By default, the bacterial and archael genetic code is used, which has seven possible initiation codons: ATG, GTG, TTG, CTG, ATA, ATT, and ATC.
`minGeneLength`	Integer specifying the minimum length of genes to find in the genome.
`allowEdges`	Logical determining whether to allow genes that run off the edge of the sequences. If `TRUE` (the default), genes can be identified with implied starts or ends outside the boundaries of `myDNAStringSet`, although the boundary will be set to the last possible codon position. This is useful when genome sequences are circular or incomplete.
`allScores`	Logical indicating whether to return information about all possible open reading frame or only the predicted genes (the default).
`showPlot`	Logical determining whether a plot is displayed with the distribution of gene lengths and scores. (See details section below.)
`verbose`	Logical indicating whether to print information about the predictions on each iteration. (See details section below.)

Protein coding genes are identified by learning their characteristic signature directly from the genome, i.e., ab initio prediction. Gene signatures are derived from the content of the open reading frame and surrounding signals that indicate the presence of a gene. Genes are assumed to not contain introns or frame shifts, making the function best suited for prokaryotic genomes.

If showPlot is TRUE then a plot is displayed with four panels. The upper left panel shows the fitted distribution of background open reading frame lengths. The upper right panel shows this distribution on top of the fitted distribution of predicted gene lengths. The lower left panel shows the fitted distribution of scores for the intergenic spacing between genes on the same and opposite genome strands. The bottom right panel shows the total score of open reading frames and predicted genes by length.

If verbose is TRUE, information is shown about the predictions at each iteration of gene finding. The mean score difference between genes and non-genes is updated at each iteration, unless it is negative, in which case the score is dropped and a "-" is displayed. The columns denote the number of iterations ("Iter"), number of codon scoring models ("Models"), start codon scores ("Start"), upstream k-mer motif scores ("Motif"), mRNA folding scores ("Fold"), initial codon bias scores ("Init"), upstream nucleotide bias scores ("UpsNt"), termination codon bias scores ("Term"), ribosome binding site scores ("RBS"), codon autocorrelation scores ("Auto"), stop codon scores ("Stop"), and number of predicted genes ("Genes").

An object of class Genes, which is stored as a matrix with information corresponding to each open reading frame.

Erik Wright eswright@pitt.edu

ExtractGenes, Genes-class, WriteGenes

# import a test genome
fas <- system.file("extdata",
	"Chlamydia_trachomatis_NC_000117.fas.gz",
	package="DECIPHER")
genome <- readDNAStringSet(fas)

x <- FindGenes(genome)
x
genes <- ExtractGenes(x, genome)
proteins <- ExtractGenes(x, genome, type="AAStringSet")