packSearch: packFinder Algorithm Pipeline
In packFinder: de novo Annotation of Pack-TYPE Transposable Elements

Description Usage Arguments Details Value Note Author(s) See Also Examples

General use pipeline function for the Pack-TYPE transposon finding algorithm.

packSearch(
  tirSeq,
  Genome,
  mismatch = 0,
  elementLength,
  tsdLength,
  tsdMismatch = 0
)

`tirSeq`	A `DNAString` object containing the TIR sequence to be searched for.
`Genome`	A `DNAStringSet` object to be searched.
`mismatch`	The maximum edit distance to be considered for TIR matches (indels + substitions). See `matchPattern` for details.
`elementLength`	The maximum element length to be considered, as a vector of two integers. E.g. `c(300, 3500)`
`tsdLength`	Integer referring to the length of the flanking TSD region.
`tsdMismatch`	An integer referring to the allowable mismatch (substitutions or indels) between a transposon's TSD sequences. `matchPattern` from Biostrings is used for pattern matching.

Finds potential pack-TYPE elements based on:

Similarity of TIR sequence to tirSeq
Proximity of potential TIR sequences
Directionality of TIR sequences
Similarity of TSD sequences

The algorithm finds potential forward and reverse TIR sequences using identifyTirMatches and their associated TSD sequence via getTsds. The main filtering stage, identifyPotentialPackElements, filters matches to obtain a dataframe of potential PACK elements. Note that this pipeline does not consider the possibility of discovered elements being autonomous elements, so it is recommended to cluster and/or BLAST elements for further analysis. Furthermore, only exact TSD matches are considered, so supplying long sequences for TSD elements may lead to false-negative results.

A dataframe, containing elements identified by thealgorithm. These may be autonomous or pack-TYPE elements. Will contain the following features:

start - the predicted element's start base sequence position.
end - the predicted element's end base sequence position.
seqnames - character string referring to the sequence name in Genome to which start and end refer to.
width - the width of the predicted element.
strand - the strand direction of the transposable element. This will be set to "*" as the packSearch function does not consider transposons to have a direction - only TIR sequences. Passing the packMatches dataframe to packClust will assign a direction to each predicted Pack-TYPE element.

This dataframe is in the format produced by coercing a link[GenomicRanges:GRanges-class]{GRanges} object to a dataframe: data.frame(GRanges). Downstream functions, such as packClust, use this dataframe to manipulate predicted transposable elements.

This algorithm does not consider:

Autonomous elements - autonomous elements will be predicted by this algorithm as there is no BLAST step. It is recommended that, after clustering elements using packClust, the user analyses each group to determine which predicted elements are autonomous and which are likely Pack-TYPE elements. Alternatively, databases such as Repbase (https://www.girinst.org/repbase/) supply annotations for autonomous transposable elements that can be used to filter autonomous matches.
TSD Mismatches - if two TIRs do not have exact matches for their terminal site duplications they will be ignored. Supplying longer TSD sequences will likely lead to a lower false-positive rate, however may also cause a greater rate of false-negative results.

Pattern matching is done via matchPattern.

Jack Gisby

identifyTirMatches, getTsds, identifyPotentialPackElements, packClust, packMatches, DNAStringSet, DNAString, matchPattern

data(arabidopsisThalianaRefseq)

packMatches <- packSearch(
    Biostrings::DNAString("CACTACAA"),
    arabidopsisThalianaRefseq,
    elementLength = c(300, 3500),
    tsdLength = 3
)