clusterCTSS: Clustering CTSSs into tag clusters (TCs)
In ge11232002/CAGEr: Analysis of CAGE (Cap Analysis of Gene Expression) sequencing data for precise mapping of transcription start sites and promoterome mining

Description Usage Arguments Details Value Author(s) References See Also Examples

Clusters individual CAGE transcription start sites (CTSSs) along the genome into tag clusters using specified "ab initio" method, or assigns them to predefined genomic regions.

clusterCTSS(object, threshold = 1, nrPassThreshold = 1,
            thresholdIsTpm = TRUE, method = "distclu", maxDist = 20, 
            removeSingletons = FALSE, keepSingletonsAbove = Inf, 
            minStability = 1, maxLength = 500, 
            reduceToNonoverlapping = TRUE, customClusters = NULL, 
            useMulticore = FALSE, nrCores = NULL)

`object`	A `CAGEset` object
`threshold, nrPassThreshold`	Only CTSSs with signal `>= threshold` in `>= nrPassThreshold` experiments will be used for clustering and will contribute towards total signal of the cluster.
`thresholdIsTpm`	Logical, is threshold raw tag count value (FALSE) or normalized signal (TRUE)
`method`	Method to be used for clustering. Can be one of the `"distclu"`, `"paraclu"` or `"custom"`. See Details.
`maxDist`	Maximal distance between two neighbouring CTSSs for them to be part of the same cluster. Used only when `method = "distclu"`, otherwise ignored.
`removeSingletons`	Logical, should tag clusters containing only one CTSS be removed. Ignored when `method = "custom"`.
`keepSingletonsAbove`	Controls which singleton tag clusters will be removed. When `removeSingletons = TRUE`, only singletons with signal `< keepSingletonsAbove` will be removed. Useful to prevent removing highly supported singleton tag clusters. Default value `Inf` results in removing all singleton TCs when `removeSingletons = TRUE`. Ignored when `removeSingletons = FALSE` or `method = "custom"`.
`minStability`	Minimal stability of the cluster, where stability is defined as ratio between maximal and minimal density value for which this cluster is maximal scoring. For definition of stability refer to Frith et al., Genome Research, 2007. Clusters with stability `< minStability` will be discarded. Used only when `method = "paraclu"`, otherwise ignored.
`maxLength`	Maximal length of cluster in base-pairs. Clusters with length `> maxLength` will be discarded. Ignored when `method = "custom"`.
`reduceToNonoverlapping`	Logical, should smaller clusters contained within bigger cluster be removed to make a final set of tag clusters non-overlapping. Used only when `method = "paraclu"`. See Details.
`customClusters`	Genomic coordinates of predefined regions to be used to segment the CTSSs. It has to be a `data.frame` with following columns: `chr` (chromosome name), `start` (0-based start coordinate), `end` (end coordinate), `strand` (either `"+"`, or `"-"`). Used only when `method = "custom"`.
`useMulticore`	Logical, should multicore be used. `useMulticore = TRUE` is supported only on Unix-like platforms.
`nrCores`	Number of cores to use when `useMulticore = TRUE`. Default value `NULL` uses all detected cores.

Two "ab initio" methods for clustering TSSs along the genome are supported: "distclu" and "paraclu". "distclu" is an implementation of simple distance-based clustering of data attached to sequences, where two neighbouring TSSs are joined together if they are closer than some specified distance. "paraclu" is an implementation of Paraclu algorithm for parametric clustering of data attached to sequences developed by M. Frith (Frith et al., Genome Research, 2007, http://www.cbrc.jp/paraclu/). Since Paraclu finds clusters within clusters (unlike distclu), additional parameters (removeSingletons, keepSingletonsAbove, minStability, maxLength and reduceToNonoverlapping) can be specified to simplify the output by discarding too small (singletons) or too big clusters, and to reduce the clusters to a final set of non-overlapping clusters. Clustering is done for every CAGE dataset within CAGEset object separatelly, resulting in a different set of tag clusters for every CAGE dataset. TCs from different datasets can further be aggregated into a single referent set of consensus clusters by calling aggregateTagClusters function.

The slots clusteringMethod, filteredCTSSidx and tagClusters of the provided CAGEset object will be occupied by the information on method used for clustering, CTSSs included in the clusters and list of tag clusters per CAGE experiment, respectively. To retrieve tag clusters for individual CAGE dataset use tagClusters function.

Vanja Haberle

Frith et al. (2007) A code for transcription initiation in mammalian genomes, Genome Research 18(1):1-12, (http://www.cbrc.jp/paraclu/).

tagClusters
aggregateTagClusters

load(system.file("data", "exampleCAGEset.RData", package="CAGEr"))

clusterCTSS(object = exampleCAGEset, threshold = 50, thresholdIsTpm = TRUE,
nrPassThreshold = 1, method = "distclu", maxDist = 20, 
removeSingletons = TRUE, keepSingletonsAbove = 100)