Description Usage Arguments Details Value Examples
View source: R/cdhit-methods.R
CDHIT is a greedy algorithm to cluster amino acid or DNA sequences based on a
minimum identity.
By default, in this package it is configured perform ungapped, global
alignments with no clipping at start or end.
The identity
is the number of identical characters in alignment
divided by the full length of the shorter sequence.
Set s
< 1 to change the minimum coverage of the shorter sequence, which
will allow clipping at start or end.
Changing G
= 0 changes the meaning of the identity
to be the number of
identical characters in the alignment divided by the length of the alignment.
In this case, you must also set the alignment coverage controls aL
, AL
, aS
, AS
.
1 2 3 4 5 6 7 8 9 10 |
seqs |
|
identity |
minimum proportion identity |
kmerSize |
word size. If NULL, it will be chosen automatically based on the identity. You may need to lower it below 5 for AAseq with identity less than .7. |
min_length |
Minimum length for sequences to be clustered. An error if something smaller is passed. |
s |
fraction of shorter sequence covered by alignment. |
only_index |
if TRUE only return the integer cluster indices, otherwise return a tibble. |
showProgress |
show a status bar |
... |
other arguments that can be passed to cdhit, see https://github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHIT for details. These will override any default values. |
CDHit is by Fu, Niu, Zhu, Wu and Li (2012). The R interface is originally by Thomas Lin Pedersen and was transcribed here because it is not exported from the package FindMyFriends, which is orphaned.
vector of integer
of length seqs
providing the cluster
ID for each sequence, or a tibble
. See details.
1 2 3 4 5 6 7 8 9 10 | fasta_path = system.file('extdata', 'demo.fasta', package='CellaRepertorium')
aaseq = Biostrings::readAAStringSet(fasta_path)
# 100% identity, global alignment
cdhit(aaseq, identity = 1, only_index = TRUE)[1:10]
# 100% identity, local alignment with no padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = 1, aS = 1, only_index = TRUE)[1:10]
# 100% identity, local alignment with .9 padding of endpoints
cdhit(aaseq,identity = 1, G = 0, aL = .9, aS = .9, only_index = TRUE)[1:10]
# a tibble
tbl = cdhit(aaseq, identity = 1, G = 0, aL = .9, aS = .9, only_index = FALSE)
|
[1] 100 101 162 102 6 245 103 49 163 164
[1] 100 101 162 102 6 245 103 49 163 164
[1] 100 101 162 102 6 245 103 49 163 164
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.