searchAln | R Documentation |
Scans a pairwise alignment of nucleotide sequences with the pattern represented by the PWMatrix. It reports only those hits that are overlapped in the alignment of of the two sequences and exceed a specified threshold score in both, AND are found in regions of the alignment above the specified conservation cutoff value.
searchAln(pwm, aln1, aln2, seqname1="Unknown1", seqname2="Unknown2",
min.score="80%", windowSize=51L,
cutoff=0.7, strand="*", type="any", conservation=NULL,
mc.cores=1L)
pwm |
A PWMatrix object or a PWMatrixList object. |
aln1 |
A |
aln2 |
A |
seqname1 , seqname2 |
A |
min.score |
The minimum score for the hit. Can be given an character string in the format of "80%" or as a single absolute value. When it is percentage value, it means the percentage of the maximal possible from the PWM. |
windowSize |
The size of the sliding window (in nucleotides) for calculating local conservation in the alignment. This should be an odd value. |
cutoff |
The conservation cutoff can be from 0 (0% identity) to 1 (100% identity).
The regions which have lower conservation than the cutoff
will be discarded from the results of the pattern searching.
The conservation is calculated by comparing the alignments
within the |
strand |
When searching the alignment, we can search the positive strand or negative strand. While strand is "*", it will search both strands and return the results based on the positvie strand coordinate. |
type |
This argument can be "any" or "all". When it is "any", one motif will be kept if the maximal conservation value of the motif is larger than the cutoff. When it is "all", one motif will be kept if the minimal conservation value of the motif is larger than the cutoff. |
conservation |
A vector of conservation profile. If not supplied, the conservation profile will be computed internally on the fly. |
mc.cores |
The number of cpu threads to use when searching |
In brief, given a pairwise alignment of two sequences,
first of all, we remove the gaps ("-", "-", ".").
Then we scan both ungapped sequences with the pwm and
return the hits that above min.score
.
Since we only want to keep the conserved hits,
we choose the pair of motifs that overlap most in the alignment.
Finally, the pair of motifs have to be conserved above
the threshold cutoff
.
In the returned SitePairSet
, the coordinates of start, end are based
on the ungapped sequences, instead of the original alignment.
This is due to we are more concerned about the actual location of motif
in the genome rather than in the alignment.
A SitePairSet
object is returned when pwm is a PWMatrix
,
while a SitePairSetList
is returned when pwm is a PWMatrixList
.
Ge Tan
searchSeq
data(MA0003.2)
data(MA0004.1)
pwm1 <- toPWM(MA0003.2)
pwm2 <- toPWM(MA0004.1)
pwmList <- PWMatrixList(pwm1=pwm1, pwm2=pwm2)
# Two character objects
aln1 <- "ACCACATTGCCTCAGGGCAGGTAAGTTGATC---AAAGG---AAACGCAAAGTTTTCAAG"
aln2 <- "GTTTCACTACATTGCTTCAGGGCAGTAAATATATAAATATATAAAAATATAATTTTCATC"
aln <- c(aln1=aln1, aln2=aln2)
library(Biostrings)
alnDNAStringSet <- DNAStringSet(c(aln1=aln1, aln2=aln2))
# PWMatrix, character, character
## Only scan the positive strand of the alignments
sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2",
min.score="70%", cutoff=0.5,
strand="+", type="any")
## Only scan the negative strand of the alignments
sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2",
min.score="70%", cutoff=0.5,
strand="-", type="any")
## Scan the both strands of the alignments
sitePairSet <- searchAln(pwm1, aln1, aln2, seqname1="aln1", seqname2="aln2",
min.score="70%", cutoff=0.5,
strand="*", type="any")
## Convert the SitePairSet object into other R objects
as(sitePairSet, "data.frame")
as.data.frame(sitePairSet)
as(sitePairSet, "DataFrame")
as(sitePairSet, "GRanges")
writeGFF3(sitePairSet)
writeGFF2(sitePairSet)
# PWMatrix, character, missing
sitePairSet <- searchAln(pwm1, aln,
min.score="70%", cutoff=0.5,
strand="*", type="any")
# PWMatrix, DNAString, DNAString
sitePairSet <- searchAln(pwm1, DNAString(aln1), DNAString(aln2),
seqname1="aln1", seqname2="aln2",
min.score="70%", cutoff=0.5,
strand="*", type="any")
# PWMatrix, DNAStringSet, missing
sitePairSet <- searchAln(pwm1, alnDNAStringSet,
min.score="70%", cutoff=0.5,
strand="*", type="any")
# PWMatrixList, character, character
sitePairSetList <- searchAln(pwmList, aln1, aln2,
seqname1="aln1", seqname2="aln2",
min.score="70%", cutoff=0.5,
strand="*", type="any")
## elementLenths of each pwm hits
elementNROWS(sitePairSetList)
## output
writeGFF2(sitePairSetList)
writeGFF3(sitePairSetList)
as(sitePairSetList, "DataFrame")
as(sitePairSetList, "data.frame")
as.data.frame(sitePairSetList)
as(sitePairSetList, "GRanges")
# PWMatrix, Axt, missing
library(CNEr)
axtFilesHg19DanRer7 <- file.path(system.file("extdata", package="TFBSTools"),
"hg19.danRer7.net.axt")
axtHg19DanRer7 <- readAxt(axtFilesHg19DanRer7)
sitePairSetList <- searchAln(pwm1, axtHg19DanRer7, min.score="80%",
windowSize=51L, cutoff=0.7, strand="*",
type="any", conservation=NULL, mc.cores=1)
## We may want to coordinates of motif in the genome
GRangesTFBS <- toGRangesList(sitePairSetList, axtHg19DanRer7)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.