View source: R/prepare_data_from_FASTA.R
prepare_data_from_FASTA | R Documentation |
Given a set of sequences in a FASTA file this function returns a sparse matrix with one-hot encoded sequences. In this matrix, the sequence features are along rows, and sequences along columns. Currently, mono- and dinucleotide features for DNA sequences are supported. Therefore, the length of the feature vector is 4 and 16 times the length of the sequences (since the DNA alphabet is four characters) for mono- and dinucleotide features respectively.
prepare_data_from_FASTA(fasta_fname, raw_seq = FALSE, sinuc_or_dinuc = "sinuc")
fasta_fname |
Provide the name (with complete path) of the input FASTA file. |
raw_seq |
TRUE or FALSE, set this to TRUE if you want the raw sequences. |
sinuc_or_dinuc |
character string, 'sinuc' or 'dinuc' to select for mono- or dinucleotide profiles. |
A sparse matrix of sequences represented with one-hot-encoding.
get_one_hot_encoded_seqs
for directly using a
DNAStringSet object
Other input functions:
get_one_hot_encoded_seqs()
fname <- system.file("extdata", "example_data.fa.gz",
package = "seqArchR", mustWork = TRUE)
# mononucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
sinuc_or_dinuc = "sinuc")
# dinucleotides feature matrix
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
sinuc_or_dinuc = "dinuc")
# FASTA sequences as a Biostrings::DNAStringSet object
rawSeqs <- prepare_data_from_FASTA(fasta_fname = fname,
raw_seq = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.