#' @title
#' Convert .uc Files to Dataframe
#'
#' @description
#' Reads .uc files (USEARCH Cluster Format) generated by
#' the VSEARCH clustering and alignment algorithms.
#'
#' @param file
#' The file path of the .uc file.
#'
#' @param output
#' The type of analysis that was carried out to produce the
#' .uc file.
#' \itemize{
#' \item If output is specified as "cluster", VSEARCH
#' clustering was carried out.
#' \item If output is specified as "alignment",
#' VSEARCH pairwise global alignment was carried out.
#' }
#'
#' Note that clustering produces one "H" record
#' for each sequence, and one "C" record for
#' each cluster, while an alignment produces an
#' "H" record for each alignment (see details).
#'
#' @return
#' A dataframe containing the converted .uc file. The fields
#' contained within are as follows:
#' \itemize{
#' \item Record type - "H, C or N", see details for
#' further information.
#' \item Cluster designation
#' (\code{output = "cluster"} only)
#' \item Sequence length, or cluster size
#' \item Percent identity to target
#' \item The nucleotide strand
#' (\code{output = "cluster"} only)
#' \item A compressed alignment - see details for
#' further information.
#' \item ID of query sequence
#' \item ID of target sequence ("H" records only)
#' }
#'
#' @details
#' USEARCH cluster format is a tab separated text file that
#' contains clustering and/or alignment information for a
#' set of sequences. For each sequence a record type,
#' "H, C or N", is provided providing information about the
#' type of "hit" in the dataframe. These refer to:
#'
#' \itemize{
#' \item H - Hit - for alignments, indicates an
#' identified alignment of two supplied sequences. For
#' clustering, indicates the cluster assignment for a query.
#' \item C - Cluster record - a record for each cluster generated.
#' \item N - No hit - indicates that no cluster was
#' assigned or no alignment was found with a target
#' sequence. For clustering, a query with no hits becomes
#' the centroid of a new cluster.
#' }
#'
#' Additionally, for each record a "compressed alignment"
#' is generated. This is the alignment represented in a
#' compact format including the letters "M", "D", and "I".
#' Before each letter, the number of consecutive columns
#' of the given letter type is also given. The letter types
#' are as follows:
#'
#' \itemize{
#' \item "M" - Match - Identical bases between the query
#' and target sequence
#' \item "D" - Deletion - A gap in the target sequence
#' \item "I" - Insertion - A gap in the query sequence
#' }
#'
#' An example of this would be "13M", referring to 13
#' consecutive matches between the query and target sequence.
#'
#' @seealso
#' code{\link{tirClust}}, code{\link{packAlign}},
#' code{\link{readBlast}}, code{\link{packClust}}
#'
#' @references
#' VSEARCH may be downloaded from
#' \url{https://github.com/torognes/vsearch}. See
#' \url{https://www.ncbi.nlm.nih.gov/pubmed/27781170}
#' for further information.
#'
#' @examples
#' readUc(system.file(
#' "extdata",
#' "packMatches.uc",
#' package = "packFinder"
#' ))
#'
#' @author
#' Jack Gisby
#'
#' @export
readUc <- function(file, output = "cluster") {
checkPermissions(file)
if (output != "cluster" & output != "alignment") {
stop("Argument 'output' must be specified as 'cluster' or 'alignment'")
}
packClusts <- utils::read.table(file, sep = "\t")
colnames(packClusts) <- c(
"type",
"cluster",
"width",
"identity",
"strand",
"6",
"7",
"cigarAlignment",
"query",
"target"
)
cluster <- packClusts$cluster
strand <- packClusts$strand
# different parts of the uc file are required for different purposes
if (output == "cluster") {
return(subset(packClusts, select = -c(6, 7)))
} else if (output == "alignment") {
return(subset(packClusts, select = -c(cluster, strand, 6, 7)))
}
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.