make_trim | R Documentation |
make_trim
uses parallel processing to generate trimmed reads.
make_trim(
input,
output,
indels = TRUE,
concat = 12,
check_mem = FALSE,
threads = 1,
chunk_size = NULL,
polyG = c(type = NULL, min = NULL, mismatch = NULL),
adapt_3_set = c(type = "hard_rm", min = 10, mismatch = 0.1),
adapt_3 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTA",
adapt_5_set = c(type = NULL, min = NULL, mismatch = NULL),
adapt_5 = NULL,
seq_range = c(min = NULL, max = NULL),
quality = c(threshold = 20, percent = 0.8)
)
input |
Character indicating the path to a directory containing fastq formatted sequence files either compressed (.fastq.gz) or text (.fastq). |
output |
Character indicating the path to a directory where the compressed trimmed fastq files should be saved (as basename trim.fastq.gz). |
indels |
Logical whether indels should be allowed in the mismatch between read and adaptor sequences. Note that indel trimming will only be done on fragments that have failed to be trimmed by other means. Default=TRUE. |
concat |
Integer setting the threshold for trimming concatemer-like
adaptors. Even when adaptor synthesis and ligation are strictly controlled,
concatemer-like ("di-") adaptor sequences are formed. When |
check_mem |
Logical whether a memory check should be performed. A memory check estimates the approximate memory usage given the fastq file sizes and number of threads. The check happens before entering the parallel trimming loop, and will give an immediate warning given intermediary risky memory estimates, and an error if the risk is very high. Note, this option depends on the parallel and benchmarkme packages. (default=FALSE) |
threads |
Integer stating the number of parallel jobs. Note, that
reading multiple fastq files drains memory, using up to 10Gb per fastq
file. To avoid crashing the system due to memory shortage, make sure that
each thread on the machine have at least 10 Gb of memory available, unless
your fastq files are very small or very large. Use
|
chunk_size |
Integer indicating the number of reads to be read into memory in each chunk. When chunk_size=NULL (default), the whole fastq will be processed as one. Thus, running 5 threads will keep 5 full fastq files in memory simultaneously (fast but demanding). Setting chunk_size to an integer will instead read only a portion of each fastq at a time and then append the trimmed results on disc (slow but less demanding). Try using chunk_size=1000000 reads, less will be extremely slow. Rather decrease threads if you encounter memory problems. |
polyG |
Character vector with three inputs named 'type', 'min' and 'mismatch' that controls poly G trimming. This trimming might be necessary for fastq files from two channel illumina sequencers (e.g. NextSeq and NovaSeq) where no color signals are interpreted as 'G', but where read failure sometimes also results in long stretches of 'no-signal-Gs'. Works similar to |
adapt_3_set |
Character vector with three inputs named 'type', 'min' and 'mismatch' that controls 3' trimming. As default, adapt_3_set=c(type="hard_rm", min=10, mismatch=0.1). Options for each input are:
|
adapt_3 |
Character string defining the 3' adaptor sequence. In most library preparation protocols, the 3' adaptor is the only adaptor needed trimming. |
adapt_5_set |
Currently not supported, but will be in future updates.
Same as |
adapt_5 |
Character string defining a 5' adaptor sequence. 5' adaptor sequences are only trimmed in specialized protocols or in very short paired-end sequencing. |
seq_range |
Numeric vector with two inputs named 'min' and 'max', that
controls the sequence size filter. For example, if
|
quality |
Numeric vector with two inputs named 'threshold' and 'percent'
that controls the fastq quality filter. For example, if
|
Given a file path to sequence files in fastq format (extension .fastq.gz or .fastq) this function will first trim the sequences from adaptor sequence, and then apply sequence size and phred score quality filters. On systems with low memory or very large fastq files, users may choose to handle each file in chunks on the expense of time performance.
Exports a compressed trimmed fastq file to the output folder (file name: basename+trim.fastq.gz). If any type="save" has been set, an additional fastq file (file name includes 'REMOVED') will be exported. In addition, overview statisics (progress report) will be returned as a data.frame.
https://github.com/Danis102 for updates on the current package.
Other PAC generation:
PAC_check()
,
create_PAC()
,
make_PAC()
,
make_counts()
,
make_cutadapt()
,
make_pheno()
,
merge_lanes()
############################################################
### Seqpac fastq trimming with the make_trim function
### (for more streamline options see the make_counts function)
############################################################
### Seqpac fastq trimming with the make_counts function
### using default settings for NEBNext small RNA adaptor
# Seqpac includes strongly down-sampled smallRNA fastq.
sys_path = system.file("extdata", package = "seqpac", mustWork = TRUE)
input <- list.files(path = sys_path, pattern = "fastq", all.files = FALSE,
full.names = TRUE)
# Run make_trim using NEB-next adaptor
# (Here we store the trimmed files in the input folder)
# (but you may choose where to store them using the output option)
input <- unique(dirname(input)) # mak_trim searches for fastq in input dir
output <- tempdir() # save in tempdir
list.files(input, pattern = "fastq") #before
prog_report <- make_trim(
input=input, output=output, threads=1,
adapt_3_set=c(type="hard_rm", min=10, mismatch=0.1),
adapt_3="AGATCGGAAGAGCACACGTCTGAACTCCAGTCACTA",
polyG=c(type="hard_trim", min=20, mismatch=0.1),
seq_range=c(min=14, max=70),
quality=c(threshold=20, percent=0.8))
list.files(path = output, pattern = "fastq") #after
# How did it go? Check progress report:
prog_report
## The principle:
# input <- "/some/path/to/untrimmed_fastq_folder"
# output <- "/some/path/to/output_folder_for_trimmed_fastq"
#
# prog_report <- make_trim(
# input=input, output=output,
# threads=<how_many_parallel>,
# check_mem=<should_memory_be_checked?>,
# adapt_3_set=<options_for_main_trimming,
# adapt_3=<sequence_to be trimmed,
# polyG=<Illumina_Nextseq_type_poly_G_trimming>,
# seq_range=<what_sequence_range_to_save>,
# quality=<quality_filtering_options>))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.