title: "An Ultra-Fast All-in-One FASTQ preprocessor"
author: "Wei Wang periwinkle.david@gmail.com"
date: "r format(Sys.Date(), '%m/%d/%Y')
"
package: Rfastp
output: BiocStyle::html_document: number_sections: yes toc: true vignette: > %\VignetteIndexEntry{Rfastp} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc}
bibliography: - fastp.bib
knitr::opts_chunk$set(tidy=FALSE, cache=FALSE, #dev="png", message=FALSE, error=FALSE, warning=TRUE) options(width=100)
The Rfastp package provides an interface to the all-in-one preprocessing for FastQ files toolkit fastp[@10.1093/bioinformatics/bty560].
Use the BiocManager
package to download and install the package from
Bioconductor as follows:
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("Rfastp")
If required, the latest development version of the package can also be installed from GitHub.
BiocManager::install("remotes") BiocManager::install("RockefellerUniversity/Rfastp")
Once the package is installed, load it into your R session:
library(Rfastp)
The package contains three example fastq files, corresponding to a single-end fastq file, a pair of paired-end fastq files.
se_read1 <- system.file("extdata","Fox3_Std_small.fq.gz",package="Rfastp") pe_read1 <- system.file("extdata","reads1.fastq.gz",package="Rfastp") pe_read2 <- system.file("extdata","reads2.fastq.gz",package="Rfastp") outputPrefix <- tempfile(tmpdir = tempdir())
Rfastp support multiple threads, set threads number by parameter thread
.
se_json_report <- rfastp(read1 = se_read1, outputFastq = paste0(outputPrefix, "_se"), thread = 4)
pe_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, outputFastq = paste0(outputPrefix, "_pe"))
pe_merge_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, merge = TRUE, outputFastq = paste0(outputPrefix, '_unpaired'), mergeOut = paste0(outputPrefix, "_merged.fastq.gz"))
umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, outputFastq = paste0(outputPrefix, '_umi1'), umi = TRUE, umiLoc = "read1", umiLength = 16)
the following example will add prefix string before the UMI sequence in the sequence name. An "_" will be added between the prefix string and UMI sequence. The UMI sequences will be inserted into the sequence name before the first space.
umi_json_report <- rfastp(read1 = pe_read1, read2 = pe_read2, outputFastq = paste0(outputPrefix, '_umi2'), umi = TRUE, umiLoc = "read1", umiLength = 16, umiPrefix = "#", umiNoConnection = TRUE, umiIgnoreSeqNameSpace = TRUE)
Trim poor quality bases at 3' end base by base with quality higher than 5; trim poor quality bases at 5' end by a 29bp window with mean quality higher than 20; disable the polyG trimming, specify the adapter sequence for read1.
clipr_json_report <- rfastp(read1 = se_read1, outputFastq = paste0(outputPrefix, '_clipr'), disableTrimPolyG = TRUE, cutLowQualFront = TRUE, cutFrontWindowSize = 29, cutFrontMeanQual = 20, cutLowQualTail = TRUE, cutTailWindowSize = 1, cutTailMeanQual = 5, minReadLength = 29, adapterSequenceRead1 = 'GTGTCAGTCACTTCCAGCGG' )
rfastq can accept multiple input files, and it will concatenate the input files into one and the run fastp.
pe001_read1 <- system.file("extdata","splited_001_R1.fastq.gz", package="Rfastp") pe002_read1 <- system.file("extdata","splited_002_R1.fastq.gz", package="Rfastp") pe003_read1 <- system.file("extdata","splited_003_R1.fastq.gz", package="Rfastp") pe004_read1 <- system.file("extdata","splited_004_R1.fastq.gz", package="Rfastp") inputfiles <- c(pe001_read1, pe002_read1, pe003_read1, pe004_read1) cat_rjson_report <- rfastp(read1 = inputfiles, outputFastq = paste0(outputPrefix, "_merged1"))
pe001_read2 <- system.file("extdata","splited_001_R2.fastq.gz", package="Rfastp") pe002_read2 <- system.file("extdata","splited_002_R2.fastq.gz", package="Rfastp") pe003_read2 <- system.file("extdata","splited_003_R2.fastq.gz", package="Rfastp") pe004_read2 <- system.file("extdata","splited_004_R2.fastq.gz", package="Rfastp") inputR2files <- c(pe001_read2, pe002_read2, pe003_read2, pe004_read2) catfastq(output = paste0(outputPrefix,"_merged2_R2.fastq.gz"), inputFiles = inputR2files)
dfsummary <- qcSummary(pe_json_report)
p1 <- curvePlot(se_json_report) p1
p2 <- curvePlot(se_json_report, curve="content_curves") p2
dfTrim <- trimSummary(pe_json_report)
usage of rfastp:
?rfastp
usage of catfastq:
?catfastq
usage of qcSummary:
?qcSummary
usage of trimSummary:
?trimSummary
usage of curvePlot:
?curvePlot
Thank you to Ji-Dung Luo for testing/vignette review/critical feedback, Doug Barrows for critical feedback/vignette review and Ziwei Liang for their support.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.