sc_atac_trim_barcode: demultiplex raw single-cell ATAC-Seq fastq reads

View source: R/sc_atac_trim_barcode.R

sc_atac_trim_barcodeR Documentation

demultiplex raw single-cell ATAC-Seq fastq reads


single-cell data need to be demultiplexed in order to retain the information of the cell barcodes the data belong to. Here we reformat fastq files so barcode/s (and if available the UMI sequences) are moved from the sequence into the read name. Since scATAC-Seq data are mostly paired-end, both 'r1' and 'r2' are demultiplexed in this function.


  bc_file = NULL,
  valid_barcode_file = "",
  output_folder = "",
  umi_start = 0,
  umi_length = 0,
  umi_in = "both",
  rmN = FALSE,
  rmlow = FALSE,
  min_qual = 20,
  num_below_min = 2,
  id1_st = -0,
  id1_len = 16,
  id2_st = 0,
  id2_len = 16,
  no_reverse_complement = FALSE



read one for pair-end reads.


read two for pair-end reads, NULL if single read.


the barcode information, can be either in a fastq format (e.g. from 10x-ATAC) or from a .csv file (here the barcode is expected to be on the second column). Currently, for the fastq approach, this can be a list of barcode files.


optional file path of the valid (expected) barcode sequences to be found in the bc_file (.txt, can be txt.gz). Must contain one barcode per line on the second column separated by a comma (default =""). If given, each barcode from bc_file is matched against the barcode of best fit (allowing a hamming distance of 1). If a FASTQ bc_file is provided, barcodes with a higher mapping quality, as given by the fastq reads quality score are prioritised.


the output dir for the demultiplexed fastq file, which will contain fastq files with reformatted barcode and UMI into the read name. Files ending in .gz will be automatically compressed.


if available, the start position of the molecular identifier.


if available, the start position of the molecular identifier.




logical, whether to remove reads that contains N in UMI or cell barcode.


logical, whether to remove reads that have low quality barcode sequences


the minimum base pair quality that is allowed (default = 20).


the maximum number of base pairs below the quality threshold.


barcode start position (0-indexed) for read 1, which is an extra parameter that is needed if the bc_file is in a .csv format.


barcode length for read 1, which is an extra parameter that is needed if the bc_file is in a .csv format.


barcode start position (0-indexed) for read 2, which is an extra parameter that is needed if the bc_file is in a .csv format.


barcode length for read 2, which is an extra parameter that is needed if the bc_file is in a .csv format.


specifies if the reverse complement of the barcode sequence should be used for barcode error correction (only when barcode sequences are provided as fastq files). FALSE (default) lets the function decide whether to use reverse complement, and TRUE forces the function to use the forward barcode sequences.


None (invisible 'NULL')


data.folder <- system.file("extdata", package = "scPipe", mustWork = TRUE)
r1      <- file.path(data.folder, "small_chr21_R1.fastq.gz") 
r2      <- file.path(data.folder, "small_chr21_R3.fastq.gz") 

# Using a barcode fastq file:

# barcodes in fastq format
barcode_fastq      <- file.path(data.folder, "small_chr21_R2.fastq.gz") 

sc_atac_trim_barcode (
r1            = r1, 
r2            = r2, 
bc_file       = barcode_fastq,
rmN           = TRUE,
rmlow         = TRUE,
output_folder = tempdir())

# Using a barcode csv file:

# barcodes in .csv format
barcode_1000       <- file.path(data.folder, "chr21_modified_barcode_1000.csv")

## Not run: 
sc_atac_trim_barcode (
r1            = r1, 
r2            = r2, 
bc_file       = barcode_1000, 
id1_st        = 0,
rmN           = TRUE,
rmlow         = TRUE,
output_folder = tempdir())

## End(Not run)

LuyiTian/scPipe documentation built on Dec. 11, 2023, 8:21 p.m.