import_bam: Import bam files

View source: R/data_import.R

import_bamR Documentation

Import bam files

Description

Import single-end or paired-end bam files as GRanges objects, with various processing options. It is highly recommend to index the BAM file first.

Usage

import_bam(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  trim.to = c("whole", "5p", "3p", "center"),
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_PROseq(
  file,
  mapq = 20L,
  revcomp = TRUE,
  shift = -1L,
  trim.to = "3p",
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_PROcap(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  trim.to = "5p",
  ignore.strand = FALSE,
  field = "score",
  paired_end = NULL,
  yieldSize = NA,
  ncores = 1L
)

import_bam_ATACseq(
  file,
  mapq = 20L,
  revcomp = FALSE,
  shift = 0L,
  plus_offset = 4,
  minus_offset = -4,
  trim.to = "5p",
  ignore.strand = TRUE,
  field = "score",
  paired_end = TRUE,
  yieldSize = NA,
  ncores = 1L
)

Arguments

file

Path of a bam file, or a vector of paths.

mapq

Filter reads by a minimum MAPQ score. This is the correct way to filter multi-aligners.

revcomp

Logical indicating if aligned reads should be reverse-complemented.

shift

Either an integer giving the number of bases by which to shift the entire read upstream or downstream, or a pair of integers indicating shifts to be applied to the 5' and 3' ends of the reads, respectively. Shifting is strand-specific, with negative numbers shifting the reads upstream, and positive numbers shiftem them downstream. This option is applied after the revcomp, but before trim.to and ignore.strand options are applied.

trim.to

Option for selecting specific bases from the reads, applied after the revcomp and shift options. By default, the entire read is maintained. Other options are to take only the 5' base, only the 3' base, or the only the center base of the read.

ignore.strand

Logical indicating if the strand information should be discarded. If TRUE, strand information is discarded after revcomp, trim.to, and shift options are applied.

field

Metadata field name to use for readcounts, usually "score". If set to NULL, identical reads (or identical positions if trim.to options applied) are not combined, and the length of the output GRanges will be equal to the number of input reads.

paired_end

Logical indicating if reads should be treated as paired end reads. When set to NULL (the default), the first 100k reads are checked.

yieldSize

The number of bam file records to process simultaneously, e.g. the "chunk size". Setting a higher chunk size will use more memory, which can increase speed if there is enough memory available. If chunking is not necessary, set to NA.

ncores

Number of cores to use for importing bam files. Currently, multicore is only implemented for simultaneously importing multiple bam files. For smaller datasets or machines with higher memory, this can increase performance, but can otherwise lead to substantial performance penalties.

plus_offset

For importing ATAC-seq, the shift to apply to plus strand alignments. By default, plus strand reads are shifted 4 bp downstream.

minus_offset

For importing ATAC-seq, the shift to apply to minus strand alignments. By default, minus strand reads are shifted 4 bp upstream (in terms of the genomic coordinates).

Value

A GRanges object.

ATAC-seq data import

By default, import_bam_ATACseq will shift plus-aligned reads downstream 4 bp, minus-aligned reads upstream 4 bp, and then take the strand-specific start site of the reads before removing strand information and collapsing identical reads. These steps account for the 9bp gap between opposing fragments generated from the same Tn5 reaction, selecting the central base in the 9bp duplication.

While other sources often state that the offset should be +4 on plus strand and -5 on minus strand alignments (or alternatively +5, -4), this does not result in the two positions overlapping. I have verified that this is true based on the expected result of the Tn5 reaction and adapter ligation and sequencing steps, and also using real sequencing data, which confirms that only the +4/-4 shift results in a significant increase in the number positions that overlap. However, these arguments are left to the user if they insist on doing it differently.

Note that the order of operations performed is the same as the order of the associated arguments in the function proper, but not in the argument documentation i.e., the plus_offset and minus_offset arguments are applied after the shift argument and before the trim.to argument.

While this single-base precision analysis of ATAC-seq may be useful in some cases, for most users it is unlikely to be useful. Instead, you might use the plus_offset and minus_offset arguments correctly, but set trim.to = "whole" (and keep ignore.strand = TRUE). This will keep the entire ATAC-seq reads, which is the most common analysis approach. It is also common to use coverage data with ATAC-seq, but this eliminates read count information.

Author(s)

Mike DeBerardine

References

Hojoong Kwak, Nicholas J. Fuda, Leighton J. Core, John T. Lis (2013). Precise Maps of RNA Polymerase Reveal How Promoters Direct Initiation and Pausing. Science 339(6122): 950–953. https://doi.org/10.1126/science.1229386

Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, William J. Greenleaf (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, dna-binding proteins and nucleosome position. Nature Methods 10: 1213–1218. https://doi.org/10.1038/nmeth.2688

Examples

# get local address for included bam file
ps.bam <- system.file("extdata", "PROseq_dm6_chr4.bam",
                      package = "BRGenomics")

#--------------------------------------------------#
# Import entire reads
#--------------------------------------------------#

# Note that PRO-seq reads are sequenced as reverse complement
import_bam(ps.bam, revcomp = TRUE, paired_end = FALSE)

#--------------------------------------------------#
# Import entire reads, 1 range per read
#--------------------------------------------------#

import_bam(ps.bam, revcomp = TRUE, field = NULL,
           paired_end = FALSE)

#--------------------------------------------------#
# Import PRO-seq reads at basepair-resolution
#--------------------------------------------------#

# the typical manner to import PRO-seq data:
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
           paired_end = FALSE)

#--------------------------------------------------#
# Import PRO-seq reads, removing the run-on base
#--------------------------------------------------#

# the best way to import PRO-seq data; removes the
# most 3' base, which was added in the run-on
import_bam(ps.bam, revcomp = TRUE, trim.to = "3p",
           shift = -1, paired_end = FALSE)

#--------------------------------------------------#
# Import 5' ends of PRO-seq reads
#--------------------------------------------------#

# will include bona fide TSSes as well as hydrolysis products
import_bam(ps.bam, revcomp = TRUE, trim.to = "5p",
           paired_end = FALSE)

mdeber/BRGenomics documentation built on Aug. 3, 2024, 3:43 a.m.