knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
plyranges provides a consistent interface for importing and wrangling
genomics data from a variety of sources. The package defines a grammar of
genomic data transformation based on dplyr
and the Bioconductor packages
IRanges
, GenomicRanges
, and rtracklayer
. It does this by providing a set
of verbs for developing analysis pipelines based on Ranges objects that
represent genomic regions:
mutate()
and stretch()
functions.anchor_
family of functions.arrange()
.mutate()
,
filter()
, and summarise()
functions.group_by()
.join_nearest_
family
of functions.join_overlaps_
family of functions.reduce_ranges()
.disjoin_ranges()
.read_/write_
family
of functions.For more details on the features of plyranges, read the vignette. For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq results read the fluentGenomics workflow.
plyranges is part of the tidyomics
project, providing a dplyr
-based interface for many types of
genomics datasets represented in Bioconductor.
plyranges can be installed from the latest Bioconductor release:
# install.packages("BiocManager") BiocManager::install("plyranges")
To install the development version from GitHub:
BiocManager::install("tidyomics/plyranges")
Ranges
Ranges
objects can either represent sets of integers as IRanges
(which have
start, end and width attributes) or represent genomic intervals (which have
additional attributes, sequence name, and strand) as GRanges
. In addition,
both types of Ranges
can store information about their intervals as metadata
columns (for example GC content over a genomic interval).
Ranges
objects follow the tidy data principle: each row of a Ranges
object
corresponds to an interval, while each column will represent a variable about
that interval, and generally each object will represent a single unit of
observation (like gene annotations).
We can construct a IRanges
object from a data.frame
with a start
or
width
using the as_iranges()
method.
library(plyranges) df <- data.frame(start = 1:5, width = 5) as_iranges(df) # alternatively with end df <- data.frame(start = 1:5, end = 5:9) as_iranges(df)
We can also construct a GRanges
object in a similar manner. Note that a
GRanges
object requires at least a seqnames column to be present in the
data.frame (but not necessarily a strand column).
df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"), start = 1:5, width = 5) as_granges(df) # strand can be specified with `+`, `*` (mising) and `-` df$strand <- c("+", "+", "-", "-", "*") as_granges(df)
Let's look at a more a realistic example (taken from HelloRanges vignette).
dir <- system.file(package = "HelloRangesData", "extdata/") genome <- as_granges(read.delim(file.path(dir, "hg19.genome"), header = FALSE), seqnames = V1, start = 1L, width = V2) gwas <- read_bed(file.path(dir, "gwas.bed"), genome_info = genome) exons <- read_bed(file.path(dir, "exons.bed"), genome_info = genome)
Suppose we have two GRanges objects: one containing coordinates of known exons and another containing SNPs from a GWAS.
The first and last 5 exons are printed below, there are two additional columns corresponding to the exon name, and a score.
We could check the number of exons per chromosome using group_by
and
summarise
.
exons exons %>% group_by(seqnames) %>% summarise(n = n())
Next we create a column representing the transcript_id with mutate
:
exons <- exons %>% mutate(tx_id = sub("_exon.*", "", name))
To find all GWAS SNPs that overlap exons, we use join_overlap_inner
. This
will create a new GRanges with the coordinates of SNPs that overlap exons, as
well as metadata from both objects.
olap <- join_overlap_inner(gwas, exons) olap
For each SNP we can count the number of times it overlaps a transcript.
olap %>% group_by(name.x, tx_id) %>% summarise(n = n())
We can also generate 2bp splice sites on either side of the exon using
flank_left
and flank_right
. We add a column indicating the side of flanking
for illustrative purposes. The interweave
function pairs the left and right
ranges objects.
left_ss <- flank_left(exons, 2L) right_ss <- flank_right(exons, 2L) all_ss <- interweave(left_ss, right_ss, .id = "side") all_ss
The fluentGenomics workflow package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the case study by Michael Love for using plyranges with tximeta.
The extended vignette in the plyrangesWorkshops package has a detailed walk through of using plyranges for coverage analysis.
The Bioc 2018 Workshop book has worked examples of using plyranges
to analyse publicly available genomics data.
If you found plyranges
useful for your work please cite our
paper:
@ARTICLE{Lee2019, title = "plyranges: a grammar of genomic data transformation", author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael", journal = "Genome Biol.", volume = 20, number = 1, pages = "4", month = jan, year = 2019, url = "http://dx.doi.org/10.1186/s13059-018-1597-8", doi = "10.1186/s13059-018-1597-8", pmc = "PMC6320618" }
We welcome contributions from the R/Bioconductor community. We ask that contributors follow the code of conduct and the guide outlined here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.