require(gread) knitr::opts_chunk$set( comment = "#", error = FALSE, tidy = FALSE, cache = FALSE, collapse=TRUE) # options(datatable.auto.index=FALSE)
We've developed three packages for performing differential analysis of NGS
data, namely gread
, gcount
and ganalyse
. In short,
gread enables loading or reading in the required data quickly from many different formats in which NGS data and gene annotations are available.
gcount counts the reads depending on user configuration on raw counts.
ganalyse then allows to perform differential gene expression analysis
using many methods such as limma, voomlimma (for FPKM
), edger on the
read counts.
In this vignette, we'll discuss the gread
package.
gread
is an R-package that allows to read and process gene annotations and
NGS data. Currently supported formats include:
gff
,gtf
(also referred to as gff v2.5
at times),bed
format, andbam
formatYou can use the function supported_formats()
to check all available formats.
supported_formats()
There are four functions, one for each type of format.
read_gtf
(gtf = read_gtf("sample.gtf")) class(gtf)
read_gff
(gff = read_gff("sample.gff")) class(gff)
read_bed
(bed = read_bed("sample.bed")) ## bed9, bed12 and bed15 formats class(bed)
read_bam
(bam = read_bam("sample.bam")) class(bam)
All functions have a set of common arguments, and might contain additional
arguments specific to that particular format. See ?<function_name>
for more
details.
You can filter invalid rows by using the filter
argument. Default value
is FALSE
, i.e., load all rows.
There are some additional arguments for read_bam
function to load extra
tags. If you are not familiar with it, it's better to leave this argument
untouched.
All these functions are objects of class gtf
, gff
bed
and bam
respectively. They inherit from data.table
.
read_format
While these functions are quite easy and straightforward to use, there's
another function read_format()
that makes things even simpler and is
sufficient in most cases.
# gtf file (gtf = read_format("sample.gtf")) # gff file (gff = read_format("sample.gff")) # bam file (bam = read_format("sample.bam")) # bed file (bed = read_format("sample.bed"))
The gread()
package provides an useful function extract()
to extract
useful features from gtf
and gff
objects. This vignette provides a short
description of the function. See ?extract
for more.
In a typical RNA-seq
experiment, we would want to extract the genes
start-end
coordinates and also just the exons. This is because, the
default methodology is to extract the reads that overlap exons and count
the total number of reads that overlap only those exonic regions for that
gene. Extracting genes and exons into separate data.tables makes
things easier to obtain counts then using the gcount
package.
gtf = read_format("sample.gtf") ## Extract a unified gene model by combining all overlapping exons (uni_genes = extract(gtf, feature="gene_exon", type="union", transcript_id="transcript_id", gene_id="gene_id")) ## extract genes as such (genes = extract(gtf, feature="gene", type="default"))
Sometimes the gene annotation formats do not have intronic coordintes
annotated but might be essential. In that case, we can use the
construct_introns
function on a gtf
or gff
object a follows:
gtf = read_format("sample.gtf") (gtf = construct_introns(gtf, update=TRUE)) (introns = construct_introns(gtf, update=FALSE)) class(gtf) class(introns)
By default, the argument update
is equal to TRUE
, which returns an
updated gtf
object with the intron coordinates added to gtf
object.
The class of the input object is retained in result.
If you require just the introns, set update = FALSE
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.