The r Biocpkg("Organism.dplyr")
creates an on disk sqlite database to hold
data of an organism combined from an 'org' package (e.g.,
r Biocpkg("org.Hs.eg.db")
) and a genome coordinate functionality of the
'TxDb' package (e.g., r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")
). It
aims to provide an integrated presentation of identifiers and genomic
coordinates. And a src_organism object is created to point to the database.
The src_organism object is created as an extension of src_sql and
src_sqlite from dplyr, which inherited all dplyr methods. It also
implements the select()
interface from r Biocpkg("AnnotationDbi")
and
genomic coordinates extractors from r Biocpkg("GenomicFeatures")
.
The src_organism()
constructor creates an on disk sqlite database file with
data from a given 'TxDb' package and corresponding 'org' package. When dbpath
is given, file is created at the given path, otherwise temporary file is
created.
suppressPackageStartupMessages({ library(Organism.dplyr) library(GenomicRanges) library(ggplot2) })
library(Organism.dplyr)
Running src_organism()
without a given path will save the sqlite file to a
tempdir()
:
src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene")
Alternatively you can provide explicit path to where the sqlite file should be saved, and re-use the data base at a later date.
path <- "path/to/my.sqlite" src <- src_organism("TxDb.Hsapiens.UCSC.hg38.knownGene", path)
supportedOrganisms()
provides a list of organisms with corresponding 'org'
and 'TxDb' packages being supported.
supportedOrganisms()
Organism name, genome and id could be specified to create sqlite database. Organism name (either Organism or common name) must be provided to create the database, if genome and/or id are not provided, most recent 'TxDb' package is used.
src <- src_ucsc("human", path)
An existing on-disk sqlite file can be accessed without recreating the database. A version of the database created with TxDb.Hsapiens.UCSC.hg38.knownGene, with just 50 Entrez gene identifiers, is distributed with the Organism.dplyr package
src <- src_organism(dbpath = hg38light()) src
All methods from package dplyr can be used for a src_organism object.
Look at all available tables.
src_tbls(src)
Look at data from one specific table.
tbl(src, "id")
Look at fields of one table.
colnames(tbl(src, "id"))
Below are some examples of querying tables using dplyr.
SNORD%
is from
SQL, with %
representing a wild-card match to any string)tbl(src, "id") %>% filter(symbol %like% "SNORD%") %>% dplyr::select(entrez, map, ensembl, symbol) %>% distinct() %>% arrange(symbol) %>% collect()
inner_join(tbl(src, "id"), tbl(src, "id_go")) %>% filter(symbol == "ADA") %>% dplyr::select(entrez, ensembl, symbol, go, evidence, ontology)
txcount <- inner_join(tbl(src, "id"), tbl(src, "ranges_tx")) %>% dplyr::select(symbol, tx_id) %>% group_by(symbol) %>% summarize(count = n()) %>% dplyr::select(symbol, count) %>% arrange(desc(count)) %>% collect(n=Inf) txcount library(ggplot2) ggplot(txcount, aes(x = symbol, y = count)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Transcript count") + labs(x = "Symbol") + labs(y = "Count")
inner_join(tbl(src, "id"), tbl(src, "ranges_gene")) %>% filter(symbol %in% c("ADA", "NAT2")) %>% dplyr::select(gene_chrom, gene_start, gene_end, gene_strand, symbol, map) %>% collect() %>% GenomicRanges::GRanges()
Methods select()
, keytypes()
, keys()
, columns()
and mapIds
from
r Biocpkg("AnnotationDbi")
are implemented for src_organism objects.
Use keytypes()
to discover which keytypes can be passed to keytype argument
of methods select()
or keys()
.
keytypes(src)
Use columns()
to discover which kinds of data can be returned for the
src_organism object.
columns(src)
keys()
returns keys for the src_organism object. By default it returns the
primary keys for the database, and returns the keys from that keytype when the
keytype argument is used.
Keys of entrez
head(keys(src))
Keys of symbol
head(keys(src, "symbol"))
select()
retrieves the data as a tibble based on parameters for selected
keys columns and keytype arguments. If requested columns that have multiple
matches for the keys, select_tbl()
will return a tibble with one row for
each possible match, and select()
will return a data frame.
keytype <- "symbol" keys <- c("ADA", "NAT2") columns <- c("entrez", "tx_id", "tx_name","exon_id") select_tbl(src, keys, columns, keytype)
mapIds()
gets the mapped ids (column) for a set of keys that are of a
particular keytype. Usually returned as a named character vector.
mapIds(src, keys, column = "tx_name", keytype) mapIds(src, keys, column = "tx_name", keytype, multiVals="CharacterList")
Eleven genomic coordinates extractor methods are available in this
package: transcripts()
, exons()
, cds()
, genes()
,
promoters()
, transcriptsBy()
, exonsBy()
, cdsBy()
,
intronsByTranscript()
, fiveUTRsByTranscript()
,
threeUTRsByTranscript()
. Data can be returned in two versions, for
instance tibble (transcripts_tbl()
) and GRanges or GRangesList
(transcripts()
).
Filters can be applied to all extractor functions. The output can be resctricted
by an AnnotationFilter
, an AnnotationFilterList
, or an expression that can
be tranlated into an AnnotationFilterList
. Valid filters can be retrieved by
supportedFilters(src)
.
supportedFilters(src)
All filters take two parameters: value and condition, condition could be one of "==", "!=", "startsWith", "endsWith", ">", "<", ">=" and "<=", default condition is "==".
EnsemblFilter("ENSG00000196839") SymbolFilter("SNORD", "startsWith")
The following illustrates several ways of inputting filters to an extractor function.
smbl <- SymbolFilter("SNORD", "startsWith") transcripts_tbl(src, filter=smbl) filter <- AnnotationFilterList(smbl) transcripts_tbl(src, filter=filter) transcripts_tbl(src, filter=~smbl)
A GRangesFilter()
can also be used as a filter for the methods with
result displaying as GRanges or GRangesList.
gr <- GRangesFilter(GenomicRanges::GRanges("chr15:25062333-25065121")) transcripts(src, filter=~smbl & gr)
Filters in extractor functions support &
, |
, !
(negation), and
()
(grouping). Transcript coordinates of gene symbol equal to "ADA" and
transcript start position less than 44619810.
transcripts_tbl(src, filter = AnnotationFilterList( SymbolFilter("ADA"), TxStartFilter(44619810,"<"), logicOp="&") ) ## Equivalent to transcripts_tbl(src, filter = ~symbol == "ADA" & tx_start < 44619810)
Transcripts coordinates of gene symbol equal to "ADA" or transcript end position equal to 243843236.
txend <- TxEndFilter(243843236, '==') transcripts_tbl(src, filter = ~symbol == "ADA" | txend)
Using negation to find transcript coordinates of gene symbol equal to "ADA" and transcript start positions NOT less than 44619810.
transcripts_tbl(src, filter = ~symbol == "ADA" & !tx_start < 44618910)
Using grouping to find transcript coordinates of a long filter statement.
transcripts_tbl(src, filter = ~(symbol == 'ADA' & !(tx_start >= 44619810 | tx_end < 44651742)) | (smbl & !tx_end > 25056954) )
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.