library(ssrch) library(msigTextM) library(DT)
MSigDb is a curated collection of gene sets, with enumerations of gene symbols or ENTREZ identifiers in sets that occupy higher level categories.
The main categories (and subcategories, indented) are
Here we tabulate the counts of gene sets by category and sub-category.
main category subcategory ARCHIVED C1 C2 C3 C4 C5 C6 C7 H 0 326 0 0 0 0 189 4872 50 BP 0 0 0 0 0 4436 0 0 0 C5_BP 527 0 0 0 0 0 0 0 0 C5_CC 159 0 0 0 0 0 0 0 0 C5_MF 178 0 0 0 0 0 0 0 0 CC 0 0 0 0 0 580 0 0 0 CGN 0 0 0 0 427 0 0 0 0 CGP 0 0 3433 0 0 0 0 0 0 CM 0 0 0 0 431 0 0 0 0 CP 0 0 252 0 0 0 0 0 0 CP:BIOCARTA 0 0 217 0 0 0 0 0 0 CP:KEGG 0 0 186 0 0 0 0 0 0 CP:REACTOME 0 0 674 0 0 0 0 0 0 MF 0 0 0 0 0 901 0 0 0 MIR 0 0 0 221 0 0 0 0 0 TFT 0 0 0 615 0 0 0 0 0
The vocabularies of the gene set collections are important for
many aspects of utilization.
STANDARD_NAME
attributes for
C7 gene sets look like
[1] "KAECH_NAIVE_VS_DAY8_EFF_CD8_TCELL_UP" [2] "KAECH_NAIVE_VS_DAY8_EFF_CD8_TCELL_DN" [3] "KAECH_NAIVE_VS_DAY15_EFF_CD8_TCELL_UP" [4] "KAECH_NAIVE_VS_DAY15_EFF_CD8_TCELL_DN" [5] "KAECH_NAIVE_VS_MEMORY_CD8_TCELL_UP" [6] "KAECH_NAIVE_VS_MEMORY_CD8_TCELL_DN" [7] "KAECH_DAY8_EFF_VS_DAY15_EFF_CD8_TCELL_UP" [8] "KAECH_DAY8_EFF_VS_DAY15_EFF_CD8_TCELL_DN"
This indicates that authorship, time
(or other experimental design factors), cell type, and
regulatory association may be encoded in standard gene
set names. The authorship information can be retrieved
systematically from the PMID
attribute, and its
role in the set name can probably
be ignored.
Can we parse the set names, using the fact that
underscore separates the key descriptive tokens?
Perhaps, but examples like
GSE19888_ADENOSINE_A3R_INH_PRETREAT_AND_ACT_
BY_A3R_VS_A3R_INH_AND_TCELL_
MEMBRANES_ACT_MAST_CELL_UP
are not encouraging.
The DESCRIPTION_BRIEF
field is more human-readable. Here
are examples associated with the first four standard
names listed above.
[1] "Genes up-regulated in naïve CD8 T cells compared to effector CD8 T cells at the peak expansion phase (day 8 after LCMV-Armstrong infection)." [2] "Genes down-regulated in naïve CD8 T cells compared to effector CD8 T cells at the peak expansion phase (day 8 after LCMV-Armstrong infection)." [3] "Genes up-regulated in naïve CD8 T cells compared to effector CD8 T cells at contraction phase (day 15 after LCMV-Armstrong infection)." [4] "Genes down-regulated in naïve CD8 T cells compared to effector CD8 T cells at contraction phase (day 15 after LCMV-Armstrong infection)."
Some readers may find it a little challenging to extract the key distinctions among these sets -- up and down regulation are distinguished, but so are phases (peak expansion vs contraction), cell types, and times after LCMV-Armstrong infection.
The purpose of this package is to mobilize text mining technologies to help users interrogate gene set information in MSigDb through verbalization of relevant scientific concepts.
We have created a DocSet
instance for 300 gene sets from
the immunologic set collection.
immu300
This is essentially a collection of environments mapping from gene set
names and descriptions to genes and back. We have a searchDocs
method
to query the DocSet.
dsl = searchDocs("down.*systemic lupus", immu300) datatable(dsl)
It is possible to search for occurrence of a gene in the gene sets catalogued in the DocSet instance.
brocc = searchDocs("BRCA2", immu300) brocc
The abstracts (DESCRIPTION_FULL fields) of the sets need to be introduced. This would entail enhancement to the ssrch package.
Need to do the full c7 and see how large the resulting DocSet is.
Do other classes.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.