knitr::opts_chunk$set(fig.retina = 1)
CTDquerier
R packageCTDquerier
is an R package that allows R users to download essential data from CTDbase about genes, chemicals and diseases. First, CTDquerier
validates the user's input in CTDbase vocabulary files. Second, the validated results are used to query CTDbase and their data are downloaded into the R session.
The package can be installed using:
if ( !requireNamespace( "BiocManager", quietly = TRUE ) ) { install.packages( "BiocManager" ) } BiocManager::install( "CTDquerier" )
The case-study is based on the results obtained in the article entitled "Case-control admixture mapping in Latino populations enriches for known asthma-associated genes", from Torgerson et al. [@Torgerson2012]. The genes from Table E1 of that paper have been collected and used as starting point for this document.
This case study aims to illustrate how CTDquerier
can be used to provide further evidence about a given hypothesis or how to provide biological insights from genetic, toxicological or environmental epidemiological studies. In our case, we are using a real example where the authors are interested in providing a set of genes associated with asthma.
By using CTDquerier
we will provide information about, among others, what is the level of curated evidence, from CTDbase, of the association between those genes and asthma or which chemicals are related to this list of genes and asthma according to CTDbase.
Asthma is a common long term inflammatory disease of the airways of the lungs. It is characterized by variable and recurring symptoms, reversible airflow obstruction, and bronchospasm. Symptoms include episodes of wheezing, coughing, chest tightness, and shortness of breath. Asthma is thought to be caused by a combination of genetic and environmental factors. Environmental factors include exposure to air pollution and allergens. Diagnosis is usually based on the pattern of symptoms, response to therapy over time, and spirometry. Asthma is classified according to the frequency of symptoms, forced expiratory volume in one second, and peak expiratory flow rate. It may also be classified as atopic or non-atopic where atopy refers to a predisposition toward developing a type 1 hypersensitivity reaction.
The Genetics of Asthma in Latino Americans (GALA) study [@Price2009] includes subjects with asthma and their biological parents recruited from schools, clinics, and hospitals at four sites: San Francisco Bay Area, New York City, Puerto Rico, and Mexico City. Patients were contacted to participate in the study if approved by their primary physician. On the basis of interviews and questionnaire data, subjects were included in the study if they were between the ages of 8 and 40 years with physician-diagnosed mild to moderate-to-severe asthma and had experienced 2 or more symptoms in the previous 2 years at the time of recruitment (including wheezing, coughing, and/or shortness of breath) and if both parents and 4 sets of grandparents self-identified as being of either Puerto Rican or Mexican ethnicity.
After quality control [@Torgerson2012], the total number of SNPs included in the study was 729,685, and the total number of subjects was 529 children with asthma (253 Mexican and 276 Puerto Rican subjects) and 347 control subjects (158 Mexican and 189 Puerto Rican subjects).
The authors aim to identify novel asthma-related genes in Latino populations by using case-control admixture mapping. By comparing local ancestry between cases and control subjects, the authors identified 62 admixture mapping peaks (29 in Puerto Ricans and 33 in Mexicans). They also observed a significant enrichment of previously identified asthma-associated genes within the 62 admixture mapping peaks (P = .0051). Table E1 provides the information about all the genes in those candidate regions from the paper [@Torgerson2012].
The Table E1 from the GALA Study [@Torgerson2012] was downloaded and converted into a .csv
file. This table is included in the package and can be loaded into R
by:
table_e1 <- read.csv( "https://raw.githubusercontent.com/isglobal-brge/brgedata/master/inst/extdata/CTDquerier_examples/gala_table_e1.csv" , stringsAsFactors = FALSE )
The column Genes of the table has the genes of interest. Let us start by removing regions without annotated genes.
dim( table_e1 ) table_e1 <- table_e1[ table_e1$Genes != "NA ", ] dim( table_e1 )
We have filtered 17 regions that had no genes. Now we will get the gene list:
gala_genes <- trimws( unlist( strsplit( table_e1$Genes, "," ) ) ) length( gala_genes ) gala_genes[1:15]
NOTE: Each region can have several genes separated by ,
.
We end up with a total of 305 candidate genes related to asthma as proposed by Torgerson et al. [@Torgerson2012]. Let us call to this list GALA genes
.
Any enrichment analysis of a given list of genes should start by querying CTDabse. The function query_ctd_gene
is in charge to do it. First, it validates the list of genes into CTDbase gene-vocabulary. Second, it performs a different query per gene and downloads the associated information into your computer.
library( CTDquerier )
gala <- query_ctd_gene( terms = gala_genes, ask = FALSE, verbose = TRUE )
Since this process can be high time-consuming, CTDquerier
package already contains the result of the query encapsulated in an object called gala
. We load this object into R
using the function data
.
data( gala, package = "CTDquerier" ) gala
Let us illustrate the type of questions that can be addressed by using CTDquerier
when the user wants to provide biological insights from a gene list. In our case, we are interested in providing existing literature evidence (annotated in CTDbase) that our GALA genes are enriched in chemical or genetic processes related to asthma.
Recall that gala_genes
has our genes of interest (305 genes) and that the result obtained from querying CTDbase with those genes is in the object gala
(258 terms). There is a difference of 47 genes, meaning that the missing 47 genes are not present in CTDbase. We can plot these numbers using the method plot
of a CTDdata
object.
library( ggplot2 ) plot( gala ) + ggtitle( "Lost & Found Genes from GALA Study" )
The names of the lost genes from GALA study can be obtained using the method get_terms
:
get_terms( gala )[[ "lost" ]]
The call to query_ctd_gene
done in the previous section has downloaded "all the information available" in CTDbase of GALA genes. We can inspect the information available just printing our object of class CTDdata
:
gala
The line starting with #Diseases
indicates the number of relationships between a gene (or chemical) and diseases. For the GALA genes, there are a total of 223,462 relations between our genes of interest and diseases. The table with the relations between genes and diseases can be obtained using the method get_table
.
gala_all_diseases <- get_table( gala, index_name = "diseases" ) colnames( gala_all_diseases ) dim( gala_all_diseases )
We observe that the table with the relations between gene and diseases has close to ~220K rows. This table provides with all the connections that link GALA genes with diseases, including the curated and the inferred ones.
We can also check that the number of genes used to create this table are the correct ones:
length( unique( gala_all_diseases$GeneSymbol ) ) sum( get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) ) sum( !get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) )
We can also check the name of the genes that do not contribute to creating the table of associations between genes and diseases:
get_terms( gala )[[ "found" ]][ !get_terms( gala )[[ "found" ]] %in% unique( gala_all_diseases$GeneSymbol ) ]
Finally, the answer to our question of interest can be obtained by:
length( unique( gala_all_diseases$Disease.Name ) )
This is the number of unique diseases associated to the GALA genes using all the relations. If we are interested in getting only the associations between the GALA genes and the diseases, understood as the curated relations:
gala_all_diseases_cu <- gala_all_diseases[ !is.na( gala_all_diseases$Direct.Evidence ), ] gala_all_diseases_cu <- gala_all_diseases_cu[ gala_all_diseases_cu$Direct.Evidence != "", ] dim( gala_all_diseases_cu ) length( unique( gala_all_diseases_cu$Disease.Name ) )
So, in summary: There are 301 diseases linked to GALA genes, supported by 436 curated associations.
To answer this question, we first need to get "Asthma"
in the column containing disease information in the table of associations between genes and diseases that we have previously created.
gala_asthma <- gala_all_diseases[ gala_all_diseases$Disease.Name == "Asthma" , ] dim( gala_asthma )
The subsetting resulted in a table with 209 entries. The first column to inspect is the Direct.Evidence
that provides a clue of the relations that are curated and the ones that are inferred.
sum( gala_asthma$Direct.Evidence != "" & !is.na( gala_asthma$Direct.Evidence ) )
So, only two of the 214 relations are curated associations. The other 212 (92 + 120) are inferred links.
For the ones that have no direct evidence, CTDbase informs about the level of evidence of the relation between a gene and a disease with the columns Inference.Score
and Reference.Count
. The first contains the CTDbase inference-score of gene-disease association in the base of a series of criterion defined by CTDbase's developers while the second indicates the number of bibliographical references enforcing the links.
Then, the mean of the Inference Score can be interpreted as the level of evidence of the association between GALA genes and asthma according to CTDbase:
mean( gala_asthma$Inference.Score, na.rm = TRUE )
This index provides evidence about the degree of similarity between CTD chemical-gene-disease networks and a similar scale-free random network. The higher the score, the more likely the inference network has atypical connectivity. In other words, this large score indicates that the network of GALA genes with asthma is real and not build by chance (http://ctdbase.org/help/chemDiseaseDetailHelp.jsp).
We can also provide the total number of papers relating GALA genes with asthma:
sum( gala_asthma$Reference.Count, na.rm = TRUE )
We can provide the Inference Score of each gene of interest with asthma by:
plot( gala, index_name = "disease", subset.disease = "Asthma", filter.score = 20 ) + ggtitle( "Evidence of the association between GALA genes and Asthma" )
Note that here only genes having a Inference Score higher than 20 are plotted (argument filter.score
)
The object gala
has a total of r nrow(gala@chemicals_interactions)
gene-chemical interaction registers. We can obtain this table using the get_table
method:
gala_chem <- get_table( gala, index_name = "chemical interactions" ) colnames( gala_chem ) length( unique( gala_chem$Chemical.Name ) )
This output is indicating that GALA genes are associated with r length( unique( gala_chem$Chemical.Name ) )
unique chemicals according to CTDabse. The distribution of the number of associations (Reference.Count
column) can be obtained by:
knitr::kable( t( table( gala_chem$Reference.Count ) ) )
This information can also be obtained in a heat-map plot. The arguments subset.genes
and subset.chemicals
are used to filter the elements in X-axis (genes) and Y-axis (chemicals). The argument filter.score
allows filtering by a minimum number of present papers (variable Reference.Count
) providing evidence about the association between chemicals and genes.
plot( gala, index_name = "chemical interactions", filter.score = 6 )
In that case, we are interested in knowing the gene-associations and chemicals-associations to our disease (or group of disorders) of interest
The way we can obtain all the information related to a disease from CTDbase is similar to the one we got the information of genes. In that case the function to be used is query_ctd_dise
:
asthma <- query_ctd_dise( terms = "Asthma", ask = FALSE, verbose = TRUE )
As we can see from the log obtained when verbose
is set to TRUE
, the information downloaded from CTDabse by CTDquerier
for diseases is composed by disease-gene interactions, disease-chemical interactions and pathway interactions (KEGG).
asthma
To know the exact number of genes related to Asthma according to CTDbase we extract the proper table and count the unique genes that appear on it.
ctd_asthma <- get_table( asthma, index_name = "gene interactions" ) length( unique( ctd_asthma$Gene.Symbol ) )
We can also look for teh curated associations between genes and Asthma:
sum( !is.na( ctd_asthma$Direct.Evidence ) & ctd_asthma$Direct.Evidence != "" )
According to CTDbase, there are r length( unique( ctd_asthma$Gene.Symbol ) )
genes related to asthma and r sum( !is.na( ctd_asthma$Direct.Evidence ) & ctd_asthma$Direct.Evidence != "" )
of them are curated associations. However the term "Asthma"
may appear in different ways. Here we illustrate how to get the real number of genes related with Asthma.
library( knitr ) tt <- as.data.frame( table( ctd_asthma$Disease.Name ) ) if( nrow( tt ) > 0 ) { colnames( tt ) <- c( "Disease", "Frequency" ) kable( tt[ order( tt$Frequency, decreasing = TRUE ), ] ) x <- tt[ order( tt$Frequency, decreasing = TRUE ), 2][1] # For text } else { x <- NA # For text }
Hence, the true answer is r x
gene-diseases relations and not r, length( unique( ctd_asthma$Gene.Symbol ) )
.
First, let us retrieve the chemicals related to asthma from asthma
object.
ctd_asthma_chem <- get_table( asthma, index_name = "chemical interactions" ) colnames( ctd_asthma_chem ) length( unique( ctd_asthma_chem$Chemical.Name ) )
Hence, the answer is r length( unique( ctd_asthma_chem$Chemical.Name ) )
. Now lets see how many of these relations are curated associations:
sum( !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != "" )
We can plot the Inference.Score
or the Reference.Count
(for both curated and inferred relations). Let us assume that we are interested in providing information about the chemicals associated to GALA genes having a score larger than 30.
plot( asthma, index_name = "chemical interactions", subset.disease = "Asthma", filter.score = 30 ) #+ #ggtitle( "Evidence of the association between GALA genes and chemicals" )
In some occasions, the user may be interested in determining how chemicals, genes and a given disease are linked in the literature.
The table gala_chem
contains the relation between GALA genes and the chemicals associated with these genes according to CTDbase. The table ctd_asthma_chem
contains the chemicals related to Asthma according to CTDbase. Therefore, to find the chemicals that are linked to both GALA genes and asthma, according to CTDbase we get the intersection of both tables. This can be performed by
intr_chem <- intersect( gala_chem$Chemical.Name, ctd_asthma_chem$Chemical.Name ) length( intr_chem )
There are r length( intr_chem )
chemicals sharing related to GALA genes and asthma at the same time. Next code can be used to quantify the overlap.
length( intr_chem ) / nrow( gala_chem ) * 100 length( intr_chem ) / nrow( ctd_asthma_chem ) * 100
p1 <- round(length( intr_chem ) / nrow( gala_chem ) * 100, 2) p2 <- round(length( intr_chem ) / nrow( ctd_asthma_chem ) * 100, 2)
That is, the r p1
% of the total number of associations of GALA genes with chemicals are shared with the observed associations of asthma with chemicals. On the other hand, the r p2
% of the chemicals associated with asthma are shared with the substances associated with GALA genes.
On the other and we count the curated chemical associations on both sides of GALA genes with:
a <- ctd_asthma_chem$Chemical.Name[ !is.na( ctd_asthma_chem$Direct.Evidence ) & ctd_asthma_chem$Direct.Evidence != "" ] intr_chem <- intersect( gala_chem$Chemical.Name, a ) length( intr_chem )
It can be really useful to have a data.frame
with three columns: chemical names, and the number of references in the literature associated with GALA genes and asthma. We bould this table with:
gala_chem_r <- gala_chem[ gala_chem$Chemical.Name %in% intr_chem, ] gala_chem_r <- gala_chem_r[ !duplicated( gala_chem_r$Chemical.Name ), ] ctd_asthma_chem_r <- ctd_asthma_chem[ ctd_asthma_chem$Chemical.Name %in% intr_chem, ] ctd_asthma_chem_r <- ctd_asthma_chem_r[ !duplicated( ctd_asthma_chem_r$Chemical.Name ), ] if( nrow( gala_chem_r ) > 0 & nrow( ctd_asthma_chem_r ) > 0 ) { dta <- merge( gala_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ], ctd_asthma_chem_r[ , c( "Chemical.Name", "Reference.Count" ) ], by = "Chemical.Name" ) colnames( dta ) <- c( "Chemical.Name", "Reference.Gala", "Reference.Asthma" ) dta <- dta[ order( dta$Reference.Gala, dta$Reference.Asthma, decreasing = TRUE ), ] dta[1:5, ] }
Having this data.frame
, we can use the function leaf_plot
to compare the number of references in the gene list and in asthma for the top 25 chemicals by:
if( nrow( gala_chem_r ) > 0 & nrow( ctd_asthma_chem_r ) > 0 ) { leaf_plot( dta[1:25, ], label = "Chemical.Name", valueLeft = "Reference.Gala", valueRight = "Reference.Asthma", titleLeft = "GALA", titleRight = "Asthma" ) }
Finally, one may be interested in providing statistical evidence about the enrichment of a gene list and the set of genes that have been described in the literature to be associated with a disease of interest or in a given chemical that is relevant into our health problem.
This can be answered by using the method enrich
. This method performs a Fisher test of enrichment given two objects of class CTDdata
at gene level. For this analysis we need to provide to method enrich
with a gene universe. CTDquerier
provides a gene universe obtained from HGNC (HUGO Gene Nomenclature Committee) at date 2018/01/10 (https://www.genenames.org/cgi-bin/download). The universe can be loaded with:
hgnc_universe <- read.delim( "https://github.com/isglobal-brge/brgedata/blob/master/inst/extdata/CTDquerier_examples/HGNC_Genes.tsv?raw=true", sep = "\t", stringsAsFactor = FALSE )
The enrich
method allows using all relations between genes and asthma or only the curated ones. By setting the argument use
to "all"
, all inferred and curated relations are used to test for enrichment:
enrich( gala, asthma, universe = hgnc_universe$Approved.Symbol, use = "all" )
The function enrich
takes the genes in the gala
object (e.g. target genes), the genes in the asthma
object (e.g. gene set/pathway) and performs a Fisher test (see fisher.test
) with the HGNC gene universes. We can observe an enrichment of our GALA genes with asthma (p < 2.2e-16).
Alternatively, using only the curated associations we see no enrichment:
enrich( gala, asthma, universe = hgnc_universe$Approved.Symbol, use = "curated" )
First of all, we need to obtain the information about air pollutants from CTDbase. The function query_ctd_chem
deals with it:
air <- query_ctd_chem( terms = "Air Pollutants", ask = FALSE, verbose = FALSE ) air
Once we obtained all the information about air pollutants from CTDbase the enrichment analysis is performed by using the method enrich
. In this case, our second set corresponds to the genes related to air pollutants.
enrich( gala, air, universe = hgnc_universe$Approved.Symbol, use = "all" )
We observe no enrichment between the genes associated with Air Pollutants and GALA genes.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.