knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# library(idepGolem) devtools::load_all()
# Make data for enrichment functions idep_data <- get_idep_data() DATABASE <- Sys.getenv("GE_DATABASE")[1] YOUR_DATA_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53.csv") YOUR_EXPERIMENT_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53_sampleInfo.csv") expression_file <- data.frame( datapath = YOUR_DATA_PATH ) experiment_file <- data.frame( datapath = YOUR_EXPERIMENT_PATH ) load_data <- input_data( expression_file = expression_file, experiment_file = experiment_file, go_button = FALSE, demo_data_file = idep_data$demo_data_file, demo_metadata_file = idep_data$demo_metadata_file ) converted <- convert_id( query = rownames(load_data$data), idep_data = idep_data, select_org = "BestMatch" ) all_gene_info <- gene_info( converted = converted, select_org = "BestMatch", idep_data = idep_data ) converted_data <- convert_data( converted = converted, no_id_conversion = FALSE, data = load_data$data ) gene_names <- get_all_gene_names( mapped_ids = converted_data$mapped_ids, all_gene_info = all_gene_info ) processed_data <- pre_process( data = converted_data$data, missing_value = "geneMedian", data_file_format = 1, low_filter_fpkm = NULL, n_min_samples_fpkm = NULL, log_transform_fpkm = NULL, log_start_fpkm = NULL, min_counts = .5, n_min_samples_count = 1, counts_transform = 1, counts_log_start = 4, no_fdr = NULL )
One important aspect of iDEP is the ability to take a subset of genes and perform pathway analysis to determine the pathways that are significantly enriched with different testing effects. The iDEP data base has biological pathways for each species in the database, and with each pathway there is a list of gene IDs that correspond with that process. We then calculate a p-value for the proportion of genes from a pathway in the subset to the proportion of genes subsetted to the entire background. The link below explains the most basic and popular method of determining overlap and enrichment. There are two functions to perform pathway analysis. The first one will be described in the section below.
Enrichment Link https://tv.qiagenbioinformatics.com/video/19605716/understanding-the-p-value-of
Performing pathway analysis for the subsetted list of genes requires only the
pathways that have at least one gene in their gene list. We use a database
query to ensure that all the pathways that have genes from the subset will be
returned in the gene sets. Creating a query requires a subset of the
gene_names
data frame. For this example, we will simply use the first 100 rows of the data. iDEP has many different methods to create interesting subsets, including differential expression analysis. Depending on the structure of the subsetted genes, use the code chunk below to filter gene_names
.
# gene_names_query <- gene_names[gene_names$ensembl_ID == rownames(example_data), ] gene_names_query <- gene_names[gene_names$ensembl_ID == rownames(processed_data$data), ]
The converted
parameter will be the return from the convert_id
function
call. The input for go
determines which portion of the pathway database to use
for the analysis. To determine the choices for this input, we can run the
function gmt_category
. This will return the portions of the database that have
pathways corresponding to the matched species. The first two parameters are
returned objects in the "Load_Data" instruction. The next is the organism that
the expression data is for, an input that has been used in previous functions.
gmt_file
is a datapath to a gmt file, but only if the species is new and not
in the iDEP database. Lastly, the data from the get_idep_data
call is filled
in. An example call for the demonstration data is in the code chunk below.
gmt_choices <- gmt_category( converted = converted, converted_data = converted_data$data, select_org = "BestMatch", gmt_file = NULL, idep_data = idep_data )
Search this list for the desired portion, and specify it with either
gmt_choices[[2]]
or the string denoting the section. For gmt_choices[[2]]
it
would be "GOBP". The next parameter, select_org
, should be the same as the
chunk above. The input gmt_file
is only used if you are working with a
species that is not in the iDEP database. If this is the case, and there is a
GMT file that goes with the species, the input will be the datapath stored in a data frame. For details on this, see the first instruction Load_Data
. The next inputs are objects that are also covered in the first instruction. With all the correct inputs, we are now going to actually subset gene_names
and create a query. We will use the process_heatmap_data
function which is described in the "Clustering" instruction. This final workflow is demonstrated in the chunk
below.
heatmap_data <- process_heatmap_data( data = processed_data$data, n_genes_max = 150, # n_genes_min = 0, gene_centering = TRUE, gene_normalize = TRUE, sample_centering = FALSE, sample_normalize = FALSE, all_gene_names = gene_names, select_gene_id = "symbol" ) gene_names_query <- gene_names[gene_names$symbol %in% rownames(heatmap_data), ] gene_sets <- read_pathway_sets( all_gene_names_query = gene_names_query, converted = converted, go = "GOBP", select_org = "BestMatch", gmt_file = NULL, idep_data = idep_data, gene_info = all_gene_info )
To use the find_overlap
function, we are going to work with the returned
object gene_sets
from above. This function performs the operation to
statistically determine the pathways that are enriched according to the gene
query that was submitted. The function above returns a list with objects that
will be inputted into parameters of the find_overlap
function. The parameter
pathway_table
is found at gene_sets$pathway_table
. This input is a table
that is returned from the read_pathway_sets
function. It contains information
on the pathways that were found by querying the database. The parameter
query_set
is the vector of ensembl IDs that were used as the query subset.
The next input parameter total_genes
, is the total number of genes in the
database for the matched species. processed_data
is the input parameter for
the data matrix that has gone through the pre-processing functions. The
gene_info
parameter takes in the all_gene_info
list that was create in the
"Load_Data" instruction. The input parameter go
takes in the selection for the
portion of the database. This input should be the same as what was inputted for
go
in the function read_pathway_sets
.
The next parameter takes in the idep_data
list that is created from the iDEP
database. use_filtered_background
is TURE/FALSE for which gene set to use as
the background when calculating overlap. For TRUE, the background gene set will
be the genes in the processed data matrix that passed the filtering. If the
value is set to FALSE, the background will be all the gene IDs that were found
in the iDEP database for the matched species. It is recommended to use the value TRUE, as this will give the most accurate p-value for which pathways are enriched. The video in the introduction gives details about the meaning of enriched. select_org
is the input that specifies the species that the
expression data is for. Finally, the last input parameter is reduced
and can be set to FALSE, or a decimal value. If a decimal value is inputted, this will filter out all the pathways that have a greater proportion of genes in common with another pathway than the inputted decimal. If FALSE is inputted, no pathways will be removed with redudant gene sets. This results of this function be examined in the next section.
pathway_info <- find_overlap( pathway_table = gene_sets$pathway_table, query_set = gene_sets$query_set, total_genes = gene_sets$total_genes, processed_data = processed_data$data, gene_info = all_gene_info, go = "GOBP", idep_data = idep_data use_filtered_background = TRUE, select_org = "BestMatch", reduced = .75 )
The table below is the first four columns of the results data frame that was
created in the example enrichment analysis. The fifth column is difficult to put
into a table because it contains the gene sets for all the significant pathways
that were found. The first column in this table is the adjusted p-value for the
significance of the overlap calculated for the pathway. The next column denotes
the amount of genes from the query that were associated with the given pathway.
The next column gives the total number of genes that were associated with the
pathway. Finally, the last column is the functional category or description of
what the pathway is biologically responsible for. To see the genes for the
functional category from the query, run pathway_info[, 5]
. If there is no
significantly enriched pathways, the data frame will say "No significant
enrichment found!"
DT::datatable( pathway_info[, 1:4], options = list(dom = "ft"), rownames = FALSE, selection = "single" )
This instruction covered the steps needed to perform enrichment analysis on a subset of genes. Using these steps it is possible to determine the pathways that are enriched based off the ratio of genes in the subset corresponding to the pathway and genes in the background corresponding to the pathway. The goal is to determine the pathways that are being affected by the different treatments and effects. For troubleshooting, all functions have documentation and the code is available on Github. (https://github.com/gexijin/idepGolem)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.