knitr::opts_chunk$set(collapse = TRUE, cache = FALSE, comment = "#>")
This vignette is the second chapter in the "Pathway Significance Testing with pathwayPCA
" workflow, providing a detailed perspective to the Import Data section of the Quickstart Guide. This vignette will discuss using the the read_gmt
function to import Gene Matrix Transposed (.gmt
) pathway collection files as a list object with class pathwayCollection
. Also, we will discuss importing assay and response data, and how to make your assay data tidy. For our pathway analysis to be meaningful, we need gene expression data (from a microarray or something similar), corresponding phenotype information (such as weight, type of cancer, or survival time and censoring indicator), and a pathway collection.
Before we move on, we will outline our steps. After reading this vignette, you should be able to
.gmt
file and save the pathways stored therein as a pathwayCollection
object using the read_gmt
function..csv
file with the read_csv
function from the readr
package, and transpose this data frame into "tidy" form with the TransposeAssay
function..csv
file, and join (merge) it to the assay data frame with the inner_join
function from the dplyr
package.First, load the pathwayPCA
package and the tidyverse
package suite.
library(tidyverse) # Set tibble data frame print options options(tibble.max_extra_cols = 10) library(pathwayPCA)
The .gmt
format is a commonly used file format for storing pathway collections. Lists of pathways in the Molecular Signatures Database (MSigDB) can be downloaded from the MSigDB Collections page.
GMT-formatted files follow a very specific set of rules:
"KEGG_STEROID_BIOSYNTHESIS"
."http://www.broadinstitute.org/gsea/msigdb/cards/KEGG_STEROID_BIOSYNTHESIS"
."SOAT1" "LSS" "SQLE" "EBP" "CYP51A1" "DHCR7" "CYP27B1" "DHCR24" "HSD17B7" "MSMO1" "FDFT1" "SC5DL" "LIPA" "CEL" "TM7SF2" "NSDHL" "SOAT2"
.read_gmt
Based on the clearly-organized .gmt
file format, we were able to write a very fast function to read .gmt
files into R. The read_gmt
function takes in a path specifying where your .gmt
file is stored, and outputs a pathways list.
gmt_path <- system.file("extdata", "c2.cp.v6.0.symbols.gmt", package = "pathwayPCA", mustWork = TRUE) cp_pathwayCollection <- read_gmt(gmt_path, description = TRUE)
We now carefully discuss the form of this information. This cp_pathwayCollection
object has class pathwayCollection
and contains the following components:
pathways
: A list of character vectors. Each character vector should contain a subset of the names of the -Omes measured in your assay data frame. These pathways should not be too short, otherwise we devolve the problem into simply testing individual genes. The pathwayPCA
package requires each pathway to have a minimum of three genes recorded in the assay data frame.Important: some protein set lists have proteins markers recorded as character numerics (e.g. "3"), so make sure the feature names of your assay have an overlap with the gene or protein names in the pathwayCollection
list. Ensure that there is a non-empty overlap between the gene names in the pathways list and the feature names of the assay. Not every gene in your assay data frame will be in the pathways list, and not every gene in each pathway will have a corresponding measurement in the assay data frame. However, for meaningful results, there should be a significant overlap between the genes measured in the assay data frame and the gene names stored in the pathways list. If your pathways list has very few matching genes in your assay, then your pathway-based analysis results will be significantly degraded. Make sure your pathways list and assay data are compatible.
TERMS
: A character vector comprised of the proper name of each pathway in the pathway collection.description
: (OPTIONAL) A character vector the same length as the pathways list with descriptive information. For instance, the .gmt
file included with this package has hyperlinks to the MSigDB description card for that pathway in this field. This field will be imported by the read_gmt
function when description = TRUE
(it defaults to FALSE
).setsize
: the number of genes originally recorded in each pathway, stored as an integer vector. NOTE: this information is calculated and added to the pathways list at Omics
-class object creation (later in the workflow). This information is useful to measure the ratio of the number of genes from each pathway recorded in your assay to the number of genes defined to be in that pathway. For each pathway, this ratio should be at least 0.5 for best pathway analysis results.The object itself has the following structure:
cp_pathwayCollection
This object will be the list supplied to the pathwayCollection_ls
argument in the CreateOmics
function.
pathwayCollection
ListAdditionally, you can create a pathwayCollection
object from scratch with the CreatePathwayCollection
function. This may be useful to users who have their pathway information stored in some form other than a .gmt
file. You must supply a list of vectors of gene names to the pathways
argument, and a vector of the proper names of each pathway to the TERMS
argument. You could also store any other pertinant pathway information by passing a <name> = <value>
pair to this function.
myPathways_ls <- list( pathway1 = c("Gene1", "Gene2"), pathway2 = c("Gene3", "Gene4", "Gene5"), pathway3 = "Gene6" ) myPathway_names <- c( "KEGG_IMPORTANT_PATHWAY_1", "KEGG_IMPORTANT_PATHWAY_2", "SOME_OTHER_PATHWAY" ) CreatePathwayCollection( sets_ls = myPathways_ls, TERMS = myPathway_names, website = "URL_TO_PATHWAY_CITATION" )
To download a .gmt
file from Wikipathways, we recommend the R
package rWikiPathways
. From their vignette:
WikiPathways also provides a monthly data release archived at http://data.wikipathways.org. The archive includes GPML, GMT and SVG collections by organism and timestamped. There’s an R function for grabbing files from the archive...
downloadPathwayArchive()
This will simply open the archive in your default browser so you can look around (in case you don’t know what you are looking for). By default, it opens to the latest collection of GPML files. However, if you provide an organism, then it will download that file to your current working directory or specified destpath. For example, here’s how you’d get the latest GMT file for mouse:
downloadPathwayArchive(organism = "Mus musculus", format = "gmt")
And if you might want to specify an archive date so that you can easily share and reproduce your script at any time in the future and get the same result. Remember, new pathways are being added to WikiPathways every month and existing pathways are improved continuously!
downloadPathwayArchive(date = "20171010", organism = "Mus musculus", format = "gmt")
pathwayCollection
Object to a .gmt
FileFinally, we can save the pathwayCollection
object we just created via the write_gmt()
function:
write_gmt( pathwayCollection = cp_pathwayCollection, file = "../test.gmt" )
We assume that the assay data (e.g. transcriptomic data) is either in an Excel file or flat text file. For example, your data may look like this:
In this data set, the columns are individual samples. The values in each row are the -Omic expression measurements for the gene in that row.
readr
To import data files in .csv
(comma-separated), .fwf
(fixed-width), or .txt
(tab-delimited) format, we recommend the readr
package. You can .csv
files with the read_csv
function, fixed-width files with read_fwf
, and general delimited files with read_delim
. These functions are all from the readr
package. Additionally, for data in .xls
or .xlsx
format, we recommend the readxl
package. We would read a .csv
data file via
assay_path <- system.file("extdata", "ex_assay_subset.csv", package = "pathwayPCA", mustWork = TRUE) assay_df <- read_csv(assay_path)
The read_csv
function warns us that the name of the first column is missing, but then automatically fills it in as X1
. Further, this function prints messages to the screen informing you of the assumptions it makes when importing your data. Specifically, this message tells us that all the imported data is numeric (.default = col_double()
) except for the gene name column (X1 = col_character()
).
Let's inspect our assay data frame. Note that the gene names were imported as a character column, as shown by the <chr>
tag at the top of the first column. This data import step stored the row names (the gene names) as the first column, and preserved the column names (sample labels) of the data.
assay_df
The assay input to the pathwayPCA
package must be in tidy data format. The "Tidy Data" format requires that each observation be its own row, and each measurement its own column. This means that we must transpose our assay data frame, while preserving the row and column names.
To do this, we can use the TransposeAssay
function. This function takes in a data frame as imported by the three readr
functions based on data in a format similar to that shown above: genes are the rows, gene names are the first column, samples are stored in the subsequent columns, and all values in the assay (other than the gene names in the first column) are numeric.
(assayT_df <- TransposeAssay(assay_df))
This transposed data frame has the gene names as the column names and the sample names as a column of character (chr
) values. Notice that the data itself is 17 genes measured on 36 samples. Before transposition, we had 37 columns because the feature names were stored in the first column. After transposition, we have 36 rows but 18 columns: the first column stores the sample names. This transposed data frame (after filtering to match the response data) will be supplied to the assayData_df
argument in the CreateOmics
function. (See the Creating Omics
Data Objects vignette for more information on creating Omics
-class objects.)
If ever we need to extract individual components of a tidy data frame, we can use the assay[row, col]
syntax. If we need entire measurements (columns), then we can call the column by name with the assay$ColName
syntax. For example,
assayT_df
---corresponding to Sample "T21101312"---then we typeassayT_df[2, ]
Notice that the tibble
object has 1 row and 18 columns.
- If we need the third column of assayT_df
---corresponding to Gene "LSS"---then we type
assayT_df[, 3]
This tibble
object has 36 rows and 1 column.
- If we need the intersection of these two (the expression level of Gene "LSS" in Sample "T21101312"), then we type
assayT_df[2, 3, drop = TRUE]
This output would normally be a 1 by 1 tibble
(which isn't terribly helpful), so we add the drop = TRUE
argument to "drop" the dimensions of the table. This gives us a single basic number (scalar).
- If we need the third column of assayT_df
, but we want the result back as a vector instead of a tibble
, we call the column by name:
assayT_df$LSS
SummarizedExperiment
ObjectOftentimes, genomic experiment data is stored in a SummarizedExperiment
-class object. If your assay and response data are stored in such an object, use the SE2Tidy()
function to extract the necessary information and return it as a tidy data frame. Because SummarizedExperiment
objects can have more than one assay, you must specify the index for the assay of your choice with the whichAssay
argument. Here is an example using the airway
data:
library(SummarizedExperiment) data(airway, package = "airway") airway_df <- SE2Tidy(airway)
Now we can look at a nice summary of the tidied assay and response data. This will drop all of the gene-specific metadata, as well as any experiment metadata. However, pathwayPCA
can't make use of this data anyway, so we haven't lost much.
airway_df[, 1:20]
We now have an appropriate pathways list and a tidy -Omics assay data frame. All we need now is some response data. Let's imagine that your phenotype data looks something like this:
We next import this response information. We can use the read_csv
function once again:
pInfo_path <- system.file("extdata", "ex_pInfo_subset.csv", package = "pathwayPCA", mustWork = TRUE) pInfo_df <- read_csv(pInfo_path)
This phenotype data frame has a column for the sample labels (Sample
) and the response information. In this case, our response is a survival response with an event time and observation indicator.
pInfo_df
This pInfo
data frame has the sample names as a column of character values, just like the transposed assay data frame. This is crucially important for the "joining" step. We can use the inner_join
function from the dplyr
library to retain only the rows of the assayT_df
data frame which have responses in the pInfo
data frame and vice versa. This way, every response in the phenotype data has matching genes in the assay, and every recorded gene in the assay matches a response in the phenotype data.
joinedExperiment_df <- inner_join(pInfo_df, assayT_df, by = "Sample") joinedExperiment_df
This requires you to have a key column in both data frames with the same name. If the key column was called "Sample" in the pInfo_df
data set but "SampleID" in the assay, then the by
argument should be changed to by = c("Sample" = "SampleID")
. It's much nicer to just keep them with the same names, however. Moreover, it is vitally important that you check your sample IDs. Obviously the recorded genetic data should pair with the phenotype information, but it is your responsibility as the user to confirm that the assay rows match the correct responses. You are ultimately responsible to defend the integrity of your data and to use this package properly.
Included in this package, we have a small tidy assay and corresponding gene subset list. We will load and inspect this assay. This data set has 656 gene expression measurements on 250 colon cancer patients. Further notice that the assay and overall survival response information have already been matched.
data("colonSurv_df") colonSurv_df
We also have a small list of 15 pathways which correspond to our example colon cancer assay. To create a toy example, we have curated this artificial pathways list to include seven significant pathways and eight non-significant pathways.
data("colon_pathwayCollection") colon_pathwayCollection
The pathways list and tidy assay (with matched phenotype information) are all the information we need to create an Omics
-class data object.
We now summarize our steps so far. We have
.gmt
file and saved the pathways stored therein as a pathwayCollection
object using the read_gmt
function..csv
file with the read_csv
function from the readr
package, and transposed this data frame into "tidy" form with the TransposeAssay
function..csv
file, and joined it to the assay data frame with the inner_join
function from the dplyr
package.Now we are prepared to create our first Omics
-class object for analysis with either AES-PCA or Supervised PCA. Please read vignette chapter 3: Creating Omics
Data Objects.
Here is the R session information for this vignette:
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.