knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(rmarkdown.html_vignette.check_title = FALSE)
# library(idepGolem) devtools::load_all()
# Make data for pre-processing functions idep_data <- get_idep_data() DATABASE <- Sys.getenv("GE_DATABASE")[1] YOUR_DATA_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53.csv") YOUR_EXPERIMENT_PATH <- paste0(DATABASE, "data_go/BcellGSE71176_p53_sampleInfo.csv") expression_file <- data.frame( datapath = YOUR_DATA_PATH ) experiment_file <- data.frame( datapath = YOUR_EXPERIMENT_PATH ) load_data <- input_data( expression_file = expression_file, experiment_file = experiment_file, go_button = FALSE, demo_data_file = idep_data$demo_data_file, demo_metadata_file = idep_data$demo_metadata_file ) converted <- convert_id( query = rownames(load_data$data), idep_data = idep_data, select_org = "BestMatch" ) all_gene_info <- gene_info( converted = converted, select_org = "BestMatch", idep_data = idep_data ) converted_data <- convert_data( converted = converted, no_id_conversion = FALSE, data = load_data$data ) gene_names <- get_all_gene_names( mapped_ids = converted_data$mapped_ids, all_gene_info = all_gene_info )
This instruction builds off the objects and functions from the "Load_Data" vignette. Please see that document before using the functions described below. We will now transform and process the data based off the type that is being studied. The package also provides visualization functions that will provide a foundation for exploratory data analysis and good overview of the data.
pre_process
FunctionProcessing the data requires many options to ensure the data gets to the desired
form. This is the data that will be used in the exploratory analysis to come.
The data to be processed is the converted_data
from the "Load_Data" vignette.
The parameters missing_value
and data_file_format
have a list of input
values to select from for processing. These values and descriptions can be found
in the tables below. The missing_value
parameter determines how missing data
entries will be filled in, and the data_file_format
parameter tells the
function what type of data is being processed. Depending on the
data_file_format
chosen, additional parameters will need to be provided. These
are explained below.
Input (missing_value
) Description
geneMedian
Substitute median expression value for gene across all samples
treatAsZero
Substitute "0" for all missing values
geneMedianInGroup
Substitute median expression value for gene within sample group
Input (data_file_format
) Description
1
Read counts data (recommended)
2
Normalized expression values (RNA-seq FPKM, micro-array, etc.)
3
Fold-changes and corrected P values from CuffDiff or any other program
For read counts data, the parameters min_counts
, n_min_samples_counts
,
counts_transform
and counts_log_start
need input values. The parameters
min_counts
and n_min_samples
work to filter the genes out of the data that
do not have sufficient counts. The function counts the columns that have more
counts than min_counts
for each gene and filters out the genes with less
columns than n_min_samples
. For example, if you only wanted genes with more
than 0 counts for each sample, you would set min_counts = .5
and
n_min_samples = ncol(converted_data)
. The parameter counts_transform
specifies the transformation to perform on the counts data. This parameter has 3
input options which are outlined below. If counts_transform = 1
,
counts_log_start
determines the constant to add to CPM to ensure that there
are no log(0)
errors. All other parameters can be set to NULL
Input (counts_transform
) Description
1
EdgeR: log2(CPM+c) (Counts per Million + constant) (Visit https://www.rdocumentation.org/packages/SparkR/versions/2.1.2/topics/log2)
2
VST: variance stabilizing transform (Visit https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/vst)
3
rlog: regularized log (slow) (Visit https://www.rdocumentation.org/packages/DESeq2/versions/1.12.3/topics/rlog)
If the data being processed is for normalized expression values, the parameters
low_filter_fpkm
, n_min_samples_fpkm
, log_transform_fpkm
and
log_start_fpkm
need input values. The parameters low_filter_fpkm
and
n_min_samples_fpkm
work together to filter out the genes that do not have
enough counts. The function will count the columns that are greater than
low_filter_fpkm
and discard the gene if it is not more than
n_min_samples_fpkm
columns. The parameter log_transform_fpkm
is TRUE/FALSE
for whether to perform a log transform on the data or not. If
log_transform = TRUE
, then log_start_fpkm
is the value that will be added to
ensure there are no log(0)
errors. All other values can be set to NULL
.
For fold-changes and corrected p-values, there is only one additional parameter
to provide an input for, no_fdr
. This is a logical TRUE/FALSE
to specify
whether the data contains corrected p-values or not. TRUE
denotes fold-change
only data. All other parameters can be set to NULL
.
The demonstration loaded in the "Load_Data" instruction is read counts data. The
code below shows an example use of the pre_process
function using the example
data. This data has the converted ensembl IDs. We will return a large list of
the results of the processing from this function. The objects in this list are
defined in the table below.
Output Description
data Data matrix that has the pre-process operations performed to it.
mean_kurtosis The calculated kurtosis of the mean counts for a gene. (Visit https://www.rdocumentation.org/packages/e1071/versions/1.7-2/topics/kurtosis)
raw_counts The raw counts before pre-processing.
data_type_warning Set to "1" if iDEP recognizes the data as different than the inputted data_file_format
.
data_size Length 4 vector giving the dimensions of the data before and after processing.
p_vals For fold-change data has p-values, they are returned here.
processed_data <- pre_process( data = converted_data$data, missing_value = "geneMedian", data_file_format = 1, low_filter_fpkm = NULL, n_min_samples_fpkm = NULL, log_transform_fpkm = NULL, log_start_fpkm = NULL, min_counts = .5, n_min_samples_count = 1, counts_transform = 1, counts_log_start = 4, no_fdr = NULL )
Now that the data has been processed, we can begin data analysis. It is possible to create unique and informative plots with the processed data, and iDEP has built in functions to assist with that. These plots can then also be saved in an object and have additional ggplot features added to them.
total_counts_ggplot
FunctionThis function will only work for read counts data, so if that is not the format
of the processed data, please skip this function. For read counts data, this
function creates a barplot to show the counts in millions for each sample in the
data. counts_data
is the raw counts that is returned from the pre_process
function. sample_info
is the experiment design data from the input_data
function.
total_counts_ggplot( counts_data = processed_data$raw_counts, sample_info = load_data$sample_info )
eda_scatter
FunctionAnother useful plot is a scatter plot of the transformed expression for two
samples in the data. eda_scatter
provides a scatter plot of two specified
sample (columns) of the processed data. plot_xaxis
and plot_yaxis
are
string names of the columns to be plotted. The options for these inputs can be
seen using colnames(processed_data$data)
. processed_data
is the returned
data from the pre_process
function.
colnames(processed_data$data) eda_scatter( processed_data = processed_data$data, plot_xaxis = "p53_IR_1", plot_yaxis = "p53_mock_1" )
eda_density
FunctionThis plotting function provides a plot of the distribution of counts values for
each sample in the data. The information is displayed in a density ridge plot in
the ggplot format. Again, the data provided is returned from the pre_process
function. The sample_info
provides information on each sample in the dataset.
eda_density( processed_data = processed_data$data, sample_info = load_data$sample_info )
merge_data
FunctionThe current data is using the converted ensembl IDs as rownames in a large data
matrix, but the following function will create a data frame containing the gene
names from the get_all_gene_names
function. merge_data
takes the
gene_names
data frame and a data matrix with ID rownames and creates a data
frame with columns for each ID that idep has matched from the original IDs.
This is useful to view different IDs for the same gene. It also enables writing
CSV files that can be saved and be used to reproduce results from iDEP. The
following code will merge the gene names with the raw counts data and the
transformed data from pre_process
.
merged_raw_counts <- merge_data( all_gene_names = gene_names, data = processed_data$raw_counts, merge_ID = "ensembl_ID" ) merged_data <- merge_data( all_gene_names = gene_names, data = processed_data$data, merge_ID = "ensembl_ID" )
iDEP also has a function that visualizes individual genes and their expression
behavior across the samples. It is possible to determine the genes to plot by
looking at the merged or processed data. It is also possible to use the
gene_names
data frame to search for a gene ID. The code below demonstrates how
to search for a gene ID.
gene_names[grep("H4c1", gene_names$symbol), ]
Depending on the type of gene ID that is desired for the plot, the function below will swap the gene IDs with the processed data matrix. This new data matrix will be used in the individual plot function using the selected ID type and the desired gene. The code below shows how to swap the gene IDs to the symbol name.
ind_plot_data <- rowname_id_swap( data_matrix = processed_data$data, all_gene_names = gene_names, select_gene_id = "symbol" )
The function individual_plots
will create the plot for the desired gene. We
will use the data that we created in the code above for individual_data
. If
the data comes with an experiment file, the sample information that was loaded
in the first instruction will be used in sample_info
. The gene to be plotted
can be specified in the selected_gene
parameter. Multiple genes can be
inputted as a vector to plot them at the same time and compare. There are two
different types of plots that can be outputted with this function. The parameter
gene_plot_box
controls which plot will be outputted. Setting this input to
TRUE will output a lineplot that gives the expression across all samples. FALSE
will output a bar plot that gives the mean expression across the groups detected
in the sample names. For the barplot, use_sd
as TRUE will give a standard
deviation bar instead of a standard error bar. lab_rotate
controls the
rotation of the x-axis labels. The outputted object is a formatted ggplot.
# individual_plots( # individual_data = ind_plot_data, # sample_info = load_data$sample_info, # selected_gene = " H4c1", # gene_plot_box = FALSE, # use_sd = FALSE, # lab_rotate = 90 # ) individual_plots( individual_data = ind_plot_data, sample_info = load_data$sample_info, selected_gene = " Ankrd13b", gene_plot_box = FALSE, use_sd = FALSE, lab_rotate = 90 ) individual_plots( individual_data = ind_plot_data, sample_info = load_data$sample_info, selected_gene = " Ankrd13b", gene_plot_box = TRUE, use_sd = FALSE, lab_rotate = 90 )
This concludes all the functions utilized in the Pre-Process portion of the iDEP package. The data can be plotted in many different ways, but hopefully the functions detailed above provide a good starting point. For troubleshooting, all functions have documentation and the code is available on Github. (https://github.com/gexijin/idepGolem)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.