# No further user input required below this point. # Developers might change code below if needed. # Define name and version of this report template. Version should be updated # in each release. template_name <- "qc_report" template_version <- "1.0.0" # Specify knitr options. knitr::opts_chunk$set( cache = FALSE, # Set `TRUE` to enable rmarkdown cache. cache.lazy = FALSE, eval = TRUE, echo = TRUE, message = TRUE, warning = TRUE, error = FALSE # Important to have `FALSE` here in order to stop after an error. ) # Load all required packages. library(hermes)
# Load Data and get object HermesData se <- get(params$input_summarized_experiment) stopifnot(is(se, "SummarizedExperiment")) # Section 1 - Pre Filtering, with added QC flags control <- control_quality(params$min_cpm, params$min_cpm_prop, params$min_corr, params$min_depth) object_original <- HermesData(se) %>% add_quality_flags(control) # Section 2 - Post Filtering & Normalization object_filtered_normalized <- object_original %>% filter() %>% normalize() # QC reference default values: Reference object for standard quality control setting. control_default <- control_quality() object_default_qc_flags <- HermesData(se) %>% add_quality_flags(control_default)
This template is r template_name
version r template_version
.
This report file was generated by r Sys.info()[["user"]]
on r Sys.time()
in directory
r getwd()
using hermes
version r packageVersion("hermes")
.
sessionInfo()
The dataset used is "r params$input_summarized_experiment
".
The data set was composed of r ncol(object_original)
samples and r nrow(object_original)
genes.
First, we check how the distribution of library sizes looks like across all samples. The median library size is r sprintf("%0.3G", median(colSums(counts(object_original))))
.
According to Standards, Guidelines and Best Practices for RNA-Seq (The ENCODE Consortium), the sequencing depth/library size is usually determined by the goals of the experiment and the nature of the RNA sample. The minimum library size required is around 20-30 million aligned reads.
draw_libsize_hist(object_original)
The distribution of library sizes is then compared to the normal distribution. The linearity of the points will suggest if the distribution of library sizes is normally distributed.
draw_libsize_qq(object_original)
In this plot, each line represents density of expression of all the genes within one sample. Lines that deviate from the majority may suggest inconsistent read depth or technical failure of the samples. The peaks in log2 (count) distribution indicate the log2(count) value at which the majority of genes are expressed.
draw_libsize_densities(object_original)
This step is necessary to flag out genes with low expression. Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical analyses conducted later in the pipeline. Therefore, these genes should be filtered out prior to the analysis.
There are many ways to filter out genes with lower counts. When there are n biological replicates in each group, we usually filter on a minimum counts per million threshold present in at least n samples. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above r params$min_cpm
in at least in r (params$min_cpm_prop)*100
% of the samples. These thresholds can be modified.
This barplot shows the chromosomes with their proportions of low expression genes.
draw_genes_barplot(object_original) + theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
object_tg <- top_genes( object_original, n_top = params$top_gene_n_top, min_threshold = params$top_gene_min_threshold )
This bar plot shows the shows the expression counts (y axis) for each of the r nrow(object_tg)
top genes (x-axis) with.
autoplot(object_tg, y_lab = "Maximum Count", title = "Top genes")
object_sds <- top_genes( object_original, summary_fun = rowSds, n_top = params$top_gene_n_top, min_threshold = params$top_gene_min_threshold )
This bar plot shows the shows the expression standard deviation (y axis) for each of the r nrow(object_tg)
top genes (x-axis) with.
autoplot( object_sds, y_lab = "Standard Deviation across samples", title = "Top variability genes" )
This step is necessary to filter out samples with low depth or technical failure. There are three approaches:
Identified either by lower quartile minus 1.5*IQR (default) or user-defined threshold: r ifelse(is.null(params$min_depth), "still default here", params$min_depth)
Humans usually have a similar RNAseq profile; therefore, correlation of RNAseq vectors between patients should be high. If a patient's average correlation score with the other patients is lower than a threshold (default = r control_default$min_corr
, r params$min_corr
used in this report for r params$cor_method
's method), it is likely that the sample preparation went wrong. We are flagging such samples here as technical failures.
Correlation matrix between the sample vectors of counts from a specified assay, containing quality flags (TechnicalFailureFlag
, LowDepthFlag
) describing the original input samples:
object_cor <- hermes::correlate(object_original, method = params$cor_method)
autoplot( object_cor, row_names_gp = grid::gpar(fontsize = 8), column_names_gp = grid::gpar(fontsize = 8) )
In this plot, each data point represents the number of non-zero expressed genes in a sample. Samples with a lower number non-zero count genes may be also of interest.
draw_nonzero_boxplot(object_original)
In this section we look at the results after filtering and normalizing the RNAseq counts.
Total number of genes before filtering: r nrow(object_original)
Total number of genes after filtering: r nrow(object_filtered_normalized)
Rate of passing: r round((100/nrow(object_original))*nrow(object_filtered_normalized),1)
%
Total number of samples before filtering: r ncol(object_original)
Total number of samples after filtering: r ncol(object_filtered_normalized)
Rate of passing: r round((100/ncol(object_original))*ncol(object_filtered_normalized),1)
%
Here we use the counts assay of the filtered object, i.e. less samples might be included here.
draw_libsize_hist(object_filtered_normalized)
draw_libsize_qq(object_filtered_normalized)
draw_libsize_densities(object_filtered_normalized)
Batch effects are unwanted sources of variation caused by different processing date, handling personnel, reagent lots, equipment/machines, etc. Batch effects is a big challenge faced in biological research, especially towards translational research and precision medicine.
PCA r ifelse(is.null(params$pca_batch_var), "is not performed because 'pca_batch_var' was not provided", paste("is performed with batch variable", params$pca_batch_var))
.
Background: If batch information is provided in the column data, principal component analysis will be performed below on autosomes for checking batch effect. In general, it is a good sign to see that data points from different batches are mixing up well. This suggests that the batch effect is negligible.
Note that genes with constant counts across all samples are excluded from the analysis internally. Centering and scaling is applied internally.
Here the original counts are used, after filtering samples and genes as described above.
object_pca <- calc_pca(object_filtered_normalized, assay_name = "counts") autoplot( object_pca, label = TRUE, label.repel = TRUE, data = as.data.frame(colData(object_filtered_normalized)), colour = params$pca_batch_var )
In this analysis the r params$pca_assay_name
assay is used.
object_norm_pca <- calc_pca( object_filtered_normalized, assay_name = params$pca_assay_name ) autoplot( object_norm_pca, label = TRUE, label.repel = TRUE, data = as.data.frame(colData(object_filtered_normalized)), colour = params$pca_batch_var )
We list the samples where the default technical failure flags are different from the resulting ones, using the custom correlation score threshold:
df_comp_tech_failure <- data.frame( default = get_tech_failure(object_default_qc_flags), result = get_tech_failure(object_original) )
DT::datatable(subset(df_comp_tech_failure, default != result))
Next we list the samples where the default low depth flags are different from the resulting ones, using the custom depth threshold:
df_comp_low_depth <- data.frame( default = get_low_depth(object_default_qc_flags), result = get_low_depth(object_original) )
DT::datatable(subset(df_comp_low_depth, default != result))
We list the genes where the default low expression flags are different from the resulting ones, using the custom CPM and proportion thresholds:
df_comp_low_expression <- data.frame( default = get_low_expression(object_default_qc_flags), result = get_low_expression(object_original) )
DT::datatable(subset(df_comp_low_expression, default != result))
The used HermesData
object called object_filtered_normalized
has been saved in r params$output_data_file
if that file did not exist beforehand.
if (!file.exists(params$output_data_file)) { save(object_filtered_normalized, file = params$output_data_file) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.