In insightsengineering/hermes: Preprocessing, analyzing, and reporting of RNA-seq data

# No further user input required below this point.
# Developers might change code below if needed.

# Define name and version of this report template. Version should be updated
# in each release.
template_name <- "qc_report"
template_version <- "1.0.0"

# Specify knitr options.
knitr::opts_chunk$set(
  cache = FALSE, # Set `TRUE` to enable rmarkdown cache.
  cache.lazy = FALSE,
  eval = TRUE,
  echo = TRUE,
  message = TRUE,
  warning = TRUE,
  error = FALSE # Important to have `FALSE` here in order to stop after an error.
)

# Load all required packages.
library(hermes)

# Load Data and get object HermesData
se <- get(params$input_summarized_experiment)
stopifnot(is(se, "SummarizedExperiment"))

# Section 1 - Pre Filtering, with added QC flags
control <- control_quality(params$min_cpm, params$min_cpm_prop, params$min_corr, params$min_depth)

object_original <- HermesData(se) %>%
  add_quality_flags(control)

# Section 2 - Post Filtering & Normalization

object_filtered_normalized <- object_original %>%
  filter() %>%
  normalize()

# QC reference default values: Reference object for standard quality control setting.
control_default <- control_quality()

object_default_qc_flags <- HermesData(se) %>%
  add_quality_flags(control_default)

Template and Session Info {.tabset}

This template is r template_name version r template_version.

Session Info {-}

This report file was generated by r Sys.info()[["user"]] on r Sys.time() in directory r getwd() using hermes version r packageVersion("hermes").

Details {-}

sessionInfo()

Dataset Summary

The dataset used is "r params$input_summarized_experiment". The data set was composed of r ncol(object_original) samples and r nrow(object_original) genes.

Technical Metrics {.tabset}

Histogram of Library Sizes

First, we check how the distribution of library sizes looks like across all samples. The median library size is r sprintf("%0.3G", median(colSums(counts(object_original)))). According to Standards, Guidelines and Best Practices for RNA-Seq (The ENCODE Consortium), the sequencing depth/library size is usually determined by the goals of the experiment and the nature of the RNA sample. The minimum library size required is around 20-30 million aligned reads.

draw_libsize_hist(object_original)

Q-Q Plot of Library Sizes

The distribution of library sizes is then compared to the normal distribution. The linearity of the points will suggest if the distribution of library sizes is normally distributed.

draw_libsize_qq(object_original)

Density Plot of (Log) Counts Distributions

In this plot, each line represents density of expression of all the genes within one sample. Lines that deviate from the majority may suggest inconsistent read depth or technical failure of the samples. The peaks in log2 (count) distribution indicate the log2(count) value at which the majority of genes are expressed.

draw_libsize_densities(object_original)

Gene filtering {.tabset}

This step is necessary to flag out genes with low expression. Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical analyses conducted later in the pipeline. Therefore, these genes should be filtered out prior to the analysis.

There are many ways to filter out genes with lower counts. When there are n biological replicates in each group, we usually filter on a minimum counts per million threshold present in at least n samples. In this dataset, we choose to retain genes if they are expressed at a counts-per-million (CPM) above r params$min_cpm in at least in r (params$min_cpm_prop)*100% of the samples. These thresholds can be modified.

Stacked Barplot of Low Expression Genes by Chromosome

This barplot shows the chromosomes with their proportions of low expression genes.

draw_genes_barplot(object_original) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Genes with extremely high counts

object_tg <- top_genes(
  object_original,
  n_top = params$top_gene_n_top,
  min_threshold = params$top_gene_min_threshold
)

This bar plot shows the shows the expression counts (y axis) for each of the r nrow(object_tg) top genes (x-axis) with.

autoplot(object_tg, y_lab = "Maximum Count", title = "Top genes")

Genes with high variability

object_sds <- top_genes(
  object_original,
  summary_fun = rowSds,
  n_top = params$top_gene_n_top,
  min_threshold = params$top_gene_min_threshold
)

This bar plot shows the shows the expression standard deviation (y axis) for each of the r nrow(object_tg) top genes (x-axis) with.

autoplot(
  object_sds,
  y_lab = "Standard Deviation across samples",
  title = "Top variability genes"
)

Sample filtering {.tabset}

This step is necessary to filter out samples with low depth or technical failure. There are three approaches:

Low depth samples

Identified either by lower quartile minus 1.5*IQR (default) or user-defined threshold: r ifelse(is.null(params$min_depth), "still default here", params$min_depth)

Correlation between samples

Humans usually have a similar RNAseq profile; therefore, correlation of RNAseq vectors between patients should be high. If a patient's average correlation score with the other patients is lower than a threshold (default = r control_default$min_corr, r params$min_corr used in this report for r params$cor_method's method), it is likely that the sample preparation went wrong. We are flagging such samples here as technical failures.

Correlation matrix between the sample vectors of counts from a specified assay, containing quality flags (TechnicalFailureFlag, LowDepthFlag) describing the original input samples:

object_cor <- hermes::correlate(object_original, method = params$cor_method)

autoplot(
  object_cor,
  row_names_gp = grid::gpar(fontsize = 8),
  column_names_gp = grid::gpar(fontsize = 8)
)

Boxplot of non-zero genes per sample

In this plot, each data point represents the number of non-zero expressed genes in a sample. Samples with a lower number non-zero count genes may be also of interest.

draw_nonzero_boxplot(object_original)

Filtering and Normalization Effects

In this section we look at the results after filtering and normalizing the RNAseq counts.

Top level gene numbers

Total number of genes before filtering: r nrow(object_original)
Total number of genes after filtering: r nrow(object_filtered_normalized)
Rate of passing: r round((100/nrow(object_original))*nrow(object_filtered_normalized),1)%

Top level sample numbers

Total number of samples before filtering: r ncol(object_original)
Total number of samples after filtering: r ncol(object_filtered_normalized)
Rate of passing: r round((100/ncol(object_original))*ncol(object_filtered_normalized),1)%

Technical Metrics After Sample Filtering {.tabset}

Here we use the counts assay of the filtered object, i.e. less samples might be included here.

Histogram of Library Sizes

draw_libsize_hist(object_filtered_normalized)

Q-Q Plot of Library Sizes

draw_libsize_qq(object_filtered_normalized)

Density Plot of (Log) Counts Distributions

draw_libsize_densities(object_filtered_normalized)

Principal Components Analyses for batch effects {.tabset}

Batch effects are unwanted sources of variation caused by different processing date, handling personnel, reagent lots, equipment/machines, etc. Batch effects is a big challenge faced in biological research, especially towards translational research and precision medicine.

PCA r ifelse(is.null(params$pca_batch_var), "is not performed because 'pca_batch_var' was not provided", paste("is performed with batch variable", params$pca_batch_var)).

Background: If batch information is provided in the column data, principal component analysis will be performed below on autosomes for checking batch effect. In general, it is a good sign to see that data points from different batches are mixing up well. This suggests that the batch effect is negligible.

Note that genes with constant counts across all samples are excluded from the analysis internally. Centering and scaling is applied internally.

PCA before normalization

Here the original counts are used, after filtering samples and genes as described above.

object_pca <- calc_pca(object_filtered_normalized, assay_name = "counts")

autoplot(
  object_pca,
  label = TRUE,
  label.repel = TRUE,
  data = as.data.frame(colData(object_filtered_normalized)),
  colour = params$pca_batch_var
)

PCA after normalization

In this analysis the r params$pca_assay_name assay is used.

object_norm_pca <- calc_pca(
  object_filtered_normalized,
  assay_name = params$pca_assay_name
)

autoplot(
  object_norm_pca,
  label = TRUE,
  label.repel = TRUE,
  data = as.data.frame(colData(object_filtered_normalized)),
  colour = params$pca_batch_var
)

Comparing QC results on samples

Technical failure flags

We list the samples where the default technical failure flags are different from the resulting ones, using the custom correlation score threshold:

df_comp_tech_failure <- data.frame(
  default = get_tech_failure(object_default_qc_flags),
  result = get_tech_failure(object_original)
)

DT::datatable(subset(df_comp_tech_failure, default != result))

Low depth flags

Next we list the samples where the default low depth flags are different from the resulting ones, using the custom depth threshold:

df_comp_low_depth <- data.frame(
  default = get_low_depth(object_default_qc_flags),
  result = get_low_depth(object_original)
)

DT::datatable(subset(df_comp_low_depth, default != result))

Comparing QC results on genes

We list the genes where the default low expression flags are different from the resulting ones, using the custom CPM and proportion thresholds:

df_comp_low_expression <- data.frame(
  default = get_low_expression(object_default_qc_flags),
  result = get_low_expression(object_original)
)

DT::datatable(subset(df_comp_low_expression, default != result))

Output

The used HermesData object called object_filtered_normalized has been saved in r params$output_data_file if that file did not exist beforehand.

if (!file.exists(params$output_data_file)) {
  save(object_filtered_normalized, file = params$output_data_file)
}

insightsengineering/hermes documentation built on Dec. 15, 2024, 8:07 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

insightsengineering/hermes
Preprocessing, analyzing, and reporting of RNA-seq data

In insightsengineering/hermes: Preprocessing, analyzing, and reporting of RNA-seq data

Template and Session Info {.tabset}

Session Info {-}

Details {-}

Dataset Summary

Technical Metrics {.tabset}

Histogram of Library Sizes

Q-Q Plot of Library Sizes

Density Plot of (Log) Counts Distributions

Gene filtering {.tabset}

Stacked Barplot of Low Expression Genes by Chromosome

Genes with extremely high counts

Genes with high variability

Sample filtering {.tabset}

Low depth samples

Correlation between samples

Boxplot of non-zero genes per sample

Filtering and Normalization Effects

Technical Metrics After Sample Filtering {.tabset}

Histogram of Library Sizes

Q-Q Plot of Library Sizes

Density Plot of (Log) Counts Distributions

Principal Components Analyses for batch effects {.tabset}

PCA before normalization

PCA after normalization

Comparing QC results on samples

Technical failure flags

Low depth flags

Comparing QC results on genes

Output

R Package Documentation

Browse R Packages

We want your feedback!

insightsengineering/hermes Preprocessing, analyzing, and reporting of RNA-seq data

In insightsengineering/hermes: Preprocessing, analyzing, and reporting of RNA-seq data

Template and Session Info {.tabset}

Session Info {-}

Details {-}

Dataset Summary

Technical Metrics {.tabset}

Histogram of Library Sizes

Q-Q Plot of Library Sizes

Density Plot of (Log) Counts Distributions

Gene filtering {.tabset}

Stacked Barplot of Low Expression Genes by Chromosome

Genes with extremely high counts

Genes with high variability

Sample filtering {.tabset}

Low depth samples

Correlation between samples

Boxplot of non-zero genes per sample

Filtering and Normalization Effects

Technical Metrics After Sample Filtering {.tabset}

Histogram of Library Sizes

Q-Q Plot of Library Sizes

Density Plot of (Log) Counts Distributions

Principal Components Analyses for batch effects {.tabset}

PCA before normalization

PCA after normalization

Comparing QC results on samples

Technical failure flags

Low depth flags

Comparing QC results on genes

Output

R Package Documentation

Browse R Packages

We want your feedback!

insightsengineering/hermes
Preprocessing, analyzing, and reporting of RNA-seq data