ASE-GLM-edgeR: Using Generalised linear models (GLMs) to analyse...

ASE-GLM-edgeRR Documentation

Using Generalised linear models (GLMs) to analyse differential ASEs using edgeR

Description

These functions allow users to fit custom GLMs included/excluded counts using edgeR for differential Alternative Splice Events (ASEs)

Usage

fitASE_edgeR(
  se,
  strModelFormula,
  strASEFormula,
  useQL = TRUE,
  IRmode = c("all", "annotated", "annotated_binary"),
  filter_antiover = TRUE,
  filter_antinear = FALSE
)

fitASE_edgeR_custom(
  se,
  model_IncExc,
  model_ASE,
  useQL = TRUE,
  IRmode = c("all", "annotated", "annotated_binary"),
  filter_antiover = TRUE,
  filter_antinear = FALSE
)

testASE_edgeR(
  se,
  fit,
  coef_IncExc = ncol(fit[["model_IncExc"]]),
  contrast_IncExc = NULL,
  coef_ASE = ncol(fit[["model_ASE"]]),
  contrast_ASE = NULL
)

addPSI_edgeR(results, se, condition, conditionList)

Arguments

se

The NxtSE object created by makeSE(). To reduce runtime and avoid excessive multiple testing, consider filtering the object using applyFilters

strModelFormula

A string specifying the model formula to fit isoform counts to assess differential expression in isolation. Should take the form of "~0 + batch1 + batch2 + test_factor", where batch1 and batch2 are batch factors (if any), and test_factor is the variate of interest.

strASEFormula

A string specifying the model formula to fit PSIs (isoform ratios). The variate of interest should be specified as an interactiion term with ASE. For example, following the above example, the ASE formula should be "~0 + batch1 + batch2 + test_factor + test_factor:ASE"

useQL

(default TRUE) Whether to use edgeR's quasi-likelihood method to help reduce false positives from near-zero junction / intron counts. NB: edgeR's quasi-likelihood method is run with legacy method (Lun and Smyth (2017)).

IRmode

(default all) Choose the approach to quantify IR events. Default all considers all introns as potentially retained, and calculates IR-ratio based on total splicing across the intron using the "SpliceOver" or "SpliceMax" approach (see collateData). Other options include annotated which calculates IR-ratios for annotated introns only, and annotated_binary which calculates PSI considering the "included" isoform as the IR-transcript, and the "excluded" transcript is quantified from splice counts only across the exact intron (but not that of overlapping introns). IR-ratio are denoted as "IR" events, whereas PSIs calculated using IR and intron-spliced binary alternatives are denoted as "RI" events.

filter_antiover, filter_antinear

Whether to remove novel IR events that overlap over or near anti-sense genes. Default will exclude antiover but not antinear introns. These are ignored if strand-specific RNA-seq protocols are used.

model_IncExc

A model matrix in which to model differential expression of isoform counts in isolation. The number of rows must equal that of the number of samples in se

model_ASE

A model matrix in which to model differential PSIs. The number of rows must be twice that of the number of samples in se, the first half are for included counts, and the second half are for excluded counts. See example below.

fit

The output returned by the fitASE_edgeR and fitASE_edgeR_custom functions.

coef_IncExc, coef_ASE

model coefficients to be dropped for LRT test between full and reduced models. Directly parsed onto edgeR::glmQLFTest. See ?edgeR::glmQLFTest for details

contrast_IncExc, contrast_ASE

numeric vector specifying one or more #' contrasts of the linear model coefficients to be tested. Directly parsed onto edgeR::glmQLFTest. See ?edgeR::glmQLFTest for details

results

The return value of testASE_edgeR(), to be used as input to append mean and delta PSI values onto.

condition

The name of the column containing the condition values in colData(se)

conditionList

A list (or vector) of condition values of which to calculate mean PSIs

Details

edgeR accounts appropriately for zero-counts which are often problematic as PSI approaches zero or one, leading to false positives. The following functions allow users to define model formulas to test relative expressions of included / excluded counts (to assess whether isoforms are differentially regulated, in isolation), as well as together as an interaction (the latter provides results of differential ASE analysis)

See the examples section for a brief explanation of how to use these functions.

See also ASE-methods for further explanations of results output.

Value

fitASE_edgeR and fitASE_edgeR_custom returns a named list containing the following:

  • IncExc and ASE are DGEGLM objects containing the fitted models for isoform counts and PSIs, respectively

  • model_IncExc and model_ASE are model matrices of the above fitted models.

testASE_edgeR() returns a data.table containing the following:

  • EventName: The name of the ASE event. This identifies each ASE in downstream functions including makeMeanPSI, makeMatrix, and plotCoverage

  • EventType: The type of event. See details section above.

  • EventRegion: The genomic coordinates the event occupies. This spans the most upstream and most downstream splice junction involved in the ASE, and is use to guide the plotCoverage function.

  • flags: Indicates which isoforms are NMD substrates and/or which are formed by novel splicing only.

edgeR specific output equivalent to statistics returned by edgeR::topTags():

  • logFC, logCPM, F, PValue, FDR: log fold change, log counts per million, F statistic, p value and (Benjamini Hochberg) adjusted p values of the differential PSIs for the contrasts or coefficients tested.

  • inc/exc_(...): edgeR statistics corresponding to differential expression testing for raw included / excluded counts in isolation (not of the PSIs).

addPSI_edgeR() appends the following columns to the above output

  • AvgPSI_X: the average percent spliced in / percent IR levels for condition X. Note this is a geometric mean, based on the arithmetic mean of logit PSI values.

  • deltaPSI: The difference in PSI between the mean values of the two conditions.

  • abs_deltaPSI: The absolute value of difference in PSI between the mean values of the two conditions.

Functions

  • fitASE_edgeR(): Use edgeR to fit counts and ASE models with a given design formula

  • fitASE_edgeR_custom(): Use edgeR to fit counts and ASE models with a given design formula

  • testASE_edgeR(): Use edgeR to return differential ASE results. coef and contrast are parsed onto edgeR's glmQLFTest function

  • addPSI_edgeR(): Adds average and delta PSIs of conditions of interest onto results produced by testASE_edgeR(). Note this is done automatically for other methods described in ASE-methods.

References

Lun A, Smyth G (2017). 'No counts, no variance: allowing for loss of degrees of freedom when assessing biological variability from RNA-seq data' Stat Appl Genet Mol Biol, 017 Apr 25;16(2):83-93. https://doi.org/10.1515/sagmb-2017-0010

Examples

# Load the NxtSE object and set up the annotations
# - see ?makeSE on example code of generating this NxtSE object
se <- SpliceWiz_example_NxtSE()

colData(se)$treatment <- rep(c("A", "B"), each = 3)
colData(se)$replicate <- rep(c("P","Q","R"), 2)
require("edgeR")

fit <- fitASE_edgeR(
    se, 
    strModelFormula = "~0 + replicate + treatment", 
    strASEFormula = "~0 + replicate + treatment + treatment:ASE"
)

# Get coefficient terms of Included / Excluded counts isolated model
colnames(fit$model_IncExc)
# [1] "replicateP" "replicateQ" "replicateR" "treatmentB"

# Get coefficient terms of PSI model
colnames(fit$model_ASE)
# [1] "replicateP" "replicateQ" "replicateR" "treatmentB"            
# [5] "treatmentA:ASEIncluded" "treatmentB:ASEIncluded"

# Contrast between treatment "B" against treatment "A"
res <- testASE_edgeR(se, fit,
    contrast_IncExc = c(0,0,0,1),
    contrast_ASE = c(0,0,0,0,-1,1)
)

### # Add mean PSI values to results:
res_withPSI <- addPSI_edgeR(res, se, "treatment", c("B", "A"))


### Using custom model matrices to model counts
#   - the equivalent analysis can be performed as follows:

# Sample annotations for isoform count expressions
colData <- as.data.frame(colData(se))

# Sample annotations for isoform count PSI analysis
colData_ASE <- rbind(colData, colData)
colData_ASE$ASE <- rep(c("Included", "Excluded"), each = nrow(colData))
rownames(colData_ASE) <- c(
    paste0(rownames(colData), ".Included"),
    paste0(rownames(colData), ".Excluded")
)

model_IncExc <- model.matrix(
    ~0 + replicate + treatment,
    data = colData
)

model_ASE <- model.matrix(
    ~0 + replicate + treatment + treatment:ASE,
    data = colData_ASE
)

fit <- fitASE_edgeR_custom(se, model_IncExc, model_ASE)

res_customModel <- testASE_edgeR(se, fit,
    contrast_IncExc = c(0,0,0,1),
    contrast_ASE = c(0,0,0,0,-1,1)
)

# Check this produces identical results:
identical(res_customModel, res)

### Time series examples using edgeR and splines 
# - similar to section 4.8 in the edgeR vignette

colData(se)$timepoint <- rep(c(1,2,3), each = 2)
colData(se)$batch <- rep(c("1", "2"), 3)

# First, we set up a polynomial spline with 2 degrees of freedom:
Time <- poly(colData(se)$timepoint, df = 2)

# Next, we define the batch factor:
Batch <- factor(colData(se)$batch)

# Finally, we construct the same factors for ASE analysis. Note that
#   each factor must be repeated twice

Time_ASE <- rbind(Time, Time)
Batch_ASE <- c(Batch, Batch)
ASE <- factor(
    rep(c("Included", "Excluded"), each = nrow(colData(se)))
)

# Now, we set up the model matrices for isoform and PSI count modelling
model_IncExc <- model.matrix(~0 + Batch + Time)
model_ASE <- model.matrix(~0 + Batch_ASE + Time_ASE + Time_ASE:ASE)

fit <- fitASE_edgeR_custom(se, model_IncExc, model_ASE)

# Note the coefficients of interest in the constructed models:

colnames(model_IncExc)
# [1] "Batch1" "Batch2" "Time1"  "Time2" 

colnames(model_ASE)
# [1] "Batch_ASE1" "Batch_ASE2" "Time_ASE1" "Time_ASE2"
# [5] "Time_ASE1:ASEIncluded" "Time_ASE2:ASEIncluded"

# We are interested in a model in which `Time` is excluded, thus:

res <- testASE_edgeR(se, fit,
    coef_IncExc = 3:4,
    coef_ASE = 5:6
)

# Finally, add PSI values for each time point:

res_withPSI <- addPSI_edgeR(res, se, "timepoint", c(1, 2, 3))


alexchwong/SpliceWiz documentation built on Oct. 15, 2024, 10:12 a.m.