Objective

We would like to be able to use existing R APIs for annotating genes based on gene IDs as the starting point. Thereby we can keep up to date with new Ensembl versions that are published and are flexible whether IGIS (gRED) or GI-HDAP (pRED) is used.

For gRED molecules, the canonical source for Ensembl (Entrez for legacy) annotations is IGIS. Therefore we want to be able to fetch the required feature data columns for the genes based on the IDs from IGIS.

Fortunately, the nice R-package igis is available.

Workflow

  1. Establish a connection to data base
  2. provide a wrapper that loads the right IGIS version based on the gene IDs.
  3. e.g. connection <- connect(prefix = prefix(object), db = "igis") where object is a HermesData object and db identifies the data base to use.
  4. prefix() is a new generic function/method to access slot of same name in HermesData object: it is either "ENSG" or "GeneID" and filled at creation.
  5. In the future we might have other data bases, such as the pRED one, to connect to.
  6. then connection result has a class: here Igis
  7. internally that calls igis::Igis where we use "human" species and the version is determined by the gene IDs format found in object.
  8. Then annotate the HermesData object
  9. e.g. define an annotation method: this can have a getter function, where we just give back the relevant data frame columns from rowData(object).
  10. and then the replacement method can be used like: annotation(object) <- query(genes = genes(object), connection)
  11. genes() is a new method for generic defined in package GenomicFeatures to access gene names in object
  12. So here we need a new generic function query and a method for character, Igis signature.

Prototypes

Additional prefix slot and getter method

Just for illustration we create a new class here, but it would be the same for HermesData and RangedHermesData: just an additional slot prefix.

# this will be folded into both HermesData and RangedHermesData class definitions:
.HD <- setClass(
  Class = "HD",
  contains = "HermesData",
  slots = c(prefix = "character")
)

# this will be like this:
setGeneric("prefix", def = function(object, ...) {
  object@prefix
})

# the constructor then also fills the prefix slot - this will later be folded into HermesData()
# constructor:
HD <- function(object) {
  start <- HermesData(object)
  gene_ids <- rownames(object)
  prefix <- if (all(grepl("^ENSG", gene_ids))) {
    "ENSG"
  } else if (all(grepl("^GeneID", gene_ids))) {
    "GeneID"
  } else {
    stop("hermes currently only supports EnsemblID or EntrezID as gene IDs")
  }
  .HD(
    start,
    prefix = prefix
  )
}

Example

So now our typical example is:

object <- HD(summarized_experiment)
prefix(object)

genes getter method

Actually this is just a synonym for rownames() really. In order to ensure that, we need to add another validation check for HermesData objects: we require that its rownames equal the GeneID column in rowData.

library(igis)

#' @importFrom GenomicFeatures genes
setMethod(
  f = "genes",
  signature = c(x = "AnyHermesData"),
  definition = function(x, ...) {
    rownames(x)
  }
)

# later assert this via validation method:
identical(rownames(object), rowData(object)$GeneID)

# then we can safely use this:
head(genes(object))

Connection function

As mentioned above, this can later be extended easily to other db options.

connect <- function(prefix = c("ENSG", "GeneID"), db = "igis") {
  prefix <- match.arg(prefix)
  db <- match.arg(db)
  if (db == "igis") {
    version <- switch(prefix,
      ENSG = "4.0",
      GeneID = "3.0"
    )
    igis::Igis(species = "human", version = version)
  }
}

connection <- connect(prefix(object))
class(connection)

annotation method

This gives back the relevant data frame columns from rowData(object).

# Define a package constant:
.row_data_annotation_cols <- c(
  "symbol",
  "desc",
  "GeneID",
  "chromosome",
  "size"
)

#' @importFrom BiocGenerics annotation
setMethod(
  f = "annotation",
  signature = c(object = "AnyHermesData"),
  definition = function(object, ...) {
    rowData(object)[, .row_data_annotation_cols]
  }
)

head(annotation(object))

query method

Now we are ready to query from Igis e.g. - again we define this as a generic function so that it can be easily extended once we need other connections.

setGeneric("query", def = function(genes, connection) {})

setMethod(
  f = "query",
  signature = c(genes = "character", connection = "Igis"),
  definition = function(genes, connection) {
    trans <- igis::transcriptsByGeneId(
      geneIds = genes,
      canonical = TRUE,
      igis = connection
    )
    df <- cbind(
      setNames(
        mcols(trans),
        c(
          "symbol",
          "GeneID",
          "desc"
        )
      ),
      Chromosome = as.vector(seqnames(trans)),
      StartBP = start(trans),
      EndBP = end(trans),
      WidthBP = width(trans)
    )
    rownames(df) <- genes
    df[, .row_data_annotation_cols]
  }
)

Let's try this out:

new_annotations <- query(genes(object), connection)
head(new_annotations)

Note: If there is too much time delay between the connection creation and the query, then the error:

could not run statement: MySQL server has gone away 

can be thrown. In that case just create a new connection again.

Annotation replacement method

This basically just uses the data frame replacement method: That part of the rowData which is the annotation gets replaced with the provided data frame.

setReplaceMethod(
  f = "annotation",
  signature = c(object = "AnyHermesData", value = "DataFrame"),
  function(object, value) {
    # In production add here more checks/assertions if needed (e.g. test
    # whether wrong things could be inserted here into rowData) -
    # do we need to align the row names etc.
    # and then finally insert:
    rowData(object)[, .row_data_annotation_cols] <- value[, .row_data_annotation_cols]
    object
  }
)

Let's try:

# first remove some info
# rowData(object)[, "CanonicalTranscript"] <- ""

# then fill in
annotation(object) <- new_annotations

head(annotation(object))

But the class of object is still:

class(object)
rd <- rowData(object)
head(rd)
head(annotation(object))

Annotation feature: getting gene type

Gene type can be found using igis::variationConsequence(), in the biotype slot.

Single gene lookup

For a single gene this works well:

connection <- connect_igis("ENSG")
gene <- "ENSG00000163734"

transcripts <- igis::transcriptsByGeneId(
  geneIds = gene,
  canonical = TRUE,
  igis = connection
)
class(transcripts)
transcripts

cons <- igis::variationConsequence(
  igis = connection,
  allele = "T",
  location = transcripts,
  species = "human"
)

cons$biotype

Multiple genes lookup

So probably we can do the same principle as what was done for igis::transcriptsByGeneId() in that instead of providing one specific gene, we provide a list of all genes. Note that below we also have the other prefix (GeneID which means Entrez or IGIS version 3).

object <- HermesData(summarized_experiment)
prefix(object)
connection <- connect_igis(prefix(object))

all_transcripts <- igis::transcriptsByGeneId(
  geneIds = genes(object),
  canonical = TRUE,
  igis = connection
)

all_transcripts

So let's try this:

all_cons <- igis::variationConsequence(
  igis = connection,
  allele = "T",
  location = all_transcripts,
  species = "human"
)

This gives

Error in getListElement(x, i, ...) : 
  GRanges objects don't support [[, as.list(), lapply(), or unlist() at the moment

So seems to me that there is a bug in igis here. I looked this up on Google and found other people reporting this, and being referred to R 4.0.3 with later BioConductor release. I tested this in my local RStudio with R 4.0.3 and indeed this runs there fine:

library(igis)
connection <- Igis(species = "human", version = "4.0")
genes <- c("ENSG00000163734", "ENSG00000163735")
transcripts <- igis::transcriptsByGeneId(
  geneIds = genes,
  canonical = TRUE,
  igis = connection
)
cons <- variationConsequence(
  igis = connection,
  allele = "T",
  location = transcripts,
  species = "human"
)
cons$biotype

As the NEST team overall will switch to R 4.0.x in the next cycle, it seems better to postpone this feature implementation to the next cycle.

Alternative considered

As an alternative we could have worked around this for now, e.g. via querying each range separately:

for (i in 1:length(genes(object))) {
  gene_type <- igis::variationConsequence(
    allele = "T",
    location = transcripts[i, ],
    igis = connection,
    species = "human"
  )

  df <- cbind(transcripts[i, ], gene_type$biotype)
}

However this will be pretty slow, so does not seem like a good option. Also, the feature request is not urgent at all, so it's fine to have this in an upcoming release.



insightsengineering/hermes documentation built on Dec. 15, 2024, 8:07 a.m.