library(knitr) options(width=80) knitr::opts_chunk$set( collapse=TRUE, comment="", fig.align="center", fig.wide=TRUE )
r Biocpkg("GenomicScores")
is an R package distributed as part of the
Bioconductor project. To install the package, start R and enter:
install.packages("BiocManager") BiocManager::install("GenomicScores")
Once r Biocpkg("GenomicScores")
is installed, it can be loaded with the following command.
library(GenomicScores)
Often, however, r Biocpkg("GenomicScores")
will be automatically loaded when
working with an annotation package that uses r Biocpkg("GenomicScores")
, such
as r Biocpkg("phastCons100way.UCSC.hg38")
.
Genomewide scores assign each genomic position a numeric value denoting an estimated measure of constraint or impact on variation at that position. They are commonly used to filter single nucleotide variants or assess the degree of constraint or functionality of genomic features. Genomic scores are built on the basis of different sources of information such as sequence homology, functional domains, physical-chemical changes of amino acid residues, etc.
One particular example of genomic scores are phastCons scores. They provide a
measure of conservation obtained from genomewide alignments using the program
phast (Phylogenetic Analysis with Space/Time models)
from @siepel05. The r Biocpkg("GenomicScores")
package allows one to retrieve
these scores through annotation packages (Section
\@ref(availability-and-retrieval-of-genomic-scores)) or as
r Biocpkg("AnnotationHub")
resources (Section
\@ref(genomic-scores-as-annotationhub-resources)).
Often, genomic scores such as phastCons are used within workflows running on
top of R and Bioconductor. The purpose of the r Biocpkg("GenomicScores")
package is to enable an easy and interactive access to genomic scores within
those workflows.
Storing and accessing genomic scores within R is challenging when their values cover large regions of the genome, resulting in gigabytes of double-precision numbers. This is the case, for instance, for phastCons [@siepel05] or CADD [@kircher14].
We address this problem by using lossy compression, also called quantization, coupled with run-length encoding (Rle) vectors. Lossy compression attempts to trade off precision for compression without compromising the scientific integrity of the data [@zender16].
Sometimes, measurements and statistical estimates under certain models generate false precision. False precision is essentialy noise that wastes storage space and it is meaningless from the scientific point of view [@zender16]. In those circumstances, lossy compression not only saves storage space, but also removes false precision.
The use of lossy compression leads to a subset of quantized values much
smaller than the original set of genomic scores, resulting in long runs of
identical values along the genome. These runs of identical values can be
further compressed using the implementation of Rle vectors available in the
r Biocpkg("S4Vectors")
Bioconductor package.
To enable a seamless access to genomic scores stored with quantized values
in compressed vectors the r Biocpkg("GenomicScores")
defines the GScores
class of objects. This class manages the location, loading and dequantization
of genomic scores stored separately on each chromosome, while it also provides
rich metadata on the provenance, citation and licensing of the original data.
The access to genomic scores through GScores
objects is available either
through annotation packages or as r Biocpkg("AnnotationHub")
resources. To
find out what kind of genomic scores are available, through which mechanism,
and in which organism, we may use the function availableGScores()
.
avgs <- availableGScores() avgs
For example, if we want to use the phastCons conservation scores available
through the annotation package r Biocpkg("phastCons100way.UCSC.hg38")
, we
should first install it (we only need to do this once).
BiocManager::install("phastCons100way.UCSC.hg38")
Second, we should load the package, and a GScores
object will be created and
named after the package name, during the loading operation. It is often handy
to shorten that name.
library(phastCons100way.UCSC.hg38) phast <- phastCons100way.UCSC.hg38 class(phast)
Typing the name of the GScores
object shows a summary of its contents and
some of its metadata.
phast
The bibliographic reference to cite the genomic score data stored in a GScores
object can be accessed using the citation()
method either on the package name
(in case of annotation packages), or on the GScores
object.
citation(phast)
Other methods tracing provenance and other metadata are provider()
,
providerVersion()
, organism()
and seqlevelsStyle()
; please consult
the help page of the GScores
class for a comprehensive list of available
methods.
provider(phast) providerVersion(phast) organism(phast) seqlevelsStyle(phast)
To retrieve genomic scores for specific consecutive positions we should use the
method gscores()
, as follows.
gscores(phast, GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)))
For a single position we may use this other GRanges()
constructor.
gscores(phast, GRanges("chr22:50528593"))
We may also retrieve the score values only with the method score()
.
score(phast, GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1))) score(phast, GRanges("chr22:50528593"))
Let's illustrate how to retrieve phastCons scores using data from the GWAS
catalog available through the Bioconductor package r Biocpkg("gwascat")
. For
the purpose of this vignette, we will filter the GWAS catalog data by (1)
discarding entries with NA
values in either chromosome name or position, or
with multiple positions; (2) storing the data into a GRanges
object, including
the GWAS catalog columns STRONGEST SNP-RISK ALLELE
and MAPPED_TRAIT
, and the
reference and alternate alleles, as metadata columns; (4) restricting variants
to those located in chromosomes 20 to 22; and (3) excluding variants with
multinucleotide alleles, or where reference and alternate alleles are identical.
library(BSgenome.Hsapiens.UCSC.hg38) library(gwascat) gwc <- get_cached_gwascat() mask <- !is.na(gwc$CHR_ID) & !is.na(gwc$CHR_POS) & !is.na(as.integer(gwc$CHR_POS)) gwc <- gwc[mask, ] grstr <- sprintf("%s:%s-%s", gwc$CHR_ID, gwc$CHR_POS, gwc$CHR_POS) gwcgr <- GRanges(grstr, RISK_ALLELE=gwc[["STRONGEST SNP-RISK ALLELE"]], MAPPED_TRAIT=gwc$MAPPED_TRAIT) seqlevelsStyle(gwcgr) <- "UCSC" mask <- seqnames(gwcgr) %in% c("chr20", "chr21", "chr22") gwcgr <- gwcgr[mask] ref <- as.character(getSeq(Hsapiens, gwcgr)) alt <- gsub("rs[0-9]+-", "", gwcgr$RISK_ALLELE) mask <- (ref %in% c("A", "C", "G", "T")) & (alt %in% c("A", "C", "G", "T")) & nchar(alt) == 1 & ref != alt gwcgr <- gwcgr[mask] mcols(gwcgr)$REF <- ref[mask] mcols(gwcgr)$ALT <- alt[mask]
library(BSgenome.Hsapiens.UCSC.hg38) library(gwascat) gwcgr <- readRDS(system.file("extdata", "gwcgr_chr20-22_151123.rds", package="GenomicScores"))
gwcgr
Finally, let's obtain the phastCons scores for this GWAS catalog variant set, and examine their summary and cumulative distribution.
pcsco <- score(phast, gwcgr) summary(pcsco) round(cumsum(table(na.omit(pcsco))) / sum(!is.na(pcsco)), digits=2)
We can observe that only 10% of the variants in chromosomes 20 to 22 have a conservation phastCons score above 0.5. Let's examine which traits have more fully conserved variants.
xtab <- table(gwcgr$MAPPED_TRAIT[pcsco == 1]) head(xtab[order(xtab, decreasing=TRUE)])
The r Biocpkg("AnnotationHub")
(AH), is a Bioconductor web resource that
provides a central location where genomic files (e.g., VCF, bed, wig) and other
resources from standard (e.g., UCSC, Ensembl) and distributed sites, can be
found. An AH web resource creates and manages a local cache of files retrieved
by the user, helping with quick and reproducible access.
We can quickly check for the available AH resources by subsetting as follows
the resources names from the previous table obtained with availableGScores()
.
rownames(avgs)[avgs$AnnotationHub]
The selected resource can be downloaded with the function getGScores(). After the resource is downloaded the first time, the cached copy will enable a quicker retrieval later. Let's download other conservation scores, the phyloP scores [@pollard2010detection], for human genome version hg38.
obj20 <- readRDS(system.file("extdata", "phyloP100way.UCSC.hg38.chr20.rds", package="GenomicScores")) obj21 <- readRDS(system.file("extdata", "phyloP100way.UCSC.hg38.chr21.rds", package="GenomicScores")) obj22 <- readRDS(system.file("extdata", "phyloP100way.UCSC.hg38.chr22.rds", package="GenomicScores")) mdobj <- metadata(obj20) phylop <- GScores(provider=mdobj$provider, provider_version=mdobj$provider_version, download_url=mdobj$download_url, download_date=mdobj$download_date, reference_genome=mdobj$reference_genome, data_pkgname=mdobj$data_pkgname, data_dirpath="../inst/extdata", data_serialized_objnames=c(phyloP100way.UCSC.hg38.chr20.rds="phyloP100way.UCSC.hg38.chr20.rds", phyloP100way.UCSC.hg38.chr21.rds="phyloP100way.UCSC.hg38.chr21.rds", phyloP100way.UCSC.hg38.chr22.rds="phyloP100way.UCSC.hg38.chr22.rds"), data_tag="phyloP", license="", license_url="", license_reqconsent=FALSE) gscopops <- get(mdobj$data_pkgname, envir=phylop@.data_cache) gscopops[["default"]] <- RleList(compress=FALSE) gscopops[["default"]][[mdobj$seqname]] <- obj20 assign(mdobj$data_pkgname, gscopops, envir=phylop@.data_cache)
phylop <- getGScores("phyloP100way.UCSC.hg38")
phylop
Let's retrieve the phyloP conservation scores for the previous set of GWAS catalog variants and compare them in Figure \@(fig:phastvsphylop).
ppsco <- score(phylop, gwcgr) plot(pcsco, ppsco, xlab="phastCons", ylab="phyloP", cex.axis=1.2, cex.lab=1.5, las=1)
We may observe that the values match in a rather discrete manner due to the
quantization of the scores. In the case of the phastCons annotation package
r Biocpkg("phastCons100way.UCSC.hg38")
, the GScore
object gives access
in fact to two score populations, the default one in which conservation scores
are rounded to 1 decimal place, and an alternative one, named DP2
, in which
they are rounded to 2 decimal places. To figure out what are the available
score populations in a GScores
object, we should use the method
populations()
.
populations(phast)
Whenever one of these populations is called default
, this is the one used
by default. In other cases we can find out which is the default population as
follows:
defaultPopulation(phast)
To use one of the available score populations we should use the argument pop
in the corresponding method, as follows.
pcsco2 <- score(phast, gwcgr, pop="DP2") head(pcsco2)
Figure \@ref(fig:phastvsphylop2) below shows again the comparison of phastCons and phyloP conservation scores, this time at the higher resolution provided by the phastCons scores rounded at two decimal places.
plot(pcsco2, ppsco, xlab="phastCons", ylab="phyloP", cex.axis=1.2, cex.lab=1.5, las=1)
Retrieving genomic scores through AnnotationHub
resources requires an internet
connection and we may want to work with such resources offline, for instance in
high-performance computing (HPC) environments. For that purpose, we can create
ourselves an annotation package, such as
phastCons100way.UCSC.hg38,
from a GScores
object corresponding to a downloaded AnnotationHub
resource.
To do that we use the function makeGScoresPackage()
as follows:
makeGScoresPackage(phast, maintainer="Me <me@example.com>", author="Me", version="1.0.0")
cat("Creating package in ./phastCons100way.UCSC.hg38\n")
An argument, destDir
, which by default points to the current working
directory, can be used to change where in the filesystem the package is created.
Afterwards, we should still build and install the package via, e.g.,
R CMD build
and R CMD INSTALL
, to be able to use it offline.
One particular type of genomic scores that are accessible through
the GScores
class is minor allele frequency (MAF) data.
There are currently 15 annotation packages that store MAF values
using the r Biocpkg("GenomicScores")
package, named using the
prefix MafDb
or MafH5
; see Table \@ref(tab:tableMafDb) below.
Annotation Package | Description
--------------------------- | --------------------------------------------------------------------------------------------
r Biocpkg("MafDb.1Kgenomes.phase1.hs37d5")
| MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh37.
r Biocpkg("MafDb.1Kgenomes.phase1.GRCh38")
| MAF data from the 1000 Genomes Project Phase 1 for the human genome version GRCh38.
r Biocpkg("MafDb.1Kgenomes.phase3.hs37d5")
| MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh37.
r Biocpkg("MafDb.1Kgenomes.phase3.GRCh38")
| MAF data from the 1000 Genomes Project Phase 3 for the human genome version GRCh38.
r Biocpkg("MafDb.ExAC.r1.0.hs37d5")
| MAF data from ExAC 60706 exomes for the human genome version GRCh37.
r Biocpkg("MafDb.ExAC.r1.0.GRCh38")
| MAF data from ExAC 60706 exomes for the human genome version GRCh38.
r Biocpkg("MafDb.ExAC.r1.0.nonTCGA.hs37d5")
| MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh37.
r Biocpkg("MafDb.ExAC.r1.0.nonTCGA.GRCh38")
| MAF data from ExAC 53105 nonTCGA exomes for the human genome version GRCh38.
r Biocpkg("MafDb.gnomAD.r2.1.hs37d5")
| MAF data from gnomAD 15496 genomes for the human genome version GRCh37.
r Biocpkg("MafDb.gnomAD.r2.1.GRCh38")
| MAF data from gnomAD 15496 genomes for the human genome version GRCh38.
r Biocpkg("MafDb.gnomADex.r2.1.hs37d5")
| MAF data from gnomADex 123136 exomes for the human genome version GRCh37.
r Biocpkg("MafDb.gnomADex.r2.1.GRCh38")
| MAF data from gnomADex 123136 exomes for the human genome version GRCh38.
r Biocpkg("MafH5.gnomAD.v4.0.GRCh38")
| MAF data from gnomAD 76156 genomes for the human genome version GRCh38.
r Biocpkg("MafDb.TOPMed.freeze5.hg19")
| MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh37.
r Biocpkg("MafDb.TOPMed.freeze5.hg38")
| MAF data from NHLBI TOPMed 62784 genomes for the human genome version GRCh38.
: (#tab:tableMafDb) Bioconductor annotation packages storing MAF data.
In this type of package, the scores populations correspond to populations of individuals from which the MAF data were derived, and all MAF data were compressed using a precision of one significant figure for MAF < 0.1 and two significant figures for MAF >= 0.1. Let's load the MAF package for the release v4.0 of gnomAD [@chen2024genomic].
library(MafH5.gnomAD.v4.0.GRCh38) mafh5 <- MafH5.gnomAD.v4.0.GRCh38 mafh5 populations(mafh5)
Let's retrieve the gnomAD MAF values for the previous GWAS catalog variant set and examine its distribution, and how many variants occur in less than 1% of all gnomAD populations and what fraction do they represent among the analyzed variants.
mafs <- score(mafh5, gwcgr, pop="AF_allpopmax") summary(mafs) sum(mafs < 0.01, na.rm=TRUE) sum(mafs < 0.01, na.rm=TRUE) / sum(!is.na(mafs))
Finally, let's examine which traits have more such rare variants.
xtab <- table(gwcgr$MAPPED_TRAIT[mafs < 0.01]) head(xtab[order(xtab, decreasing=TRUE)])
Among the score sets available as
AnnotationHub
web resources shown in the previous section, some of them, such as CADD
[@kircher14], M-CAP [@jagadeesh16] or AlphaMissense [@cheng2023accurate],
provide multiple scores per genomic position that capture the tolerance to
mutations of single nucleotides. Such type of scores, often used to establish
the potential pathogenicity of variants, are sometimes released under some sort
of license for a non-commercial use. In such cases, the function getGScores()
will ask us interactively to accept the license. We can also set the argument
accept.license=TRUE
to accept it non-interactively. We will illustrate such
a case using the AlphaMissense scores [@cheng2023accurate].
obj20 <- readRDS(system.file("extdata", "AlphaMissense.v2023.hg38.chr20.rds", package="GenomicScores")) obj21 <- readRDS(system.file("extdata", "AlphaMissense.v2023.hg38.chr21.rds", package="GenomicScores")) obj22 <- readRDS(system.file("extdata", "AlphaMissense.v2023.hg38.chr22.rds", package="GenomicScores")) mdobj <- metadata(obj20) am23 <- GScores(provider=mdobj$provider, provider_version=mdobj$provider_version, download_url=mdobj$download_url, download_date=mdobj$download_date, reference_genome=mdobj$reference_genome, data_pkgname=mdobj$data_pkgname, data_dirpath="../inst/extdata", data_serialized_objnames=c(AlphaMissense.v2023.hg38.chr20.rds="AlphaMissense.v2023.hg38.chr20.rds", AlphaMissense.v2023.hg38.chr21.rds="AlphaMissense.v2023.hg38.chr21.rds", AlphaMissense.v2023.hg38.chr22.rds="AlphaMissense.v2023.hg38.chr22.rds"), data_tag="AlphaMissense", license=mdobj$license, license_url=mdobj$license_url, license_reqconsent=TRUE) gscopops <- get(mdobj$data_pkgname, envir=am23@.data_cache) gscopops[["default"]] <- RleList(compress=FALSE) gscopops[["default"]][[mdobj$seqname]] <- obj20 assign(mdobj$data_pkgname, gscopops, envir=am23@.data_cache)
am23 <- getGScores("AlphaMissense.v2023.hg38") These data is shared under the license CC BY-NC-SA 4.0 (see https://creativecommons.org/licenses/by-nc-sa/4.0), do you accept it? [y/n]: y
Let's retrieve the AlphaMissense scores for the reference and alternate alleles in our GWAS catalog variant set.
am23 amsco <- score(am23, gwcgr, ref=gwcgr$REF, alt=gwcgr$ALT) summary(amsco)
Using the cutoffs for AlphaMissense scores reported in [@cheng2023accurate] to classify variants into "likely benign", "ambiguous" and "likely pathogenic", and 0.01 and 0.1 as MAF cutoffs, let's cross-tabulate the proportions of these two factors.
mask <- !is.na(amsco) & !is.na(mafs) amscofac <- cut(amsco[mask], breaks=c(0, 0.34, 0.56, 1)) amscofac <- relevel(amscofac, ref="(0.56,1]") maffac <- cut(mafs[mask], breaks=c(0, 0.01, 0.1, 1)) xtab <- table(maffac, amscofac) t(xtab) xtab <- t(xtab / rowSums(xtab)) round(xtab, digits=2)
Figure \@ref(fig:ChengEtAl2024Fig5b) below displays graphically these proportions in an analogous way to the one shown in Figure 5B from @cheng2023accurate. While these proportions are quite different to the original figure, due to the much lower number of variants analyzed here, we still can see, like in [@cheng2023accurate], that the proportion of variants classified as likely pathogenic by AlphaMissense scores is much larger for rare variants with MAF < 0.01 than for common variants with MAF > 0.01.
par(mar=c(4, 5, 3, 1)) barplot(xtab, horiz=TRUE, col=c("red", "royalblue", "gray"), xlab="Proportion", ylab="MAF", cex.axis=1.2, cex.names=1.2, cex.lab=1.5) par(xpd=TRUE) legend(0.17, 4.1, c("likely pathogenic", "likely benign", "ambiguous"), fill=c("red", "royalblue", "gray"), inset=0.01, horiz=TRUE) par(xpd=FALSE)
The input genomic ranges to the gscores()
method may have widths larger than one
nucleotide. In those cases, and when there is only one score per position, the
gscores()
method calculates, by default, the arithmetic mean of the scores across
each range.
gr1 <- GRanges(seqnames="chr22", IRanges(start=50528591:50528596, width=1)) gr1sco <- gscores(phast, gr1) gr1sco mean(gr1sco$default) gr2 <- GRanges("chr22:50528591-50528596") gscores(phast, gr2)
However, we may change the way in which scores from multiple-nucleotide ranges are
summarized with the argument summaryFun
, as follows.
gscores(phast, gr2, summaryFun=max) gscores(phast, gr2, summaryFun=min) gscores(phast, gr2, summaryFun=median)
A typical use case of the r Biocpkg("GenomicScores")
package is in the context
of annotating variants with genomic scores, such as phastCons conservation
scores. For this purpose, we load the r Biocpkg("VariantAnnotaiton")
and
r Biocpkg("TxDb.Hsapiens.UCSC.hg38.knownGene")
packages. The former will allow
us annotate variants, and the latter contains the gene annotations from UCSC
that will be used in this process.
library(VariantAnnotation) library(TxDb.Hsapiens.UCSC.hg38.knownGene) txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
We annotate the location of previous set of filtered GWAS variants, using the
function locateVariants()
from the r Biocpkg("VariantAnnotation")
package.
loc <- locateVariants(gwcgr, txdb, AllVariants()) loc[1:3] table(loc$LOCATION)
Now we annotate phastCons conservation scores on the variants and store
those annotations as an additional metadata column of the GRanges
object.
For this specific purpose we should use the method score()
that returns
the genomic scores as a numeric vector, instead of doing it as a metadata
column in the input ranges object, done by the gscores()
function.
loc$PHASTCONS <- score(phast, loc, pop="DP2") loc[1:3]
Using the following code we can examine the distribution of phastCons conservation scores of variants across the different annotated regions, shown in Figure \@ref(fig:plot1).
x <- split(loc$PHASTCONS, loc$LOCATION) mask <- elementNROWS(x) > 0 boxplot(x[mask], ylab="phastCons score", las=1, cex.axis=1.2, cex.lab=1.5, col="gray") points(1:length(x[mask])+0.25, sapply(x[mask], mean, na.rm=TRUE), pch=23, bg="black")
Next, we can annotate AlphaMissense and CADD scores as follows. Note that we
use the QUERYID
column of the annotations to fetch back reference and
alternative alleles from the original data container.
loc$AM <- score(am23, loc, ref=gwcgr$REF[loc$QUERYID], alt=gwcgr$ALT[loc$QUERYID])
obj20 <- readRDS(system.file("extdata", "cadd.v1.6.hg38.chr20.rds", package="GenomicScores")) obj21 <- readRDS(system.file("extdata", "cadd.v1.6.hg38.chr21.rds", package="GenomicScores")) obj22 <- readRDS(system.file("extdata", "cadd.v1.6.hg38.chr22.rds", package="GenomicScores")) mdobj <- metadata(obj20) cadd <- GScores(provider=mdobj$provider, provider_version=mdobj$provider_version, download_url=mdobj$download_url, download_date=mdobj$download_date, reference_genome=mdobj$reference_genome, data_pkgname=mdobj$data_pkgname, data_dirpath="../inst/extdata", data_serialized_objnames=c(cadd.v1.6.hg38.chr20.rds="cadd.v1.6.hg38.chr20.rds", cadd.v1.6.hg38.chr21.rds="cadd.v1.6.hg38.chr21.rds", cadd.v1.6.hg38.chr22.rds="cadd.v1.6.hg38.chr22.rds"), data_tag="CADD") gscopops <- get(mdobj$data_pkgname, envir=cadd@.data_cache) gscopops[["default"]] <- RleList(compress=FALSE) gscopops[["default"]][[mdobj$seqname]] <- obj20 assign(mdobj$data_pkgname, gscopops, envir=cadd@.data_cache)
cadd loc$CADD <- score(cadd, loc, ref=gwcgr$REF[loc$QUERYID], alt=gwcgr$ALT[loc$QUERYID])
Using the code below we can produce the plot of Figure \@ref(fig:am23vscadd) comparing AlphaMissense and CADD scores and labeling the location of the variants from which they are derived.
library(RColorBrewer) par(mar=c(4, 5, 1, 1)) hmcol <- colorRampPalette(brewer.pal(nlevels(loc$LOCATION), "Set1"))(nlevels(loc$LOCATION)) plot(loc$AM, jitter(loc$CADD, factor=2), pch=19, col=hmcol, xlab="AlphaMissense scores", ylab="CADD scores", las=1, cex.axis=1.2, cex.lab=1.5, panel.first=grid()) legend("bottomright", levels(loc$LOCATION), pch=19, col=hmcol, inset=0.01)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.