knitr::opts_chunk$set(comment = "", message=FALSE, warning = FALSE)
To install and load RNAAgeCalc
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("RNAAgeCalc")
library(RNAAgeCalc)
It has been shown that both DNA methylation and RNA transcription are linked to chronological age and age related diseases. Several estimators have been developed to predict human aging from DNA level and RNA level. Most of the human transcriptional age predictor are based on microarray data and limited to only a few tissues. To date, transcriptional studies on aging using RNASeq data from different human tissues is limited. The aim of this package is to provide a tool for across-tissue and tissue-specific transcriptional age calculation based on Genotype-Tissue Expression (GTEx) RNASeq data [@lonsdale2013genotype].
We utilized the GTEx data to construct our across-tissue and tissue-specific
transcriptional age calculator. GTEx is a public available genetic database
for studying tissue specific gene expression and regulation. GTEx V6 release
contains gene expression data at gene, exon, and transcript level of 9,662
samples from 30 different tissues. To avoid the influence of tumor on gene
expression, the 102 tumor samples from GTEx V6 release are dropped and
the remaining 9,560 samples were used in the subsequent analysis. To
facilitate integrated analysis and direct comparison of multiple datasets,
we utilized recount2 [@collado2017reproducible] version of GTEx data, where
all samples were processed with the same analytical pipeline. FPKM values
were calculated for each individual sample using getRPKM
function in
Bioconductor package
recount.
For the tissue-specific RNASeq age calculator, elastic net [@zou2005regularization] algorithm was used to train the predictors for each individual tissue. Chronological age was response variable whereas logarithm transformed FPKM of genes were predictors. The across-tissue calculator was constructed by first performing differential expression analysis on the RNASeq counts data for each individual tissue. To identify genes consistently differentially expressed across tissues, we adapted the binomial test discussed in de Magalhaes et al. [@de2009meta] to find the genes with the largest number of age-related signals. A detailed explanation can be found in our paper.
The package is implemented as follows. For each tissue, signature and sample type (see below for the descriptions), we pre-trained the calculator using elastic net based on the GTEx samples. We saved the pre-trained model coefficients as internal data in the package. The package takes gene expression data as input and then match the input genes to the genes in the internal data. This matching process is automatic so that the users just need to provide gene expression data without having to pull out the internal coefficients.
The main functions to calculate RNASeq age are predict_age
and
predict_age_fromse
. Users can use either of them. predict_age
function
takes data frame as input whereas predict_age_fromse
function takes
SummarizedExperiment
as input. Both functions work in the same way internally. In this section,
we explain the arguments in predict_age
and predict_age_fromse
respectively.
exprdata
is a matrix or data frame which contains gene expression data
with each row represents a gene and each column represents a sample. Users are
expected to use the argument exprtype
to specify raw count or FPKM
(see below). The rownames of exprdata
should be gene ids and colnames
of exprdata
should be sample ids. Here is an example of FPKM expression data:
data(fpkmExample) head(fpkm)
tissue
is a string indicates which tissue the gene expression data is
obtained from. Users are expected to provide one of the following tissues.
If the tissue argument is not provide or the provided tissue is not in this
list, the age predictor trained on all tissues will be used to calculate
RNA age.
exprtype
is either "counts" or "FPKM". If exprtype
is counts, the
expression data will be converted to FPKM by count2FPKM
automatically and
the calculator will be applied on FPKM data. When calculating FPKM, by default
gene length is obtained from the package's internal database. The internal
gene length information was obtained from recount2. However, users are able
to provide their own gene length information by using genelength
argument
(see below).
idtype
is a string which indicates the gene id type in exprdata
. Default
is "SYMBOL". The following id types are supported.
stype
is a string which specifies which version of pre-trained calculators
to be used. Two versions are provided. If stype="all"
, the calculator
trained on samples from all races (American Indian/Alaska Native, Asian,
Black/African American, and Caucasian) will be used. If stype="Caucasian"
,
the calculator trained on Caucasian samples only will be used. We found that
RNA Age signatures could be different in different races (see our paper for
details). Thus we provide both the universal calculator and race specific
calculator. The race specific calculator for American Indian/Alaska Native,
Asian, or Black/African American are not provided due to the small sample
size in GTEx data.
signature
is a string which indicate the age signature to use when
calculating RNA age. This argument is not required.
In the case that this argument is not provided, if tissue
argument is also
provided and the tissue is in the list above, the tissue specific age
signature given by our DESeq2 analysis result on GTEx data will be used.
Otherwise, the across tissue signature "GTExAge" will be used.
In the case that this argument is provided, it should be one of the following signatures.
If the genes in exprdata
do not cover all the genes in the signature,
imputation will be made automatically by the impute.knn
function in
Bioconductor package
impute.
genelength
is a vector of gene length in bp. The size of genelength
should
be equal to the number of rows in exprdata
. This argument is optional.
When using exprtype = "FPKM"
, genelength
argument is ignored. When using
exprtype = "counts"
, the raw count will be converted to FPKM. If genelength
is provided, the function will convert raw count to FPKM based on the
user-supplied gene length. Otherwise, gene length is obtained from the
internal database.
chronage
is a data frame which contains the chronological age of each
sample. This argument is optional.
If provided, it should be a dataframe with 1st column sample id and 2nd column
chronological age. The sample order in chronage
doesn't have to be in the
same order as in exprdata
. However, the samples in chronage
and exprdata
should be the same. If some samples' chronological age are not available,
users are expected to set the chronological age in chronage
to NA. If
chronage
contains more than 2 columns, only the first 2 columns will be
considered. If more than 30 samples' chronological age are available, age
acceleration residual will be calculated. Age acceleration residual is
defined as the residual of linear regression with RNASeq age as dependent
variable and chronological age as independent variable.
If this argument is not provided, the age acceleration residual will not be calculated.
chronage = data.frame(sampleid = colnames(fpkm), age = c(30,50)) res = predict_age(exprdata = fpkm, tissue = "brain", exprtype = "FPKM", chronage = chronage) head(res)
In the above example, we calculated the RNASeq age for 2 samples based on
their gene expression data coming from brain. Since the sample size is small,
age acceleration residual are not calculated.
Here is an example with sample size > 30:
# This example is just for illustration purpose. It does not represent any # real data. # construct a large gene expression data fpkm_large = cbind(fpkm, fpkm+1, fpkm+2, fpkm+3) fpkm_large = cbind(fpkm_large, fpkm_large, fpkm_large, fpkm_large) colnames(fpkm_large) = paste0("sample",1:32) # construct the samples' chronological age chronage2 = data.frame(sampleid = colnames(fpkm_large), age = 31:62) res2 = predict_age(exprdata = fpkm_large, exprtype = "FPKM", chronage = chronage2) head(res2)
The main difference between predict_age_fromse
and predict_age
is that
predict_age_fromse
takes SummarizedExperiment as input. The se
argument
is a SummarizedExperiment object.
assays(se)
should contain gene expression data. The name of
assays(se)
should be either "FPKM" or "counts". Use exprtype
argument to
specify the type of gene expression data provided. colData(se)
. This is optional. If provided, the column name for chronological
age in colData(se)
should be "age". If some samples' chronological age are
not available, users are expected to set the chronological age in colData(se)
to NA. If chronological age is not provided, the age acceleration residual will
not be calculated. rowData(se)
. This is
also optional. If using exprtype = "FPKM"
, the provided gene length will be
ignored. If provided, the column name for gene length in rowData(se)
should
be "bp_length". The function will convert raw count to FPKM by the
user-supplied gene length. Otherwise, gene length is obtained from the internal
database. tissue
, exprtype
, idtype
, stype
, signature
are
exactly the same as described in predict_age
function.library(SummarizedExperiment) colData = data.frame(age = c(40, 50)) se = SummarizedExperiment(assays=list(FPKM=fpkm), colData=colData) res3 = predict_age_fromse(se = se, exprtype = "FPKM") head(res3)
We suggest visualizing the results by plotting RNAAge vs chronological age.
This can be done by calling makeplot
function and passing in the data frame
returned by predict_age
function.
makeplot(res2)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.