suppressPackageStartupMessages({ library(ggplot2) library(plotly) library(metametrics) library(ssrch) })
Using the Omicidx system, we harvested metadata about human samples for which RNA-seq data was deposited in NCBI SRA.
We work with a subset of 1009 studies for which a cancer-related term was present in study title as recorded at NCBI SRA.
library(ggplot2) library(plotly) library(metametrics) data(study_publ_dates) # harvesting omicidx early 2019 library(lubridate) ds_ca = DocSet_ca1009() ds_ca
We accumulate (over dates of study submissions) the set of fields used in the sample annotation of the 1009 cancer studies.
study_publ_dates = na.omit(study_publ_dates) studs1009 = ls(docs2kw(ds_ca)) # in cancer corpus stud_dates = as_datetime(study_publ_dates[,2]) names(stud_dates) = study_publ_dates[,1] stud_dates = stud_dates[studs1009] # limit to corpus stud_dates = sort(stud_dates) ofields = lapply(names(stud_dates), function(x) names(retrieve_doc(x, ds_ca))) freqs = table(unlist(ofields)) #sort(freqs,decreasing=TRUE)[1:20] cumfields = ofields for (i in 2:length(cumfields)) cumfields[[i]] = union(cumfields[[i]], cumfields[[i-1]]) csiz = sapply(cumfields,length) bag_fields_ca1009 = unique(unlist(cumfields)) nfields = length(bag_fields_ca1009) mydf = data.frame(date_published=stud_dates, nfields=csiz)
The growth in size of the set of fields in use over time is displayed here:
ggplot(mydf, aes(x=date_published, y=nfields)) + geom_point()
library(plotly) ddf = data.frame(date=stud_dates[-1], newly_introduced_fields=diff(csiz), study=paste0(names(stud_dates[-1]), "\na"))
The next display is interactive -- hover over points to see study accession number and newly introduced field names.
incrs = lapply(2:length(cumfields), function(x) setdiff(cumfields[[x]], cumfields[[x-1]])) incrs = unlist(lapply(incrs, function(x) paste0(x, collapse="\n"))) sn = names(stud_dates[-1]) incrs = paste(sn, incrs, sep="\n") dddf = cbind(ddf, incrs) g2 = ggplot(dddf, aes(x=date, y=newly_introduced_fields, text=incrs)) + geom_point() ggplotly(g2)
Use of common data elements is promoted by various initiatives. Dictionaries, thesauri, and ontologies are all relevant. We have examples of each in the metametrics package.
A snapshot of the Genomic Data Commons gdcdictionary, with fields
and values related to diagnosis and sample characteristics is
provided in gdc_dx_sam
.
gdc_dx_sam
A table with all entries from several ontologies and the NCI Thesaurus
is provided by load_ontolookup
:
olook = load_ontolookup() olook
We use robust linear modeling to estimate growth in
vocabulary of fields employed over time. The data.frame
mydf
includes a variable nfields
taking a value
for each study publication date. The value of nfields
associated
with date $d$ records the
the number of fields used to annotate all studies up
to date $d$.
library(MASS) nsecpy = 3600*24*365 summary( mm <- rlm(nfields~I(as.numeric(date_published)/nsecpy), data=mydf)) plot(nfields~I(as.numeric(date_published)/nsecpy), data=mydf) abline(mm)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.