knitr::opts_chunk$set(echo = TRUE)
This doc describes how to add resources to AnnotationHub or ExperimentHub. In general, these instructions pertain to core team members only.
This requires generating two types of resources
On Local Machine:
Navigate to the AnnotationHub_docker directory or create such a directory by following the instructions here.
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") url <- getOption("AH_SERVER_POST_URL") library(AnnotationHubData) # Since this grabs the file and converts on the fly # There is actually no need to run with metadataOnly=FALSE # # This periodically fails. We realized after a certain number # of subsequent hits to ensembl ftp site, the site starts asking # for a username:password or simply blocks entirely # We have tried increasing the sleep time in between failed attempts # with a max retry of 3. Normally this function will run completely # after 1-3 attempts. Altnerative is to temporarily increase the wait time # in the ftpFileInfo function in webAccessFunctions.R # In the future maybe consider adding in conditional userpwd argumnet to getURL # # FIX ME: # More troubling in recent builds is there is not a consistent mapping name from # the ensembl repository (species) to the species in the species file # https://ftp.ensembl.org/pub/release-108/species_EnsemblVertebrates.txt # This results in an error # Hot fix: for now a browser() call can be added in makeEnsemblGtfToGranges.R # in makeEnsemblGtfToAHM before running meta <- .ensemblMetadataFromUrl(sourceUrls) # when you hit this browser, manually subset values to remove bad entry # Ideally we want to come up with a better mapping strategy so these could be included # meta <- updateResources("EnsemblGTFtoGranges", BiocVersion = "3.6", preparerClasses = "EnsemblGtfImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, release = "89") # test/check meta pushMetadata(meta, url) # you could rerun updateResources with insert=TRUE to do the push but I like to check resource data
exit R
Convert db to sqlite (puts the file in the data/ directory)
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
This takes several hours to run. See section on cross checking if stopped before finished for picking up in the middle with uploads.
Generate or have someone generate a SAS token/url on microsoft azure data lakes for the staging bucket. General recommendations is expire the SAS after two weeks. The scripts potentially need both the token and the url
export the SAS URL as an environment variable AZURE_SAS_URL
with something
like the following.
export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D'
library(AnnotationHubData) # to populate bucket need metadataOnly = FALSE # metadataOnly will control upload # insert controls inserting into database meta <- updateResources("TwoBit", BiocVersion = "3.12", preparerClasses = "EnsemblTwoBitPreparer", metadataOnly = FALSE, insert = FALSE, justRunUnitTest = FALSE, release = "101") # Because of the upload the meta will not be an ahm object run again # with metadataOnly=TRUE to save object to copy to local meta <- updateResources("TwoBit",BiocVersion="3.11", preparerClasses="EnsemblTwoBitPreparer", metadataOnly=TRUE, justRunUnitTest=FALSE, insert=FALSE, release="99") # a suggested step is to save(meta, file="metadataForTwoBit") # if running as a script or non interactively. save(meta, file="TwoBitObj.RData")
Check Azure bucket is being populated.
Once finished cross check that everything populated correctly. You can check if any failed and needs to be rerun with the following.
meta <- updateResources("TwoBit", BiocVersion = "3.12", preparerClasses = "EnsemblTwoBitPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, release = "101") temp = rep(NA, length(meta)) for(i in 1:length(meta)){ tit = metadata(meta[[i]])$RDataPath temp[i] = tit } ## when loading AzureStor will ask about saving credentials; recommend `no` library("AzureStor") ## You will need the SAS token generated for access sas <- "sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D" ## check what files were uploaded url <- "https://bioconductorhubs.blob.core.windows.net" ep <- storage_endpoint(url, sas = sas) container <- storage_container(ep, "staginghub") ## if you changed the first argument of the updateResources to use a different ##directory replace the TwoBit with that filesOn = gsub(pattern="TwoBit/", "", list_storage_files(container, "TwoBit")[,"name"]) ## difference of metadata and what has been uploaded paths = setdiff(temp, filesOn) ## if there were any that were not uploaded subset and try again newMeta = meta[match(paths,temp)] pushResources(newMeta, uploadToRemote = TRUE, download = TRUE) ## in good practice it is also recommended to check the setdiff in the opposite ## direction as well
Navigate to the AnnotationHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") url <- getOption("AH_SERVER_POST_URL") library(AnnotationHubData) # option 1: meta <- updateResources("TwoBit", BiocVersion = "3.12", preparerClasses = "EnsemblTwoBitPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, release = "101") # option 2: # if you saved the meta load("metadataForTwoBit") pushMetadata(meta, url) # you could rerun updateResources with insert=TRUE to do the push but I like to check resource data
exit R
Convert db to sqlite (puts the file in the data/ directory)
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
After GTF or 2bits are added to production database, you should be able to get the resources and see the updated timestamp of database with something like the following
library(AnnotationHub) hub = AnnotationHub() length(query(hub, c("ensembl", "gtf", "release-101"))) length(query(hub, c("fasta", "twobit", "releases-101")))
Contributors will generally reach out when wanting to update or include Annotations to the AnnotationHub. In the past they have provided the annotations through an application like dropbox; we have since updated the process and now will directly upload files to S3 in a temporary location. Send them the instructions found here
A key will have to be generated for them to access and use the
AnnotationContributor
account. Go to here
Under the 'Security credentials' tab click 'Create access key'. Send the
Access key ID to the contributor and the Secret access key is stored in AWS.
When the contributor is done you can delete the key (small 'x' at the right of
the key row).
Advise that their data should be in a directory the same name as the software package that will access the annotations; subdirectories to keep track of versions is strongly encouraged.
Once the data is uploaded to S3 move the data to the proper location.
We will need a copy of the package to generate and test the annotaitons. Request link to package from user.
Follow instructions here
In general, generate the list of AnnotationHubMetadata objects with
makeAnnotationHubMetadata()
or updateResources
. To test that the
metadata.csv is properly formatted, run makeAnnotationHubMetadata
.
Some suggested testing procedures can be found here
When satisfied start the AnnotationHub docker and add resource to docker.
Navigate to the AnnotationHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") library(AnnotationHubData) url <- getOption("AH_SERVER_POST_URL") # run approprate makeAnnotationHubMetadata() call and #pushMetadata(meta[[1]], url)
exit R
Test
From the list of dockers running sudo docker ps
find the process with db
in
the name. Example test_db1
. Connect to the container with
sudo docker exec -ti test_db bash
. Log into mysql with mysql -p -u ahuser
(The password will be the same as the exported MYSQL_REMOTE_PASSWORD). Explore
with mysql commands like select * from resources order by id desc limit 5;
Convert db to sqlite (puts the file in the data/ directory) Which command to run to convert the db to sqlite will depend on the name of the process, but it should be one of the followng:
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
This recipe should be run after the new OrgDb packages have been built for the release are available in the devel repo. The code essentially loads the current packages, extracts the sqlite file and creates some basic metadata.
The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.
new per Nov 21 We want the devel orgDbs to be exposed as soon as generated close to release time for users to updated accordingly. Before running and adding new resources, manually remove references in the BiocVersions table for the current release orgdbs associated with the current devel version. Quickest way to get references. In R perform the query
library(AnnotationHub) ah = AnnotationHub() table(mcols(query(ah, "OrgDb"))$rdatadateadded) ## use the dates in this query to filter in sqlite3 to get ids
In sqlite3:
select id from resources where preparerclass="OrgDbFromPkgsImportPreparer" and rdatadateadded="2021-10-08" limit 1; select id from resources where preparerclass="OrgDbFromPkgsImportPreparer" and rdatadateadded="2021-10-08" order by id desc limit 1; ## use resource ids to get ids from biocversions table # check first and get ids of only devel version (soon to be release) # in this scenerio 3.13 is current release, prepping for 3.14 devel to be released select * from biocversions where biocversion="3.14" and resource_id between 95950 and 95968; # use highest and lowest id to make delete call delete from biocversions where id between 298567 and 298585;
Generate or have someone generate a SAS token/url on microsoft azure data
lakes for the staging bucket. General recommendations is expire the SAS after
two weeks. The scripts potentially need both the token and the url. export the
SAS URL as an environment variable AZURE_SAS_URL
with something like the following.
export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D'
On Local Machine:
Navigate to the AnnotationHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) sudo docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") url <- getOption("AH_SERVER_POST_URL") library(AnnotationHubData) # see the man page for clarification on testing and actively pushing # ?makeStandardOrgDbsToAHM # to populate bucket need metadataOnly = FALSE # metadataOnly will control upload # insert controls inserting into database meta <- updateResources("OrgDbs", BiocVersion = "3.5", preparerClasses = "OrgDbFromPkgsImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE, downloadOrgDbs=TRUE) # downloadOrgDbs can be FALSE for subsequent runs pushMetadata(meta, url)
exit R
Convert db to sqlite (puts the file in the data/ directory)
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database
new as of nov 2021 Add new orgdbs to manually be visible in next devel. If the current release is 3.13 and devel is 3.14 and this is in preparation for 3.14 release, this would be adding a biocversion reference for 3.15.
In sqlite3 database:
## use whatever date when added select id from resources where preparerclass="OrgDbFromPkgsImportPreparer" and rdatadateadded="2021-10-08" limit 1; select id from resources where preparerclass="OrgDbFromPkgsImportPreparer" and rdatadateadded="2021-10-08" order by id desc limit 1; ## test -- should show current devel -- use the two values above for between select * from biocversions where resource_id between 95950 and 95968;
In a separate R terminal you can quickly concat the call to add all versions at once replacing 3.15 with whatever the next devel is and the low/high of the resource ids:
paste0("(3.15,",95950:95968, ")", collapse=",")
paste that into sqlite call
insert into biocversions (biocversion, resource_id) Values [R output] ; ## now check again that both current devel and future devel show select * from biocversions where resource_id between 95950 and 95968;
query(hub, "OrgDb") table(mcols(query(hub, "OrgDb"))$rdatadateadded)
This recipe should be run after the new TxDbs have been built and are in the devel repo. The code loads the packages, extracts the sqlite file and creates metadata.
The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.
Generate or have someone generate a SAS token/url on microsoft azure data
lakes for the staging bucket. General recommendations is expire the SAS after
two weeks. The scripts potentially need both the token and the url. export the
SAS URL as an environment variable AZURE_SAS_URL
with something like the following.
export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D'
On Local Machine:
Navigate to the AnnotationHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) sudo docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") url <- getOption("AH_SERVER_POST_URL") library(AnnotationHubData) # see the man page for clarification on testing and actively pushing # follow example in man page # will need the list of updated or added TxDbs to make a character list of files ?makeStandardTxDbsToAHM pushMetadata(meta, url)
exit R
Convert db to sqlite (puts the file in the data/ directory)
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database
After uploaded to production you can test that they are available in release and devel
query(hub, "TxDb") table(mcols(query(hub, "TxDb"))$rdatadateadded)
This code generates ~1700 non-standard OrgDb sqlite files from ucsc. These are less comprehensive and the standard OrgDb packages. It's best to run this on an EC2 instance. You can run it locally if your machine has enough space to download the files from NCBI but keep in mind this code takes several hours to run.
The BiocVersion should be whatever the next release version will be, the current devel soon to be release. The OrgDb resources get the same name when they are regenerated - they aren't tied to a genome build so that's not a distinguishing feature in the title. We only want 1 OrgDb for each species available in a release and the BiocVersion is what we use to filter records exposed.
Before running, make sure the AnnotationForge/inst/extdata/viableIDs.rda
and
GenomeInfoDbData/data/specData.rda
files have been updated and pushed in their
respective packages. The scripts to generate these data files are in the packages
inst/scripts
directories as viableIDs.R and updateGenomeInfoDbData.R respectively.
new per Nov 21 We want the devel orgDbs to be exposed as soon as generated close to release time for users to updated accordingly. Before running and adding new resources, manually remove refreences in the BiocVersions table for the current release orgdbs associated with the current devel version. Quickest way to get references. In R perform the query
library(AnnotationHub) ah = AnnotationHub() table(mcols(query(ah, "OrgDb"))$rdatadateadded) ## use the dates in this query to filter in sqlite3 to get ids
In sqlite3:
select id from resources where preparerclass="NCBIImportPreparer" and rdatadateadded="2021-10-13" limit 1; select id from resources where preparerclass="NCBIImportPreparer" and rdatadateadded="2021-10-13" order by id desc limit 1; ## use resource ids to get ids from biocversions table # check first and get ids of only devel version (soon to be release) # in this scenerio 3.13 is current release, prepping for 3.14 devel to be released select * from biocversions where biocversion="3.14" and resource_id between 95973 and 97713; # use highest and lowest id to make delete call delete from biocversions where id between 298567 and 298585;
Generate or have someone generate a SAS token/url on microsoft azure data lakes for the staging bucket. General recommendations is expire the SAS after two weeks. The scripts potentially need both the token and the url
export the SAS URL as an environment variable AZURE_SAS_URL
with something
like the following. Also export sas token with AZURE_SAS_TOKEN
export AZURE_SAS_URL='https://bioconductorhubs.blob.core.windows.net/staginghub?sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D' export AZURE_SAS_TOKEN='sp=racwl&st=2022-02-08T15:57:00Z&se=2022-02-22T23:57:00Z&spr=https&sv=2020-08-04&sr=c&sig=fBtPzgrw1Akzlz%2Fwkne%2BQrxOKOdCzP1%2Fk5S%2FHk1LguE%3D'
library(AnnotationHubData) # see the man page for clarification on testing and actively pushing # ?makeNCBIToOrgDbsToAHM # to populate bucket need metadataOnly = FALSE # metadataOnly will control upload # insert controls inserting into database ## This can be run in a directory where the example in ## ?AnnotationForge::makeOrgPackageFromNCBI to save time building cache ## if cache was recently generated. You will need ~62 G of space to run the ## code ## if rebuilding the cache you will likely need to increase the default timeout ## limit of R options(timeout=10000) meta <- updateResources("NonStandardOrgDb", BiocVersion = "3.5", preparerClasses = "NCBIImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE) # a suggested step is to save(meta, file="metadataForNonStandardOrgs") # and scp to local machine you will need to have run it with # metadataOnly=TRUE to get correct form of saved object
needToRerunNonStandardOrgDb
. needToRerunNonStandardOrgDb(biocVersion="3.5", resourceDir="NonStandardOrgDb", justRunUnitTest = TRUE)
Note:
We have had issues in the past with the recipe completing for all
desired resources. The receipe, when repeatedly run, will check the
appropriate bucket to compare what resources still need to be
processed. The helper function needToRerunNonStandardOrgDb
can
be run to determine if a repeat call should be made. This is important
because if all the resources are on azure and metadataOnly=FALSE, it will
assume you want to overwrite all the files and begin the generation
over.
On Local Machine:
Navigate to the AnnotationHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) docker-compose up
options(AH_SERVER_POST_URL="http://localhost:3000/resource") options(ANNOTATION_HUB_URL="http://localhost:3000") url <- getOption("AH_SERVER_POST_URL") library(AnnotationHubData) # option 1: meta <- updateResources("NonStandardOrgDb, BiocVersion = "3.5", preparerClasses = "NCBIImportPreparer", metadataOnly = TRUE, insert = FALSE, justRunUnitTest = FALSE) # option 2: # if you saved the meta from the EC2 instance load("metadataForNonStandardOrgs") pushMetadata(meta, url)
exit R
Convert db to sqlite (puts the file in the data/ directory)
sudo docker exec annotationhub_annotationhub_1 bash /bin/backup_db.sh docker exec annotationhub_docker_annotationhub_1 bash /bin/backup_db.sh
If satisfied, copy this file to annotationhub.bioconductor.org and follow instructions for updating production database
new as of nov 2021 Add new orgdbs to manually be visible in next devel. If the current release is 3.13 and devel is 3.14 and this is in preparation for 3.14 release, this would be adding a biocversion reference for 3.15.
In sqlite3 database:
## use whatever date when added select id from resources where preparerclass="NCBIImportPreparer" and rdatadateadded="2021-10-13" limit 1; select id from resources where preparerclass="NCBIImportPreparer" and rdatadateadded="2021-10-13" order by id desc limit 1; ## test -- should show current devel -- use the two values above for between select * from biocversions where resource_id between 95973 and 97713;
In a separate R terminal you can quickly concat the call to add all versions at once replacing 3.15 with whatever the next devel is and the low/high of the resource ids:
paste0("(3.15,",95973:97713, ")", collapse=",")
paste that into sqlite call
insert into biocversions (biocversion, resource_id) Values [R output] ; ## now check again that both current devel and future devel show select * from biocversions where resource_id between 95973 and 97713;
query(hub, "OrgDb") table(mcols(query(hub, "OrgDb"))$rdatadateadded)
ExperimentHub Resources are added upon request or when it is recommended a package be an Experiment Data Package rather than Software package. The package that will use such data will reach out. It is then a similar process to The AnnotationHubData adding contributor resources.
The user will upload files to S3 in a temporary location. Send them the instructions found here
A key will have to be generated for them to access and use the
AnnotationContributor
account. Go to here
Under the 'Security credentials' tab click 'Create access key'. Send the
Access key ID to the contributor and the Secret access key is stored in AWS.
When the contributor is done you can delete the key (small 'x' at the right of
the key row).
Advise that their data should be in a directory the same name as the software
package that will use the Experiment Data when uploading (ie. software package
Test
would upload files to S3 in a folder Test
-> Test\file1
,
Test\file2
, etc). If subdirectories are needed that is okay but ensure the
RDataPath
in the metadata.csv reflects this structure.
Once the data is uploaded to S3 move the data to the proper location.
We will need a copy of the package to generate and test the annotaitons. Request link to package from user. The following should be sent to user to ensure the package is sent up correctly: instructions.
In general, generate the list of ExperimentHubMetadata objects with
makeExperimentHubMetadata()
or addResources
. To test that the
metadata.csv is properly formatted,run makeAnnotationHubMetadata
.
Info on ExperimentHub docker and how to set up docker directory.
When satisfied start the ExperimentHub docker and add resource to docker.
Navigate to the ExperimentHub_docker directory
Start the docker:
export MYSQL_REMOTE_PASSWORD=*** (See credentials doc) docker-compose up
options(EXPERIMENT_HUB_SERVER_POST_URL="http://localhost:4000/resource") options(EXPERIMENT_HUB_URL="http://localhost:4000") library(ExperimentHubData) url <- getOption("EXPERIMENT_HUB_SERVER_POST_URL") # run approprate makeExperimentHubMetadata() call and following if necessary #pushMetadata(meta, url)
exit R
Test
From the list of dockers running sudo docker ps
find the process with db
in
the name. Example test_db1
. Connect to the container with
sudo docker exec -ti test_db bash
. Log into mysql with mysql -p -u hubuser
(The password will be the same as the exported MYSQL_REMOTE_PASSWORD). Explore
with mysql commands like select * from resources order by id desc limit 5;
Convert db to sqlite (puts the file in the data/ directory) Which command to run to convert the db to sqlite will depend on the name of the process, but it should be one of following:
sudo docker exec experimenthubdocker_experimenthub_1 bash /bin/backup_db.sh docker exec experimenthub_docker_experimenthub_1 bash /bin/backup_db.sh
Some other Notes and helpful hints:
If a new recipe needed to be added, the recipe is added in AnnotationHub. Be sure to then update the version of AnnotationHub dependency in the DESCRIPTION of ExperimentHub and increase the version in ExperimentHub.
Remember when an ExperimentHub package is accepted it is uploaded to a different repo.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.