Home

/

GitHub

/

In rworkflow/ReUseData: Reusable and reproducible Data Management

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

ReUseData provides functionalities to construct workflow-based data recipes for fully tracked and reproducible data processing. Evaluation of data recipes generates curated data resources in their generic formats (e.g., VCF, bed), as well as a YAML manifest file recording the recipe parameters, data annotations, and data file paths for subsequent reuse. The datasets are locally cached using a database infrastructure, where updating and searching of specific data is made easy.

The data reusability is assured through cloud hosting and enhanced interoperability with downstream software tools or analysis workflows. The workflow strategy enables cross platform reproducibility of curated data resources.

Installation

Install the package from Bioconductor.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ReUseData")

Use the development version:

BiocManager::install("ReUseData", version = "devel")

Load the package and other packages used in this vignette into the R session.

suppressPackageStartupMessages(library(Rcwl))
library(ReUseData)

`ReUseData` core functions for data management

Here we introduce the core functions of ReUseData for data management and reuse: getData for reproducible data generation, dataUpdate for syncing and updating data cache, and dataSearch for multi-keywords searching of dataset of interest.

Data generation

First, we can construct data recipes by transforming shell or other ad hoc data preprocessing scripts into workflow-based data recipes. Some prebuilt data recipes for public data resources (e.g., downloading, unzipping and indexing) are available for direct use through recipeSearch and recipeLoad functions. Then we will assign values to the input parameters and evaluate the recipe to generate data of interest.

## set cache in tempdir for test
Sys.setenv(cachePath = file.path(tempdir(), "cache"))

recipeUpdate()
recipeSearch("echo")
echo_out <- recipeLoad("echo_out")
inputs(echo_out)

Users can then assign values to the input parameters, and evaluate the recipe (getData) to generate data of interest. Users need to specify an output directory for all files (desired data file, intermediate files that are internally generated as workflow scripts or annotation files). Detailed notes for the data is encouraged which will be used for keywords matching for later data search.

We can install cwltool first to make sure a cwl-runner is available.

invisible(Rcwl::install_cwltool())

echo_out$input <- "Hello World!"
echo_out$outfile <- "outfile"
outdir <- file.path(tempdir(), "SharedData")
res <- getData(echo_out,
               outdir = outdir,
               notes = c("echo", "hello", "world", "txt"))

The file path to newly generated dataset can be easily retrieved. It can also be retrieved using dataSearch() functions with multiple keywords. Before that, dataUpdate() needs to be done.

res$output

There are some automatically generated files to help track the data recipe evaluation, including *.sh to record the original shell script, *.cwl file as the official workflow script which was internally submitted for data recipe evaluation, *.yml file as part of CWL workflow evaluation, which also record data annotations, and *.md5 checksum file to check/verify the integrity of generated data file.

list.files(outdir, pattern = "echo")

The *.yml file contains information about recipe input parameters, the file path to output file, the notes for the dataset, and auto-added date for data generation time. A later data search using dataSearch() will refer to this file for keywords match.

readLines(res$yml)

Data caching, updating and searching

dataUpdate() creates (if first time use), syncs and updates the local cache for curated datasets. It finds and reads all the .yml files recursively in the provided data folder, creates a cache record for each dataset that is associated (including newly generated ones with getData()), and updates the local cache for later data searching and reuse.

IMPORTANT: It is recommended that users create a specified folder for data archival (e.g., file/path/to/SharedData) that other group members have access to, and use sub-folders for different kinds of datasets (e.g., those generated from same recipe).

(dh <- dataUpdate(dir = outdir))

dataUpdate and dataSearch return a dataHub object with a list of all available or matching datasets.

One can subset the list with [ and use getter functions to retrieve the annotation information about the data, e.g., data names, parameters values to the recipe, notes, tags, and the corresponding yaml file.

dh[1]
## dh["BFC1"]
dh[dataNames(dh) == "outfile.txt"]
dataNames(dh)
dataParams(dh)
dataNotes(dh)
dataTags(dh)
dataYml(dh)

ReUseData, as the name suggests, commits to promoting the data reuse. Data can be prepared in standard input formats (toList), e.g., YAML and JSON, to be easily integrated in workflow methods that are locally or cloud-hosted.

(dh1 <- dataSearch(c("echo", "hello", "world")))
toList(dh1, listNames = c("input_file"))
toList(dh1, format = "yaml", listNames = c("input_file"))
toList(dh1, format = "json", file = file.path(tempdir(), "data.json"))

Data can also be aggregated from different resources by tagging with specific software tools.

dataSearch()
dataTags(dh[1]) <- "#gatk"
dataSearch("#gatk")

Existing data annotation

The package can also be used to add annotation and notes to existing data resources or experiment data for management. Here we add exisiting "exp_data" to local data repository.

exp_data <- file.path(tempdir(), "exp_data")
dir.create(exp_data)

We first add notes to the data, and then update data repository with information from the new dataset.

annData(exp_data, notes = c("experiment data"))
dataUpdate(exp_data)

Now our data hub cached meta information from two different directories, one from data recipe and one from exisiting data. Data can be retrieved by keywords.

dataSearch("experiment")

NOTE: if the argument cloud=TRUE is enabled, dataUpdate() will also cache the pregenerated data sets (from evaluation of public ReUseData recipes) that are available on ReUseData google bucket and return in the dataHub object that are fully searchable. Please see the following section for details.

Cloud data resources

With the prebuilt data recipes for curation (e.g., downloading, unzipping, indexing) of commonly used public data resources we have pregenerated some data sets and put them on the cloud space for direct use.

Before searching, one need to use dataUpdate(cloud=TRUE) to sync the existing data sets on cloud, then dataSearch() can be used to search any available data set either in local cache and on the cloud.

gcpdir <- file.path(tempdir(), "gcpData")
dataUpdate(gcpdir, cloud=TRUE)

If the data of interest already exist on the cloud, then getCloudData will directly download the data to your computer. Add it to the local caching system using dataUpdate() for later use.

(dh <- dataSearch(c("ensembl", "GRCh38")))
getCloudData(dh[1], outdir = gcpdir)

Now we create the data cache with only local data files, and we can see that the downloaded data is available.

dataUpdate(gcpdir)  ## Update local data cache (without cloud data)
dataSearch()  ## data is available locally!!!

The data supports user-friendly discovery and access through the ReUseData portal, where detailed instructions are provided for straight-forward incorporation into data analysis pipelines run on local computing nodes, web resources, and cloud computing platforms (e.g., Terra, CGC).

Know your data

Here we provide a function meta_data() to create a data frame that contains all information about the data sets in the specified file path (recursively), including the annotation file ($yml column), parameter values for the recipe ($params column), data file path ($output column), keywords for data file (notes columns), date of data generation (date column), and any tag if available (tag column).

Use cleanup = TRUE to cleanup any invalid or expired/older intermediate files.