Follow development at github.
This package provides a simple api to the Broad cromwell workflow engine.
For the time being, use the develop
branch:
require(devtools) devtools::install_github('seandavi/wdlRunR', ref = "develop")
Big Data in biology is a reality. Operating on large-scale biological data, particularly when the data flows require interdependencies and multiple steps, requires smart workflow management tools. Such workflow managers should, ideally:
The Cromwell is such a
workflow system. The
[wdlRunR](https://github.com/seandavi/wdlRunR)
R package allows
users of the R statistical programming environment to interact with
the Cromwell workflow system. The wdlRunR R package can act as the
"orchestrator" for job submission to the Cromwell workflow system. The
R ecosystem provides functionality for documentation, literate
programming, publication-quality graphics, statistical analysis, and
powerful data and metadata manipulation capabilities. Workflows of
nearly arbitrary size to be described and then run using the Cromwell
to run large-scale data processing steps, but with the entire
workflow, including data munging, workflow metadata management, cloud
orchestration and data management, and analysis of processed data
described as R code or literate programming documents.
Cromwell is available as a java
jarfile and runs as a server or as a
command-line executor of single workflows. This package is focused on
interacting with Cromwell as a server and accessing it via
its
documented REST API.
Cromwell jobs are described using Workflow Description Language (WDL). WDL resources are available:
As a first step, we download the current released version of
Cromwell. We then run the server and put the log file into a
tempdir()
using R code and a call to system()
.
library(wdlRunR) cromwell_jar = getCromwellJar(cromwell_version='36') cromwell_log = file.path(tempdir(),'cromwell.log') system(sprintf('java -jar %s server > %s 2>&1 &', cromwell_jar, cromwell_log)) # let server start up Sys.sleep(20)
Cromwell should now be running in the background and we can verify by checking for processes with "cromwell" in the name.
system('ps aux | grep cromwell')
The simplest example (and one that runs in the time needed for a demo!) is a "Hello, WDL" workflow. Right now, we simply describe a WDL workflow as text or as a text file (see below).
hello_wdl = "task hello { String name command { echo 'Hello ${name}!' } output { File response = stdout() } } workflow test { call hello }"
We are going to submit a batch of jobs (10 of them) with
randomly-generated names. The input to the cromwellBatch()
function
is a data.frame
with columns named for the WDL inputs. Each row of
the data.frame
will become a workflow that will be run (locally in
this case) by Cromwell.
library(knitr) opts_chunk$set(eval=FALSE)
options(cromwell_base = 'http://localhost:8000') randomStrings = sapply(1:10,function(r) {paste(sample(LETTERS,10),collapse="")}) wdlInputs = data.frame(test.hello.name=randomStrings)
We can submit a batch of jobs to cromwell by simply posting to the
correct API endpoint and the function cromwellBatch()
wraps this
process in R.
res = cromwellBatch(wdlSource = hello_wdl,workflowInputs=wdlInputs) # and we do this to allow the jobs to get running Sys.sleep(20)
Once jobs are submitted, they will enter the Cromwell workflow
monitoring system. We can query this system at any time. Look at
the API documentation and help('cromwellQuery')
in R for some
more details.
cromwellQuery()
We can use any R functionality to manage our WDL and inputs. For example, we can get a WDL workflow from a github-hosted WDL workflow. Here is an example of using the same "Hello, WDL" workflow.
#read from github or other URL library(httr) hello_remote_wdl = content(GET("https://raw.githubusercontent.com/DockstoreTestUser/dockstore-whalesay/master/Dockstore.wdl"),'text')
And we can use R for creating inputs in any way we like. Here, we use the babynames R package to set the names.
if(require(babynames)) wdlInputs = data.frame(test.hello.name=sample(babynames$name,10)) z = cromwellBatch(wdlSource = hello_remote_wdl,workflowInputs=wdlInputs) Sys.sleep(20) cromwellQuery()
To make things a bit more interesting, we can simulate long-running jobs using a "sleeping" variation of "Hello, WDL."
hello_wdl_sleep = "task hello { String name command { echo 'Hello ${name}!' && sleep 60 } output { File response = stdout() } } workflow test { call hello }"
randomStrings = sapply(1:10,function(r) {paste(sample(LETTERS,10),collapse="")}) wdlInputs = data.frame(test.hello.name=randomStrings) z = cromwellBatch(wdlSource = hello_wdl_sleep,workflowInputs=wdlInputs) # to let Cromwell get started with the jobs. sleep(10)
At this point, checking cromwellQuery()
will show running jobs (if
done within 60 seconds, of course). Limiting to "Running" jobs is also
easy.
cromwellQuery(term="status=Running")
And we can check for outputs as well.
results = cromwellQuery(term="status=Succeeded") outputs = cromwellOutputs(results$id) str(outputs,list.len=4) readLines(outputs[[1]]$outputs$test_hello_response)
logs = cromwellLogs(results$id) # stderr -- should be empty for this task readLines(logs[[1]]$test.hello[[1]]$stderr) # stdout readLines(logs[[1]]$test.hello[[1]]$stdout)
# messy! unlink(cromwell_jar) unlink(cromwell_log) system('pgrep -f cromwell | xargs -I {} kill -9 {}')
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.