out_docker_pull_stable <- scdrake::format_shell_command(glue::glue("docker pull {DOCKER_IMAGE_STABLE}"))
out_docker_pull_latest <- scdrake::format_shell_command(glue::glue("docker pull {DOCKER_IMAGE_LATEST}"))
out_singularity_pull_stable <- scdrake::format_shell_command(glue::glue("singularity pull docker:{DOCKER_IMAGE_STABLE}"))

Docker (wiki) is a software for running isolated containers which are based on images. You can think of them as small, isolated operating systems.

The Bioconductor's page for Docker provides itself a nice introduction to Docker and its usage for R, and we recommend to read it through. In this vignette, we will mainly give practical examples related to {scdrake}.

You can also run the image in SingularityCE (without RStudio) - see Singularity below.

Platform-specific notes

Linux hosts (important)

On a Linux host, there can be different Docker distributions installed:

If you are using a Linux host to run the {scdrake} image, we recommend to stick with Docker Engine (see the official FAQ on DD and Linux).

See here how to quickly switch to DE if you have installed DD.

Otherwise in Docker Desktop:

Windows

We recommend to use WSL2 and its Ubuntu distribution for Docker Desktop (guide) as this is successfully tested by us.


Cheatsheet

A quick summary of the more detailed steps below. Note that this is not covering a special case of a Linux host using Docker Desktop.

Click to show the cheatsheet

Get the scdrake image

Linux, Windows, and amd64 Mac

Latest stable version:

r knitr::knit(text = out_docker_pull_stable)

Latest development version:

r knitr::knit(text = out_docker_pull_latest)

Create a shared directory

mkdir ~/scdrake_projects
cd ~/scdrake_projects

Run the image

For Linux (using Docker Engine), Mac and Windows.

out_docker_run_rstudio <- scdrake::format_shell_command(c(
  "docker run -d",
  "-v $(pwd):/home/rstudio/scdrake_projects",
  "-p 8787:8787",
  "-e USERID=$(id -u)",
  "-e GROUPID=$(id -g)",
  "-e PASSWORD=1234",
  DOCKER_IMAGE_STABLE
))

r knitr::knit(text = out_docker_run_rstudio)

Find the ID or name of the running container

docker ps
CONTAINER ID   IMAGE                            COMMAND   CREATED        STATUS        PORTS                                       NAMES
d47b4d265052   scdrake:1.4.0-bioc3.15           "/init"   24 hours ago   Up 24 hours   0.0.0.0:8787->8787/tcp, :::8787->8787/tcp   condescending_payne

Run scdrake through its CLI

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects <CONTAINER ID or NAME> \
  scdrake -h


Obtaining the Docker image

A Docker image based on the official Bioconductor image (version r BIOC_VERSION) is available. This is the most handy and reproducible way how to use {scdrake} as all the dependencies are already installed and their version is fixed. In addition, the parent Bioconductor image comes bundled with RStudio Server.

You can pull the Docker image with the latest stable {scdrake} version using

r knitr::knit(text = out_docker_pull_stable)

or list available versions in our Docker Hub repository.

For the latest development version use

r knitr::knit(text = out_docker_pull_latest)


Running the image

The {scdrake}'s image is based on the Bioconductor's image which is in turn based on the Rocker Project's image (rocker/rstudio). Thanks to it, the {scdrake}'s image comes bundled with RStudio Server.

Docker allows to mount local (host) directories to containers. We recommend to create a local directory in which individual {scdrake} projects (and other files, if needed) will lie. This way you won't lose data when a container is destroyed. Let's create such shared directory in your home directory and switch to it:

mkdir ~/scdrake_projects
cd ~/scdrake_projects

Running detached with RStudio Server

Important: not working on Linux hosts using Docker Desktop.

RStudio Server is a comfortable IDE for R. You can run the image including an RStudio instance using:

r knitr::knit(text = out_docker_run_rstudio)

Let's decompose the command above:

The -e VAR=VALUE arguments are used to set environment variables inside the container. You can check those for the parent rocker/rstudio image here.

An image can have a default command which is executed on container run. For our image, it's a script /init which starts RStudio Server.

Now you can check that the container is running:

docker ps

You should see something as

CONTAINER ID   IMAGE                            COMMAND   CREATED        STATUS        PORTS                                       NAMES
d47b4d265052   scdrake:1.4.0-bioc3.15           "/init"   24 hours ago   Up 24 hours   0.0.0.0:8787->8787/tcp, :::8787->8787/tcp   condescending_payne

Values in the CONTAINER ID and NAMES columns can be then used to reference the running container.

You can see that we have exposed port 8787. What it is useful for? Well, it allows us to connect to RStudio Server instance, which is running on port 8787 inside the container. Now you can just open your browser, navigate to localhost:8787 and login as rstudio with password 1234.

TIP: if you are using a remote server which you can SSH into and it allows SSH tunneling, then you can forward the exposed port to your host:

ssh -NL 8787:localhost:8787 user@server

Now you can start using {scdrake} from the RStudio, or read below for alternative ways.

Running detached without RStudio Server

If you just want to run the image detached without RStudio Server, use

out <- scdrake::format_shell_command(c(
  "docker run -d",
  "-u rstudio",
  "-v $(pwd):/home/rstudio/scdrake_projects",
  DOCKER_IMAGE_STABLE,
  "sleep infinity"
))

r knitr::knit(text = out)

Note for Docker Desktop on a Linux host: omit the -u rstudio (or use -u root) to run the image as root. This will mount /home/rstudio/scdrake_projects as UID 0 in the container, but UID of the DD user will be used in the host directory.

Running once ("run and forget")

Running the image in detached mode gives us chance e.g. to find out more details about what's wrong in a failed pipeline run, or to work interactively in RStudio. However, we can also run the image in a "run and forget" mode where a container is started, {scdrake} is run, and finally the container is stopped and removed.

out <- scdrake::format_shell_command(c(
  "docker run -it --rm",
  "-v $(pwd):/home/rstudio/scdrake_projects",
  DOCKER_IMAGE_STABLE,
  "scdrake ..."
))

r knitr::knit(text = out)

Note for Linux hosts using Docker Engine: the active user in the container will be root and created files will belong to him. If your user ID (UID) on the host is 1000, you can use the -u rstudio, and created files will then belong to you. If your UID is different, then either run the image including RStudio or refer to Problems with filesystem permissions below.


Issuing commands in the running container

You can start bash or R process inside the container and attach to it using

docker exec -it -u rstudio <CONTAINER ID or NAME> bash
docker exec -it -u rstudio <CONTAINER ID or NAME> R

Note for Docker Desktop on a Linux host: omit the -u rstudio (or use -u root).


Problems with filesystem permissions {.tabset}

In the first example, when we run the image including RStudio Server, we specified -e USERID=$(id -u) -e GROUPID=$(id -g), and so ownership of /home/rstudio was changed to match the host user*. Then when you write to the shared directory /home/rstudio/scdrake_projects, you are basically doing that as the host user.

* This is actually not a feature of Docker, but the rocker/rstudio image which uses those environment variables to run the /rocker_scripts/init_userconf.sh which manages the ownership.

By default, rstudio user in the container has ID of 1000 which is commonly used for the default user in most Linux distributions. You can check it yourself on your host by executing id -u. If it is 1000, then you can safely skip the instructions below.

Note for Linux hosts using Docker Desktop: in this case, Docker manages the file ownership as described here. For you it means the root user (UID 0) inside the container will write to the shared directory using UID of the Docker Desktop user, so you don't have to care about the -u rstudio parameter in docker exec command.


So when you don't want to run RStudio Server and your host user ID is not 1000, you have to take care of ownership by yourself. There are several ways how to accomplish that. You can also read about this problem here or here.

Modify the user and group ID in the container

The rocker/rstudio parent image already comes with a script which can add or modify user information (it is actually run when you start the container including RStudio Server).

docker exec \
  -e USERID=$(id -u) -e GROUPID=$(id -g) \
  <CONTAINER ID or NAME> \
  bash /rocker_scripts/init_userconf.sh

Note that this just changes ID and group of the rstudio user, so later don't forget to run commands as this user, i.e. docker exec -u rstudio ...

"Chowning" (tedious)

If you have root privileges on the host, you can just change ownership each time you write to the shared directory from the container:

sudo chown -R $(id -u):$(id -g) scdrake_projects

Issuing {scdrake} commands through its command line interface (CLI)

We now assume that the {scdrake} container is running and the shared directory is mounted in /home/rstudio/scdrake_projects. The {scdrake} commands can be easily issued through its CLI (see vignette("scdrake_cli")):

docker exec -it -u rstudio <CONTAINER ID or NAME> \
  scdrake -h

The usual workflow can be similar to the following:

  1. Initialize a new project:
docker exec -it -u rstudio -w /home/rstudio/scdrake_projects <CONTAINER ID or NAME> \
  scdrake -d my_first_project init-project
  1. Modify the configs on your host in my_first_project/config/

  2. Run the pipeline:

docker exec -it -u rstudio -w /home/rstudio/scdrake_projects/my_first_project <CONTAINER ID or NAME> \
  scdrake --pipeline-type single_sample run

Note for Docker Desktop on a Linux host: omit the -u rstudio (or use -u root).


Create a command alias

To reduce typing/copy-pasting a bit, you can create a command alias in your shell, here for bash:

alias scdrake_docker="docker exec -it -u rstudio <CONTAINER ID or NAME> scdrake"

Note for Docker Desktop on a Linux host: omit the -u rstudio (or use -u root).

And then use this alias as e.g.

scdrake_docker -d /home/rstudio/scdrake_projects/my_first_project --pipeline-type single_sample run

Notice that we are using the -d parameter as the alias doesn't contain the -w parameter which instructs Docker to execute the command in a working directory.

Finally, you can put the alias permanently inside your ~/.bashrc file. Just don't forget to change <CONTAINER ID or NAME> when you start a new container :)

Alternatively, you can utilize the "run and forget" way described above:

out <- scdrake::format_shell_command(c(
  "alias scdrake_docker=docker run -it --rm",
  "-v $(realpath ~/scdrake_projects):/home/rstudio/scdrake_projects",
  DOCKER_IMAGE_STABLE,
  "scdrake"
))

r knitr::knit(text = out)

Singularity

Singularity is an alternative container runner that does not require root access, so it can be used on e.g. HPCs.

You will probably use Singularity in case you don't have root permissions on the host machine. It can be simply installed to the user space e.g. via the conda virtual environment/package manager as the conda-forge/singularity package.

Docker images can be simply pulled from the Docker Hub similar to docker pull:

r knitr::knit(text = out_singularity_pull_stable)

The only difference is the usage of docker: prefix so Singularity knows where to look for the image. If the image is already present in the local Docker storage, you can use docker-daemon: instead.

After pulling, Singularity needs to convert the image to the SIF format, and that can take quite a time. By default, the SIF file is saved in the current working directory.

Running the image in Singularity

Here are some key differences compared to Docker:

First we will create a directory structure. These directories will be mounted to the container. Note that we also create a fake home directory - that is because some R packages need to use a cache directory, which is located there.

mkdir -p ~/scdrake_singularity
cd ~/scdrake_singularity
mkdir -p home/${USER} scdrake_projects/pbmc1k

Now we can issue a {scdrake} commands through its CLI:

singularity exec \
    -e \
    --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects" \
    path/to/scdrake_image.sif \
    scdrake <args> <command>

For example, to initialize a new project:

singularity exec \
    -e \
    --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --download-example-data init-project

And to run the pipeline using that project:

singularity exec \
    -e \
    --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    scdrake --pipeline-type single_sample run

Note that you can also start a bash or R session and use {scdrake} within that, e.g.

singularity exec \
    -e \
    --no-home \
    --bind "home/${USER}/:/home/${USER},scdrake_projects/:/home/${USER}/scdrake_projects" \
    --pwd "/home/${USER}/scdrake_projects/pbmc1k" \
    path/to/scdrake_image.sif \
    R

and then in R

scdrake::run_single_sample_r()


bioinfocz/scdrake documentation built on Sept. 19, 2024, 4:43 p.m.