External Scripts with Dockers
From Array Suite Wiki
This page explains how an Escript with Docker support can be run. The page is structured in sections, summarized in the table of contents below. Some sections will be described in more detail in other pages, which will be linked with that section.
External Scripts (escripts) are designed to run pipelines/workflows using public bioinformatics tools that are not included in the OmicSoft distribution. To simplify this distribution, Escripts support Docker-based tools; Docker allows users to work on containers, which are small prepackaged tools, which can be deployed on any environment with Docker support. This means that once you have configured Docker, you do not need to install any other dependencies or packages on your server to run new tools!
To be able to use docker following conditions have to be met:
AMI/Server: Docker (recommended version: 19.03.8) ECR: EC2s must have permission to ECR if you wish to use private Docker images InstanceType: The InstanceType must be set depending on the type of the instance that will be run, because different bioinformatics tools require vastly different computational resources
To run Dockerized tools on server, you will need to install Docker (v19.03.8 tested) and configure any necessary permissions so that the OmicSoft Server account can run the "docker" command.
To run Dockerized tools on-cloud, you will need the OmicSoft Cloud add-on, and specify an AMI that includes Docker; see Ubuntu Cloud AMIs for a compatible AMI for your region, or build your own AMI with the appropriate dependencies.
As far as we are aware, any Docker image can be integrated with the current external script solution. Several Docker repositories maintain pre-built Docker images that are quick to deploy. Please find here a list of maintained images for STAR and Kallisto.
Users can also build their own Docker images to run tools from the OmicSoft framework.
Data Flow Diagram
EScript Syntax for Docker commands
The syntax for using an External Too: EScript_Syntax_Updates.
Dockerized External Scripts can be run on:
- client Studio)
- server (SendToQueue, oshell)
- cluster (SendToQueue, oshell)
- cloud (SendToQueue, oshell)
Temporary limitations when running external tool on the server
When running an external tool on the server for which one or more Resources are needed (see below) please set ParallelJob option value to 1. Otherwise the possibility of resource contention between the multiple external processes trying to read from the Resource causes one or more of these processes to not produce any output. If ones external tool does not need any resource files it is safe to set the ParallelJob option value to a value greater than 1. This is not an issue for Cloud-based analyses.
Follow these general patterns for building an EScript with Docker images. (For more details, please see the Syntax page):
Begin RunEScript /RunOnServer=True; Resources " (Any Resource Files you need. They need to be in the same folder, but you can list multiple files) "; Files " (Any input files. Depending on whether each file should be processed independently, as read pairs according to OmicSoft's pairing logic, or all files in one analysis, /Mode should be set to Single, Paired, or Multiple) "; EScriptName AnyNameYouLike; Command (the exact command you would like to run, with parameters specified as literals or macros); Options /Mode=(Single|Paired|Multiple) /RunOnDocker=True /ImageName="Repo/Image:Version" /UseCloud=(True|False) /OutputFolder=(OutputFolderPath); End;
Specify any file(s) that are to be used as part of the Escript command for every file, such as a genome reference, annotation file, etc.
Just like with other Oscripts, most scripts act on one or more input files in order to be able to run. These input files are provided in the section Files and are read depending on the Mode provided in the options. Supported read modes are single, paired and multiple. The input files must be entered between quotes and the section is always finished with a semicolon.
Files "/GhindariuCloudFolder/ArrayServer/Input/Fastqs/SRR521461_1.fastq.gz" "/GhindariuCloudFolder/ArrayServer/Input/Fastqs/SRR521461_2.fastq.gz" "/GhindariuCloudFolder/ArrayServer/Input/Fastqs/SRR521462_1.fastq.gz" "/GhindariuCloudFolder/ArrayServer/Input/Fastqs/SRR521462_2.fastq.gz";
After the Files section, the user has to provide a name for the Escript, under EScriptName. This name will later be used to gather the output results, present possible error logs in the solution project. Just like before, the section ends with a semicolon.
The user can enter one or more commands here, which will be executed in the Docker environment. Each command must be prefixed with the keyword Command.
If running a Docker command, OmicSoft will automatically construct the Docker syntax, including mapping of Resource, Input and Output files for each executed Docker container.
Make use of reserved macros like %FilePath% and %OutputFolder% to allow OmicSoft to automatically substitute files into the command.
Just like before, each command has to be terminated with a semicolon.
Command kallisto quant -i %Resource1% -o "%OutputFolder%" -b 100 %FilePath1% %FilePath2%;
This command pulls the first listed resource as %Resource1%, specifies the /OutputFolder parameter from the Options section (below) as %OutputFolder%, and sends each pair of files from the Files section to a Docker command as %FilePath1% and %FilePath2%.
As with all Oscripts, the options section contains some parameters that have the same functions as a regular oscript (/ParallelJobNumber, /ThreadNumberPerJob).
In addition, certain Escript parameters (/Mode, /ErrorOnStdErr, /ErrorOnMissingOutput) are often useful in Docker commands as well.
The key parameters for running a Dockerized tool include /RunOnDocker=true and /ImageName=Repo|ImageName:version, which dictates which tool will be deployed on the EC2 instance. Some available commands are presented in Usage. In addition, /OutputFolder is required in the options section, and is transformed within each command as %OutputFolder% in the Command section. /UseCloud=True is required if you want to run AWS-based analyses (requires Cloud Add-on and proper configuration). Additional parameters are defined on the Syntax page.
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Single /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /ImageName="quay.io/biocontainers/star:2.7.3a--0" /UseCloud=True /OutputFolder="/GhindariuCloudFolder/Output/Results/star" /InstanceType=m4.4xlarge /VolumeSize=50;
Finally, in the section Output, the ExternalTool escripts supports transforming the result of an analysis. This is often required because running the same command on multiple input files will produce files with the same output name, which will be written to the same output directory (over-writing each other). To make sure these files are not overwritten, Output Transformation will quickly rename output files.
Output "/GhindariuCloudFolder/Output/Abundances/abundance.tsv => /GhindariuCloudFolder/Output/Abundances/%PairName%_abundance.tsv" /Type=tsv;
In this pattern, every time a file is written from one of the parallel jobs to "abundance.tsv" it is transformed to "%PairName%_abundance.tsv", substituting the individual sample's name for "%PairName%".
Frequent errors & Troubleshooting
- Files must have the path with no free space in it
- When submitting an Escript through the GUI's "Run Script (Send To Queue)" option, the /OutputFolder parameter doesn't accept Global Macros, only Reserved Macros (eg: placeholders from the input /resource files: %PairName%, %FileName%, %ResourceFolder%).
- Be careful with the Reserved Macros! Macros are different depending on the mode (ex: FilePath and FileName macros are not supported for multiple mode)
- Global Macros should work everywhere in the EScript except for the OutputFolder
- Many bioinformatics tools output to "Standard Error" (STDERR), which can trigger OmicSoft to detect a failed run. You can avoid this by redirecting STDERR to STDOUT
- e.g. kallisto quant (parameters) 2>&1;
- e.g. Both kallisto index and kallisto quant scripts display their output in the error stream. This is a limitation of kallisto tool itself.
Exposing a Dockerized script to GUI
The External Tool script can also be run from the GUI by importing the pscript
It exposes the same parameters like above:
ParallelJobNumber, ThreadNumberPerJob, Mode, UseCloud, ErrorOnStdErr and ErrorOnMissingOutput have standard predefined values. The rest of the fields are fully editable by the user. The script cannot run without input files, an output folder and having a solution open in ArrayStudio.
More info here: Kallisto on EScript.
More info: STAR on EScript.