External Script Updates
From Array Suite Wiki
External Tool run reference
The external tool is a form of escript designed to run pipelines/workflows using public bioinformatics tools, which are not included in the Omicsoft distribution.
To allow the users to define their own tools, but also give them the possibility to use predefined tools without the need of complicated environment configurations, the current update has integrated support for both cloud and docker, adding to the old syntax a higher amount of flexibility.
EScript can be now run on:
- client (ArrayStudio) with or without Docker
- server (ArrayServer > SendToQueue, oshell) with or without Docker
- cluster (ArrayServer > SendToQueue, oshell)
- cloud (ArrayServer > SendToQueue, oshell) with or without Docker
To be able to use docker following conditions have to be met:
- AMI/Server: Docker (recommended version: 19.03.8)
- ECR: EC2s must have permission to ECR
- InstanceType: AMI must be set depending on the type of the instance that will be run
Current limitations when running external tool on the server:
When running an external tool on the server for which one or more resource files are needed (specified via the Resources section) one needs to the set ParallelJob option value to 1. As with a value greater than 1 there is the possibility of resource contention between the multiple external processes trying to read from the causing one ore more of these processes to not produce any output. If ones external tool does not need any resource files it is safe to set the ParallelJob option value to a value greater than 1.
The External Tool syntax can be found here: External Script Syntax
General steps one should follow when building an EScript.
The script might use for a certain processing step the result from a previous analysis. This result must be provided under the Resources section. The results from the previous analysis might not be ready when the script is first read, so there was a need to distinguish between these files and the input files. More details about the syntax of the Resource section can be found in the syntax page. Resources
Most scripts require several input files in order to be able to run. These input files are provided in the section Files and are read depending on the Mode provided in the options. Supported read modes are single, paired and multiple. The input files must be entered between quotes and the section is always finished with a semicolon.
After the files, the user has to provide a name for the Escript, under EScriptName. This name will later be used to gather the output results, present possible error logs in the solution project. Just like before, the section ends with a semicolon.
The user can enter several commands. Each command must be prefixed with the keyword command. In the background each command will be converted into a docker command and linked with a docker input & output directory. In this way, the user can seamlessly use the tool without worrying about docker parameters. Just like before, each command has to be terminated with a semicolon.
| Command kallisto quant -i %Resource1% -o "%OutputFolder%" -b 100 %FilePath1% %FilePath2%;|
Command kallisto version;
The options section contains some parameters which have the same function like a regular escript ParallelJobNumber, ThreadNumberPerJob, Mode, ErrorOnStdErr, ErrorOnMissingOutput and the possibility of using a dev environment to update the EC2 instance UseDev2=true. The parameter RunOnDocker=true is mandatory to be able to use the script with external tool support. ImageName dictates which tool will be deployed on the EC2 instance. The available tools are presented in Usage. The OutputFolder is required in the options because it is different from the scripts' output folder. The scripts' output folder is a virtual path, while the outputFolder here in options is the outputFolder on a EC2 instance, which can't access the virtual pahts defined by the user. More details about the options section are found here.
|Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Single /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /ImageName="quay.io/biocontainers/star:2.7.3a--0" /UseCloud=True /UseDev2=True /OutputFolder="/GhindariuCloudFolder/Output/Results/star" /InstanceType=m4.4xlarge /VolumeSize=50;|
Finally, in the section Output, the ExternalTool escripts supports transforming the result of an analysis. This is required because running the same command on multiple input files will produce files with the same output name. To make sure these files are not overwritten, the transformation of the output files was implemented.
|Output "/GhindariuCloudFolder/Output/Abundances/abundance.tsv => /GhindariuCloudFolder/Output/Abundances/%PairName%_abundance.tsv" /Type=tsv;|
Frequent errors & Troubleshooting
- Files must have the path with no free space in it
- Options > /OutputFolder parameter doesn't accept Global Macros, only Reserved Macros (eg: placeholders from the input /resource files: %PairName%, %FileName%, %ResourceFolder%)
- Be careful with the Reserved Macros! Macros are different depending on the mode (ex: FilePath and FileName macros are not supported for multiple mode)
- Global Macros should work everywhere in the EScript except for the OutputFolder
- Both kallisto index and kallisto quant scripts display their output in the error stream. This is a limitation of bioconda kallisto tool itself.
More info here: Kallisto on EScript.
More info: STAR on EScript.
The External Tool script can also be run from the GUI by importing the pscript attached in jira: ARRS-1003 - Authenticate to see issue details
It exposes the same parameters like above:
ParallelJobNumber, ThreadNumberPerJob, Mode, UseCloud, ErrorOnStdErr and ErrorOnMissingOutput have standard predefined values. The rest of the fields are fully editable by the user. The script cannot run without input files, an output folder and having a solution open in ArrayStudio.