EScript Syntax Updates

From Array Suite Wiki

Jump to: navigation, search

With the addition of Cloud support and Docker support, several additional parameters have been introduced to Escripts.

Contents

Prerequisites

AMI/Server: Docker (recommended version: 19.03.8)

ECR: EC2s must have permission to ECR if you wish to use private Docker images

If running on Cloud, make sure your AMI has Docker installed on it, or use the latest OmicSoft Docker-compatible AMI.

Syntax

Minimal Escript Skeleton for Docker runs

Begin RunEScript /RunOnServer=True;
Resources
"
(Any Resource Files you need. They need to be in the same folder, but you can list multiple files)
";
Files
"
(Any input files. Depending on whether each file should be processed independently, as read pairs according to OmicSoft's pairing logic,
or all files in one analysis, /Mode should be set to Single, Paired, or Multiple)
";
EScriptName AnyNameYouLike;
Command (the exact command you would like to run, with parameters specified as literals or macros);
Options /Mode=(Single|Paired|Multiple) /RunOnDocker=True /ImageName="Repo/Image:Version" /UseCloud=(True|False) /OutputFolder=(OutputFolderPath);
End;

Updates to Escript syntax

To support Docker and Cloud runs, many additional parameters to External Script Syntax were introducted.

Files and Folders

Most scripts require several input files in order to be able to run. These input files are provided in the section Files and are read depending on the Mode provided in the options. Supported read modes are single, paired and multiple. The input files must be entered between quotes and the section is always finished with a semicolon.


  • Resources section - Allows specification of one or more files to be used as a "resource" for all samples analyzed
    • All files specified in a Resources section must be in the same folder
    • Files may be referred to with %Resource1%, %Resource2%, etc.
    • Use %ResourceFolder% to refer to the folder (useful for STAR and other commands that need to know where a folder is)
  • Files section - all files specified in the Files section must be inside the same folder (won't apply for mode single)
  • Usage in Command:
    • /Mode=Single %FilePath% - links to the input file
    • /Mode=Paired %FilePath1% and %FilePath2% - links to every 2 paired input files
  • /OutputFolder=some-path (in Options section)
    • This Option is required; will be used as %OutputFolder% inside the specified Command to be interpreted contextually

Additional EScript Options

These options will be specified in the Options section of the command.

  • /RunOnDocker=True - required - indicates whether script should be run in a Docker container (cloud or locally)
  • /ImageName=myDockerImage:v1 - required - indicates the docker image to be used by the command
  • /DockerArgs=–-rm -i -t (optional) - additional docker run arguments (e.g. --rm tells docker to remove the container after job is finished)
       default value, if not specified is: --rm 

Cloud Support

  • /UseCloud=True: dictates if the analysis should be performed on EC2 or on the local server. UseCloud=true means it will be performed on the EC2 instance.
  • /InstanceType=c5.xlarge for specifying custom instance types
    • Default is OSummaryInstanceType defined in ArrayServer.cfg, or m4.large if not specified
  • /VolumeRatio: as a factor of input-size (e.g. 4 x input-size , /VolumeRatio=4)
  • /VolumeSize: specific GB value (e.g. /VolumeSize=1000 )
    • default is 4 x input-size which will be attached only if (4 x input-size) < 5GB
    • specific size >= 0 will always be added

Image Repository Access

  • /DockerRegistry=DockerHubPrivate|DockerHubPublic|ECR - specifies type of registry
    • default value, if not specified is: DockerHubPublic
    • ECR: support only on cloud
    • DockerHubPrivate: not yet supported (to be added)

Image Repository Types

Public Docker hub repository: no additional configuration needed

AWS ECR: docker-login minimum required policies "GetAuthorizAwsRegistryRegionationToken" is required in your AWS policy.

  • docker-login command will be run before running the command
  • only valid for 12h on that instance
Options

Warning.png WARNING:


Other Options (generally useful)

  • /ParallelJobNumber = number of analysis which can be run in parallel (on different machines)
  • /ThreadNumberPerJob = number of threads running for each analysis
  • /Mode = Input file mode. It can be single, paired or multiple. It determines how the input files are read. If paired, for example, the files are grouped in pairs of 2 files and submitted together to the command. If multiple, all input files will be run on a single command (e.g. if merging many files together).
  • /ErrorOnStdErr = Throw an error on output to Standard Error

Warning.png WARNING: Kallisto, Tophat, and other tools output to STDERR; it is useful to redirect 2>&1 to avoid Finished With Errors all the time

  • /ErrorOnMissingOutput = error on missing output - If no output files were generated, an Error flag will be generated

A complicated Escript example for Docker runs

This Escript example demonstrates many powerful aspects of External Scripts with Docker images:

  1. Macros are specified at the beginning, allow quick tweaks of parameters within the Oscript
  2. A custom InstanceType is specified using a macro
  3. Multiple Resources are specified (although only one is used)
  4. Four input files are specified, as two file pairs.
  5. Two parallel jobs are specified, so each file pair will be run simultaneously.
  6. The Command uses %FilePath1% and %FilePath2% to specify the input pairs
  7. The Output generated by Kallisto is transformed from the generic "abundance.tsv" to %PairName%_abundance.tsv".
Begin Macro;
@ThreadNumberPerJob@ 2;
@Bootstrap@ 100;
@ParallelJobNumber@ 2;
@Mode@ Paired;
@InstanceType@ "m4.xlarge";
@ErrorOnMissingOutput@ True;
@OutputFolderName@ "/path/to/files/OutputFolder";
@UseCloud@ True;
End;
Begin RunEScript /RunOnServer=True;
Resources
"
/path/to/resources/MyResource.fastq
/path/to/resources/AnotherResource.idx
/path/to/resources/OneMoreResource.tsv
";
Files
"
/path/to/files/File1a.fastq
/path/to/files/File1b.fastq
/path/to/files/File2a.fastq
/path/to/files/File2b.fastq
";
EScriptName MyEscriptName;
Command kallisto quant -i "%Resource1%" -t @ThreadNumberPerJob@ -o "%OutputFolder%" -b @Bootstrap@ %FilePath1% %FilePath2%;
Options /ParallelJobNumber=@ParallelJobNumber@ /ThreadNumberPerJob=@ThreadNumberPerJob@ /Mode=@Mode@ /InstanceType=@InstanceType@ /ErrorOnStdErr=False  /ErrorOnMissingOutput=@ErrorOnMissingOutput@ /RunOnDocker=True /ImageName="omicdocker/kallisto:testing" /UseCloud=True /OutputFolder="@OutputFolderName@/%PairName%";
Output "@OutputFolderName@/%PairName%/abundance.tsv => @OutputFolderName@/%PairName%_abundance.tsv" /Type=tsv;
End;

Command Syntax

   Command syntax: Command python user-script.py %FilePath% %OutputFolder% 
       NOTE: user-script.py can be an actual command, and not a script file 
       command tells (pre-existing) python to run script user-script.py with arguments %FilePath% and %OutputFolder%
       command is expanded to
           docker run /DockerArgs  -v /input_file_path:/app/_Input_  -v /OutputFolder:/app/_Output_  /ImageName  python user-script.py "/app/_Input_/file" "/app/_Output_"
       command breakdown
           -v /input_file_path:/app/_Input_ - maps the local input file path to a docker internal path, so these files are accessible from within docker
           -v /OutputFolder:/app/_Output_ - maps the local output folder path to a docker internal path, so these files are accessible from within docker
           the additional options are added to the command
           user's script is also added to the command
           file-paths and other macros are parsed into docker internal paths


Other considerations

   before running user's command, the /app/_Input_, /app/_Output_ and /app/_Resource_ folders will be automatically created inside the docker container
       docker will automatically create any directory mapped with -v command, inside the container if it does not exist
   input paths that are passed down from docker to user's script will always belong to /app/_Input_ or /app/_Output_
   when running docker on-premise, the actual file-input-paths and OutputFolder path will be mapped to /app/_Input_ and /app/_Output_
   when running docker in the cloud
       temporary paths are created on the virtual machine (EC2), e.g. /opt/temp/_Input_ and /opt/temp/_Output_
       cloud input files are downloaded into /opt/temp/_Input_
       the two temporary paths are mapped to docker's /app/_Input_ and /app/_Output_
       results from running the script inside docker will be stored into /app/_Output_ which is /opt/temp/_Output_
       all files in the temporary path /opt/temp/_Output_ are uploaded to the user specified OutputFolder, which is a cloud-path