External Script Integration

From Array Suite Wiki

Jump to: navigation, search

Contents

Overview

Omicsoft supports execution of third-party commands within OmicScript commands with External Script (EScript) syntaxto build pipelines/workflows using public bioinformatics tools. Since most third-party tools are Linux-only, users should run Escript in Oshell or ArrayServer on a Linux machine.

Feature Highlights

  • Escripts can wrap and run public bioinformatics tools, such as BWA, Bowtie, Tophat, and Cufflink, in OmicSoft Project Environment
  • With Version 10.2, Escript supports Docker-based tools. See External Scripts with Dockers for details on additional syntax and parameters.
  • EScripts can be pre-configured and managed in OmicSoft Server as pipeline scripts and be exposed in OmicSoft Studio GUI
  • Escript runs can be submitted to the job queue in OmicSoft Server and run in Grid Engine if the server has been configured (see EnableCluster);
  • Escript jobs are monitored and tracked in OmicSoft Server.

Getting Started

A note of caution

Escripts provide a very powerful way to extend OmicSoft Suite capabilities by calling external tools by Oscript or through the GUI. However, because bioinformatics tools use a wide variety of syntaxes, dependencies, and resources, the Escript syntax can seem complicated when you get started. It is recommended that you start with an existing Escript and modify to begin with to fit new commands, until you get familiar with the patterns.

Escript Instructions

The basic Escript pattern is

Begin RunEScript;
Files
"/path/to/input/files";
EScriptName ANiceName;
Command MyExternalCommand;
Options /ParallelJobNumber=1 /ThreadNumber=4 /Mode=(Single|Paired|Multiple) (Other Options) /OutputFolder="/Path/To/OutputFolder";
End;

SearchFiles can be used in lieu of Files.

EScriptName just defines the name of the script. It is useful for display/log.

Command

For Command and Output, you can assume the following file/pair specific macros are automatically defined based on Mode. Generally speaking, the Command you run depends on the Mode (i.e. if each input file is Single-end, Mode=Single, so you will need to use %FilePath%; if input files are paired, you'll instead use %PairPath%).

    • /Mode=Single: %FilePath%, %FileName%, %FileNameNoExt%, %FileDirectory%
    • /Mode=Paired: %FilePath1%, %FileName1%, %FileNameNoExt1%, %FilePath2%, %FileName2%, %FileNameNoExt2%, %PairPath%, %PairName%.
    • FilePath/PairPath specifies the full path, while FileName/PairName name has no path but has extension.
  • For windows, we automatically concatenate all rendered commands and write a .bat file and execute it. For linux, we call bash or qsub to run the script.


We support partial failing – i.e. if one run fails and the other run is ok, we will go through the pipeline and provide the error report.

  • ThreadNumberPerJob has to be defined in both Options and the command, because the script engine has no idea that “-p” defines thread number per job.
  • ErrorOnStdErr
    • if /ErrorOnStdErr=True, the run will detect it as a crash when there is a message to stderr (system error);
    • if /ErrorOnStdErr=False, the run treats the message in stderr as regular log message.
  • If OutputFolder needs to be specified, please use /OutputFolder="$$@OutputFolder@" in Option section.
  • If one sample has multiple files (e.g. multiple fastq files for one sample), users can use sample registration to register those samples, and then use @GroupedFileNames@, The multiple files will be passed together (files separated by ",") to the external command. %GroupName% will be the corresponding sample ID of the multiple input files. Please note that @GroupedFileNames@ and @GroupName@ only are available for paired end read files. A simple example is shown below:
Begin RunEScript;
Files "@GroupedFileNames@";
EScriptName echo;
Command echo "%FilePath%" > "D:\temp.txt";
Command echo "%GroupName%" > "D:\temp.txt";
End;

Simple Bowtie Escript

Here, I will introduce the Escript using a simple example wrapping Bowtie. The script assume that

  • Bowtie is installed and can be found in PATH
  • ebwt indexes are located in /home/garyge/App/bowtie-0.12.9/indexes

The script syntax is like this:

Begin NewProject;
File "/tmp/test.osprj";
Options /Distributed=True;
End; 

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Bowtie;
Command mkdir "/tmp/test/alignment";
Command bowtie "/home/garyge/App/bowtie-0.12.9/indexes/hg19" -1 "%FilePath1%" -2 "%FilePath2%" -p 8 -a -m 1 -v 2 -t -S "/tmp/test/alignment/BowtieAlignment.sam";
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False;
End;

Begin AddMappedDnaSeqReads /Namespace=NgsLib;
Files 
"/tmp/test/test/alignment/BowtieAlignment.sam";
Reference Human.hg19;
Filter ;
Options /FileFormat=SAM /ThreadNumber=4 /NoCopy=True /UseVirtualBams=False;
Output BAMFile;
End;

Begin SaveProject;
File "/tmp/test.osprj";
Options /Distributed=True;
End;

Save this script as SimpleBowtieExample.oscript and run it using mono oshell.exe --runscript:

/[path where mono was installed]/bin/mono /home/garyge/Oshell/oshell.exe --runscript /home/garyge/Omicsoft /tmp/SimpleBowtieExample.oscript /tmp /[path where mono was installed]/bin/mono > /tmp/run.log

Please read Oshell and Running OmicScript Pipeline if you are not familiar with oshell and OmicScript.
Here the run log file

  • Here, mkdir and bowtie will run in shell sequentially. User can add more shell commands.

run bowtie2 via ArrayStudio gui

The script to be used:

Bowtie2
Begin RunEScript;

Files "/Users/jeffdu/SRAGSE60052/SRR1797220_1.fastq.gz /Users/jeffdu/SRAGSE60052/SRR1797220_2.fastq.gz"; EScriptName Bowtie; Command "/home/jdu/tools/bowtie2/bowtie2-2.4.2-sra-linux-x86_64/bowtie2" "-x /scratch/temp/bowtieindex/hg19" -1 "%FilePath1%" -2 "%FilePath2%" -S "/scratch/temp/BowtieAlignment/testBowtieAlignSRR179.sam"; Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False; End;

Tophat Escript with Macro and parameter function

In this section, I will make the EScript more complicated with Macro and parameter function.

Tophat Escript

The script syntax is like this:

Begin Macro;
@FileSeparator@ "/";
@ProjectName@ TophatTest;
@ProjectFolder@ "/tmp/EscriptTest";
@OutputFolder@ "/tmp/EscriptTest/TophatTest/output";
@TophatPath@ "/home/garyge/App/tophat-2.0.7/tophat";
@ReferenceIndexFolder@ "/home/garyge/Omicsoft/ReferenceLibrary";
End;

Begin NewProject;
File "@ProjectFolder@@FileSeparator@@ProjectName@.osprj";
Options /Distributed=True;
End;

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Tophat;
// run bowtie-build if it is first time using the reference
// Command bowtie-build -f FASTA(Human.hg19) "@ReferenceIndexFolder@@FileSeparator@Human.hg19";
Command mkdir "@OutputFolder@";
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -G GTF(Human.hg19$RefGene) -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%";
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False;
Output "@OutputFolder@@FileSeparator@accepted_hits.bam => @OutputFolder@@FileSeparator@%PairName%.bam" /Type=bam;
Output "@OutputFolder@@FileSeparator@junctions.bed => @OutputFolder@@FileSeparator@%PairName%.junction.bed" /Type=junction_bed;
End;

Begin AddGenomeMappedRnaSeqReads /Namespace=NgsLib;
Files 
"@OutputFiles.bam@";
Reference Human.hg19;
GeneModel RefGene;
Options /FileFormat=BAM /ThreadNumber=8 /NoCopy=True /UseVirtualBams=False;
Output BAMFile;
End;

Begin SaveProject;
File "@ProjectFolder@@FileSeparator@@ProjectName@.osprj";
Options /Distributed=True;
End;

Save this script as TophatEscript.oscript and run it using mono oshell.exe --runscript:

/[path where mono was installed]/bin/mono /home/garyge/Oshell/oshell.exe --runscript /home/garyge/Omicsoft /tmp/TophatEscript.oscript /tmp /[path where mono was installed]/bin/mono > /tmp/run.log


Escript Instruction

  • All parameters can be parameterized with @parameter@ key syntax:
    • We will call these macros general macros (see Macro), to differentiate the file/specific macros such as %PairPath%;
    • Each @parameter@ will be be substituted with key in the whole script before execution.
  • We introduced a new parameter function syntax like FASTA(Human.hg19) and GTF(Human.hg19$RefGene):
    • This is called a parameter function. Our engine supports this function and provides the output (usually a file path);
    • Human.hg19 and Human.hg19$RefGene are the parameter (use $ to separate multiple parameters).
  • We should always double quote parameters that potentially contains space or “/” after macro rendering (i.e. all file paths). Functional parameter does not need to be double quoted unless the parameter itself contains space or “/” before function evaluation (we will automatically double quote it if the returned value has space). Always double the full transformation (not individual files) of Output statement;

Tips.png It is good practice to also quote literal values, especially if they contain special characters to ensure they are not misinterpreted by OmicSoft e.g.

Command mkdir "-p" "/tmp/test/alignment" "@WorkDir@";

These quotes will be included in the command submitted to the shell so you can't simply quote the whole command e.g.

// This will fail
Command "mkdir -p /tmp/test/alignment @WorkDir@";
// As will this
Command mkdir "-p /tmp/test/alignment @WorkDir@";

In cases where you want to chain or pipe (shell) commands, use "bash -c" like this:

Command bash -c "cd /app/Output/ && tar -czvf hg38.star_ref.tar.gz hg38 --remove-files";

(This approach is also an easy way to perform shell commands in general instead of having to quote specific special characters.)

  • If successfully run on at least one file/pair, the RunEScript will automatically define macro @OutputFiles.type@ that can be used by downstream process. The value completely comes from Output transformation;
    • @OutputFiles.bam@ represents the file path to the .bam file;
    • @OutputFiles.junction_bed@ represents the file path to the junction.bed file;
    • User can define the /Type=mytype and use them as @OutputFiles.mytype@.
  • If OutputFolder needs to be specified, please use /OutputFolder="$$@OutputFolder@" in Option section.

EScript with conditional command

We can define any condition for commands, as long as they are also appeared in Options statement. These conditions control the external executable to run with different parameter settings in different scenarios.

  • Define the conditions (e.g. single mode or paired mode) as a macro
  • Make sure to include the macro in Options statement (e.g. /Mode=@AnalysisMode@)
  • Set the condition for each command (e.g. Command bowtie /Mode=Paired)

Take the Tophat Escript as one example, we can define two conditional commands, with or without GTF input:

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Tophat;
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -G GTF(Human.hg19$RefGene) -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%" /Mode=Paired /UseGTF=True;
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%" /Mode=Paired  /UseGTF=False;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=@AnalysisMode@ /ErrorOnStdErr=False /UseGTF=@UseGTFFile@;
End;


Use can control the workflow by Macro in oscript or GUI:

Begin Macro;
@UseGTFFile@ True;
@AnalysisMode@ Paired;
End;

Expose the OmicScript in GUI

Escript can be embedded in pipeline OmicScript and exposed in ArrayStudio GUI, details can be found here.

In the <Input> section, add lines similar to the following, customized to your script:

ExternalScriptInputType=Files
ExternalScriptMenuText=BWA-MEM alignment
ExternalScriptMenuStructure=NGS\DNA-Seq\Alignment
ExternalScriptFileFilter=FASTQ files|*.fastq|.gz|*.gz

TopHat alignment example

20130516 EscriptExample.png

Cufflinks example

Pipeline OmicScript using Cufflink exposed in ArrayStudio

Cluster Support

The user may need to add environment variables to the special file ~/.sge_request using the -v option:

i.e: -v PATH=/path/to/bowtie

-v BOWTIE2_INDEXES=/path/to/bowtie2indexes

OmicScript

Begin RunEScript;

Files "YourFiles ";

EScriptName NameOScriptForTaggingPurposes;

Command touch /path/tofile";

Options /ParallelJobNumber=99 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False /UseCluster=True /ClusterCustomOption= /ErrorOnMissingOutput=False /Group="" /OutputFolder="";

End;