External Script Integration

From Array Suite Wiki

Jump to: navigation, search

Contents

Overview

Omicsoft has implemented the external script (EScript) integration to build pipelines/workflows using public bioinformatics tools. Since most third-party tools are Linux-only, users should run Escript in Oshell or ArrayServer on a Linux machine.

Feature Highlights

  • Escript can wrap and run public bioinformatics tools, such as BWA, Bowtie, Tophat, and Cufflink, in OmicSoft Project Environment;
  • EScript can be pre-configured and managed in ArrayServer as pipeline scripts and be exposed in ArrayStudio GUI;
  • Escript runs can be submitted to the job queue in ArrayServer and run in Grid Engine if the server has been configured (see EnableCluster);
  • Escript jobs are monitored and tracked in ArrayServer.

Get Started

Simple bowtie Escript

Here, I will introduce the Escript using a simple example wrapping Bowtie. The script assume that

  • Bowtie is installed and can be found in PATH
  • ebwt indexes are located in /home/garyge/App/bowtie-0.12.9/indexes

The script syntax is like this:

Begin NewProject;
File "/tmp/test.osprj";
Options /Distributed=True;
End; 

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Bowtie;
Command mkdir "/tmp/test/alignment";
Command bowtie "/home/garyge/App/bowtie-0.12.9/indexes/hg19" -1 "%FilePath1%" -2 "%FilePath2%" -p 8 -a -m 1 -v 2 -t -S "/tmp/test/alignment/BowtieAlignment.sam";
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False;
End;

Begin AddMappedDnaSeqReads /Namespace=NgsLib;
Files 
"/tmp/test/test/alignment/BowtieAlignment.sam";
Reference Human.hg19;
Filter ;
Options /FileFormat=SAM /ThreadNumber=4 /NoCopy=True /UseVirtualBams=False;
Output BAMFile;
End;

Begin SaveProject;
File "/tmp/test.osprj";
Options /Distributed=True;
End;

Save this script as SimpleBowtieExample.oscript and run it using mono oshell.exe --runscript:

/opt/mono-2.10.9/bin/mono /home/garyge/Oshell/oshell.exe --runscript /home/garyge/Omicsoft /tmp/SimpleBowtieExample.oscript /tmp /opt/mono-2.10.9/bin/mono > /tmp/run.log

Please read Oshell and Running OmicScript Pipeline if you are not familiar with oshell and OmicScript.
Here the run log file

Escript Instructions

  • EScriptName just defines the name of the script. It is useful for display/log.
  • For Command and Output, you can assume the following file/pair specific macros are automatically defined based on Mode
    • /Mode=Single: %FilePath%, %FileName%, %FileNameNoExt%, %FileDirectory%, note file path is full path, and file name has no path but has extension.
    • /Mode=Paired: %FilePath1%, %FileName1%, %FileNameNoExt1%, %FilePath2%, %FileName2%, %FileNameNoExt2%, %PairPath%, %PairName%.
  • For windows, we automatically concatenate all rendered commands and write a .bat file and execute it. For linux, we call bash or qsub to run the script.
  • Here, mkdir and bowtie will run in shell sequentially. User can add more shell commands.
  • We supported partial failing – i.e. if one run fails and the other run is ok, we will go through the pipeline and provide the error report.
  • ThreadNumberPerJob has to be defined in both Options and the command, because the script engine has no idea that “-p” defines thread number per job.
  • ErrorOnStdErr
    • if /ErrorOnStdErr=True, the run will detect it as a crash when there is a message to stderr (system error);
    • if /ErrorOnStdErr=False, the run treats the message in stderr as regular log message.
  • If OutputFolder needs to be specified, please use /OutputFolder="$$@OutputFolder@" in Option section.
  • If one sample has multiple files (e.g. multiple fastq files for one sample), users can use sample registration to register those samples, and then use @GroupedFileNames@, The multiple files will be passed together (files separated by ",") to the external command. %GroupName% will be the corresponding sample ID of the multiple input files. Please note that @GroupedFileNames@ and @GroupName@ only are available for paired end read files. A simple example is shown below:
Begin RunEScript;
Files "@GroupedFileNames@";
EScriptName echo;
Command echo "%FilePath%" > "D:\temp.txt";
Command echo "%GroupName%" > "D:\temp.txt";
End;

Tophat Escript with Macro and parameter function

In this section, I will make the EScript more complicated with Macro and parameter function.

Tophat Escript

The script syntax is like this:

Begin Macro;
@FileSeparator@ "/";
@ProjectName@ TophatTest;
@ProjectFolder@ "/tmp/EscriptTest";
@OutputFolder@ "/tmp/EscriptTest/TophatTest/output";
@TophatPath@ "/home/garyge/App/tophat-2.0.7/tophat";
@ReferenceIndexFolder@ "/home/garyge/Omicsoft/ReferenceLibrary";
End;

Begin NewProject;
File "@ProjectFolder@@FileSeparator@@ProjectName@.osprj";
Options /Distributed=True;
End;

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Tophat;
// run bowtie-build if it is first time using the reference
// Command bowtie-build -f FASTA(Human.hg19) "@ReferenceIndexFolder@@FileSeparator@Human.hg19";
Command mkdir "@OutputFolder@";
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -G GTF(Human.hg19$RefGene) -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%";
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False;
Output "@OutputFolder@@FileSeparator@accepted_hits.bam => @OutputFolder@@FileSeparator@%PairName%.bam" /Type=bam;
Output "@OutputFolder@@FileSeparator@junctions.bed => @OutputFolder@@FileSeparator@%PairName%.junction.bed" /Type=junction_bed;
End;

Begin AddGenomeMappedRnaSeqReads /Namespace=NgsLib;
Files 
"@OutputFiles.bam@";
Reference Human.hg19;
GeneModel RefGene;
Options /FileFormat=BAM /ThreadNumber=8 /NoCopy=True /UseVirtualBams=False;
Output BAMFile;
End;

Begin SaveProject;
File "@ProjectFolder@@FileSeparator@@ProjectName@.osprj";
Options /Distributed=True;
End;

Save this script as TophatEscript.oscript and run it using mono oshell.exe --runscript:

/opt/mono-2.10.9/bin/mono /home/garyge/Oshell/oshell.exe --runscript /home/garyge/Omicsoft /tmp/TophatEscript.oscript /tmp /opt/mono-2.10.9/bin/mono > /tmp/run.log


Escript Instruction

  • All parameters can be parametrized with @parameter@ key syntax:
    • We will call these macros general macros (see Macro), to differentiate the file/specific macros such as %PairPath%;
    • Each @parameter@ will be replaced with key in the whole script.
  • We introduced a new parameter function syntax like FASTA(Human.hg19) and GTF(Human.hg19$RefGene):
    • This is called a parameter function. Our engine supports this function and provides the output (usually a file path);
    • Human.hg19 and Human.hg19$RefGene are the parameter (use $ to separate multiple parameters).
  • We should always double quote parameters that potentially contains space or “/” after macro rendering (i.e. all file paths). Functional parameter does not need to be double quoted unless the parameter itself contains space or “/” before function evaluation (we will automatically double quote it if the returned value has space). Always double the full transformation (not individual files) of Output statement;
  • If successfully run on at least one file/pair, the RunEScript will automatically define macro @OutputFiles.type@ that can be used by downstream process. The value completely comes from Output transformation;
    • @OutputFiles.bam@ represents the file path to the .bam file;
    • @OutputFiles.junction_bed@ represents the file path to the junction.bed file;
    • User can define the /Type=mytype and use them as @OutputFiles.mytype@.
  • If OutputFolder needs to be specified, please use /OutputFolder="$$@OutputFolder@" in Option section.

EScript with conditional command

We can define any condition for commands, as long as they are also appeared in Options statement. These conditions control the external executable to run with different parameter settings in different scenarios.

  • Define the conditions (e.g. single mode or paired mode) as a macro
  • Make sure to include the macro in Options statement (e.g. /Mode=@AnalysisMode@)
  • Set the condition for each command (e.g. Command bowtie /Mode=Paired)

Take the Tophat Escript as one example, we can define two conditional commands, with or without GTF input:

Begin RunEScript;
Files
"/home/garyge/Test/_Raw/SRR243575.s.1.fastq
/home/garyge/Test/_Raw/SRR243575.s.2.fastq";
EScriptName Tophat;
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -G GTF(Human.hg19$RefGene) -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%" /Mode=Paired /UseGTF=True;
Command "@TophatPath@" -p 8 -r 250 --mate-std-dev 50 -o "@OutputFolder@" "@ReferenceIndexFolder@@FileSeparator@Human.hg19" "%FilePath1%" "%FilePath2%" /Mode=Paired  /UseGTF=False;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=@AnalysisMode@ /ErrorOnStdErr=False /UseGTF=@UseGTFFile@;
End;


Use can control the workflow by Macro in oscript or GUI:

Begin Macro;
@UseGTFFile@ True;
@AnalysisMode@ Paired;
End;

Expose the OmicScript in GUI

Escript can be embedded in pipeline OmicScript and exposed in ArrayStudio GUI, details can be found here.

In the <Input> section, add lines similar to the following, customized to your script:

ExternalScriptInputType=Files
ExternalScriptMenuText=BWA-MEM alignment
ExternalScriptMenuStructure=NGS\DNA-Seq\Alignment
ExternalScriptFileFilter=FASTQ files|*.fastq|.gz|*.gz

TopHat alignment example

20130516 EscriptExample.png

Cufflinks example

Pipeline OmicScript using Cufflink exposed in ArrayStudio

Cluster Support

The user may need to add environment variables to the special file ~/.sge_request using the -v option:

i.e: -v PATH=/path/to/bowtie

-v BOWTIE2_INDEXES=/path/to/bowtie2indexes

OmicScript

Begin RunEScript;

Files "YourFiles ";

EScriptName NameOScriptForTaggingPurposes;

Command touch /path/tofile";

Options /ParallelJobNumber=99 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False /UseCluster=True /ClusterCustomOption= /ErrorOnMissingOutput=False;

End;