Kallisto on EScript

From Array Suite Wiki

Revision as of 10:41, 9 November 2020 by JeffDu (Talk | contribs)
Jump to: navigation, search

Contents

Goal

Enable customers to run Kallisto on multiple fastq samples in parallel, on server, HPC, or AWS Cloud, resulting in a pair of OmicData objects in an OmicSoft Suite project.

Kallisto Workflow via escript

A complete Kallisto flow requires the following steps:

Build transcriptome index

If aligning a transcriptome, an escript procedure is used to build the transcriptome index.

Kallisto Index
Begin RunEScript /RunOnServer=True;

Files "/VirtualCloudFolder/ArrayServer/Input/Transcripts/transcripts.fasta.gz";
EScriptName KallistoIndex;
Command kallisto index -i %OutputFolder%generated_transcripts.idx %FilePath%;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Single /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /ImageName="omicdocker/kallisto:testing" /UseCloud=True /UseDev3=True /OutputFolder="/VirtualCloudFolder/Output/Transcripts";
End;


To run the above EScript properly, it's probably best to download the fastq files for the appropriate gene model you want to use. For example, I used Gencode.V24 for mine, so downloaded it from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_24/gencode.v24.transcripts.fa.gz

The output *.idx file will show up when the job finishs.

Build index kallisto.png

RNA-seq quantification

For the RNA-seq quantification a new external script should be created, using the transcriptome index from the command bellow.

Kallisto Quantification
Begin RunEScript /RunOnServer=True;

Resources
"/VirtualCloudFolder/Output/Transcripts/generated_transcripts.idx";
Files
"/VirtualCloudFolder/ArrayServer/Input/Fastqs/SRR521461_1.fastq.gz"
"/VirtualCloudFolder/ArrayServer/Input/Fastqs/SRR521461_2.fastq.gz"
"/VirtualCloudFolder/ArrayServer/Input/Fastqs/SRR521462_1.fastq.gz"
"/VirtualCloudFolder/ArrayServer/Input/Fastqs/SRR521462_2.fastq.gz";
EScriptName KallistoQuant;
Command kallisto quant -i %Resource1% -o "%OutputFolder%" -b 100 %FilePath1% %FilePath2%;
Options /ParallelJobNumber=2 /ThreadNumberPerJob=8 /Mode=Paired /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /ImageName="omicdocker/kallisto:testing" /UseCloud=True /UseDev3=True /OutputFolder="/VirtualCloudFolder/Output/Abundances/";
Output "/VirtualCloudFolder/Output/Abundances/abundance.tsv => /VirtualCloudFolder/Output/Abundances/%PairName%_abundance.tsv" /Type=tsv;
End;

Merge Kallisto ouput

Kallisto ouput can be further merged to be imported in the ArrayStudio Analyses tab. A script supporting this step has been built in the following Docker image: omicdocker/pandas:latest. It can be included in the Kallisto pipeline, using the following syntax:

Kallisto Merge
Begin RunEScript /RunOnServer=True;

Files
"/VirtualCloudFolder/ArrayServer/Output/Abundances/SRR521461_abundance.tsv"
"/VirtualCloudFolder/ArrayServer/Output/Abundances/SRR521462_abundance.tsv";
EScriptName KallistoMerge;
Command python Anisto.py -i %FileDirectory% -o %OutputFolder%;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Multiple /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /RunOnDocker=True /ImageName="omicdocker/pandas:latest" /UseCloud=True /UseDev3=True /OutputFolder="/VirtualCloudFolder/Output/Results";
End;

Import in ArrayStudio

Import can be done via ArrayStudio or script:

Import
Begin UnstackTableFile /Namespace=Table /RunOnServer=True /RunOnCloud=False /RunOnGenomicCloud=False;

File "@OutputFolderName@/merged.txt";
RowID target_id;
SplitBy SampleID;
Value est_counts;
Covariate ;
Options /Format=TabDelimited;
Output mergedCounts;
End;

KallistoOnStudio.JPG