Kallisto Merge Python Script

From Array Suite Wiki

Jump to: navigation, search

Kallisto outputs one "abundance.tsv" file per sample, which is useful for downstream analyses with Kallisto/Sleuth, but can be inconvenient when you are used to working with expression matrices.

To simplify usage with Kallisto, we generated a simple Python script that will merge all "abundance.tsv" files in a directory into a pair of merged expression matrices.

This python script is available as a Docker image here:

The full script can be found below.

Contents

Calling Anisto.py Dockerized Escript

Anisto.py will merge all abundance.tsv files into two merged abundance files (Counts and TPM).

Input files

Command syntax

The script syntax is

python3 Anisto.py -i (InputDirectory) -o (OutputDirectory) -p (prefix)

The prefix parameter is optional, and will be prepended to the output file as prefix_results.count. If not specified, the date will be prepended.

The script is available as a Docker image in the omicdocker /pandas:latest repository.

Begin RunEScript /RunOnServer=True;
Files 
"/GhindariuCloudFolder/Output/Abundances/SRR521461_abundance.tsv"
"/GhindariuCloudFolder/Output/Abundances/SRR521462_abundance.tsv";
EScriptName KallistoMerge;
Command python3 Anisto.py -i %FileDirectory% -o %OutputFolder%;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Multiple /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /RunOnDocker=True /ImageName="omicdocker /pandas:latest" /UseCloud=True /OutputFolder="/GhindariuCloudFolder/Output/Results";
End;

Example Usage

Given a set of Kallisto quant output files named

/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521461_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521462_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521463_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521522_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521523_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521524_abundance.tsv

and using this Escript command:

Begin RunEScript /RunOnServer=True;
SearchFiles "/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020" /Pattern=*.tsv /Recursive=False;
EScriptName KallistoMergePython;
Command python3 Anisto.py -i "%FileDirectory%" -o "%FileDirectory%" -p "merged";
Options  /Mode=Multiple /ErrorOnStdErr=False /ErrorOnMissingOutput=False /RunOnDocker=True /ImageName="omicdocker/pandas:latest" /OutputFolder="/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020";
End;

Anisto.py will merge the results of these files into

/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/merged_result.count
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/merged_result.tpm

where each file contains the merged abundance results as a text matrix:

--merged_result.count--

target_id	SRR521523	SRR521462	SRR521463	SRR521522	SRR521524	SRR521461
ENST00000456328.2	5.55646	23.8902	15.5573	4.33379	3.79016	54.3487
ENST00000450305.2	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000488147.1	90.1691	1017.74	1058.87	114.598	57.6045	706.806
ENST00000619216.1	0.0	0.25	0.5	0.0	0.0	0.0
ENST00000473358.1	1.54822	0.0	0.0	0.0	0.0	8.02154
ENST00000469289.1	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000607096.1	0.0	0.0	0.0	0.0	0.0	0.0

--merged_result.tpm--

target_id	SRR521523	SRR521462	SRR521463	SRR521522	SRR521524	SRR521461
ENST00000456328.2	0.155124	0.528399	0.374834	0.111086	0.112607	1.26689
ENST00000450305.2	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000488147.1	3.1746	28.3221	32.0945	3.70874	2.15809	20.7556
ENST00000619216.1	0.0	0.471365	1.00522	0.0	0.0	0.0
ENST00000473358.1	0.119864	0.0	0.0	0.0	0.0	0.514807
ENST00000469289.1	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000607096.1	0.0	0.0	0.0	0.0	0.0	0.0

Anisto.py

The full script can be found here: File:Anisto.txt.