Kallisto Merge Python Script

From Array Suite Wiki

(Difference between revisions)
Jump to: navigation, search
(Example Usage)
Line 52: Line 52:
  
 
where each file contains the merged abundance results as a text matrix:
 
where each file contains the merged abundance results as a text matrix:
 +
 
--merged_result.count--
 
--merged_result.count--
 
  target_id SRR521523 SRR521462 SRR521463 SRR521522 SRR521524 SRR521461
 
  target_id SRR521523 SRR521462 SRR521463 SRR521522 SRR521524 SRR521461

Revision as of 11:55, 21 April 2020

Kallisto outputs one "abundance.tsv" file per sample, which is useful for downstream analyses with Kallisto/Sleuth, but can be inconvenient when you are used to working with expression matrices.

To simplify usage with Kallisto, we generated a simple Python script that will merge all "abundance.tsv" files in a directory into a pair of merged expression matrices.

This python script is available as a Docker image here:

The full script can be found below.

Contents

Calling Anisto.py Dockerized Escript

Anisto.py will merge all abundance.tsv files into two merged abundance files (Counts and TPM).

Input files

Command syntax

The script syntax is

python3 Anisto.py -i (InputDirectory) -o (OutputDirectory) -p (prefix)

The prefix parameter is optional, and will be prepended to the output file as prefix_results.count. If not specified, the date will be prepended.

The script is available as a Docker image in the omicdocker /pandas:latest repository.

Begin RunEScript /RunOnServer=True;
Files 
"/GhindariuCloudFolder/Output/Abundances/SRR521461_abundance.tsv"
"/GhindariuCloudFolder/Output/Abundances/SRR521462_abundance.tsv";
EScriptName KallistoMerge;
Command python3 Anisto.py -i %FileDirectory% -o %OutputFolder%;
Options /ParallelJobNumber=1 /ThreadNumberPerJob=8 /Mode=Multiple /ErrorOnStdErr=False /ErrorOnMissingOutput=True /RunOnDocker=True /RunOnDocker=True /ImageName="omicdocker /pandas:latest" /UseCloud=True /OutputFolder="/GhindariuCloudFolder/Output/Results";
End;

Example Usage

Given a set of Kallisto quant output files named

/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521461_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521462_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521463_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521522_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521523_abundance.tsv
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/SRR521524_abundance.tsv

and using this Escript command:

Begin RunEScript /RunOnServer=True;
SearchFiles "/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020" /Pattern=*.tsv /Recursive=False;
EScriptName KallistoMergePython;
Command python3 Anisto.py -i "%FileDirectory%" -o "%FileDirectory%" -p "merged";
Options  /Mode=Multiple /ErrorOnStdErr=False /ErrorOnMissingOutput=False /RunOnDocker=True /ImageName="omicdocker/pandas:latest" /OutputFolder="/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020";
End;

Anisto.py will merge the results of these files into

/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/merged_result.count
/Users/joseph/RnaSeqTutorial2013/KallistoPythonPipeline/Test04202020/merged_result.tpm

where each file contains the merged abundance results as a text matrix:

--merged_result.count--

target_id	SRR521523	SRR521462	SRR521463	SRR521522	SRR521524	SRR521461
ENST00000456328.2	5.55646	23.8902	15.5573	4.33379	3.79016	54.3487
ENST00000450305.2	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000488147.1	90.1691	1017.74	1058.87	114.598	57.6045	706.806
ENST00000619216.1	0.0	0.25	0.5	0.0	0.0	0.0
ENST00000473358.1	1.54822	0.0	0.0	0.0	0.0	8.02154
ENST00000469289.1	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000607096.1	0.0	0.0	0.0	0.0	0.0	0.0

--merged_result.tpm--

target_id	SRR521523	SRR521462	SRR521463	SRR521522	SRR521524	SRR521461
ENST00000456328.2	0.155124	0.528399	0.374834	0.111086	0.112607	1.26689
ENST00000450305.2	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000488147.1	3.1746	28.3221	32.0945	3.70874	2.15809	20.7556
ENST00000619216.1	0.0	0.471365	1.00522	0.0	0.0	0.0
ENST00000473358.1	0.119864	0.0	0.0	0.0	0.0	0.514807
ENST00000469289.1	0.0	0.0	0.0	0.0	0.0	0.0
ENST00000607096.1	0.0	0.0	0.0	0.0	0.0	0.0

Anisto.py

#!/usr/bin/python3
# coding=utf-8
"""
@author: SimonPr.
@release: 07.04.2020
"""

import pandas as pd
import numpy as np
import os
from collections import OrderedDict, defaultdict
import argparse
import time
import datetime
from copy import deepcopy

def main():
    parser = argparse.ArgumentParser(description=' Parse Tsv', prog='Merger')
    parser.add_argument('--inputdir', '-i', required=True,
                        help=(" please give a mother folder containing all SRR subfolders "))
    parser.add_argument('--prefix', '-p', default=None, dest="prefix", type=str,
                        help="Specifies prefix of output files" )
    parser.add_argument('--output', '-o', required=True, help=(" Specify the Output dir."
                        "The program creates Counts and TPM files with merged info form each tlv found in SRR"))
    args = parser.parse_args()
    date = datetime.datetime.fromtimestamp(time.time()).strftime('%Y_%m_%d_%H_%M')
    if args.prefix == None:
        prefix = date
        out_dir = os.path.join(args.output, date)
    else:
        prefix = args.prefix
    if not os.path.exists(args.output):
        os.makedirs(args.output)
    out_tpm = os.path.join(args.output, ("%s_result.tpm" % prefix))
    out_count = os.path.join(args.output, ("%s_result.count" % prefix))
    _tsvlist = os.listdir(args.inputdir)
    tsvlist = [t for t in _tsvlist if t.split('.')[-1] == 'tsv']

    for f in range(0,len(tsvlist)):
        if f == 0:
            st_count, st_tpm = tsv2dict(args.inputdir, tsvlist[f])
        elif f == 1:
            #dd = defaultdict(list)
            nd_count, nd_tpm = tsv2dict(args.inputdir, tsvlist[f])
            dictmerged_count = merged_phase1(st_count, nd_count)
            dictmerged_tpm = merged_phase1(st_tpm,nd_tpm)
        elif f >= 2:
            nn_count, nn_tpm = tsv2dict(args.inputdir, tsvlist[f])
            dain = [key for key in nn_count.keys() if key not in dictmerged_count.keys()]
            for key, val in dictmerged_count.items():
                if key in nn_count.keys():
                    dictmerged_count[key].append(nn_count[key])
                    dictmerged_tpm[key].append(nn_tpm[key])
                else:
                    dictmerged_count[key].append(np.nan)
                    dictmerged_tpm[key].append(np.nan)
            maind_count = main_dict(dictmerged_count, nn_count,dain)
            maind_tpm = main_dict(dictmerged_tpm, nn_tpm, dain)
    if len(tsvlist) > 2:
        toframe(maind_count, out_count, "Count", tsvlist)
        toframe(maind_tpm, out_tpm, "TPM", tsvlist)
    elif len(tsvlist) == 2:
        toframe(dictmerged_count, out_count, "Count", tsvlist)
        toframe(dictmerged_tpm, out_tpm, "TPM", tsvlist)
    else:
        toframe(st_count, out_count, "Count", tsvlist)
        toframe(st_tpm, out_tpm, "TPM", tsvlist)

def tsv2dict(path, tsv):
    count_dict, tpm_dict = {}, {}
    with open(os.path.join(path, tsv)) as fh:
        for line in fh:
            if line.strip().split('\t')[0] != 'target_id':
                target_cell, count_cell, tpm_cell = line.strip().split('\t')[0], float(line.strip().split('\t')[3]), \
                                                    float(line.strip().split('\t')[4])
                count_dict[target_cell] = count_cell
                tpm_dict[target_cell] = tpm_cell
                # print("{}\t{}\t{}".format(target_cell, count_cell, tpm_cell))
    #count_dict.pop('target_id')
    return count_dict, tpm_dict

def merged_phase1(st_count_dictionary, nd_count_dictionary):
    dd = defaultdict(list)
    for d in (st_count_dictionary, nd_count_dictionary):
        for key, value in d.items():
            if key in st_count_dictionary.keys() or key not in nd_count_dictionary.keys():
                dd[key].append(value)
            if key not in st_count_dictionary.keys() and key in nd_count_dictionary.keys():
                dd[key].append(np.nan)
            if key in nd_count_dictionary.keys() and key not in st_count_dictionary.keys():
                dd[key].append(value)
            if key not in nd_count_dictionary.keys() and key in st_count_dictionary.keys():
                dd[key].append(np.nan)
    return dd

def main_dict(dictmerge, nn, da):
    dap = da
    dictando, maindict = defaultdict(list), defaultdict(list)
    counter = []
    for s, v in dictmerge.items():
        counter = [np.nan for i in range(0, len(v) - 1)]
        break
    for i in dap:
        newcount = [*counter]
        dictando[i] = newcount
        if i in nn.keys():
            dictando[i].append(nn[i])
    # print(dictando)
    maindict = {**dictmerge, **dictando}
    return  maindict

def toframe(adict, outname, sis, ltsv):
    frame = pd.DataFrame.from_dict(adict, orient="index")
    frame.reset_index(inplace=True)
    ltsv = [x.split('_')[0] for x in ltsv]
    ltsv.insert(0, 'target_id')
    frame.to_csv(outname, header=ltsv, index=None, sep='\t')
    print("{} File Commpleted!".format(sis))

if __name__ == "__main__":
    main()