Scaling Land Text Dump RNA-seq FPKM values to TPM

From Array Suite Wiki

Jump to: navigation, search

OmicSoft Land bulk RNA-seq data is reported as FPKM-normalized values, but a scaling factor is pre-calculated for each sample to enable convenient conversion within the OmicSoft Studio GUI.

This same scaling factor can be found in the full Land text dumps, in "Samples.txt", so can be used to quickly convert the Gene and Transcript FPKM values to TPM, using your favorite scripting language.

Converting FPKM to TPM using Awk

R and Python environments can be used to calculate the TPM-scaled expression values, but in this example I will show how to use the Linux awk program to quickly output the files.

Before starting, identify the column in Samples.txt that contains TPM Scaling Factor, since this may change from Land to Land.

A quick way to do this is by triangulating on the column, usually in the 30's:

head -n 1 Samples.txt | awk 'BEGIN{FS="\t"};{print $30}'
> StudyName
head -n 1 Samples.txt | awk 'BEGIN{FS="\t"};{print $35}'
> Treatment
head -n 1 Samples.txt | awk 'BEGIN{FS="\t"};{print $33}'
> TPM Scaling Factor

Then you can use this awk one-liner with Samples.txt and GeneFPKM.txt as input. Change the ScalingFactor column to match the column above (33 in this example)

awk 'BEGIN{FS="\t";OFS="\t"};{if (NR==FNR){ScalingFactor[$1]=$33;SampleNames[$1]=$2} else if (FNR==1){print $1, SampleNames[$1],$2,"TPM"} else  {print $1,SampleNames[$1],$2,$3*ScalingFactor[$1]}}' Samples.txt GeneFPKM.txt
>SampleIndex     SampleID        GeneIndex       TPM
>0       C000S5B1        0       0.69198
>1       C000S5B4        0       0.403456
>2       C000WYB3        0       0.664014
>3       C0010KB1        0       0.339472