TPM and FPKM

From Array Suite Wiki

(Difference between revisions)
Jump to: navigation, search
Line 21: Line 21:
  
 
[[Category: .NGS RNA-Seq]]
 
[[Category: .NGS RNA-Seq]]
 +
 +
==Related Articles==
 +
*[[RNA-Seq Normalized FPKM Values in Land]]
 +
*[[Omicsoft RPKM/FPKM/Count values]]

Revision as of 13:05, 13 November 2018

RSEM software from published paper can only quantify gene expression based on transcriptome mapped reads. Our reimplementation is to allow quantification using genome mapped BAM files directly. It is described in more details in our published Oshell paper or read wiki page: Omicsoft RPKM/FPKM/Count values.

RSEM is a way to calculate TPM, and RPKM is linear to TPM for any given sample [1]. They all have transcript length in the denominator. TPM is really just RPKM scaled by a constant to make sure the sum of all values is 1 million.

In our land results, we scaled the FPKM/RPKM one more time, so that the 75% quantile are the same for all samples. This type of third-quartile normalization is very common in practice. The same approach was used in TCGA level 3 dataset. If the input is TPM from public RSEM implementation, after upper quantile normalization, it will end up with the same values as scaled FPKM. RPKM=TPM*c, RPKM.Normalized=TPM.Normalized if their final upper quartile value are the same.

If user does want to have TPM values, it can be computed based on the fact that:

  • RSEM estimated theta θ value from EM algorithm. θ represents relative expression level in a measurement called “the probability of nucleotides”. θi is the probability of mapped read nucleotide belong to isoform i.
  • RPKM = (1,000,000*1,000* θi*TotalNumberOfMappedReads) / (ℓi * TotalNumberOfMappedReads)=(1,000,000,000* θi)/ℓi, where ℓi is the length, in nucleotides, of isoform i.
  • TPM (transcript per million) = 1,000,000*θi/(ℓi*c), where c is a constant value, sum_[j](θj/ℓj), sum_[i]TPM=1,000,000
  • TotalNumberOfMappedReads are only sum of reads mapped to exon or exon junction region on the chromosome. It is not the total number of alignments in BAM file nor total number of aligned reads in the alignment report.

Then

  • The ratio of RPKM/TPM is c*1,000, a constant, for transcript i.
  • If sum_[i](RPKM_[i])=Z, because sum_[i](TPM_[i])=1,000,000, then c*1000=Z/1,000,000
  • TPM=RPKM*1,000,000/Z

Reference

  1. Section "Comparison to RPKM estimation" in RSEM paper: Bioinformatics (2010) 26 (4): 493-500

Related Articles