Transcirpt quantification is affected by positional bias

From Array Suite Wiki

Jump to: navigation, search

The transcript quantification is based on RSEM and provided gene model. However, we do find issues about EM estimation which has been affected by 5->3 read coverage bias. In some of TCGA tumor type, there are strong 3’ bias where 3’ of gene region tends to have higher coverage than 5’ region. For example, TCGA BLCA dataset is clearly has such coverage pattern while TCGA OV dataset seems to fine:


When there is a strong 3’ bias of coverage, RSEM estimation will tend to assign reads to a shorter transcript which share most of its exons/exon junctions with the longer one. As you can see the transcript bar chart below:


uc002mqh.4 is found to be the most abundant in BLCA but uc010dxq.3 is the most abundant in OV.

In general, library preparation will affect RNA-Seq data quality and some low input RNA-Seq library preparation protocol can lead to severe 5->3 read coverage bias. In TCGA dataset, read length is most datasets are 51bp; while in CCLE, reads are longer 101 bp. It is also affecting accuracy of RSEM estimation.

Internally, we did try to improve the RSEM model with positional bias parameters. However, every gene seems to have different bias pattern and it does not improve the estimation too much when correcting them using a global pattern.

Also Read