Fusion gene detection in RNA-Seq
From Array Suite Wiki
In RNA-Seq datasets, fusion genes can be detected based on both paired- and single-end reads. In a paired-end NGS dataset,a discordant read pair is one that is not aligned to the reference genome with the expected distance or orientation. If a set of discordant read pairs are mapped to two different genes, a fusion gene is suggested (Fusion PE detection). On the other hand, single-end reads that span the fusion junctions provide base-pair evidence for the fusion events (Fusion SE detection). (also reads Ge, H, et al. "FusionMap: detecting fusion genes from next-generation sequencing data at base-pair resolution." Bioinformatics 27.14 (2011): 1922-1928.)
In paired-end datasets, the fusion report from junction-spanning reads is complementary to that from discordant read pairs. Moreover, fusion detection based on junction-spanning reads is more powerful than the paired-end approach when the inner distance between read pairs is short. In theory, for a set of 75 nt paired-end reads generated from a 400 bp fragment library, the ratio between the number of junction-spanning reads and discordant read pairs is close to 75*2/(400-75*2)=0.6. The ratio will increase when the usable read lengths increase with new sequencing techniques or reagents. However, when read length is less than 50bp, single-end approach is less sensitive compared to paired-end approach.
The bioinformatics workflow in fusion detection includes
- detect fusion genes from fusion junction-spanning reads using raw datasets or unmapped reads
- detect fusion genes from inter-transcript paired-end reads based on RNA-Seq alignment
If you have ArrayStudio License, you can run the fusion detection easily through its GUI. Please read the following two guides:
Or run the two in one function starting from NgsData (BAM file) containing unmapped reads (it requires at least junction spanning support)
- Example of OmicScript to run FusionMap alignment
- Example of OmicScript to run FusionMap alignment and paired fusion report in the RNA-Seq pipeline
Both fusion detection functions generate their own fusion reports: Fusion SE report & Fusion PE report. Fusion SE report provides base-pair resolution information, such as SplicePattern, FrameShift and FusionJunctionSequence.
We usually recommend users to use relaxed parameters to run fusion detection and then use varied filters to remove false positives. The fusion table could be very big especially when user merges fusion results from many samples and with relaxed detection parameters. Below are useful information to reduce false positves:
Fusion report contains a Filter column label fusion candidates using our accumulated black list on gene and gene pair level. The possible value could be empty, InBlackList, InFamilyList, InParalogueList or SameEnsemblGene.
- Gene blacklist includes:
- Mitochondrial & ribosomal genes (based on Gene Ontology)
- Pseudogenes according to three sources: Ensembl annotations, Entrez Gene Database and HUGO Gene Nomenclature Committee (HGNC).
Fusion candidates involving genes in this blacklist will be marked as InBlackList.
- Gene pair blacklist includes
- Gene family definition according to HGNC
- Paralog gene pairs according to Ensembl BioMart.
If the value is InFamilyList, it means the gene pair belongs to the same gene family (defined by HGNC). If the value is InParalogueList, it means the gene pair does not belong to HGNC based gene family, but they belong to two genes that are in the same paralogue group (defined by Ensembl v74, which we finally downloaded the database from Biomart). If it is SameEnsemblGene, it means that these two genes actually are overlapping a “super” ensemble gene (note the fusion comes from UCSC gene model).
Canonical splice patterns
In the RNA-Seq datasets, studies have shown that most of fusion junctions are exon-exon splice junctions of fusion genes. A TRUE fusion junction is likely to preserved a canonical splice pattern: GT-AG (~98.71%), GC-AG (~0.56%) and AT-AC (<0.05%). You can use the filter panel on the right to reduce false positives by selecting:
- Canonical splice patterns [major]: GT-AG
- Canonical splice patterns [minor]: GC-AG and AT-AC
- Non-canonical splice patterns: all other detected di-nucleotides
In the Fusion SE report, there is one more column for FrameShift prediction. Three nucleotide position in each frame is note as position 0, 1, 2.  is amino acid coding pattern. If the FrameShift jumping from gene A to B is 0->1, 1->2, 2->0, then there is no frame shift in both gene A and B due to the fusion event. User can filter the result by
- In-frame: contain any of 0->1, 1->2, 2->0 in the FrameShift column
- Frame-Shift: predicted to be others
- Unknown: something like ‘->0’ or '->'. no CDS information. Since 08/2013, we also considered them as In-frame.
- Frame shift is inferred based on CDS information in gene model (get from GTF files). If the frame shift is just’ ->’, then should I assume that the breakpoint is not in the coding region for either gene and *similarly, something like ‘->0’ would imply the breakpoint is in the coding region only for the second gene (in phase 0).
- Be aware that it is limited to the gene model you used. For example, you may see it is "->" using RefGene model but may see it is annotated in Ensembl.
- Because the fusion position can be in multiple transcript, you will see multiple FrameShift predictions.
Fusion distance filter
By computing the genomic distance between two fusion breakpoints, we can infer the fusion type: read-through or translocation. For fusion A-B, we can create a value column by
- -1: A and B are on two different chromosome
- The gap size between genomic coordinates of fusion positions in A and B
More filters can be applied outside of Omicsoft
Fusion detection agreement between single-end and paired-end
SE and PE are two independent approaches to detect fusions. We can add a column to PE and SE fusion result tables showing the agreement between SE and PE detection for each fusion. The value in this column can be a statistical testing score, such as the Fisher exact test. Such test can only be done after we merge fusion results from a group of RNA-Seq samples.
Gene expression of two fusion partners and their fusion product
The number of seed reads in SE and fusion read pairs in PE provides confident estimate for the abundance of a fusion gene or transcript. The number of seed reads covering the fusion junction is a function of sequencing depth and read length. The number of inter-transcript read pairs covering the fusion junction is a function of sequencing depth and the fragment size (the inner distance). We can normalize read number by number of mappable reads in the whole NGS dataset and also by the adjusted read/fragment length. It provided a measure close to the RPKM value of the fusion gene or transcript, and facilitated the comparison of abundances of fusion products both between samples and between different read lengths. The proposed normalization formula are:
Paired-end detection result
- FusionRPKM.PE = #FusionReadPairs/(#mappable reads)/(adjusted inner distance)*10^6*10^3
- adjusted inner distance = MedianInsertSize – 2*ReadLength+2*2
Single-end detection results (using seed reads only, seed reads per kilobase of seed region and per million mapped reads (SRPKM) values mentioned in the FusionMap paper)
- SRPKM = #SeedReads/(#mappable reads)/(seed region length)*10^6*10^3
- seed region length= ReadLength-2*MinimalFusionAlignmentLength then adjusted by the number of allowable mismatches
Single-end detection results (using all fusion reads)
- FusionRPKM.SE = #FusionReads/(#mappable reads)/(adjusted read length)*10^6*10^3
- adjusted read length= ReadLength * 92%
- FusionReads = #SeedReads + #RescuedReads
Based on them, we can create additional filters, such as
- RPKMRatio.SE.Gene1 = FusionRPKM.SE/(Gene1.RPKM+1)
- RPKMRatio.SE.Gene2 = FusionRPKM.SE/(Gene2.RPKM+1)
- RPKMRatio.SE.PE = (FusionRPKM.SE+1)/(FusionRPKM.PE+1)
- FusionRPKM.summary = sum(FusionRPKM)/(#FusionRPKM>0) showing the average fusion expression levels in samples with the fusion
Filters on tumor/normal count
- TumorCount = #tumor (unique sample ID) having the fusion
- NormalCount = #normal (unique sample ID) having the fusion
- Tumor_Normal_Ratio = (TumorCount+1)/(NormalCount+1);
Filters on gene annotation
- Cancer gene: YES, NA (Not Available)
- Kinase class: AGC, CMGC, CAMK, CK1, STE, TK, TKL, NA
- Surface gene
- Filter to find fusions with type (surface Gene)—(TK Gene) is particular interesting.
Fusion SAM files
- In Report Paired End Fusion Genes, each SAM file entry are extracted from the original alignment file.
- In Map Fusion Reads, each entry is based on fusion alignment. Each fusion read will be cut into two ends which aligned to different locations. In the SAM output, we use two entries to describe the alignment of each fusion reads in the paired-end fashion. Each line also contain the tag of fusion read type and fusion junction ID. Example of a fusion read in the SAM file:
BI:081030_SL-XBF_0001_FC30CB2AAXX:7:1:954:637 67 22 23632564 255 37M39S 9 133729451 0 ATCGTCCACTCAGCCACTGGATTTAAGCAGAGTTCAAAAGCCCTTCAGCGGCCAGTAGCATCTGACTTTGAGCCTC IIIIIII>III/III1I7II.0F;*8>I=I'>,+I(**&<I>I*/I)4?=0<C&8%&;<&&4&4%3**/2%,35#7 ZF:Z:FUS_1665438551_2873549802(++) ZT:Z:Seed BI:081030_SL-XBF_0001_FC30CB2AAXX:7:1:954:637 131 9 133729451 255 37S39M 22 23632564 0 ATCGTCCACTCAGCCACTGGATTTAAGCAGAGTTCAAAAGCCCTTCAGCGGCCAGTAGCATCTGACTTTGAGCCTC IIIIIII>III/III1I7II.0F;*8>I=I'>,+I(**&<I>I*/I)4?=0<C&8%&;<&&4&4%3**/2%,35#7 ZF:Z:FUS_1665438551_2873549802(++) ZT:Z:Seed
BI:081030_SL-XBF_0001_FC30CB2AAXX:7:1:954:637 has been cut into two partial reads, one with 37nt and the other with 39nt. The first line represents the alignment of 37nt end using CIGAR 37M39S; the second line represents the alignment of 39nt end using CIGAR 37S39M.
Add fusion SAM/BAM files to genome browser
User can add track of each individual SAM files into genome browser by Add Track | Add Track from local files | Add SAM/BAM file.
If User have many samples and want to group them in genome browser, the SAM files have to be imported into a project as NgsData and attach it with sample design. Below is one example based on 130 TCGA BAM files:
- Create or open a OmicSoft project
- Add Data | Add NGS Data | Add RNA-Seq Data | Add Genome-Mapped Reads | Choose file format as SAM/BAM; output same (such as FusionAlignment)
- Once the NGS Data is created, right click on the design folder and Import the sample design for all SAM files
- Go to genome browser tab, Add Track | Add Track from local analysis| Alignment track: NGS data
- You can create track for each sam file or use "create a track for each group" by specifying a sample group, such as sample type.
Once, fusion alignment files are loaded in genome browser, user can jump to any fusion locations from fusion table by right clicking on the row name in either result table or detail table:
The fusion region shows up in genome browser spitted fusion view. User can show reads and junctions: