Align Ion Torrent reads

From Array Suite Wiki

Jump to: navigation, search


Mapping Ion Torrent NGS data with a "Two-step" alignment protocol

Next Generation Sequencing data generated from Thermo Fisher's Ion Torrent platforms are attractive because of the high throughput offered. However, Ion Torrent data introduce insertion/deletion errors, especially at homopolymers, at a much higher rate than Illumina platform reads.

To deal with this, Thermo-Fisher has proposed that an alternate "two-step" alignment approach is useful for maximizing the number of useful reads from Ion Torrent data.

In short, raw data (after QC-based filtering) are

  1. Aligned using a short read aligner such as OSA
  2. Unmapped reads are separated from mapped reads
  3. Unmapped reads are aligned using less-stringent "local alignment" parameters
  4. Reads from the first and second rounds are combined for downstream analysis.

These steps can be performed entirely within Array Studio, and data at each stage can be analyzed for quality parameters to maximize confidence in your Ion Torrent data.

For a more general overview of the Array Studio RNA-seq workflow, please see our tutorial or RNA-seq analysis video series.

[back to top]

Test Data Set

For this analysis, two sets of RNA-seq data from the "ABRF NGS study" (GSE46876) were used:

Ion Torrent Proton runs ABRF-PRO-MSK-B-1,ABRF-PRO-MSK-B-2,ABRF-PRO-MSK-B-3 (SRX307047-SSRX307049)

illumina HiSeq 2500 ABRF-ILMN-RNA-B-1,BRF-ILMN-RNA-B-2,BRF-ILMN-RNA-B-3,ABRF-ILMN-RNA-B-4 (SRX307101-SRX307104)

In both cases, poly-A mRNA from "FirstChoice Human Brain Reference Total RNA" were processed through appropriate library protocols.

Step 1: Quality Control of Raw Reads

The Array Studio Raw QC wizard runs multiple analysis modules to gauge overall quality, and identify potential problems that can be remedied with proper filtering.

NGS rawQC menu.png

[back to top]

Step 2: Align Ion Torrent reads with OSA

Click Add Data | Add RNA-seq Data | Map Reads to Genome (Illumina):

Ngs MapRNAseqReadsToGenome Menu.png

You can choose to leave most Mapping parameters as default,

TwoStep MapReads General Window.png TwoStep MapReads Advanced Window.png

but you may want to strip 'GGCCAAGGCG' from the 3' end of reads (under the Advanced tab, in case the 3' adapter sequence was not removed).

TwoStep StripIonTorrentAdapters Window.png

Also, be sure to specify an output folder, because you will need to find the output .bam files in the next step! Otherwise, Array Studio will generate a folder to contain your alignment.

[back to top]

Optional step 2b: Aligned QC of first-round reads

If you choose, you can run RNA-seq Aligned QC metrics to gauge the alignment rate. These metrics may be helpful in comparing to the second round alignments.

[back to top]

Step 3: Separate mapped and unmapped reads

Now, click NGS | Manipulation | Export to separate the first round .bam file into mapped reads (in a .sam file) and unmapped reads (as a fastq file):

Ngs ExportData Menu.png

In the Export NGS Data window, select Mapped+Unmapped, specify an output folder, and click Send to Queue.

TwoStep ExportMappedSamUnmappedFastq Window.png

Then, convert the mapped .sam reads back into .bam reads using ConvertNgsFiles:

TwoStep ConvertNgsFiles Menu.png

Locate the mapped .sam reads (should end with .mapped.sam), select SAM as source and BAM as target, and specify an output folder. The files will be named *.mapped.bam.

TwoStep ConvertNgsFiles Window.png

[back to top]

Step 4: Align unmapped reads using OSA Long Read settings

The reads that were unmapped by OSA's standard parameters might have homopolymeric indels that prevented a good alignment. The "Long Read Aligner" uses much less stringent alignment rules to attempt to make short local alignments of sub-sequences within the read. In particular, this module allows multiple indel's to be tolerated in a read.

To run the Long Read Aligner, click Add Data | Add NGS Data | Add RNA-Seq Data | Map Reads To Genome (Long Reads):

Ngs MapLongRnaSeqReads Menu.png

Users can change several parameters controlling the Seed Length, gap penalties, etc. Specify an output name (and output folder!) and send to queue:

TwoStep MapLongRNAseqReads Window.png

[back to top]

Step 5: Merge First and Second round mapped reads

Now the mapped reads from the first round and second round can be merged.

To merge the .bam files, a Grouping file must be generated, which is a two-column file where the first column is the full ArrayServer path to the files, and the second column is the name of the group to which each file belongs.

In this grouping file, include the first round mapped reads only .bam (i.e. those converted from .sam files), along with the full set of second round reads .bam file.

To easily get the paths to the aligned .bam files, use the Array Server file browser to navigate to the folder containing the first-round (aligned only) or second-round reads (aligned from unmapped first round), select the appropriate .bam files, then right-click and select Copy File Paths:

TwoStep FileBrowser FindFiles.png

Paste this into the first column of an Excel file, specify file groupings in the second column, save as tab-delimited text, and upload to ArrayServer.

Then, click NGS | Tools | Merge Files:

Ngs MergeFiles Menu.png

then select the files to merge and grouping file:

TwoStep MergeNgs Window.png

Once the files have been merged, you can import the merged reads into the project with Add Genome-Mapped RNA-seq Reads:

Ngs AddGenomeMappedRNAseqReads Menu.png

[back to top]

Optional step 5b: Aligned QC on merged reads

Check the alignment of the merged files.

[back to top]

Step 6: Compare mapping quality of alignment steps

Once you have merged your files, you can delete the intermediate files, such as the round 1 and round 2.bam files, .sam files, and the unmapped .fastq files.

But first, you may want to inspect the alignments in the OmicSoft Genome Browser, or quantitate gene expression with the different alignments, then use Microarray-Microarray integration to compare the datasets.

If the alignments from the second round alignment look comparable to the first round, you can confidently use the merged alignments for downstream analysis, but if the second round alignments look excessively "noisy", consider using only first-round reads, or change the Long Read alignment parameters.

[back to top]

Example Results

An Ion Torrent test dataset from the ABRF-NGS study, which sequenced brain poly-A mRNA with both Illumina HiSeq 2500 and Ion Torrent Proton procedures, was run through either standard OSA4 RNA-seq analysis, only "Long Read" analysis, or the two-step analysis, compared with an Illumina dataset aligned with OSA4.

Aligned QC

TwoStep AlignedQC Results.png

Ion Torrent reads tended to map more poorly than Illumina reads, but unique mapping rate was ~40% to ~55% by the two-step protocol.

[back to top]

Correlation of Gene Expression Quantification with Illumina

TwoStep PairwiseCorrelation Plot.png

Gene-level quantification (Log2-transformed UQ-normalized FPKM) were nearly identical between Ion Torrent reads aligned only with OSA4 and reads mapped with the two step protocol; reads aligned with a Two-step protocol also maintained strong correlation with Illumina data, similar to OSA4-mapped reads.

Similarity of Genome Browser coverage profiles between OSA4 and recovered Long Reads

TwoStep GenomeBrowser APLP2.png

Aligning unmappable Ion Torrent reads using the "Long Reads" parameters allows recovery of sequence data, improving confidence in gene expression and isoform usage estimates.

Related Articles

[back to top]