Align Ion Torrent reads
From Array Suite Wiki
Mapping Ion Torrent NGS data with a "Two-step" alignment protocol
Next Generation Sequencing data generated from Thermo Fisher's Ion Torrent platforms are attractive because of the high throughput offered. However, Ion Torrent data introduce insertion/deletion errors, especially at homopolymers, at a much higher rate than Illumina platform reads.
To deal with this, Thermo-Fisher has proposed that an alternate "two-step" alignment approach is useful for maximizing the number of useful reads from Ion Torrent data.
In short, raw data (after QC-based filtering) are
- Aligned using a short read aligner such as OSA
- Unmapped reads are separated from mapped reads
- Unmapped reads are aligned using less-stringent "local alignment" parameters
- Reads from the first and second rounds are combined for downstream analysis.
These steps can be performed entirely within Array Studio, and data at each stage can be analyzed for quality parameters to maximize confidence in your Ion Torrent data.
Test Data Set
For this analysis, two sets of RNA-seq data from the "ABRF NGS study" (GSE46876) were used:
Ion Torrent Proton runs ABRF-PRO-MSK-B-1,ABRF-PRO-MSK-B-2,ABRF-PRO-MSK-B-3 (SRX307047-SSRX307049)
illumina HiSeq 2500 ABRF-ILMN-RNA-B-1,BRF-ILMN-RNA-B-2,BRF-ILMN-RNA-B-3,ABRF-ILMN-RNA-B-4 (SRX307101-SRX307104)
In both cases, poly-A mRNA from "FirstChoice Human Brain Reference Total RNA" were processed through appropriate library protocols.
Step 1: Quality Control of Raw Reads
The Array Studio Raw QC wizard runs multiple analysis modules to gauge overall quality, and identify potential problems that can be remedied with proper filtering.
Step 2: Align Ion Torrent reads with OSA
Click Add Data | Add RNA-seq Data | Map Reads to Genome (Illumina):
You can choose to leave most Mapping parameters as default,
but you may want to strip 'GGCCAAGGCG' from the 3' end of reads (under the Advanced tab, in case the 3' adapter sequence was not removed).
Also, be sure to specify an output folder, because you will need to find the output .bam files in the next step! Otherwise, Array Studio will generate a folder to contain your alignment.
Optional step 2b: Aligned QC of first-round reads
If you choose, you can run RNA-seq Aligned QC metrics to gauge the alignment rate. These metrics may be helpful in comparing to the second round alignments.
Step 3: Separate mapped and unmapped reads
Now, click NGS | Manipulation | Export to separate the first round .bam file into mapped reads (in a .sam file) and unmapped reads (as a fastq file):
In the Export NGS Data window, select Mapped+Unmapped, specify an output folder, and click Send to Queue.
Then, convert the mapped .sam reads back into .bam reads using ConvertNgsFiles:
Locate the mapped .sam reads (should end with .mapped.sam), select SAM as source and BAM as target, and specify an output folder. The files will be named *.mapped.bam.
Step 4: Align unmapped reads using OSA Long Read settings
The reads that were unmapped by OSA's standard parameters might have homopolymeric indels that prevented a good alignment. The "Long Read Aligner" uses much less stringent alignment rules to attempt to make short local alignments of sub-sequences within the read. In particular, this module allows multiple indel's to be tolerated in a read.
To run the Long Read Aligner, click Add Data | Add NGS Data | Add RNA-Seq Data | Map Reads To Genome (Long Reads):
Users can change several parameters controlling the Seed Length, gap penalties, etc. Specify an output name (and output folder!) and send to queue:
Step 5: Merge First and Second round mapped reads
Now the mapped reads from the first round and second round can be merged.
To merge the .bam files, a Grouping file must be generated, which is a two-column file where the first column is the full ArrayServer path to the files, and the second column is the name of the group to which each file belongs.
In this grouping file, include the first round mapped reads only .bam (i.e. those converted from .sam files), along with the full set of second round reads .bam file.
To easily get the paths to the aligned .bam files, use the Array Server file browser to navigate to the folder containing the first-round (aligned only) or second-round reads (aligned from unmapped first round), select the appropriate .bam files, then right-click and select Copy File Paths:
Paste this into the first column of an Excel file, specify file groupings in the second column, save as tab-delimited text, and upload to ArrayServer.
Then, click NGS | Tools | Merge Files:
then select the files to merge and grouping file:
Once the files have been merged, you can import the merged reads into the project with Add Genome-Mapped RNA-seq Reads:
Optional step 5b: Aligned QC on merged reads
Check the alignment of the merged files.
Step 6: Compare mapping quality of alignment steps
Once you have merged your files, you can delete the intermediate files, such as the round 1 and round 2.bam files, .sam files, and the unmapped .fastq files.
But first, you may want to inspect the alignments in the OmicSoft Genome Browser, or quantitate gene expression with the different alignments, then use Microarray-Microarray integration to compare the datasets.
If the alignments from the second round alignment look comparable to the first round, you can confidently use the merged alignments for downstream analysis, but if the second round alignments look excessively "noisy", consider using only first-round reads, or change the Long Read alignment parameters.
An Ion Torrent test dataset from the ABRF-NGS study, which sequenced brain poly-A mRNA with both Illumina HiSeq 2500 and Ion Torrent Proton procedures, was run through either standard OSA4 RNA-seq analysis, only "Long Read" analysis, or the two-step analysis, compared with an Illumina dataset aligned with OSA4.
Ion Torrent reads tended to map more poorly than Illumina reads, but unique mapping rate was ~40% to ~55% by the two-step protocol.
Correlation of Gene Expression Quantification with Illumina
Gene-level quantification (Log2-transformed UQ-normalized FPKM) were nearly identical between Ion Torrent reads aligned only with OSA4 and reads mapped with the two step protocol; reads aligned with a Two-step protocol also maintained strong correlation with Illumina data, similar to OSA4-mapped reads.
Similarity of Genome Browser coverage profiles between OSA4 and recovered Long Reads
Aligning unmappable Ion Torrent reads using the "Long Reads" parameters allows recovery of sequence data, improving confidence in gene expression and isoform usage estimates.
- Latest Tutorials
- Align RNAseq reads to genome
- Align Long RNAseq reads to genome
- Omicsoft aligner wiki and publication