Aligned Data QC for RNA-Seq
From Array Suite Wiki
There are two main modules for qc for aligned data. It's suggested that both of these modules be run for your samples.
Generates a table of metrics, along with a visualization for each metric. These metrics can be used to provide an overview of different alignment statistics, coverage statistics, and more, in a single table. Also generates a ProfileView showing a chart for each metric
Metrics are broken up into different types of metrics.
These metrics can be used to give an overall idea of the quality of the alignment for your samples.
One of the most important metrics is the AlignmentUniquelyMappedRate (for single end) or AlignmentUniquelyPairedRate (for paired end). This indicates the percentage of reads uniquely mapped (aligned to only one part of the genome) and in the case of paired end, both mates were aligned to the same region (within the expected insert size and standard deviation).
The NM rates proves metrics on the number of reads that aligned with x # of mismatches (or indels). This gives you an idea of the quality of the alignment. Metrics are provided for up to 4 mismatches per read, and then >4 mismatches. Metrics are also provided as to whether a read maps across an intron (i.e exon junction), has an insertion, or has a deletion.
Flag metrics are generally only useful for paired end reads when data has been aligned with OSA.
It provides metrics, using the SAM Flags, on # of Reads (Read1/Read2), # Failed reads, # reads marked as duplicates, # alignments marked as secondary.
Note: OSA does not mark reads as duplicates for RNA-Seq. OSA does not mark reads as failed.
Profile Metrics provide important overall statistics based on the provided gene model.
- The rate of reads mapped to an exon, exon junction, intron, anywhere in a known gene
- The rate of reads in a known gene with an insertion or deletion,
- The rate of reads in an inter-gene region, inter-gene region with insertion/deletion
- Profile_InterGene_FPK (fragments per kilobase of intergenic sequence) measures the number of reads mapped to intergenic regions, normalized to kilobase of intergenic genome (based on the specified model). This metric can be used as a "noise" threshold for gene FPKM quantification.
- The rate of reeds in a deep inter-gene region (>5kb outside the known gene model)
Use these metrics to determine the overall success of the profiling. For instance, if a large % of reads map to a deep inter-gene region, this could indicate some sort of genomic contamination.
An ExpressionProfilingProficiency measurement is returned as well, giving you an overall efficiency rate (reads mapped to exons + exon junctions)/total reads
Source metrics are based on the provided gene model. It provides the most information with gene models like Ensembl that have detailed information for the source of each transcript.
These metrics can be used to get a sense of the overall types of transcripts that are being aligned. For instance, in this experiment shown below, most reads are mapped to the protein coding regions. For specific types of RNA-Seq experiments (i.e miRNA-Seq), you'd expect most reads to align to a specific source. Compare metrics across files to check for patterns.
Insert Size Metrics
Insert size metrics provide some basic metrics on the insert sizes for paired end experiments. Insert Size mean, median, median absolute deviation, and the 5th and 95th percentile are provided.
Use these metrics to ensure that the paired end experiment is performing as expected, and to look for any outlier values.
The duplication metrics can give you an idea of the total level of duplication for an experiment (after alignment). This is based on coordinates (start position), rather than the raw data QC which was based on sequence.
It is expected that an RNA-Seq experiment will have a large amount of duplication, so do not be alarmed if these metrics show high values.
Interesting values include the rate of Unique Starting Positions for reads, as well as the rates for each level of duplication. Usually, the higher the coverage, the more duplication that will be seen. This is totally expected with RNA-Seq experiments.
Omicsoft does not recommend "removing duplicates" from RNA-Seq experiments, and does not provide a tool to do so.
For paired end experiments, this module reports Fragment range, which is the difference between the left coordinate of the left read, and right coordinate of the right read. In a normal experiment, it's expected that a large number of the fragment ranges would be unique.
The coverage metrics give you an overall idea of the mean coverage of your experiment. For RNA-Seq, this looks at the (total length of aligned reads/total exon length of your gene model).
It also gives metrics on the number of genes with coverage and the rate of this coverage. Finally, it gives a metric on the number of genes with at least 1 RPKM of coverage, as well as 10 RPKM of coverage. Use these metrics to get an idea of the scope of your RNA-Seq experiment. In the example below, between 64 and 72% of genes were covered, but that rate drops off significantly at the higher RPKM levels. Even at 1 RPKM, the rate drops to as low as 27%.
The strand metrics give you the rates at which reads are aligned to the sense or anti-sense strands. For most Illumina RNA-Seq experiments (in which the reads are unstranded, its expected that reads would align in equal portion (50/50). However, for some stranded protocols, this might not be the case. Additional information is provided for paired end experiments (pairs are on the opposite strand for Illumina).
The first example shows a normal unstranded single-end Illumina RNA-Seq experiment and results.
The next example shows a paired end dataset with a normal Illumina protocol (unstranded).
The next example shows a single end dataset prepared with a stranded library (forward strand only).
The next example shows a paired end dataset prepared with a stranded library. As expected, most of the forward strand reads map to the sense strand, with most of the reverse strand reads mapping to the antisense strand.
Feature metrics measure the rate of CDS/Exon/Gene/Transcript coverage by RNA-Seq data.
Example Metrics Table
RNA-Seq 5'->3' Trend
Generates a table, and plots, that can be used to identify any 5' to 3' bias in your RNA-Seq samples. Use the column Q4/Q1 to quickly find whether there is any bias in your samples. For potentially degraded RNA samples, it'd be expected that for a longer transcript size, this ratio could potentially be high.
In the example below, we contrast two samples (a heart sample and kidney sample). The kidney sample shows some sign of degradation, with increasing transcript sizes, whereas the heart only shows some 3' bias at the highest transcript sizes.
In the following example, we used the Filter for the column TranscriptBin to look at the 5000+ bin size across the 16 samples. Notice that there is a range, with the Kidney showing the most degradation vs the Heart sample showing the least degradation.