Raw Data QC of RNA-Seq Data

From Array Suite Wiki

Jump to: navigation, search


Before running a time-consuming alignment, it's a good idea to get an idea of the quality of your data, as well as any other issues. For instance, if you are unsure as to whether you have any adapters or barcodes still remaining on your sequence, some of this information can be determined using the Raw Data QC Wizard. The Raw Data QC Wizard runs nearly all of the Raw Data QC modules, and does so by scanning each sequence only once (thus increasing the speed of the process by not having to scan each file for each module). It is recommended to use the wizard, rather than manually running each module. The exception is the adapter searching module, which is not part of the QC Wizard.


Contents

Basic Statistics

The basic statistics table contains some important information about your samples, including total Sample #, Minimum and Maximum read length (if pre-filtering has occurred), total Nucleotide #, and GC%. Use this table to confirm any expected values, as well as to get an idea of the overall size of your experiment. In the screenshot below, note the wide range of sequence number. Download the table below to see a full example of the Basic Statistics table.

BasicStatistics.png

File:BasicStatisticsTable.xls

Base Distribution

Base distribution gives a plot for each sample (or for paired end reads, each file). Use this to see if there are any unexpected patterns in the data. In the screenshot below (an miRNA-Seq experiment), the sample still contains an adapter, leading to an unequal distribution across the length of the read. For this experiment, the user would want to make sure to input the adapter information when doing the alignment.

Note that it is expected to have an unequal distribution for Illumina experiments for the first few bases. This is due to the random hexamer priming having a bias (see this article for more details: Biases in Illumina transcriptome sequencing caused by random hexamer priming.

BaseDistribution.pngBaseDistributionLegend.png

Overall Quality

The Overall Quality Report summarizes the quality of all reads in each sample, along each base pair. The score is along the X-axis, and the number of nucleotides with that score is displayed on the Y-axis.

RawQCOverallQuality.png

Quality BoxPlot

The Quality BoxPlot module produces two tables and charts:

1. PerSequence Quality

The per sequence quality plot shows, for each file, the distribution of quality scores of all reads. Quality score of each read is determined by averaging over its all bases. Range of quality score is usually from 0 to 40. In a good experiment, most reads should be above 10, but this varies.

PerSequenceQuality.png

2. QualityBoxPlot

The quality box plot shows, for each file, the quality distribution across each base. In most experiments, quality declines across reads. Omicsoft's aligner handles this automatically, trimming low quality bases from the right end of the reads.

QualityBoxPlot.png

K-Mer Analysis

The K-Mer QC module can be used to detect anomalies in your data. It returns both a report and a profile view of the top 5 most abundant KMer sequences.

In the example below, this sample had a clear irregularity. This was actually a 5bp barcode before the adapter on the 3' end, and was identified using this module.

Kmer.png

Sequence Duplication

The sequence duplication module looks for duplication of full sequences in your data. Use this module to scan for any contamination (i.e amplification of adapters, primers, etc.) as well as to get an overall idea of the level of duplication in your experiment. Note it is expected that RNA-Seq experiments will have a large level of duplication, but it is not expected that there should be a large percentage of overrepresented sequences.

Overrepresented Sequences

In the example below, a set of overrepresented sequences is shown. Any sequence represented in more than .01% of the data is returned. Note this is the full sequence, so partial sequence matches will not be shown here. If you have an amplified Illumina adapter or primer here, this will be shown in this report.

Note the overrepresented sequences in the following example. This could indicate some level of contamination, however, keep in mind the percentages are relatively low. Also, keep in mind that these sequences likely won't be aligned properly and thus might result in lower alignment percentages than you were expecting.

OverrepresentedSequences.png

In this miRNA-Seq example, there are sequences that are detected as overrepresented. However, this is not a source of worry, as a BLAT of the top sequence shows that it is an miRNA sequence that appears to be abundant in this sample.

OverrepresentedmiRNA.png

Duplication Graph

In the example below, there is little duplication for the RNA-Seq experiment. This is a good example, with DuplicationLevel=1 (i.e unique) representing over 80% of the sequences. This is indicative of a good level of coverage for your experiment.

DuplicationLevelNormal.png

In the following graph, a sample is shown from a sample with lower coverage, and thus a large amount of duplication is shown at the 10+ level. This is somewhat expected with RNA-Seq, but something to keep in mind when considering downstream analysis.

DuplicationLevelLowerCoverage.png

Searching for Adapters

Use the Search Adapters module in cases where you are not sure if there are still adapter sequences on your data, or what adapter sequences were used. This is common when analyzing public domain data where the information is not always known.

In the following example, the software was able to detect the small RNA-Seq Illumina adapter:

AdapterSearching.png