Raw Data QC of RNA-Seq Data
From Array Suite Wiki
Before running a time-consuming alignment, it's a good idea to get an idea of the quality of your data, as well as any other issues. For instance, if you are unsure as to whether you have any adapters or barcodes still remaining on your sequence, some of this information can be determined using the Raw Data QC Wizard. The Raw Data QC Wizard runs nearly all of the Raw Data QC modules, and does so by scanning each sequence only once (thus increasing the speed of the process by not having to scan each file for each module). It is recommended to use the wizard, rather than manually running each module. The exception is the adapter searching module, which is not part of the QC Wizard.
The basic statistics table contains some important information about your samples, including total Sample #, Minimum and Maximum read length (if pre-filtering has occurred), total Nucleotide #, and GC%. Use this table to confirm any expected values, as well as to get an idea of the overall size of your experiment. In the screenshot below, note the wide range of sequence number. Download the table below to see a full example of the Basic Statistics table.
Base distribution gives a plot for each sample (or for paired end reads, each file). Use this to see if there are any unexpected patterns in the data. In the screenshot below (an miRNA-Seq experiment), the sample still contains an adapter, leading to an unequal distribution across the length of the read. For this experiment, the user would want to make sure to input the adapter information when doing the alignment.
Note that it is expected to have an unequal distribution for Illumina experiments for the first few bases. This is due to the random hexamer priming having a bias (see this article for more details: Biases in Illumina transcriptome sequencing caused by random hexamer priming.
The Overall Quality Report summarizes the quality of all reads in each sample, along each base pair. The score is along the X-axis, and the number of nucleotides with that score is displayed on the Y-axis.
The Quality BoxPlot module produces two tables and charts:
1. PerSequence Quality
The per sequence quality plot shows, for each file, the distribution of quality scores of all reads. Quality score of each read is determined by averaging over its all bases. Range of quality score is usually from 0 to 40. In a good experiment, most reads should be above 10, but this varies.
The quality box plot shows, for each file, the quality distribution across each base. In most experiments, quality declines across reads. Omicsoft's aligner handles this automatically, trimming low quality bases from the right end of the reads.
The K-Mer QC module can be used to detect anomalies in your data. It returns both a report and a profile view of the top 5 most abundant KMer sequences.
In the example below, this sample had a clear irregularity. This was actually a 5bp barcode before the adapter on the 3' end, and was identified using this module.
The sequence duplication module looks for duplication of full sequences in your data. Use this module to scan for any contamination (i.e amplification of adapters, primers, etc.) as well as to get an overall idea of the level of duplication in your experiment. Note it is expected that RNA-Seq experiments will have a large level of duplication, but it is not expected that there should be a large percentage of overrepresented sequences.
In the example below, a set of overrepresented sequences is shown. Any sequence represented in more than .01% of the data is returned. Note this is the full sequence, so partial sequence matches will not be shown here. If you have an amplified Illumina adapter or primer here, this will be shown in this report.
Note the overrepresented sequences in the following example. This could indicate some level of contamination, however, keep in mind the percentages are relatively low. Also, keep in mind that these sequences likely won't be aligned properly and thus might result in lower alignment percentages than you were expecting.
In this miRNA-Seq example, there are sequences that are detected as overrepresented. However, this is not a source of worry, as a BLAT of the top sequence shows that it is an miRNA sequence that appears to be abundant in this sample.
In the example below, there is little duplication for the RNA-Seq experiment. This is a good example, with DuplicationLevel=1 (i.e unique) representing over 80% of the sequences. This is indicative of a good level of coverage for your experiment.
In the following graph, a sample is shown from a sample with lower coverage, and thus a large amount of duplication is shown at the 10+ level. This is somewhat expected with RNA-Seq, but something to keep in mind when considering downstream analysis.
Searching for Adapters
Use the Search Adapters module in cases where you are not sure if there are still adapter sequences on your data, or what adapter sequences were used. This is common when analyzing public domain data where the information is not always known.
In the following example, the software was able to detect the small RNA-Seq Illumina adapter: