Different quality encoding

From Array Suite Wiki

Jump to: navigation, search


When we import NGS data (e.g. fastq files) into project, Array Studio attempts to automatically detect the format, i.e. Illumina or Sanger.

If the lowest quality score (ASCII) is <59, it is assumed to be "Sanger"; otherwise, if the minimum quality >=64, it is assumed to be Illumina; if neither are true, then the user will need to manually specify the encoding type.

If the user wants to choose the format manually, it is important to make sure it is correct.

Fastq Format.png

Sanger format can encode a quality score from 0 to 93 using ASCII 33 to 126. While starting with Illumina 1.3 and before Illumina 1.8, the format encoded a quality score from 0 to 62 using ASCII 64 to 126. For example, if the quality code for a position is B, then if it is Sanger, its corresponding quality is 33; if it is in Illumina, its corresponding quality is 2.

  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126
Common ranges
  0........................26...31.......40.43                               
                                 0........9.............................40 

 S - Sanger (including Illumina 1.8+)        Phred+33,  raw reads typically (0, 41)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)


Quality is an important information for NGS data analysis. In many downstream analysis, a quality cut off (usually 13, corresponding to the probability of wrong base detection equals to 0.05) would be set to filter out low quality reads. In this case, if the read is in Sanger Quality format but treated as Illumina, most of the reads would be filter out since their qualities are understated. See the table above.

Fastq Format2.png