Ngs NgsSequenceDuplication.pdf

From Array Suite Wiki

Jump to: navigation, search

Contents

NGS Raw Data QC -- Sequence Duplication

Overview

This module counts the degree of duplication for every sequence in the set, and creates a plot showing the relative number of sequences with different degrees of duplication. In a diverse library, most sequences will occur only once in the final set. Moderate levels of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (e.g. PCR over-amplification).


To access the module, please go to Analysis | NGS | Raw Data QC | Sequence Duplication.

SeqDup menu.png

Input Data Requirements

Accepted file formats include FASTQ, FASTA, QSEC, SFF, SAM BAM and AUTO (AUTO allows the use of any combination of the listed file types).

[back to top]


General Options

SequenceDuplication.png

Add file

Add files to menu

  • Add button will add samples by selections
  • Add Folder will add all samples in the selected folder (local project only)
  • Search will find files based on sample registration (server project only)
  • Add list will allow users to add files from a list (even add a grouping file for alignment functions).
[back to top]

Adapter Stripping

The Adapter Stripping window appears after selecting the "Customize" button.


NGS Filter 4.png


  • No Adapter Stripping: No attempt will be made to remove adapter sequences from reads.
  • Strip 3' end adapters (end only): The 3' ends of reads will be compared to the adapter sequence for a match.
  • Strip right adapter (middle or end): The adapter sequence will be checked for a match within the read, and will trim the adapter sequence, along with any sequence 3' to the adapter.
  • Strip multiple adapters: Multiple adapter sequences can be listed.
[back to top]


Options

  • Job number: The total number of jobs to run at the same time.
  • Zip format: Select which format is used in compressing the files (default is "None").
  • Maximal duplication level: The user can modify the depth of the duplication count and control for the maximum number of bins that are returned (default is "10").
  • Output Name: The user can choose to name the output file.
  • Output folder: The output folder is used to store output files.
  • Include only mapped entries - Selecting this option will only incorporate sequences that have been mapped to the reference sequence.
    • This option only works when .bam or .sam files are added, and the File format dropdown menu is set to SAM or BAM
  • Preview mode (5% systematic sampling) - By default, the module is run on all sequences. If this option is checked, the module will run on a 5% sampling of input reads for each file. This option should be used as a quick indicator of quality for especially large raw data files.
[back to top]


Contamination List

  • User can select to use the default contamination list or to use a custom list. Details for the list format is below in the screen shot:

SeqDup2.png

Example Contamination List entry:

Illumina Single End Apapter 1				ACACTCTTTCCCTACACGACGCTGTTCCATCT
Illumina Single End Apapter 2				CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Illumina Single End PCR Primer 1			AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
Illumina Single End PCR Primer 2			CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT
Illumina Single End Sequencing Primer			ACACTCTTTCCCTACACGACGCTCTTCCGATCT
[back to top]


Output Results

For each file, a plot will indicate the percentage of duplicate sequence (relative to unique) that fall into a particular duplication level. To cut down on the amount of information in the final plot, any sequences with more duplicates than number that was defined by the user as 'Maximal duplication level' will be placed in the maximum bin. Therefore, it is not unusual to see a small rise in this final category.

If you see a big rise in this final category, then it means you have a large number of sequences with very high levels of duplication.

SeqDup 4.png

This function will also output a table that summarizes the duplication levels.

NGS Sequence Duplication Result Table

Finally, a table will be generated that lists the top over-represented sequences, and possible sources of contamination, such as library adapters.

NGS Overrepresented Sequences Result Table

OmicScript

NgsSequenceDuplication


Related Articles

[back to top]