From Array Suite Wiki
NGS Raw Data QC -- Sequence Duplication
This module counts the degree of duplication for every sequence in the set, and creates a plot showing the relative number of sequences with different degrees of duplication. In a diverse library, most sequences will occur only once in the final set. Moderate levels of duplication may indicate a very high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias (e.g. PCR over-amplification).
To access the module, please go to Analysis | NGS | Raw Data QC | Sequence Duplication.
Input Data Requirements
Accepted file formats include FASTQ, FASTA, QSEC, SFF, SAM BAM and AUTO (AUTO allows the use of any combination of the listed file types).
Add files to menu
- Add button will add samples by selections
- Add Folder will add all samples in the selected folder (local project only)
- Search will find files based on sample registration (server project only)
- Add list will allow users to add files from a list (even add a grouping file for alignment functions).
The Adapter Stripping window appears after selecting the "Customize" button.
- No Adapter Stripping: No attempt will be made to remove adapter sequences from reads.
- Strip 3' end adapters (end only): The 3' ends of reads will be compared to the adapter sequence for a match.
- Strip right adapter (middle or end): The adapter sequence will be checked for a match within the read, and will trim the adapter sequence, along with any sequence 3' to the adapter.
- Strip multiple adapters: Multiple adapter sequences can be listed.
- Job number: The total number of jobs to run at the same time.
- Zip format: Select which format is used in compressing the files (default is "None").
- Maximal duplication level: The user can modify the depth of the duplication count and control for the maximum number of bins that are returned (default is "10").
- Output Name: The user can choose to name the output file.
- Output folder: The output folder is used to store output files.
- Include only mapped entries - Selecting this option will only incorporate sequences that have been mapped to the reference sequence.
- This option only works when .bam or .sam files are added, and the File format dropdown menu is set to SAM or BAM
- Preview mode (5% systematic sampling) - By default, the module is run on all sequences. If this option is checked, the module will run on a 5% sampling of input reads for each file. This option should be used as a quick indicator of quality for especially large raw data files.
- User can select to use the default contamination list or to use a custom list. Details for the list format is below in the screen shot:
Example Contamination List entry:
Illumina Single End Apapter 1 ACACTCTTTCCCTACACGACGCTGTTCCATCT Illumina Single End Apapter 2 CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT Illumina Single End PCR Primer 1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT Illumina Single End PCR Primer 2 CAAGCAGAAGACGGCATACGAGCTCTTCCGATCT Illumina Single End Sequencing Primer ACACTCTTTCCCTACACGACGCTCTTCCGATCT
For each file, a plot will indicate the percentage of duplicate sequence (relative to unique) that fall into a particular duplication level. To cut down on the amount of information in the final plot, any sequences with more duplicates than number that was defined by the user as 'Maximal duplication level' will be placed in the maximum bin. Therefore, it is not unusual to see a small rise in this final category.
If you see a big rise in this final category, then it means you have a large number of sequences with very high levels of duplication.
This function will also output a table that summarizes the duplication levels.
Finally, a table will be generated that lists the top over-represented sequences, and possible sources of contamination, such as library adapters.