From Array Suite Wiki
Map DNA-Seq Reads to Genome (Illumina)
The Map DNA-SEQ Reads to Genome (Illumina) module allows the user to map raw sequence reads to the genome, and return a number of summary statistics, and a NGS dataset (used for further downstream analysis like mutation generation, paired fusion gene detection, and more).
Notes on duplicates
This function will mark duplicate reads in SAM entries.
The "Input format" include FASTQ, FASTA, QSEC, or AUTO (AUTO allows the use of any combination of the listed file types).
In the Basic section, the user has a number of options:
- The user can choose whether this is a paired end sequencing analysis, and if so, the files will automatically be paired using numbering logic (e.g. _1, _2 or .1, .2).
- Detect indels will allow you to detect indels in the data, and opens up additional options in the advanced tab of the module.
- The Genome model used for mapping must be selected (include any genome model where an index has already been built, and allows the user to build one on the fly for a new Genome or Gene model). The OmicSoft genome models for Human do not have the recommended decoy sequences. Consider building your own model or using another tool
- For quality encoding, the user can choose Automatic (recommended) or explicitly set the quality encoding as either Illumina or Sanger.
Performance and Reporting
For the performance and reporting section, there are a number of important settings.
- Total penalty is the total number of indels or mismatches allowed for a successful mapping. The penalty is defined as the maximal number of mismatches allowed plus the gap penalty if an indel is present in the alignment. Usually we set the gap penalty to one or two (default is two). By default, Omicsoft automatically set the maximal penalty for each read to Max (2, (read length - 31) / 15) based on trimmed read length. Below is a table of automatic penalty for reads with 17- 106 nt.
- User can override the automatic penalty setting and set the number to any fixed number. Penalty values of 2, 3, 4 are mostly common for reads < 100 nt.
- Thread number is the total number of threads to be allocated to the process. The more threads that are allocated, the faster the algorithm will run. By default, this is set to the number of CPUs on the user’s computer. This should not be set to a greater number of CPUs than available, but can be reduced at the user’s discretion.
- Job number - Specifying the parallel "Job number" will spawn off new processes to run the alignments. If you have 24 samples, you could specify "Job number" = 12 to run 12 alignments at once.
- Search repetitive/low complexity regions (greedy mode) will search the highly repetitive and/or low complexity regions. In most cases, this mode is not recommended and adds significant time to the alignment.
- Non-unique mapping is for handling ties (reads mapped to multiple locations on the genome). You can report up to a specified number of ties, or choose to exclude them completely from the mapping and counting.
- Optionally, SAM files can be generated, and the user can choose not to import the data directly into the project, as well as an output folder for the results.
- The user can specify the output folder for the results.
In the Advanced tab, the user can set a number of options related to paired ends.
Indel penalty can be set, and is defined as the allowable open gap penalty.
Maximal middle insertion size, maximal middle deletion size, maximum end insertion size, maximal end deletion size, and minimal distal end size can be set in this section as well.
- For the read trimming section, the user can choose to trim the reads using a quality score of a specified amount or below. If that base pair has a quality score below the specified amount, the read is trimmed at that point, although the algorithm will only trim, at most, down to a read size of 17 base pairs.
- Advanced trimming allows the user to trim by various options:
- Trim first # nucleotides- Will remove the specified number of nucleotides from beginning of the sequence.
- Trim last # nucleotides - Will remove the specified number of nucleotides from the end of the sequence.
- Trim by quality - See above (default is 2).
- Trim by final length - Will remove nucleotides from the end of the sequence to achieve the specified final length.
The Adapter Stripping section allows the user to strip adapters from the 3’ end of the read, by specifying the adapter sequence. Optionally, the user can choose to exclude any unmatched reads (without adapters) from further analysis and mapping.
Note: The user should understand the order of operations that takes place when doing the 3' end adapter stripping during an alignment:
1. Quality trimming/other trimming options
2. Strip adapters
If a read contains any sequence representing the barcode (Multiplex Identifier (MID) ) at the end of the read, this sequence may interfere with the adapter stripping module. The 3' adapter stripping does a localized alignment at the right end of the read, but its unable to find internal adapters. Thus, you will get no adapter stripping.
You may remove the MID sequence using either the MID Extraction + Adapter Stripping module, or choose to trim the # of bases of the MID sequence from the end of the read using Advanced Trimming options (i.e. "Trim last nucleotides").
- The user can choose to write unpaired/unmapped files into separate files (for later analysis) and generate an alignment summary report.
- Exclude unmapped reads in BAM file - By default, all unmapped reads are kept in the BAM file and sorted as well. Selecting this option will reduce the amount of time in the sorting step.
- Zip format - Select which format is used in compressing the files (default is "None").
- Output Name - The user can choose to name the output file.
In the "Preview" tab, the user can select the option to "Preview the reads (sampling + align + QC) and the "Sampling percentage" to be used.
The resulting files could include an alignment summary table and the NGS dataset (for downstream analysis like creation of mutation data, fusion data, etc.).
The various calculations for the Alignment report are as follows:
- Observation 1 - Name of first half of pair
- Observation 2 -Name of second half of pair
- MeanInsertSize - Average insert size calculated across paired reads
- Total read # - Total # of reads
- Uniquely paired read # - Reads that are both uniquely mapped and paired
- Non-uniquely paired read # - Reads that are not uniquely mapped but are paired
- Uniquely mapped read #1 - Reads that are uniquely mapped, but not paired, from file #1
- Uniquely mapped read #2 - Reads that are uniquely mapped, but not paired, from file #2
- Non-uniquely mapped read #1 - Reads that are non-uniquely mapped, but not paired, from file #1
- Non-uniquely mapped read #2 - Reads that are non-uniquely mapped, but not paired, from file #2
- Unmapped read #1 - Reads that are unmapped from file #1
- Unmapped read #2 - Reads that are unmapped from file #2
- Uniquely paired read % - Percentage of reads that are both uniquely mapped and paired
- Non-uniquely paired read % - Percentage of reads that are not uniquely mapped but are paired
- Uniquely mapped read 1 % - Percentage of reads that are uniquely mapped, but not paired, from file #1
- Uniquely mapped read 2 % - Percentage of reads that are uniquely mapped, but not paired, from file #2
- Non-uniquely mapped read 1 % - Percentage of reads that are non-uniquely mapped, but not paired, from file #1
- Non-uniquely mapped read 2 % - Percentage of reads that are non-uniquely mapped, but not paired, from file #2
- Unmapped read 1 % - Percentage of reads that are unmapped from file #1
- Unmapped read 2 % - Percentage of reads that are unmapped from file #2