From Array Suite Wiki
The Map DNA-SEQ Reads to Genome (Illumina) module allows the user to map raw sequence reads to the genome, and return a number of summary statistics, and a NGS dataset (used for further downstream analysis like mutation generation, paired fusion gene detection, and more).
Accepted file formats include FASTQ, FASTA, QSEC, or AUTO (AUTO allows the use of any combination of the listed file types).
To open this module, go to Add Data | Add NGS Data | Add DNA-Seq Data | Map Reads to Genome (Illumina).
In the Basic section, the user has a number of options:
- Reads are paired - The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering system (_1, _2 or .1, .2; see here for details on paired-end naming conventions).
- Index mode - The user has the option to build the library index using 15-mer, 14-mer or 12-mer. By default, 32-bit machine uses 12-mer and 64-bit machine uses 14-mer. A 15mer index requires 9GB memory to build and run. The advantage is that 15-mer is that it is faster (or comparable) than 14-mer for all alignments. It is much faster for gapped alignment and novel exon junction (essentially deletion).
- Genome - The Genome model used for mapping must be selected (include any genome model where an index has already been built, and allows the user to build one on the fly for a new Genome or Gene model).
- Quality encoding - the user can choose Automatic (recommended) or explicitly set the quality encoding as either Illumina or Sanger.
Performance and Reporting
For the performance and reporting section, there are a number of important settings:
- Total penalty is the total number of indels or mismatches allowed for a successful mapping. The penalty is defined as the maximal number of mismatches allowed plus the gap penalty if an indel is present in the alignment. Usually we set the gap penalty to one or two (default is two). By default, Omicsoft automatically set the maximal penalty for each read to Max (2, (read length - 31) / 15) based on trimmed read length. Below is a table of automatic penalty for reads with 17- 106 nt.
Read Length Penalty 17-76 2 77-91 3 92-106 4
- User can override the automatic penalty setting and set the number to any fixed number. Penalty values of 2, 3, 4 are mostly common for reads < 100 nt.
- Thread number - is the total number of threads to be allocated to the process. The more threads that are allocated, the faster the algorithm will run. By default, this is set to the number of CPUs on the user's computer. This should not be set to a greater number of CPUs than available, but can be reduced at the user's discretion.
- Job number
- Detect indels - will allow you to detect indels in the data, and opens up additional options in the advanced tab of the module.
- Search repetitive/low complexity regions (greedy mode) - will search the highly repetitive and/or low complexity regions. In most cases, this mode is not recommended and adds significant time to the alignment.
- Non-unique mapping - is for handling ties (reads mapped to multiple locations on the genome). User can report up to a specified number of ties, or choose to exclude them completely from the mapping and counting.
- Generate SAM (mapped) + FASTQ (unmapped) files - Optionally, SAM files can be generated, and the user can choose not to import the data directly into the project.
- Output folder - The user can specify the output folder for the results.
Indel penalty can be set, and is defined as the allowable open gap penalty. Maximal middle insertion size, Maximal middle deletion size, Maximum end insertion size, Maximal end deletion size, and Minimal distal end size can be set in this section as well.
In the Advanced tab, the user can set a number of options related to paired ends, such as
- Expected insert size of the paired end reads
- standard deviation of the insert size
- strand mode for the pairs (different strand—Illumina data or same strand—SOLID data).
- Trim by quality score - The user can choose to trim the reads using a quality score of a specified amount or below. If that base pair has a quality score below the specified amount, the read is trimmed at that point, although the algorithm will only trim, at most, down to a read size of 17 base pairs.
- Advanced trimming allows the user to trim by various options:
- Trim first # nucleotides- Will remove the specified number of nucleotides from beginning of the sequence.
- Trim last # nucleotides - Will remove the specified number of nucleotides from the end of the sequence.
- Trim by quality - See above (default is 2).
- Trim by final length - Will remove nucleotides from the end of the sequence to achieve the specified final length.
- The Adapter Stripping window appears after selecting the "Customize" button.
- The adapter stripping section allows the user to specify either no adapter stripping, to strip adapters from the 3’ end of the read, or right adapters (at middle or end of the reads) by specifying the adapter sequence.
- Exclude unmatched reads - The user can choose to exclude unmatched reads (without adapters) from further analysis and mapping.
- Trim reads first - The user should understand the order of operations that takes place when doing the 3' end adapter stripping during an alignment. See AdapterStripping 3'End for more details.
- Exclude unmapped reads in BAM file - By default, all unmapped reads are kept in the BAM file and sorted as well. Selecting this option will reduce the amount of time in the sorting step.
- Map reads to forward strand - In the case of strand-specific library preparation protocols, users can choose to align reads only to one of the strands.
- Map reads to reverse strand
- Zip format - Select which format is used in compressing the files (default is "None").
- Output Name - The user can choose to name the output file.
More information can be found here: Preview Tab
The resulting files could include an alignment summary table and the NGS dataset (for downstream analysis like creation of mutation data, fusion data, etc.).