DNA-Seq Getting Started
From Array Suite Wiki
Analysis of DNA-Seq data usually follows the following steps:
Before adding reads
OmicSoft provides NGS Quality Control modules which help users get better ideas of the quality of their sequence data. Users can get basic statistics and summaries by running modules: Basic Statistics (reports sequence #, Min and Max sequence length, nucleotide #, GC%), Base Distribution, Quality Box Plot, Sequence Duplication, Paired End Insert Size Profile.
Alternatively, user can use QC Wizard to run multiple QC commands simultaneously, without having to go through each individual menu.
Quality Control modules accept these input file formats: FASTA, FASTQ, QSEC, SFF, SAM, BAM and AUTO (AUTO allows the use of any combination of the listed file types).
Starting from raw data/mapping reads to genome
When the user has raw sequence data files, it is recommended to start from the raw data files, by first mapping the reads to the genome. Another module Map Reads to Genome (Long Reads) uses partial alignment algorithm and is designed to map raw sequence reads that are longer than the standard length (i.e. 454 reads, as well as Pacific Bio reads and newer Illumina reads).
This module accepts input formats: FASTQ, FASTA, QSEC, or AUTO (AUTO allows the use of any combination of the listed file types).
OmcSoft's method allows for both paired end and single-end reads.
Multiple samples can be aligned in the same run, and the result is a new "NGS" dataset within the project.
User should specify whether the raw reads are paired end or single end, and specify the reference genome.
If the user plans to use the BAM files in other programs, it is recommended they set the Output folder. This will export the BAM files to the specified folder. Normally, these BAM files are "hidden" within the project directory, in a folder with a randomized name.
More information can be found here: Map Reads to Genome (Illumina) module
Starting from already mapped reads
The Add Genome-Mapped Reads module is helpful when the user has data aligned outside of Omicsoft and wants to do some downstream analysis with it. This module can import BAM or SAM files and return a number of summary statistics, an NGS dataset (used for further downstream analysis like exon junction generation, paired fusion gene detection, and more), as well as return a microarray dataset containing expression values.
User should provide correct information of the reference genome used for aligning. If the reference genome cannot be found in ArrayStudio, user can build one on the fly for a new Genome or Gene model.
More information can be found here: Add Genome-Mapped Reads module
There are a number of summarization modules that can be run (or are run by default) in Array Studio, when aligning DNA-Seq data.
Summarizing Amplicon Based Coverage Statistics
This module is useful when user ran an experiment on only a specific set of genes or area of a chromosome, and wants to see the coverage statistics for only those regions (or if you were only particular interested in a set of genes or regions and wanted to see the coverage statistics). User must provide an Amplicon file to run.
More information can be found here: Summarize Amplicon Based Coverage Statistics
Summarizing Target Sequencing Coverage Statistics
This is another module used to calculate coverage statistics for specified or targeted regions. It is similar to #Summarizing Amplicon Based Coverage Statistics, but uses different input files. This module accepts .bed files. Resulting reports from these two modules are slightly different, too.
More information can be found here: Summarize Target Sequencing Coverage Statistics
Mutation data can be generated for DNA-Seq data. Summarize Mutation Data module allows the user to compare frequencies of mutation, for individual sites, between groups of samples. This module generates a mutation dataset that can be used for further downstream analysis, along with potentially a coverage dataset as well.
More information can be found here : Summarize Mutation Data module
It is highly recommended to attach annotation to the Mutation Report for DNA-Seq data. Annotate Mutation/SNP Report module allows the user to generate additional details, using a supplied mutation report or -OMIC dataset. Mutaions will be annotated with gene name for each annotated mutation, chromosome, position, reference allele, mutation allele, dbSNP name (if known), Annotation type (intron, non-synonymous, 5' UTR, synonymous, 3'UTR, etc.), AAPosition (amino acid position of change), AAChange (amino acid change-if there is one), transcript ID, transcript name,transcript strand, distance to 3' end, and distance to 5' end.
More information can be found here: Annotate Mutation/SNP Report module
DNA-Seq data can be used to identify potential fusion genes, where short reads map to exon junctions . Map Fusion Reads (Illumina) module and Report Fusion Genes (Paired End) module are efficient and easy-to-use tools for this purpose.
Tutorial for fusion detection can be found here: https://omicsoftdocs.github.io/ArraySuiteDoc/tutorials/RNASeq/RNA-Seq_Fusion_Gene_Detection/
Visualizing Reads in the Genome Browser
DNA-Seq reads can be easily visualized in Genome Browser. The genome browser is based upon the concept of tracks. Initially, a Genome must be set (along with Species), and all of the accompanying tracks are based on that Genome or Reference. The Genome Browser is available within Array Studio in the Genome Browser tab at the top of the software.
The user has the ability to add tracks, from a variety of data types, to explore data. The built-in Genome Browser button for every variable ID allows the user to browser sequence of genes (SNPs, exon junctions, etc.) from within the Analysis Solution Explorer. The following example shows a genome browser with tracks from DNA-Seq reads, SNP summary of the reads and dbSNP.
Tutorials for Genome Browser can be found here: