RNA-Seq Getting Started Old

From Array Suite Wiki

Jump to: navigation, search

Analysis of RNA-Seq data usually follows the following steps:

Contents

Before mapping reads

OmicSoft provides NGS Quality Control modules which help users get better ideas of the quality of their sequence data, before running time consuming alignment. Users can get basic statistics and summaries by running modules: Basic Statistics (reports sequence #, Min and Max sequence length, nucleotide #, GC%), Base Distribution, Quality Box Plot, Sequence Duplication, K-Mer Analysis(K=5).

For QC modules of aligned data, see Aligned Data QC for RNA-Seq.

Alternatively, user can use QC Wizard to run multiple QC commands simultaneously, without having to go through each individual menu.

Quality Control modules accept these input file formats: FASTA, FASTQ, QSEC, SFF, SAM BAM and AUTO (AUTO allows the use of any combination of the listed file types).

Starting from raw data/mapping reads to genome

Map Reads to Genome (Illumina) module

When the user has raw data files, it is recommended to start from the raw data files, by first mapping the reads to the genome. With this method, the reads are first mapped to the transcriptome, with the remaining reads being mapped to the genome (to find novel genes, exon junctions, etc..)

This is highly preferred over mapping only to the transcriptome in most cases. By mapping to the genome, the user will ensure more accurate expression value calculation downstream, as well as allowing for detection of novel exon junctions, novel genes, etc.

Omcisoft's method allows for both paired end and single-end reads.

Multiple samples can be aligned in the same run, and the result is a new "NGS" dataset within the project.

If the user plans to use the BAM files in other programs, it is recommended they set the Output folder. This will export the BAM files to the specified folder. Normally, these BAM files are "hidden" within the project directory, in a folder with a randomized name.

More information can be found here: Map Reads to Genome (Illumina) module


Starting from already mapped reads

Add Genome Mapped RNA-Seq Reads module

The Add Genome Mapped RNA-Seq Reads module is helpful when the user has data aligned outside of Omicsoft and wants to do some downstream analysis with it. This module can import BAM or SAM files and return a number of summary statistics, an NGS dataset (used for further downstream analysis like exon junction generation, paired fusion gene detection, and more), as well as return a microarray dataset containing expression values.

User should provide correct information of the reference genome used for aligning. If the reference genome cannot be found in ArrayStudio, user can build one on the fly for a new Genome or Gene model.

More information can be found here: Add Genome Mapped RNA-Seq Reads module


Summarizing Data

There are a number of summarization modules that can be run (or are run by default) in Array Studio, when aligning RNA-Seq data.

Reporting Gene/Transcript Counts

The Report Gene/Transcript Counts module reports either the gene counts or transcript counts for an already imported NGS dataset. This might be for cases where the user did not have Array Studio count the RNA-Seq alignments on import, or a case where they might want to use a different counting method. Array Studio offers various counting Methods at either the gene level or transcript level.

This module generates a MicroArray dataset, including gene ID and expression measurement for each gene ID.

More information can be found here: Report Gene/Transcript Counts module

Reporting Exon/Exon Junction Counts

The Report Exon/Exon Junction Counts module is used for detection of alternative splicing and exon skipping. It looks at both summarizing via exons and exon junctions in the dataset. The data is returned as a table dataset, that can easily be converted to microarray data for comparisons across samples.

More information can be found here: Report Exon/ExonJunction Counts module

Finding Exon Skipping

Looking for skipped exons is one strategy to finding alternative splicing.

The key to exon skipping is to find genes that are expressed in both sets of samples, but have exon counts that are different between groups. To do this, it's recommended to first perform a Gene Summarization (see above), followed by Reporting of Exon/Exon Junction counts.

  1. Run ReportGeneTranscriptCounts
  2. Use the generated RPKM data to find genes that are expressed in both sets of samples (use a filter to filter only genes over a specific value in both sets of samples
  3. Run ReportExonCounts
  4. Filter ReportExonCounts for those genes found by the steps above
  5. Filter exons for those expressed in one set and absent in the other set
  6. Follow up by looking at exon/gene in the genome browser

Reporting Exon Junctions

The Report Exon Junctions module can be used to generate an exon junctions report for each observation in the dataset. This will return number of exon junctions reported for each observation along with details for the specific JunctionID. User can output .bed files.

More information can be found here: Report Exon Junctions module

Summarizing Mutations

Mutation data can be generated for the RNA-Seq data. Summarize Mutation Data module allows the user to compare frequencies of mutation, for individual sites, between groups of samples. This module generates a mutation dataset that can be used for further downstream analysis, along with potentially a coverage dataset as well.

More information can be found here : Summarize Mutation Data module

Annotating Mutations

It is highly recommended to attach annotation to the Mutation Report for RNA-Seq data. Annotate Mutation/SNP Report module allows the user to generate additional details, using a supplied mutation report or -OMIC dataset. Mutaions will be annotated with gene name for each annotated mutation, chromosome, position, reference allele, mutation allele, dbSNP name (if known), Annotation type (intron, non-synonymous, 5’ UTR, synonymous, 3’UTR, etc.), AAPosition (amino acid position of change), AAChange (amino acid change—if there is one), transcript ID, transcript name, transcript strand, distance to 3’ end, and distance to 5’ end.

More information can be found here: Annotate Mutation/SNP Report module


Fusion Detection

RNA-seq data allows researchers to identify potential gene fusions. OmicSoft developed powerful and easy-to-use tools to do this, Map Fusion Reads (Illumina) module and Report Fusion Genes (Paired End) module.

The best practice is to have paired-end RNA-Seq data mapped to reference genome and run Report Fusion Genes (Paired End) module. Then use only the unmapped reads (generated as fastq files during the paired end alignment) for single end fusion aligner Map Fusion Reads (Illumina). The combination of both the paired end analysis and single end fusion aligner module would be used to eliminate false positives and detect the exact position of the fusion junction.

Tutorial for fusion detection can be found here:
http://www.omicsoft.com/software/ArrayStudio/FusionTutorial.pdf
or Fusion Tutorial


Visualizing Reads in the Genome Browser

RNA-Seq reads can be easily visualized in Genome Browser. The genome browser is based upon the concept of tracks. Initially, a Genome must be set (along with Species), and all of the accompanying tracks are based on that Genome or Reference. The Genome Browser is available within Array Studio in the Genome Browser tab at the top of the software.

The user has the ability to add tracks, from a variety of data types, to explore data. The built-in Genome Browser button for every variable ID allows the user to browser sequence of genes (SNPs, exon junctions, etc.) from within the Analysis Solution Explorer. The following example shows a genome browser with tracks from RNA-Seq reads and expression measurement.

GenomeBrowser Example001.jpg

Tutorials for Genome Browser can be found here:
http://www.omicsoft.com/software/ArrayStudio/GenomeBrowserTutorial_NGS.pdf
or
http://www.omicsoft.com/software/ArrayStudio/GenomeBrowserUserGuide.pdf