Getting Started with RNAseq Analysis

From Array Suite Wiki

Jump to: navigation, search

Contents

Getting Started with RNA-seq pipeline functions

Array Studio provides a suite of tools to quickly, easily, and reliably process RNA-seq data. Users have the choice of either executing each step of the analysis one-by-one, or can use the RNA-seq pipeline function. The following set of videos will walk the user through the functions automatically executed by the standard RNA-seq pipeline, starting with raw reads in .fastq format. By the end, the user will have raw and aligned QC, transcript, gene, and exon-junction abundance measurements, as well as gene variants (mutations) and gene fusions.

  • Overview of RNA-seq pipeline function [00:15]
  • Alternatives to RNA-seq pipeline function [01:11]
  • Advantages of One-step pipeline [01:47]
  • Output of RNA-seq pipeline [02:25]
  • Downstream analysis of RNA-seq data [2:35]


[back to top]


Running the RNA-seq pipeline for a new project

Creating a new project for your RNA-seq dataset, and running it through the Array Studio RNA-seq pipeline, only takes a few mouse-clicks. Output data can be used for downstream analysis.

  • Running the RNA-seq pipeline for a new project [00:50]
  • Submitting and monitoring your Server-based analysis [02:31]
  • Output data types from RNA-seq pipeline [03:21]


[back to top]


Raw Data QC

Before aligning your RNA-seq data, you must first perform quality control (QC) on the raw data, to spot common problems like adapter or barcode sequence contamination, degraded quality at ends of reads, or problematic samples. The Array Studio Raw Data QC Wizard reports a number of useful measures of raw NGS quality, and can be generated as part of the RNA-seq pipeline function. However, this module should be run before running the pipeline, to determine what read filtering or trimming might need to be performed.

  • Running the Raw Data QC Wizard [00:18]
  • Base Distribution [01:03]
  • Basic Stats [01:32]
  • Duplication Level [01:46]
  • Kmer Analysis [02:15]
  • Overall/Per-sequence Quality Reports [02:49]
  • Quality Box plot [03:12]
  • Over-represented Sequences [03:43]
  • Per-sequence GC report [03:57]
  • Sequence Length Report [04:12]


[back to top]


Filtering and Trimming Raw Reads

Array Studio's NGS Filter function can trim low-quality bases from raw NGS data, filter out uniformly low-quality reads, and strip away adapter sequences. The RNA-seq pipeline assumes that input reads are pre-filtered and stripped, so only quality-based trimming and filtering will be performed in the pipeline (no adapter stripping). It is a good idea to run the Filter function on your reads, based on the raw data QC results, before running the RNA-seq pipeline.

  • Filtering low-quality reads [00:28]
  • Trimming ends of reads [00:52]
  • Stripping adapter sequences [01:15]
  • Generating .ff files instead of new .fastq files [02:05]


[back to top]


Aligned Data QC

Array Studio automatically generates an Alignment Report after aligning reads to the genome or transcriptome. Additional alignment statistics can be generated by running the Aligned Data QC and RNA-seq 5'->3' Trend modules.

  • Alignment Report [00:05]
  • Running Aligned QC [00:28]
  • Aligned QC Report [01:11]
  • Calculating RNA-seq 5'->3' trend [01:50]


[back to top]


Quantifying RNA-seq expression

Array Studio can calculate gene, transcript, and exon-junction expression, based on the relative number of reads mapped to different features. Results can be reported as raw counts and RPKM/FPKM.

  • Running expression quantification modules [00:20]
  • Adding experimental design metadata to NGS and -Omic data [01:03]
  • Quantification output [02:00]
  • Viewing gene annotation metadata [02:15]


[back to top]

Quantifying exon junction usage

The RNA-seq pipeline will report exon junctions, in addition to transcript- and gene-level expression. In the output Exon Junctions Report, both known and novel exon junction counts will be reported for each sample, as well as matching transcript and gene models. Exon junction data can also be visualized in the Array Studio Genome Browser.

Tips.pngUsers can also generate counts of only known exons/exon junctions with the Report Exon Counts function.


  • Running the Report Exon Junctions module [00:10]
  • The Report Exon Junctions output [00:30]
  • Filtering the junction report for genes of interest [01:15]
  • Viewing junction data in the Genome Browser [01:39]
  • Show only Novel Exon Junctions [03:12]


[back to top]

Annotating Sequence Variants in your RNA-seq data

The RNA-seq pipeline will automatically generate a Sequence Variant Report, indicating every position in your RNA-seq data that differed from the selected reference genome. You can also directly run the Summarize Variant Data module directly, which will give you more control over variant confidence and output formats. Variants can be annotated in Mutation Reports or VCF files, and visualized directly in the Genome Browser.

  • Running the Summarize Variant Data module [00:08]
  • Variant Data output [01:04]
  • Adding annotations to Variant Reports [01:25]
  • Annotating Vcf reports [02:09]
  • Filtering variant data for a gene [02:57]
  • Adding .bam files to Genome Browser [03:58]
  • Viewing variation in the Genome Browser[04:31]
  • View read sequences in the Genome Browser [04:52]


[back to top]


Gene Fusion detection in RNA-seq data

The RNA-seq pipeline will automatically run the Map Fusion Reads module on single-end RNA-seq data, or the Combined Fusion Analysis module on paired-end data. Both modules will run OmicSoft's FusionMap method to identify unmapped reads that span multiple genomic locations, indicating possible gene fusion events. The Combined Fusion Analysis module will also perform Paired-end fusion analysis, which looks for read-pairs mapping to different genes. Results can be viewed, filtered, and sorted in report tables, or viewed directly in the Array Studio Genome Browser.

  • Overview of Array Studio Gene Fusion identification [00:01]
  • Running the Combined Fusion Analysis module [00:37]
  • Fusion Analysis Report [01:27]
  • Filtering the Fusion Report for a gene [03:05]
  • Visualizing fusion reads in the Genome Browser [03:57]
  • Paired-end Fusion Gene Report [05:38]



Downstream Analysis of pipeline data

A large number of visualization and QC functions are available to analyze feature-level RNA-seq data. The following videos will demonstrate some ways to explore your data.

Normalizing and Transforming RNA-seq Data for MicroArray-type analysis

Array Studio has a large number of modules originally designed for expression MicroArray analysis, but these modules are also useful for analyzing feature-level (e.g. gene-level, exon-level) RNA-seq data. However, many of these modules expect normalized and log-transformed input data. Array Studio provides a number of methods for normalizing and transforming -Omic data.

  • Overview of RNA-seq downstream analysis modules [00:01]
  • Normalizing RNA-seq FPKM data to upper-quartile with MicroArray Normalization [01:05]
    • A related function, NGS | Inference | Normalize RNA-seq data, provides additional NGS-specific normalization options
  • Log2-transforming data (with pseudocount) [01:58]


[back to top]


Attach new Views to Data

Data can be directly viewed in tables, but can also be displayed in up to 40 Views, depending on the contents of the underlying data. These Views are highly customizable, and are completely interactive.

  • Attach a Variable View [00:12]
  • Re-group samples [00:25]
  • Change to BoxPlot [00:50]
  • Filter Views to specific data [01:41]


[back to top]

Principal Component Analysis on normalized expression data

Principal Component Analysis (PCA) is an effective tool to group data by components that contribute to the greatest variance in the dataset. In other words, PCA can group your data based on variance, which should reflect differences between samples. Outliers (such as failed samples) will often appear as outliers.

  • Run Principal Component Analysis module [00:10]
  • Show/Hide Hotelling T2 Ellipse [01:08]


[back to top]


Hierarchical Clustering of normalized expression data

Gene expression data can be grouped by Hierarchical Clustering by Variables (e.g. genes) and Observations (e.g. samples) to reveal associations in your data.

  • Specify parameters for hierarchical clustering [00:10]
  • Adjust HC color scale [01:55]
  • Add color bar by sample metadata [02:20]
  • Detailed View with Modern Dendrogram [02:52]


[back to top]

RNAseq-MicroArray Integration

Feature-level (genes, transcripts, etc.) results from RNA-seq experiments can directly be compared to microarray data from the same samples, using the Microarray-Microarray Integration module.

  • Importing MicroArray .cel files to Array Studio [00:16]
  • Attaching a MicroArray design table [01:08]
  • Running the MicroArray-MicroArray Integration function [02:22]
  • Output of MicroArray-MicroArray Integration [03:48]
  • View correlations of each gene between data sets [04:45]
  • View correlations of each sample for every gene [05:21]



[back to top]

Advanced Analysis of RNA-seq data

Statistical inference of differential expression of genes and transcripts can be performed on your feature-level data, whether it was quantified in Array Studio or imported from external programs.

ANOVA on RNA-seq Data

Array Studio provides a number of modules for statistical inference of differences between RNA-seq samples. In this video, a two-way Analysis Of Variance (ANOVA) is performed on a set of samples from two tissue (lung and skin), and from males and females.

  • The Two-way ANOVA module [00:20]
  • ANOVA output [01:45]
  • Sorting ANOVA output to find significant results [02:25]
  • Confirming ANOVA results in Array Studio Genome Browser [02:38]


[back to top]

DESeq on RNA-seq Data

The DESeq GLM test is a powerful tool for inferring differential expression of genes/transcripts from raw count data. The resulting data objects are fully interactive, and can be explored in Array Studio Views and Genome Browser.


  • Running the DEseq GLM module [00:25]
  • DEseq table output [01:49]
  • Viewing candidate genes in Array Studio Genome Browser [02:09]
  • Viewing DEseq results as a Volcano Plot [02:35]

[back to top]


Identifying Differential Usage of Isoforms

ArrayStudio uses a straightforward approach to identifying genes with differential transcript usage between groups. The results can be filtered to identify candidate genes, and can be directly inspected in the Array Studio Genome Browser.

  • Array Studio approach to identifying differential isoforms [00:20]
  • Running the Differentially Expressed Isoforms module [00:59]
  • Sorting and filtering Differentially Expressed Isoforms results [02:32]
  • Viewing transcript coverage in Array Studio Genome Browser [03:57]
  • Viewing exon usage in Array Studio Genome Browser [04:47]


[back to top]

Related Articles

[back to top]