Statistical Inference for RNA-Seq

From Array Suite Wiki

Jump to: navigation, search

As with any -OMIC data analysis, it's always recommended that the user have multiple replicates when doing a statistical comparison of groups. However, both of the modules available for statistical inference of RNA-Seq data allow for the user to do an analysis with as little as 1 replicate per group.

Currently, the software supports both a One-Way DESeq analysis, as well as a test for alternative splicing.

Contents

DESeq One Way Test

Background

More information on DESeq can be found here: http://www-huber.embl.de/users/anders/DESeq/.

Omicsoft has re-implemented DESeq, and believes it to be one of the best methods for comparing count data.

In order to run DESeq on your data, you must first have performed a naive gene counting method. Note, it is important that this method generates Counts, and not RPKM/FPKM values (and that no background reads are added to the dataset).

Running DESeq

1. Generate count data (using either Omicsoft's "Count" method), or importing an outside dataset (Quantification_of_RNA-Seq). As stated above, its important that this data is Count data, and NOT RPKM/FPKM data, as the module is designed to work on counts.

2. Make sure that the -OMIC data has a design table with a grouping to be used for the analysis.

3. Choose DESeq from the Quantification menu.

If the comparison you are making only has 1 replicate per group, you must choose the option "Each group has only one observation (no replicates)". This will automatically set the options for Fit Type, Sharing model, and Dispersion method to local, fit-only, and blind).

Generally, the following information should guide your decision on the FitType, SharingModel, and DispersionMethod.

1. Fit Type - For experiments with more than 1 replicate, its recommended to choose Parametric. However, the original DESeq paper used local (and this is what is used for situations where you only have 1 replicate).

2. SharingModel - fit-only should only be used with small # of replicates, when you are not overly concerned about false positives from dispersion outliers. Maximum is recommended when you have at least three or four replicates per group, and is the most conservative of the options. gene-est-only is only recommended when you have large numbers of replicates. Using this option could lead to some false positives if your replicate # is not sufficient.

3. Dispersion method - blind should be used when you have no biological replicates. Pooled uses the samples from all conditions to estimate a single pooled dispersion value, and should be used if the replicate # is more than 1. Per-condition can be used if you have a sufficient number of replicates (3 or more) and multiple conditions, to assign a dispersion value to each group, rather than pooling it across all samples.

See the following for more details on each of the options:

  1. FitType
  2. SharingModel
  3. DispersionMethod

DeSeqOptions.png

Results of DE-Seq

In the resulting report, the user will get a raw p-value, adjusted p-value (i.e FDR), as well as a fold change.

DESeqResults.png

Visualization

Right-click a row to quickly zoom to an already opened genome browser, or create a new genome browser with your NGS dataset.

In the example below, we are looking at a gene with clear differences in expression between the Adipose and Liver samples.

GenomeBrowserF13b.png

Alternative Splicing

The alternative splicing detection module works on imported NgsData to run a chi-square test to determine any alternative splicing. It requires a design table with sample annotation and grouping to be attached to the dataset. It works on data with one or more replicates per group. It is based on a two-way Chi-Square test. You should note that higher coverage genes tend to have more significant p-values.

This module should help you find specific exons that are skipped in one group vs other groups in your samples.

Running Alternative Splicing

Go to the Alternative Splicing module, under Inference, in the NGS menu.

Choose your grouping, thread number (the more threads the faster it will run), and the minimal gene read count (200 by default). Minimal gene read count ensures that the gene is present in the groups being compared. We'd recommend setting this number to something higher (for instance, 2000), to get rid of some false positives. The goal is to find genes where they are expressed in both groups being compared, so setting this to a higher number helps to ensure that.

AlternativeSplicingWindow.png

Results of Alternative Splicing

The module generates a table with p-values and MaxRatio of the exons in the experiment. The MaxRatio can be used to filter for genes with higher ratios, meaning an exon that is clearly expressed in one group but not the other.

Before looking at the report, it's recommended to filter for "Confounded=False". This will eliminate any exons that are confounded for one gene with another gene (on another strand, for instance).

Confounded.png

The report can be sorted by MaxExonRatio (right click and choose sort descending) or by ChiSquareValue. Note that the name of the gene with the splicing is given, as well as the coordinates for the exon being skipped.

SplicingReport.png

Visualizing Results

Right-click on the gene of interest in your report table to quickly go to that region in the genome browser. It is usually expected that you'll zoom out a few levels to be able to see the exons around your exon of interest. Clicking "Trim Introns" from the toolbar is also recommended for visualization of the exon skipping. In the example below, it is very clear that MTUS1-005 is expressed in adipose, but not in liver.

MTUS1.png