DESeq2Test.pdf

From Array Suite Wiki

Jump to: navigation, search

Warning.png WARNING: This functionality will be deprecated starting with the 2021 July software release (v11.3). It will be replaced by a new R-based implementation of DESeq2: DESeq2Test(R).pdf.


Contents

DESeq General Linear Model

This is the command in Array Studio for running differential expression analysis on RNA-Seq count data. It allows the user to model the data on a liner model basis and test for differential expression using wald test based on negative binomial. The function should perform similarly to the DESeq2 R package.

Most of the options are the same as DESeq V1, but the underlying algorithm/implementation to estimate dispersion and fold change are different. (You will see less extreme fold changes in DESeq V2).


Deseq2GLM 1.jpg

Omicsoft implementation is benchmarked with DESeq2 v1.10.1.

dds <- DESeqDataSetFromMatrix(countData = round(datacounts),   colData = designtable,   design = ~ modification)
dds <- DESeq(dds, minReplicatesForReplace=Inf, modelMatrixType="standard")
output <- results(dds, independentFiltering=FALSE)

General

Input/Output

  • The window includes a dropdown box to select the Project and Data object on which the command will be run.
  • Selections can be made on which variables should be included in the General Linear Model (options include "all", "selected", "visible", and any pre-generated Lists).
  • Selections can also be made on which observations should be included in the General Linear Model. (options include "all", "selected", "visible", and any pre-generated Lists).

The factor(s) used in your statistical design should only use characters in A-Z a-z 0-9 _ (underline) and . (period). Please note, for some R tools/packages implemented into Array Suite, only letters, numbers, dots, and underline characters are allowed for variable names or column names.

Other characters, including ~ + - * / : ^ | [ ] { } ( ) # < > , and space may be interpreted improperly in the statistical design.

DesignTableCharacters.png

Options

If user is not familiar with General Linear Model (GLM), please also read general linear model function documentation. The Options section for the Linear Model window include 3 steps:

  • Step 1, which is required, involves specifying the model. This is where the user will specify the terms of the model, main effects and cross/interaction terms:

Deseq2GLM 2.jpg

The "Columns" section contains columns from the Data object's Design Table. If the column should be considered a Class term, a checkbox for that column can be selected. By default, Array Studio will guess on what constitutes a Class term. In general, numeric columns will not be considered Class terms by default, while other column such as "Factors", will be considered Class terms by default. Consult with a statistician if not sure as to whether a column should be a class term.

The "Construct Model" section is where the user can add the terms to the model. By selecting terms on the left, the user can use the Add, Cross, and Remove buttons to select the terms for that particular model. Selecting "Add" will add one or multiple terms to the model, whereas "Cross" will cross the terms selected on the left.

Clicking "OK" returns the user to the General Linear Model window, where Step 1 is now complete.

  • Step 2, which is also required, involves specifying the contrasts involved. This includes any particular comparisons the user is interested in, along with the tests:

Deseq2GLM 3.jpg

The user has the option of manually building contrasts for each comparison or using the "For each" option to let Array Studio generate multiple estimates at once. In the Options section, the user can decide whether Estimates, Fold changes, Raw p-values, Adjusted p-values, Generate significant list, and Split significant list (by direction) will be created for the Inference report generated by this command.

Advanced

Ngs DESeqGLM AdvancedWindow02.png

  • Fit type: Either "parametric", "local", or "mean" for the type of fitting of dispersions to the mean intensity.
  • Alpha level: P-value cutoff. Default alpha level is set to be 0.05, it has nothing to do with the DESeq2 result, only if you have checked the option to “Generate significant list” or “Split significant list” when you specify the test, then this threshold will be used to define the “significant level”.
  • Minimal replicates for replacing: This setting tells the DESeq2 algorithm when it is allowed to replace outliers with the trimmed mean value. It is the minimum amount of replicates needed before outlier status can be determined and replaced. For example, if you have 7 replicates in your dataset and the algorithm finds an outlier expression value in some gene, the outlier value will be replaced with a trimmed mean for that given gene. Afterwards, the model will be refit for differential expression prediction using these new values. You can find more details about this setting in DESeq2 manual: https://bioconductor.org/packages/release/bioc/manuals/DESeq2/man/DESeq2.pdf Note: The parameter is called minReplicatesForReplace under the DESeq function.
  • PFilterAlpha: corresponds to alpha in DESeq2::results(DESeq), the significance cutoff used for optimizing the independent filtering (by default 0.1). If the adjusted p-value cutoff (FDR) will be a value other than 0.1, alpha should be set to that value.
  • Perform independent filtering: Filter genes with a low overall count
    • Genes that were filtered by Independent Filtering will have fold-change, estimate, and raw P-value, but no adjusted P-value (FDR).
  • Export Dispersion Table
  • Export Wald Table: A Wald test for significance is provided as the default inference method.
  • Export outliers: Export an outlier column in the result table, which means that this gene was considered as outlier based on Cook’s distance. Genes flagged as outliers will have fold-changes, but no P-value calculations.
  • Export group means: Export a column to show mean values for each group in the result table
  • Export maximal group means per contrast: Export a column to show the maximal group means for each comparison (the bigger value we can get by comparing the mean value of case group and control group), the means will be calculated based on count after normalization (same to DESeq2 normalization, we will calculate the sizeFactor for each column/observation, and then each count value will be divided by that sizeFactor for each column)
  • Export contrast vector table: Export a table with contrast vectors
  • Use alphabetical order factor level: Enable sorting of factors by alphabetical order
[back to top]


Output Results

  • A DESeq Inference Report will be generated, containing fold-change and p-values for each tested variable. The default visualization, a volcano plot, will also be generated.

DESeqGLM3 02.png

InferenceReport VolcanoPlot.png

  • DispersionTable will be generated under the "Summary" folder.
  • DispersionScatterPlot will be generated automatically for the DispersionTable

Missing Data in the Inference Report

You may notice that some (or many) genes in the inference report are missing one or more values. In essence, if the DESeq algorithm determines that it doesn't make sense to calculate a value, (the gene is not expressed, the gene is flagged as an outlier, etc.) the value won't be reported. Details can be found in the DESeq2 manual.

  • Genes with no counts will not have fold-change or P-value calculations
  • Genes flagged as outliers (as determined by DEseq2 using Cook's Distance) will have fold-change, but no P-value calculations
  • Genes filtered by Independent Filtering (as determined by DEseq2) will have raw P-value, but no Adjusted P-value.

DispersionScatterPlot.png

Inconsistent Results for Factor Columns

There is a flaw in the current implementation of DESeq2 v1.10 for multi-level comparisons: the alphabetical ordering affects the calculations in multi-level comparisons. This issue may have also been present for the DESeq2 R version we benchmarked against when the feature was implemented in 2015.

Despite the flaw in this implementation, the results and interpretations are still valid, for two reasons:

  • Thousands of projects have been added to OmicSoft Lands, and as part of the QC process we confirm that the biological findings in the source paper are consistent with our results
  • The divergence between DESeq2 analyses depending on which is the base level is far less than the difference between results from DESeq2 vs Voom


The July software release, will support analysis through the latest R implementation

Omicscript

DESeq2Test

Related Articles

EnvelopeLarge2.png