GeneSetAnalysis.pdf

From Array Suite Wiki

Jump to: navigation, search

Contents

Gene Set Analysis (GSA)

Overview

Gene Set Analysis is a powerful tool to help users who have their own gene set and would like to identify comparisons containing similar gene set enrichment from tens of thousands of comparisons. There are two ways to perform Gene Set Analysis in Land. Users can run Gene Set Analysis in two tabs of Array Studio (Analysis and Land). The Array Studio Gene Set Analysis function allows a user to upload a list of gene identifiers (with or without fold-change and p-values from inference reports) to server, and searches the OmicSoft's Gene Set database to return gene sets with significant overlaps with input gene list. This function uses a Fisher's exact test to compare lists and Wilcoxon test to compare gene lists with fold change/p-values provided. Please scroll to the appropriate menu options below for instruction usage in Analysis and Land tabs.

Analysis Tab

To access the GeneSet Analysis function, click Integration | Gene Set Analysis:

Integration GeneSetAnalysis Menu.png

Tips.png The user will have to run the analysis within a server project.


[back to top]


Gene Set Analysis Wizard

The Gene Set Analysis Wizard will guide you through the steps to query the Gene Set databases, using lists of Gene IDs/Symbols, or inference tables with fold-change/p-values.

The OmicSoft GeneSet Database includes curated datasets from Land collections and signatures derived from The Broad Institute's Molecular Signatures collection, and can also include internal GeneSet databases.


Step 1: Choose Input Type

GeneSetAnalysisWizard01.png

Currently, the Gene Set Analysis function supports queries using mouse and human identifiers. After choosing the collection to which your identifiers belong, select the target Gene Sets to query and format of your query (list(s) or table with p-value/fold-change values).

  • Universal set name: Users can choose from HumanUniversal_V1 and MouseUniversal_V1; input identifier (such as probeset ID, EntrezID, EnsemblID) will be mapped the gene symbols in the universal set based on OmicSoft ID mapping database.
  • Choose gene set database: The user can select multiple (Ctrl+click) gene set databases
    • Company-curated databases where each company can add/manage on the server
    • GeneSets from public domain, such as MSigDB
    • Land-derived gene signature sets. Availability of GeneSets derived from OncoLand, DiseaseLand, and SingleCellLand are based on the subscription status to each land.
  • Choose user's input type: list of ID, bidirectional list, or a table with numeric values
  • Choose ID type: gene symbol, probeset ID, EntrezID, or EnsemblID
[back to top]


Step 2: Input gene signature

Depending on whether you have Gene Set list(s) or a table, you will use different interfaces to import your data.

Input a list

If you have one list of significant genes, you can simply copy and paste your list, load it from a file, or an open Array Studio solution:

GeneSetAnalysisWizard02.png

Input two bi-directional list

If you have two lists of significant genes (up- and down-regulated), you can copy and paste the lists, or import them from a file or solution:

GeneSetAnalysisWizard03.png

Input gene signatures with numeric values

If you have fold-change/p-values for your gene lists, you can use these values to determine the cutoffs for your significant gene list. You can import from a local tab-delimited file or an inference table in an open project, then select the columns and cutoffs to identify significant genes.

GeneSetAnalysisWizard04.png

  • Value Column (directional): The column that indicates up- or down-regulation (usually log2(fold-change)).
  • Rank Column (e.g. pValue): A column that indicates a score (usually p-value) by which to rank the genes in the Gene Set. Lower is better.
  • ID Column: The table column matching the specified IDs selected in step 1 (e.g. EnsemblID, probeset ID, Gene symbol).
  • Absolute (value) >=: The minimum Value for a gene in the query to be included in the final Gene Set.
  • Rank column value <=: The maximum Rank for a gene in the query to be included in the final Gene Set.
  • Maximal gene set size: At most, this many genes will be included in the final Gene Set.
[back to top]


Step 3: Specify Analysis Parameters

GeneSetAnalysisWizard11.png

  • P-value cutoff for fisher exact test: Specify the test cutoff for each comparison between query and target Gene Sets.
  • Multiplicity adjustment: Select the method to adjust for multiple testing.
  • Attach target gene sets in the report: Select whether or not to include target gene sets in the output report.
    • Clicking on a row of the output GeneSet Analysis report will list all genes in the target Gene Set in the Target Genes tab, and a column will indicate whether or not each gene was included in the final Target Gene Set.
  • Test full target sets: By default, the Gene Set analysis function will query the target database multiple times, removing less significant query genes, to identify the query gene set that gives the most significant results. To disable this function, check this box.
  • Generate Volcano plots of enrichment scores: Select whether or not to output a Volcano plot of the Gene Set overlap for each target Gene Set (see below for details).
  • Perform regression analysis: This is implemented after server version 10.0.1.73, user can specify whether to run linear regression analysis for the overlapped gene expression. This module will check:
    • 1. if the input data has fold change value associated (regression analysis won't run with only a list of genes)
    • 2. if the overlapped gene number is >= 5
    • If both conditions are met, the module will run regression analysis and generate two columns in report table, and show regression plot in details view, please check with the output section for more details.
  • Output Name (optional): Specify a name for the output reports.
[back to top]
Publishing Query Gene Sets to Database (Optional)

The user has the option to publish the query gene set to an internal curated gene set database.

If selected, several options for publishing are enabled:

  • Replace existing gene sets: If a Gene Set with the specified name already exists for the selected Database, replace the Gene Set with the current search.
  • Database: Select a destination database for the new Gene Set. Internal Gene Sets can be managed with Manage Gene Sets.
  • Source: Select or type in the category that best describes the source of the Gene Set.
  • Type: Select the type of comparison that generated this Gene Set, such as Condition-specific gene expression, Disease-specific gene expression, etc.
  • Project: Specify the name of the project from which this Gene Set was derived.
  • Name: Specify a name for the Gene Set.
  • Description: Provide a description for the Gene Set, so that others can understand the significance of the genes.
  • Tag: Provide a "tag" that succinctly summarizes the Gene Set, to allow grouping with other similar Gene Sets in the database.
[back to top]


Step 4: Confirm the Query Gene Set

Integration GeneSetAnalysis Step4Window.png

Confirm that your query genes were properly mapped and identified, then click Finish.

The analysis job will send to server job queue. It will scan each Gene Set in the selected databases and return Gene Sets with significant overlaps.

[back to top]


Analysis Details

Fisher's exact test is used for Gene Set Analysis. The Human and Mouse universal sets will match Gene IDs for the two species, to allow identification of mouse-derived Gene Sets using a human query, and vice-versa.

Considering two gene sets A and B, the contents of Fisher's exact test table is formulated as

Fisher Table In Set B Not in Set B
In Set A A∩B A-(A∩B)
Not in Set A B-(A∩B) UniversalSet - (A + B)

Fisher's exact test is done with "alternative = "greater"".

If both input and target GeneSets are non-directional, which is the simplest case, only one test will be performed.

Gsa01.png

If the input GeneSet is directional (up-regulated and down-regulated lists), and the target GeneSet is non-directional, three independent analyses will be performed for the input gene set: one for the "Up" input list, one for the "Down" input list, and one for the full input list.

Gsa02.png

If the input GeneSet is non-directional, but the target gene set is directional, three independent analyses will be performed for the input gene set: one for target Up list, one for target Down list, and one for the full target list.

Gsa03.png

If both input and target gene sets are directional, five independent analyses will be performed as shown below:

Gsa04.png

Iterative searching for the best target gene list

Directional target GeneSets are sorted, so this Gene Set Analysis function will identify rank-based cutoffs for the "best" number of top target genes to return the best Fisher P value between the target and input GeneSets.

  • In the sorted (smallest to largest p-value) target list, the analysis will start with the gene at the top of the target GeneSet list.
  • If the target gene overlaps with the input GeneSet, then a Fisher exact test and P-value will be calculated.
    • If not, then analysis will continue with the next gene.
  • The search stops when maximum overlapped gene number is reached.
  • The P-values for each number of target genes considered will be evaluated to identify the "target cutoff" that results in the best score.

For example, if there are 30 overlapped genes between your input gene set and a target gene set, the search will calculate each p-value between your full Gene Set list and the top N target genes for which there are 0,1,...,30 overlapped genes. The best P value will be picked and reported from the above calculated P values. Users have the option (Test full target sets (best cut-points will not be searched)) to turn off this option.

[back to top]


Output Results

Two Table objects will be created: a "Gene Set Analysis Result" report and a "Gene Set Enrichment Analysis" report. The latter will not be created if you deselected "Generate Volcano Plots of enrichment scores".

Each Comparison is assigned a Category to help you concentrate on GeneSets of interest, such as tissue-specific GeneSets (Tissue-Specific Gene Expression and Tissue1 vs Tissue2), response to treatments (Resistant vs. Sensitive, Responder vs. Non-Responder, Treatment vs. Control, and Treatment1 vs. Treatment2).

Gene Set Analysis Result

A summary report will be generated, listing the overlap between your input Gene Set (subset by fold-change direction, if compared to directional target set), and information about the target Gene Set. Clicking on a Gene Set row (ID) will display a Venn diagram of the overlap between Query and Target Gene Sets, as well as additional details about the overlap Gene Set.

  • GeneSet Analysis Table Report ((if Perform Regression Analysis option was not checked in step 3)

GeneSetAnalysisWizard06.png

  • GeneSet Analysis Table Report (if Perform Regression Analysis option was checked in step 3 and input is a table with fold change; this feature was implemented after server version 10.0.1.73)

GeneSetAnalysisWizard12.png


In addition, several categories of Gene Set "hits" are displayed, separated by comparison type (e.g. "CellType1 vs CellType2, Treatment vs. Control, etc.). Top hits are ranked by p-value, with symbols indicating whether it is a full-list match, only a certain direction (e.g. Up-Up), or anti-correlated.

  • Top Hits: overall, overall up and down, top hits by gene set type

GeneSetAnalysisWizard07.png

[back to top]


Gene Set Enrichment Analysis Result

If you provided an input Gene Set with "directions" and selected "Generate Volcano plots", a second data object will be created that shows the trend for up- and down-regulated genes in each target set.

Since each Query-Target Gene Set comparison will be compared for the full set, as well as directional subsets (up- and down-regulated, for correlated and anti-correlated sets), some target Gene Sets will show a stronger statistical result when comparing only down-regulated targets to up-regulated query genes, for example.

The GeneSet Enrichment Analysis result will report the best hit for the five tests for each query/target comparison.

In addition, a Volcano plot will be generated, where the Y-axis is the log10(p-value) for the best Gene Set subset, and the X-axis is the "Enrichment score" for correlated vs anti-correlated genes. Gene Sets that are overall anti-correlated will be to the left, while Gene Sets that are correlated overall will be to the right.

Integration GeneSetAnalysis GSEAvolcano.png

Tips.png
All p-values are converted by -log10(p-value) first.Then "enrichment score" is calculated by (p-value(up/up) + p-value(down/down) - (p-value(up/down) + p-value(down/up))
[back to top]


Land Tab

Search multiple genes or pathway in Land

  • Perform a search of multiple genes or pathway:

GeneSetAnalysis06.png

  • Gene set analysis can be found under Enrichment Analysis. The analysis is based on Fisher Exact test.

GeneSetAnalysis07.png

Using Custom Gene Set

Users can create their own gene sets. Gene Sets can be imported and managed by "Manage => Genes => Manage Gene Sets"

GeneSetAnalysis01.png

A gene set should at least have a column containing gene symbols. Other related columns can be included as well. In particular, including fold-change and P-Value measurements can improve the utility of the gene sets.

The following terms are recognized as column headers in GeneSet files:

  • Fold-change (recognized Column titles include "Log2FoldChange", "FoldChange", and "Estimate")
  • Raw p-value (recognized Column titles include "RawPValue","p-value", "pvalue", and "PValue")
  • Adjusted p-value (recognized Column titles include "AdjustedPValue")
  • General p-value (recognized column titles include "GeneralPValue")

If one of these terms are found in the header columns, the data in those columns will be used in GeneSet Enrichment Analysis tests.

An example of Gene Set is shown below:

GeneSetAnalysis02.png

(Note: The table may display spaces between words in a header column to aid readability, but terms such as "FoldChange" and "RawPValue" will only be recognized as a single term, and will be preserved in the actual table object. Hover your mouse over a column header to confirm proper formatting.)

Select Gene Set Analysis under Enrichment Analysis.

GeneSetAnalysis09.png

Choose the gene set to perform analysis on

GeneSetAnalysis10.png

Statistical Tests

Different statistical test is chosen for Gene Set Analysis, depending on whether PValue and/or Fold Change values are available.

Fisher's exact test

  • If neither P-Values nor Fold Changes are provided in user's gene set, Fisher's exact test will be used for gene set analysis in Land.
  • When "Search Multiple Genes" function is used, Fisher's exact test will be used for gene set analysis in Land.
    • GeneSetAnalysis06.png
  • All genes in user's gene set will be considered significant genes.
  • For each comparison table already included in Land, significant genes are chosen by the below criteria:
    • P-Value < 0.05
    • fold change >= 1.25
    • If the number of genes passing the above filters is more than 500, top 500 genes with the smallest P-Values will be chosen as significant genes.

Wilcoxon test

  • If P-Values and/or fold changes are provided in user's gene set, Wilcoxon tests (nonparametric) will be used for gene set analysis in Land.
  • If Estimates(log2 fold changes) are provided, the values must be signed.
  • For each comparison table already included in Land, significant genes are chosen by the below criteria:
    • P-Value < 0.05
    • fold change >= 1.25
    • If the number of genes passing the above filters is more than 100, top 100 genes with the smallest P-Values will be chosen as significant genes.
  • The above significant genes will be used to divide user's gene set into significant group and insignificant group, and Wilcoxon test will be performed on these two groups.
  • If only P-Values are available in user's gene set, then P-Values will be used to perform Wilcoxon test.
  • If only fold changes are available in user's gene set, then fold changes will be used to perform Wilcoxon test.
  • If both P-Values and fold changes are available in user's gene set, Wilcoxon test will be performed independently on P-Values and fold changes, and the smaller P-Value will be picked as the final P-Value.
  • Estimates (log2 fold changes) and fold changes will return same results, as Wilcoxon test is a nonparametric test.

Gene Set Analysis Result

Gene Set Analysis(plot)

Users can select any comparisons they are interested in, and a table will be displayed under the plot to show all the details.

GeneSetAnalysis04.png

Gene Set Analysis(Table)

GeneSetAnalysis05.png


OmicScript

GeneSetAnalysis_Oscript

[back to top]


Related Articles

[back to top]

EnvelopeLarge2.png