LinearModel.pdf
From Array Suite Wiki
Contents |
General Linear Model
Overview
The General Linear Model function is the main command in Array Studio for running analysis on data. It is the recommended option for most, if not all, experimental designs. It allows the user to model the data on a variable-by-variable basis. The user can specify a fixed, mixed, or random model. Estimates, fold changes, p-values, prediction dataset, and confidence intervals can generated using this model. For the more casual user that may think of terms like One-way ANOVA, Two-way ANOVA, Repeated Measures ANOVA, ANCOVA, etc., this model should be used.
To run this module, type MicroArray | Inference | General Linear Model.
Input Data Requirements
This module works on -Omic data types that follow a normal distribution. Most imported microarray data will follow a normal distribution, because the signal data are generally log2-transformed during import.
Users can use these methods to see whether their data is normally distributed. Array Studio can plot kernel density of samples, and summarize skewness and kurtosis to check how well their data approximate a normal distribution.
For data that do not follow a normal distribution, the user can consider transformation or normalization.
For NGS count data that follow a negative binomial distribution, the user can use interference methods such as DeSeq2.
General Options
Note: for moderated t-test (limma package R) option, all data in the Microarray table has to be numerical. In case there are dots in the table, see [Imput](http://www.arrayserver.com/wiki/index.php?title=Impute.pdf) to replace all dots with 0.
Input/Output
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
The factor(s) used in your statistical design should only use characters in A-Z a-z 0-9 _ (underline) and . (period). Please note, for some R tools/packages implemented into Array Suite, only letters, numbers, dots, and underline characters are allowed for variable names or column names.
Other characters, including ~ + - * / : ^ | [ ] { } ( ) # < > , and space may be interpreted improperly in the statistical design.
Options
The Options section for the Linear Model window include 3 steps, which must be filled out in order:
Step 1(required): specify model
This is where the user will specify the terms of the model. Getting the terms correct for each individual experiment is key to a successful General Linear Model. Clicking Specify Model opens the Specify Linear Model window, which has two sections:
- Columns: This section contains columns from the Data object's Design Table.
- Class: If the column should be considered a Class term, a checkbox for that column can be selected. By default, Array Studio will guess on what constitutes a Class term. In general, numeric columns will not be considered Class terms by default, while other column, such as "Factors", will be considered Class terms by default. Users should consult with a statistician if not sure as to whether a column should be a class term. In the example shown below, time should be considered a Class term, but because Array Studio made it a numeric column, it is not by default. Changing this in the Design Table will affect the default behavior here.
- Term: the factors in the design table.
- Construct Model:this section is where the user can add the terms to the model. By selecting terms on the left, the user can use the Add, Cross, Nest, and Remove buttons to select the terms for that particular model.
- Add: Clicking this button will add the selected terms to the model.
- Cross: Clicking this button will cross the terms selected on the left (this is discussed in more detail later).
- Nest: Clicking this button will nest the selected term on the left panel to the selected term on the right panel.
- Remove: Clicking this button will remove selected terms in right panel.
- For the example shown below, the user would want to cross time and treatment, as the interaction between time and treatment is of interest in this experiment.
Step 2(required): specify tests
This includes any particular comparisons the user is interested in, along with other tests (F-tests, variance, etc.).
TTest (Estimate)
From the left side the user can pick from the terms in the model (from Step 1). The user has the option of manually building estimates (ttests) for each comparison or using the "For each" option to let Array Studio generate multiple estimates at once. This is discussed in more detail below. In the Options section, the user can decide whether Estimates, Fold changes, Raw p-values, Adjusted p-values, Generate significant list, and Split significant list (by direction) will be created for the Inference report generated by this command. Clicking the Add button will add the specified estimates to the Ttests section.
- Term: The terms defined in the model in Step1 specify model.
- For each: Unchecking this box requires users to manually construct Statistics to do T-Test. Checking the this box will let Array Studio generate some statistics for T-tests. For main effect, there are no further options for users to choose (the only option is none). For interaction terms (e.g. factor A: factor B), it allows users to further build tests for factor B based on each level of factor A. With interaction term, users can still leave the For each to (none) and simply compare different levels of the interaction term.
- Level: All possible values in Term.
- Coefficient: Values that users need to specify to conduct a T-Test. By default, all Coefficients are 0 and the Add button is grey. Users need to modify at least 2 coefficients to build statistics for T-Test. Usually the sum of Coefficient is 0.
- Divisor: All Coefficients will be divided by the value defined here.
- Compare to: The baseline that users want to compare to. For main effect, all other levels will compare to this baseline. For interaction term (factor A: factor B), if users define factor A in For each, users can further specify a level in factor B as baseline so that all other levels in factor B will be compared to this baseline.
- Pairwise comparison: Similar to Compare to, but instead of comparing all other levels to one baseline, checking this box will conduct comparisons for all possible level combinations.
- Four way contrasts: This option is only available when users try to do a T-Test for interaction (e.g., factor A : factor B) term and set For each as one of the factors (e.g., factor A). Suppose that factor A: factor B is significant and users want to test for the effect of A within each level of B, users can choose this option. A more detailed example can be found here: Four way contrasts
- Slice: If factor A is also an interaction (e.g., factor A = factor a1: factor a2), users can further slice factor A. If factor A isn't an interaction, only none is available.
- Compare to: The users can choose to compare one level of factor A to all other levels in factor A when comparing the corresponding interaction (e.g. factor A: factor B).
- Pairwise comparison: The user can choose to compare all possible pairwise comparisons within factor A when comparing the corresponding interaction (e.g. factor A: factor B).
- Options: Users can specify which parameter estimates for the TTest will be output.
- Estimates: The estimate of the effect.
- Fold change: The fold change of effect. Note that, in Array Studio, the absolute value of fold change is always larger than 1.
- Raw p-value: The raw p value of the T-test.
- Adjusted p-value: The adjusted p-value, based on raw p value.
- Generate significant list: Generate a list that contains the variables that have an adjusted p value less than the threshold (by default it is 0.05).
- Split significant list: Generate 2 significant lists, split by the sign of estimates.
FTest(Anova)
The FTest(Anova) tab can be used to generate FTests for any of the terms in the model. In this window, the user can add any of the terms (as well as the Residual) form the model. Options for output from the FTests include Raw p-values, Adjusted p-values, Coefficients, Variance components, and the generation of significant lists. Clicking on one or more terms in the Terms box, selecting options, and clicking Add will add the FTests to the FTests box. This is shown in more detail below.
- Options: Users can specify which parameter estimates for the FTest will be output.
- Raw p-value: The raw p value of the F-test.
- Adjusted p-value: The adjusted p-value, based on raw p value.
- Coefficients: The coefficients for different levels in the selected terms
- Variance components: It only works for random effect. It gives the estimate of the variance of random variable.
- Generate significant list: Generate a list that contains the variables that have an adjusted p value less than the threshold (by default it is 0.05).
Step 3(optional): Change Options
This step provides the user with some additional options available for change (adding LSMeans data, prediction dataset, Multiplicity Adjustments, confidence intervals, etc.). While this is considered optional, the user should verify these settings before proceeding.
General
- ANOVA test type: Type1, Type2, Type3, and Type4 (Type 3 is the default option) - These are sum of square types that are related to ANOVA. They only make a difference if you have an unbalanced design. Type 1, 2 and 3 are universally accepted types, and Type 4 is SAS specific. Type 3 is mostly commonly used and generally correct. For additional information please see the following link: http://en.wikipedia.org/wiki/Explained_sum_of_squares.
- Multiplicity: FDR_BH, FDR_BY, Bonferroni, Sidak, StepDownBonferroni, StepDownSidak, and StepUp (FDR_BH is the default option)
- FC transformation: Method used to calculate Fold change based on Estimates. Exp2 is the default transformation, as it is expected by default that the data is Log2.
- Estimate cutoff: For the generation of estimate Lists, the user can specify an Estimate cutoff. For instance, if the user is only interested in the significant variables with estimates that are greater than 1 or less than -1, 1 would be entered in this box. Note: This can be used to specify fold change cutoffs as well. For Log2 data, if the user wanted significant variables with fold change greater than 2 or less than -2, entering 1 in this box would be appropriate. (By default, this is set to 0, so as to not exclude any rows based on estimate value)
- Alpha level: For the generation of estimate Lists, the user can specify and Alpha level cutoff (p-value cutoff; by default this is 0.05)
- By: Used if the user wants to run the same model, on a number of different levels, for a given factor. For instance, if the user has 6 tissues, and wants to run the same model (potentially time*treatment) on each of the tissues (generating an individual report for each tissue), the By dropdown box should be used.
- Select list folder: Allows the user to select the folder into which any generated lists will go.
- The Generate overall significant list: This checkbox will create a "master" list encompasses all significant rows from all comparisons/FTests/etc.
- Note: The Multiplicity adjustment takes into account the total number of tests performed within a given analysis. There is the ability to set the default option to adjust p-values on a per-test basis. Please refer to "[ Statistics]" section of the User Guide for details.
Filtration
The Filtration section allows the user to implement the special Filtration available in Array Studio. For each variable, a GLM will performed to have a variance estimate for error term and a residual for each observation. Based on this, a p value would be calculated for each observation's residual according to a T-distribution. A further adjust p value would be generate by the defined Multiplicity method. The observation with adjusted p value less than Alpha level would be removed and new GLM would be performed for this variable without outlier. Filtration is a useful feature which filters out outliers in statistical inference and hence yields higher power. Filtration is especially useful for noisy data, such as microarray data which usually contain a lot of extreme observations. Enabling this feature in Array Studio helps find more deferentially expressed genes and increase the power. However, in studies with low numbers of observations, Filtration is not recommended. Filtration is not checked by default.
- Multiplicity: The method defined to adjust the raw p-value.
- Alpha level: The adjust p value cut-off.
- Max iteration: The max number of GLM performed for each variable.
LSMeans | Estimates | Predictions
Least-Squares Means (LSMeans for short) for a linear model are simply predictionsâ€”or averages thereofâ€”over all levels of the selected terms in the specified model. The user needs to select the term for which to generate the LSMean Data from the drop-down box. Instead of generate a point-estimate of prediction over all levels of the terms, Array Studio generate a confidence interval for those predictions.
- Generate LSMean data: Checking this box will generate a new Data Object in the Solution Explorer containing the LSMean data for the generated model.
- Append inference report: Checking this box appends the LSMean data to the Inference Report.
- LSMean Confidence interval: The user also needs to enter a desired LSMean Confidence interval for the generation of a confidence interval in the new Data object.
- Generate estimate data: Checking this box will generate the LSMeans for the TTest statistics.
- Estimate Confidence interval: Confidence interval of the LSMeans for the TTest statistics.
- Generate prediction data: Checking this box would generate the predicted LSMeans for each observation based on specified model in step 1.
- Prediction Confidence interval: Confidence interval of the LSMeans for the Prediction Estimates.
- Generate data for significant variables only: Checking this box will only generate the LSMean data for rows that are considered significant in the model. If the user is interested in seeing the LSMean data for all rows of the data, deselect this box.
Perform moderated t-test (limma package in R): The moderated t-test is used to rank genes in order of evidence for differential expression. They use an empirical Bayes method to shrink the probe-wise sample variances towards a common value and to augmenting the degrees of freedom for the individual variances (Smyth, 2004). The empirical Bayes moderated t-statistics test each individual contrast equal to zero. For each probe (row), the moderated F-statistic tests whether all the contrasts are zero. The F-statistic is an overall test computed from the set of t-statistics for that probe. This is exactly analogous the relationship between t-tests and F-statistics in conventional ANOVA, except that the residual mean squares and residual degrees of freedom have been moderated between probes. http://rss.acs.unt.edu/Rdoc/library/limma/html/ebayes.html
Output Results
Array Studio will generate an Inference Table in the Inference tab of the Solution Explorer, with a name ending in .Tests. Several Views will be generated automatically, including a VolcanoPlotView and a TableView.
Notice in the example shown below that the VolcanoPlotView will show each individual VolcanoPlot for each test generated by the model.
The TableView will list all of the columns that have been generated by the model. This could include raw pvalues, adjusted pvalues, fold changes, estimates, etc. for each requested estimate. In addition, columns will be generated if Ftests generation was requested. The Annotation Table will also be attached to this Table. Making use of the Filter tab in the View Controller will enable the user to quickly filter by pvalue, fold change, etc., if the Lists generated by the model are not sufficient. An example TableView for the Inference Report is shown below.
As with any TableView, any information in the table can be exported or saved at any time.
Differing views will also be generated if the user selected options for LSMean, Estimate, or Prediction data.
OmicScript
Related Articles
- Latest Tutorials
- Omicsoft aligner wiki and publication