From Array Suite Wiki
Normalize RNA-Seq data
The Normalize RNA-Seq data module will normalize observations/samples based on linear scaling methods.
To access this module, please go to Analysis | NGS | Inference | Normalize RNA-Seq Data
Input Data Requirements
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Normalization method: Log Geometric Mean, Mean, Median, ScaleQuantile, TMM (edgeR), TotalCount, RPKM to TPM, UpperQuartile, and LandNormalization (for more details on the different normalizations, see below).
- Quantile: used for Quantile method, specifying the quantile line used for normalization to the specified scale target
- Set scale target: the target values for normalized quantiles, mean, median, total count or upper quantile, depending on which method has been selected;
- Reference Library & Gene Model: Specify the corresponding reference and gene model if the user is normalizing transcript level data and Normalize transcripts at gene level is selected, or if normalization requires gene metadata (normalizing by gene length, filtering MT genes, etc).
- Export scale factor: append the scaling factor to the design table
- Normalize transcripts at gene level: get the expression values/counts at gene level by getting sum of their corresponding transcripts; then, calculate the scaling factor on gene level; finally, apply the same scaling factor to transcript level data.
- Sequence low count cutoff: variables with mean expression (across samples) below this cutoff value will not be used to calculate the normalization factor.
- Sequence high count cutoff: variables with mean expression (across samples) above this cutoff value will not be used to calculate the normalization factor.
- For example, if excluding genes with mean FPKM < 1 or > 1000 when normalizing by median, only genes expressing between 1 and 1000 FPKM will be used to identify the median gene, and the normalization factor will be calculated to bring this gene's expression to the target value. Then, all genes (including genes expressing <1 or >1000 FPKM) will be normalized by the calculated value.
- Sequence short length cutoff: Genes with exon length less than this value will not be considered when calculating the normalization factor.
- Requires a Gene Model with ExonLength information in the annotation metadata (e.g. OmicsoftGene).
- Remove MT and miRNA sequences: Do not consider mitochondrial or miRNA genes when calculating the normalization factor
- Requires a Gene Model with Source information containing miRBase annotations (for miRNA) or chromosome annotations of MT or M (for mitochondrial genes)
- Only use protein coding genes for normalization: Only include protein-coding genes in normalization
- Requires annotation metadata with a Source column, containing protein_coding annotation
- Remove Filtered Sequences: Genes that are not used in normalization will be excluded from the output -Omic data
- Log Geometric Mean: This is the same normalization method used in DESeq.
- For a given transcript/gene, it computes the ratio of read count over geometric mean of read counts across all samples. The scaling factor is calculated based on median of the *: ratios from all genes/transcripts passed filters.
- Mean: In each observation, expression values/counts are divided by the mean of genes/transcripts in this observation, and multiplied by the average of mean values from all observations (or the target value if specified in the option) of the dataset.
- The mean and scaling factors are calculated based on genes/transcripts passed filters.
- Median: In each observation, expression values/counts are divided by the median of genes/transcripts in this observation and multiplied by the average of median values from all observations (or the target value if specified in the option).
- The median and scaling factors are calculated based on genes/transcripts passed filters.
- Scale Quantile (previously Quantile): Very similar to the Median method, the except that the user-specified quantile is used.
- TMM: Trimmed Mean of M-values (TMM) is the normalization method used in edgeR.
- Samples/observations that have the closest average expressions to mean of all samples is considered as reference samples, and all others are test samples. For each test sample, the scaling factor is calculated based on weighted mean (weighted by estimated asymptotic variance) of log ratios between the test and reference, from a gene set removing most/lowest expressed genes and genes with highest/lowest log ratios.
- Total count: In each observation, expression values/counts are divided by the total number of mapped read counts in this observation and multiplied by the mean total counts/values across all the observations (or the target value if specified in the option) of the dataset.
- Upper quantile: Very similar to median method, except that the 50% quantile (median) is replaced by 75% quantile.
- RPKM to TPM: Convert RPKM values to TPM based on the linear relationship between TPM and FPKM in each sample.
- LandNormalization: details in RNA-Seq Normalized FPKM Values in Land
Besides the normalized Omic Data object, a scaling factor column will be appended in the design table. The Scale factor should be multiply by, which might be different from the size factor used in other tools (such as the DESeq R package, which is divide by). This will not have divide-by-zero issues, and is more consistent with the previous ArrayStudio’s normalization function in microarray.
- edgeR: a Bioconductor package for differential expression analysis of digital gene expression data
- Differential expression analysis for sequence count data
- A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
- Latest Tutorials