# Data Normalization

### From Array Suite Wiki

Microarray data must be properly normalized to account for variance in data and biases that may occur for downstream applications such as differential expression (DE) analyses. This Wiki page will demonstrate the transformation of microarray data to account for this variation and how this can impact DE analyses. For this demonstration, we have chosen the microarray data that is available in our Microarray Tutorial in our Tutorials Wiki page. The data can also be downloaded with the link provided here. This microarray data consists of 24 .CEL files corresponding to a time series dataset comparing the response to DBP or control treatments. When we plot the raw data intensities from these .CEL files using the **OmicData | Summarize | Kernel Density** menu option, we see:

Notice the asymmetric distribution of intensities that cluster close to zero. To account for this skewed distribution (common with microarrays and RNA-seq data), when microarray data is added to a project in Array Studio, the default option is to perform a logarithmic transformation:

Leaving this box checked will automatically apply a log2 transformation to the data so that there is a more uniform distribution of the samples across the x-axis:

By applying this transformation, we find that there is some variance among the samples that was not easily identifiable with the raw data. To account for this variance, there are a number of normalization options available in Array Studio. For a complete list and explanation of these options, go to the Wiki page: Normalize.pdf. For the purposes of this Wiki page, we show three different normalization methods.

One method to normalize the microarray data is to Center to 0. With this method, a single normalization factor will be used to bring the means among samples to the set number (zero in this case). For this example, notice that this method has a minimal effect on the variance among samples:

Another method to normalize the data is to center a specified quantile at a target value. In this example, we have specified the 75% Quantile to a Target value of 10. Notice how there is less variation at the higher intensity levels, but slightly higher variance at lower levels.

Finally, the preferred normalization for microarray data Bolstad et al, (2003) is Quantile Normalization (or Full Quantile). With this method, the distribution of probe intensities among all chips is forced to be the same. Applying this transformation to the above dataset results in a distribution like this:

To demonstrate how these normalization strategies can affect downstream analysis, such as identifying DE genes, we performed 2-way ANOVAs with each data set. As shown below, for each pairwise comparison (at each time point), the number of significantly DE genes identified from each dataset (untransformed, log2 transformed, Full Quantile) is quite different:

Moreover, by examining the overlap of DE genes from a single time point (3 hrs), there is a significant lack of overlap among the different data inputs:

Therefore, we highly recommend users to carefully choose an appropriate normalization method prior to performing downstream analyses. While the recommended normalization method for microarrays is Full Quantile Normalization, users are encouraged to examine how these normalization methods perform with their own data.