DiseaseLand Curation Pipeline
From Array Suite Wiki
DiseaseLand Curation Pipeline
We have processed various publically available projects for DiseaseLand and separated them into two major categories: ImmunoLand and CVMLand. The main disease areas are as follows for each Land:
- IBD (Crohn Disease, Ulcerative Colitis)
- Infectious diseases
- Lung disease (COPD, Pulmonary Fibrosis)
- Multiple Sclerosis
- Neurodegenerative disease (Alzheimer’s, Parkinson’s, ALS)
- Psoriasis and other skin disease
- SLE (Systemic Lupus Erythematosus)
- Mood disorders (Schizophrenia)
- Cardiovascular disease
- Kidney disease (Chronic Kidney disease, Polycystic Kidney disease, Diabetic Nephropathy)
- Diabetes (Type I and Type II)
- Fatty Liver disease
If you have other disease areas that you would like to see added into DiseaseLand, please let us know by contacting us at firstname.lastname@example.org.
We process most commercially available microarray gene expression platforms, including Agilent, Illumina and Affymetrix platforms, and RNA-Seq.
- Most single channel expression microarray platform
- miRNA, methylation platforms (late 2016)
Non-supported platforms and projects:
- Two color expression microarray platform
- Obsolete or custom microarray platforms without probe sequence annotation
- Badly designed experiments, projects without enough samples or statistical power could be removed during our pipeline.
We select public projects from GEO, ArrayExpress, SRA, and other large data repositories like BluePrint, GTEx, ImmGen. Customers can also deposit their own in-house data into Land.
We fully support both human and mouse projects. Rat projects will be supported in the future.
Sample and Comparison Meta Data Curation Process
Providing a consistent annotation across various projects is one of the most challenging task for large scale meta-analysis. Omicsoft DiseaseLand curation team uses control vocabularies to harmonize the tissue, cell type, disease state, treatment and other commonly used terms. Terms are selected from major ontology thesaurus like MESH, Cell Ontology, UBERON, etc. We reference these ontologies when adding to and maintaining our own internal ontologies.
The Omicsoft DiseaseLand curation team works together to generate accurately curated project and sample level metadata. To achieve this goal, we have implemented an internally developed tool, ArrayCurator, to aid in 1) selecting CV from our internally maintained ontologies, 2) comparing independent curations and 3) generating the final metadata in a format suitable for the Land. Each project is assigned to at least two independent curators for quality control. The specific DiseaseLand manager edits all outside curations, merges the two curations and maintains a subset of internal CVs and ontologies. After two curations are merged into a final curation, a second manager reviews the merged curated metadata before exporting to the Land as an additional quality control step.
Project Analysis Pipeline
Raw data extraction and sample normalization
All raw data are normalized before analysis. For Affymetrix platforms with raw data(“.cel” files), data are extracted directly from .cel files and normalized using an in-house algorithm similar to RMA (see our white paper). For other microarray platforms, level 3 data are re-scaled to a median target intensity 500 before usage.
For RNA-Seq samples, .fastq files are downloaded from SRA and aligned to genome using Omicsoft OSA aligner to generate .bam files. The read count and FKPM value for each gene are derived from .bam files using RSEM. Upper-Quartile normalization is used to normalize RNA-Seq FKPM values.
Note: For expression arrays, there may be some discrepancies between published data and the values in DiseaseLand. Please see the accompanied wiki page here for an explanation of where these differences arise.
Sample level QC
For microarray samples, correlation QC are performed for each project to computer sample MAD scores and remove sample outlier before analysis. (http://www.arrayserver.com/wiki/index.php?title=CorrelationQC.pdf) RNA-Seq projects that has median sample alignment rates less than 40% are excluded from land
Statistical modeling and inference
Instead of using t-test for all comparisons, Omicsoft statisticians/bioinformatians work with curation scientist and select the best experimental factors to include in the analysis for each project. We also consider block effects, match samples, and random effects during mathematical modeling. For microarray expression data, linear models are used for log2 transformed intensity and DESeq2 is used for RNA-Seq raw counts data.