CuratedGWAS

From Array Suite Wiki

Jump to: navigation, search

Contents

GxL.Associations_B37: Public GWAS results curated by OmicSoft for GeneticsLand data subscription

As part of the GeneticsLand data subscription, OmicSoft curates Genome Wide Association Studies (GWAS) that are publicly available.

This Land contains:

  • 7,407 Association Results Sets
  • >1,000 Phenotypes from 161 different trait Categories
  • Each variant in the Land is annotated using the following databases (additional annotation sources are available):
dbSNP, 1000 Genomes, ClinVar, gnomAD, SIFT, GTEx, GWAS catalog, GRASP 2.0, conservation scores from dbNSFP, GWAVA, HaploReg, RegulomeDB, HGNC, OMIM, InterPro, The Drug Gene Interaction Database (DBIdb)


Content

AssociationsLand.png

  • A list of the Categories included in the Land



Addiction Emphysema Obsessive-compulsive disorder (OCD)
Adipose-related Endometrial cancer Oral cancer
Adverse drug reaction (ADR) Environment Oral-related
Age-related macular degeneration (ARMD) Epigenetics Ovarian
Aging Epilepsy Ovarian cancer
Alcohol Esophageal cancer Pancreas
Allergy Eye-related Pancreatic cancer
Alzheimer's disease Female Parkinson's disease
Amyotrophic lateral sclerosis (ALS) Gallbladder cancer Physical activity
Anemia Gallstones Platelet
Aneurysm Gastric cancer Pregnancy-related
Anthrax Gastrointestinal Prostate cancer
Anthropometric General health Pulmonary
Arterial Glaucoma Radiation
Arthritis Graft-versus-host Rectal cancer
Asthma Grave's disease Renal
Atrial fibrillation GxE Renal cancer
Attention-deficit/hyperactivity disorder (ADHD) Hair Reproductive
Autism Hearing Rheumatoid arthritis
Behavioral Heart Salmonella
Bipolar disorder Height Schizophrenia
Bladder cancer Hepatic Sickle cell anemia
Blood cancer Hepatitis Skin cancer
Blood pressure HIV/AIDS Skin-related
Blood-related Hormonal Sleep
Body mass index Huntington's disease Smallpox
Bone cancer Imaging Smoking
Bone-related Immune-related Social
Brain cancer Infection Stone
Breast cancer Inflammation Stroke
Calcium Influenza Subclinical CVD
Cancer Kidney cancer Surgery
Cancer-related Leukemia Systemic lupus erythematosus (SLE)
Cardiomyopathy Lipids Testicular cancer
Cardiovascular disease (CVD) Liver cancer Thrombosis
Celiac disease Lung cancer Thyroid
Cell line Lymphoma Thyroid cancer
Cervical cancer Male Treatment response
Chronic kidney disease Melanoma Tuberculosis
Chronic lung disease Menarche Type 1 diabetes (T1D)
Chronic obstructive pulmonary disease (COPD) Menopause Type 2 diabetes (T2D)
Cognition Methylation Ulcerative colitis
Colorectal cancer Mood disorder Upper airway tract cancer
Congenital Mortality Urinary
Coronary heart disease (CHD) Movement-related Uterine cancer
C-reactive protein (CRP) Multiple sclerosis (MS) Uterine fibroids
Crohn's disease Muscle-related Vaccine
CVD risk factor (CVD RF) Musculoskeletal Valve
Cystic fibrosis Myasthenia gravis Vasculitis
Dental Myocardial infarction (MI) Venous
Depression Narcotics Vitamin
Developmental Nasal Weight
Diet-related Nasal cancer Wound
Drug response Neuro


[back to top]


Curation Process

Data Processing

Publicly available GWAS result sets were obtained and processed into GTT format using an internally developed pipeline. This pipeline performs allele standardization, a key feature of GeneticsLand, ensuring that all genetic associations are reported on the forward strand of the same genome build and that the effects (e.g. betas, ORs, HRs) are always given as the alternative allele versus the genome reference-regardless of the original input format. This allows for much easier cross study comparisons.

Pipeline

We've developed a pipeline for allele standardization that minimizes the loss of variants due to strand ambiguities. An overview of the processes steps is below:

1) First determine if are results reported on the forward strand
  • Used Informative SNPs (non-A/T, C/G)
  • Calculate % of forward
  • if > 97% assume all variants reported on forward
2) Flag (and keep) ambiguous alleles using the "Uncertain" column
  • If forward assumption met- informative SNPs on reverse strand: Uncertain = EffectDirection
  • If forward assumption not met- non-informative (A/T,C/G) SNPs: Uncertain = EffectDirection
  • Both- alleles do not match genome reference: Uncertain = TwoNonRefs
3) Standardize the effects to always compare the ALT allele vs the REF allele
  • Set model reference = REF and model effect allele = ALT
  • Flip effect (beta, OR, HR) and other relevant columns when necessary so results are always show in the same orientation in Land
4) Perform additional calculations
  • If std. error is provided, calculate confidence intervals
  • Calculate OR for binary outcomes


For studies where only an RS ID is provided, we used the following method to determine the alleles:

1) If the RS ID is found in dbSNP (including any merged RS IDs)
  • Used the REF and ALT alleles given in dbSNP
  • Multiallelic RS IDs are written as allele1/allele2
  • Set Uncertain = EffectAllele to indicate the alleles are being inferred
2) If RS ID not found in dbSNP
  • REF = genome reference base and ALT = N
  • Set Uncertain = EffectAllele to indicate the alleles are unknown

Metadata Curation

The OmicSoft GeneticsLand curation team, with expertise in GWAS, work to generate accurately curated project and sample level metadata (e.g. GWAS modeling information, dataset stratification, sample sizes, outcome unit, genotyping array, etc.). Additionally, the GWAS outcomes are classified into broad trait categories, allowing for quick phenotype-based filtering and searching.

How To Use

Metadata Data Dictionary

Column Description Example Entries
AssociationID A unique ID for each independent analysis. A publication (grouped by projects in Project Name) may perform more than one association (GWAS) analysis. Each are given a unique ID comprised of the phenotype, any subsets (e.g. Males only or Stage 1), and the PMID. e.g. Dementia Late Onset Alzheimers Disease Stage 1 PMID24162737
Ancestry Ancestry of the individuals used in the association result set e.g. African
Categories Broad phenotype category e.g. Cardiovascular
Consortia Consortia that generated the association result set e.g. CHARGE
Discovery Sample Description Description of the individuals used in study's discovery phase e.g. 4,275 Korean ancestry individuals
Effect Type Defines the type of value given for the "Effect Size" column in GeneticsLand OR, Beta, HR, z-score
Imputed Indicates if the study was imputed. NA used for missing and could mean study was not imputed or data was not collected Imputed, NA
Inclusion Threshold Defines if the association result set has an inclusion threshold or if all results are provided. Full = full summary statistics available; TopHits = top hits summary statistics only, a p-value threshold was applied. Full, TopHits
Model Model used to generate the association result set. Generally available for studies that contain full summary statistics Linear Regression, Logistic Regression, Interaction, Cox Regression
Number of Variants Total number of SNPs and/or indels used in the source GWAS- for top hits only studies, will not represent the number of SNPs/indels in Land
Outcome Unit The measurement unit of the phenotype. Generally available only for studies that contain full summary statistics e.g. g/dL , CaseControl, %
Phenotype The phenotype/outcome tested in the association result set e.g. Myocardial Infarction
Platform Genotyping array used e.g. Custom Illumina iSelect
Project Name Name of the project that the association results set belongs to. Either PubMed ID or dbGaP accession number e.g. PMID21829377
Replication Sample Description Description of the individuals used in study's replication phase e.g. 7579 EA cases, 8236 controls
Source Source from where the association result set was obtained e.g. dbGaP
Study Description A description of the study's general focus
Subset Indicates whether a study was restricted to men, women or children Men/Women/Children
Total Discovery Samples Total number of individuals (N) used in the study's discovery phase
Total Replication Samples Total number of individuals (N) used in the study's replication phase
Total Sample Size Total number of individuals (N) in the study

Gene or SNP-centric workflow

1) Start by searching the Land for your Gene(s) or SNP(s) of interest. You can search by:

SNP: rs7412
Gene: APOE
Coordinate: 19:45412079
Region: 19:45412070-45412080


GeneSearch.png



Or search multiple SNPs, genes, coordinates or regions at once

BasicSearch.png


2) Open the annotated association results under Select View | Curated Studies (Table)


CuratedStudiesTable.png


  • Note- multi-variant searches (any search other than a single snp search) includes an All SNPs table


3) Browse, modify, filter, or export results using the filter panel, task bar, and export buttons.

CuratedStudiesTable2.png


Phenotype-centric workflow

There are several ways to open a specific Association result set in Land. In GxL.Associaitons_B37, you may want to browse all studies related to a specific trait category. For example:


1) Select studies of interest by filtering on phenotype categories (or other metadata columns)

PhenotypeSearch.png



2) Highlight the category and specific studies of interest. Then select "Browse Selected Associations"

SelectPhenotypes.png



3) Select from the available associations views (top hits table, genome plots, etc.) under "Select View"

SelectView.png


  • Note: you can also browse association results by searching for the AssociationID in the search box or under Search Multiple Associations | Add From Land

Views

Associations views can be reached easily by searching 1 or more associations. Views include interactive plots:

genome plots GenomePlot.png

and region plots RegionPlot.png

Data Source

All GWAS results sets are open access. Please cite the original source when using this resource.


[back to top]


Questions or Comments

Related Articles

[back to top]

EnvelopeLarge2.png