From Array Suite Wiki

(Redirected from Annotate VCF File)
Jump to: navigation, search


Annotate Variants (VCF/BED/GTT/RS_ID)


The Annotate Variants module allows the user to generate annotations using VCF, GTT, PLINK BED or RS_ID as input.

This function uses OmicSoft's Oscr database technology to allow efficient annotation, streaming, and filtering of even the largest VCF files. For example, OSCR-based annotation of TCGA's BRCA variant dataset (26 Gigabytes, 3.4 million variants, 2187 samples) can be filtered to 1751 Pathogenic variants in seconds. For an overview of different methods for annotating variant data in Array Studio, please see Annotate Variants in Array Studio.

To use this module, please select Analysis | NGS | Variation | Annotate Variant Files (VCF/BED/GTT/RS_ID) or Analysis | GWAS | Annotation

NGS annotation.png or GWAS annotation.png

Input Data Requirements

The input file formats can be VCF, GTT, PLINK BED or RS_ID.

[back to top]

General Options


[back to top]

Add file

Add files to menu

  • Add button will add samples by selections
  • Add Folder will add all samples in the selected folder (local project only)
  • Search will find files based on sample registration (server project only)
  • Add list will allow users to add files from a list (even add a grouping file for alignment functions).
[back to top]


  • Reference : The user can select the reference library to use.
  • Gene model : The user can select the gene model to use.
  • Generate .oscr files for GeneticsLand : Generates optimized .oscr files for streamed annotated variant data. This will be checked and output automatically.
  • Generate text files: By checking this option, a text file containing the annotation will also be generated.
  • Job number : Parallel number of processing jobs to be performed simultaneously.
  • Output folder : Specify the output folder for the output Oscr and text files.
[back to top]

Annotation Source

VariantAnnotation3.png Array Studio provides a large number of annotation classifiers to improve the identification of relevant genetic variants and genomic regions

Variant based annotators

Annotation (click for details) Description Current classifier for specifying in Land.cfg
dbSNP rs IDs from NCBI's dbSNP Snp151 (DbsnpVersion in Land.cfg)
1000 Genomes Continental allele frequencies from The 1000 Genomes Project 1000GenomesSimple_20170501
gnomAD Population allele frequencies from gnomAD gnomAD_20170501
CADD Allele deleteriousness score from CADD (Combined Annotation Dependent Depletion) CADD_20170501
ClinVar Aassociations between alleles and phenotypes from ClinVar ClinVar_20170501
dbNSFP Predicted effects of non-synonymous SNVs from dbNSFP DBNSFP_v3.5
GRASP Allele-trait associations from Genome-wide Repository of Associations between SNPs and Phenotypes See details
GTEx eQTLs Associations between alleles and Gene expression (eQTLs) from GTEx GTExEqtl_20170501
GWAS Catalog Allele-trait associations from the EBI-NHGRI GWAS Catalog GWASCatalog_20180626.1
GWAVA Allele deleteriousness score from GWAVA (Genome Wide Annotation of VAriants) GWAVA_20170501
HaploReg Annotation on non-coding variants from HaploReg HaploregV4_20170501
HGMD Curated allele-disease associations from QIAGEN's Professional Version of the Human Gene Mutation Database. HGMD_2018.4
RegulomeDB Annotation on non-coding variants from RegulomeDB RegulomeDB_20170501
UK10K Cohort allele frequencies from UK10K UK10K_20170501
Wellderly Allele frequencies from "healthy elderly" patients enrolled in the Scripps Wellderly study Wellderly_20170501

Cancer frequency annotators

Annotation (click for details) Description Current classifier for specifying in Land.cfg
TCGA Germline Allele frequencies and counts from joint calling WES of normal samples from ~11k cancer patients in the The Cancer Genome Atlas TCGAGermline_All (see details for cancer-specific classifiers)

Gene based annotators

Annotation (click for details) Description Current classifier for specifying in Land.cfg
Drug-Gene interaction database Gene functions and drug interactions from Drug-Gene interaction database DGIdbCategories_20170501.osgc and DGIdbInteractions_20170501.osgc
NCBI Gene ID Gene ID from NCBI's Gene database EntrezID_20170501.osgc
Familial Cancer Genes Gene-related syndromes from the Familial Cancer Database Familial_20170501.osgc
HGNC Gene ID from HUGO Gene Nomenclature Committee (HGNC) HGNC_20170501.osgc
Human DNA Repair Genes Activity of DNA repair genes from Wood lab HumanDNARepairGenes_20170501.osgc
OMIM Gene-disease associations from Online Mendelian Inheritance in Man OMIM_20170501.osgc

Position and region based annotators

Annotation (click for details) Description Current classifier for specifying in Land.cfg
Conservation Scores Conservation scores including GERP++, PhyloP, and PhastCons Conservation_20170501
ENCODE Enhancers & Promoters Enhancer and promoter regions predicted by ENCODE See details
FitCons Fitness consequence scores from INSIGHT FitCons_20170501.osrc
InterPro Protein domain from InterPro Interpro_20170501

[back to top]

Output Results

An annotated OmicSoft SNP Classification Result (oscr) object will be created. Each row contains information for a single variant, and clicking on that row will display sample-level genotype details for that variant. Variant annotation columns are filterable, such as mutation type, AAposition and 1000genome frequency.


Variants affecting multiple genes

In cases where a variant overlaps more than one gene annotation, only the gene most severely affected by the variant will be listed. The purpose of this is to eliminate redundant annotation rows listing each gene for the same variant.

The ranking of variant severity is: Stop Loss > Stop Gain > Splice Site > Exon Coding > Exon Non-Coding > Intron > 5'UTR > 3'UTR > Intergenic.

To get a comprehensive list of affected genes (with repeated rows of relevant variant entries), use the oscript: AnnotateVcfData.

[back to top]



Related Articles

[back to top]