GWAS-VCF

From Array Suite Wiki

Jump to: navigation, search


Contents

GWAS-VCF format

GWAS-VCF is a variation of the Variant Call Format (VCF) that reports genetic association results instead of genotypes. It was recently developed and proposed for adoption as a new community standard. When GeneticsLand was developed in 2016, no such standard existed so OmicSoft created the GTT format which is very similar. However, as a potential community standard which has already been adopted by some major institutions, GWAS-VCF is favored over the OmicSoft-specific GTT format.

ExportFromGxl

The first implementation of this format in OmicSoft Suite was in the ExportFromGxl procedure in GeneticsLand. With the exception of details noted here, we adhered to the most recent version of the GWAS-VCF spec at the time (v1.2)

FORMAT keys

We extended the reserved FORMAT keys for the additional association results that are stored in GeneticsLand and relevant for export:

Key Corresponding GTT Column Description
LB AssociationResult.EffectSize_LowerBound Lower bound of the confidence interval around ES
UB AssociationResult.EffectSize_UpperBound Upper bound of the confidence interval around ES
MU AssociationResult.Up Number of studies within a meta-analyis where the direction of ES was positive
MD AssociationResult.Down Number of of studies within a meta-analyis where the direction of ES was negative
PH AssociationResult.PValueHeterogeneity P-value for for any measure of heterogeneity for meta-analyses such as Cochran's heterogeneity statistic
UC AssociationResult.Uncertain Indicator(s) of uncertainty of result
EA Which of the two ALT alleles is the effect allele when analysis was conducted without the REF allele

The following reserved FORMAT keys are excluded because no such values are stored in GeneticsLand: EZ, NC, AC

Multi-allelic variants

Although the GWAS-VCF spec indicates multi-allelic variants should be decomposed such that each ALT allele is tested separately against the REF allele, this is not feasible when working with existing results from historical analyses where access to the genotype data is not available to decompose and re-analyze.

GeneticsLand also favors this decomposed form and enforces it for genotypes but allows an exception for association results given this constraint for historical data. Specifically, we store the 2 tested ALT alleles in the ALT field delimited by '/' with the allele after the '/' being the effect allele. We use '/' instead of ',' to distinguish this as an ordered pair (the ',' delimiter is used in the VCF spec for the ALT field for an unordered list of 2 or more values).

For exporting such results to the GWAS-VCF format, to comply with the VCF spec for the ALT field, we replace the '/' delimiter with the ',' and report the effect allele using the EA FORMAT key. There are examples of this highlighted in orange text on blue background in the Example file below.

Association Metadata

For the GWAS-VCF's trait and study meta-information header lines, we export the columns from the Association and Project Metadata tables, respectively. The trait ID comes from the primary AssociationID column in the Association Metadata. The study ID comes from the primary ProjectName column in the Project Metadata. The remaining columns in these metadata tables become the the key=value pairs in these structured meta-information header lines. Please note the following which are highlighted in green background in the Example file below:

  • Per the VCF spec, all values in these key=value pairs must be strings and enclosed in double quotes (including numeric values).
    • See how the numeric Total Sample Size column of the Association Metadata in the Example file below is reported in the file's ##trait line.
  • Per the VCF spec, the following special characters in values will be percent encoded: LF (%0A), CR (%0D), and " (%22)
    • See how the Description column of the Project Metadata in the Example file below is reported in the file's ##study line.
  • The ProjectName column in the Association Metadata which links to the Project Metadata is reported using study as the key so it can be linked to the study meta-information header line.
    • If the Association Metadata already has a column with ID of study, it will be reported using study. as the key.
      • See how the study column of the Association Metadata in the Example file below is reported in the file's ##trait line.
  • Type is a reserved key with a controlled vocabulary per the VCF spec - if there is a column in either metadata table with ID of Type, it will be reported using Type. as the key.
    • See how the Type column of the Association Metadata in the Example file below is reported in the file's ##trait line.
  • The VCF spec places more stringent constraints on key identifiers then OmicSoft Suite does for column IDs
    • Keys must begin with a letter - if a column ID begins with any non-letter characters, they will be removed to derive the key.
    • Keys can only contain letters, digits, underscores, or dot/period characters - any other characters in the column ID will be replaced with an underscore to derive the key.
      • See how the Total Sample Size column of the Association Metadata in the Example file below is reported in the file's ##trait line.
    • Keys must be unique and it is possible after applying the above 2 modifications, the derived key could be non-unique - if this occurs, a .N suffix will be added to establish uniqueness where N will be 1 for the first non-unique key and incremented for each additional instance.
      • See how the 2019Status column of the Association Metadata in the Example file below is reported in the file's ##trait line.
Tips.png Although strongly recommended, metadata is not required to export association results. If Association Metadata is missing for a particular AssociationID selected for export, a blank trait meta-information header line will be included in the GWAS-VCF that only contains the AssociationID as the trait ID. If Association Metadata is present but there is no ProjectName column joining the Project Metadata, then the study meta-information header line in the GWAS-VCF file will be omitted.


Other Meta-information header lines

These are highlighted in red text on yellow background in the Example file below.

dbSNP

Although unspecified in the GWAS-VCF spec, the accompanying example reports the dbSNP rs ID as an attribute in the INFO column (key=RSID). This is likely because of the VCF requirement that the values in the ID column be unique and when following the guidance to decompose multi-allelic variants, there will be instances of multiple records in the file having the same rs ID.

The ExportFromGxl procedure's existing behavior for exporting genotypes to a VCF file addressed this uniqueness constraint on the ID column by appending a suffix to non-unique rs ID values of the form :REF:ALT where REF and ALT are the values from the REF and ALT columns. This same behavior is used for exporting association results to a GWAS-VCF file and therefore there is no RSID INFO attribute. Instead, we report the version of dbSNP as a simple meta-information header (key=dbSNP). See rs4970393 in the Example file below.

source

We indicate which Land the export is from, the Land's version, and the OmicSoft Server software version in the simple meta-information header with key of source.

reference

We indicate which OmicSoft Reference Library the Land was built on in the simple meta-information header with key of reference.

Example file

Corresponding Land Association Metadata for this example:

AssocitionID ProjectName study Type Status 2019Status Total Sample Size
Gastrointestinal Inflammatory Bowel Disease PMID26192919 International Inflammatory Bowel Disease Genetics Consortium PMID26192919 CaseControl Open Embargo 34652

Corresponding Land Project Metadata for this example:

ProjectName Description
International Inflammatory Bowel Disease Genetics Consortium In recent years the "IIBDGC" has focused on collecting very large datasets from a diverse set of countries via world-wide collaboration.

In addition to enabling the discovery of all these genes, we also try to dig a little deeper into what these associations actually mean.

Red text on yellow background highlighting - see Other Meta-information header lines section above

Orange text on blue background highlighting - see Multi-allelic variants section above

Green background highlighting - see Association Metadata section above

##fileformat=VCFv4.3
##gwasformat=GWAS-VCFv1.2
##filedate=20210516
##source=GxL.Associations_B37 GeneticsLand GxL.Associations_B37_20181114_v5 from OmicSoft Server 11.1.0.203
##reference=OmicSoft Human.B37.3
##dbSNP=151
##FORMAT=<ID=LP,Number=1,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=ES,Number=1,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=1,Type=Float,Description="Standard error of ES">
##FORMAT=<ID=LB,Number=1,Type=Float,Description="Lower bound of the confidence interval around ES">
##FORMAT=<ID=UB,Number=1,Type=Float,Description="Upper bound of the confidence interval around ES">
##FORMAT=<ID=NS,Number=1,Type=Float,Description="Number of subjects with called genotypes in the analysis">
##FORMAT=<ID=AF,Number=1,Type=Float,Description="Alternative allele frequency in the analyzed subjects">
##FORMAT=<ID=SI,Number=1,Type=Float,Description="Accuracy score of imputed allele doses analyzed">
##FORMAT=<ID=MU,Number=1,Type=Integer,Description="Number of studies within a meta-analyis where the direction of ES was positive">
##FORMAT=<ID=MD,Number=1,Type=Integer,Description="Number of of studies within a meta-analyis where the direction of ES was negative">
##FORMAT=<ID=PH,Number=1,Type=Float,Description="P-value for for any measure of heterogeneity for meta-analyses such as Cochran's heterogeneity statistic">
##FORMAT=<ID=UC,Number=.,Type=String,Description="Indicator(s) of uncertainty of result",Reference="http://www.arrayserver.com/wiki/index.php?title=GTT#2.29_Key_columns">
##FORMAT=<ID=EA,Number=1,Type=String,Description="Which of the two ALT alleles is the effect allele when analysis was conducted without the REF allele">
##study=<ID="International Inflammatory Bowel Disease Genetics Consortium",Description="In recent years the %22IIBDGC%22 has focused on collecting very large datasets from a diverse set of countries via world-wide collaboration.%0AIn addition to enabling the discovery of all these genes, we also try to dig a little deeper into what these associations actually mean.">
##trait=<ID="Gastrointestinal Inflammatory Bowel Disease PMID26192919",study="International Inflammatory Bowel Disease Genetics Consortium",study.="PMID26192919",Type.="CaseControl",Status="Open",Status.1="Embargo",Total_Sample_Size="34652">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	Gastrointestinal Inflammatory Bowel Disease PMID26192919
1	962606	rs4970393:G:A	G	A	.	.	.	LP:ES:SE:LB:UB	2.155541:-0.0089:.:-0.014:-0.0058
1	962606	rs4970393:G:C	G	C	.	.	.	LP:ES:SE:LB:UB	4.155541:-0.0089:.:-0.014:-0.0058
1	962606	1:962606:G:C,T	G	C,T	.	.	.	LP:ES:SE:LB:UB:NS:AF:SI:MU:MD:PH:UC:EA	1.013631:0.0087:0.0037:0.005:0.0124:52565:0.12:0.85:9:1:0.52:EffectDirection,TwoNonRefs:T
1	962612	1:962612:G:C,T	G	T,C	.	.	.	LP:ES:SE:LB:UB:NS:AF:SI:MU:MD:PH:UC:EA	1.013631:0.0087:0.0037:0.005:0.0124:52565:0.12:0.85:9:1:0.52:TwoNonRefs:T
2	41198003	rs1200379625	G	C	.	.	.	LP:ES:SE	2.30103:.:.
2	41198011	rs13023575	T	C	.	.	.	LP:ES:SE:LB:UB	.:0.0087:0.0037:0.005:0.0124
2	41198012	rs1429578055	G	A	.	.	.	LP:ES:SE:LB:UB	299.2783:0.0087:.:0.005:0.0124
2	41198060	2:41198060:T:A	T	A	.	.	.	LP:ES:SE:LB:UB	.:0.0087:.:0.005:0.0124
2	41198090	rs1173809375	T	C	.	.	.	LP:ES:SE	.:.:.
3	25451461	rs2364118	G	A	.	.	.	LP:ES:SE:LB:UB	0.3019256:-0.0099:.:-0.014:-0.0058
X	60002	rs1226858834	T	A	.	.	.	LP:ES:SE:LB:UB	3.39794:0.0087:.:0.005:0.0124
Tips.png Example simulated, does not reflect any particular real GWAS result


EnvelopeLarge2.png