Genetics GwasQCPipeline2.pdf

From Array Suite Wiki

Jump to: navigation, search




This module runs a sequence of common variant- and sample-level QC tests designed to prepare a high quality set of genotype calls for analysis. It is designed for genome-wide data but could be used for any genotype dataset if the options are appropriately configured (e.g. QC filters that are not be relevant to non-genome-wide datasets should be skipped).

To run this module, please select Analysis | GWAS | QC (PLINK2)

GWAS QC2 Menu.png

Input Data Requirements

The input files should be in PLINK bed format. Specifically, it must be normalized such that Allele1 (column 5) in the bim file is the reference allele on the plus strand and Allele2 is the alternative allele on the plus strand. Other assumptions:

  • Variants in the pseudo-autosomal region (PAR) are reported on either chromosome 25 or XY in the PLINK bim file with positions corresponding to chromosome X.
  • Clinically reported sex is reported in the PLINK fam file using standard PLINK coding of 1 for males, 2 for females, 0 for missing.
  • Samples are unrelated - any known relationships defined in the pedigree structure will be ignored. Specifically, during the kinship test, if any known relatives have higher kinship than the Kinship cutoff, they will still be excluded as if they were unexpected relationships. Conversely, if any expected relatives per the pedigree structure have lower kinship than expected, they won't be identified as erroneous (if the kinship is below the Kinship cutoff, they will pass QC).

If you have exported your data from GeneticsLand then the bim file will already comply with these assumptions but you may need to update the dataset to specify the clinical sex (--update-sex). Be sure to include the --keep-allele-order option so the allele normalization is maintained during this update.

Warning.png WARNING: If running this module on a Linux Array Server, your server administrator will need to confirm the mono configuration .

General Options


[back to top]

Add file

Click Add to browse the Server folders for your PLINK bed file(s).

[back to top]


  • There are eight QC options users can specify at the QC stage. The default values are given in the above figure.
    • Marker missing rate threshold (round 1): Before calculating sample missing rate, this first pass marker missingness filter is used to remove any low quality markers with very high missingness.
    • Subject missing rate threshold (round 1): This step removes low quality samples based on missing rate. The default threshold of 0.02 will remove samples with > 2% missingness (or < 98% call rate) which is relatively stringent. You may wish to adjust this to 0.05 or another more lenient threshold.
    • Marker missing rate threshold (round 2): After low quality samples are removed, low quality markers are removed based on missing rate.
    • Gender cutoff: Genotyped sex is inferred from the homozygosity of chr X. Samples with homozygosity < Female MaxF will be reported as a genotyped sex of female. Samples with homozygosity > Male MinF will be reported as a genotyped sex of male. Samples with homozygosity between these two scores will be reported as a genotyped sex of unknown. Samples whose genotyped sex is opposite the sex in the source data (PLINK fam file) will be removed under the assumption the fam file sex is the phenotyped sex and such a discrepancy between genotype and phenotype could indicate a mis-identified sample (i.e. a sample swap) or potential chr X anneuploidy which would confound any downstream association analyses. Samples with missing sex in the fam file will pass regardless of the inferred genotyped sex.
    • Heterogeneity SD Unit: Sample homozygosity will be calculated from the autosomal markers and samples more than the default 3 Standard Deviations from the mean will be removed.
    • Kinship cutoff: The aim of this QC step is identify cryptically related samples which are established by testing pair-wise Identify-By-Descent (IBD) between each pair. An estimated kinship coefficient range >0.354, 0.354 - 0.177, 0.177 - 0.0884, and 0.0884 - 0.0442 that corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships, respectively ref. A default kinship value of 0.0884 is set to identify pairs of samples with 2nd-degree or closer relationships. Samples with kinship values larger than the cutoff will be iteratively removed based on missingness (the sample with highest missingness will be removed first) until all remaining kinship values are less than the cutoff.
    • HWE p-value cutoff: Checking for Hardy-Weinberg Equilibrium (HWE) is the final step in the quality control analysis of genetic markers. Under HWE assumptions, allele and genotype frequencies can be estimated from one generation to the next and deviations from this equilibrium may indicate genotyping errors. It is critical to conduct this QC step in a subset of subjects with similar genetic ancestry. To do this, each sample's genetic ancestry is first inferred from principal components analysis (PCA) anchored with the five continental populations from the 1000 Genomes: Africans, Ad Mixed Americans, East Asians, Europeans, and South Asians. Genetic ancestry of a sample is predicted using the k-Nearest (k=3 default) Neighbors algorithm. HWE testing is subsequently conducted in the largest subgroup of samples. The default cutoff is set at 1e-199 to effectively bypass this test since genotypes for truly associated markers may deviate from HWE and if the subsequent analysis is an association analysis, removing such markers could inflate type II error. If the subsequent analysis is something like imputation which is very sensitive to genotyping error, a more stringent cut-off such as 1e-6 or 1-8 can be set.
  • Specify reference genome: You input PLINK bed file should also be on the specified version. We use this to define regions of high LD for exclusion during calculation of certain QC statistics. We also use the 1000 Genomes data for ancestry inference and need to correctly match version for combining the data.
  • Parallel job number: one job per file set; if you select 50 PLINK bed files, and job number set to 10, and it will run 10 jobs (file sets) at the same time. Each set is processed independently and you would produce 50 QC filtered datasets.
  • Output folder specifies the full path to where output files should be saved.
  • Remove Unmapped Markers indicates whether to remove any markers that are unmapped (as indicated by being on chromosome 0 in the PLINK bim file).
  • Ignore Replicate Removal indicates whether to skip the removal of the lower quality (call rate) of replicate probes. For data sets without replicates (e.g. NGS), selecting this option will greatly reduce run time.
  • Delete intermediate files specifies whether to maintain the temporary files generated during the process. These can sometimes be helpful for troubleshooting.
[back to top]

Output Results

Upon completion of GWAS data QC, QC results will be summarized as shown below. A _qced file will also be imported to the project which contains the genotypes which passed all the filters.


QC results are organized into the following four interactive table views:

  • Marker Exclusion List: The set of probes that did not meet marker QC thresholds are summarized in the marker exclusion list along with the reason for exclusion.
  • Sample Exclusion List: The set of samples that did not meet sample QC thresholds are summarized in the sample exclusion list along with the reason for exclusion.
  • PCA combined: PCA scatter plots are generated for pairs of selected PC scores (also called eigenvectors). In the below example, PC1 is plotted against PC2. Subjects from 1000 Genomes are represented by the square symbols and your samples are denoted by the circles. The five continental populations, Africans (AFR), Ad Mixed Americans (AMR), East Asians (EAS), Europeans (EUR), and South Asians (SAS), are represented by different colors as shown in the right hand view controller. In this example, the vast majority of the study samples are clustered with Europeans. Genetic ancestry of a sample is further predicted using the k-Nearest Neighbors algorithm.
  • PCA results: The top 10 PCs are recalculated in the study samples only (without the 1000 Genomes) which are appropriate for use as covariates in association analyses to correct for population stratification.


These same results are also summarized in a Microsoft Word attachment. Here is an example report which includes a summary like:


In the output folder, there will be new PLINK dataset _qced.bed, _qced.bim, _qced.fam.

[back to top]



Related Articles

[back to top]