From Array Suite Wiki
GWAS: QC (PLINK2)
This module runs a sequence of common variant- and sample-level QC tests designed to prepare a high quality set of genotype calls for analysis. It is designed for genome-wide data but could be used for any genotype dataset if the options are appropriately configured (e.g. QC filters that are not be relevant to non-genome-wide datasets should be skipped).
To run this module, please select Analysis | GWAS | QC (PLINK2)
Input Data Requirements
The input files should be in PLINK bed format. Specifically, it must be normalized such that the first allele in the bim file is the reference allele and the second allele is the alternative allele. If you have exported your data from GeneticsLand then it will already be normalized.
Click Add to browse the Server folders for your PLINK bed file(s).
- There are eight QC options users can specify at the QC stage. The default values are given in the above figure.
- Marker missing rate threshold (round 1): Before calculating sample missing rate, this first pass marker missingness filter is used to remove any low quality markers with very high missingness.
- Subject missing rate threshold (round 1): This step removes low quality samples based on missing rate. The default threshold of 0.02 will remove samples with > 2% missingness (or < 98% call rate) which is relatively stringent. You may wish to adjust this to 0.05 or another more lenient threshold.
- Marker missing rate threshold (round 2): After low quality samples are removed, low quality markers are removed based on missing rate.
- Gender cutoff: Genotyped sex is inferred from the homozygosity of chr X. Samples with homozygosity < Female MaxF will be reported as a genotyped sex of female. Samples with homozygosity > Male MinF will be reported as a genotyped sex of male. Samples with homozygosity between these two scores will be reported as a genotyped sex of unknown. Samples whose genotyped sex is opposite the sex in the source data (PLINK fam file) will be removed under the assumption the fam file sex is the phenotyped sex and such a discrepancy between genotype and phenotype could indicate a mis-identified sample (i.e. a sample swap) or potential chr X anneuploidy which would confound any downstream association analyses.
- Heterogeneity SD Unit: Sample homozygosity will be calculated from the autosomal markers and samples more than the default 3 Standard Deviations from the mean will be removed.
- Kinship cutoff: The aim of this QC step is identify cryptically related samples which are established by testing pair-wise Identify-By-Descent (IBD) between each pair. An estimated kinship coefficient range >0.354, 0.354 - 0.177, 0.177 - 0.0884, and 0.0884 - 0.0442 that corresponds to duplicate/MZ twin, 1st-degree, 2nd-degree, and 3rd-degree relationships, respectively ref. A default kinship value of 0.0884 is set to identify pairs of samples with 2nd-degree or closer relationships. Samples with kinship values larger than the cutoff will be iteratively removed based on missingness (the sample with highest missingness will be removed first) until all remaining kinship values are less than the cutoff.
- HWE p-value cutoff: Checking for Hardy-Weinberg Equilibrium (HWE) is the final step in the quality control analysis of genetic markers. Under HWE assumptions, allele and genotype frequencies can be estimated from one generation to the next and deviations from this equilibrium may indicate genotyping errors. It is critical to conduct this QC step in a subset of subjects with similar genetic ancestry. To do this, each sample's genetic ancestry is first inferred from principal components analysis (PCA) anchored with the five continental populations from the 1000 Genomes: Africans, Ad Mixed Americans, East Asians, Europeans, and South Asians. Genetic ancestry of a sample is predicted using the k-Nearest (k=3 default) Neighbors algorithm. HWE testing is subsequently conducted in the largest subgroup of samples. The default cutoff is set at 1e-199 to effectively bypass this test since genotypes for truly associated markers may deviate from HWE and if the subsequent analysis is an association analysis, removing such markers could inflate type II error. If the subsequent analysis is something like imputation which is very sensitive to genotyping error, a more stringent cut-off such as 1e-6 or 1-8 can be set.
- Specify reference genome: reference genome currently needs to be Human.B37.3 as we use this to define regions of high LD for exclusion during calculation of certain QC statistics. We also use the 1000 Genomes data for ancestry inference and these data are on version 37. You input PLINK bed file should also be on version 37.
- Parallel job number: one job per file set; if you select 50 PLINK bed files, and job number set to 10, and it will run 10 jobs (file sets) at the same time. Each set is processed independently and you would produce 50 QC filtered datasets.
- Output folder specifies the full path to where output files should be saved.
- Remove Unmapped Markers indicates whether to remove any markers that are unmapped (as indicated by being on chr 0 in the input PLINK dataset).
- Delete intermediate files specifies whether to maintain the temporary files generated during the process. These can sometimes be helpful for troubleshooting.
Upon completion of GWAS data QC, QC results will be summarized as shown below. A _qced file will also be imported to the project which contains the genotypes which passed all the filters.
QC results are organized into the following four interactive table views:
- Marker Exclusion List: The set of probes that did not meet marker QC thresholds are summarized in the marker exclusion list along with the reason for exclusion.
- Sample Exclusion List: The set of samples that did not meet sample QC thresholds are summarized in the sample exclusion list along with the reason for exclusion.
- PCA combined: PCA scatter plots are generated for pairs of selected PC scores (also called eigenvectors). In the below example, PC1 is plotted against PC2. Subjects from 1000 Genomes are represented by the square symbols and your samples are denoted by the circles. The five continental populations, Africans (AFR), Ad Mixed Americans (AMR), East Asians (EAS), Europeans (EUR), and South Asians (SAS), are represented by different colors as shown in the right hand view controller. In this example, the vast majority of the study samples are clustered with Europeans. Genetic ancestry of a sample is further predicted using the k-Nearest Neighbors algorithm.
- PCA results: The top 10 PCs are recalculated in the study samples only (without the 1000 Genomes) which are appropriate for use as covariates in association analyses to correct for population stratification.
These same results are also summarized in a Microsoft Word attachment. Here is an example report which includes a summary like:
In the output folder, there will be new PLINK dataset _qced.bed, _qced.bim, _qced.fam.