From Array Suite Wiki
Sentieon DNA-Seq Pipeline
Omicsoft has integrated Sentieon's Genomics Software with Array Studio for germline variant detection using the Sentieon DNA-Seq module. Sentieon recommends the following bioinformatics pipeline for DNA analysis that is based on the Broad Institute's BWA-GATK Best Practice workflow https://www.broadinstitute.org/gatk/guide/best-practices.
After creating a project on the server, users can access the module by selecting Add NGS Data | Add From Pipeline | Sentieon DNA-Seq Piepline (Beta):
To mirror the sequential process of the overall GATK Best Practice work-ﬂow, this module is designed as a streamlined process to enable users to efficiently conduct germline variant calling from raw input files. When starting with fastq files, array studio will set the default Steps to mimic the recommended best practices. Other accepted input file formats are BAM/CRAM, VCF, and GVCF files and when applied, the Steps of the work-flow will automatically change accordingly.
In the Basic section under the General Tab, the user has a number of options:
- Choose a Genome
- The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering logic (e.g. _1, _2 or .1, .2).
- Replace existing BAM files in the output folder or skip the alignment step for samples already having BAM files in the output folder
- Number of thread for each alignment and number of jobs/samples running in parallel.
- Output name
- Output folder
- (Optional) Users can optionally designate the location of an interval file. BED and Picard style formats are supported. For more information on when to use interval files please look at GATK documentation https://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals and read further on supported formats https://software.broadinstitute.org/gatk/documentation/article?id=11009
- (Optional) Users can align to a Custom Reference genome by designating the location of a FASTA file. In addition, the following index and dictionary files must be available to the software in the designated location. This is an optional input and by default, Sentieon will run reference files from the GATK resource bundle.
- (Optional) Custom dbSNP file is optional but when specified is used in the variant calling step to label known variants.
A single command is run to efficiently perform the alignment using BWA-MEM and create sorted BAM files using Sentieon software. By default, the option -M is applied to Mark shorter split hits as secondary. This option is selected to be compatible with mark duplicates but can be unchecked under the Customize Steps I tab.
Calculate Data Metrics
By default, five statistical summaries of the data and pipeline data analysis quality are generated per BAM file. Users can choose to skip specific outputs by adjusting the Metrics options under the Customize Steps I Tab.
Please visit the Sentieon_QC_Metrics wiki page for a complete description of column headers in the Metrics reports.
All of the tables will be output to a 'Metrics' folder in the user specified Output folder on the server. This folder will contain:
- GC_Summary_TXT: GC bias metrics summary.
- GC_Metric_TXT: GC bias metrics report.
- MQ_Metric_TXT: Mapping Quality Metrics report.
- QD_Metric_TXT: Quality/depth metrics results.
- IS_Metrix_TXT: insertion size metrics results.
- ALN_Metric_TXT: alignment metrics report.
- Metrics_PDF: Full Metrics report file.
This step detects reads indicative that the same DNA molecules were sequenced several times. These duplicates are not informative, and it is recommended that they not be counted as additional evidence. This step requires input of a sorted BAM file and is accomplished in two steps. The first command collects read information, and the second command performs the deduping. The output is DEDUP_METRICS_TXT and DEDUP_BAM with associated index file (.bai). Downstream analysis tools are duplicate aware and therefore, users can choose to select the option to Mark Duplicates Only.
The Realigner algorithm will perform the indel realignment on the dedup_bam file. This algorithm accepts additional parameters --algo Realigner -k $known_sites --interval_list $regions to set known sites or to specify an interval. Please contact Omicsoft support for help on specifying these Additional Parameters into the pipeline.
This step calculates the required modification of the quality scores assigned to individual read bases of the sequence read data and applies those to the BAM file. A RECAL_result.csv file will be output to the server. Optionally, the recalibrated BAM file and it's corresponding index file can be output to the server. This output is optional (and is unchecked by default) as Sentieon variant callers can perform the recalibration on the fly using the before recalibration BAM plus the recalibration table. In fact, if users choose to rerun the variant calling, they should NOT use the recalibrated BAM together with the recalibration table, as that would apply the recalibration twice.
A single command is run to call variants and apply the base quality score recalibration (BQSR). The Haplotype Caller (Haplotyper algorithm) is applied by default; however, users can opt to use the Unified Genotyper caller. The following options can be set :
- Emit confidence level: variants with quality less than this threshold will not be added to the output VCF file.
- Call confidence level: variants with quality less than this confidence will not be added to the output VCF file.
- Emit mode: determines what calls will be emitted. Options are (1) variant: emit calls only at confident variant sites. This is default behavior; (2) confidence: emit calls at confident variant sites or confidence reference sites ; (3) all: emit all regardless of their confidence; and (4), gvcf: emits additional information required for joint calling (for Haplotyper only). This option is required if you plan to perform joint calling.
Sentieon recommends processing multiple samples from a cohort together using 2 options:
- Process each sample individually (up to BQSR for each sample) creating either a recalibrated BAM file or a realigned BAM and a recalibration table for each sample. Then, process all BAM files using the Haplotyper algorithm.
- Process each sample individually and use the Haplotyper algorithm with option --emit_mode gvcf to create a gvcf file containing additional information, then process all gvcf using joint calling (the GVCFtyper algorithm). This method allows for easy and fast reprocessing after additional samples have been processed.
Input for the joint genotyping algorithm are gvcf files and output is a vcf containing the joint called variants for all samples. Please review the GATK Forum FAQ for information on the gvcf format https://software.broadinstitute.org/gatk/documentation/topic.php?name=faqs. Options under the Customize Steps III tab allow users to set the Emit confidence level and Call confidence levels. Defaults are set to 10 and 30, respectively.
Emit mode options are:
- Variant-emit calls only at confident variant sites. This is default behavior.
- Confident-emit calls at confident variant sites or confident reference sites
- all- emit all calls, regardless of their confidence. Users can also select to "Keep the gvcf file", which will not overwrite the original input file.
- Genotype missing segments: If multiple genome segments were generated (e.g. 1000 segments), and most segments were generated, but some failed (e.g. 0, 10, 99, and 144).
- Note: Segmentation IDs are zero-indexed, i.e. 0-999 (for 1000 segments).
Variant Quality Score Recalibration (VQSR)
VQSR assigns a well-calibrated probablity score to individual variant calls, to enable more accurate control in determining the most likely variants. The method uses highly confident known sites to build a recalibration model and determine the probability that called sites are true. For more information on the algorithm, please see the GATK documentation Sentieon Inc.%20465 Fairchild Drive, Suite 135, Mountain View CA 94043%20www.sentieon.com%207-11%20vqsr. http://gatkforums.broadinstitute.org/discussion/39/variant-quality-score-recalibration- © Sentieon Inc. 465 Fairchild Drive, Suite 135, Mountain View CA 94043 www.sentieon.com 7-11 vqsr. Sentieon's VarCal algorithm calculates the Variant Quality Score Recalibration, which is then applied using the ApplyVarCal algorithm. The output is a copy of the original vcf containing the additional annotations from the VQSR. VQSR
Options that can be applied to the VQSR steps under the Customize Steps II tab:
- Set the Maximum Gausssians: This setting determines the maximum number of Gaussians that will be used for the positive recalibration model. The default value of 8 is for SNP and 4 for INDEL. Note that separate models are built for SNP and INDELS.
- Sensitivity: This sets a normalized quality threshold for each tranche, and the number should be between 0 and 100. The default values of 99.5% is recommended.
- Output alignment file format: Select the output file format from BWA
- Select Keep unrecalibrated vcf in order to keep the raw vcf files.
- Save Individual Log: Choose this option to generate a summary per sample log file. Log files are extremely large. The main log file is an aggregated log file. If you choose this option, you will get additional log files, 1 per sample.