From Array Suite Wiki
Sentieon DNA-Seq Pipeline
Omicsoft has integrated Sentieon's Genomics Software into Array Studio for germline variant detection using the Sentieon DNA-Seq module. The typical use case is to perform bioinformatics pipelines such as the one illustrated below (and based on the Broad Institute's Best Practices https://www.broadinstitute.org/gatk/guide/best-practices).
After creating a project on ArrayServer, users can access the module by selecting Add NGS Data | Add From Pipeline | Sentieon DNA-Seq Piepline (Beta):
Input format: This pipeline is designed to enable streamlined processing for germline variant calling from raw input files. GZIP compression is support for FASTQ. Other accepted starting file formats are BAM/CRAM, VCF, and GVCF files and when applied, the "Steps" of the pipeline will automatically change accordingly.
Add File Names to menu:
- Add button will add samples by selection
- Search will bring up a popup menu to search Samples/Sample Sets registered on the server
- Add List will allow users to upload a list of files and file paths (even add a grouping file for alignment functions http://www.arrayserver.com/wiki/index.php?title=How_to_use_multiple_sequence_files_for_one_sample%3F).
- Remove will clear out selected files that have been added to the input files. You can select the samples with your mouse and choose to remove them.
- Clear will remove all files from the input file list. You do not need to select individual files.
In the Basic section under the General Tab, the user has a number of options:
- Choose a Genome: This pipeline will use the GATK resource reference files, rather than the standard OmicSoft' reference genomes provided for other tools.
- The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering logic (e.g. _1, _2 or .1, .2).
- Replace existing BAM files in the output folder or skip the alignment step for samples already having BAM files in the output folder
- Number of thread for each alignment and number of jobs/samples running in parallel.
- Output name
- Output folder
- (Optional) Users can optionally designate the location of an interval file. BED and Picard style formats are supported. For more information on when to use interval files please look at GATK documentation https://gatkforums.broadinstitute.org/gatk/discussion/4133/when-should-i-use-l-to-pass-in-a-list-of-intervals and read further on supported formats https://software.broadinstitute.org/gatk/documentation/article?id=11009
- (Optional) Users can align to a Custom Reference genome by designating the location of a FASTA file. In addition, the following index and dictionary files must be available to the software in the designated location. This is an optional input and by default, Sentieon will run reference files from the GATK resource bundle.
- (Optional) Custom dbSNP file is optional but when specified is used in the variant calling step to label known variants.
Alignment is performed using BWA and sorting using Sentieon software. By default, we apply the option "-K 100000000" to guarantee that results are independent of the number of threads. The option -M is also applied to Mark shorter split hits and flag as secondary for Picard compatibility.
Calculate Data Metrics
Five statistical summaries of quality are performed per sorted BAM file. Please visit the Sentieon_QC_Metrics wiki page for a complete description of the Metrics reports.
This step is used to mitigate the effects of PCR amplification bias (ie. the same DNA molecules were sequenced several times) on variant calling. Duplicates are not informative, and it is recommended that they not be counted as additional evidence. This step requires input of a sorted BAM file and is accomplished in two steps: (1) The LocusCollector algorithm collects read information and generates a score file. (2) The Dedup algorithm removes the duplicates.
The output is a DEDUP_BAM (.bam) with associated index file (.bam.bai) and a metrics report (dedup_metrics). Downstream analysis tools are duplicate aware and therefore, users can choose to select the option to Mark Duplicates Only that will flag the duplicates but not remove them from the BAM file.
Please note that variant calling workflows that use Haplotyper omit indel realignment. This is based on a change to the GATK Best Practices workflow.
This change does not apply to UnifiedGenotyper workflows. In that case, the Realigner algorithm will perform the indel realignment on the dedup_bam file. By default the Realigner algorithm applies additional parameters to set known sites or to specify an interval (--algo Realigner --known_sites /path_to_OmicSoftDirectory/Variant/Sentieon/Mills_and_1000G_gold_standard.indels.b37.vcf.gz --known_sites /path_to_OmicSoftDirectory/Variant/Sentieon/1000G_phase1.indels.b37.vcf.gz).
The Base Quality Score Recalibration (BQSR) is a data pre-processing step that detects systematic errors made by the sequencer when it estimates the quality score of each base call. Variant calling algorithms rely heavily on the quality score assigned to the individual base calls. The QualCal algorithm will rewrite the quality scores and reports to the recal_data.table. The output of this step will be a report (.recal.pdf) summarizing before/after recalibration.
Optionally, the recalibrated BAM file and it's corresponding index file can be output:
A single command is run to call variants and apply the base quality score recalibration. Users can choose between Haplotype Caller (Haplotyper algorithm) or the Unified Genotyper caller.
The following options can be set :
- Emit confidence level: variants with quality less than this threshold will not be added to the output VCF file.
- Call confidence level: variants with quality less than this confidence will not be added to the output VCF file.
- Emit mode: determines what calls will be emitted. Options are (1) variant: emit calls only at confident variant sites. This is default behavior; (2) confidence: emit calls at confident variant sites or confidence reference sites ; (3) all: emit all regardless of their confidence; and (4), gvcf: emits additional information required for joint calling (for Haplotyper only). This option is required if you plan to perform joint calling.
GATK Best Practices currently set emit and call confidence to 10. Current versions of ArrayServer default both to 30, which was previously best practices.
Joint calling of multiple samples is performed by the GVCFtyper algorithm. It is recommended to process each genome through variant calling (using Haplotyper) with option --emit_mode gvcf (https://software.broadinstitute.org/gatk/documentation/topic.php?name=faqs). You can manually input the file path of the GVCF file for each sample or provide an input list. Then, process all gvcf using joint calling (the GVCFtyper algorithm).
Options under the Customize Steps III Tab allow users to set the confidence levels. The Call Confidence determines the threshold of variant quality to call a variant. Variants with quality less than threshold will not be called. The Emit Confidence determines the threshold of variant quality to emit a variant. Variants with quality less than confidence will not be added to the output VCF file.
Defaults in ArrayServer are currently set to 30 and 30, respectively.
Emit mode options are:
- Variant-emit calls only at confident variant sites. This is default behavior.
- Confident-emit calls at confident variant sites or confident reference sites
- All- emit all calls, regardless of their confidence. Users can also select to "Keep the gvcf file", which will not overwrite the original input file.
- Dividing whole genome into segments: This option will break the genome into smaller chunks and perform the joint calling as 1 job per segment. Please not the last segment will contain the decoy genome and is likely to be considerably large. Segment number will not influence the final output file; one vqsr.vcf file will be generated. This option is applied
- Genotype missing segments: User can specify multiple genome segments (e.g. 1000 segments). If some segments fail, this option can be used to re-run the joint calling on specific segments (e.g. 0, 10, 99, and 144). The final VCF file will be merged.
- Note: Segmentation IDs are zero-indexed, i.e. 0-999 (for 1000 segments).
Variant Quality Score Recalibration (VQSR)
VQSR assigns a well-calibrated probablity score to individual variant calls, to enable more accurate control in determining the most likely variants. The method uses highly confident known sites to build a recalibration model and determine the probability that called sites are true. For more information on the algorithm, please see the GATK documentation https://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration
Sentieon's VarCal algorithm calculates the Variant Quality Score Recalibration, which is then applied using the ApplyVarCal algorithm. The output is a copy of the original vcf containing the additional annotations from the VQSR.
Options that can be applied to the VQSR steps under the Customize Steps II tab (see image under Joint calling):
- Set the Maximum Gausssians: This setting determines the maximum number of Gaussians that will be used for the positive recalibration model. The default value of 8 is for SNP and 4 for INDEL. Note that separate models are built for SNP and INDELS.
- Sensitivity: This sets a normalized quality threshold for each tranche, and the number should be between 0 and 100. The default values of 99.5% is recommended.
- Additional Parameters: Use this field to apply additional options that are supported by the algorithm. For example, if user includes "/annotation=MQ --annotation MQRankSum --annotation ReadPosRankSum --annotation FS" to the Additional Parameters then "--annotation MQ --annotation MQRankSum --annotation ReadPosRankSum --annotation FS" will be applied in the Sentieon command.
- Output alignment file format: Select the output file format for BWA. Options are Auto, BAM, or CRAM.
- Select Keep unrecalibrated vcf in order to keep the raw vcf files.
- Save Individual Log: Choose this option to generate a summary per sample log file. Log files are extremely large and will "freeze" ArrayStudio GUI if users try to view the log under the Server Tab. The main log file is a filtered, aggregated log file that can be found in /Path_to_BaseDir/ServerJobLog. If you choose the option to get an Individualized Log file, you will get additional log files, 1 per sample.
For additional details/questions answer, please see the Sentieon Tools Manual https://s3.amazonaws.com/sentieon-release/documentation/Sentieon+genomics+Manual+-+Latest.pdf