From Array Suite Wiki
Sentieon TNSeq/TNscope Pipeline
This wiki will describe how to use the Sentieon Tools as they are automated in Omicsoft's ArrayStudio.
Running these tools requires users have a Sentieon license. Please contact firstname.lastname@example.org if you would like more information about obtaining a license to use the Sentieon pipelines.
After creating a project on the server, access the module by selecting Add NGS Data | Add From Pipeline | Sentieon TNseq/TNscope Pipeline (Beta):
Tumor vs Normal Mode
This section will describe how to use the pipeline in Tumor vs Normal mode to enable calling of somatic variants in paired tumor-normal samples. The figure below represents a typical pipeline for calling Tumor-Normal analysis recommended in the Broad Institute in https://software.broadinstitute.org/gatk/documentation/presentations.php?id=6007
Pipeline Mode: Tumor vs Normal
Input format: Users can leave input format to AUTO. Accepted input files include FASTQ (GZIP compression is supported) and BAM/CRAM.
Add files to menu
- Add button will add samples by selection
- Search will bring up a popup menu to search Samples/Sample Sets registered on the server
- Add List will allow users to upload a list of files and file paths (even add a grouping file for alignment functions http://www.arrayserver.com/wiki/index.php?title=How_to_use_multiple_sequence_files_for_one_sample%3F).
- Remove will clear out selected files that have been added to the input files. You can select the samples with your mouse and choose to remove them.
- Clear will remove all files from the input file list. You do not need to select individual files.
Under the General tab, there are a number of choices that must be selected in order to send the job to queue:
- Choose a Genome Reference: Choices are Human Build 37, Human Build 38, or Others.
An important note about the reference genomes: Sentieon tools supports reference genomes from the GATK resource bundle. For Build 37, Sentieon reference matches OmicSoft's Human.B37.3 and will be compatible with downstream functions including variant annotation and loading into Lands. In constrast, GRCH38DH is now the official GATK reference for alignment and does not match Human.B38. You will receive errors when annotating and loading files into Lands. Please contact email@example.com if you need more information on how to avoid these errors. OmicSoft is actively working to support GRCH38DH in all of our Genetics functions.
- Define the Normal group based on information that will be populated from the grouping file.
- The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering logic (e.g. _1, _2 or .1, .2).
- Replace Existing BAM files: If left Unchecked (default), the alignment step will check the output folder. Alignment will be skipped for samples already having alignment output in the folder. It is designed to allow users to re-run the alignment step on the whole batch, skipping the alignment step for existing BAMs. If this option is checked, the module will re-run alignment for all samples and overwrite old alignment files in the output folder.
- Number of thread for each alignment and number of jobs/samples running in parallel. Thread Number is set to 4 by default. Job Number is set to 1 by default.
- Identify a location for the Output Folder to place the QC metrics reports/tables, bam files, vcf files etc.
- Grouping FilePlease see this wiki page for an example of the Grouping File http://www.arrayserver.com/wiki/index.php?title=Grouping_File. The grouping file is used to identify the matched pairs.
- (Optional) User can supply a custom reference FASTA file if selecting Others as the Reference Genome.
Steps: The steps below compose the typical pipeline as illustrated above. Users can choose all steps to run as a single automated pipeline or run each step individually:
- BWA Mapping
- Remove Duplicates
- Indel Realigner
- Base Recalibration (BQSR)
Variant Calling: can choose to run one or both algorithms
- TNsnv -- calls SNV using a Genotyper algorithm and provides results that match MuTecT
- TNhaplotyper -- calls SNV and INDELs uses Haplotyper algorithm and provides results that match MuTect2
- BWA Mapping: A single command is run to efficiently perform the alignment using BWA-MEM and create sorted BAM files. By default, the option -M is applied to Mark short split hits as secondary in order to be compatible with mark duplicates. This option can be turned off by un-checking Mark shot split hits as secondary.
- Metrics: Five statistical summmaries of the alignment will be generated per BAM file. Users can choose to skip any QC step by uncheck options under Metrics including GC bias, Insert size, Alignment summary, Mean quality by cycle, and Quality distribution.
- Remove Duplicates: This step detects reads indicative that the same DNA molecules were sequenced several times. These duplicates are not informative, and it is recommended that they not be counted as additional evidence. This step requires a sorted BAM file and is accomplished in two steps. The first command collects read information, and the second command performs the deduping. The output is DEDUP_METRICS_TXT and DEDUP_BAM with associated index file (.bai). Downstream analysis tools are duplicate aware and therefore, users can choose to select the option to Mark Duplicates only.
- Indel Realignment: This step is performs a local realignment around indels. This is performed on the Tumor and Normal samples indepedently. The Realigner algorithm will perform the indel realignment on the dedup_bam file.
- Base Recalibration: This step calculates the required modification of the quality scores assigned to individual read bases of the sequence read data and applies those to the BAM file. A RECAL_result.csv file will be output to the server. Optionally, the recalibrated BAM file and it's corresponding index file can be output to the server. This output is optional (and is unchecked by default) as Sentieon variant callers can perform the recalibration on the fly using the before recalibration BAM plus the recalibration table. In fact, if users choose to rerun the variant calling, they should NOT use the recalibrated BAM together with the recalibration table, as that would apply the recalibration twice.
- Indel co-realignment: This step performs a local realignment around indels for the combined data of both tumor and normal samples.
- Variant Discovery: Input is a BAM file and output is a VCF file. Somatic variant calling can be performed using either Mutect (TNsnv) or Mutect 2 (TNhaplotyper). Defaults are set based on recommendations from Sentieon.
Each algorithm that is run in the steps above can accept additional options. These are not all outlined here. You can contact firstname.lastname@example.org to ask how to apply additional options.
If users have additional license for TNscope, it can be enabled here. TNscope will simultaneously call SNVs, INDELs, and structure variants.
- minimum base quality: determines the filtering quality of the bases used in variant calling. The default value is 10. Any base with quality less than 10 will be ignored.
- Prune Factor: The default value is 2. Minimum pruning factor in local assembly; paths with fewer supporting kmers than FACTOR will be pruned by the graph.
- Estimated normal contamination fractions : estimation of the fraction of contamination on the normal sample coming from other samples. The default is step to 0.
- Estimated tumor contamination fractions: estimation of the fraction of contamination on the tumor sample coming from other samples. The default is step to 0.
- Minimum normal LOD: minimum normal log odds used to check that the tumor variant is not a normal variant. The default is 2.2
- Minimum tumor LOD: minimum tumor log odds in the final call of variants. The default value is 6.3.
- Minimum normal LOD for candidate selection: Default is 0.5. Minimum tumor log odds in the initial pass calling variants.
- Minimum tumor LOD for candidate selection: Default is 4. minimum tumor log odds in the initial pass calling variants.
- Phasing: Choose to Enable or Disable phasing in the output.
- PCR indel error model: This is used to weed out false positive indels. The possible modes are None (used for PCR free samples), Hostile, Aggressive, and Conservative. They are ordered as Hostile>Aggressive>Conservative based on decreasing aggressiveness. The default value is Conservative.
- Additional Parameters :
- Select the option to input a Custom dbSNP file to label known variants.
- Users can select to designate the location of an optional interval file. BED and Picard style formats are supported.
- Select the output file format from BWA
Tumor Only Mode
The pipeline can be used to process tumor samples if a normal sample matched to the tumor sample is not available. In order to enable this mode, the user must supply their own Panel of Normals and a set of known somatic variants that you want white-listed during the variant calling, preferably from Cosmic.
Required input format under the General Tab is VCF files. Both the Panel of Normals and Cosmic vcf files are required.
Please note that due to licensing restrictions, OmicSoft cannot distribute the COSMIC file to users. We do not have a COSMIC license. Users must obtain a license on their own and run the following commands using Sentieon in order to generate the panel of Normals vcf file:
Common Error message
Cannot allocate memory
[00:01:00] version: sentieon-genomics-201611.03 [00:00:46] [kmc_init] Failed to allocate 11453246080 bytes at bwt.c line 443: Cannot allocate memory ... [00:01:00] Uncaught exception "std::runtime_error": Failed to open /local/scratch/server_data/ArrayServerFile/ServerTest06Wilson_9065/FtpRoot/Users/qa/Server80_Sentieon2017/20180801/C28264_Test56_Tumor.realigned.bam: No such file or directory ... . Error = Object reference not set to an instance of an object@@@
This error seems to suggest that the current server does not have enough resource (memory) required by the Sentieon algorithm. We have seen this message when running the pipeline in a small Linux machine (16GB memory). The error disappeared when re-running the same pipeline in a machine with more memory. We therefore suggest users to either use a machine with more memory or reduce the number of parallel jobs running in the same machine while running the Sentieon pipeline.