From Array Suite Wiki
Sentieon TNSeq Pipeline
The Sentieon TNseq pipelines enable accurate calling of somatic variants in paired tumor-normal samples or in unpaired tumor samples. The figure below represents a typical pipeline for calling Tumor-Normal analysis recommended in the Broad Insitute in https://software.broadinstitute.org/gatk/documentation/presentations.php?id=6007
The following pipeline can be used to process tumor samples indepedently:
This wiki will describe how to use the integration of Sentieon TNseq pipeline in Omicsoft's ArrayStudio, which requires users have a Sentieon license. Please contact email@example.com if you would like more information about obtaining a license to use the Sentieon pipelines.
After creating a project on the server, users can access the module by selecting Add NGS Data | Add From Pipeline | Sentieon TNseq/TNscope Pipeline (Beta):
Input format: Accepted input file include FASTQ (GZIP compression is supported) and BAM/CRAM.
Add files to menu
- Add button will add samples by selection
- Search will bring up a popup menu to search Samples/Sample Sets registered on the server
- Add List will allow users to add files from a list (even add a grouping file for alignment functions).
- Remove will clear out selected files that have been added to the input files. You can select the samples with your mouse and choose to remove them.
- Clear will remove all files from the input file list. You do not need to select individual files.
In the General tab, there are a number of addition options to select:
- Choose a Genome Reference. Choices are Human_B37, Human_B38 or custom. OmicSoft references are based on the GATK bundle. If user choose to supply their own FASTA file then this file should be supplied in the Custom Reference section.
- To run the pipeline on paired Tumor vs Normal samples, define the Normal group based on information in the grouping file.
- The user can choose whether this is a paired end sequencing analysis, and if so, the reads will automatically be paired using a numbering logic (e.g. _1, _2 or .1, .2).
- Replace Existing BAM files: Unchecked this option (default) and the alignment step will check output folder and skip alignmnet step for samples already having alignmnet output in the folder. It is designed to allow users to re-run the alignment step on the whole batch but skip alignment to existing BAMs. If this option is checked, the module will re-run alignment for all samples and overwrite old alignment files in the output folder.
- Number of thread for each alignment and number of jobs/samples running in parallel. Thread Number is set to 4 by default. Job Number is set to 1 by default.
- Identify a location for the Output Folder to place the QC metrics reports/tables, bam files, vcf files etc.
- (Optional) Please see this wiki page for an example of the Grouping File http://www.arrayserver.com/wiki/index.php?title=Grouping_File.
- (Optional) User can supply a custom reference FASTA file if selecting Others as the Reference Genome.
Note that if you do not have access to a normal (non tumor) sample matched to the tumor sample, the Panel of Normal VCF and the Cosmic VCF inputs are highly recommended.
- (Optional) Panel of Normals: VCF file with common errors that appear as variants from multiple unrelated normal samples. The contents of this file will be used to identify variants that are more likely to be germline variants, and filter them as such. re
- (Optional) Cosmic: VCF file format. data from the Catalogue of Somatic Mutations in Cancer (COSMIC) representing a list of known tumor related variants. The contents of this file will be used to reduce the germline risk factor of the variants. You need to use the same COSMIC file as the one used to generate the Panel of Normal VCF.
Steps: Check the boxes to run each individual pipeline module:
- BWA Mapping
- Remove Duplicates
- Indel Realigner
- Base Recalibration
- BWA Mapping: A single command is run to efficiently perform the alignment using BWA-MEM and create sorted BAM files. By default, the option -M is applied to Mark short split hits as secondary in order to be compatible with mark duplicates. This option can be turned off by un-checking Mark shot split hits as secondary.
- Metrics: Five statistical summmaries of the alignment will be generated per BAM file. Users can choose to skip any QC step by uncheck options under Metrics including GC bias, Insert size, Alignment summary, Mean quality by cycle, and Quality distribution.
- Remove Duplicates: This step detects reads indicative that the same DNA molecules were sequenced several times. These duplicates are not informative, and it is recommended that they not be counted as additional evidence. This step requires a sorted BAM file and is accomplished in two steps. The first command collects read information, and the second command performs the deduping. The output is DEDUP_METRICS_TXT and DEDUP_BAM with associated index file (.bai). Downstream analysis tools are duplicate aware and therefore, users can choose to select the option to Mark Duplicates only.
- Indel Realignment: This step is performs a local realignment around indels. This is performed on the Tumor and Normal samples indepedently. The Realigner algorithm will perform the indel realignment on the dedup_bam file. This algorithm accepts additional parameters --algo Realigner -k $known_sites --interval_list $regions to set known sites or to specify an interval. Please contact Omicsoft support for help on specifying these Additional Parameters into the pipeline.
- Base Recalibration: This step calculates the required modification of the quality scores assigned to individual read bases of the sequence read data and applies those to the BAM file. A RECAL_result.csv file will be output to the server. Optionally, the recalibrated BAM file and it's corresponding index file can be output to the server. This output is optional (and is unchecked by default) as Sentieon variant callers can perform the recalibration on the fly using the before recalibration BAM plus the recalibration table. In fact, if users choose to rerun the variant calling, they should NOT use the recalibrated BAM together with the recalibration table, as that would apply the recalibration twice.
- Indel co-realignment: This step performs a local realignment around indels for the combined data of both tumor and normal samples.
- Variant Discovery: Input is a BAM file and output is a VCF file. Somatic variant calling can be performed on the tumor-normal matched pair or the tumor and panel of normal data using either the Genotyper algorithm (TNsnv) or Haplotyper Algorithm (TNhaplotyper). Defaults are set based on recommendations from Sentieon.
If users have additional license for TNscope, it can be enabled by checking TNscope and this algorithm will be used to perform the somatic variant calling on the tumor-normal matched pair or the tumor only data, using a Haplotyper algorithm.
- Select the option to input a Custom dbSNP file to label known variants.
- Users can select to designate the location of an optional interval file. BED and Picard style formats are supported.
- Select the output file format from BWA
- Use Centralized Cloud License: If your organization does not have an independent Sentieon License and has obtained a Centralized Cloud License from Omicsoft, please select this option