Sentieon QC Metrics

From Array Suite Wiki

Jump to: navigation, search

Contents

Introduction

By default the Sentieon DNASeq Pipeline will run five QC commands:

driver -t 32 -r $ref -i sorted.bam

--algo AlignmentStat --adapter_seq " aln_metrics.txt

--algo GCBias --summary gc_summary.txt gc_metrics.txt

--algo MeanQualityByCycle mg_metrics.txt

--algo QualDistribution qd_metrics.txt

--algo InsertSizeMetricAlgo isize_metrics.txt


The user can choose to opt out of these QC metrics by unchecking the Metrics Step in the General Tab.

SentieonQCMetricsOption.png

Options under the Customize Steps Tabs I-II and Advanced Tabs allows the user to specify specific Metrics Reports to output.

Sentieon tools will run the Sentieon cmd on each sample. Omicsoft will concatenate each sample-based metric report into a summary report. These summary reports will be appended to the project. In the GUI, a Sentieon QC Metrics folder will be generated under Tables and populated with the associated charts and reports. In addition, these reports are output to a Metrics folder in the Output folder location that was specified by the user.

SentieonQCMetricsOptionCustomizeStepsI.png

Output

QCMetricsReports.png

Alignment Report

This report provides a summary of alignment metrics from a BAM file, including high level metrics about the quality of read alignments as well as the proportion of reads that passed Illumina's chastity filters (PF reads). Alignment metrics are stored in the aln_metrics.txt file.

Column Definitions for Alignment Metrics produced are:

CATEGORY: Distinguishes either PAIRED: for a fragment run, FIRST_OF_PAIR: when metrics are for only the first read in a paired run, SECOND_OF_PAIR: when metrics are for only the second read in a paired run, or PAIR when metrics are aggregated for both first and second reads in a pair.

Total Reads: The total number of reads including all PF and non-PF reads. When CATEGORY equals PAIR this value will be 2x the number of clusters.

PF_Reads: The number of PF reads where PF is defined as passing Illumina's filter.

PCT_PF_READS: The percentage of reads that are PF (PF_READS / TOTAL_READS).

PCT_PF_READS_ALIGNED: The percentage of PF reads that aligned to the reference sequence. (PF_READS_ALIGNED/PF_READS)

PF_ALIGNED_BASES: The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.

PF_HQ_ALIGNED_READS: The number of PF reads that were aligned to the reference sequence with a mapping quality of Q20 or higher signifying that the aligner estimates a 1/100 (or smaller) chance that the alignment is wrong.

PF_ALIGNED_BASES: The total number of aligned bases, in all mapped PF reads, that are aligned to the reference sequence.

PF_HQ_ALIGNED_BASES: The number of bases aligned to the reference sequence in reads that were mapped at high quality. Will usually approximate PF_HQ_ALIGNED READS * READ_LENGTH but may different when either mixed read lengths are present or many reads are aligned with gaps.

PF_HQ_ALIGNED_Q20_BASES: The subset of PF_HQ_ALIGNED_BASES where the base call quality was Q20 or higher.

PF_HQ_MEDIAN_MISMATCHES: The median number of mismatches versus the reference sequence in reads that were aligned to the reference at high quality (i.e. PF_HQ_ALIGNED READS).

PF_MISMATCH_RATE: The rate of bases mismatching the reference for all bases aligned to the reference sequence.

PF_HQ_ERROR_RATE: The percentage of bases that mismatch the reference in PF HQ aligned reads.

PF_INDEL_RATE: The number of insertion and deletion events per 100 aligned bases. Uses the number of events as the numerator, not the number of inserted or deleted bases.

MEAN_READ_LENGTH: The mean read length of the set of reads examined. When looking at the data for a single lane with equal length reads this number is just the read length. When looking at data for merged lanes with differing read lengths this is the mean read length of all reads.

READS_ALIGNED_IN_PAIRS: The number of aligned reads whose mate pair was also aligned to the reference.

PCT_READS_ALIGNED_IN_PAIRS: The percentage of reads whose mate pair was also aligned to the reference. READS_ALIGNED_IN_PAIRS / PF_READS_ALIGNED

BAD_CYCLES: The number of instrument cycles in which 80% or more of base calls were no-calls.

STRAND_BALANCE: The number of PF reads aligned to the positive strand of the genome divided by the number of PF reads aligned to the genome.

PCT_CHIMERAS: The percentage of reads that map outside of a maximum insert size (usually 100kb) or that have the two ends mapping to different chromosomes.

PCT_ADAPTER: The percentage of PF reads that are unaligned and match to a known adapter sequence right from the start of the read.

GC Bias

The Mean Base Quality, Normalized Coverage, and Windows charts are fully interactive in the GUI. For example, you can highlight bins on the Mean Base Quality chart and the associated table will appear below the chart.

GCBiasMeanQuality.png


Column Definitions for the GC Bias Table

GC: For each observation, the G+C content of the reference sequence is represented by bins of values from 0% to 100%

WINDOWS: The number of windows on the reference genome that have this G+C content.

READ_STARTS: The number of reads whose start position is at the start of a window of this GC.

MEAN_BASE_QUALITY: The mean quality (determined via the error rate) of all bases of all reads that are assigned to windows of this GC.

NORMALIZED_COVERAGE: The ration of "coverage" in this GC bin vs. the mean coverage of all GC bins. A number of 1 represents mean coverage, a number less than one represents lower than mean coverage (e.g. 0.5 means half as much coverage as average) while a number greater than one represents higher than mean coverage (e.g. 3.1 means this GC bin has 3.1 times more reads per window than average).

ERROR_BAR_WIDTH: The radius of error bars in this bin based on the number of observations made. For example if the normalized coverage is 0.75 and the error bar width is 0.1 then the error bars would be drawn from 0.65 to 0.85.

Insert Size Distribution

Users can identify the mean insert size for paired end experiments from the charts provided in the GUI. For each experiment pair, there will be one chart. Users can change the drop-down option from 1*1 to for example, 2*1 to display 2 charts at once. Users can also choose the adjust the scale to be the same for all charts using the Toggle Uniform Scale Status button (see arrow). The associated data in the charts is summarized in the Insert Size Distribution Table. For specific information on how insert size distribution is calculated please see the BWA manual http://bio-bwa.sourceforge.net/bwa.shtml.

InsertSizeDistribution.png

Mean Quality By Cycle

The chart shows the mean quality by cycle across all reads that passed the PF threshold (including unaligned reads) for each BAM file. The table for the plot is written in the output file mq_metrics.txt.

MeanQualityByCycle.png