From Array Suite Wiki
Summarize Matched Pair Variation Data (VarScan2)
The Summarize Matched Pair Variation Data module uses the same algorithm as Varscan2 to identify variants, including somatic mutations and loss-of-homozygosity (LOH. When given an NGS data object containing matched normal/tissue samples, this module will identify and report comparative tumor/normal information for each identified variant, including coverage, frequency, genotype, and prediction on where the variant arose (e.g. "germline", "somatic","LOH").
A one-sided Fisher exact test (same as VarScan2) is done for the somatic mutation test and an exact binomial test is done for variant mutation test.
To access this module, please go to Analysis | NGS | Variation | Summarize Matched Pair Variant Data (VarScan2)
Input Data Requirements
This module works on NGS data objects, but expects that paired samples come from the same subject. The design table for this Ngs Data object should have a column that identifies disease/normal pairs, and identifies which sample within the pair is "normal".
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- References:Selections can be made on which references, such as chromosomes, should be included in the filtering (options include All references, Selected references, Visible references, and Customized references (select any pre-generated Lists))
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Compare all samples to a common control: Whether to compare all samples to the same control, instead of individual pairs of samples
- Control: Specify the common control that all samples compared to.
- Pair: The Design table column defining sample pairs
- Tumor status: The Design table column defining tumor/normal status of each sample
- Normal: The value in the column specified under Tumor status that indicates a normal sample
- Job number - The number of jobs to run in parallel.
- Exclude mutation if maximal frequency is less than - Exclude a mutation call if the maximal mutation frequency is less than a threshold in all the samples.
- Annotate by DBSNP database - Specify whether to annotate the variants by dbSNPCompact.
- Generate tableland (support up to 100M rows) - A Tableland is almost like a table, but is always in-file (not in-memory) and read-only. It is not only memory efficient, but also fast for a lot of actions such as loading and filtering.
- Generate merged VCF report for all samples - Whether to generate merged VCF report for all samples.
- Array Studio is moving toward Vcf-centered analysis, compared to the older Array Studio Mutation Reports.
- Generate individual VCF report for each sample: Whether to generate individual VCF report for each sample.
- Output folder: An output folder can be specified for the reports.
- Base quality cutoff - Base quality cutoff will not count any mutation where the base quality is below the specified value for that mutation.
- Map quality cutoff - Map quality cutoff will not count any mutation where the map quality cutoff is below the specified value for that mutation (only applies to BWA mapped reads).
- Minimal indel size - will only count an indel if its size is greater than or equal to the specified value.
- Left exclusion - Exclude a defined number of base pairs from the left end of the sequence. Note ”Left" is always relative to the forward strand.
- Right exclusion - Exclude a defined number of base pairs from the right end of the sequence. Note ”right" is always relative to the forward strand.
- Exclude singletons (paired end required) - Will not count reads where both pairs did not map to the same region.
- Exclude multi-reads (ZC tag required) - Multi reads are considered non-unique (i.e. reads that align to multiple genomic locations with equal or similar numbers of mismatches). Selecting this option will include unique reads only when performing the SNP summarization.
- Exclude duplicate alignments: Only one read in a set of duplicated aligned reads will be counted. Duplicates are identified by the duplication flag.
Variation options can be set for the following parameters:
- Minimal normal hit - Will only count a mutation if there are greater than or equal to the specified value for total hits for that mutation in normal samples.
- Minimal tumor hit - Will only count a mutation if there are greater than or equal to the specified value for total hits for that mutation in tumor samples.
- Minimal mutation hit - Will only count a mutation if there are greater than or equal to the specified value for the number of primary mutations hits for that mutation point.
- Minimal heterozygosity frequency - Will only count as heterozygous if the ratio of mutation hit/total hit is no less than a specified number.
- Minimal homozygosity frequency - Will only count as homozygous if the ratio of mutation hit/total hit is no less than a specified number.
Filtering options can be set the following parameters and the filtering status will be output in the result table
- If average mapping quality difference greater than - If the difference of average mapping quality is greater than a specified number between reference-supporting reads and variant-supporting reads, then it will mark as "MappingQualityDifference".
- If average read length difference greater than - If the difference of read length is greater than a specified number between reference-supporting reads and variant-supporting reads, then it will mark as "ReadLengthDifference".
- If MMQS difference greater than - If the difference of quality sum of mismatches (MMQS) is greater than a specified number between reference-supporting reads and variant-supporting reads, then it will marked "MmqsDifference".
- If strandness greater than - If the fraction of variant reads from each strand is greater than a specified number, then it will mark as “Strandness".
- If homopolymer greater than - If the number of bases in a flanking homopolymer matching one allele is greater than a specified number, then it will mark as "Homopolymer".
- If frequency difference less than (somatic only) - If the difference of variant allele frequency between tumor and normal samples is less than a specified number, then it will mark as "FrequencyDifference".
- If average relative variant position less than - If average variant position in supporting reads relative to read length is less than specified number, it will be marked.
- If significance p-value less than - if the p-value is larger than specified number, then it will mark as "NonSignificance".
For each variant, the resulting report contains "Normal coverage/frequency", "Tumor coverage/frequency", "Somatic/variant p-value", "Call", "NormalGenotype", "TumorGenotype" and "FilteringStatus", to allow the user to filter data to identify somatic mutations of interest.
The user can sort and filter this report to identify somatic variation between matched-pair samples.
For more details, see SomaticMutation Report.