From Array Suite Wiki

Jump to: navigation, search

Begin BuildOsp2FromTxt /Namespace=Genetics;
Options  /PanelID=Illumina.MyCustomArray-v1-4_A1.v2

This procedure can be used to create the .osp2 (OmicSoft Panel) file needed for using a custom genotyping array in NormalizeSnpStrand & PublishToGxl.

Warning.png WARNING: This procedure is currently limited to Human.B37.3 for ReferenceLibraryID




File should be the path to a single file containing the details of the probe content of the array in the format described here.

Text form of OSP

The input is a tab-delimited text table of the records in the .osp2 file. The table has the following nine columns and should contain one record per probe after the single column-name header line:

  1. ProbeID - this should be the value which will be observed in the genotyping datasets which will be processed with the .osp2 file (i.e. 2nd column of bim file). For example, Illumina has both an IlmnID field and a Name field in their array annotations but almost all datasets exported from their GenomeStudio software use the Name value as the probe identifier.
  2. Chromosome - the contig from the ReferenceLibraryID where the variant assayed by the probe is located or 0 for probes which couldn't be confidently mapped or have known quality issues (see details of this special case below)
    • When processing genotypes with the .osp2 file, probes on chromosome 0 will be retained but the 0 value will act as a flag of a known issue
      • Suggest using the Comment field (column 9) to describe the issue
    • Since REF must match ReferenceLibraryID, variants in the PAR (ReferenceLibraryID should have the PAR region of Y masked with Ns) should be reported at their X Chromosome coordinate (Position). The BuildOsp2FromTxt procedure has logic to identify PAR regions and process variants accordingly, it does not require special coding of the Chromosome e.g. 25 or XY
  3. Position - the 1-based location on the Chromosome of the variant assayed by the probe per VCF specifications (e.g. location of the anchor base before an indel variant)
    • May be 0 for probes which couldn't be mapped
    • It is not necessary to set this value to 0 for probes with known quality issues, the 0 value for the Chromosome will be recognized as the flag and any non-zero value here will be ignored. Therefore, you can list the mapping information for such probes here (and in the REF & ALT columns) and report the mapped Chromosome in the Comment field.
  4. REF - allele as observed in ReferenceLibraryID at Chromosome:Position per the VCF specification (i.e. sense strand of ReferenceLibraryID, anchor base for indels at Position, etc)
    • For unmapped probes on Chromosome 0, REF & ALT are irrelevant and can be left blank
    • Although VCF allows N as unknown base (in addition to A, C, G, T), it is not allowed here (put any such ambiguous variants on Chromosome 0 per details below)
    • If the probe assays 2 ALT alleles at a multi-allelic locus (instead of a simple REF/ALT variant), still list the REF allele for context
  5. ALT - The non-REF allele per the VCF specification
    • If the probe assays 2 ALT alleles at a multi-allelic locus (instead of a simple REF/ALT variant), list both delimited by a comma per the VCF specification. Do no list any additional ALT alleles as the presence of 2 values is the signifier that these are the specific alleles being assayed by the probe.
    • Like REF, using N as unknown base for ambiguous alleles is not allowed (also don't allow special values like * or <NamedALT> due to nature of genotyping arrays)
  6. Strand - manufacturer-specific formatted string indicating relative strand orientations - see details below
  7. Allele1 - same as REF except when probe is on Chromosome 0 (see details of this special case below)
  8. Allele2 - same as ALT except when the probe assays 2 ALT alleles where the comma delimiter in ALT should be replaced by a /
  9. Comment - optional free-text for documentation (will be ignored by the BuildOsp2FromTxt procedure). Even if choosing not to populate this column, it should be present (i.e. there should be a tab after the value in the Allele2 column)

Warning.png WARNING: Strongly recommend using BCFtools norm to both validate and left-normalize indel & MNV alleles. This left-normalization may alter Chromosome, Posititon, REF, and/or ALT (and, therefore, Allele1 and/or Allele2)

Formatting records for probes on Chromosome 0

In an effort to retain a full data trail, NormalizeSnpStrand is capable of processing problematic probes by reporting them on Chromosome 0. This is particularly helpful for ensuring the correct .osp2 file was selected for processing a particular dataset as NormalizeSnpStrand will report the probe-content overlap/intersection between the dataset and the .osp2 file as an indicator of whether the correct .osp2 file was selected. If problematic probes are excluded from the .osp2 file, such reporting will be less meaningful as datasets are likely to contain some/all of these probes and therefore NormalizeSnpStrand will report finding "extra" probes in the dataset which are not present in the .osp2 file suggesting the wrong .osp2 file was selected.

There are generally 2 reasons a probe should be reported on Chromosome 0:

  1. The probe does not map with sufficient quality to any location in ReferenceLibraryID (perhaps it was designed against a different genome version) or maps ambiguously to multiple locations
  2. The alleles the probe is reporting are potentially inaccurate
    • This is common with Illumina's Infinium chemistry at multi-allelic loci if the probe is not sufficiently allele specific. For example, take this probe from the InfiniumQCArray-24v1-0_A3 array:

      where the key fields are:

      SNP AlleleA_ProbeSeq Chr MapInfo RefStrand

      Note, the REF base at 15:74709718 in Human.B37.3 is G and this probe reports the 2 ALT alleles T,C. Specifically, the probe's 3' end aligns to the non-polymorphic base before this multi-allelic SNV (74709717) and when the reaction occurs to anneal the polymorphic base at 74709718, the A and T nucleotides will both be reported with the green fluor and the C and G nucleotides will both be reported with the red fluor which gives the following results:

      DNA's Genotype Fluor(s) reported Reported Genotype Validity
      C/C red C/C correct
      C/T red & green C/T correct
      T/T green T/T correct
      G/T red & green C/T incorrect
      G/C red C/C incorrect
      G/G red C/C incorrect

When reporting probes on Chromosome 0, follow these guidelines for setting the Allele1 and Allele2 values:

  • List one of the alleles from the design (probe) strand as Allele1 and the other as Allele2 (order does not matter)
    • Using the design (probe) strand ensures the alleles can be recovered in the future (e.g. if re-mapping to a newer version of the genome where the probe does align uniquely well)
    • Since these are on Chromosome 0, values relevant to the ReferenceLibraryID will be ignored (i.e. REF and ALT). So, even if the probe assays 2 ALT alleles, list one as Allele1 and the other as Allele2 (don't put both into Allele2 and REF into Allele1 as you would for a probe on a non-0 Chromosome)
  • List alleles in their "native" form i.e. as would expect to observe in the genotyping datasets which will be processed with the .osp2 file (5th & 6th columns of bim file)
    • For example, leave Illumina's I/D coded indels alleles as I and D
    • For example, if Affymetrix reports a particular probe's indel alleles using the -/InsSeq style, Allele1 should be "-" and Allele2 InsSeq (or vice versa)
  • Do this consistently for all probes on Chromosome 0 (i.e. regardless of whether here because unmapped or known quality issue)

Strand Column

The format of the string value in the Strand column will vary depending on the array manufacturer:


Illumina's GenomeStudio allows exporting genotypes onto 5 possible strands:

  1. Plus (reference) - the reference sequence's sense strand (Illumina reports the genome version in the GenomeBuild field of their annotations)
  2. TOP - specified in the TopGenomicSeq field of Illumina's annotations
  3. Design (probe) - the strand the probe assays
  4. Forward - This will vary based on the how Illumina obtained the variant. Most often, it is dbSNP's forward designation (which is not the same as the reference sequence's sense strand). For novel variants/custom probes, the flank sequence used to design the probe (SourceSeq column of Illumina's annotation file) will come from someplace other than dbSNP and that source can assign Forward/Reverse designation any way they wish (i.e. not guaranteed to be reference sequence's sense strand). Note, the sequence as listed in the SourceSeq column is not necessarily what has been designated as Forward, it is the un-manipulated sequence as supplied by the source regardless of whether the source designated it Forward or Reverse. The Forward strand orientation is embedded in the value in the IlmnID field of Illumina's annotations.
  5. AB - this insn't a strand per se but the generic alleleA/alleleB encoding of the fluorescent signals

Genotypes are most often exported on the Plus (reference) or TOP strand. Occasionally, genotypes are exported on the Design strand. Very rarely are genotypes exported in their AB form. Sometimes a user misinterprets Forward to mean sense strand (assuming forward and plus are synonymous) and exports to this strand - NormalizeSnpStrand does not handle datasets on Forward strand (see FlipFromForward file).

To account for genotypes on either the Plus (reference), TOP, or Design (probe) strand, the Strand column for Illumina arrays takes the form I:PTR where the "I:" prefix indicates Illumina and P, T, and R are relative strand orientations reported as "+" for same and "-" for complementary (e.g. I:+-+  or  I:++-):

  • P indicates your determination of the probe's orientation to the ReferenceLibraryID's sense strand
    • This provides an opportunity to override Illumina's indication in R if you are confident they are wrong
    • Value may be "?" to indicate undetermined in which case R will be used instead
  • T indicates the orientation of the probe relative to Illumina's TOP sequence as indicated in Illumina's annotations (IlmnStrand field)
  • R indicates the orientation of the probe to the ReferenceLibraryID's sense strand as indicated in Illumina's annotations (RefStrand field)

Affymetrix's software typically exports genotypes onto the reference sequence's sense strand. There is no "TOP" or "Forward" strand in the Affymetrix world so the only other possibility would be the strand the probe assays (design/probe). So, the Strand column for Affymetrix arrays takes the form A:R where the "A:" prefix indicates Affymetrix and R is the orientation (+/-) of the probe to the ReferenceLibraryID's sense strand as indicated by Affy's annotations.

Because Affymetrix genotypes MNVs and complex variants and doesn't use the I/D coded values for indel genotypes like Illumina does, the "native" alleles (as exported from the Affymetrix software and expected to be observed in the genotype dataset that will be processed with the .osp2 file) cannot be easily mapped to the corresponding normalized alleles in the .osp2 file (Allele1 and Allele2 columns). So, when the "native" alleles don't match the normalized .osp2 alleles, they should be listed, in order, after the A:R in the Strand column like:

ProbeID Strand Allele1 Allele2
ExampleProbe1 A:+-/AA G GAA

where the first native allele (-) corresponds to the normalized allele in the Allele1 field (G) and the second native allele (AA) corresponds to the normalized allele in the Allele2 field (GAA). This is an example of a indel which Affymetrix sometimes reports using the old -/InsSeq convention where the "-" represents the deletion allele and the InsSeq (AA) is the sequence of base(s) which are inserted or deleted. Specifically, it is an example of an insertion since the REF value in Allele1 is the deletion and the ALT value in Allele2 is the insertion. Extending this example further, if it was a multi-allelic locus and another probe (ExampleProbe2) assayed 2 ALT alleles:

  1. the same insertion of 2 As with native value AA
  2. an insertion of just a single A with native value A

then the allele mapping would be like:

ProbeID Strand Allele1 Allele2
ExampleProbe1 A:+-/AA G GAA
ExampleProbe2 A:+AA/A G GAA/GA

where the first native allele (AA) corresponds to the first normalized allele in the Allele2 field (GAA) and the second native allele (A) corresponds to the 2nd normalized allele in the Allele2 field (GA). There won't be a native allele corresponding to the normalized allele in the Allele1 field because the probe doesn't assay the REF allele but we are still including the REF allele in Allele1 for context.

Script to prepare

There is a docker image here which contains a script that can be run to assemble this table from the manufacturer's annotation file. It is a BASH script named and is maintained as a docker image because of its use of several tools like plink, BCFtools, SAMtools, and blast.

You can run a container to execute the script as follows:

docker run --rm --name optional_label --volume /HostInOut:/ContainerInOut --user $(id -u ${USER}):$(id -g ${USER}) qdidiscoveryservices/prepare-osp config_options &>/path/to/log
  • Where /path/to/log should be a path on the host but config_options should use /ContainerInOut paths
    • You can mount multiple /HostInOut directories by repeating the --volume option).
  • Suggest an initial run using --help as the config_options which will print the details of each option.
  • This will pull the default latest instance of the docker image, older versions will be maintained in the Docker Hub repo for access as necessary.

You can also run a container interactively (e.g. to troubleshoot) like:

docker run -it --user $(id -u ${USER}):$(id -g ${USER}) --volume /HostInOut:/ContainerInOut qdidiscoveryservices/prepare-osp bash


The primary output table from (suitable as input as File) will be named ReferenceLibraryID_Manufacturer.ProductVersion.txt (e.g. Human.B37.3_Illumina.InfiniumQCArray-24v1-0_A3.txt) where ProductVersion comes from the header of the manufacturer's annotation file. In addition to this output, these supplemental outputs will also be written with the same filename root (e.g. Human.B37.3_Illumina.InfiniumQCArray-24v1-0_A3):

  • .summary.txt - Details of of the run including tool versions and summary counts of different probe types/groups. Tab-delimited so can be opened in Excel.
  • .FlipFromForward - List of probes to complement from Forward strand to get to TOP strand:
    • Since NormalizeSnpStrand does not support the Forward strand, if a genotype set was exported from GenomeStudio to Forward strand, this file can be used in a plink --flip command to get to the TOP strand which NormalizeSnpStrand accepts.
    • Note, this file is only relevant for SNVs since Illumina encodes indel alleles as I/D which are the same regardless of strand state.
  • If >0 indel probes are successfully resolved: .indels.update-map & .indels.update-alleles
    • These files can be used by plink1.9 to standardize the I/D coded indel genotypes to the expected VCF conventions:
      plink --bfile source_data_w_ID_alleles --update-alleles indels.update-alleles --update-chr indels.update-map 2 --update-map indels.update-map 3 --a1-allele indels.update-map 4 --make-bed --out resolved_data_w_VCF_alleles
    • Primary use of these files is testing - after recent improvements, NormalizeSnpStrand can standardize the I/D genotypes (in the past, these required pre/post-processing)

Strongly suggest reviewing both the .summary.txt file and the log captured from the docker container run.

Tips.png This script relies on the mapping information provided by Illumina in the annotation file. As such, the P value in Strand column will be ? (it does not attempt to map the probes and detect errors in Illumina's RefStrand annotation) with the exception of indels where it does map the probe in order to infer the REF and ALT values.


This is the how the docker container was run to generate the File for Illumina's InfiniumQCArray-24v1-0_A3 array:

docker run --rm --name "InfiniumQCArray-24v1-0_A3" --volume /panel_building:/data --user $(id -u ${USER}):$(id -g ${USER}) qdidiscoveryservices/prepare-osp \
   --annotation-file "/data/Illumina/csv/InfiniumQCArray-24v1-0_A3.csv.gz" --fasta "/data/ReferenceSequences/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa" \
   --manufacturer illumina --out-directory "/data/Illumina/out" \

where /panel_building is the directory on the host containing both the input files and the --out-directory:


PanelID should be of the form Manufacturer.ProductVersion:

  • Manufacturer is either Affymetrix or Illumina
  • ProductVersion should sufficiently indicate both the physical array version and the version of the source data used to compile the File (i.e. the annotations from the array manufacturer). For example, HumanOmni2.5-4v1_H indicates the H annotation version of the first iteration of the HumanOmni2.5 quad array (the -4 indicates 4 samples can be run on each BeadChip hence quad as compared to the HumanOmni25-8v1-2_A1 where 8 samples can be run on each BeadChip). This value is often defined in the header of the manufacturer's annotation file like the Descriptor File Name line in this example (the .bpm extension refers to the corresponding binary form of the annotation file used by the GenomeStudio software):
    Illumina, Inc.
    Descriptor File Name,HumanOmni2.5-4v1_H.bpm
    Assay Format,Infinium HD Super
    Date Manufactured,4/21/2011
    Loci Count ,2443177

If a .osp2 file already exists for this PanelID (e.g. you found a mistake in the existing .osp2 file and are creating a new .osp2 file with the correction), add an additional suffix to the PanelID to maintain proper version control (e.g. PanelID=Illumina.HumanOmni2.5-4v1_H.v2).


The .osp2 file written to OutputFolder will be named as ReferenceLibraryID_PanelID.osp2 .

A text-dump of the .osp2 file will also be written to the OutputFolder named as ReferenceLibraryID_PanelID.osp2.txt which can be useful to confirm the procedure executed as expected. This file will also have 9 columns but will be slightly different from the content of File:

  1. SnpID - should be same as ProbeID column of File
  2. Chromosome - decoded value from Uid
  3. Position - decoded value from Uid
  4. Reference - decoded value from Uid (indels in particular probably won't match REF value from File)
  5. Alternative - decoded value from Uid (indels in particular probably won't match ALT value from File)
  6. Strand - should be same as Strand column of File with one exception:
    • For probes which assay 2 ALT alleles, the REF value in the Allele1 field of File will be appended to the end of the Strand value in the .osp2 file and the 2 assayed ALT alleles in the Allele2 field of File will be split with the first being placed in Allele1 and the second in Allele2 of the .osp2 file.
  7. Allele1 - should be same as Allele1 column of File with the exception of situation described in Strand
  8. Allele2 - should be same as Allele2 column of File with the exception of situation described in Strand
  9. Uid - OmicSoft's encoded form of variant definition

To use the .osp2 file in NormalizeSnpStrand or PublishToGxl, place it into the Variant/panel subdirectory of the OmicsoftDirectory.

Warning.png WARNING: Unlike the standard .osp2 files maintained by OmicSoft as part of the Suite, you are accountable for record management (data trail, versioning, archiving, etc) of any custom .osp2 files you create. Please refer to your institution's policies.