Omicsoft Gene Model

From Array Suite Wiki

Revision as of 17:52, 14 December 2016 by Joseph (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

OmicsoftGene20141031 is a improved version of OmicsoftGene20140822 , including 13 previously removed miRNA genes.

Mouse Gene model OmicsoftGene20141031

Replaces OmicsoftGene20140822 (designed for Mouse.B38).

Consists of three components:

  • UCSC gene model (downloaded on 8/22/2014);
  • Ensembl gene model (R76) for MT only;
  • miRBase gene model (R21) for microRNA genes. All miRNA gene annotations from the UCSC gene model are removed.

Human Gene modelOmicsoftGene20141031

Replaces OmicsoftGene20140822 (designed for Human.hg38).

Consists of three components:

  • UCSC gene model (downloaded on 08/22/2014);
  • Ensembl gene model (R76) for MT only;
  • miRBase gene model (R21) for microRNA genes. All miRNA gene annotations from the UCSC gene model are removed.


Human Gene model OmicsoftGene20130723

(designed for Human.B37.3, Human.hg19)

Consists of three components:

  • UCSC gene model (downloaded on 7/23/2013);
  • Ensembl gene model (R75) for MT only;
  • miRBase gene model (R20) for microRNA genes. All miRNA gene annotations from the UCSC gene model are removed.


==Steps used to build UCSC-based gene models==:

  • Eliminate any records that point to a chromosome that is not included in the associated ReferenceLibrary FASTA;
  • Group transcripts by ‘gene_name’; Assign the ‘gene_name’ to the transcript ‘gene_id’ field. In cases where the ‘gene_id’ differs from the new name, create a ‘orig_id’ field that preserves the old ID;
  • Identify cases where ‘gene_id’ differs only by case (this happens in both UCSC and Ensembl);
  • Sort the names and rename all transcripts to the first name in the sorted list (usually the all caps version);
  • For each transcript bundle (i.e. the set of transcripts that share a ‘gene_id’ at this stage), split the bundle up until the following conditions are met for all sub-bundles:
    • All on same chromosome
    • All on same strand
    • No transcript in the sub-bundle is more than 10kb away from all other transcripts in the sub-bundle (including introns)
  • If sub-bundles have been created, add suffixes to the transcript ‘gene_id’s to differentiate them: <gene_id>_<chromosome>_<strand>_<bundle_num> (e.g. snoU13_1_+_3). The bundle_num is incremented for each bundle that is on the same chromosome and strand;
  • It makes the gene model look cleaner in the genome browser (i.e. preventing the mess that occurs when the gene spans a large region of the chromosome because transcripts are scattered all over the place)
  • LocusLink file is used to get Entrez gene ID.