Ngs ReportSingleCellCounts.pdf

From Array Suite Wiki

Jump to: navigation, search


Report SingleCell Counts


The Barcoded BAM Based Counting module reports SingleCell gene counts counts for an NGS dataset. Currently, this module only generate the count data at gene level, not on transcript level. This module will count the unique UMIs for each RNA molecule.

To access this module, please go to Analysis | NGS | Single Cell RNA-Seq | Barcoded BAM Based Counting

Ngs SCGeneCount04.png

[back to top]

Input Data Requirements

This module works on NGS data objects, including RNA-seq data that were mapped to genome or transcriptome.

General Options

Ngs SCGeneCount05.png


  • Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
  • References: This module will quantify gene/transcript levels for all genes in the reference used to map the RNA-seq data.
  • Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
  • Output type: Zero Inflated MicroArray data will be generated for memory efficiency.
  • Output name: The user can choose to name the output data object.
[back to top]


  • Gene Model: Allow users to specify gene model ID to use.
  • Cell barcode tag: The default name will be "CB".
  • Cell count safe harbor: user can set a cell count threshold here. The count here refers to the number of mapped genes, cell with count bigger than safe harbor will always be returned; and Cells with count less than the safe harbor will be evaluated based on count distribution. 250 or 200 should be safe to use in this option.
  • Count by UMI barcode (generate UMI counts): UMI barcode tab: the default name for UMI is UB.
  • Count multi-reads: Multi reads are considered non-unique (i.e. reads that align to multiple genomic locations with equal or similar numbers of mismatches). Selecting this option will include unique reads, and those multi-reads which can be mapped to the same gene, for the UMI counting for this gene. If this option is not selected, only those reads that can be uniquely mapped to each gene will be used for UMI counting.
  • Count sense strand reads: when this option is checked, only reads mapped to the sense strand will be counted, as generally generated by 3' chemistry.
  • Count antisense strand reads: when this option is checked, only reads mapped to the antisense strand will be counted, as generally generated by 5' chemistry.
  • Job number - Here, the job number is number of threads. The total number of threads to be allocated to the process. The more threads that are allocated, the faster the algorithm will run. By default, this is set to the number of CPUs on the user’s computer. Users should not set this value to a greater value than the number of CPUs available, but can be reduced at the user’s discretion.
  • Output folder: Specify the output folder to store output results.
[back to top]

Advanced Options

Ngs SCGeneCount06.png

Cluster UMI

PCR amplification errors and sequencing errors can generate novel UMI sequences, which will erroneously increase the unique UMI number. This option enable users to merges UMIs of a given transcript when they differ by just one base, to be the same UMI during the counting, and eliminates UMIs generated by substitution errors during PCR or sequencing.

For instance in the following example, the correct UMI is "CGGT", while due to PCR/Sequencing error, the resulting UMI has another 3 variant: CGGG, CGCT, and CAGT. In this case, the UMI count was 4 without clustering UMIs, while if user check this option of Cluster UMI, the UMI count will be 1:

Ngs SCGeneCount09.png

applied in a real example:

Ngs SCGeneCount11.png

If user check this option, this error introduced by base-calling or amplification can be avoided, by clustering different UMIs (mapped to the same gene) with 1 mismatch to be the same primer UMI. The primer UMI was chosen based on the read count number, the one with higher read count will be treated as the correct UMI.

Here is an example to show how cluster UMI works:

  1. All the reads mapped to this gene will be ranked based on the naive count
  2. Take the first UMI, set as premier UMI, scan the following UMIs if there are any UMIs that only have 1 mismatch.
  3. If there are such UMIs, add up all the counts to the premier UMI
  4. Only preserve the premier UMI after clustering
  5. Do the same for the following UMIs
[back to top]

Convert UMI count to transcript number

Warning.png WARNING: We recommend user check this option only if their UMI length is small, for instance, only 4nt or 5nt; If the UMI is 7nt or longer, this option should not be checked

Here we provide another "correction" for the UMI counting especially when the UMI nucleotide number is small (4 or 5). As for small number of UMIs, it's easy for them to hit the maximal limitation to cover all of the transcripts for one gene, which is called "UMI usage saturation". Here we provide a way to calculate the theoretical number of transcripts based on the UMI numbers. The used equations contain logarithms that tend to infinity and exaggeratedly overestimate the number of molecules when the number of detected UMIs approaches the maximal complexity of the UMI.

if count < total:
theoretical_value = -total * ln(1-count/total)
if count == total:
theoretical_value = -total * ln(1-(count-1)/total) // prevent overflow

#total = 4*4*4....*4 (n times, n is the number of nucleotide in UMI)

If user is interested to know more about the underneath mathematics, please refer to this paper for more information: Counting individual DNA molecules by the stochastic attachment of diverse labels

If user check this option, in the output folder, there will be a text file generated for each sample, and content will be like this;

Ngs SCGeneCount08.png

The column of UniqueCount is the unique number of UMI, column of Count contains the naive count of the reads mapped to each gene, and the column TranscriptNumber contains the calculated theoretical value, which is a "correction" for the UMI usage saturation.

Output Results

This module will create one or more ZeroInflated MicroArray Data objects,

Ngs SCGeneCount03.png

If user has checked the option for Convert UMI count to transcript number in the advanced option for Report Single Cell Counts:

Ngs SCGeneCount12.png

Then there will be two omic type data object show up in the GUI of project:

Ngs SCGeneCount13.png

The SingleCellCounts table shows the unique UMI count before being converted to theoretical UMI count:

Ngs SCGeneCount14.png

The SingleCellConvertedCounts table shows the theoretical unique UMI count:

Ngs SCGeneCount15.png

Also, in the output folder, there will be a text file to store the reads for converted counts, and a text file to store the naive count (in any case if user want to load this table to ArrayStudio to have a check):

Ngs SCGeneCount16.png

[back to top]



Related Articles


[back to top]