From Array Suite Wiki
Report SingleCell Counts
The Barcoded BAM Based Counting module reports SingleCell gene counts counts for an NGS dataset. Currently, this module only generate the count data at gene level, not on transcript level. This module will count the unique UMIs for each RNA molecule.
To access this module, please go to Analysis | NGS | Single Cell RNA-Seq | Barcoded BAM Based Counting
Input Data Requirements
This module works on NGS data objects, including RNA-seq data that were mapped to genome or transcriptome.
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- References: This module will quantify gene/transcript levels for all genes in the reference used to map the RNA-seq data.
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output type: Zero Inflated MicroArray data will be generated for memory efficiency.
- Output name: The user can choose to name the output data object.
- Gene Model: Allow users to specify gene model ID to use.
- Cell barcode tag: The default name will be "CB".
- Cell count safe harbor: user can set a cell count threshold here. The count here refers to the number of mapped genes, cell with count bigger than safe harbor will always be returned; and Cells with count less than the safe harbor will be evaluated based on count distribution. 250 or 200 should be safe to use in this option.
- Count by UMI barcode (generate UMI counts): UMI barcode tab: the default name for UMI is UB.
- Count multi-reads: Multi reads are considered non-unique (i.e. reads that align to multiple genomic locations with equal or similar numbers of mismatches). Selecting this option will include unique reads, and those multi-reads which can be mapped to the same gene, for the UMI counting for this gene. If this option is not selected, only those reads that can be uniquely mapped to each gene will be used for UMI counting.
- Stranded sequencing: when this option is checked, only reads mapped to the forward strand will be counted; when this option is not checked, both forward and reverse strand mapped reads will be counted.
- Job number - Here, the job number is number of threads. The total number of threads to be allocated to the process. The more threads that are allocated, the faster the algorithm will run. By default, this is set to the number of CPUs on the user’s computer. Users should not set this value to a greater value than the number of CPUs available, but can be reduced at the user’s discretion.
- Output folder: Specify the output folder to store output results.
PCR amplification errors and sequencing errors can generate novel UMI sequences, which will erroneously increase the unique UMI number. This option enable users to merges UMIs of a given transcript when they differ by just one base, to be the same UMI during the counting, and eliminates UMIs generated by substitution errors during PCR or sequencing.
For instance in the following example, the correct UMI is "CGGT", while due to PCR/Sequencing error, the resulting UMI has another 3 variant: CGGG, CGCT, and CAGT. In this case, the UMI count was 4 without clustering UMIs, while if user check this option of Cluster UMI, the UMI count will be 1:
applied in a real example:
If user check this option, this error introduced by base-calling or amplification can be avoided, by clustering different UMIs (mapped to the same gene) with 1 mismatch to be the same primer UMI. The primer UMI was chosen based on the read count number, the one with higher read count will be treated as the correct UMI.
Here is an example to show how cluster UMI works:
- All the reads mapped to this gene will be ranked based on the naive count
- Take the first UMI, set as premier UMI, scan the following UMIs if there are any UMIs that only have 1 mismatch.
- If there are such UMIs, add up all the counts to the premier UMI
- Only preserve the premier UMI after clustering
- Do the same for the following UMIs
Convert UMI count to transcript number
Here we provide another "correction" for the UMI counting especially when the UMI nucleotide number is small (4 or 5). As for small number of UMIs, it's easy for them to hit the maximal limitation to cover all of the transcripts for one gene, which is called "UMI usage saturation". Here we provide a way to calculate the theoretical number of transcripts based on the UMI numbers. The used equations contain logarithms that tend to infinity and exaggeratedly overestimate the number of molecules when the number of detected UMIs approaches the maximal complexity of the UMI.
if count < total: theoretical_value = -total * ln(1-count/total) if count == total: theoretical_value = -total * ln(1-(count-1)/total) // prevent overflow #total = 4*4*4....*4 (n times, n is the number of nucleotide in UMI)
If user is interested to know more about the underneath mathematics, please refer to this paper for more information: Counting individual DNA molecules by the stochastic attachment of diverse labels
If user check this option, in the output folder, there will be a text file generated for each sample, and content will be like this;
The column of UniqueCount is the unique number of UMI, column of Count contains the naive count of the reads mapped to each gene, and the column TranscriptNumber contains the calculated theoretical value, which is a "correction" for the UMI usage saturation.
This module will create one or more ZeroInflated MicroArray Data objects,
If user has checked the option for Convert UMI count to transcript number in the advanced option for Report Single Cell Counts:
Then there will be two omic type data object show up in the GUI of project:
The SingleCellCounts table shows the unique UMI count before being converted to theoretical UMI count:
The SingleCellConvertedCounts table shows the theoretical unique UMI count:
Also, in the output folder, there will be a text file to store the reads for converted counts, and a text file to store the naive count (in any case if user want to load this table to ArrayStudio to have a check):
- SCRNA-Seq Analysis
- Alignment of SingleCell RNA-Seq Data
- Counting individual DNA molecules by the stochastic attachment of diverse labels