Ngs 10X V1 Preprocess.pdf

From Array Suite Wiki

(Redirected from Ngs 10XPreprocess.pdf)
Jump to: navigation, search


10X V1 SingleCell fastq file Preprocess


Warning.png WARNING:

SingleCell fastq file generated by 10X Genomics platform have two versions, and version 1 has been deprecated. Now version 2 is the more popular data format for 10X Single Cell data.

This module is specially designed to preprocess version 1 SingleCell fastq file.

For V1 data, user should be able to find this description: Chemistry -- Single Cell 3' V1; or Cell Ranger -- 1.*.*

If user can find that the 10X sample has this description:

Chemistry -- Single Cell 3' v2; or Cell Ranger -- Version 2.0.0

-- Then it will be 10X V2 data and user should go to this wiki: Preprocess of SingleCell RNASeq data from 10X(V2)

For 10X V1 fastq file:

  • i5 index file: Sample Index
  • i7 index file: Cell Barcode (14nt)
  • R1 file: sequence which can be aligned to genome (98nt)
  • R2 file: UMI (10nt)


10X Genomics has been a popular platform to generate single cell RNASeq data. OmicSoft have designed this module specially for the preprocessing of 10X V1 data.

During this module, user will take fastq file as input, and then

  • Filter the cells based on the total reads number
  • Do cell barcode correction with the similar logic applied by CellRanger
  • Rand the cells based on the read number, extract the top N cells
  • Filter the reads by cell barcode and UMI quality
  • Generate a knee plot to show the reads number distribution across all cells

This module can be accessed by going to NGS | SingleCell RNA-Seq | 10X Preprocessing | 10X(V1) Preprocessing:


Input Data Requirements

This module requires FASTQ/FASTR.GZ files as input file (3 fastq files per sample), and a mapping file to show which group the fastq file belongs to.

For instance, if I have 6 fastq files like this as input file:

SC 10XpreprocessGUI3.png

I can have a mapping.txt file like this to group these fastq files:

SC 10XpreprocessGUI4.png

Also, I can combine the lane-001 and lane-002 read as the same sample like this using the mapping file:

SC 10XpreprocessGUI5.png

Warning.png WARNING:

1. the mapping file must contain these 3 column names: CellBCFileName, CellSeqFileName, SampleBCFileName;

2. if user group multiple fastq files into one sample, the order of these fastq file in each column should keep consistent;

3. Users should not provide a full path to the files, but simply the name of files in the experiment.

General Options


  • Quality encoding: Illumina quality scores, Sanger quality scores, or Automatic (figures out the quality scoring on its own).
    • Since 2011, Illumina's CASAVA pipeline (v1.8+) has used Sanger quality encoding, not Illumina.
  • Job number: Parallel job number
  • Thread number: The number of threads used per parallel job.
  • Zip format: Select which format is used in compressing the files.
  • Output name: can be specified for the newly generated files.
  • Output folder : can be specified for the location to store the output files.
  • Mapping file: please check with the upper section of Input Data Requirements for more details about mapping.txt

  • Parse sample barcode file: logical, user can check this option if they do want to parse sample barcode info into the tag file
  • Correct cell barcode: logical, user can check if they want to do the cell barcode correction in this step, default is True
  • Export sample barcode tag: logical, user can check if they want to export the sample barcode in the tag file, default is True
  • Keep empty cell barcode: logical, user can check if they want to keep the cell barcode existing in the white list but has empty data in the fast, default is False

  • Barcode white list file: a file containing the list of valid cell barcode, provided by 10X Genomics cellRanger. If user have installed CellRanger, then this file can be found at cellranger-2.1.0/cellranger-cs/2.1.0/tenkit/lib/python/tenkit/barcodes/737K-april-2014_rc.txt. User can also download this file from our server: WhiteList for V1; or from github Github link
  • Barcode confidence threshold[0-1]: The confidence value used for cutoff (for maximal posterior possibility) in this barcode correction process(default is 0.975)
  • Top rich cell count: If user knows how many cells are expected from this sample, user can specify the number here, so this module will rank the cells based on the reads number, and get the top N cells for downstream analysis. As we are using read number rather than UMI number to rank the cells, we would suggest use a more generous number than the expected number. For instance, for PBMC 4K dataset, we can use 4300 or 4500 here.
  • Minimal cell read count: a threshold for user to have a cutoff to filter out low quality cells, the cells that have smaller number of reads than the number specified here will be considered as poor quality cells and will be disregarded in this preprocess

Advanced Options


  • CellBarcode:
    • Filter by barcode quality: whether to filter barcode by quality, default is true
    • Minimal barcode quality: the lower bound of quality to filter for barcode
    • maximal nucleotides in low quality: how many nucleotide is allowed in low quality (<10), the default is 0
    • Minimal read count for knee plot: a setting to control if the cell will be appeared in the knee plot. For cells contain less number of reads then the number specified here, these cells won't show up in the knee plot. The default value is 500. With this option, the knee plot will have better a focus on higher quality cells.
  • UMI:
    • Filter by UMI quality: whether to filter UMI by quality, default is true
    • Minimal UMI quality: the lower bound of quality to filter for UMI
    • Maximal nucleotides in low quality: how many nucleotide is allowed in low quality (<10), the default is 0
  • Export Buffer Size: user can use this option to specify the read count for each thread to process in runtime; default is 40,000. Bigger than this might impact performance in cloud job


As the result for preprocessing 10X V1 and 10X V2 will be the same, so we just put the same result with 10X V2 here just for demonstration.

Result in GUI

In ArrayStudio GUI, there will be two table objects generated under the folder of 10X Preprocess QC:

  • A preprocess report to show total read count, kept read count, skipped read count, and kept read Rate:


  • A knee plot to show the read count distribution across all cells, ranked by total read count number from left to right, and colored by "kepted" or "skipped":


  • There will also be a table associated with the knee plot view:


Result in output folder

Due to I/O issue and for writing efficiency purpose, we designed this module to split the gigantic fastq file into different chunks of smaller fastq files, with each fastq file having around 5 million reands (300Mb size)

Mapping file

The *ChunkFiles.mapping.txt will be a mapping file which contains two columns, the first column will show all fastq files with the full path, the second column shows the sample name:


Fastq Reads

The *prepReads_chunkN.fastq.gz contains the normal read sequence and the quality as the normal fastq files.

Barcode and UMI (Tag file)

The *_prepReads_chunkN.tag.gz contains the extracted barcodes and UMI information for each read for the corresponding fastq file.



Related Articles