From Array Suite Wiki
Preprocess of SingleCell RNASeq data from 10X(V2)
SingleCell fastq file generated by 10X Genomics platform have two versions, and version 1 has been deprecated. Now version 2 is the more popular data format for 10X Single Cell data.
This module is specially designed to preprocess version 1 SingleCell fastq file.
For V1 data, user should go to this wiki: Preprocess of SingleCell RNASeq data from 10X(V1)
If user can find that the 10X sample has this description:
Chemistry -- Single Cell 3' v2; or Cell Ranger -- Version 2.0.0
--These data will be 10X V2 and will be applicable for this module.
For 10X V2 fastq file:
- I1 file: Sample index read (optional)
- R1 file: Read 1 sequence for CellBarcode and UMI (16-nt CellBarcode + 10-nt Umi + TR(discarded, not for alignment, might not be included)
- R2 file: Read 2 sequence which can be aligned to genome
10X Genomics has been a popular platform to generate single cell RNASeq data. OmicSoft have designed this module specially for the preprocessing of 10X V2 data.
During this module, user will take fastq file as input, and then
- Filter the cells based on the total reads number
- Do cell barcode correction with the similar logic applied by CellRanger
- Rand the cells based on the read number, extract the top N cells
- Filter the reads by cell barcode and UMI quality
- Generate a knee plot to show the reads number distribution across all cells
This module can be accessed by going to NGS | SingleCell RNA-Seq | 10X Preprocessing | 10X(V2) Preprocessing:
Input Data Requirements
This module requires FASTQ files as input file.
Usually user will have multiple fastq file for one sample, which is usually true for 10X dataset, so we would suggest user use Add List option and use a mapping file to input these fastq file. The purpose of using a mapping file is to group different fastq files into one sample.
An example for mapping.txt file:
/FastqFilePath/pbmc4k_S1_L001_R1_001.fastq.gz PBMC4K /FastqFilePath/pbmc4k_S1_L001_R2_001.fastq.gz PBMC4K /FastqFilePath/pbmc4k_S1_L002_R1_001.fastq.gz PBMC4K /FastqFilePath/pbmc4k_S1_L002_R2_001.fastq.gz PBMC4K
- Quality encoding: Illumina quality scores, Sanger quality scores, or Automatic (figures out the quality scoring on its own).
- Since 2011, Illumina's CASAVA pipeline (v1.8+) has used Sanger quality encoding, not Illumina.
- Job number: Parallel job number
- Thread number: The number of threads used per parallel job.
- Zip format: Select which format is used in compressing the files.
- Output name: can be specified for the newly generated files.
- Output folder : can be specified for the location to store the output files.
- Minimal cell read count: a threshold for user to have a cutoff to filter out low quality cells, the cells that have smaller number of reads than the number specified here will be considered as poor quality cells and will be disregarded in this preprocess
- Cell Barcode Correction:
- Correct cell barcode: logical, user can check if they want to do the cell barcode correction in this step, default is true
- Barcode white list file: a file containing the list of valid cell barcode, provided by 10X Genomics cellRanger. If user have installed CellRanger, then this file can be found at cellranger-2.0.0/cellranger-cs/2.0.0/tenkit/lib/python/tenkit/barcodes/737K-august-2016.txt. User can also download this file from our server: WhiteList for V2, or downlaod it from Github: Github link for white list file
- Barcode confidence threshold[0-1]: The confidence value used for cutoff (for maximal posterior possibility) in this barcode correction process(default is 0.975)
- Top rich cell count: If user knows how many cells are expected from this sample, user can specify the number here, so this module will rank the cells based on the reads number, and get the top N cells for downstream analysis. As we are using read number rather than UMI number to rank the cells, we would suggest use a more generous number than the expected number. For instance, for PBMC 4K dataset, we can use 4300 or 4500 here.
- Filter by barcode quality: whether to filter barcode by quality, default is true
- Minimal barcode quality: the lower bound of quality to filter for barcode
- Barcode length: the length for cell barcode, for 10X V2 data, the length for CellBarcode is 16nt
- maximal nucleotides in low quality: how many nucleotide is allowed in low quality (<10), the default is 0
- Minimal read count for knee plot: a setting to control if the cell will be appeared in the knee plot. For cells contain less number of reads then the number specified here, these cells won't show up in the knee plot. The default value is 500. With this option, the knee plot will have better a focus on higher quality cells.
- Filter by UMI quality: whether to filter UMI by quality, default is true
- UMI length: the length of UMI, for 10X V2 data, the length of UMI is 10nt
- Minimal UMI quality: the lower bound of quality to filter for UMI
- Maximal nucleotides in low quality: how many nucleotide is allowed in low quality (<10), the default is 0
Result in GUI
In ArrayStudio GUI, there will be two table objects generated under the folder of 10X Preprocess QC:
- A preprocess report to show total read count, kept read count, skipped read count, and kept read Rate:
- A knee plot to show the read count distribution across all cells, ranked by total read count number from left to right, and colored by "kepted" or "skipped":
- There will also be a table associated with the knee plot view:
Result in output folder
Due to I/O issue and for writing efficiency purpose, we designed this module to split the gigantic fastq file into different chunks of smaller fastq files, with each fastq file having around 5 million reands (300Mb size)
The *ChunkFiles.mapping.txt will be a mapping file which contains two columns, the first column will show all fastq files with the full path, the second column shows the sample name:
In the following step of Alignment, user can choose to use the Add List function to input the "chunked fastq files", to indicate the sample name. If user only have one sample in their test, the ***ChunkFiles.mapping.txt can be used; if user have multiple samples in their test, the ***ChunkFiles.MergedMapping.txt can be used:
The *prepReads_chunkN.fastq.gz contains the normal read sequence and the quality as the normal fastq files.
Barcode and UMI (Tag file)
The *_prepReads_chunkN.tag.gz contains the extracted barcodes and UMI information for each read for the corresponding fastq file.