Ngs 10X V1 Preprocess.pdf
From Array Suite Wiki
10X SingleCell fastq file Preprocess
For 10X V2 fastq file:
- I1 file: Sample index read (optional)
- R1 file: Read 1 sequence for CellBarcode and UMI (16-nt CellBarcode + 10-nt Umi + TR(discarded, not for alignment, might not be included)
- R2 file: Read 2 sequence which can be aligned to genome
Different from other platforms, fastq file generated by 10X Genomics platform normally contains three fastq file for each sample, and the Unique Molecular Identifiers (UMIs) and Cell Barcode sequence are included in these files like this accordingly:
- I1 file: contains the Cell Barcode seqiemce
- I2 file: contains the Sample Barcode sequence
- RA file: contains the UMI sequence and the actual sequence that can be aligned to the genome
This 10XPreprocess module enables user to extract the Cell Barcode, Sample Barcode and UMI sequence from these files, and store them into a tag file, and store the actual read sequence that can be aligned to the genome into the fastq file, as this figure demonstrates:
This module can be accessed by going to NGS | SingleCell RNA-Seq | 10X Preprocessing:
Input Data Requirements
This module requires FASTQ/FASTR.GZ files as input file (3 fastq files per sample), and a mapping file to show which group the fastq file belongs to.
For instance, if I have 6 fastq files like this as input file:
I can have a mapping.txt file like this to group these fastq files:
Also, I can combine the lane-001 and lane-002 read as the same sample like this using the mapping file:
3. Users should not provide a full path to the files, but simply the name of files in the experiment.
- Quality encoding: Illumina quality scores, Sanger quality scores, or Automatic (figures out the quality scoring on its own).
- Since 2011, Illumina's CASAVA pipeline (v1.8+) has used Sanger quality encoding, not Illumina.
- Job number: Parallel job number
- Zip format: Select which format is used in compressing the files.
- Output name: can be specified for the newly generated files.
- Output folder : can be specified for the location to store the output files.
- Mapping file: Specify the group for each fastq file, so ArrayStudio will know which file contains the cell barcode, which file contains sample barcode, and which file contains UMI and actual read sequence.
- Parse sample barcode file: User can specify whether to parse the information in the sample barcode file, default option is Yes.
- Export sample barcode tag: User can specify whether to export the sample barcode information into the tag file in this step, default option is Yes. If user leave it unchecked, the tag file generated will not contain the columns for sample barcode sequence and quality.
Normally, the output folder will have the resulted files for each raw fastq file (one fastq file, one tag file and one report file), for instance like this one:
The resulted fastq.gz contains the normal read sequence and the quality as the normal fastq files:
Barcode and UMI (Tag file)
The resulted tag.gz contains the extracted cell barcode, sample barcode and UMI information for each read for the corresponding fastq file: