Ngs 10XPreprocess.pdf

From Array Suite Wiki

Jump to: navigation, search


Contents

10X SingleCell fastq file Preprocess

Overview

This module is specially designed to preprocess SngleCell fastq file generated by 10X Genomics platform. Different from other platforms, fastq file generated by 10X Genomics platform normally contains three fastq file for each sample, and the Unique Molecular Identifiers (UMIs) and Cell Barcode sequence are included in these files like this accordingly:

  • I1 file: contains the Cell Barcode seqiemce
  • I2 file: contains the Sample Barcode sequence
  • RA file: contains the UMI sequence and the actual sequence that can be aligned to the genome

This 10XPreprocess module enables user to extract the Cell Barcode, Sample Barcode and UMI sequence from these files, and store them into a tag file, and store the actual read sequence that can be aligned to the genome into the fastq file, as this figure demonstrates:

SC 10XpreprocessGUI1.png

This module can be accessed by going to NGS | SingleCell RNA-Seq | 10X Preprocessing:

SC 10XpreprocessGUI2.png

Input Data Requirements

This module requires FASTQ/FASTR.GZ files as input file (3 fastq files per sample), and a mapping file to show which group the fastq file belongs to.

For instance, if I have 6 fastq files like this as input file:

SC 10XpreprocessGUI3.png

I can have a mapping.txt file like this to group these fastq files:

SC 10XpreprocessGUI4.png

Also, I can combine the lane-001 and lane-002 read as the same sample like this using the mapping file:

SC 10XpreprocessGUI5.png

Warning.png WARNING:

1. the mapping file must contain these 3 column names: CellBCFileName, CellSeqFileName, SampleBCFileName;

2. if user group multiple fastq files into one sample, the order of these fastq file in each column should keep consistent


General Options

SC 10XpreprocessGUI6.png

[back to top]


Options

  • Quality encoding: Illumina quality scores, Sanger quality scores, or Automatic (figures out the quality scoring on its own).
    • Since 2011, Illumina's CASAVA pipeline (v1.8+) has used Sanger quality encoding, not Illumina.
  • Job number: Parallel job number
  • Zip format: Select which format is used in compressing the files.
  • Output name: can be specified for the newly generated files.
  • Output folder : can be specified for the location to store the output files.
  • Mapping file: Specify the group for each fastq file, so ArrayStudio will know which file contains the cell barcode, which file contains sample barcode, and which file contains UMI and actual read sequence.
    • Parse sample barcode file: User can specify whether to parse the information in the sample barcode file, default option is Yes.
    • Export sample barcode tag: User can specify whether to export the sample barcode information into the tag file in this step, default option is Yes. If user leave it unchecked, the tag file generated will not contain the columns for sample barcode sequence and quality.

Output

Normally, the output folder will have the resulted files for each raw fastq file (one fastq file, one tag file and one report file), for instance like this one:

SC 10XpreprocessGUI7.png

Fastq Reads

The resulted fastq.gz contains the normal read sequence and the quality as the normal fastq files:

SC 10XpreprocessGUI8.png

Barcode and UMI (Tag file)

The resulted tag.gz contains the extracted cell barcode, sample barcode and UMI information for each read for the corresponding fastq file:

SC 10XpreprocessGUI9.png

Oscript

Ngs_10XSCPreprocess.oscript

Related Articles

EnvelopeLarge2.png