Preprocess of SingleCell RNA-Seq Data

From Array Suite Wiki

Jump to: navigation, search


Contents

General

Compared to normal fastq file, Single-Cell RNASeq fastq files contains extra information for barcode and Unique Molecular Identifiers (UMIs). The purpose of pre-process of the fastq files is to extract these information from fastq files, and store them into a tag file. Single-Cell fastq file resulted from different sources might have different format to carry these information. Based on these different format, here we separate various fastq files into four types, and we will explain here how can we extract the barcodes and UMIs information from each type of fastq file.

Case 1: Parse Header

The head of each read in the fastq file contains the information for Cell barcode and UMI, for instance:

SC preprocess1.png

Oscript used for this case:

Begin SCLandTools /Namespace=NgsLib;
Files ' G:\test\20170414_SCTest\PreprocessFastq\parseHeader\SRX425372_sub.fastq.gz';
Options 
/Action=PreprocessFastq   
/Demultiplex=False 
/IsMultiplexed = True 
/ParseHeader= True
/HeaderReg= "^@(?<readname>\S+) (?<barcode>\S+)-(?<umi>\w{4})" 
/HasUmi = True
/FilterBarcodeQuality=False
/ValidBarcodeFile="G:\test\20170414_SCTest\PreprocessFastq\parseHeader\SRX425372_Barcode.txt" 
/OutputFolder="G:\test\20170414_SCTest\PreprocessFastq\parseHeader\output"; 
End;

Case2: Single-end-derived

Barcode and UMI information contained in the read sequence rather than in the head of the read.

For instance, this fastq read only contains UMI information in the read sequence:

SC preprocess2.png

Oscript used for this case: (set /Filter BarcodeQuality=False as there is no barcode informaiton)

Begin SCLandTools /Namespace=NgsLib;
Files "G:\test\20170414_SCTest\PreprocessFastq\SE\SRX387272_sub.fastq.gz";
Options 
/Action=PreprocessFastq   
/Demultiplex=False 
/PairedEnd = False 
/ParseSeq=True
/IsMultiplexed = False 
/SeqReg="^(?<umi>\w{5})(?<polyG>G{1,9})" 
/HasUmi = True
/FilterBarcodeQuality=False 
/FilterUmiQuality = False 	
/OutputFolder="G:\test\20170414_SCTest\PreprocessFastq\SE\SRX387272_output"; 
End;

Case3: Single-end read paired with Barcode/UMI file

In this case, the fastq files looks like paired-end read file, but one of the paired reads actually contains the information for Cell barcode and UMI, and the other read contains the actual read sequence. For instance, here is how these kind of file look like:

SC preprocess3.png

Oscript used for this case:

Begin SCLandTools /Namespace=NgsLib;
Files "
G:\test\20170414_SCTest\PreprocessFastq\PE\SRX907220_sub_1.fastq.gz
G:\test\20170414_SCTest\PreprocessFastq\PE\SRX907220_sub_2.fastq.gz
";
Options 
/Action=PreprocessFastq   
/Demultiplex=False
/PairedEnd = True 
/IsMultiplexed = True 
/SeqReg1="^(?<barcode>\w{12})(?<umi>\w{8})" 
/SeqReads=2
/HasUmi=True
/ValidBarcodeFile="G:\test\20170414_SCTest\PreprocessFastq\PE\GSM1626794_barcode.txt"
/FilterBarcodeQuality=False 
/FilterUmiQuality = False 
/OutputFolder="G:\test\20170414_SCTest\PreprocessFastq\PE\SRX907220_output_demultiplex"; 
End;

For 10X_V2 data, this is the format of the reads. We have added the option in ArrayStudio to pre-process these reads:

10X V2.png

Note: For 10X_V2 datasets that are stored in S3 Cloud locations, users will not be able to read into the file to select the RegEx Pattern as above in ArrayStudio. These files can still be passed through the pre-processing step by simply pasting in the pattern: ^(?<barcode>\w{16})(?<umi>\w{10}) in the R1 pattern field.

Case4: Paired-end read

In this case, the read are real paired-end, and the barcode and UMI information just existed in one of the read, like this file:

SC preprocess4.png

Oscript used for this case:

Begin SCLandTools /Namespace=NgsLib;
Files "
G:\test\20170414_SCTest\PreprocessFastq\TruePE\SRX731093_sub_1.fastq.gz
G:\test\20170414_SCTest\PreprocessFastq\TruePE\SRX731093_sub_2.fastq.gz
";
Options 
/Action=PreprocessFastq   
/Demultiplex = False 
/PairedEnd = True 
/IsMultiplexed = True 
/SeqReg1="^(?<barcode>\w{8})(?<umi>\w{4})T{1,25}" 
/SeqReads=0
/HasUmi = True 
/FilterBarcodeQuality=False 
/FilterUmiQuality = False  
/OutputFolder="G:\test\20170414_SCTest\PreprocessFastq\TruePE\SRX731093_output_demultiplex"; 
End;

Output

Normally, the output folder will have five output files for each raw fastq file, for instance like this one:

SC preprocess5.png

Fastq Reads

The *filterReads.fastq.gz contains the normal read sequence and the quality as the normal fastq files:

SC preprocess6.png

Barcode and UMI (Tag file)

The *filterReads.tag.gz contains the extracted barcodes and UMI information for each read for the corresponding fastq file:

SC preprocess7.png

Filtered reads

Sequence are automatically trimmed by quality. If you want to skip the step, user can specify this option in the oscript for preprocess:

Trimming  /Mode=Composite /LeftTrimming=0 /RightTrimming=0 /ReadTrimQuality=0 /ReadTrimSize=65536;

The *skipreads.fastq.gz contains the reads sequence and quality for those filtered reads:

SC preprocess8.png

Filtered reads Barcode/UMI

The *skipreads.tag.gz contains the barcodes and UMI information for the skipped reads:

SC preprocess9.png

Related Articles

EnvelopeLarge2.png