How to use multiple sequence files for one sample?

From Array Suite Wiki

Jump to: navigation, search

Contents

General

In some cases, user may have multiple files from multiple lanes for one NGS sample. They can be merged into a single file before doing any NGS analysis. However, it is not ideal, because it is risky to delete the original files and most users will keep two copies of data.

Grouping File

In Omicsoft, we provide a solution to input multiple files for one sample using the GroupingFile parameter. The parameter should work for any NGS modules which are taking raw read files as input. Group File has two columns defining sample files and sample name (see Group File example for details). Based on group file, read 1 and read 2 files are matched by file name and inputted into data analysis sequentially. The output file will use sample name as output name, such as "SampleName.bam", and "SampleName" in OmicSoft reports.

Rename NGS output with Sample ID

There is a good "side effect" using GroupingFile: renaming BAM files with other names. Previously, BAM file names are determined by input file names and a simple pairing rule. With GroupingFile, outputs are named with SampleName (the second) column. Therefore, GroupingFile is also useful when there is a single file (or a pair of files for paired end) for one sample (see TestDataB sample in Group File example for details).

How to do this in sample registration

In sample registration, multiple files can be specified in FilePath separated by |. In the example of sample registration, two pairs of FASTQ1 and FASTQ2 files are specified for TestDataA and one pair of FASTQ1 and FASTQ2 files are specified for TestDataB.

[Samples]			
SampleID	FilePath	Species
TestDataA	FASTQ1=/filepath1/TestA1_1.fastq.gz|FASTQ2=/filepath1/TestA1_2.fastq.gz|FASTQ1=/filepath1/TestA2_1.fastq.gz|FASTQ2=/filepath1/TestA2_2.fastq.gz	Human
TestDataB	FASTQ1=/filepath2/TestB_1.fastq.gz|FASTQ2=/filepath2/TestB_2.fastq.gz	Human
...
[SampleSet]			
ID=TestGroupData			
Title=TestGrouping			
Description=test
...
[Pipeline]								
ScriptID=xxxx.pscript
... 

Once the sampleset is created (during sample registration or later), user can run a pipeline on the sampleset. If one of samples contains multiple sequence files, a grouping file will be generated automatically (usually it is saved in /Users/userid/PScriptLog with name GroupingFile_xxxxxx.txt). The program will insert /GroupingFile="/Users/userid/PScriptLog/GroupingFile_xxxxxx.txt" properly to the Options sections in the final pipeline script.

NOTE: Due to the change made on 06/27/2013, the output names of results analyzed using sample registration will use the SampleID column, even if there is no sample in sampleset contains multiple sequence files. Please take it into account when you choose a good SampleID and pre-register the output BAM files.

Here is one example which pre-registering the output BAM files

[Samples]			
SampleID	FilePath	Species
TestDataA	FASTQ1=/F1/TestA1_1.fastq.gz|FASTQ2=/F1/TestA1_2.fastq.gz|FASTQ1=/F1/TestA2_1.fastq.gz|FASTQ2=/F1/TestA2_2.fastq.gz|BAM=/output/TestDataA.bam	Human
TestDataB	FASTQ1=/filepath2/TestB_1.fastq.gz|FASTQ2=/filepath2/TestB_2.fastq.gz|BAM=/output/TestDataB.bam	Human

Please read Sample registration articles if you are not familiar with sample registration process.

How to do this from GUI

In GUI, group file can be specified through the "Add List" link:

GroupFile.png


Grouping file example (tab delimited),

/filepath/MyData_1.fastq.gz	TestDataA
/filepath/MyData_2.fastq.gz	TestDataA
/filepath/MyTest_1.fastq.gz	TestDataA
/filepath/MyTest_2.fastq.gz	TestDataA
/filepath/SRR065521.1.fastq.gz	TestDataB
/filepath/SRR065521.2.fastq.gz	TestDataB

Note:

  • When add list using a file containing only one column (on a list of file paths), GroupingFile option is not enabled. It simply help users load a list of files to GUI. Output names will still use input file names.
  • When add list using a file containing only two columns (a list of file paths + a list of sample IDs), GroupingFile option is enabled. It help users load a list of files to GUI and generate the GroupingFile option for analysis. Output names will use sample IDs (the second column).
  • In NGS raw data QC, the grouping file is not used since a hidden parameter /UseFileGrouping=False is specified by default. Each raw data file is scanned for QC purpose.
  • Sample registration will automatically append /GroupingFile="/xxx" in the end of each pipeline script block. /DisableAutoGroupingFile=True will disable the append of the option.