How to use multiple sequence files for one sample?
From Array Suite Wiki
In some cases, user may have multiple files from multiple lanes for one NGS sample. They can be merged into a single file before doing any NGS analysis. However, it is not ideal, because it is risky to delete the original files and most users will keep two copies of data.
In Omicsoft, we provide a solution to input multiple files for one sample using the GroupingFile parameter. The parameter should work for any NGS modules which are taking raw read files as input. Group File has two columns defining sample files and sample name (see Group File example for details). Based on group file, read 1 and read 2 files are matched by file name and inputted into data analysis sequentially. The output file will use sample name as output name, such as "SampleName.bam", and "SampleName" in OmicSoft reports.
Rename NGS output with Sample ID
There is a good "side effect" using GroupingFile: renaming BAM files with other names. Previously, BAM file names are determined by input file names and a simple pairing rule. With GroupingFile, outputs are named with SampleName (the second) column. Therefore, GroupingFile is also useful when there is a single file (or a pair of files for paired end) for one sample (see TestDataB sample in Group File example for details).
How to do this in sample registration
In sample registration, multiple files can be specified in FilePath separated by |. In the example of sample registration, two pairs of FASTQ1 and FASTQ2 files are specified for TestDataA and one pair of FASTQ1 and FASTQ2 files are specified for TestDataB.
[Samples] SampleID FilePath Species TestDataA FASTQ1=/filepath1/TestA1_1.fastq.gz|FASTQ2=/filepath1/TestA1_2.fastq.gz|FASTQ1=/filepath1/TestA2_1.fastq.gz|FASTQ2=/filepath1/TestA2_2.fastq.gz Human TestDataB FASTQ1=/filepath2/TestB_1.fastq.gz|FASTQ2=/filepath2/TestB_2.fastq.gz Human ... [SampleSet] ID=TestGroupData Title=TestGrouping Description=test ... [Pipeline] ScriptID=xxxx.pscript ...
Once the sampleset is created (during sample registration or later), user can run a pipeline on the sampleset. If one of samples contains multiple sequence files, a grouping file will be generated automatically (usually it is saved in /Users/userid/PScriptLog with name GroupingFile_xxxxxx.txt). The program will insert
/GroupingFile="/Users/userid/PScriptLog/GroupingFile_xxxxxx.txt" properly to the Options sections in the final pipeline script.
NOTE: Due to the change made on 06/27/2013, the output names of results analyzed using sample registration will use the SampleID column, even if there is no sample in sampleset contains multiple sequence files. Please take it into account when you choose a good SampleID and pre-register the output BAM files.
Here is one example which pre-registering the output BAM files
[Samples] SampleID FilePath Species TestDataA FASTQ1=/F1/TestA1_1.fastq.gz|FASTQ2=/F1/TestA1_2.fastq.gz|FASTQ1=/F1/TestA2_1.fastq.gz|FASTQ2=/F1/TestA2_2.fastq.gz|BAM=/output/TestDataA.bam Human TestDataB FASTQ1=/filepath2/TestB_1.fastq.gz|FASTQ2=/filepath2/TestB_2.fastq.gz|BAM=/output/TestDataB.bam Human
Please read Sample registration articles if you are not familiar with sample registration process.
How to do this from GUI
In GUI, group file can be specified through the "Add List" link:
Grouping file example (tab delimited),
/filepath/MyData_1.fastq.gz TestDataA /filepath/MyData_2.fastq.gz TestDataA /filepath/MyTest_1.fastq.gz TestDataA /filepath/MyTest_2.fastq.gz TestDataA /filepath/SRR065521.1.fastq.gz TestDataB /filepath/SRR065521.2.fastq.gz TestDataB
- When add list using a file containing only one column (on a list of file paths), GroupingFile option is not enabled. It simply help users load a list of files to GUI. Output names will still use input file names.
- When add list using a file containing only two columns (a list of file paths + a list of sample IDs), GroupingFile option is enabled. It help users load a list of files to GUI and generate the GroupingFile option for analysis. Output names will use sample IDs (the second column).
- In NGS raw data QC, the grouping file is not used since a hidden parameter
/UseFileGrouping=Falseis specified by default. Each raw data file is scanned for QC purpose.
- Sample registration will automatically append /GroupingFile="/xxx" in the end of each pipeline script block. /DisableAutoGroupingFile=True will disable the append of the option.