Building Land From Raw Data

From Array Suite Wiki

Jump to: navigation, search

Contents

Overview

ArrayLand is OmicSoft's solution to manage and deliver large Omics data. It is built up based on OmicSoft File System (OFS), which stores data in database files with different layers of indexes for gene/markers and samples.

OmicSoft uses ArrayLand framework to deliver large data service results, such as TCGA, in Land format. Once users configured Land data on ArrayServer internally, all ArrayStudio/ArrayLand users can search all types of genomics profiles of a single gene or a set of genes instantly with rich visualization.

Users in each company can build their internal land based on each analysis functions and LandTools. Overview of building an internal land:

BuildAnInternalLand.png

Here, we briefly cover steps/scripts to build land from raw data using ArrayServer.

[back to top]

Land components

Each land has its configuration file: LandName.cfg. Examples and all options are list in ArrayLand Configuration Options page.

Genome and Gene Model

User has to decide a genome and gene model used to run analysis. User should choose the same genome and gene model used for other lands in order to do cross land search. Currently, we are using

Reference Human.B37.3;
GeneModel OmicsoftGene20130723;

Data type

Currently, we are supporting array expression, DNASeq mutation, CNV, RNA-Seq quantification, Fusion, and Mutation. The complete support list can be found in LandTools. Land tools is used to generate Land vector (.alv) files for each sample in each data type mode.

[back to top]

Meta data

Land visualization is view of data based on its sample meta data. SampleID column must be there for each data type. If the same sample ID is not shared between different data type, user has to specify the IntegrationLevel column in Land cfg files. At least two primary columns required, such as tumor type and sample type column in TCGA land.

Examples to generate land vector files

All examples here are running using OmicScript. User have to create a new server project, create *.oscript file and then send the script to server queue.

20140812 SendScriptToQueue.png

[back to top]

The *.oscript file includes the scripts (described below) in a text file with file extension "oscript".

RNA-Seq analysis pipeline (old fashion with two steps)

Step 1: alignment using MapRnaSeqReadsToGenome. User can also run this step in NGS Add RNA-Seq Data GUI.

Begin MapRnaSeqReadsToGenome /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleA_1.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleA_2.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleB_1.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleB_2.fastq.gz
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Trimming /Mode=TrimByQuality /ReadTrimQuality=2;
Options /ParallelJobNumber=99 /PairedEnd=True /FileFormat=FASTQ /AutoPenalty=True /FixedPenalty=2 
/Greedy=false /IndelPenalty=2 /DetectIndels=False /MaxMiddleInsertionSize=10 /MaxMiddleDeletionSize=10 
/MaxEndInsertionSize=10 /MaxEndDeletionSize=10 /MinDistalEndSize=3 /ExcludeNonUniqueMapping=False 
/ReportCutoff=10 /WriteReadsInSeparateFiles=True /OutputFolder="/YourOutputFiles/OnArrayServer/RNASeqBAM" 
/GenerateSamFiles=False /ThreadNumber=1 /InsertSizeStandardDeviation=40 /ExpectedInsertSize=300 
/MatePair=False /InsertOnSameStrand=False /InsertOnDifferentStrand=True /QualityEncoding=Automatic 
/CompressionMethod=Gzip /Gzip=True /SearchNovelExonJunction=True /ExcludeUnmappedInBam=False /KeepFullRead=False 
/Replace=False /Platform=ILLUMINA /CompressBam=False;
Output RNASeqAlignemnt;
End;

Usually, we are using /ThreadNumber=4 for each alignment job. All alignment BAM files will be generated to /YourOutputFiles/OnArrayServer/RNASeqBAM. You can run this alignment tool from GUI too. It is not a part of LandTools.

Step 2: run downstream analysis and convert to alv using LandTools ConvertRnaSeqBamToAlv

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files 
"
/YourOutputFiles/OnArrayServer/RNASeqBAM/TestSampleA.bam
/YourOutputFiles/OnArrayServer/RNASeqBAM/TestSampleB.bam
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertRnaSeqBamToAlv
/BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/RNASeq.Design.txt"
/SampleIDColumn="SampleID"
/BamFileNameColumn="BamFileName"
/CopyToLocal=False
/ConvertExonJunction=True
/ConvertMutation=True
/ConvertCount=True
/ConvertFusion=True
/ConvertPairedEndFusion=True
/ConvertBas=True
/MinimalTotalHit=10
/MinimalMutationHit=5
/MinimalMutationFrequency=0.20
/MinimalFusionAlignmentLength=0
/TargetThirdQuantile=10
/ThreadNumber=1
/ParallelJobNumber=99
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV";
End;

All results will be generated to /YourOutputFiles/OnArrayServer/ALV. It will run downstream analysis: quantification, mutation, fusion detection and etc. Land tool is running on cluster nodes if ArrayServer is backed up with cluster. BamFileMappingFileName file is an important mapping file which assigns SampleID to alv files. Example of the mapping file here RNASeq.Design.txt

BamFileName	SampleID
TestSampleA.bam	SampleA
TestSampleA.bam	SampleB
[back to top]

RNA-Seq analysis pipeline (since V9)

In V9, we have put Generate Land ALV option in one-step RNA-Seq Pipeline function. It will run the full RNA-Seq analysis in the project and also generate ALV files for sample so user can publish to Land:

RNASeqPP2.png

[back to top]


DNA-Seq analysis pipeline

Step 1: alignment using MapDnaSeqReads. User can also run this step in NGS Add DNA-Seq Data GUI

Begin MapDnaSeqReads /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Normal_1.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Normal_2.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Tumor_1.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Tumor_2.fastq.gz
";
Reference Human.B37.3;
Trimming /Mode=TrimByQuality /ReadTrimQuality=2;
Options /ParallelJobNumber=99 /PairedEnd=True /FileFormat=FASTQ /AutoPenalty=True /FixedPenalty=2 
/IndelPenalty=2 /DetectIndels=True /Greedy=False /ExcludeNonUniqueMapping=False /ReportCutoff=10 
/WriteReadsInSeparateFiles=True /OutputFolder="/YourOutputFiles/OnArrayServer/DNASeqBAM" /MaxMiddleInsertionSize=10 
/MaxMiddleDeletionSize=50000 /MaxEndInsertionSize=10 /MaxEndDeletionSize=10 /MinDistalEndSize=3 /GenerateSamFiles=False 
/ThreadNumber=2 /ExpectedInsertSize=300 /InsertSizeStandardDeviation=40 /MatePair=False /QualityEncoding=Automatic 
/CompressionMethod=Gzip /Gzip=True /ExcludeUnmappedInBam=False /KeepFullRead=False 
/MapRead=True /MapReverseComplement=True /Replace=False /Platform=ILLUMINA;
Output DNASeqAlignment;
End;

All alignment BAM files will be generated to /YourOutputFiles/OnArrayServer/DNASeqBAM. You can run this function from GUI too. It is not a part of LandTools.

Step 2: run mutation analysis and convert to alv using LandTools ConvertBamToMutation

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files 
"
/YourOutputFiles/OnArrayServer/DNASeqBAM/Test6356.Normal.bam
/YourOutputFiles/OnArrayServer/DNASeqBAM/Test6356.Tumor.bam
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options
/Action=ConvertBamToMutation
/BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/RNASeq.Design.txt"
/SampleIDColumn=SampleID
/BamFileNameColumn=BamFileName
/MinimalTotalHit=10
/MinimalMutationHit=5
/MinimalMutationFrequency=0.20
/DataMode=DnaSeq_Mutation
/ThreadNumber=100
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/DnaSeq_Mutation";
End;

All results will be generated to /YourOutputFiles/OnArrayServer/ALV/DnaSeq_Mutation. It will run mutation detection analysis. Land tool is running on cluster nodes if ArrayServer is backed up with cluster. BamFileMappingFileName file is an important mapping file which assigns SampleID to alv files. Example of the mapping file here DNASeq.Design.txt

BamFileName	SampleID
Test6356.Normal.bam	SampleA
Test6356.Tumor.bam	SampleB

In the example above, the same sample IDs have been assigned to DnaSeq_Mutation data, so that data are linked with RNA-Seq results with the same ID, such as comparing mutation detected in RNA-Seq and DNA-Seq. If there is no 1-1 mapping between DNA-Seq and RNA-Seq samples, user can assign different IDs such as "SampleA1" and "SampleA2" to DNA-Seq but have an IntegrationLevel column value "SampleA" to match RNA-Seq's IntegrationLevel column value in MetaData table.

Step 3: (optional, only if user can find tumor/normal pairs design for the dataset) running somatic mutation analysis using Summarize Matched Pair Variation Data. This function is based on VarScan2.

Users have to import a design table for the NgsData object DNASeqAlignment generated from step 1. The design table needs to contain two columns: one for patient ID and another for sample type (tumor or normal), such as the one below:

MPVDesign.png

[back to top]

User can also send script to server queue:

Begin SummarizeMatchedPairVariation /Namespace=NgsLib /RunOnServer=True;
Project test_server;
Data test_server\\NgsData;
Pair PatientID;
TumorStatus SampleType /Normal=Normal;
Options /BaseQualityCutoff=13 /MapQualityCutoff=0 /MinimalIndelSize=1 /ExcludeSingletons=False 
/ExcludeMultiReads=False /ExcludeDuplicates=True /LeftExclusion=0 /RightExclusion=0 
/ThreadNumber=10 /MinimalNormalHit=8 /MinimalTumorHit=6 /MinimalMutationHit=2 
/HeterozygosityFrequencyCutoff=0.10 /HomozygosityFrequencyCutoff=0.75 /FilteringReadPositionCutoff=0.10 
/FilteringStrandnessCutoff=0.9 /FilteringHomopolymerCutoff=5 /FilteringMappingQualityDifferenceCutoff=30 
/FilteringReadLengthDifferenceCutoff=25 /FilteringMmqsDifferenceCutoff=100 /FrequencyDifferenceCutoff=0.20 
/FilteringSignificanceLevel=0.05 /MaxFrequencyCutoff=0 /GenerateTableland=True /DbsnpVersion=(none) 
/OutputFolder="/YourOutputFiles/OnArrayServer/DNASeqMPV";
Output MPV;
End;

All results (*.mpv matched pair variation files) will be generated to /YourOutputFiles/OnArrayServer/DNASeqMPV output folder. For this example, it will generate one file using patient ID as file name: /YourOutputFiles/OnArrayServer/DNASeqMPV/Test6356.mpv

Step 4: (optional, only run when step3 finishes) convert somatic mutation mpv files to alv files using LandTools ConvertNgsmpv.

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourOutputFiles/OnArrayServer/DNASeqMPV/Test6356.mpv
"
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertNgsmpv /ThreadNumber=12 /BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/MPV.Design.txt" 
/SampleIDColumn="SampleID"  /OutputFolder="/YourOutputFiles/OnArrayServer/ALV/DnaSeq_SomaticMutation" /IsRnaSeq=False;
End;

Although the design parameter name is "BamFileMappingFileName", it is a mapping file for MPV files. It is assigning SampleID to DnaSeq SomaticMutation result. Somatic mutation is mutations detected in Tumor sample but not in normal. We usually assign the Tumor sample ID to the final MPV results, example below:

BamFileName	SampleID
Test6356	SampleA

Note: using file name without extension or full path. For example, if your ngsmpv file is "abcde.bam.ngsmpv", you should use "abcde.bam" in the BamFileName column.

Step 3-4 of DNA-Seq analysis can also be run for RNA-Seq samples if there are matched samples.

[back to top]


Affy Expression array analysis

Single Step 1: convert CEL file to alv using LandTools ConvertAffymetrixCelToGeneralExpression. General_Expression will be our preferred storage mode for expression value, because it allows for cross-platform search and comparison.

Oscript:

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
SearchFiles "/YourFiles/OnArrayServer/HG133Plus2RawData" /Pattern=*.CEL /Recursive=True;
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCelToGeneralExpression 
/SampleIDColumn=SampleID /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/Expr.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/General_Expression";
End;

Here, I used SearchFiles instead of Files statement so it will search all .cel files in folder /YourFiles/OnArrayServer/HG133Plus2RawData. If user only want to run a subset of files in the folder, it is better to run using Files statement:

Begin LandTools /Namespace=NgsLib;
Files
"
/YourOutputFiles/OnArrayServer/HG133Plus2RawData/SampleAhg133plus2.cel
/YourOutputFiles/OnArrayServer/HG133Plus2RawData/SampleBhg133plus2.cel
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCelToGeneralExpression 
/SampleIDColumn=SampleID /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/Expr.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/General_Expression";
End;

CelFileMappingFileName is a design file (the first column needs to be the chipId or celname without .cel suffix) to assign sample ID to each expression analysis result:

Example:

ChipID	SampleID
SampleAhg133plus2	SampleA
SampleBhg133plus2	SampleB

All alv files will be generated to /YourOutputFiles/OnArrayServer/ALV/General_Expression output folder.

[back to top]

Affy SNP array analysis

Single Step: convert CEL file to alv using LandTools ConvertAffymetrixCnvCel

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
SearchFiles "/YourFiles/OnArrayServer/SNP6RawData" /Pattern=*.CEL /Recursive=True;
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCnvCel /SampleIDColumn="SampleID" /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/SNP6.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/SNP_CNV";
End;

Here, I used SearchFiles instead of Files statement so it will search all .cel files in folder /YourFiles/OnArrayServer/SNP6RawData. If user only want to run a subset of files in the folder, it is better to run using Files statement:

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
 Files 
"
/YourOutputFiles/OnArrayServer/SNP6RawData/SampleASNP6.cel
/YourOutputFiles/OnArrayServer/SNP6RawData/SampleBSNP6.cel
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCnvCel /SampleIDColumn="SampleID" /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/SNP6.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/SNP_CNV";
End;

CelFileMappingFileName is a design file (the first column needs to be the chipId or celname without .cel suffix) to assign sample ID to each CNV calling result:

Example:

ChipID	SampleID
SampleASNP6	SampleA
SampleBSNP6	SampleB

All alv files will be generated to /YourOutputFiles/OnArrayServer/ALV/SNP_CNV output folder.

[back to top]

Create and publish land

Once all alv files are ready, we can build lands. By default, only admin can create new land. Once land is created, admin can manage user access and control who can publish/edit land contents.

Create a new land

Create land in Land Tab | Tools | Create Land

CreateLand.png

Usually, the following land options are good enough. If necessary, User can add more options later on in the LandName.cfg file directly. Click each options see more details or read ArrayLand Configuration Options.

Name=TestLand
ReferenceLibraryID=Human.B37.3
GeneModelID=OmicsoftGene20130723
PrimaryGrouping=Tumor Type
SecondaryGrouping=Sample Type
FunctionalAnnotationFiles=Human.B37.3_FunctionalMutation20131003.gbt,Human.B37.3_CosmicMutation_V68.gbt,Human.B37.3_1000Genome_2011_0521.compact.gbt
Description=My first land which is built based on test dataset
MutationGeneModelID=Uniprot.Ensembl75

By default, only admin can create an empty land and allow a user (or a user group) to publish or edit land. Admin can control the access for each land in

LandControl.png

[back to top]

Publish alv to the new land

Publish to land in Land Tab | Tools | Publish to Land

PublishLand.png

User can also choose to publish land using Oscript

Begin PublishAlv /Namespace=Land;
Land TestLand;
InputFolder "/YourOutputFiles/OnArrayServer/ALV";
Exclude "";
Options /UseSgen=True /Memory=8GB /Recursive=True /PublishMode=Auto /ParallelJobNumber=4 /DataTypes=(all);
End;

This script is going to run 4 parallel jobs for land publish and each job is using 8GB memory. User should change them accordingly if the machine has less than 4*8=32 GB memory. Land publish will find all alv files in the input folder and publish them for each data type.

[back to top]

Add sample metadata

Once alv files are published, user has to add sample meta data for visualization. Add sample metadata in Land Tab | Manage | Samples | Manage Sample Meta Data. If it is grey, you need admin to change the land permission for you.

AddMetaData.png

The meta data table should have one column for "Tumor Type" and another for "Sample Type" since both are specified in the land configuration when set it up. If use wants to link RNA-Seq bam files to data points in land view, the meta data table need a "BamFileName" column and the Land configuration needs a few BAM/BAS configuration options.

Warning.png WARNING: Please see This page for restrictions on metadata column names.


The example above is using a single sample ID for all data types. If the data does not have 1-1 mapping between data types, IntegrationLevel option can link them based on one meta table column, such as the one below:

AddMetaData2.png

Here, CNV SNP6 files have three replicate each. It will link to other data types using IntegrationLevel=IntegrationID option in land. Integration views/analysis will follow the rule defined.

[back to top]


Enjoy your land

You can search genes in your land. If you do not see your land data, try log off server and re-login to ArrayServer to refresh cache.

If you change land configuration in cfg file directly, you have to ask server admin to refresh land, choose refresh (all).

Also try build virtual land to do cross land search between your land and public lands (such as TCGA and GTEx).

[back to top]