Building Land From Raw Data

From Array Suite Wiki

Jump to: navigation, search

Contents

Overview

OmicSoft Lands are a solution to manage and deliver large Omics data. Lands are built based on OmicSoft File System (OFS), which stores data in database files with different layers of indexes for gene/variants and samples. OmicSoft uses the Land framework to deliver data in Land format to customers. If you are interested in OmicSoft data services, you can contact support@omicsoft.com for more information.

Advanced users can built their own data into a land database using LandTools. This wiki page is meant to be a users guide for building an land on user's internal ArrayServer (i.e. Internal Land) and will link out to useful documentation. A graphical overview of building an internal land is shown below:

BuildAnInternalLand.png

Land components

Each land has its configuration file: LandName.cfg. An example of the file and all configurable parameters are list in ArrayLand Configuration Options page.

Genome and Gene Model

Users are required to decide a genome and gene model used to run analysis in ArrayStudio. Before performing your analysis, it is best practice to choose the same genome and gene model for previous built Lands in order to facilitate cross-land comparisons. Please use the link to the following wiki page to investigate the current Genome Reference and Gene Models supported in Land Platform Genome Reference and Gene Models supported in Land Platform.

Data type

In ArrayLands we support a variety of omic data, including array expression, DNASeq mutation, CNV, RNA-Seq quantification, Fusion, and Mutation. Land tools is used to generate Land vector (.alv) files for each sample in each data type mode. The complete list of land tools can be found here LandTools.

ALV files are not used to publish data into GeneticsLand. Supported data types can be directly published into the Land using Land PublishToGxl.pdf. Please see this wiki page and our GeneticsLand tutorial for a complete description of how to publish data into GeneticsLand. A list of supported source files and optional compression is also provided.

[back to top]

Meta data

Land visualization is view of data based on its sample meta data. SampleID column is required for each data type. If the same sample ID is not shared between different data type, user has to specify the IntegrationLevel column in Land.cfg file. At least two primary columns are also required, based on the Primary and Secondary Grouping that was defined in the Land.cfg file when the land was created.

Examples to generate land vector files

All examples here are running using OmicScript. User have to create a new server project, create *.oscript file and then send the script to server queue.

20140812 SendScriptToQueue.png

[back to top]

The *.oscript file includes the scripts (described below) in a text file with file extension "oscript".

RNA-Seq analysis pipeline (old fashion with two steps)

Step 1: alignment using MapRnaSeqReadsToGenome. User can also run this step in NGS Add RNA-Seq Data GUI.

Begin MapRnaSeqReadsToGenome /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleA_1.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleA_2.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleB_1.fastq.gz
/YourFiles/OnArrayServer/RNASeqRawData/TestSampleB_2.fastq.gz
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Trimming /Mode=TrimByQuality /ReadTrimQuality=2;
Options /ParallelJobNumber=99 /PairedEnd=True /FileFormat=FASTQ /AutoPenalty=True /FixedPenalty=2 
/Greedy=false /IndelPenalty=2 /DetectIndels=False /MaxMiddleInsertionSize=10 /MaxMiddleDeletionSize=10 
/MaxEndInsertionSize=10 /MaxEndDeletionSize=10 /MinDistalEndSize=3 /ExcludeNonUniqueMapping=False 
/ReportCutoff=10 /WriteReadsInSeparateFiles=True /OutputFolder="/YourOutputFiles/OnArrayServer/RNASeqBAM" 
/GenerateSamFiles=False /ThreadNumber=1 /InsertSizeStandardDeviation=40 /ExpectedInsertSize=300 
/MatePair=False /InsertOnSameStrand=False /InsertOnDifferentStrand=True /QualityEncoding=Automatic 
/CompressionMethod=Gzip /Gzip=True /SearchNovelExonJunction=True /ExcludeUnmappedInBam=False /KeepFullRead=False 
/Replace=False /Platform=ILLUMINA /CompressBam=False;
Output RNASeqAlignemnt;
End;

Usually, we are using /ThreadNumber=4 for each alignment job. All alignment BAM files will be generated to /YourOutputFiles/OnArrayServer/RNASeqBAM. You can run this alignment tool from GUI too. It is not a part of LandTools.

Step 2: run downstream analysis and convert to alv using LandTools ConvertRnaSeqBamToAlv

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files 
"
/YourOutputFiles/OnArrayServer/RNASeqBAM/TestSampleA.bam
/YourOutputFiles/OnArrayServer/RNASeqBAM/TestSampleB.bam
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertRnaSeqBamToAlv
/BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/RNASeq.Design.txt"
/SampleIDColumn="SampleID"
/BamFileNameColumn="BamFileName"
/CopyToLocal=False
/ConvertExonJunction=True
/ConvertMutation=True
/ConvertCount=True
/ConvertFusion=True
/ConvertPairedEndFusion=True
/ConvertBas=True
/MinimalTotalHit=10
/MinimalMutationHit=5
/MinimalMutationFrequency=0.20
/MinimalFusionAlignmentLength=0
/TargetThirdQuantile=10
/ThreadNumber=1
/ParallelJobNumber=99
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV";
End;

All results will be generated to /YourOutputFiles/OnArrayServer/ALV. It will run downstream analysis: quantification, mutation, fusion detection and etc. Land tool is running on cluster nodes if ArrayServer is backed up with cluster. BamFileMappingFileName file is an important mapping file which assigns SampleID to alv files. Example of the mapping file here RNASeq.Design.txt

BamFileName	SampleID
TestSampleA.bam	SampleA
TestSampleA.bam	SampleB
[back to top]

RNA-Seq analysis pipeline (since V9)

In V9, we have put Generate Land ALV option in one-step RNA-Seq Pipeline function. It will run the full RNA-Seq analysis in the project and also generate ALV files for sample so user can publish to Land:

RNASeqPP2.png

[back to top]


DNA-Seq analysis pipeline

Step 1: alignment using MapDnaSeqReads. User can also run this step in NGS Add DNA-Seq Data GUI

Begin MapDnaSeqReads /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Normal_1.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Normal_2.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Tumor_1.fastq.gz
/YourFiles/OnArrayServer/DNASeqRawData/Test6356.Tumor_2.fastq.gz
";
Reference Human.B37.3;
Trimming /Mode=TrimByQuality /ReadTrimQuality=2;
Options /ParallelJobNumber=99 /PairedEnd=True /FileFormat=FASTQ /AutoPenalty=True /FixedPenalty=2 
/IndelPenalty=2 /DetectIndels=True /Greedy=False /ExcludeNonUniqueMapping=False /ReportCutoff=10 
/WriteReadsInSeparateFiles=True /OutputFolder="/YourOutputFiles/OnArrayServer/DNASeqBAM" /MaxMiddleInsertionSize=10 
/MaxMiddleDeletionSize=50000 /MaxEndInsertionSize=10 /MaxEndDeletionSize=10 /MinDistalEndSize=3 /GenerateSamFiles=False 
/ThreadNumber=2 /ExpectedInsertSize=300 /InsertSizeStandardDeviation=40 /MatePair=False /QualityEncoding=Automatic 
/CompressionMethod=Gzip /Gzip=True /ExcludeUnmappedInBam=False /KeepFullRead=False 
/MapRead=True /MapReverseComplement=True /Replace=False /Platform=ILLUMINA;
Output DNASeqAlignment;
End;

All alignment BAM files will be generated to /YourOutputFiles/OnArrayServer/DNASeqBAM. You can run this function from GUI too. It is not a part of LandTools.

Step 2: run mutation analysis and convert to alv using LandTools ConvertBamToMutation

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files 
"
/YourOutputFiles/OnArrayServer/DNASeqBAM/Test6356.Normal.bam
/YourOutputFiles/OnArrayServer/DNASeqBAM/Test6356.Tumor.bam
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options
/Action=ConvertBamToMutation
/BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/RNASeq.Design.txt"
/SampleIDColumn=SampleID
/BamFileNameColumn=BamFileName
/MinimalTotalHit=10
/MinimalMutationHit=5
/MinimalMutationFrequency=0.20
/DataMode=DnaSeq_Mutation
/ThreadNumber=100
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/DnaSeq_Mutation";
End;

All results will be generated to /YourOutputFiles/OnArrayServer/ALV/DnaSeq_Mutation. It will run mutation detection analysis. Land tool is running on cluster nodes if ArrayServer is backed up with cluster. BamFileMappingFileName file is an important mapping file which assigns SampleID to alv files. Example of the mapping file here DNASeq.Design.txt

BamFileName	SampleID
Test6356.Normal.bam	SampleA
Test6356.Tumor.bam	SampleB

In the example above, the same sample IDs have been assigned to DnaSeq_Mutation data, so that data are linked with RNA-Seq results with the same ID, such as comparing mutation detected in RNA-Seq and DNA-Seq. If there is no 1-1 mapping between DNA-Seq and RNA-Seq samples, user can assign different IDs such as "SampleA1" and "SampleA2" to DNA-Seq but have an IntegrationLevel column value "SampleA" to match RNA-Seq's IntegrationLevel column value in MetaData table.

Step 3: (optional, only if user can find tumor/normal pairs design for the dataset) running somatic mutation analysis using Summarize Matched Pair Variation Data. This function is based on VarScan2.

Users have to import a design table for the NgsData object DNASeqAlignment generated from step 1. The design table needs to contain two columns: one for patient ID and another for sample type (tumor or normal), such as the one below:

MPVDesign.png

[back to top]

User can also send script to server queue:

Begin SummarizeMatchedPairVariation /Namespace=NgsLib /RunOnServer=True;
Project test_server;
Data test_server\\NgsData;
Pair PatientID;
TumorStatus SampleType /Normal=Normal;
Options /BaseQualityCutoff=13 /MapQualityCutoff=0 /MinimalIndelSize=1 /ExcludeSingletons=False 
/ExcludeMultiReads=False /ExcludeDuplicates=True /LeftExclusion=0 /RightExclusion=0 
/ThreadNumber=10 /MinimalNormalHit=8 /MinimalTumorHit=6 /MinimalMutationHit=2 
/HeterozygosityFrequencyCutoff=0.10 /HomozygosityFrequencyCutoff=0.75 /FilteringReadPositionCutoff=0.10 
/FilteringStrandnessCutoff=0.9 /FilteringHomopolymerCutoff=5 /FilteringMappingQualityDifferenceCutoff=30 
/FilteringReadLengthDifferenceCutoff=25 /FilteringMmqsDifferenceCutoff=100 /FrequencyDifferenceCutoff=0.20 
/FilteringSignificanceLevel=0.05 /MaxFrequencyCutoff=0 /GenerateTableland=True /DbsnpVersion=(none) 
/OutputFolder="/YourOutputFiles/OnArrayServer/DNASeqMPV";
Output MPV;
End;

All results (*.mpv matched pair variation files) will be generated to /YourOutputFiles/OnArrayServer/DNASeqMPV output folder. For this example, it will generate one file using patient ID as file name: /YourOutputFiles/OnArrayServer/DNASeqMPV/Test6356.mpv

Step 4: (optional, only run when step3 finishes) convert somatic mutation mpv files to alv files using LandTools ConvertNgsmpv.

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
Files
"
/YourOutputFiles/OnArrayServer/DNASeqMPV/Test6356.mpv
"
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertNgsmpv /ThreadNumber=12 /BamFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/MPV.Design.txt" 
/SampleIDColumn="SampleID"  /OutputFolder="/YourOutputFiles/OnArrayServer/ALV/DnaSeq_SomaticMutation" /IsRnaSeq=False;
End;

Although the design parameter name is "BamFileMappingFileName", it is a mapping file for MPV files. It is assigning SampleID to DnaSeq SomaticMutation result. Somatic mutation is mutations detected in Tumor sample but not in normal. We usually assign the Tumor sample ID to the final MPV results, example below:

BamFileName	SampleID
Test6356	SampleA

Note: using file name without extension or full path. For example, if your ngsmpv file is "abcde.bam.ngsmpv", you should use "abcde.bam" in the BamFileName column.

Step 3-4 of DNA-Seq analysis can also be run for RNA-Seq samples if there are matched samples.

[back to top]


Affy Expression array analysis

Single Step 1: convert CEL file to alv using LandTools ConvertAffymetrixCelToGeneralExpression. General_Expression will be our preferred storage mode for expression value, because it allows for cross-platform search and comparison.

Oscript:

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
SearchFiles "/YourFiles/OnArrayServer/HG133Plus2RawData" /Pattern=*.CEL /Recursive=True;
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCelToGeneralExpression 
/SampleIDColumn=SampleID /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/Expr.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/General_Expression";
End;

Here, I used SearchFiles instead of Files statement so it will search all .cel files in folder /YourFiles/OnArrayServer/HG133Plus2RawData. If user only want to run a subset of files in the folder, it is better to run using Files statement:

Begin LandTools /Namespace=NgsLib;
Files
"
/YourOutputFiles/OnArrayServer/HG133Plus2RawData/SampleAhg133plus2.cel
/YourOutputFiles/OnArrayServer/HG133Plus2RawData/SampleBhg133plus2.cel
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCelToGeneralExpression 
/SampleIDColumn=SampleID /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/Expr.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/General_Expression";
End;

CelFileMappingFileName is a design file (the first column needs to be the chipId or celname without .cel suffix) to assign sample ID to each expression analysis result:

Example:

ChipID	SampleID
SampleAhg133plus2	SampleA
SampleBhg133plus2	SampleB

All alv files will be generated to /YourOutputFiles/OnArrayServer/ALV/General_Expression output folder.

[back to top]

Affy SNP array analysis

Single Step: convert CEL file to alv using LandTools ConvertAffymetrixCnvCel

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
SearchFiles "/YourFiles/OnArrayServer/SNP6RawData" /Pattern=*.CEL /Recursive=True;
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCnvCel /SampleIDColumn="SampleID" /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/SNP6.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/SNP_CNV";
End;

Here, I used SearchFiles instead of Files statement so it will search all .cel files in folder /YourFiles/OnArrayServer/SNP6RawData. If user only want to run a subset of files in the folder, it is better to run using Files statement:

Begin LandTools /Namespace=NgsLib /RunOnServer=True;
 Files 
"
/YourOutputFiles/OnArrayServer/SNP6RawData/SampleASNP6.cel
/YourOutputFiles/OnArrayServer/SNP6RawData/SampleBSNP6.cel
";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Options /Action=ConvertAffymetrixCnvCel /SampleIDColumn="SampleID" /CelFileMappingFileName="/YourOutputFiles/OnArrayServer/ALV/SNP6.Design.txt"
/OutputFolder="/YourOutputFiles/OnArrayServer/ALV/SNP_CNV";
End;

CelFileMappingFileName is a design file (the first column needs to be the chipId or celname without .cel suffix) to assign sample ID to each CNV calling result:

Example:

ChipID	SampleID
SampleASNP6	SampleA
SampleBSNP6	SampleB

All alv files will be generated to /YourOutputFiles/OnArrayServer/ALV/SNP_CNV output folder.

[back to top]

Create and publish land

Once all alv files are ready, we can build lands. By default, only admin can create new land. Once land is created, admin can manage user access and control who can publish/edit land contents.

Create a new land

Create land in Land Tab | Tools | Create Land

CreateLandinGUI.png

Usually, the following land options are good enough to mimic either a OncoLand or DiseaseLand type Land environment. If necessary, users can add more options after the land has been created by modifying the LandName.cfg file directly. Name, ReferenceLibrary, and Gene Model cannot be changed once you have created the land. More information about how to Create A Land in the GUI can be found here Create Land.

Name=TestOncoLand
ReferenceLibraryID=Human.B37.3
GeneModelID=OmicsoftGene20130723
PrimaryGrouping=Tumor Type
SecondaryGrouping=Sample Type
VariantClassifiers=ClinVar_20160815,FunctionalMutation_20160815,1000GenomesSimple_20160815,ExAC_20160815,ESP6500_20160815,UK10K_20160815,RegulomeDB_20160815
Description=My first land which is built to mimic OncoLands
MutationGeneModelID=Uniprot.Ensembl75
Name=TestDiseaseLand
ReferenceLibraryID=Human.B37.3
GeneModelID=OmicsoftGene20130723
PrimaryGrouping=DiseaseCategory
SecondaryGrouping=TissueCategory
VariantClassifiers=ClinVar_20160815,FunctionalMutation_20160815,1000GenomesSimple_20160815,ExAC_20160815,ESP6500_20160815,UK10K_20160815,RegulomeDB_20160815
Description=My first land which is built to mimic DiseaseLands
MutationGeneModelID=Uniprot.Ensembl75


By default, only admin can create an empty land and allow a user (or a user group) to publish or edit land. Admin can control the access for each land in

LandControl.png

[back to top]


Publish alv to the new land

Publish to land in Land Tab | Tools | Publish to Land

PublishLand.png

User can also choose to publish land using Oscript

Begin PublishAlv /Namespace=Land;
Land TestLand;
InputFolder "/YourOutputFiles/OnArrayServer/ALV";
Exclude "";
Options /UseSgen=True /Memory=8GB /Recursive=True /PublishMode=Auto /ParallelJobNumber=4 /DataTypes=(all);
End;

This script is going to run 4 parallel jobs for land publish and each job is using 8GB memory. User should change them accordingly if the machine has less than 4*8=32 GB memory. Land publish will find all alv files in the input folder and publish them for each data type.

[back to top]

Add sample metadata

Warning.png WARNING: Please see This page for restrictions on metadata column names.

Once alv files are published, users need to Manage Land Sample Metadata for visualization. Data cannot be viewed in the Sample Distribution view if Metadata has not been registered in the Land. Add sample metadata in Land Tab | Manage | Samples | Manage Sample Meta Data. If Manage Sample Meta Data is in grey (as in the screenshot below), you need to ask admin to change the land permissions for you.

Needadminnew.png

The meta data table is required to have Sample ID, Subject ID, Primary Grouping, and Secondary Grouping columns exactly the way it was defined in the land.cfg file. Users can view the information that was provided in the land.cfg file by going to Manage | Show Land Statistics:

Showlandstatistics.png

The genome, gene model, Primary Grouping, and Secondary Grouping will be displayed. Please remember that the Capitalization and Spacing needs to be an exact match in the Sample Metadata file.

TestOncolandStatistics.png

Below is an example of the Sample Metadata file that would match the Test Oncoland from above.

TestOncolandMetadata.png

If use wants to link RNA-Seq bam files to data points in land view, the meta data table need a "BamFileName" column and the Land configuration needs a few BAM/BAS configuration options.

Warning.png WARNING: A special note about building internal lands that mimic OmicSoft DiseaseLand. DiseaseCategory and TissueCategory are hard-coded using OmicSoft controlled vocabulary terms. If these columns are defined in the land configuration, they will be auto-filled from DiseaseState and Tissue columns, respectively. It is best practice to NOT specify these columns manually.


Integrating/combining datatypes for a sample

The example above is using a single sample ID for all data types. If the data does not have 1-1 mapping between data types, IntegrationLevel option can link them based on one meta table column, such as the one below:

AddMetaData2.png

Here, CNV SNP6 files have three replicate each. It will link to other data types using IntegrationLevel=IntegrationID option in land. Integration views/analysis will follow the rule defined.

[back to top]


Enjoy your land

You can search genes in your land. If you do not see your land data, try log off server and re-login to ArrayServer to refresh cache.

If you change land configuration in cfg file directly, you have to ask server admin to refresh land, choose refresh (all).

Also try build virtual land to do cross land search between your land and public lands (such as TCGA and GTEx).

[back to top]