Build ComparisonLand From Raw Data

From Array Suite Wiki

Revision as of 16:21, 1 November 2018 by Joseph (Talk | contribs)
Jump to: navigation, search


Create ComparisonLand from ALV file


This wiki page will introduce the users to the steps required to build ComparisonLand from ArrayLand Vector (.alv) files, which contain expression data for samples in an ArrayLand-compatible format.

If you are starting from raw data like CEL file or FASTQ file, you can use our LandTools to create Land-compatible ALV files from

And then continue the work here.


There are four steps to building a ComparisonLand from ALV files:

  1. Run statistical analysis with ALV files, and store the result as an ArrayStudio project file (.osprj)
  2. Generate TLV files from the project file (.osprj)
  3. Extract sample metadata, project metadata, and clinical data from the information sheet
  4. Create a new Land, then publish ALV and TLV files to the Land, and load all of the metadata

The example files and oscripts used for this wiki page can be downloaded from our website: ComparisonLand Example Data.

After unzipping, there will be four folders: the input folders for MicroArray data and RNAseq data, and their oscripts in separate folders. The two Excel files outside the subfolders are the contrast mapping files used for Step 2: Convert .osprj file to .tlv files.


[back to top]

Input Data Requirements

Before starting, the user will need to generate the ALV files and an information file for each project, which is an Excel file containing at least three sheets: ComparisonLand1.png

  1. project
  2. sample
  3. analysis

(if you have more than one analysis, add sheets named as analysis2, analysis3, ...)

The information file requires careful curation using controlled vocabulary terms appropriate to the project. We strongly recommend that, where possible, users should restrict metadata terms in required columns to the terms already found in large Lands, such as TCGALand and DiseaseLand.

[back to top]


The project sheet requires two columns: Key and Value. Each Key is a project metadata item, and the Value will be the associated project's entry. It is recommended that the user start with the tutorial example file, and modify it as appropriate. In this sheet, the user should enter a description of the project, including Project.ID, Project.Accession, Project.Platform, etc. We encourage users to include as much additional information as they can, as this information will be stored with the project file and project metadata in the Land later.


Warning.png WARNING: The Excel file name need to follow specific format: it must start with Project.ID, followed by a space " ", then followed by Project.Platform in the Excel sheet name and "Project.Platform" field. for example, the excel file name for the upper image must be: GSE47718 GPL11154.xlsx. We strongly recommend using the proper Platform GPL for RNA-seq data, instead of a generic "RNASEQ" label.

[back to top]


The Sample sheet should work as a sample design metadata file and contain useful information about each sample in the expression data set. ComparisonLand requires at least two Sample Metadata columns: Tissue and OlandDiseaseState (will be displayed in your Land metadata as DiseaseState), and should conform to existing terms in ArrayLands. DiseaseCategory and TissueCategory will be mapped automatically by OlandDiseaseState and Tissue, user don't need to manually input these two columns. Besides these columns, you can include any additional sample Metadata, such as OlandCellType (will be displayed in your Land metadata as CellType), organism, molecule, SubjectID, etc.

Warning.png WARNING: Please see This page for restrictions on metadata column names.


[back to top]


The analysis sheet provides information for statistical analysis, containing the statistical model and the detailed comparison, by specifying the columns within the Samples sheet that specify the sample groupings for the statistical model.

We also attached the screenshot from GUI to show how the same analysis will be done with ArrayStudio GUI, so the users that familiar with processing data with our GUI can have a better understanding regarding these settings.


So the same test can be done in GUI, specify the linear model and the test like this:


Warning.png WARNING: Consult your company’s biostatistician to ensure that the proper statistical model is constructed

As mentioned above, if there are more additional comparisons, just add another sheet named analysis2, analysis3, etc.

ComparisonLand7.png ComparisonLand8.png

And in GUI, the model can be set like this for these two tests:


And the test can be set like this: (only for comparing to Control1)


If the user wants to do a pairwise comparison for a certain column in the sample metadata, for instance, the user can include an estimate tab to specify additional comparisons. For example, with the columns of extragroup1 (g1 vs g3, g2 vs g3, g1 vs g2) and extragroup2 (A1 vs A4, A2 vs A4, A3 vs A4, A1 vs A3, A2 vs A3, A1 vs A2), user can set the analysis sheet and the estimate sheet as the following figure shows:


In the Compare section, fill in the column name, in the CompareTo section, leave it as blank or fill with ".". The estimate sheet specifies all of the comparison for the selected column, for instance, g1 vs g3, g2 vs g3, and g1 vs g2. Similarly, the analysis2 sheet will have the CompareTo section filled with ".", and the estimate sheet contains all of the pairwise comparisons for "extragroup2".

And the pairwise comparison will be equivalent to this setting in GUI: (only showing the comparison for extragroup1 here)


The analysis sheet also contains the rowID as "Series" and "DataPlatform". DataPlatform is a historical term and it is not used anymore. "Series" can be used by user to filter for the samples that they want to include in the analysis. For instance, if we have design table like this:


The Stage column has four values: S1, S2, S3, S4, while we only want to do statistical analysis with S1 and S3, then we can set the analysis sheet like this:


[back to top]

Step 1: Convert .alv files to .osprj

Once the user generates the ALV files and has curated the information excel file as described above, the user can convert the .alv files to an .osprj file, by running the following Oscripts in Array Studio.

This conversion step includes performing the statistical comparisons specified in the analysis sheet, depending on the type of data being converted.

If the ALV contains MicroArray data, use this oscript to generate .osprj file:

Begin ComparisonLandTools /Namespace=NgsLib;
Files "
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
    /ProjectName="GSE1786 GPL96" 
    /ProjectDesignFileName="/Inputfolder/GSE1786 GPL96.xlsx" 

If the ALV contains RNASeq data, then change the Options to:

Begin ComparisonLandTools /Namespace=NgsLib;
Files "   
    Reference Human.B37.3;
    GeneModel OmicsoftGene20130723;	
        /ProjectName="GSE47718 GPL11154"
        /ProjectDesignFileName="/Inputfolder/ALV/Rnaseq/GSE47718 GPL11154.xlsx" 
  • Input: ALV files and metadata Excel file
  • Action defines whether it's MicroArray data or RnaSeq data, and the ProjectDesignFileName defines the location and name of the information file that contains the three sheets as mentioned above.
  • ProjectName field in the oscript must be follow the pattern Project.ID, followed by a space " ", then followed by Project.Platform, e.g. GSE47718 GPL11154.

Statistical analysis will be carried out according the model and comparisons set in the metadata Excel file. Microarray data will analyzed by General Linear Model, while RNA-seq data will be analyzed using DEseq, using default module parameters.

Step 1 Output

Once this step is completed, an .osprj file will be generated, which contains a "simple project", which can be opened in ArrayStudio by downloading to a local drive. The user can check the inference table, which contains Fold-change, p-value and adjusted p-value calculations. The user can also use Microarray and RNA-seq analysis functions, such as those described in the relevant tutorials, to analyze the data.

[back to top]


Error message=The requested feature is not implemented likely means that your input ALV files are not supported. Most often, this means that you are using the GeneBas ALV files, not the RnaSeq_Transcript ALV files. Despite the name, the RnaSeq_Transcript files have transcript and gene-level quantification, as would be used for differential expression analysis.

Step 2: Convert .osprj file to .tlv files

Once step 1 has finished, there will be an .osprj file created in the output folder as defined in the step 1. Now, the user can convert the .osprj file to TLV files. This step will need another Excel file, which works as the mapping file for TLV file conversion. This mapping file must contain columns with Index, ProjectDataName, AnalysisName, Contrast and ContrastType.


For each analysis specified in the Project Design File, you will need to specify the analysis to export to TLV. These will usually be listed as analysis, analysis2, analysis3, etc.

Open the ProjectName PlatformName.osprj file in Array Studio to inspect the contrasts you generated. You should see at least one analysis in the Inference tab. If you see more than one, you will need to list these additional analyses in this contrast mapping excel file.

  • Index: 1, 2, 3, 4, ......
  • ProjectDataName: must be the same as the Project Name of the .osprj file
  • AnalysisName: analysis, analysis2, analysis3, analysis4, ......
  • Contrast: Should always be set to all
  • TestLandID: default
  • SampleSetProperties: User can just set it as empty, or input the interested columns names that are present in the sample metadata. But for any column names input here, it must follow the rules explained in the following warnings.
Tips.pngUser can just set this column as empty, and the qualified column information will automatically parsed into TLV metadata

Warning.png WARNING: If any columns are specified in SampleSetProperties, you must also include Tissue, and the column included here MUST be consistent within sample metadata for each comparison. For instance, if user is comparing samples in groupA and groupB, the term in the Tissue column in groupA samples must be all the same, and Tissue in groupB sample must be all the same.

  • ContrastType: user should input the content in this column with the controlled vocabulary that can be recognized by our land tools. These recognized terms can be found in the page of ContrastType.
Tips.pngThe controlled vocabulary for ContrastType is case sensitive

Extra columns can be added to the mapping files if user wants, to contain additional information about the comparisons.

This oscript can be used for the TLV conversion from .osprj file:

Begin ComparisonLandTools /Namespace=NgsLib;
Files "Outputfolder/GSE1786 GPL96.osprj";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;     
  • Input: The .osprj file that contains all of the inference reports

Step 2 Output

Once this step is complete, TLV files will be created, one-per-comparison.

[back to top]

Step 3: Extract MetaData from information file

After the TLV files are generated, they are ready to be published to Land. But first, in this step we will extract the sample metadata, project metadata, and the clinical data from the information file we used in step 1.

This oscript can be used to extract the metadata:

Begin ComparisonLandTools /Namespace=NgsLib;
Files "

The BamFilename.txt file only works for RNASeq data, not for MicroArray data. And even for RNASeq data, it's not an essential file. (If user want to apply it, just remove the "#" in the script.) It can be very simple and just contain two columns, one for "ID", and one for "BamFileName":

ID  BamFileName
sample1  sample1.bam
sample2  sample2.bam

In the downloading file, user can find the example of BamFileName.txt in the main folder.

Step 3 Output

This step will result in several files in the outputfolder\MicroArray_metaData:


These files will be used in the next step:

  • MetaData.osobj: contains the sample metadata and can be loaded to Land later. By default, only some sample-level metadata columns will be included in the Land Sample metadata columns. These column names are:
"Land Sample Type", "LandTissue", "CellLine", "DesignFeatures", "PatientID", "SubjectID", "Organism", "SampleSource", "Tissue", "CellType", "DiseaseState", "DiseaseStage", "Symptom", "SamplePathology", "Treatment", "Response", "Transfection", "Infection", "SamplingTime", "Ethnicity", "Gender", "Title", "Description", "BamFileName". Any other columns will be put in the “Clinical metadata”.
  • .cli file: Any other columns from the sample design file that haven't been included in the MetaData.obj, will be parsed into this clinical metadata.
  • ProjectMetaData.osobj: contains the project metadata and can be loaded to Land later
[back to top]

Step 4: Create Land, publish .alv and .tlv, and load metadata

The user can refer to our Landtools page to create and publish Land, where the user will find the steps for creating a Land, publishing .alv or .tlv files to the Land, and adding sample metadata.

As we have extracted sample metadata and project metadata from the information file and saved them as an .osobj file in step 3, we can just choose to load the metadata as .osobj file.


After this, use can refresh the Land like this:

ComparisonLand13.png ComparisonLand14.png

And log off and log in to the server again. Now the ComparisonLand building is finished and is ready to be used.

[back to top]

Output and explore with the new Land

The output of these steps will be a new Land on ArrayServer, and the user can explore the new Land similarly to DiseaseLand.

Normally, user should see a sample distribution view once the Land is opened, if there is no view, it might be caused by a mismatch between the PrimaryGrouping and SecondaryGrouping settings when user created the new Land, and the metadata files. Basically, user should set “PrimaryGrouping” and “SecondaryGrouping” to two columns that existed in the metadata. For detail settings, please refer to create and publish Land.

For normal usage, user can search with a gene and should be able to see the comparisons related to this gene. For instance, this is the View user can see if searching with a gene "ERG" in a small ComparisonLand.


The relative expression of case vs control samples are displayed in the main window, comparison metadata are displayed in the details window, and raw expression values in each sample in the comparison are displayed as boxplots.

[back to top]

Related Articles

[back to top]