Build ComparisonLand From Raw Data
From Array Suite Wiki
Create ComparisonLand from ALV file
If you are starting from raw data like CEL file or FASTQ file, you can use our LandTools to create Land-compatible ALV files from
- RNA-Seq files (two-step process) or Single-step RNA-seq pipeline with .Alv output
- Affy Expression files
And then continue the work here.
There are four steps to building a ComparisonLand from ALV files:
- Run statistical analysis with ALV files, and store the result as an ArrayStudio project file (.osprj)
- Generate TLV files from the project file (.osprj)
- Extract sample metadata, project metadata, and clinical data from the information sheet
- Create a new Land, then publish ALV and TLV files to the Land, and load all of the metadata
The example files and oscripts used for this wiki page can be downloaded from our website: ComparisonLand Example Data.
After unzipping, there will be four folders: the input folders for MicroArray data and RNAseq data, and their oscripts in separate folders. The two Excel files outside the subfolders are the contrast mapping files used for Step 2: Convert .osprj file to .tlv files.
Input Data Requirements
(if you have more than one analysis, add sheets named as analysis2, analysis3, ...)
The information file requires careful curation using controlled vocabulary terms appropriate to the project. We strongly recommend that, where possible, users should restrict metadata terms in required columns to the terms already found in large Lands, such as TCGALand and DiseaseLand.
The project sheet requires two columns: Key and Value. Each Key is a project metadata item, and the Value will be the associated project's entry. It is recommended that the user start with the tutorial example file, and modify it as appropriate. In this sheet, the user should enter a description of the project, including Project.ID, Project.Accession, Project.Platform, etc. We encourage users to include as much additional information as they can, as this information will be stored with the project file and project metadata in the Land later.
The Sample sheet should work as a sample design metadata file and contain useful information about each sample in the expression data set. ComparisonLand requires at least two Sample Metadata columns: Tissue and OlandDiseaseState (will be displayed in your Land metadata as DiseaseState), and should conform to existing terms in ArrayLands. DiseaseCategory and TissueCategory will be mapped automatically by OlandDiseaseState and Tissue, user don't need to manually input these two columns. Besides these columns, you can include any additional sample Metadata, such as OlandCellType (will be displayed in your Land metadata as CellType), organism, molecule, SubjectID, etc.
The analysis sheet provides information for statistical analysis, containing the statistical model and the detailed comparison, by specifying the columns within the Samples sheet that specify the sample groupings for the statistical model.
We also attached the screenshot from GUI to show how the same analysis will be done with ArrayStudio GUI, so the users that familiar with processing data with our GUI can have a better understanding regarding these settings.
So the same test can be done in GUI, specify the linear model and the test like this:
As mentioned above, if there are more additional comparisons, just add another sheet named analysis2, analysis3, etc.
And in GUI, the model can be set like this for these two tests:
And the test can be set like this: (only for comparing to Control1)
If the user wants to do a pairwise comparison for a certain column in the sample metadata, for instance, the user can include an estimate tab to specify additional comparisons. For example, with the columns of extragroup1 (g1 vs g3, g2 vs g3, g1 vs g2) and extragroup2 (A1 vs A4, A2 vs A4, A3 vs A4, A1 vs A3, A2 vs A3, A1 vs A2), user can set the analysis sheet and the estimate sheet as the following figure shows:
In the Compare section, fill in the column name, in the CompareTo section, leave it as blank or fill with ".". The estimate sheet specifies all of the comparison for the selected column, for instance, g1 vs g3, g2 vs g3, and g1 vs g2. Similarly, the analysis2 sheet will have the CompareTo section filled with ".", and the estimate sheet contains all of the pairwise comparisons for "extragroup2".
And the pairwise comparison will be equivalent to this setting in GUI: (only showing the comparison for extragroup1 here)
The analysis sheet also contains the rowID as "Series" and "DataPlatform". DataPlatform is a historical term and it is not used anymore. "Series" can be used by user to filter for the samples that they want to include in the analysis. For instance, if we have design table like this:
The Stage column has four values: S1, S2, S3, S4, while we only want to do statistical analysis with S1 and S3, then we can set the analysis sheet like this:
Step 1: Convert .alv files to .osprj
Once the user generates the ALV files and has curated the information excel file as described above, the user can convert the .alv files to an .osprj file, by running the following Oscripts in Array Studio.
This conversion step includes performing the statistical comparisons specified in the analysis sheet, depending on the type of data being converted.
If the ALV contains MicroArray data, use this oscript to generate .osprj file:
Begin ComparisonLandTools /Namespace=NgsLib; Files " /Inputfolder/ALV/Expression_Intensity_Probes.GSM30836.alv /Inputfolder/ALV/Expression_Intensity_Probes.GSM48118.alv "; Reference Human.B37.3; GeneModel OmicsoftGene20130723; Options /Action=AnalyzeMicroArrayData /ProjectName="GSE1786 GPL96" /ProjectDesignFileName="/Inputfolder/GSE1786 GPL96.xlsx" /MappingID=Affymetrix.HG-U133A_Human.B37.3 /ParallelJobNumber=1 /ThreadNumber=1 /OutputFolder="/OutputFolder/TLV/MicroArray"; End;
If the ALV contains RNASeq data, then change the Options to:
Begin ComparisonLandTools /Namespace=NgsLib; Files " /Inputfolder/ALV/Rnaseq/RnaSeq_Transcript.GSM1155370.alv /Inputfolder/ALV/Rnaseq/RnaSeq_Transcript.GSM1155371.alv "; Reference Human.B37.3; GeneModel OmicsoftGene20130723; Options /Action=AnalyzeRnaSeqData /ProjectName="GSE47718 GPL11154" /ProjectDesignFileName="/Inputfolder/ALV/Rnaseq/GSE47718 GPL11154.xlsx" /ParallelJobNumber=1 /ThreadNumber=1 /OutputFolder="/Outputfolder/TLV/Rnaseq"; End;
- Input: ALV files and metadata Excel file
- Action defines whether it's MicroArray data or RnaSeq data, and the ProjectDesignFileName defines the location and name of the information file that contains the three sheets as mentioned above.
- ProjectName field in the oscript must be follow the pattern Project.ID, followed by a space " ", then followed by Project.Platform.
Statistical analysis will be carried out according the model and comparisons set in the metadata Excel file. Microarray data will analyzed by General Linear Model, while RNA-seq data will be analyzed using DEseq, using default module parameters.
Step 1 Output
Once this step is completed, an .osprj file will be generated, which contains a "simple project", which can be opened in ArrayStudio by downloading to a local drive. The user can check the inference table, which contains Fold-change, p-value and adjusted p-value calculations. The user can also use Microarray and RNA-seq analysis functions, such as those described in the relevant tutorials, to analyze the data.
Error message=The requested feature is not implemented likely means that your input ALV files are not supported. Most often, this means that you are using the GeneBas ALV files, not the RnaSeq_Transcript ALV files. Despite the name, the RnaSeq_Transcript files have transcript and gene-level quantification, as would be used for differential expression analysis.
Step 2: Convert .osprj file to .tlv files
Once step 1 has finished, there will be an .osprj file created in the output folder as defined in the step 1. Now, the user can convert the .osprj file to TLV files. This step will need another Excel file, which works as the mapping file for TLV file conversion. This mapping file must contain columns with Index, ProjectDataName, AnalysisName, Contrast and ContrastType.
- Index: 1, 2, 3, 4, ......
- ProjectDataName: must be the same as the Project Name of the .osprj file
- AnalysisName: analysis, analysis2, analysis3, analysis4, ......
- Contrast: Should always be set to all
- TestLandID: default
- SampleSetProperties: User can just set it as empty, or input the interested columns names that are present in the sample metadata. But for any column names input here, it must follow the rules explained in the following warnings.
- ContrastType: user should input the content in this column with the controlled vocabulary that can be recognized by our land tools. These recognized terms can be found in the page of ContrastType.
Extra columns can be added to the mapping files if user wants, to contain additional information about the comparisons.
This oscript can be used for the TLV conversion from .osprj file:
Begin ComparisonLandTools /Namespace=NgsLib; Files "Outputfolder/GSE1786 GPL96.osprj"; Reference Human.B37.3; GeneModel OmicsoftGene20130723; Options /Action=ConvertInferenceReportToTlv /ComparisonMetaDataFileName="/Inputfolder/MicroArray_LandTestContrastMapping.xlsx" /DataFormat="xls" /OutputFolder="/Outputfolder/TLV"; End;
- Input: The .osprj file that contains all of the inference reports
Step 2 Output
Once this step is complete, TLV files will be created, one-per-comparison.
Step 3: Extract MetaData from information file
After the TLV files are generated, they are ready to be published to Land. But first, in this step we will extract the sample metadata, project metadata, and the clinical data from the information file we used in step 1.
This oscript can be used to extract the metadata:
Begin ComparisonLandTools /Namespace=NgsLib; Files " /Inputfolder/GSE1786_GPL96.xlsx "; Options /Action=ExtractMetaData #/BamPropertyFileName="/filepath/BamFilename.txt" /ColumnNameMappingFileName="" /SkipSampleList="" /IncludeSampleList="" /FileNamePrefix="GSE1786_GPL96" /OutputFolder="/Outputfolder/MicroArray_metaData"; End;
The BamFilename.txt file only works for RNASeq data, not for MicroArray data. And even for RNASeq data, it's not an essential file. (If user want to apply it, just remove the "#" in the script.) It can be very simple and just contain two columns, one for "ID", and one for "BamFileName":
ID BamFileName sample1 sample1.bam sample2 sample2.bam
In the downloading file, user can find the example of BamFileName.txt in the main folder.
Step 3 Output
This step will result in several files in the outputfolder\MicroArray_metaData:
These files will be used in the next step:
- MetaData.osobj: contains the sample metadata and can be loaded to Land later. By default, only some sample-level metadata columns will be included in the Land Sample metadata columns. These column names are:
"Land Sample Type", "LandTissue", "CellLine", "DesignFeatures", "PatientID", "SubjectID", "Organism", "SampleSource", "Tissue", "CellType", "DiseaseState", "DiseaseStage", "Symptom", "SamplePathology", "Treatment", "Response", "Transfection", "Infection", "SamplingTime", "Ethnicity", "Gender", "Title", "Description", "BamFileName". Any other columns will be put in the “Clinical metadata”.
- .cli file: Any other columns from the sample design file that haven't been included in the MetaData.obj, will be parsed into this clinical metadata.
- ProjectMetaData.osobj: contains the project metadata and can be loaded to Land later
Step 4: Create Land, publish .alv and .tlv, and load metadata
As we have extracted sample metadata and project metadata from the information file and saved them as an .osobj file in step 3, we can just choose to load the metadata as .osobj file.
After this, use can refresh the Land like this:
And log off and log in to the server again. Now the ComparisonLand building is finished and is ready to be used.
Output and explore with the new Land
The output of these steps will be a new Land on ArrayServer, and the user can explore the new Land similarly to DiseaseLand.
Normally, user should see a sample distribution view once the Land is opened, if there is no view, it might be caused by a mismatch between the PrimaryGrouping and SecondaryGrouping settings when user created the new Land, and the metadata files. Basically, user should set “PrimaryGrouping” and “SecondaryGrouping” to two columns that existed in the metadata. For detail settings, please refer to create and publish Land.
For normal usage, user can search with a gene and should be able to see the comparisons related to this gene. For instance, this is the View user can see if searching with a gene "ERG" in a small ComparisonLand.
The relative expression of case vs control samples are displayed in the main window, comparison metadata are displayed in the details window, and raw expression values in each sample in the comparison are displayed as boxplots.