SolidTumors Virtual Land

From Array Suite Wiki

Jump to: navigation, search


How to build a "Solid Tumors" Virtual Land

In this short tutorial, we will build a Virtual Land from the following Lands in the OncoLand collection:

  • GTEx_B37: Thousands of normal tissue samples from the GTEx project
  • OncoGEO_B37: A collection of Thousands of studies with tumor and normal samples, with expression and differential expression
  • MetastaticCancer_B37: A collection of studies focusing specifically on metastasis
  • Pediatrics_B37: A collection of studies focusing on pediatric cancers
  • TCGA_B37: A massive collection of multi-omic data from dozens of cancers, with hundreds of clinical covariate columns

The goal is to build a "Virtual Land" that allows users to look for patterns of expression across these diverse but related datasets. In-depth explorations can then be made within the individual Lands to take full advantage of the metadata and visualizations in each Land.


Land compatibility

The Lands to include in a Virtual Land must all be built on the same Genome and Gene Model. In this case, all selected Lands are built on Human.B37.3 and OmicsoftGene20130723, so can be combined.

Metadata columns

At minimum, the "PrimaryGroupingColumn" (Y-axis grouping) and "SecondaryGroupingColumn" (X-axis grouping and coloring) must be specified, or else the source Land's columns will be used. TissueCategory and DiseaseCategory tend to work well here, but this can be tweaked depending on the purpose. For example, you might want to use "SourceLand" as the secondary grouping to highlight similarities and differences between Land databases.

Generally, tissue and disease columns are useful to include for solid tissue samples. In OmicSoft Lands, these are stored as "Tissue" and "DiseaseState", respectively, and trigger automatic generation of the "TissueCategory" and "DiseaseCategory" columns.

When building your VirtualLands, make sure that these columns exist in the source Lands. {{Tips|If the Source Land contains a useful column under a different header, e.g. "TissueType" in "MyLand_B37", you can rename the column in the virtual Land with MyLand_B37.VirtualColumns=Tissue<-TissueType"}

Virtual Lands will also automatically generate a "SourceLand" column.

Additional columns can be specified with VirtualColumns, and you can always update the configuration in the Virtual Land.cfg file.

Land Size

When combining Lands into a Virtual Land, keep in mind the total size of the Land samples. The data are not being duplicated, so storage space will not be affected, but the number of samples fetched will take longer, as the Land size gets larger. It will also lengthen the Server restart time, as more databases need to be refreshed.

Building the SolidTumor_B37 Land

An OmicSoft Server administrator can switch to the Land tab, select Tools|Create Virtual Land, then use the checkboxes to indicate the Lands to include. Remember to only select Lands with compatible Genomes and Gene Models.

In the parameters section, input the following parameters (adjusted to your preferences):

//Besides Primary and Secondary Grouping, what other columns from source Lands should be included? Separate column names by commas, convert column names with VirtualName<-SourceName pattern

//How should the primary and secondary columns be mapped to each other? Choose groupings that most appropriately merge data across Lands


//What Land View makes most sense for cross-Land searches? In this case, maybe RNA-seq data would be most interesting by default.

Exploring the Virtual Land

When you open <SolidTumor_B37>, notice that the samples are grouped by "TissueCategory" (PrimaryGrouping), and colored by "DiseaseCategory" (SecondaryGrouping).

SolidTumor B37Overview.png

Search for a gene like TP53. Although many of the Source Lands will show a variation frequency overview,
the DefaultViewID.Gene was set to RnaSeq_Transcript.GeneVariable, so Gene FPKM is shown instead.
In this example, at least 36,000 samples are displayed together, from five Lands.

SolidTumor B37 FPKM.png

Now try Trellising for TissueCategory, then changing the Profile grouping ("Specify Multiple Profile Columns") to DiseaseState (one of the Virtual Columns)
and SourceLand (automatically-generated) to see the similarities and differences of data between the Lands.

Within a TissueCategory (like male reproductive system) notice the pattern of expression.
Change the Symbol color to ProjectID. Overall, prostate cancers tend to express TP53 at a higher level than normal samples from OncoGEO or GTEx_B37,
but there are discrete sub-populations. This might be due to experimental conditions, treatment differences, batch effects, etc.
But these differences can be explored in detail in each Source Land.

SolidTumor B37 FPKM ByTissueCat SourceLand.png