Introduction to TCGA Land Content
From Array Suite Wiki
TCGA_B37 and TCGA_B38
The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing. TCGALand includes data from over 33 cancer types, with RNA-Seq, DNA-Seq, Copy Number, Methylation, Expression array (Agilent), and protein array (RPPA) data.
The original TCGA Data Portal is no longer operational. TCGA data is now hosted on the GDC data portal: GDC
|Land Version||Genome Build||Gene Model|
- CNV Calling: Gistic2 Call and TCGA Land CNV Call (Segment Data)
- DNA-Seq Somatic Mutation
- Expression Ratio (Agilent)
- Methylation450 BeadChip
- Mass Spectrometry (MS)
- RPPA (protein array)
- RPPA_RBN (protein array)
- Replicate-based normalization for cross-tumors comparisons (RBN) (M.D. Anderson)
- RNA-Seq, including:
- Single-end and Paired-end fusion calling
- RNA-Seq somatic mutation, from matched tumor/normal pairs
- Exon Junction and Exon Usage
- Expression (Gene- and Transcript- level quantification)
- Metadata, including TCGA Marker Paper information
Agilent Expression Array (Agilent G4502A)
Illumina HiSeq sequencing (GAII, GAIIx, HiSeq 2000, HiSeq2500)
Illumina DNA Sequencing
Note: For expression arrays, there may be some discrepancies between published data and the values in TCGA Land. Please see the accompanied wiki page here for an explanation of where these differences arise.
- Virus data: View viral sequence counts in Land RNA-seq Data
- 16S Microbial data: Bacterial counts from 16S rRNA
HLA (Class I) identification using the RnaSeq aligned reads. The HLA OptiType program aligns RNA-seq reads to the HLA Reference genome, and then performs an optimization to determine the most likely HLA Class I allele. See OptiType - precision HLA typing from next-generation sequencing data.pdf for a description of the algorithm. TCGA has classified this information as restricted access.
Omicsoft does not reprocess other genomic data, but extracts data directly from original datasets.
- TCGA_B37: Collating all of the TCGA data, especially the DNA somatic mutation data, has been quite complex. TCGA data historically has been housed in various repositories (Broad Firehose, UCSC Cancer Genomics Hub (CGHub), TCGA Data Portal, cBioPortal). With data generation for TCGA now wrapping up, the NCI is attempting to store all of the generated data (both raw and processed) on the Genomics Data Commons (GDC). However, as GDC and cBioPortal appear to update their respective databases at different times, there are still some discrepancies between the two portals. In order to provide our users with the most comprehensive TCGA dataset, OmicSoft is actively trying to merge data from these different TCGA repositories to provide one unified dataset. When we first began curating the TCGA land, our starting files were the MAF files downloaded from the TCGA Data Portal (which is now deprecated and has been replaced with the NCI GDC) and Broad Firehose. We have been updating our TCGA land as new annotated somatic mutations are released on the GDC. cBioPortal uses their own curation and analysis pipeline that differs from GDC. Recently, we have merged mutation calls from cBioPortal to address this discrepancy.
- TCGA_B38: Mutation calls provided in B38 are public somatic mutation data derived from merging the mutations in all MAF files downloaded from the GDC (v6.0). To see more information on the mutation calling pipelines used by the GDC, please visit MAF_source
- In addition to different pipelines used to generate DNA somatic mutation data, users may notice that some a discrepancy in numbers of samples with data in B37 and B38 lands. GDC Legacy data for B37 data contained mutation calling from arrays, while data in B38 is from whole exome sequencing (WXS) data. Some WXS cases fail to pass the QC and harmonization process at GDC and are thus excluded from the pipeline.
Mass Spectrometry (MS)
MS raw data is downloaded from: The Clinical Proteomic Tumor Analysis Consortium (CPTAC). This data is typically obtained from 4 centers: The Broad Institute, Pacific Northwest National Laboratory (PNNL), Johns Hopkins University, and Vanderbilt. Currently, the data from these projects available in land are log2 ratios (iTRAQ) taken exclusively from the Broad Institute and PNNL. There are two types of protein levels reported: 1) overall protein levels and 2) variant levels (i.e. phosphorylation). For example, note the entries for BRAF:
Key Meta Data Columns
- Tumor Type: The types of tumor. See Primary Grouping for details.
- Sample Type: The types of sample indicating where the sample is from. It includes information such as whether it is from normal or tumor tissue and whether it is primary, recurrent tumor, or from cell line etc.
- Land Tissue: The tissue from which the cell line was derived, using OmicSoft's curation Controlled Vocabulary
- Land Sample Type: A detailed description of the cell type from which the cell line was derived, using OmicSoft's curation Controlled Vocabulary
- Tumor or Normal: Indicates whether a sample is from a tumor or normal sample.
Tumor Types (All 33 tumor types from TCGA)
- 2. RNA-Seq Quantification
- 3. RNA-Seq Fusion
- Fusion Site, RPKM, Frequency and Browser
- Paired End Fusion
- 4. Expression
- Summary of up- /down-regulation
- Expression Ratio
- 6. Copy Number
- Copy Number Log2 Ratio
- Copy Number Browser
- 7. Methylation
- 8. Integration Analysis
- 9. Survival Data