Introduction to SCHuman Land Content
From Array Suite Wiki
In addition to curated DiseaseLand studies with standard RNA-seq and microarray experiments, subscribers to DiseaseLand have access to Lands with single-cell RNA-seq data. SCHuman_B37 Land is part of this collection, focusing on data derived from studies that examine single cell populations from a number of categories found in our DiseaseLand Collection. Single-cell RNA-Seq experiments are available in many different technologies, we distinguish data for non-UMI studies and UMI studies in our Lands:
|Species||non-UMI Land||UMI Land||Reference||GeneModel|
SCHuman_B37 has a heavy focus on publicly available RNA-Seq expression data, and offers the potential to look at gene expression, all processed through the same pipeline, across many different projects, with the additional value of providing visualizations and functions. At Omicsoft, thanks to our experienced data curation and processing team, we have a systematic method for data curation. Refer to our Curation Pipeline for details.
Samples in single-cell Lands are split between the UMI and non-UMI lands based on project information/data processing. Generally, any single-cell RNA-Seq project in which individual cells have been barcoded and contain Unique Molecular Identifiers (UMI) as shown here, will be found in the UMI lands. This includes data from platforms such as DropSeq/10X Genomics. These samples have an inherent 3' bias and are thus processed and analyzed differently than non-UMI lands, which focus on projects in which RNA-Seq was performed on samples from other single cell populations (i.e. SMART-Seq). Most of the samples from these lands are from single cells, however, some samples have 10, 100, or 1000 cells or are bulk samples (annotated as “population” in the CellNumber property). They are mostly used for benchmarking or comparison purpose in selective projects. You can use “CellNumber” property to filter the samples if you would like to single them out. And we have filtered out samples with “0” CellNumber annotation as low quality samples.
- RNA-Seq data
Refer to individual projects' clinical metadata for details of how data were generated.
Expression is normalized as Transcripts per million (TPM) for non-UMI Lands and Reads per million (RPM) for UMI lands. To ensure only high quality data is incorporated into Single Cell Lands, we use the following criteria:
- <= 20% Mitochondrial rate,
- alignment rate and mapped reads rate >=0.4
- minimal mapped reads >=50K
- minimal gene coverage >=1000
Key Meta Data Columns
SCHuman_B37 is curated at the sample and project level, with hundreds of meta data columns available.
- DiseaseCategory (controlled vocabulary) : Disease category of the sample based on the details disease state. (Primary Grouping column)
- TissueCategory (controlled vocabulary) : Tissue category such as skin, muscle, heart, kidney etc. (Secondary Grouping column)
- DiseaseState (controlled vocabulary) : Curated at sample level from each project.
- SampleSource (controlled vocabulary) : Either cell type or tissue information. When a sample has cell type information, cell type is used. Otherwise, tissue category is used.
- CellNumber : Indicates number of cells per sample. Will be 1 for most samples, but can be used to filter poor quality samples (with a value of zero) or controls with more than 1 cell
- LibraryStrategy: Indicates the strategy used to obtain single cells for the project.
Sample Distribution by DiseaseCategory:
- ProjectName: The name of individual projects where the data is from.
- TherapeuticArea: Specific clinical focus of individual project (can be multiple areas depending on project)
Experimental designs in projects within DiseaseLand can be quite different, while some users may also want to quickly identify expression of a gene in the context of a specific study. Omicsoft created project-specific views to display expression values based on experimental design within each project.