From Array Suite Wiki
The Rtsne module in Array Studio will allow the user to cluster different cells with UMI counts, using the Rtsne package in R: T-Distributed Stochastic Neighbor Embedding using a Barnes-Hut Implementation. t-SNE is a method for constructing a low dimensional embedding of high-dimensional data, distances or similarities. Nowadays, t-SNE has been a typical method to cluster different subgroup of cells in the process of analyzing Single Cell sequencing data. This function is intended to use Single Cell UMI count data, and directly runs the Rtsne in the R engine integrated with ArrayStudio.
If user haven't run Rtsne in ArrayStudio before and need to set it up, please follow this wiki: Setup tSNE in R engine to set the Rtsne up.
To open this module, please go to Analysis | NGS | Sing Cell RNA-Seq | t-SNE Clustering.
Input Data Requirements
This module works on -Omic data objects and Zero inflated binary matrix (ZIM) data.
User can choose to perform this analysis locally:
Or perform this analysis on the server:
Note. the Perplexity value should be less than (observations -1 )/3.
- Project & Data: The window includes a dropdown box to select the Project and Data object to be filtered.
- Variables: Selections can be made on which variables should be included in the filtering (options include All variables, Selected variables, Visible variables, and Customized variables (select any pre-generated Lists)).
- Observations: Selections can be made on which observations should be included in the filtering (options include All observations, Selected observations, Visible observations, and Customized observations (select any pre-generated Lists).
- Output name: The user can choose to name the output data object.
- Dimension: integer; Output dimensionality (default: 2)
- Perplexity: numeric; Perplexity parameter
- Theta: numeric; Speed/accuracy trade-off (increase for less accuracy), set to 0.0 for exact TSNE (default: 0.5)
- Max iteration: integer; Number of iterations (default: 1000)
- Cluster color: User can choose to color the resulted scatter plot with any pre-defined group (like tissue, treatment). User can leave it empty.
- Run initial PCA: logical; Whether an initial PCA step should be performed (default: TRUE)
- Check duplicates: logical; Checks whether duplicates are present. We generally assume that there is no duplicates. User can double check to see if duplicates present and set this option to FALSE, especially for large datasets. (default: FALSE)
- Cluster output with Kmeans: With this option a user can define a number of clusters (i.e cell types/subtypes) expected in the population of cells. The Advanced tab will allow the user to fine tune this setting to provide a lower and upper bound for this number.
- PCA settings :
- initial PCA dimensions: integer; the number of dimensions that should be retained in the initial PCA step (default: 50)
- Center data before PCA: logical; Should data be centered before pca is applied? (default: TRUE)
- Scale data before PCA: logical; Should data be scaled before pca is applied? (default: FALSE)
- Kmean cluster number lower/upper bound: Indicate the minimal and maximal cell clusters. Once clustering is performed, cells will automatically be assigned according to kmeans identity
- Stop lying iteration number: integer; Iteration after which the perplexities are no longer exaggerated (default: 250, except when Y_init is used, then 0)
- Moment switch iteration number: integer; Iteration after which the final momentum is used (default: 250, except when Y_init is used, then 0)
- Momentum: numeric; Momentum used in the first part of the optimization (default: 0.5)
- Final Momentum: numeric; Momentum used in the final part of the optimization (default: 0.8)
- Eta: numeric; Learning rate (default: 200.0)
- Exaggeration factor: numeric; Exaggeration factor used to multiply the P matrix in the first part of the optimization (default: 12.0)
The Rtsne module will generate a table and a scatter plot view for this table in ArrayStudio:
An example of TsneScoreTable is shown below:
An example of scatter plot with the two principle component defined by Rtsne is shown below. Each data point represents a cell:
Once the scatter plot is generated, user can try to manually select cells that belongs to the same cluster, and add a list name to these clusters:
If all of the cells have been assigned a list name based on their distribution in the scatter plot, user can select all the lists defined from this scatter plot and right click to choose to add the list membership to the original TsneScoreTable:
Then user can go to the scatter plot, and choose to Change Symbol Properties, and color the plot by Categorical value, and set the newly added ListMembership:
With this operation, user can see that different colors can be assigned to each cluster: