Run KMeans RScript

From Array Suite Wiki

Jump to: navigation, search

Contents

Overview

R provides a large collection of algorithms for table manipulation and MicroArray and SNP analysis. In Array Studio, it is possible to send the full or partial Table/MicroArray/SNP data to R retrieve the result (data.frame or vectors) back from R. The R script can incorporate any packages and functions to process/analyze the data provided by Array Studio. The user can then use Array Studio for visualization and project management, and use Array Server to store/share the R results.

Introduction

For this module, user can use the table object in ArrayStudio as input data, send it into R and run KMeans clustering analysis, generate new columns to store the cluster information, and import the result table back to ArrayStudio. User can add scatter plot view to the new result table, and then change the symbol properties to color the dots by the new cluster columns as categorical value.

Input Dataset

This Rscript will take ArrayStudio Table object as input data.

Save the Rscript

In order to use R integration in Array Studio, the user needs to have R (version 2.7 or above) installed on the client machine for local projects, and R (version 2.7 or above) installed on the server machine for server projects. For Linux server, the R executable must be in one of the searchable paths (i.e. typing “R” in the command line from any folder should launch the R environment).

User can put R script files (file extension .rscript) into the appropriate folders, so that Array studio or Array Server can automatically load those modules when the user chooses

This R Script for Array Studio project needs to be saved in:

  • For local analysis, the .rscript files need to be put in OmicsoftHomeDirectory\RScripts\Table by default.
  • For server analysis, the .rscript files need to be put in BaseDirectory\Pipeline\RScripts\Table by default.

Run the Rscript

Kmeans02.png

Parameters setup

There are five parameters user can modify in this Rscript:

  • Kmeanslow: integer, a lower bound for the range of "K" user would like to use for the Kmeans analysis
  • Kmeanshigh: integer, a higher bound for the range of "K" user would like to use for the Kmeans analysis
  • RandomSeed: integer, a random seed number user can set for each run, so if user want to replicate their result, they can set the same number, default is 10
  • Column1: default is "V1", a column name from the table, "V1" is set as default for convenience as this module is mainly designed for KMeans analysis for tSNE result
  • Column2: default is "V2", a column name form the table, "V2" is set as default for convenience as this module is mainly designed for KMeans analysis for tSNE result


User can set a range of value to run the Kmeans anlaysis, for instance, if they want to run K means analysis with a range of k, 3 to 8, they can set the Kmeanslow as 3, Kmeanshigh as 8.

Output

Once the analysis is done, there will be a new table generated in ArrayStudio, during which the new columns for cluster information will be added:

Kmeans01.png

So with this table, user can

  • add scatter view to the new table
  • change the symbol to color the dots with new cluster information generated

Kmeans03.png

R Script

Here is the content for KMeans.rscript:

 <Info>
Description    K means for tsne table from ArrayStudio
Author         xxx xx
Created        06/01/2018
Requirement

<Input>
Kmeanslow =
Kmeanshigh = 
RandomSeed = 10
Column1 = V1
Column2 = V2

<Output>
TSNE_Kmeans

<Script>

dat <- input.data

Kmeanslow = input.parameters"Kmeanslow"
Kmeanshigh = input.parameters"Kmeanshigh"
RandomSeed = input.parameters"RandomSeed"
col1 = input.parameters"Column1"
col2 = input.parameters"Column2"

## check if column names could be found
if(!(col1%in%colnames(dat))) stop("The column1 (case-sensitive) is NOT found in the Table! ","Column names available: ", colnames(dat))
if(!(col2%in%colnames(dat))) stop("The column2 (case-sensitive) is NOT found in the Table! ","Column names available: ", colnames(dat))


Kmeanslow <- as.numeric(Kmeanslow)
Kmeanshigh <- as.numeric(Kmeanshigh)

#allow user to replicate their results
set.seed(RandomSeed)

i <- Kmeanslow

while (i <= Kmeanshigh) {
  clusters <- kmeans(dat[,c(col1,col2)], i)
  
  newclass <- paste("KMeans",i,sep = "_")
  # Save the cluster number in the dataset as column 'class'
  dat$class <- as.factor(clusters$cluster)
  
  #Rename the new column to be KMeans_i
  colnames(dat)[which(names(dat) == "class")] <- newclass
  i <- i + 1}

TSNE_Kmeans <- dat
[back to top]


Related Articles

EnvelopeLarge2.png

[back to top]