Cloud data flow

From Array Suite Wiki

Jump to: navigation, search

Flow of data for cloud-based analyses

OmicSoft Studio on the Cloud integration allows users to seamlessly analyze NGS data using on-demand Amazon Cloud resources.

Basic workflow

Nearly all computationally-intensive analyses, especially those processing large NGS files (NGS QC/alignment/summarization, variant calling, etc), can be run on the cloud. Smaller summarization and analysis jobs will be directly carried-out on the ArrayServer machine.

The general "rule" is that if both input and output folders are cloud-based (i.e. mapped S3 bucket folders), then the analysis will be performed on a cloud EC2 virtual machine.

  1. Job submission
    1. User selects input data from a cloud folder in a mapped S3 bucket, such as /CloudFolder/CCLE
      1. If #1 is true, the output folder should also be set to a cloud folder, such as /CloudFolder/TestOutput
  2. Job Launching
    1. ArrayServer will transfer necessary reference files (genomes, gene models, etc) from ArrayServer's OmicsoftDirectory to the OmicsoftCloudDirectory
      1. If the reference files are not available, ArrayServer will first retrieve the reference data from to your Array Server OmicsoftDirectory
    2. When the analysis is submitted, ArrayServer will launch one EC2 instance per sample; alignment related jobs use OAlignInstanceType; other jobs use OSummaryInstanceType.
      1. Cloud instances are launched with OmicSoft software pre-installed, which will be updated to the latest version for analysis
      2. OmicSoft includes several pre-build AMIs to launch instances; custom instances can be built if needed.
      3. If admin set MaxInstanceCount=20, at most 20 EC2 machines will be started. If there are more than 20 samples, extra samples will be queued.
    3. Input files in S3 are copied to EC2 machines where EBS storage are attached (EBS size is calculated based on input file size)
  3. Job Completion
    1. ArrayServer will monitor job progression by SQS
  4. When a job is finished, all results are uploaded to S3 output folder
    1. Large data (Filtered FASTQ files, BAM files, VCF files) will remain on S3, not downloaded to ArrayServer, but can be streamed on-demand in Array Studio
    2. Small files (NgsData: links to S3 BAM files; OmicData: expression with design/annotation; Table reports) are copied to ArrayServer, summarized, and saved in the ArrayServer server machine
    3. When a job is finished, the machine will wait 30min to run analysis on new samples in the queue
    4. EC2 machines are terminated when no jobs in queue and it is idle > 30min. No EC2 machines are running when all samples are finished.