Map reads to both human and virus genome

From Array Suite Wiki

Jump to: navigation, search


Contents

Overview

To determine viral load in a sample, Array Studio users can map reads to references for viral genomes. There are three options to do this: 1) Users can create a combined genome in which viral sequences are added to the reference genome from which the sample came from (i.e. Human). Users could map RNA-Seq reads to this human+virus combined genome and quantify gene counts for human genes as well as individual viral counts. While this may be a useful option, reads that map to viral sequences often also map within the human genome. In this case, it may be more desirable to 2) map reads to the human genome first, and use the unmapped reads to further align to virus sequences. 3) map raw reads to public virus reference genome, and do quantification for virus genome counts. This wiki page describes a workflow for how users can take unmapped reads and quantify viral expression.

Mapping of Reads To Human Genome plus Customized Virus Reference Genome and Gene Model

In this step, users can perform mapping in two stages: 1) to the human genome and 2) to the virus genome. For all OmicSoft provided genome references, please see: References.

Map Reads to Human Genome

Raw fastq reads from a bulk RNA-seq sample can be aligned first to the Human Genome.

Map reads module.png

Specify the fastq files to Map RNA-Seq Reads To Genome with the human reference genome and gene model:

Load fastq files.png

In the Advanced tab, be sure to uncheck Exclude unmapped reads in BAM file. Otherwise the unmapped reads will not be available in the bam files.

Include unmapped reads.png

mapped and unmapped reads in the bam files

When the job is done, there will be one NGS Data show up in the solution together with an Alignment Report. In the output folder specified, there are bam files for each sample, with both mapped and unmapped reads contained within the same bam file.

Bams mapped2human with unmapped.png

Map Unmapped Reads to Virus Genome

To further map to the Virus Genome, the unmapped reads will need to be extracted as fastq files from the previous step and then used for subsequent mapping.

Extract unmapped reads

Under the NGS menu, extract the unmapped reads using the NGS -> Manipulation -> Export:

Ngs Manipulation Export.png

In the Export window, choose the NgsData data with both mapped and unmapped reads, and choose output format as UNPAIRED+UNMAPPED_FASTQ_GZ, as shown below.

Export unmapped reads fastq.png

Output Fastq Files

The extracted fastq files will be found in the output folder:

Generated fastq unmapped reads.png

Map reads to Virus Genome

To map the fastq files extracted in the previous step, ;

Map unmapped reads2virus.png


Quantification Counts data for reads aligned to Human and Virus genome

After mapping original fastqs and unmapped fastqs to human and virus genome, there will be two NGS objects available, run Summarize Gene/Transcript Count module on the two NGS data, users will get counts data for human genes and virus genes.

Quantification counts 4virusandhuman.png

The final output would be similar as shown in the snapshot below:

Counts fpkm virusandhuman.png


Mapping of Reads to Public Virus Reference Genome

In OmicSoft platform, there are several public available virus genome, Virus.RefSeq20170418, Virus.RefSeq2014.**. ArrayStudio users could map RNASeq reads to those virus reference genome to analysis virus gene expression in the sample.

Map raw reads to Virus reference genome without customized virus gene model

Similar to the mapping steps demonstrated above, go to Add Data -> Add NGS Data -> Add RNA-Seq Data -> Map Reads To Genome (Illumina). In the Map RNA-Seq Reads To Genome window, choose the raw fastq files, choose one of the available Virus Reference Genome, as in this demo the Virus.RefSeq20170418 was chosen.

Map reads2public VirusRefGenome.png

When the alignment job is done, we would expect one NGS data object with effective alignment information, as shown below.

Ngs mapped2publicVirus.png


Quantification of Virus genome mapping

After getting the NGS data object for mapped reads, users could go to NGS -> Quantification -> Report Gene / Transcript Counts module to get the Counts data for virus.

Counts VirusGeneReads.png

When the job is done, the -OmicData expression table for virus genome would show up in the solution.

Virus counts omicdata.png