Generate copy number baseline for NGS data

From Array Suite Wiki

Jump to: navigation, search



In many NGS studies, people would like to know the copy number information for each gene in different tissues (such as tumor tissue vs. normal tissue). If samples are paired, i.e., for each patient, the DNA seq for both normal tissue and tumor tissue are collected, copy number computation would be easy and accurate. However, some studies collect the tumor vs. normal tissue DNA seq from different patients, i.e., the DNA seq for the tumor tissue does not have its corresponding one for the normal tissue from the same patient. In such case, we may want to generate a "fake" DNA seq as a baseline for copy number computation.

Generate the Fastq Files from all available normal DNA seq

The basic idea is to generate the DNA seq for a fake normal tissue from all available normal tissues. Suppose we have 50 normal samples, then would first sample 100%/50=2% seq data from each sample. We can use the sample function to do so:


Align all small fastq files to one BAM file

The key step is to map all the 50 fastq files to one reference, we can do this by following this wiki page: [How to use multiple sequence files for one sample?]

Compute the copy number for each tumor sample

Once we have the bam file for the baseline, we can compare all the tumor sample to the baseline to get the copy number.