Recover From Analysis With Error

From Array Suite Wiki

Jump to: navigation, search

Contents

Can I rescue data from a pipeline analysis that had an error?

In Array Studio, users can submit dozens, or even hundreds, of NGS samples for automatic processing. In addition, pipeline scripts can string together multiple analysis steps without user intervention.

But what about when something goes wrong? Sometimes, network issues, computer hardware instability, or data formatting problems can cause an error partway through the analysis.

Fortunately, it is generally the case that the data generated prior to the error are completely fine, so the full analysis does not need to be executed.

Instead, the user simply needs to resolve the issue (e.g. fix or remove problematic samples, improve network connectivity, move data to more stable disk storage, etc), then use a couple of quick tricks to continue with the pipeline.

[back to top]

How do I know there was a problem with my analysis?

Most of the time, if a Server-based analysis (e.g. a pipeline script run) encountered a problem, the Job Status window will either say "RunningWithErrors", "CompletedWithErrors", or "ErrorOccurred".

Image412.png

"RunningWithErrors"/"CompletedWithErrors" indicates that at least one sample had an issue, but the analysis was able to continue with the remaining samples (and sometimes even the problematic sample).

In contrast, "ErrorOccurred" indicates that a critical problem was identified during an analysis step, the analysis halted without saving, and this issue must be resolved before the user submits the job again.

Sometimes, an analysis will apparently run without errors, but inspection of the output data indicates that something went wrong (e.g. few or no reads were aligned to the genome).

In any case, crucial information about the problem will be found in the Job Log, which can be found by right-clicking the job in the Job Status window and selecting "View Log".

When reading the log, the following steps can help identify the problem(s) and solutions:

  1. Look at the error message in the Server Browser. When an error occurs, this will generally display a message about the nature of the error.
  2. Find the error message(s) in the log file. Searching for "error" will find the step(s) that had problems, and additional diagnostic/status information is often output immediately before and after the error.
  3. Look for successful analysis samples/steps. If the analysis halted with ErrorOccurred, any data generated in steps prior to the error may still be trustworthy, especially if the problem was simply a mis-specified parameter. Similarly, RunningWithErrors often occurs when one sample had a problem, but the analysis could complete for the remaining samples. For example, if mapping DNA-seq data was successful for 99 of 100 samples, there is no need to re-run those again.
  4. Save the omicscript that was generated. The omicscript at the top of the log can be copied, edited, excerpted, etc. to re-run part of an analysis, continue with an analysis, change input files, etc, with minimal effort.
[back to top]

Save early, Save often

Pipeline scripts should have a SaveProject#For Server Pipeline step after every major analysis step:

Begin SaveProject;
End;

This "commits" successful output data to the project, even in the event of an ErrorOccurred message. This simple addition can save hours of computation time!

Example: Corrupted file in RNA-seq pipeline

For this example, a few characters were introduced to the start of one .fastq file from the RNA-seq tutorial (renamed SRR521524b_1.bad.fastq.gz), which should cause this file to fail in all analyses. This file was submitted with the others to an RNA-seq pipeline script, resulting in a “Running with errors”/”Completed with errors” status.

RecoverFromErrors ServerJobs FinishedWithErrors.png

By looking at the job log for this analysis, first notice that the full oscript is generated at the top, substituting pipeline script parameters for actual file and object names. This script can be saved, edited, and directly re-submitted once the error is fixed. Furthermore, steps/samples that completed successfully can simply be removed from this script, so the full analysis does not need to be re-run.

Also, the log indicates each completed step, such as

[00:29:07] ----------Finished ProcMapRnaSeqReadsToGenome----------
[00:29:07] ----------Started ProcSaveProject----------

So the user can be confident that the data were properly saved (excepting failed samples).

Searching for the word "Error" identifies points where sample SRR521524b_1.bad had an issue, starting with the Raw data QC step.

[00:02:24] Error occurred for server job (SRR521524b_1.bad). Error=Error occurred for file SRR521524b_1.bad
[00:02:24] . Error = Unsupported NGS raw file format: Unknown@@@
...
[00:02:24] Performing summarization (Mode=NgsQCWizard) for observation SRR521524b_1.bad...Retry#=2
..
Tips.pngArrayServer will try three times to process a file before "giving up", in case there was a transient network transfer issue or other temporary problem.


In contrast, aligning to the genome did not throw an error per se, but the output summary indicates a problem:

[00:29:03] BamFile=/mnt/Scratch/ArrayServer/BaseDir/Test-08_Joseph/FtpRoot/Users/joseph/RecoverFromError/BAM/SRR521524b_bad.bam.0 reads uniquely paired, 0 reads non-uniquely paired, 0 reads not mapped.

Similarly, quantification by the RSEM algorithm indicated a problem for sample SRR521524b_bad.bam, but did not throw an Error:

[00:37:37] RSEM algorithm requires at least 100 aligned reads. Zero vector will be returned.

Finally, a quick glance at the quantification -Omic data will make it clear that sample SRR521524b_bad is not to be trusted.

RecoverFromErrors Quantification.png

[back to top]


Re-run the failed samples

Once the problem has been identified and resolved (in this case, removing extra characters in a fastq file), the user can choose to either

  1. re-run the problem sample through the full analysis, then merge all output data with the successful samples
  2. re-run the problem sample through early time-consuming steps (e.g. alignment), merge the data, then re-run all samples through downstream steps

It is generally more convenient to perform option 2, rather than edit and merge many -Omic and Table data types, so this will be demonstrated.

Step 1: Re-run Raw QC and Alignment for failed sample

Either run the Raw QC and Alignment steps through GUI, or copy and edit the relevant sections of the oscript from the original run log, to only process the fixed samples:

Begin NgsQCWizard /Namespace=NgsLib /RunOnServer=True;
Files "/TestDataSets/HumanRNASeqPaired/Tutorial2013_5p/SRR521524b_1.fixed.fastq.gz

/TestDataSets/HumanRNASeqPaired/Tutorial2013_5p/SRR521524b_2.fixed.fastq.gz";

Options  /FileFormat=AUTO /QualityEncoding=Automatic /CompressionMethod=Gzip /PreviewMode=True /ParallelJobNumber=1 /GenerateTableland=True /BasicStatistics=True /BaseDistribution=True /QualityBoxPlot=True /KMerAnalysis=True /SequenceDuplication=True /OutputFolder="/Users/joseph/RecoverFromError/QC/Raw";
Output RecoverFromError\\RawQC_FixedSample;
End;

Begin SaveProject;
End;

Begin MapRnaSeqReadsToGenome /Namespace=NgsLib /RunOnServer=True;
Files "/TestDataSets/HumanRNASeqPaired/Tutorial2013_5p/SRR521524b_1.fixed.fastq.gz
/TestDataSets/HumanRNASeqPaired/Tutorial2013_5p/SRR521524b_2.fixed.fastq.gz";
Reference Human.B37.3;
GeneModel OmicsoftGene20130723;
Trimming  /Mode=TrimByQuality /ReadTrimQuality=2;
Options  /ParallelJobNumber=1 /PairedEnd=True /FileFormat=FASTQ /AutoPenalty=True /FixedPenalty=2 /Greedy=false /IndelPenalty=2 /DetectIndels=True /MaxMiddleInsertionSize=10 /MaxMiddleDeletionSize=10 /MaxEndInsertionSize=10 /MaxEndDeletionSize=10 /MinDistalEndSize=3 /ExcludeNonUniqueMapping=False /ReportCutoff=10 /OutputFolder="/Users/joseph/RecoverFromError/BAM" /ThreadNumber=4 /InsertSizeStandardDeviation=40 /ExpectedInsertSize=300 /MatePair=False /InsertOnSameStrand=False /InsertOnDifferentStrand=True /QualityEncoding=Automatic /CompressionMethod=Gzip /SearchNovelExonJunction=True /ExcludeUnmappedInBam=False /KeepFullRead=False /Replace=False /Platform=Illumina /CompressBam=False;
Output BAM_FixedSample;
End; 

Begin SaveProject;
End;

By using the original script (from the log) as a template, it is convenient to make small changes to only process the samples you want, without having to remember all of the custom parameters used for the original run.

Save this script as a text file, then submit it to be processed by This script can be submitted to Tools | Run Script (Send to Queue).

[back to top]

Step 2: Import all good BAM files into new NgsData

From the original run, seven .bam files are collected as a single NgsData object, but one of those seven contains 0 reads.

The .bam results from the fixed sample is in a separate NgsData object:

RecoverFromErrors SeparateNgsData.png

It will be simplest to import all of the good .bam files into a new NgsData object, with Add Data | Add NGS Data | Add RNA-Seq Data | Add Genome-Mapped Reads:

RecoverFromErrors AddGenomeMappedReads Menu.png

RecoverFromErrors AddGenomeMappedReads Window.png

Because this new NgsData object contains all of the .bam files that should have been originally created, it will be more convenient to rename the new NgsData to match the original output:

RecoverFromErrors RenameNgsData Objects.png

This way, the new NgsData object will be automatically be compatible with the original oscript, and can be used for all downstream processing, as though there hadn't been an error in the first place.

[back to top]


Step 3: Continue with the analysis

Now go back to the oscript from the run log, and copy the steps that hadn't been completed (in this case, all steps after aligning to genome), and save into a new text file.

Simply submit this new script to the queue, and the analysis will continue.

Output results

After completing this analysis, you should find all output objects that would have been originally created. If some data were generated in the original run and saved in the project, you may find duplicated objects like AlignedQC and AlignedQC_2; the latter was generated by the second script run, to avoid over-writing data, and the updated data will be found here. If you choose, you can remove redundant data objects by right-clicking and selecting Delete.

Related Articles

[back to top]

EnvelopeLarge2.png