High Performance Secondary Analysis of Sequencing Data

Published on Tuesday 13 November 2018

Genomic analysis is on the cusp of revolutionizing the understanding of diseases and the methods for their treatment and prevention. With the advancements in Next Generation Sequencing (NGS) technologies, the number of human genomes sequenced is predicted to double every year. This market growth is further fueled by the ongoing transition of NGS into the clinical market where it is enabling personalized medicine, that promises to transform the diagnosis and treatment of diseases, leading to a disruptive change in modern medicine.

However, current DNA analysis is restricted to using limited data due to the large time and cost for Whole Genome Sequencing (WGS). As biochemical sequencing is getting faster and cheaper, the bottleneck is the analysis of the large volumes of data generated by these technologies. Faster and cheaper computational processing is required to make genomic analysis available for the masses. Furthermore, pharmaceutical companies, consumer genomic companies, and research centers are currently processing hundreds of thousands of genomes with great cost and will hugely benefit from this improvement as well.

Parabricks brings high performance computing technologies that are tailored for NGS analyses and accelerates the standard NGS software from several days to approximately one hour. The accelerated software is a drop-in replacement of existing tools that does not sacrifice output accuracy or configurability. Parabricks provides 30-36 times faster secondary analysis of FASTQ files coming out of sequencer to variant call files (VCFs) for tertiary analysis on Power 9 servers. The standard pipeline shown below consists of three steps and are defined as the Genome Analysis Toolkit (GATK). Parabricks accelerates existing GATK 4 best practices to generate equivalent results as the baseline. The image below shows the pipeline currently supported by Parabricks.

[caption id=“attachment_5912” align=“aligncenter” width=“757”] Figure 1 - Parabricks GPU accelerated pipeline[/caption]

Power Hardware Configuration

The Power System AC922 server is co-designed with OpenPOWER Foundation ecosystem members for the demanding needs of deep learning and AI, high-performance analytics, and high-performance computing users. It is deployed in the most powerful supercomputers on the planet through a partnership between IBM, NVIDIA, and Mellanox, among others.

The IBM AC922 Server is an accelerator optimized server with support for four NVIDIA Tesla V100 GPUs connected via NVLINK 2.0 to the POWER9 CPU’s at 150GBs speed each GPU. The hardware and system software configurations are summarized below.

Server	IBM AC922 (8335-GTH)
Processor	40-core at 2.4 GHz (3.0 GHz turbo) IBM POWER9 NVLink 2.0 technology, 4x SMT
Memory	· 512 GB DDR4 (8 Channels) - supporting up to 2 TB of memory
GPU	4x NVIDIA V100-16GB HBM2, SMX2

Table 1 - Hardware configuration

Performance Evaluation

Secondary analysis of genomic data on CPUs has been known to take a long time. 30x WGS data can take upto 30-40 hours for running the pipeline shown before using HaplotypeCaller for variant calling. Below, the raw run times in minutes for the Parabricks software on a Power9 server for 3 DNA samples with different coverages including NA12878.

Benchmark	Coverage	CPU only (minutes)	BWA-Mem	Others*	HaplotypeCaller	Total Time (minutes)	Speedup
S2	25x	2,746	56.8	14.65	13.2	84.5	32.4
NA12878	43x	3125	62.7	14.1	11.5	88.3	35.39
NIST 12878	41x	2993	61.05	14.95	13.71	89.71	33.96

Table 2 - Others include Co-ordinate sorting, marking duplicates, bqsr and applybqsr.

Accuracy Evaluation

The accuracy of Parabricks solution compared to GATK4 solution is done at two steps:

i) BAM after Marking Duplicates

ii) VCF after calling variants

Parabricks generates 100% equivalent BAM as compared to the CPU only solution and has over 99.99% concordance with CPU vcf.

Benchmark	Coverage	BAM	VCF
S2	25x	100%	99.998%
NA12878	43x	100%	99.996%
NIST 12878	41x	100%	99.996%

Table 3

Features of Parabricks software

30-35 times faster analysis: Compared to a CPU-only solution, Parabricks accelerates secondary analysis by orders of magnitude.
100% Deterministic and Reproducible: Parabricks software regardless of platform and number/type of resources generates the exact same results every execution.
Equivalent Results: Parabricks’ pipeline generates equivalent results as the reference Broad Institute GATK 4 best practices pipeline as the same algorithm is used.
Up to Date Support of All Tool Versions: Parabricks’ accelerated software supports multiple versions of BWA-Mem, Picard and GATK and will support all future versions of these tools.
Visualization: Parabricks generates several key visualizations real-time, while performing secondary analysis that can improve the user’s understanding of the data.
Single Node Execution: The entire pipeline is run using one computing node and does not incur any overhead of distributing data and work across multiple servers.
Turnkey Solution: Parabricks software runs on standard CPU and GPU nodes available on the cloud or on-premise, and requires no additional setup steps by the user.
On-Premise and Cloud: Parabricks software can run on local servers, AWS, Google Cloud, and Azure.

Please contact info@parabricks.com for further inquiries.