The Challenge of Scaling Genome Big Data Analysis Software on TH-2 Supercomputer

The Challenge of Scaling Genome Big Data Analysis Software on TH-2 Supercomputer Whole genome re-sequencing plays a crucial role in biomedical studies. The emergence of genomic bigdata calls for an enormous amount of computing power. However, current computational methods are inefficient in utilizing available computational resources. In this paper, we address this challenge by optimizing the utilization of the fastest supercomputer in the world – TH-2 supercomputer. TH-2 is featured by its neo-heterogeneous architecture, in which each compute node is equipped with 2 Intel Xeon CPUs and 3 Intel Xeon Phi coprocessors. The heterogeneity and the massive amount of data to be processed pose great challenges for the deployment of the genome analysis software pipeline on TH-2. Runtime profiling shows that SOAP3-dp and SOAPsnp are the most time-consuming components (up to 70% of total runtime) in a typical genome-analyzing pipeline. To optimize the whole pipeline, we first devise a number of parallel and optimization strategies for SOAP3-dp and SOAPsnp, respectively targeting each node to fully utilize all sorts of hardware resources provided both by CPU and MIC. We also employ a few scaling methods to reduce communication between different nodes. We then scaled up our method on TH-2. With 8192 nodes, the whole analyzing procedure took 8.37 hours to finish the analysis of a 300 TB dataset of whole genome sequences from 2,000 human beings, which can take as long as 8 months on a commodity server. The speedup is about 700x.