Title :
Exploiting the forgiving nature of applications for scalable parallel execution
Author :
Meng, Jiayuan ; Raghunathan, Anand ; Chakradhar, Srimat ; Byna, Surendra
Author_Institution :
NEC Labs. America, Princeton, NJ, USA
Abstract :
It is widely believed that most Recognition and Mining (RM) workloads can easily take advantage of parallel computing platforms because these workloads are dataparallel. Contrary to this popular belief, we present RM workloads for which conventional parallel implementations scale poorly on multi-core platforms. We identify off-chip memory transfers and overheads in the parallel runtime library as the primary bottlenecks that limit speedups to be well below the ideal linear speedup expected for data-parallel workloads. To achieve improved parallel scalability, we identify and exploit several interesting properties of RM workloads - sparsity of model updates, low spatial locality among model updates, presence of insignificant computations, and the inherently self-healing nature of these algorithms in the presence of errors. We leverage these domain-specific characteristics to improve parallel scalability in two major ways. First, we utilize data dependency relaxation to simultaneously execute multiple training iterations in parallel, thereby increasing the granularity of the parallel tasks and significantly lowering the run-time overheads of fine-grained threading. Second, we strategically drop selected computations that are insignificant to the accuracy of the final result, but account for a disproportionately large amount of off-chip (memory and coherence) traffic. Through the application of the proposed techniques, we show that much higher speedups are possible on multi-core platforms for two important RM applications - document search using semantic indexing, and eye detection in images using generalized learning vector quantization. On an 8-core platform, we achieve application speedups of 5.5X and 7.3X compared to sequential implementations. Compared to conventional parallel implementations of these applications using Intel´s TBB, the proposed techniques result in 4.3X and 4.9X improvements. Although the optimized parallel implementations are not numerically eq- - uivalent to the sequential implementations, the output quality is shown to be comparable (and within the margin of variation produced by processing the input data in a different order). We also explore error mitigation techniques that can be used to ensure that the accuracy of results is not compromised.
Keywords :
parallel processing; parallel programming; data dependency relaxation; data-parallel workload; document search; eye detection; generalized learning vector quantization; mining workload; multi-core platform; off-chip memory transfer; optimized parallel implementation; parallel computing platform; parallel runtime library; parallel scalability; recognition workload; scalable parallel execution; semantic indexing; Application software; Computer science; Concurrent computing; Iterative algorithms; Laboratories; National electric code; Parallel processing; Scalability; Testing; Vector quantization; Best-effort computing; Dependency relaxation; Mining; Multi-core; Parallel computing; Parallel programming; Recognition;
Conference_Titel :
Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on
Conference_Location :
Atlanta, GA
Print_ISBN :
978-1-4244-6442-5
DOI :
10.1109/IPDPS.2010.5470469