DocumentCode :
3321040
Title :
Benchmarking technology infrastructures for embarrassingly and non-embarrassingly parallel problems in biomedical domain
Author :
Kazmi, Saira A. ; Kane, Michael J. ; Krauthammer, Michael O.
Author_Institution :
Sch. of Med., Yale Center for Med. Inf., Yale Univ., New Haven, CT, USA
fYear :
2013
fDate :
21-23 May 2013
Firstpage :
1
Lastpage :
4
Abstract :
Having the advantage of large scale open source data available to us in multiple forms, the ultimate goal is to integrate these resources with gene sequence data to enhance our understanding and make viable inferences about the true nature of the processes that generate this data. We are investigating the use of open source subset of the National Institute of Health´s National Library of Medicine (NIH/NLM) data for our analysis including text as well as image features to semantically link similar publications. Due to the sheer volume of data as well as the complexity of inference tasks, the initial problem is not in the analysis but lies in making a decision about the computational infrastructure to deploy and in data representation that will help accomplish our goals. Just like any other business process, reducing processing cost and time is of essence. This work benchmarks two open source platforms (A) Apache Hadoop with Apache Mahout, and (B) open source R using bigmemory package for performing non-embarrassingly parallel and embarrassingly parallel machine learning tasks. Singular Value Decomposition (SVD) and k-means are used to represent these two problem classes respectively and average task time is evaluated for the two architectures for a range of input data sizes. In addition, performance of these algorithms using sparse and dense matrix representation is also evaluated for clustering and feature extraction tasks. Our analysis shows that R is not able to process data larger than 2 giga-bytes, with an exponential performance degradation for data larger than 226 mega-bytes. Bigmemory package in R allowed processing of larger data but with similar degradation beyond 226 mega-bytes. As expected, Hadoop/Mahout did not perform well for SVD as compared to k-means due to the tightly coupled nature of data needed at each step and is only justified for processing of very large data sets.
Keywords :
computational complexity; inference mechanisms; learning (artificial intelligence); medical information systems; parallel processing; pattern clustering; public domain software; singular value decomposition; sparse matrices; Apache Hadoop; Apache Mahout; NIH-NLM; SVD; bigmemory package; biomedical domain; business process; clustering tasks; data representation; dense matrix representation; embarrassingly parallel problems; feature extraction tasks; gene sequence data; inference task complexity; large scale open source data; machine learning tasks; national institute of health national library of medicine; nonembarrassingly parallel problems; open source R; open source subset; singular value decomposition; sparse matrix representation; technology infrastructure benchmarking; Benchmark testing; Databases; Educational institutions; Hardware; Software; Sparse matrices; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Biomedical Sciences and Engineering Conference (BSEC), 2013
Conference_Location :
Oak Ridge, TN
Print_ISBN :
978-1-4799-2118-8
Type :
conf
DOI :
10.1109/BSEC.2013.6618496
Filename :
6618496
Link To Document :
بازگشت