Scalable and Distributed Processing of Scientific XML Data

Author

Dede, Elif ; Fadika, Zacharia ; Gupta, Chaitali ; Govindaraju, Madhusudhan

Author_Institution

Dept. of Comput. Sci., SUNY Binghamton, Binghamton, NY, USA

fYear

2011

fDate

21-23 Sept. 2011

Firstpage

121

Lastpage

128

Abstract

A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.

Keywords

XML; data reduction; distributed processing; grid computing; information services; meta data; natural language processing; query languages; search problems; set theory; Apache hadoop; LEMO-MR; MapReduce model; NeXus data set; TeraGrid information service; XML metadata indexing structure; XML query language; distributed processing; domain specific scientists; hard-to-customize query; large scale MapReduce processing; large scale data reduction capability; natural language; scalable distributed solution; scientific XML data sets; search capability; search tool; spallation neutron source; Complexity theory; Indexing; Search engines; Semantics; Web search; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on

Conference_Location

Lyon

ISSN

1550-5510

Print_ISBN

978-1-4577-1904-2

Type

conf

DOI

10.1109/Grid.2011.24

Filename

6076507