• DocumentCode
    2297838
  • Title

    Scalable and Distributed Processing of Scientific XML Data

  • Author

    Dede, Elif ; Fadika, Zacharia ; Gupta, Chaitali ; Govindaraju, Madhusudhan

  • Author_Institution
    Dept. of Comput. Sci., SUNY Binghamton, Binghamton, NY, USA
  • fYear
    2011
  • fDate
    21-23 Sept. 2011
  • Firstpage
    121
  • Lastpage
    128
  • Abstract
    A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.
  • Keywords
    XML; data reduction; distributed processing; grid computing; information services; meta data; natural language processing; query languages; search problems; set theory; Apache hadoop; LEMO-MR; MapReduce model; NeXus data set; TeraGrid information service; XML metadata indexing structure; XML query language; distributed processing; domain specific scientists; hard-to-customize query; large scale MapReduce processing; large scale data reduction capability; natural language; scalable distributed solution; scientific XML data sets; search capability; search tool; spallation neutron source; Complexity theory; Indexing; Search engines; Semantics; Web search; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Grid Computing (GRID), 2011 12th IEEE/ACM International Conference on
  • Conference_Location
    Lyon
  • ISSN
    1550-5510
  • Print_ISBN
    978-1-4577-1904-2
  • Type

    conf

  • DOI
    10.1109/Grid.2011.24
  • Filename
    6076507