• DocumentCode
    2194276
  • Title

    Parallel Processing of Large-Scale XML-Based Application Documents on Multi-core Architectures with PiXiMaL

  • Author

    Head, Michael R. ; Govindaraju, Madhusudhan

  • Author_Institution
    Grid Comput. Res. Lab., SUNY Binghamton, Binghamton, NY, USA
  • fYear
    2008
  • fDate
    7-12 Dec. 2008
  • Firstpage
    261
  • Lastpage
    268
  • Abstract
    Very large scientific datasets are becoming increasingly available in XML formats. Our earlier benchmarking results show that parsing XML is a time consuming process when compared with binary formats optimized for largescale documents. This performance bottleneck will get exacerbated as size of XML data increases in e-science applications. Our focus in this paper is on addressing this performance bottleneck. In recent times, the microprocessor industry has made rapid strides towards chip multi processors (CMPs). The widely available XML parsers have been unable to take advantage of the opportunities presented by CMPs, instead, passing the task of parallelization to the application programmer. The paradigms used thus far to process large size XML documents on uniprocessors are not applicable for CMPs. We present the design, implementation, and performance analysis of PiXiMaL, a parallel processing library for large-scale XML-data files. In particular, we discuss an effective scheme to parallelize the tokenization process to achieve an overall performance increase when parsing large-scale XML documents that are increasingly in use today. Our approach is to build a DFA-based parser that recognizes a useful subset of the XML specification and converts the DFA into an NFA which can be applied on any subset of the input.
  • Keywords
    XML; deterministic automata; finite automata; grammars; multiprocessing systems; natural sciences computing; parallel programming; parallelising compilers; CMP; DFA-based XML parser; NFA-based XML parser; PiXiMaL parallel processing library; XML specification; chip multiprocessor; deterministic finite automata; e-science application; large-scale XML-based application document; microprocessor industry; multicore architecture; parallel processing; performance bottleneck; tokenization process; very large scientific dataset; Algorithm design and analysis; Delay; Humans; Large-scale systems; Microprocessors; Middleware; Multicore processing; Parallel processing; Web services; XML; automata; parallel; parsing; xml;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    eScience, 2008. eScience '08. IEEE Fourth International Conference on
  • Conference_Location
    Indianapolis, IN
  • Print_ISBN
    978-1-4244-3380-3
  • Electronic_ISBN
    978-0-7695-3535-7
  • Type

    conf

  • DOI
    10.1109/eScience.2008.77
  • Filename
    4736766