• DocumentCode
    3323161
  • Title

    An Algebraic Approach to Rule-Based Information Extraction

  • Author

    Reiss, Frederick ; Raghavan, Sriram ; Krishnamurthy, Rajasekar ; Zhu, Huaiyu ; Vaithyanathan, Shivakumar

  • Author_Institution
    Almaden Res. Center, IBM, San Jose, CA
  • fYear
    2008
  • fDate
    7-12 April 2008
  • Firstpage
    933
  • Lastpage
    942
  • Abstract
    Traditional approaches to rule-based information extraction (IE) have primarily been based on regular expression grammars. However, these grammar-based systems have difficulty scaling to large data sets and large numbers of rules. Inspired by traditional database research, we propose an algebraic approach to rule-based IE that addresses these scalability issues through query optimization. The operators of our algebra are motivated by our experience in building several rule-based extraction programs over diverse data sets. We present the operators of our algebra and propose several optimization strategies motivated by the text-specific characteristics of our operators. Finally we validate the potential benefits of our approach by extensive experiments over real-world blog data.
  • Keywords
    algebra; grammars; knowledge engineering; query processing; algebraic approach; grammar-based systems; large data sets; query optimization; real-world blog data; regular expression grammars; rule-based extraction programs; rule-based information extraction; text-specific characteristics; traditional database research; Algebra; Data mining; Databases; Information services; Instruments; Intelligent structures; Internet; Query processing; Scalability; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on
  • Conference_Location
    Cancun
  • Print_ISBN
    978-1-4244-1836-7
  • Electronic_ISBN
    978-1-4244-1837-4
  • Type

    conf

  • DOI
    10.1109/ICDE.2008.4497502
  • Filename
    4497502