• DocumentCode
    3121634
  • Title

    Join Optimization of Information Extraction Output: Quality Matters!

  • Author

    Jain, Alpa ; Ipeirotis, Panagiotis G. ; Doan, AnHai ; Gravano, Luis

  • fYear
    2009
  • fDate
    March 29 2009-April 2 2009
  • Firstpage
    186
  • Lastpage
    197
  • Abstract
    Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality - and, of course, the execution time - of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
  • Keywords
    information retrieval; information retrieval systems; optimisation; document retrieval strategies; information extraction output; information extraction systems; join optimization process; Corporate acquisitions; Data engineering; Data mining; Heart; Information analysis; Information services; Internet; Relational databases; Text processing; Web sites; Information extraction; text databases;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2009. ICDE '09. IEEE 25th International Conference on
  • Conference_Location
    Shanghai
  • ISSN
    1084-4627
  • Print_ISBN
    978-1-4244-3422-0
  • Electronic_ISBN
    1084-4627
  • Type

    conf

  • DOI
    10.1109/ICDE.2009.138
  • Filename
    4812402