• DocumentCode
    3231568
  • Title

    Interactive Tuples Extraction from Semi-Structured Data

  • Author

    Gilleron, Remi ; Marty, Patrick ; Tommasi, Marc ; Torre, Fabien

  • Author_Institution
    Lille Univ.
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    997
  • Lastpage
    1004
  • Abstract
    This paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tuples of length i-1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper
  • Keywords
    XML; information retrieval; interactive systems; learning (artificial intelligence); relational databases; tree data structures; Web documents; XHTML documents; XML documents; corporate Web sites; data structures; friendly interactive wrapper induction system; information extraction tasks; interactive tuple extraction; machine learning viewpoint; partial tuples; representation-enrichment procedure; semistructured data; target n-ary relation; tree structured data; Data mining; Data structures; Decision support systems; HTML; Information resources; Internet; Machine learning; Machine learning algorithms; Statistics; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-2747-7
  • Type

    conf

  • DOI
    10.1109/WI.2006.102
  • Filename
    4061511