• DocumentCode
    108920
  • Title

    Semisupervised Wrapper Choice and Generation for Print-Oriented Documents

  • Author

    Bartoli, Alberto ; Davanzo, Giorgio ; Medvet, Eric ; Sorio, Enrico

  • Author_Institution
    Dept. of Eng. & Archit. (DIA), Univ. of Trieste, Trieste, Italy
  • Volume
    26
  • Issue
    1
  • fYear
    2014
  • fDate
    Jan. 2014
  • Firstpage
    208
  • Lastpage
    220
  • Abstract
    Information extraction from printed documents is still a crucial problem in many interorganizational workflows. Solutions for other application domains, for example, the web, do not fit this peculiar scenario well, as printed documents do not carry any explicit structural or syntactical description. Moreover, printed documents usually lack any explicit indication about their source. We present a system, which we call PATO, for extracting predefined items from printed documents in a dynamic multisource scenario. PATO selects the source-specific wrapper required by each document, determines whether no suitable wrapper exists, and generates one when necessary. PATO assumes that the need for new source-specific wrappers is a part of normal system operation: new wrappers are generated online based on a few point-and-click operations performed by a human operator on a GUI. The role of operators is an integral part of the design and PATO may be configured to accommodate a broad range of automation levels. We show that PATO exhibits very good performance on a challenging data set composed of more than 600 printed documents drawn from three different application domains: invoices, datasheets of electronic components, and patents. We also perform an extensive analysis of the crucial tradeoff between accuracy and automation level.
  • Keywords
    document handling; information retrieval; GUI; PATO; automation levels; dynamic multisource scenario; electronic component datasheets; interorganizational workflows; invoices; normal system operation; patents; point-and-click operations; predefined item extraction; print-oriented documents; printed documents; semisupervised wrapper choice; semisupervised wrapper generation; source-specific wrappers; Accuracy; Automation; Data mining; Graphical user interfaces; Humans; Information retrieval; Patents; Document management; administrative data processing; business process automation; data entry; human-computer interaction; retrieval models;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.254
  • Filename
    6399473