• DocumentCode
    1634849
  • Title

    User-Guided Wrapping of PDF Documents Using Graph Matching Techniques

  • Author

    Hassan, Tamir

  • Author_Institution
    Database & Artificial Intell. Group, Vienna Univ. of Technol., Vienna, Austria
  • fYear
    2009
  • Firstpage
    631
  • Lastpage
    635
  • Abstract
    There are a number of established products on the market for wrapping - semi-automatic navigation and extraction of data - from Web pages. These solutions make use of the inherent structure of HTML to locate instances of data to be wrapped. As PDF documents do not have such a structure, wrapping PDF documents has long been recognized as a challenging problem. We have developed a novel system for wrapping PDF documents, which is currently at a prototype stage. A PDF document is represented as an attributed relational graph, in which nodes represent physical items on the page and edges represent spatial and logical relationships. A wrapper is defined as a subgraph of the document with additional conditions, and can quickly and intuitively be created by a non-expert using the GUI. An algorithm based on subgraph isomorphism is then used to find the data instances and extract the required data. Experiments show that our approach achieves good results with good execution time.
  • Keywords
    Internet; data structures; graph theory; graphical user interfaces; hypermedia markup languages; information retrieval; pattern matching; GUI; HTML; PDF document; Web page; attributed relational graph matching technique; data extraction; data instance; semiautomatic navigation; subgraph isomorphism; user-guided wrapping; Data mining; Databases; Graphical user interfaces; HTML; Information analysis; Navigation; Prototypes; Text analysis; Web pages; Wrapping; PDF; document analysis; document understanding; graph matching; wrapping;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.238
  • Filename
    5277569