• DocumentCode
    2014710
  • Title

    A Data Mining Approach to Reading Order Detection

  • Author

    Ceci, Michelangelo ; Berardi, Margherita ; Porcelli, Giuseppe A. ; Malerba, Donato

  • Author_Institution
    Univ. of Bari, Bari
  • Volume
    2
  • fYear
    2007
  • fDate
    23-26 Sept. 2007
  • Firstpage
    924
  • Lastpage
    928
  • Abstract
    Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.
  • Keywords
    Bayes methods; content-based retrieval; data mining; document image processing; learning (artificial intelligence); pattern classification; probability; Bayesian framework; content-based retrieval; data mining approach; document image; domain specific knowledge; information extraction; layout components; learning method; probabilistic classifier; reading order detection; Bayesian methods; Content based retrieval; Data mining; Encoding; Image recognition; Image reconstruction; Information retrieval; Labeling; Learning systems; Predictive models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
  • Conference_Location
    Parana
  • ISSN
    1520-5363
  • Print_ISBN
    978-0-7695-2822-9
  • Type

    conf

  • DOI
    10.1109/ICDAR.2007.4377050
  • Filename
    4377050