DocumentCode
2014710
Title
A Data Mining Approach to Reading Order Detection
Author
Ceci, Michelangelo ; Berardi, Margherita ; Porcelli, Giuseppe A. ; Malerba, Donato
Author_Institution
Univ. of Bari, Bari
Volume
2
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
924
Lastpage
928
Abstract
Determining the reading order for layout components extracted from a document image can be a crucial problem for several applications. It enables the reconstruction of a single textual element from texts associated to multiple layout components and makes both information extraction and content-based retrieval of documents more effective. A common aspect for all methods reported in the literature is that they strongly depend on the specific domain and are scarcely reusable when the classes of documents or the task at hand changes. In this paper, we investigate the problem of detecting the reading order of layout components by resorting to a data mining approach which acquires the domain specific knowledge from a set of training examples. The input of the learning method is the description of the "chains" of layout components defined by the user. Only spatial information is exploited to describe a chain, thus making the proposed approach also applicable to the cases in which no text can be associated to a layout component. The method induces a probabilistic classifier based on the Bayesian framework which is used for reconstructing either single or multiple chains of layout components. It has been evaluated on a set of document images.
Keywords
Bayes methods; content-based retrieval; data mining; document image processing; learning (artificial intelligence); pattern classification; probability; Bayesian framework; content-based retrieval; data mining approach; document image; domain specific knowledge; information extraction; layout components; learning method; probabilistic classifier; reading order detection; Bayesian methods; Content based retrieval; Data mining; Encoding; Image recognition; Image reconstruction; Information retrieval; Labeling; Learning systems; Predictive models;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4377050
Filename
4377050
Link To Document