DocumentCode
2021012
Title
Document Content Inventory and Retrieval
Author
Baird, Henry S. ; Moll, Michael A.
Author_Institution
Lehigh Univ., Bethlehem
Volume
1
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
93
Lastpage
97
Abstract
We give an analysis of relationships between expected retrieval performance and classification recognition accuracy in the context of document image content extraction and inventory. By content extraction we mean location and measurement of regions containing handwriting, machine- printed text, photographs, blank space, etc, in documents represented as bilevel, grey-level, or color images. Recent experiments have shown that even modest per-pixel content classification accuracies can support usefully high recall and precision rates (of, e.g., 80-90%) for retrieval queries within document collections seeking pages that contain a given minimum fraction of a certain type of content. In an effort to elucidate this interesting empirical result, we have analyzed the interdependency of classification and retrieval under a variety of assumptions about the distribution of content types in document image collections. We show that under general conditions we cannot derive precision and recall measures from per-pixel classification measures alone, but we can estimate the expected values of these measures. If however the distribution of content and error rates are uniform across the entire collection, our results suggest, it is possible to predict precision and recall measures from classification accuracy and vice versa.
Keywords
content-based retrieval; document image processing; information retrieval; pattern classification; classification recognition accuracy; content classification accuracy; document content inventory; document image collections; document image content extraction; document retrieval; retrieval performance; retrieval query; Content based retrieval; Image analysis; Image color analysis; Image recognition; Image retrieval; Information retrieval; Performance analysis; Pixel; Testing; Text analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4378682
Filename
4378682
Link To Document