DocumentCode
2016549
Title
Unsupervised Newspaper Segmentation Using Language Context
Author
Furmaniak, Ralph
Author_Institution
Univ. of Waterloo, Waterloo
Volume
2
fYear
2007
fDate
23-26 Sept. 2007
Firstpage
1263
Lastpage
1267
Abstract
There has been increased interest in digitization of newspaper archives. A major problem that must be solved is that of high accuracy decomposition of the page into its logical structure. In this paper I present an approach that uses a language similarity measure based on OCR results to train geometric layout rules tailored to an arbitrary title. Experiments have shown this approach to be very effective.
Keywords
image segmentation; information retrieval systems; natural languages; optical character recognition; publishing; unsupervised learning; OCR result; arbitrary title; geometric layout rule training; language similarity measure; newspaper archive digitization; unsupervised newspaper segmentation; Books; Engines; Image processing; Image segmentation; Merging; Optical character recognition software; Testing; Text recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location
Parana
ISSN
1520-5363
Print_ISBN
978-0-7695-2822-9
Type
conf
DOI
10.1109/ICDAR.2007.4377118
Filename
4377118
Link To Document