DocumentCode :
48820
Title :
Autonomous Document Cleaning—A Generative Approach to Reconstruct Strongly Corrupted Scanned Texts
Author :
Zhenwen Dai ; Lucke, Jorg
Author_Institution :
Dept. of Comput. Sci., Univ. of Sheffield, Sheffield, UK
Volume :
36
Issue :
10
fYear :
2014
fDate :
Oct. 1 2014
Firstpage :
1950
Lastpage :
1962
Abstract :
We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink, etc. We aim at autonomously removing such corruptions from a single letter-size page based only on the information the page contains. Our approach first learns character representations from document patches without supervision. For learning, we use a probabilistic generative model parameterizing pattern features, their planar arrangements and their variances. The model´s latent variables describe pattern position and class, and feature occurrences. Model parameters are efficiently inferred using a truncated variational EM approach. Based on the learned representation, a clean document can be recovered by identifying, for each patch, pattern class and position while a quality measure allows for discrimination between character and non-character patterns. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different example applications with different alphabets, we demonstrate and discuss the effectiveness, efficiency and generality of the approach.
Keywords :
document image processing; expectation-maximisation algorithm; feature extraction; image reconstruction; image representation; learning (artificial intelligence); natural language processing; probability; text analysis; variational techniques; autonomous document cleaning; character discrimination; character representations; character types; document patches; feature occurrences; full Latin alphabet; latent variables; learning; manual line strokes; model parameters; noncharacter patterns; pattern class; pattern features; pattern position; planar arrangements; probabilistic generative model; quality measure; scanned text documents cleaning; single letter-size page; spilled ink; strongly corrupted scanned texts reconstruction; structural regularity; truncated variational EM approach; Approximation methods; Computational modeling; Data models; Histograms; Probabilistic logic; Vectors; Visualization; Probabilistic generative models; document cleaning; expectation maximization; expectation truncation; scanned text; unsupervised learning; variational approximation;
fLanguage :
English
Journal_Title :
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher :
ieee
ISSN :
0162-8828
Type :
jour
DOI :
10.1109/TPAMI.2014.2313126
Filename :
6777544
Link To Document :
بازگشت