Title :
Clustering of Symbols Using Minimal Description Length
Author :
Tataw, Oben M. ; Rakthanmanon, Thanawin ; Keogh, Eamonn J.
Author_Institution :
Univ. of California, Riverside, Riverside, CA, USA
Abstract :
The clustering of glyphs (individual letters/characters/symbols) is typically the first step in document processing algorithms and a critical enabling technology for most historical document indexing techniques. In this work, we take a step back from current domain/language specialized research efforts to consider the problem from an agnostic perspective. In particular, we claim that, independent of the distance measure used, any method that attempts to cluster all the data is almost certainly doomed to failure. We explain this observation, and introduce a clustering method based on Minimum Description Length (MDL) that can overcome it.
Keywords :
document image processing; image classification; pattern clustering; MDL; document processing algorithms; glyphs clustering; minimal description length; symbol clustering; Accuracy; Algorithm design and analysis; Approximation algorithms; Character recognition; Clustering algorithms; Clustering methods; Encoding; Clustering; Image Similarity; MDL;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.43