مرکز منطقه ای اطلاع رساني علوم و فناوري - Large Scale Page-Based Book Similarity Clustering

DocumentCode :

3644314

Title :

Large Scale Page-Based Book Similarity Clustering

Author :

Nemanja Spasojevic;Guillaume Poncin

Author_Institution :

Google Inc., Mountain View, CA, USA

fYear :

2011

Firstpage :

119

Lastpage :

125

Abstract :

The Google Books corpus now counts over 15M books spanning 7 centuries and countless languages. Traditional cataloguing at that scale is imprecise, and often fails to identify more complex book-to-book relationships, such as `same text, different pagination´ or ´partial overlap´. Our contribution is a two-step technique for clustering books based on content similarity (at both book and page level) and classifying their relationships. We run this on our corpora consisting of more than 15M books (5B pages). We first detect similar books and similar pages within matching books, using hashing techniques and judicious thresholds. We then combine those features to identify the exact relationship between matching books. In this paper, we describe the basic approach to making the problem tractable, as well as the features and classifiers that we used. We enumerate a small number of relationships to qualify the link between scanned real-world books. Finally, we provide precision and recall measurements of the classifier.

Keywords :

"Feature extraction","Books","Error analysis","Optical character recognition software","Support vector machines","Correlation","Manuals"

Publisher :

ieee

Conference_Titel :

Document Analysis and Recognition (ICDAR), 2011 International Conference on

ISSN :

1520-5363

Print_ISBN :

978-1-4577-1350-7

Type :

conf

DOI :

10.1109/ICDAR.2011.33

Filename :

6065288

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3644314