Facilitating Understanding of Large Document Collections

Author

Bae, Jae Hyeon ; Xu, Weijia ; Esteva, Maria

Author_Institution

Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA

fYear

2011

fDate

18-21 Sept. 2011

Firstpage

1334

Lastpage

1338

Abstract

Large document collections containing multiple topics can be overwhelming to understand, requiring librarians and archivists significant time and efforts to develop access points. Efficient computational methods can aid this process by uncovering groups of documents that can be described for access. We investigate the use of density based clustering with document segmentation to identify points of access as dense clusters of information. The method returns stories and classes of cohesive clusters that can be described as precise points of access. We found that our method performs more efficiently than K-means clustering and topic model using Latent Dirichlet Allocation (LDA). We use Hadoop to process a large document collection.

Keywords

information retrieval; pattern clustering; text analysis; Hadoop; density based clustering; document segmentation; large document collection; Clustering algorithms; Clustering methods; Educational institutions; Electronic mail; Noise; Resource management; Vectors; Hadoop/MapReduce; density based clustering; digital archives; distributed processing; information retrieval;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition (ICDAR), 2011 International Conference on

Conference_Location

Beijing

ISSN

1520-5363

Print_ISBN

978-1-4577-1350-7

Electronic_ISBN

1520-5363

Type

conf

DOI

10.1109/ICDAR.2011.268

Filename

6065527