DocumentCode :
2149126
Title :
Facilitating Understanding of Large Document Collections
Author :
Bae, Jae Hyeon ; Xu, Weijia ; Esteva, Maria
Author_Institution :
Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
fYear :
2011
fDate :
18-21 Sept. 2011
Firstpage :
1334
Lastpage :
1338
Abstract :
Large document collections containing multiple topics can be overwhelming to understand, requiring librarians and archivists significant time and efforts to develop access points. Efficient computational methods can aid this process by uncovering groups of documents that can be described for access. We investigate the use of density based clustering with document segmentation to identify points of access as dense clusters of information. The method returns stories and classes of cohesive clusters that can be described as precise points of access. We found that our method performs more efficiently than K-means clustering and topic model using Latent Dirichlet Allocation (LDA). We use Hadoop to process a large document collection.
Keywords :
information retrieval; pattern clustering; text analysis; Hadoop; density based clustering; document segmentation; large document collection; Clustering algorithms; Clustering methods; Educational institutions; Electronic mail; Noise; Resource management; Vectors; Hadoop/MapReduce; density based clustering; digital archives; distributed processing; information retrieval;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
ISSN :
1520-5363
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
Type :
conf
DOI :
10.1109/ICDAR.2011.268
Filename :
6065527
Link To Document :
بازگشت