• DocumentCode
    2149126
  • Title

    Facilitating Understanding of Large Document Collections

  • Author

    Bae, Jae Hyeon ; Xu, Weijia ; Esteva, Maria

  • Author_Institution
    Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1334
  • Lastpage
    1338
  • Abstract
    Large document collections containing multiple topics can be overwhelming to understand, requiring librarians and archivists significant time and efforts to develop access points. Efficient computational methods can aid this process by uncovering groups of documents that can be described for access. We investigate the use of density based clustering with document segmentation to identify points of access as dense clusters of information. The method returns stories and classes of cohesive clusters that can be described as precise points of access. We found that our method performs more efficiently than K-means clustering and topic model using Latent Dirichlet Allocation (LDA). We use Hadoop to process a large document collection.
  • Keywords
    information retrieval; pattern clustering; text analysis; Hadoop; density based clustering; document segmentation; large document collection; Clustering algorithms; Clustering methods; Educational institutions; Electronic mail; Noise; Resource management; Vectors; Hadoop/MapReduce; density based clustering; digital archives; distributed processing; information retrieval;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.268
  • Filename
    6065527