Author_Institution :
Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA
Abstract :
Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and suffix tree clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering qualify for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/ hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine
Keywords :
bibliographic systems; data mining; document handling; information retrieval; medical information systems; ontologies (artificial intelligence); pattern clustering; text analysis; K-means; biomedical digital library MEDLINE; biomedicine; bisecting K-means; document browsing; document clustering; document retrieval; domain ontology; formal explicit specification; hierarchical methods; shared conceptualization; suffix tree clustering; text mining; Biomedical measurements; Clustering algorithms; Educational institutions; Information retrieval; Information science; Iterative algorithms; Ontologies; Partitioning algorithms; Software libraries; Text mining; comparison study; document clustering; ontology;