• DocumentCode
    84267
  • Title

    Efficient Semisupervised MEDLINE Document Clustering With MeSH-Semantic and Global-Content Constraints

  • Author

    Jun Gu ; Wei Feng ; Jia Zeng ; Mamitsuka, Hiroshi ; Shanfeng Zhu

  • Author_Institution
    Shanghai Key Lab. of Intell. Inf. Process. & the Sch. of Comput. Sci., Fudan Univ., Shanghai, China
  • Volume
    43
  • Issue
    4
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1265
  • Lastpage
    1276
  • Abstract
    For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
  • Keywords
    document handling; learning (artificial intelligence); medical information systems; pattern clustering; CL constraints; GC similarities; LC information; LC similarities; MEDLINE collections; MEDLINE records; MS information; MS similarities; MeSH-semantic constraints; SSNCut; biomedical document clustering; cannot-link constraints; document pairs; global-content constraints; global-content information; local-content information; medical subject heading-semantic information; must-link constraints; semisupervised MEDLINE spectral document clustering method; Bioinformatics; Clustering algorithms; Educational institutions; Genomics; Indexing; Thesauri; Vectors; Biomedical text mining; document clustering; semisupervised clustering; spectral clustering;
  • fLanguage
    English
  • Journal_Title
    Cybernetics, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    2168-2267
  • Type

    jour

  • DOI
    10.1109/TSMCB.2012.2227998
  • Filename
    6374265