• DocumentCode
    2613897
  • Title

    Web documents categorization using fuzzy representation and HAC

  • Author

    Deng, Jiawei ; Chen, Lihui

  • Author_Institution
    Sch. of Electr. & Electron. Eng., Nanyang Tech. Univ., Singapore
  • Volume
    2
  • fYear
    2000
  • fDate
    2000
  • Firstpage
    24
  • Abstract
    Most of the existing techniques for the characterization of Web documents are based on term-frequency analysis. In such models, given a set of documents, the characterization of each document is represented by a feature vector in a vector space. However, as Web documents written in HTML are semi-structured by means of tags, the traditional techniques that assign term weights only by the frequency of occurrence may not be able to provide satisfactory results in representing the content of such documents. Some recent studies have shown that the fuzzy representation (FR) of WWW information based on the significance of HTML tags is an effective alternative for characterizing Web documents. In this paper, the FR is used to generate the feature vector for each Web document and the hierarchical agglomerative clustering (HAC) algorithm is applied to investigate its efficiency and effectiveness for the automatic categorization of Web documents with similar contents. Experiments that have been conducted suggest several benefits of using such an approach
  • Keywords
    classification; fuzzy set theory; hypermedia markup languages; information resources; pattern clustering; vectors; HAC algorithm; HTML tags; World Wide Web document categorization; document characterization; document content representation; feature vector; fuzzy representation; hierarchical agglomerative clustering; occurrence frequency; semi-structured documents; term weights; term-frequency analysis; vector space; Clustering algorithms; Frequency; HTML; Information retrieval; Internet; Natural languages; Navigation; Probes; Web pages; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information Systems Engineering, 2000. Proceedings of the First International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-0577-5
  • Type

    conf

  • DOI
    10.1109/WISE.2000.882848
  • Filename
    882848