• DocumentCode
    2325421
  • Title

    Web document clustering based on a new niching Memetic Algorithm, Term-Document Matrix and Bayesian Information Criterion

  • Author

    Cobos, Carlos ; Montealegre, Claudia ; Mejía, María-Fernanda ; Mendoza, Martha ; León, Elizabeth

  • Author_Institution
    Univ. of Cauca, Popayan, Colombia
  • fYear
    2010
  • fDate
    18-23 July 2010
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    This paper introduces a new description-centric algorithm for web document clustering based on Memetic Algorithms with Niching Methods, Term-Document Matrix and Bayesian Information Criterion. The algorithm defines the number of clusters automatically. The Memetic Algorithm provides a combined global and local strategy for a search in the solution space and the Niching methods to promote diversity in the population and prevent the population from converging too quickly (based on restricted competition replacement and restrictive mating). The Memetic Algorithm uses the K-means algorithm to find the optimum value in a local search space. Bayesian Information Criterion is used as a fitness function, while FP-Growth is used to reduce the high dimensionality in the vocabulary. This resulting algorithm, called WDC-NMA, was tested with data sets based on Reuters-21578 and DMOZ, obtaining promising results (better precision results than a Singular Value Decomposition algorithm). Also, it was also then initially evaluated by a group of users.
  • Keywords
    Bayes methods; Internet; document handling; genetic algorithms; information retrieval; matrix algebra; singular value decomposition; vocabulary; Bayesian information criterion; DMOZ; FP-growth; Reuters-21578; WDC-NMA; Web document clustering; description-centric algorithm; fitness function; k-means algorithm; niching memetic algorithm; niching methods; search space; singular value decomposition algorithm; term-document matrix; vocabulary; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Complexity theory; Heuristic algorithms; Memetics; Partitioning algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Evolutionary Computation (CEC), 2010 IEEE Congress on
  • Conference_Location
    Barcelona
  • Print_ISBN
    978-1-4244-6909-3
  • Type

    conf

  • DOI
    10.1109/CEC.2010.5586016
  • Filename
    5586016