Title :
Web document clustering based on a new niching Memetic Algorithm, Term-Document Matrix and Bayesian Information Criterion
Author :
Cobos, Carlos ; Montealegre, Claudia ; Mejía, María-Fernanda ; Mendoza, Martha ; León, Elizabeth
Author_Institution :
Univ. of Cauca, Popayan, Colombia
Abstract :
This paper introduces a new description-centric algorithm for web document clustering based on Memetic Algorithms with Niching Methods, Term-Document Matrix and Bayesian Information Criterion. The algorithm defines the number of clusters automatically. The Memetic Algorithm provides a combined global and local strategy for a search in the solution space and the Niching methods to promote diversity in the population and prevent the population from converging too quickly (based on restricted competition replacement and restrictive mating). The Memetic Algorithm uses the K-means algorithm to find the optimum value in a local search space. Bayesian Information Criterion is used as a fitness function, while FP-Growth is used to reduce the high dimensionality in the vocabulary. This resulting algorithm, called WDC-NMA, was tested with data sets based on Reuters-21578 and DMOZ, obtaining promising results (better precision results than a Singular Value Decomposition algorithm). Also, it was also then initially evaluated by a group of users.
Keywords :
Bayes methods; Internet; document handling; genetic algorithms; information retrieval; matrix algebra; singular value decomposition; vocabulary; Bayesian information criterion; DMOZ; FP-growth; Reuters-21578; WDC-NMA; Web document clustering; description-centric algorithm; fitness function; k-means algorithm; niching memetic algorithm; niching methods; search space; singular value decomposition algorithm; term-document matrix; vocabulary; Algorithm design and analysis; Classification algorithms; Clustering algorithms; Complexity theory; Heuristic algorithms; Memetics; Partitioning algorithms;
Conference_Titel :
Evolutionary Computation (CEC), 2010 IEEE Congress on
Conference_Location :
Barcelona
Print_ISBN :
978-1-4244-6909-3
DOI :
10.1109/CEC.2010.5586016