DocumentCode :
2734378
Title :
Improving Web Page Clustering Through Selecting Appropiate Term Weighting Functions
Author :
Fresno, Víctor ; Martínez, Raquel ; Montalvo, Soto
Author_Institution :
ESCET, Univ. Rey Juan Carlos, Mostoles
fYear :
2006
fDate :
6-6 Dec. 2006
Firstpage :
511
Lastpage :
518
Abstract :
Web page clustering is useful for taxonomy design, information extraction, similarity search, and it can assist to the evaluation and visualization of the results of search engines. Therefore, an accurate clustering is a goal in Web mining and Web information extraction. Besides the particular clustering algorithm, the different term weighting functions applied to the selected features to represent Web pages is a main aspect in clustering task. This paper presents the evaluation of the performance of six different term weighting functions of Web pages, by means of a partitioning clustering algorithm results. Besides, two reduction methods have been applied: (1) the proper function, and (2) removing all features occurring more times than upper thresholds in page and collection, and occurring less times than lower thresholds in page and collection. By means of the experimentation with a collection of Web documents used in clustering research, we have determined that the best results are obtained when the term weighting function based on a fuzzy criteria combination is used.
Keywords :
Internet; Web sites; data mining; feature extraction; information retrieval; pattern clustering; search engines; text analysis; Web information extraction; Web mining; Web page clustering; Web textual document; feature selection; information extraction; search engine; similarity search; taxonomy design; term weighting function selection; Clustering algorithms; Data mining; Frequency; HTML; Partitioning algorithms; Search engines; Taxonomy; Visualization; Web mining; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Information Management, 2006 1st International Conference on
Conference_Location :
Bangalore
Print_ISBN :
1-4244-0682-X
Type :
conf
DOI :
10.1109/ICDIM.2007.369244
Filename :
4221936
Link To Document :
بازگشت