Title :
Effect of multi-word features on the hierarchical clustering of web documents
Author :
Karthick, S. ; Shalinie, S. Mercy ; Eswarimeena, Ar ; Madhumitha, P. ; Abhinaya, T. Naga
Author_Institution :
Dept. of Comput. Sci. & Eng, Thiagarajar Coll. of Eng., Madurai, India
Abstract :
Contemporary search engines and other automated web tools are faced with the task of extracting relevant information from huge web archives. This is supposed to be a difficult task due to the semi-structured and unstructured nature of the web documents. Users need automated ways of organizing and cataloging the web documents so that they can be queried efficiently. Clustering is typically employed to organize web archives and to subsequently handle user queries. This paper analyzes the effect of including multi-word features on the performance of a hierarchical clustering algorithm. Noun sequences are the predominant features considered in our work, while most of the previous research uses n-grams as features. The paper also analyzes the effect of combining link and content based representations for the web documents and their inter-relationships on the clustering performance. Empirical evaluation of the hierarchical clustering engine suggests that including multi-word features enhances the performance of the hierarchical clustering algorithm with respect to precision.
Keywords :
Internet; cataloguing; pattern clustering; query processing; search engines; text analysis; Web archives; Web documents cataloging; Web documents organization; automated Web tools; clustering performance; contemporary search engines; content based representations; hierarchical clustering algorithm; hierarchical clustering engine; information extraction; link based representations; multiword features; noun sequences; user queries; Algorithm design and analysis; Clustering algorithms; Equations; Mathematical model; Measurement; Speech; Web pages; Clustering; Feature Extraction; Hierarchical Clustering; Information retrieval; Multi-words; Part of Speech Tagger; Web Mining;
Conference_Titel :
Recent Trends in Information Technology (ICRTIT), 2014 International Conference on
Conference_Location :
Chennai
DOI :
10.1109/ICRTIT.2014.6996185