DocumentCode
2058919
Title
Clustering Algorithms and Latent Semantic Indexing to Identify Similar Pages in Web Applications
Author
De Lucia, Andrea ; Risi, Michele ; Tortora, Genoveffa ; Scanniello, Giuseppe
Author_Institution
Univ. of Salerno, Fisciano
fYear
2007
fDate
5-6 Oct. 2007
Firstpage
65
Lastpage
72
Abstract
In this paper, we analyze some clustering algorithms that have been widely employed in the past to support the comprehension of Web applications. To this end, we have defined an approach to identify static pages that are duplicated or cloned at the content level. This approach is based on a process that first computes the dissimilarity between Web pages using latent semantic indexing, a well known information retrieval technique, and then groups similar pages using clustering algorithms. We consider five instances of this process, each based on three variants of the agglomerative hierarchical clustering algorithm, a divisive clustering algorithm, k-means partitional clustering algorithm, and a widely employed partitional competitive clustering algorithm, namely Winner Takes All. In order to assess the proposed approach, we have used the static pages of three Web applications and one static Web site.
Keywords
Web sites; indexing; information retrieval; semantic Web; Web pages; agglomerative hierarchical clustering algorithm; divisive clustering algorithm; information retrieval technique; k-means partitional clustering algorithm; latent semantic indexing; partitional competitive clustering algorithm; static Web site; Algorithm design and analysis; Application software; Clustering algorithms; Indexing; Information retrieval; Navigation; Partitioning algorithms; Software systems; Time to market; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Site Evolution, 2007. WSE 2007. 9th IEEE International Workshop on
Conference_Location
Paris
Print_ISBN
978-1-4244-1450-5
Type
conf
DOI
10.1109/WSE.2007.4380246
Filename
4380246
Link To Document