مرکز منطقه ای اطلاع رساني علوم و فناوري - Comparative study of clustering algorithms and abstract representations for software remodularisation

Abstract :

As valuable software systems become older, reverse engineering becomes increasingly important to companies that have to maintain the code. Clustering is a key activity in reverse engineering that is used to discover improved designs of systems or to extract significant concepts from code. Clustering is an old, highly sophisticated, activity which offers many methods to meet different needs. The various methods have been well documented in the past; however, conclusions from general clustering literature may not apply entirely to the reverse engineering domain. In the paper, the authors study three decisions that need to be made when clustering: the choice of (i) abstract descriptions of the entities to be clustered, (ii) metrics to compute coupling between the entities, and (iii) clustering algorithms. For each decision, our objective is to understand which choices are best when performing software remodularisation. The experiments were conducted on three public domain systems (gcc, Linux and Mosaic) and a real world legacy system (2 million LOC). Among other things, the authors confirm the importance of a proper description scheme for the entities being clustered, list a few effective coupling metrics and characterise the quality of different clustering algorithms. They also propose description schemes not directly based on the source code, and advocate better formal evaluation methods for the clustering results.