Persian Web Pages Clustering Improvement: Customizing the STC Algorithm

Author

Azadnia, Mohammad ; Rezagholizadeh, Sina ; Yari, Alireza

Author_Institution

Inf. Technol. Dept., Iran Telecommun. Res. Center, Tehran, Iran

fYear

2009

fDate

24-26 Nov. 2009

Firstpage

717

Lastpage

722

Abstract

Today the Internet in almost all ethnic groups and cultures is found and the Web pages are developing very quickly in most countries and different languages. Considering the size and incoherent available information in the Internet has made the use of search engines obvious and necessary. Since search engines pay less attention to the linguistics and content features of documents in different languages and cultures, just uses the pages genuine content similarities, to provide the needs of users, will not be that successful. Regarding the fact, search engines for more effective retrieval and clustering Web pages should consider the linguistics, contents, characteristics and properties of languages. More over they should develop ways to eliminate the complexity of languages as well as using linguistic features to cluster Web pages more effective. In this paper a method for clustering and ranking Web pages in Persian language including its contents and linguistic properties has been developed. Clustering scheme provided based on STC algorithm is one of the best algorithms in clustering text documents. The main idea of this method includes some pre-processing phase to overcome the complexity of linguistic feature in Persian language. Open Source tools are available for these pre-processing steps and there is no need to implement them, simply some changes in their modules may be needed. Some of these pre-processing steps are extract phrases, parse sentences, remove stop words and also add neighbor pages pointed terms to the collection of phrases. All steps in this method have a linear behavior in time order and can apply to the large data sets. This means the proposed method in our research is scalable for mass document sources as the Web.

Keywords

natural language processing; pattern clustering; search engines; text analysis; Persian language; STC algorithm; Web pages clustering; mass document source; open source tools; search engines; text document clustering; Clustering algorithms; Clustering methods; Content based retrieval; Data mining; Information retrieval; Information technology; Internet; Search engines; Telecommunication computing; Web pages; Clustering; Persian Web; STC Algorithm;

fLanguage

English

Publisher

ieee

Conference_Titel

Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on

Conference_Location

Seoul

Print_ISBN

978-1-4244-5244-6

Electronic_ISBN

978-0-7695-3896-9

Type

conf

DOI

10.1109/ICCIT.2009.295

Filename

5368875