DocumentCode :
2907661
Title :
Persian Web Pages Clustering Improvement: Customizing the STC Algorithm
Author :
Azadnia, Mohammad ; Rezagholizadeh, Sina ; Yari, Alireza
Author_Institution :
Inf. Technol. Dept., Iran Telecommun. Res. Center, Tehran, Iran
fYear :
2009
fDate :
24-26 Nov. 2009
Firstpage :
717
Lastpage :
722
Abstract :
Today the Internet in almost all ethnic groups and cultures is found and the Web pages are developing very quickly in most countries and different languages. Considering the size and incoherent available information in the Internet has made the use of search engines obvious and necessary. Since search engines pay less attention to the linguistics and content features of documents in different languages and cultures, just uses the pages genuine content similarities, to provide the needs of users, will not be that successful. Regarding the fact, search engines for more effective retrieval and clustering Web pages should consider the linguistics, contents, characteristics and properties of languages. More over they should develop ways to eliminate the complexity of languages as well as using linguistic features to cluster Web pages more effective. In this paper a method for clustering and ranking Web pages in Persian language including its contents and linguistic properties has been developed. Clustering scheme provided based on STC algorithm is one of the best algorithms in clustering text documents. The main idea of this method includes some pre-processing phase to overcome the complexity of linguistic feature in Persian language. Open Source tools are available for these pre-processing steps and there is no need to implement them, simply some changes in their modules may be needed. Some of these pre-processing steps are extract phrases, parse sentences, remove stop words and also add neighbor pages pointed terms to the collection of phrases. All steps in this method have a linear behavior in time order and can apply to the large data sets. This means the proposed method in our research is scalable for mass document sources as the Web.
Keywords :
natural language processing; pattern clustering; search engines; text analysis; Persian language; STC algorithm; Web pages clustering; mass document source; open source tools; search engines; text document clustering; Clustering algorithms; Clustering methods; Content based retrieval; Data mining; Information retrieval; Information technology; Internet; Search engines; Telecommunication computing; Web pages; Clustering; Persian Web; STC Algorithm;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
Conference_Location :
Seoul
Print_ISBN :
978-1-4244-5244-6
Electronic_ISBN :
978-0-7695-3896-9
Type :
conf
DOI :
10.1109/ICCIT.2009.295
Filename :
5368875
Link To Document :
بازگشت