• DocumentCode
    2907661
  • Title

    Persian Web Pages Clustering Improvement: Customizing the STC Algorithm

  • Author

    Azadnia, Mohammad ; Rezagholizadeh, Sina ; Yari, Alireza

  • Author_Institution
    Inf. Technol. Dept., Iran Telecommun. Res. Center, Tehran, Iran
  • fYear
    2009
  • fDate
    24-26 Nov. 2009
  • Firstpage
    717
  • Lastpage
    722
  • Abstract
    Today the Internet in almost all ethnic groups and cultures is found and the Web pages are developing very quickly in most countries and different languages. Considering the size and incoherent available information in the Internet has made the use of search engines obvious and necessary. Since search engines pay less attention to the linguistics and content features of documents in different languages and cultures, just uses the pages genuine content similarities, to provide the needs of users, will not be that successful. Regarding the fact, search engines for more effective retrieval and clustering Web pages should consider the linguistics, contents, characteristics and properties of languages. More over they should develop ways to eliminate the complexity of languages as well as using linguistic features to cluster Web pages more effective. In this paper a method for clustering and ranking Web pages in Persian language including its contents and linguistic properties has been developed. Clustering scheme provided based on STC algorithm is one of the best algorithms in clustering text documents. The main idea of this method includes some pre-processing phase to overcome the complexity of linguistic feature in Persian language. Open Source tools are available for these pre-processing steps and there is no need to implement them, simply some changes in their modules may be needed. Some of these pre-processing steps are extract phrases, parse sentences, remove stop words and also add neighbor pages pointed terms to the collection of phrases. All steps in this method have a linear behavior in time order and can apply to the large data sets. This means the proposed method in our research is scalable for mass document sources as the Web.
  • Keywords
    natural language processing; pattern clustering; search engines; text analysis; Persian language; STC algorithm; Web pages clustering; mass document source; open source tools; search engines; text document clustering; Clustering algorithms; Clustering methods; Content based retrieval; Data mining; Information retrieval; Information technology; Internet; Search engines; Telecommunication computing; Web pages; Clustering; Persian Web; STC Algorithm;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Sciences and Convergence Information Technology, 2009. ICCIT '09. Fourth International Conference on
  • Conference_Location
    Seoul
  • Print_ISBN
    978-1-4244-5244-6
  • Electronic_ISBN
    978-0-7695-3896-9
  • Type

    conf

  • DOI
    10.1109/ICCIT.2009.295
  • Filename
    5368875