• DocumentCode
    607664
  • Title

    Text similarity analysis using IR lists

  • Author

    Metin, S.K. ; Kisla, T. ; Karaoglan, Bahar

  • Author_Institution
    Yazilim Muhendisligi Bolumu, Izmir Ekonomi Univ., Izmir, Turkey
  • fYear
    2013
  • fDate
    24-26 April 2013
  • Firstpage
    1
  • Lastpage
    4
  • Abstract
    Natural language processing can be seen as a signal processing problem when the characters, syllabi, words, punctuations in a text are considered as signals. In this article, we present a novel approach that detects text similarity in Turkish, based on the similarities of the lists of retrieved documents when the texts are given as queries to web search engines. The similarities between the URLs contained in the items of the returned lists are measured using statistical methods like euclidean, city-block, chebychev, cosine, correlation, spearman and hamming distances. For experimenting, a corpus of 150 news is developed by gathering news in 50 different topics from 3 Turkish newspapers published during a certain time slot. News on the same topic published in different newspapers are considered as similar texts. Statistical methods are applied on the formed newsXterms matrix; and for each news similar news are ranked from the most similar to least similar. If at least one of the top two is the same with the ones marked manully as similar, it is counted as success. Experimental results show that cosines and correlation distances give the best performance with 84% precision.
  • Keywords
    information retrieval; natural language processing; search engines; signal processing; statistical analysis; text analysis; IR lists; Turkish newspapers; URL; Web search engines; natural language processing; retrieved documents; signal processing problem; similar texts; statistical methods; text similarity analysis; Computational linguistics; Correlation; Natural language processing; Semantics; Signal processing; Statistical analysis; Uniform resource locators; signal information; similarity methods; statistical signal processing; web based text similarity;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Signal Processing and Communications Applications Conference (SIU), 2013 21st
  • Conference_Location
    Haspolat
  • Print_ISBN
    978-1-4673-5562-9
  • Electronic_ISBN
    978-1-4673-5561-2
  • Type

    conf

  • DOI
    10.1109/SIU.2013.6531310
  • Filename
    6531310