DocumentCode :
2302114
Title :
Measurement of turkish word semantic similarity and text categorization application
Author :
Amasyali, M.F. ; Beken, Aytunç
Author_Institution :
Bilgisayar Muhendisligi Bolumu, Yildiz Teknik Univ., Istanbul, Turkey
fYear :
2009
fDate :
9-11 April 2009
Firstpage :
1
Lastpage :
4
Abstract :
In literature, texts to be classified are generally represented in the large dimensional bag of words space in which every dimension equals to a word or ngram. In this study, firstly the words are placed in a semantic space. The word´s coordinates in semantic spaces needs the similarity of the words according to their meanings. Harris states that two words´ semantic similarity is related to the number of documents which the words are both in. We used his hypothesis for Turkish words. Firstly, we obtained word co-occurrence matrix from a Web corpus. Then, the numerical coordinates of the words are calculated by using multi dimensional scaling. Texts coordinates are obtained from word coordinates which passes in the texts. In our experiments, Turkish news texts are classified into 5 classes. We get more successful results than the traditional bag of words space. Our approach is not for only Turkish words/texts, but also for all other languages.
Keywords :
Internet; natural language processing; pattern classification; text analysis; Turkish news text; Turkish word semantic similarity measurement; Web corpus; bag-of-words space; co-occurrence matrix; multidimensional scaling; text classification; text document categorization; Testing; Text categorization;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Signal Processing and Communications Applications Conference, 2009. SIU 2009. IEEE 17th
Conference_Location :
Antalya
Print_ISBN :
978-1-4244-4435-9
Electronic_ISBN :
978-1-4244-4436-6
Type :
conf
DOI :
10.1109/SIU.2009.5136317
Filename :
5136317
Link To Document :
بازگشت