DocumentCode :
3647687
Title :
Turkish document semantic categorization using web-based encyclopedia article association
Author :
Savaş Özkan
Author_Institution :
Electrical - Electronics Engineering, Middle East Technical University, Ankara, Turkey
fYear :
2012
fDate :
7/1/2012 12:00:00 AM
Firstpage :
1
Lastpage :
5
Abstract :
The rapid growth of text, video and image documents sharing on the internet, has created a new research fields to categorize these documents automatically. Especially text document categorization is the one of the hardest processing problem in any language. Also it becomes more difficult in Turkish text document categorization than the any other language since Turkish is an agglutinating language and it includes additional spelling rules. In this paper, we develop a Turkish document semantic categorization system which uses web-based encyclopedia article association to enlarge text words scopes to get better comparison performance over predefined keywords. In word comparisons, we defines some string matching algorithms which are used in literature like Naïve matching, Levenshtein distance, Smith-Waterman distance, Jaro-Winkler distance and Jaccard index. Also we compare the performance of these string matching algorithms on a dataset obtained from three important Turkish newspaper websites, Milliyet, Hurriyet and Ntvmsnbc, under five different semantic categories which includes politic, sport, economy, magazine and technology.
Keywords :
"Encyclopedias","Semantics","Internet","Algorithm design and analysis","Electronic publishing","Text categorization"
Publisher :
ieee
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on
Print_ISBN :
978-1-4673-1446-6
Type :
conf
DOI :
10.1109/INISTA.2012.6247023
Filename :
6247023
Link To Document :
بازگشت