Title :
Turkish document semantic categorization using web-based encyclopedia article association
Author_Institution :
Electrical - Electronics Engineering, Middle East Technical University, Ankara, Turkey
fDate :
7/1/2012 12:00:00 AM
Abstract :
The rapid growth of text, video and image documents sharing on the internet, has created a new research fields to categorize these documents automatically. Especially text document categorization is the one of the hardest processing problem in any language. Also it becomes more difficult in Turkish text document categorization than the any other language since Turkish is an agglutinating language and it includes additional spelling rules. In this paper, we develop a Turkish document semantic categorization system which uses web-based encyclopedia article association to enlarge text words scopes to get better comparison performance over predefined keywords. In word comparisons, we defines some string matching algorithms which are used in literature like Naïve matching, Levenshtein distance, Smith-Waterman distance, Jaro-Winkler distance and Jaccard index. Also we compare the performance of these string matching algorithms on a dataset obtained from three important Turkish newspaper websites, Milliyet, Hurriyet and Ntvmsnbc, under five different semantic categories which includes politic, sport, economy, magazine and technology.
Keywords :
"Encyclopedias","Semantics","Internet","Algorithm design and analysis","Electronic publishing","Text categorization"
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2012 International Symposium on
Print_ISBN :
978-1-4673-1446-6
DOI :
10.1109/INISTA.2012.6247023