DocumentCode :
1674256
Title :
Keywords Similarity Based Topic Identification for Indonesian News Documents
Author :
Fuddoly, Aini ; Jaafar, Jafreezal ; Zamin, Norshuhani
Author_Institution :
Dept. of Comput. & Inf. Sci., Univ. Teknol. PETRONAS, Tronoh, Malaysia
fYear :
2013
Firstpage :
14
Lastpage :
20
Abstract :
Topic identification (TID) is a technique associated with labelling a set of textual documents with a meaningful label representing its content. TID for online news presents different problems from TID for other corpora, such as the large data volume and the frequently updated topic. Moreover, the number of developing methods for Indonesian corpus is rather small. Brace well´s algorithm has been proven effective in identifying topics in English and Japanese corpora with high accuracy. This paper implements a method for TID based on Brace well´s keywords similarity algorithm and the top-n keywords selection for Indonesian news documents. The top-n method is utilized to improve Brace well´s performance within Indonesian corpus, and to reduce the dimension of dataset during training. The combination is aimed to reduce the heavy computation problem and to explore the possibility of a new emerging topic which possibly has to be created. The method consists of two stages: training and classification. It studies the keywords of the training dataset then calculates the similarity between testing and training articles´ keywords. The algorithm produced accuracy as high as 95.22% on onlineand95.26% on offline environment, 84% against human evaluation, and an average of 2.96 seconds computational time.
Keywords :
electronic publishing; information retrieval; natural language processing; pattern classification; text analysis; Bracewell´s keywords similarity algorithm; Bracewell´s performance; English corpora; Indonesian corpus; Indonesian news documents; Japanese corpora; TID; article keyword testing; classification; keyword similarity based topic identification; online news; textual documents; top-n keywords selection method; training dataset; Accuracy; Clustering algorithms; Databases; Equations; Mathematical model; Training; Vectors; Bracewell Algorithm; Indonesian text documents; information retrieval; news domain; topic identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Modelling Symposium (EMS), 2013 European
Conference_Location :
Manchester
Print_ISBN :
978-1-4799-2577-3
Type :
conf
DOI :
10.1109/EMS.2013.3
Filename :
6779815
Link To Document :
بازگشت