DocumentCode
270122
Title
Turkish labeled text corpus
Author
Özturk, Seçil ; Sankur, B. ; Gungör, Tunga ; Yilmaz, Mustafa Berkay ; Köroǧlu, Bilge ; Aǧin, Onur ; İşbilen, Mustafa ; Ulaş, Çaǧdaş ; Ahat, Mehmet
Author_Institution
Elektr. Elektron., Muhendisligi Bolumleri, Bogazici Univ., Istanbul, Turkey
fYear
2014
fDate
23-25 April 2014
Firstpage
1395
Lastpage
1398
Abstract
A labeled text corpus made up of Turkish papers´ titles, abstracts and keywords is collected. The corpus includes 35 number of different disciplines, and 200 documents per subject. This study presents the text corpus´ collection and content. The classification performance of Term Frequcney - Inverse Document Frequency (TF-IDF) and topic probabilities of Latent Dirichlet Allocation (LDA) features are compared for the text corpus. The text corpus is shared as open source so that it could be used for natural language processing applications with academic purposes.
Keywords
natural language processing; pattern classification; probability; text analysis; LDA features; TF-IDF; Turkish labeled text corpus; Turkish paper abstracts; Turkish paper keywords; Turkish paper titles; academic purposes; classification performance; latent Dirichlet allocation features; natural language processing applications; term frequency-inverse document frequency; text corpus collection; text corpus content; topic probabilities; Abstracts; Conferences; Natural language processing; Resource management; Signal processing; Support vector machines; XML; Classification; Corpus; Inverse Document Frequency; Latent Dirichlet Allocation; NLP; Natural Language Processing; Paper; TF-IDF; Term Frequcney; Turkish;
fLanguage
English
Publisher
ieee
Conference_Titel
Signal Processing and Communications Applications Conference (SIU), 2014 22nd
Conference_Location
Trabzon
Type
conf
DOI
10.1109/SIU.2014.6830499
Filename
6830499
Link To Document