مرکز منطقه ای اطلاع رساني علوم و فناوري - Turkish word n-gram analyzing algorithms for a large scale Turkish corpus

DocumentCode :

2815805

Title :

Turkish word n-gram analyzing algorithms for a large scale Turkish corpus - TurCo

Author :

Çebi, Yalçin ; Dalkiliç, Gökhan

Author_Institution :

Dept. of Comput. Eng., Dokuz Eylul Univ., Izmir, Turkey

Volume :

fYear :

2004

fDate :

5-7 April 2004

Firstpage :

236

Abstract :

To calculate some statistical properties of a language, first you need to take some samples of that language. That sample is called a corpus. An unbalanced large scale Turkish text corpus (TurCo) having ∼362 MB capacity and more than 50 million words was prepared by using 12 different resources including Web sites and novels in Turkish language. Different algorithms were tested to obtain the n-gram (1≤n≤5) values. Efficiencies of different algorithms have been examined by applying them onto the each piece of the corpus one by one. Only detailed results of the two algorithms created without using database tables are given, because all the other algorithms need to run more than one day which makes those tests inefficient.

Keywords :

dictionaries; linguistics; natural languages; text analysis; Turkish text corpus; language statistical properties; programming language; word n-gram analyzing algorithm; Algorithm design and analysis; Assembly; Books; Computer science; Databases; Error correction; Large-scale systems; Natural languages; Telephony; Testing;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on

Print_ISBN :

0-7695-2108-8

Type :

conf

DOI :

10.1109/ITCC.2004.1286638

Filename :

1286638

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2815805