Title of article :
SPOKEN TURKISH CORPUS IN ITS PRESENT FORM: A TECHNICAL AND STATISTICAL ANALYSIS
Author/Authors :
acar, güneş katholieke universiteit leuven - elektronik mühendisliği bölümü, Belgium
Abstract :
The primary goal of this article is to explain the technologies and workflows used to build the METU Spoken Turkish Corpus (STC), which is pioneered by the late Prof. Dr. Şükriye Ruhi. The Web Based Corpus Management System, which is crucial to the building of STC, contains a set of workflows, data formats and export options that make it easy to transcribe, control and publish corpus data. Corpus Management System was developed by the STC project members using the Python programming language and it enables the collaboration of remote project members with different roles through an online interface. Within the STC, 286,391 words long speech are transcribed and checked; in addition, 79,189 words long recordings are made ready to publish. The article presents general statistics about the recordings in the STC and discusses what needs to be done for the publication of a large scale version of the STC.
Keywords :
Spoken corpus , corpus management system , EXMARaLDA
Journal title :
Journal Of Linguistics and Literature
Journal title :
Journal Of Linguistics and Literature