DocumentCode :
151846
Title :
A focused crawler for Romanian words discovery
Author :
Radu, Ionut-Gabriel ; Rebedea, Traian
Author_Institution :
Fac. of Autom. Control & Comput., Univ. Politeh. of Bucharest, Bucharest, Romania
fYear :
2014
fDate :
11-13 Sept. 2014
Firstpage :
1
Lastpage :
6
Abstract :
As all natural languages are subject to change over time and as the Web becomes more prevalent, it also constitutes a major source for identifying language evolution. Although these changes affect all linguistic branches ranging from phonetics, lexicon and grammar to semantics and pragmatics, we will focus only on discovering new potential words that entered the Romanian lexicon or alternative forms (lexicalizations) that are frequently used. In this paper we describe the architecture of a system which models the rate of Romanian vocabulary growth based on different statistics gathered by a focused web crawler. In order to validate the proposed system, the paper also presents the main new words identified by the system in online texts written in Romanian.
Keywords :
Internet; grammars; natural language processing; text analysis; vocabulary; Romanian lexicon; Romanian vocabulary growth; Romanian word discovery; Web crawler; grammar; language evolution; linguistic branches; natural languages; online texts; phonetics; pragmatics; semantics; Context; Crawlers; Databases; Markov processes; Pipelines; Text processing; Web pages; Focused Crawling; Language Identification; Neologisms Discovery; Text Processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
RoEduNet Conference 13th Edition: Networking in Education and Research Joint Event RENAM 8th Conference, 2014
Conference_Location :
Chisinau
ISSN :
2068-1038
Print_ISBN :
978-1-4799-6860-2
Type :
conf
DOI :
10.1109/RoEduNet-RENAM.2014.6955323
Filename :
6955323
Link To Document :
بازگشت