DocumentCode
151846
Title
A focused crawler for Romanian words discovery
Author
Radu, Ionut-Gabriel ; Rebedea, Traian
Author_Institution
Fac. of Autom. Control & Comput., Univ. Politeh. of Bucharest, Bucharest, Romania
fYear
2014
fDate
11-13 Sept. 2014
Firstpage
1
Lastpage
6
Abstract
As all natural languages are subject to change over time and as the Web becomes more prevalent, it also constitutes a major source for identifying language evolution. Although these changes affect all linguistic branches ranging from phonetics, lexicon and grammar to semantics and pragmatics, we will focus only on discovering new potential words that entered the Romanian lexicon or alternative forms (lexicalizations) that are frequently used. In this paper we describe the architecture of a system which models the rate of Romanian vocabulary growth based on different statistics gathered by a focused web crawler. In order to validate the proposed system, the paper also presents the main new words identified by the system in online texts written in Romanian.
Keywords
Internet; grammars; natural language processing; text analysis; vocabulary; Romanian lexicon; Romanian vocabulary growth; Romanian word discovery; Web crawler; grammar; language evolution; linguistic branches; natural languages; online texts; phonetics; pragmatics; semantics; Context; Crawlers; Databases; Markov processes; Pipelines; Text processing; Web pages; Focused Crawling; Language Identification; Neologisms Discovery; Text Processing;
fLanguage
English
Publisher
ieee
Conference_Titel
RoEduNet Conference 13th Edition: Networking in Education and Research Joint Event RENAM 8th Conference, 2014
Conference_Location
Chisinau
ISSN
2068-1038
Print_ISBN
978-1-4799-6860-2
Type
conf
DOI
10.1109/RoEduNet-RENAM.2014.6955323
Filename
6955323
Link To Document