• DocumentCode
    151846
  • Title

    A focused crawler for Romanian words discovery

  • Author

    Radu, Ionut-Gabriel ; Rebedea, Traian

  • Author_Institution
    Fac. of Autom. Control & Comput., Univ. Politeh. of Bucharest, Bucharest, Romania
  • fYear
    2014
  • fDate
    11-13 Sept. 2014
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    As all natural languages are subject to change over time and as the Web becomes more prevalent, it also constitutes a major source for identifying language evolution. Although these changes affect all linguistic branches ranging from phonetics, lexicon and grammar to semantics and pragmatics, we will focus only on discovering new potential words that entered the Romanian lexicon or alternative forms (lexicalizations) that are frequently used. In this paper we describe the architecture of a system which models the rate of Romanian vocabulary growth based on different statistics gathered by a focused web crawler. In order to validate the proposed system, the paper also presents the main new words identified by the system in online texts written in Romanian.
  • Keywords
    Internet; grammars; natural language processing; text analysis; vocabulary; Romanian lexicon; Romanian vocabulary growth; Romanian word discovery; Web crawler; grammar; language evolution; linguistic branches; natural languages; online texts; phonetics; pragmatics; semantics; Context; Crawlers; Databases; Markov processes; Pipelines; Text processing; Web pages; Focused Crawling; Language Identification; Neologisms Discovery; Text Processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    RoEduNet Conference 13th Edition: Networking in Education and Research Joint Event RENAM 8th Conference, 2014
  • Conference_Location
    Chisinau
  • ISSN
    2068-1038
  • Print_ISBN
    978-1-4799-6860-2
  • Type

    conf

  • DOI
    10.1109/RoEduNet-RENAM.2014.6955323
  • Filename
    6955323