• DocumentCode
    2280479
  • Title

    Internet evolution and progress in full automatic French language modelling

  • Author

    Vaufreydaz, Dominique ; Géry, Mathias

  • Author_Institution
    Lab. CLIPS-IMAG, equipe GEOD et MRIM, Grenoble, France
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    363
  • Lastpage
    366
  • Abstract
    The World Wide Web is the greatest information space ever seen, distributed all over the world, in many languages, on many various topics. We first describe the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling has multiplied by 6.5. Moreover, French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data to train our trigram models. Recognition experiments, made on a French "state of the art" evaluation set, show that word accuracy increased from 51% up to 62.30% using two different models automatically computed on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.
  • Keywords
    Internet; learning (artificial intelligence); linguistics; natural languages; speech recognition; Internet; World Wide Web; automatic French language modelling; automatically extracted text; speech recognition; spoken language modelling; training sets; trigram models; Crawlers; Data mining; HTML; Internet; Natural languages; Robots; Speech recognition; Stochastic processes; Web server; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on
  • Print_ISBN
    0-7803-7343-X
  • Type

    conf

  • DOI
    10.1109/ASRU.2001.1034662
  • Filename
    1034662