Title :
Internet evolution and progress in full automatic French language modelling
Author :
Vaufreydaz, Dominique ; Géry, Mathias
Author_Institution :
Lab. CLIPS-IMAG, equipe GEOD et MRIM, Grenoble, France
Abstract :
The World Wide Web is the greatest information space ever seen, distributed all over the world, in many languages, on many various topics. We first describe the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling has multiplied by 6.5. Moreover, French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data to train our trigram models. Recognition experiments, made on a French "state of the art" evaluation set, show that word accuracy increased from 51% up to 62.30% using two different models automatically computed on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.
Keywords :
Internet; learning (artificial intelligence); linguistics; natural languages; speech recognition; Internet; World Wide Web; automatic French language modelling; automatically extracted text; speech recognition; spoken language modelling; training sets; trigram models; Crawlers; Data mining; HTML; Internet; Natural languages; Robots; Speech recognition; Stochastic processes; Web server; Web sites;
Conference_Titel :
Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on
Print_ISBN :
0-7803-7343-X
DOI :
10.1109/ASRU.2001.1034662