DocumentCode
2280479
Title
Internet evolution and progress in full automatic French language modelling
Author
Vaufreydaz, Dominique ; Géry, Mathias
Author_Institution
Lab. CLIPS-IMAG, equipe GEOD et MRIM, Grenoble, France
fYear
2001
fDate
2001
Firstpage
363
Lastpage
366
Abstract
The World Wide Web is the greatest information space ever seen, distributed all over the world, in many languages, on many various topics. We first describe the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling has multiplied by 6.5. Moreover, French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data to train our trigram models. Recognition experiments, made on a French "state of the art" evaluation set, show that word accuracy increased from 51% up to 62.30% using two different models automatically computed on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.
Keywords
Internet; learning (artificial intelligence); linguistics; natural languages; speech recognition; Internet; World Wide Web; automatic French language modelling; automatically extracted text; speech recognition; spoken language modelling; training sets; trigram models; Crawlers; Data mining; HTML; Internet; Natural languages; Robots; Speech recognition; Stochastic processes; Web server; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on
Print_ISBN
0-7803-7343-X
Type
conf
DOI
10.1109/ASRU.2001.1034662
Filename
1034662
Link To Document