Internet evolution and progress in full automatic French language modelling

Author

Vaufreydaz, Dominique ; Géry, Mathias

Author_Institution

Lab. CLIPS-IMAG, equipe GEOD et MRIM, Grenoble, France

fYear

2001

fDate

2001

Firstpage

363

Lastpage

366

Abstract

The World Wide Web is the greatest information space ever seen, distributed all over the world, in many languages, on many various topics. We first describe the evolution of a French subset of this space during the last 3 years. During this time, the size of automatically extracted text for language modelling has multiplied by 6.5. Moreover, French coverage has grown from 140,000 to 200,000 lexical forms. So, we show that we can get more and more reliable data to train our trigram models. Recognition experiments, made on a French "state of the art" evaluation set, show that word accuracy increased from 51% up to 62.30% using two different models automatically computed on Web corpora. The first corpus was gathered at the beginning of 1999 and the last one at the end of 2000.

Keywords

Internet; learning (artificial intelligence); linguistics; natural languages; speech recognition; Internet; World Wide Web; automatic French language modelling; automatically extracted text; speech recognition; spoken language modelling; training sets; trigram models; Crawlers; Data mining; HTML; Internet; Natural languages; Robots; Speech recognition; Stochastic processes; Web server; Web sites;

fLanguage

English

Publisher

ieee

Conference_Titel

Automatic Speech Recognition and Understanding, 2001. ASRU '01. IEEE Workshop on

Print_ISBN

0-7803-7343-X

Type

conf

DOI

10.1109/ASRU.2001.1034662

Filename

1034662