• DocumentCode
    1749707
  • Title

    Improving trigram language modeling with the World Wide Web

  • Author

    Zhu, Xiaojin ; Rosenfeld, Ronald

  • Author_Institution
    Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
  • Volume
    1
  • fYear
    2001
  • fDate
    2001
  • Firstpage
    533
  • Abstract
    We propose a method for using the World Wide Web to acquire trigram estimates for statistical language modeling. We submit an N-gram as a phrase query to Web search engines. The search engines return the number of Web pages containing the phrase, from which the N-gram count is estimated. The N-gram counts are then used to form Web-based trigram probability estimates. We discuss the properties of such estimates, and methods to interpolate them with traditional corpus based trigram estimates. We show that the interpolated models improve speech recognition word error rate significantly over a small test set
  • Keywords
    information resources; information retrieval; linguistics; natural languages; probability; search engines; speech recognition; N-gram; Web pages; Web search engines; World Wide Web; interpolated models; phrase query; probability estimates; speech recognition; statistical language modeling; traditional corpus based trigram estimates; trigram estimates; trigram language modeling; word error rate; Computer science; Probability; Search engines; Speech recognition; Testing; Training data; Web pages; Web search; Web sites; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on
  • Conference_Location
    Salt Lake City, UT
  • ISSN
    1520-6149
  • Print_ISBN
    0-7803-7041-4
  • Type

    conf

  • DOI
    10.1109/ICASSP.2001.940885
  • Filename
    940885