• DocumentCode
    2816050
  • Title

    Internet archive as a source of bilingual dictionary

  • Author

    Fattah, M.A. ; Ren, Fuji ; Shingo, Kuroiwa

  • Author_Institution
    Fac. of Eng., Tokushima Univ., Japan
  • Volume
    2
  • fYear
    2004
  • fDate
    5-7 April 2004
  • Firstpage
    298
  • Abstract
    Parallel corpus is a very important tool to construct a good machine translation system or make any natural language processing research for cross language information retrieval. Internet archive is a good source of parallel documents in different languages. In order to construct a good parallel corpus from the Internet archive, bilingual dictionary that contains word pairs which may not exist in commercial dictionaries is a must. Extracting a bilingual dictionary from the Internet parallel documents is important to add words that are absent from the traditional dictionaries. This paper describes two algorithms to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive. The system should preferably be useful for many different language pairs. Like most of the systems done, the accuracy of our system is directly proportional to the amount of sentence pairs used. By controlling the system parameters, we could achieve 100% precision for the output bilingual dictionary, but the size of the dictionary is smaller.
  • Keywords
    Internet; dictionaries; language translation; linguistics; natural languages; thesauri; Internet archive; bilingual dictionary; cross language information retrieval; machine translation system; natural language processing; parallel corpus; parallel documents; Automatic control; Control systems; Data mining; Dictionaries; Information retrieval; Internet; Natural language processing; Natural languages; Size control; Thesauri;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on
  • Print_ISBN
    0-7695-2108-8
  • Type

    conf

  • DOI
    10.1109/ITCC.2004.1286650
  • Filename
    1286650