• DocumentCode
    3060424
  • Title

    Web-based parallel corpora for statistical machine translation

  • Author

    Li, Bo ; Liu, Juan ; Shi, Wenjuan

  • Author_Institution
    Wuhan Univ., Wuhan
  • fYear
    2007
  • fDate
    13-15 Dec. 2007
  • Firstpage
    444
  • Lastpage
    449
  • Abstract
    Statistical machine translation is the state-of-the- art technique based on sentence-level aligned parallel corpora. The improvement of this kind of technique is constrained by the lack of parallel corpora publicly available. The booming of the World Wide Web stands a fair chance that we can construct parallel corpora in a big scale more easily. In this paper, we summarize the current strategies fetching parallel corpora from the Web and classify them into three classes: the structure-based, the content-based and the hybrid. We compare these approaches and bring out some ideas that may be useful for improving the performance of the algorithms. In the discussion section, we put forward some problems that should be considered in future research.
  • Keywords
    Internet; language translation; statistical analysis; Web-based parallel corpora; World Wide Web; content-based strategy; statistical machine translation; structure-based strategy; Application software; Computer science; Feeds; Law; Machine learning; Surface-mount technology; Uniform resource locators; Web pages; Web sites; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on
  • Conference_Location
    Cincinnati, OH
  • Print_ISBN
    978-0-7695-3069-7
  • Type

    conf

  • DOI
    10.1109/ICMLA.2007.24
  • Filename
    4457270