• DocumentCode
    2112027
  • Title

    Measuring domain similarity for statistical machine translation

  • Author

    Lin Liu ; Hailong Cao ; Tiejun Zhao

  • Author_Institution
    MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
  • fYear
    2013
  • fDate
    23-25 July 2013
  • Firstpage
    611
  • Lastpage
    615
  • Abstract
    It is well known that the statistical machine translation (SMT) performance suffers when a model is applied to out-of-domain data. It is also known that the more similar the test domain and the training domain are, the more efficient the training data are for SMT performance. Hence, measuring the similarity of domains is an important task to select appropriate training data. The most widely used method uses the cosine similarity function and word frequency. The lack of exploring other approaches motivates us to propose and compare several similarity measures. Aiming for better SMT performance, we compared 10 similarity measures, which are a combination of 2 feature representations and 5 similarity functions. The results show that using the relative word frequency as the feature representation and using the skew divergence as the similarity function performs the best amongst the 10 measures and outperforms random data selection.
  • Keywords
    language translation; cosine similarity function; domain similarity measurement; feature representations; relative word frequency; similarity functions; skew divergence; statistical machine translation; test domain; training domain; Adaptation models; Business; Data models; Frequency measurement; Training; Training data; Transportation; domain adaptation; domain similarity; statistical machine translation(SMT);
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2013 10th International Conference on
  • Conference_Location
    Shenyang
  • Type

    conf

  • DOI
    10.1109/FSKD.2013.6816269
  • Filename
    6816269