• DocumentCode
    145316
  • Title

    Combining Modern Machine Translation Software with LSI for Cross-Lingual Information Processing

  • Author

    Bradford, Russell ; Pozniak, John

  • Author_Institution
    Agilex Technol. Inc., Chantilly, VA, USA
  • fYear
    2014
  • fDate
    7-9 April 2014
  • Firstpage
    65
  • Lastpage
    72
  • Abstract
    The growing internationalization of business and social interactions poses significant challenges in implementing multilingual information systems. For applications requiring retrieval, clustering, and categorization of multilingual document collections, cross-lingual application of latent semantic indexing (LSI) has a number of characteristics that make it potentially attractive. However, this technique is dependent upon the availability of applicable parallel corpora. Historically, such corpora have been quite limited in size and scope. In this paper, we provide new results regarding implementation of cross-lingual LSI text processing systems employing parallel corpora produced using modern machine translation (MT) products. We present measurements using the Reuters 21578 test set to demonstrate three key points regarding this combined LSI/modern MT approach: (1) for some languages, this approach can create parallel corpora of sufficient fidelity to support effective multilingual and cross-lingual LSI applications, (2) the technique is not particularly sensitive to details of LSI parameters, and (3) multiple languages can be represented in a single LSI space with little degradation in performance.
  • Keywords
    indexing; information systems; language translation; text analysis; MT products; Reuters 21578 test; business internationalization; cross-lingual LSI applications; cross-lingual LSI text processing systems; cross-lingual information processing; latent semantic indexing; machine translation software; multilingual document collection categorization; multilingual document collection clustering; multilingual document collection retrieval; multilingual information systems; parallel corpora; social interactions; Abstracts; Large scale integration; Matrix decomposition; Semantics; Standards; Training; Vectors; cross-lingual; latent semantic indexing; machine translation; multilingual; parallel corpora;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology: New Generations (ITNG), 2014 11th International Conference on
  • Conference_Location
    Las Vegas, NV
  • Print_ISBN
    978-1-4799-3187-3
  • Type

    conf

  • DOI
    10.1109/ITNG.2014.52
  • Filename
    6822177