DocumentCode
145316
Title
Combining Modern Machine Translation Software with LSI for Cross-Lingual Information Processing
Author
Bradford, Russell ; Pozniak, John
Author_Institution
Agilex Technol. Inc., Chantilly, VA, USA
fYear
2014
fDate
7-9 April 2014
Firstpage
65
Lastpage
72
Abstract
The growing internationalization of business and social interactions poses significant challenges in implementing multilingual information systems. For applications requiring retrieval, clustering, and categorization of multilingual document collections, cross-lingual application of latent semantic indexing (LSI) has a number of characteristics that make it potentially attractive. However, this technique is dependent upon the availability of applicable parallel corpora. Historically, such corpora have been quite limited in size and scope. In this paper, we provide new results regarding implementation of cross-lingual LSI text processing systems employing parallel corpora produced using modern machine translation (MT) products. We present measurements using the Reuters 21578 test set to demonstrate three key points regarding this combined LSI/modern MT approach: (1) for some languages, this approach can create parallel corpora of sufficient fidelity to support effective multilingual and cross-lingual LSI applications, (2) the technique is not particularly sensitive to details of LSI parameters, and (3) multiple languages can be represented in a single LSI space with little degradation in performance.
Keywords
indexing; information systems; language translation; text analysis; MT products; Reuters 21578 test; business internationalization; cross-lingual LSI applications; cross-lingual LSI text processing systems; cross-lingual information processing; latent semantic indexing; machine translation software; multilingual document collection categorization; multilingual document collection clustering; multilingual document collection retrieval; multilingual information systems; parallel corpora; social interactions; Abstracts; Large scale integration; Matrix decomposition; Semantics; Standards; Training; Vectors; cross-lingual; latent semantic indexing; machine translation; multilingual; parallel corpora;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology: New Generations (ITNG), 2014 11th International Conference on
Conference_Location
Las Vegas, NV
Print_ISBN
978-1-4799-3187-3
Type
conf
DOI
10.1109/ITNG.2014.52
Filename
6822177
Link To Document