• DocumentCode
    1296218
  • Title

    Cross-Lingual Document Representation and Semantic Similarity Measure: A Fuzzy Set and Rough Set Based Approach

  • Author

    Huang, Hsun-Hui ; Kuo, Yau-Hwang

  • Author_Institution
    Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
  • Volume
    18
  • Issue
    6
  • fYear
    2010
  • Firstpage
    1098
  • Lastpage
    1111
  • Abstract
    As cross-lingual information retrieval is attracting increasing attention, tools that measure cross-lingual semantic similarity between documents are becoming desirable. In this paper, two aspects of cross-lingual semantic document similarity measures are investigated: One is document representation, and the other is the formulation of similarity measures. Fuzzy set and rough set theories are applied to capture the inherently fuzzy relationships among concepts expressed by natural languages. Our approach first develops a language-independent sense-level document representation based on the fuzzy set model to reduce the barrier between different languages and further explores the fuzzy-rough hybrid approach to obtain a more robust macrosense-level document representation through the partitioning of the integrated sense association network of the document collection into macrosenses. Then, Tversky´s notion of similarity and the F1 measure on information retrieval are adopted to formulate, respectively, two document similarity measures with fuzzy set operations on the two proposed document representations. The effectiveness of our approach is demonstrated by its success rate in identifying the English translations to their corresponding Chinese documents in a collection of Chinese-English parallel documents. Moreover, the proposed approach can be easily extended to process documents in other languages. It is believed that the proposed representations, along with the similarity measures, will enable more effective text mining processes.
  • Keywords
    document handling; fuzzy set theory; information retrieval; natural language processing; rough set theory; Chinese document; Chinese-English parallel document; cross-lingual document representation; cross-lingual information retrieval; cross-lingual semantic document similarity measure; fuzzy set model; fuzzy set operation; fuzzy set theory; integrated sense association network; language-independent sense-level document representation; natural language; rough set theory; text mining process; Approximation methods; Computational modeling; Correlation; Fuzzy sets; Hidden Markov models; Information retrieval; Intelligent systems; Internet; Natural languages; Pragmatics; Robustness; Search engines; Semantics; Set theory; Text mining; Web pages; Cross-lingual; document representation; fuzzy–rough hybrid; sense association network; similarity measure;
  • fLanguage
    English
  • Journal_Title
    Fuzzy Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1063-6706
  • Type

    jour

  • DOI
    10.1109/TFUZZ.2010.2065811
  • Filename
    5549886