• DocumentCode
    480700
  • Title

    Enriching Multilingual Language Resources by Discovering Missing Cross-Language Links in Wikipedia

  • Author

    Oh, Jong-Hoon ; Kawahara, Daisuke ; Uchimoto, Kiyotaka ; Kazama, Jun Ichi ; Torisawa, Kentaro

  • Author_Institution
    Nat. Inst. of Inf. & Commun. Technol. (NICT), Seika
  • Volume
    1
  • fYear
    2008
  • fDate
    9-12 Dec. 2008
  • Firstpage
    322
  • Lastpage
    328
  • Abstract
    We present a novel method for discovering missing cross-language links between English and Japanese Wikipedia articles. We collect candidates of missing cross-language links -- a pair of English and Japanese Wikipedia articles, which could be connected by cross-language links. Then we select the correct cross-language links among the candidates by using a classifier trained with various types of features. Our method has three desirable characteristics for discovering missing links. First, our method can discover cross-language links with high accuracy (92% precision with 78% recall rates). Second, the features used in a classifier are language-independent. Third, without relying on any external knowledge, we generate the features based on resources automatically obtained from Wikipedia. In this work, we discover approximately $10^5$ missing cross-language links from Wikipedia, which are almost two-thirds as many as the existing cross-language links in Wikipedia.
  • Keywords
    Web sites; classification; natural language processing; English Wikipedia articles; Japanese Wikipedia articles; classification; cross-language links; multilingual language resources; Communications technology; Dictionaries; Encyclopedias; Information retrieval; Intelligent agent; Natural languages; Statistics; Wikipedia; Cross-Language Links; Language Resource; Web mining; Wikipedia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-0-7695-3496-1
  • Type

    conf

  • DOI
    10.1109/WIIAT.2008.317
  • Filename
    4740467