• DocumentCode
    2666126
  • Title

    Unsupervised incremental acquisition of a thematic corpus from the Web

  • Author

    Duclaye, Florence ; Yvon, F. ; Collin, Olivier

  • Author_Institution
    France Telecom R&D, Lannion, France
  • fYear
    2003
  • fDate
    26-29 Oct. 2003
  • Firstpage
    752
  • Lastpage
    757
  • Abstract
    We present a nearly unsupervised learning methodology for automatically acquiring a thematic corpus from the Web. Relying on a bootstrapping mechanism, our system starts with one single linguistic expression of a given target semantic relationship. It then samples the Web so as to progressively accumulate a corpus of potential examples of the same relationship. Sampling steps alternate with filtering steps, making it possible to keep the corpus thematically focused. The corpus is finally analysed to search for potential paraphrases of the initial expression of the semantic relationship. These paraphrases will eventually be used to improve our question-answering system. We focus on die learning aspect of the system and reports experimental results regarding the effectiveness of our filtering strategy.
  • Keywords
    Internet; linguistics; unsupervised learning; Web; automatic classification; bootstrap mechanism; die learning; linguistic expression; machine learning; machine-aided translation; paraphrase acquisition; question-answering system; thematic corpus; unsupervised learning; Automata; Conferences; Filtering; Inference algorithms; Research and development; Telecommunications; Thesauri; Training data; Unsupervised learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
  • Conference_Location
    Beijing, China
  • Print_ISBN
    0-7803-7902-0
  • Type

    conf

  • DOI
    10.1109/NLPKE.2003.1276006
  • Filename
    1276006