DocumentCode
2666126
Title
Unsupervised incremental acquisition of a thematic corpus from the Web
Author
Duclaye, Florence ; Yvon, F. ; Collin, Olivier
Author_Institution
France Telecom R&D, Lannion, France
fYear
2003
fDate
26-29 Oct. 2003
Firstpage
752
Lastpage
757
Abstract
We present a nearly unsupervised learning methodology for automatically acquiring a thematic corpus from the Web. Relying on a bootstrapping mechanism, our system starts with one single linguistic expression of a given target semantic relationship. It then samples the Web so as to progressively accumulate a corpus of potential examples of the same relationship. Sampling steps alternate with filtering steps, making it possible to keep the corpus thematically focused. The corpus is finally analysed to search for potential paraphrases of the initial expression of the semantic relationship. These paraphrases will eventually be used to improve our question-answering system. We focus on die learning aspect of the system and reports experimental results regarding the effectiveness of our filtering strategy.
Keywords
Internet; linguistics; unsupervised learning; Web; automatic classification; bootstrap mechanism; die learning; linguistic expression; machine learning; machine-aided translation; paraphrase acquisition; question-answering system; thematic corpus; unsupervised learning; Automata; Conferences; Filtering; Inference algorithms; Research and development; Telecommunications; Thesauri; Training data; Unsupervised learning;
fLanguage
English
Publisher
ieee
Conference_Titel
Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on
Conference_Location
Beijing, China
Print_ISBN
0-7803-7902-0
Type
conf
DOI
10.1109/NLPKE.2003.1276006
Filename
1276006
Link To Document