Title :
Mining the Web to discover the meanings of an ambiguous word
Author :
Tamir, Raz ; Rapp, Reinhard
Author_Institution :
Hebrew Univ., Jerusalem, Israel
Abstract :
In information retrieval and text mining, information on word senses is usually taken from dictionaries or lexical databases that have been prepared by lexicographers. We propose an automatic method for word sense induction, i.e. for the discovery of a set of sense descriptors to a given ambiguous word. The approach is based on the statistics of word co-occurrence as derived from Web pages. The underlying assumption is that the senses of an ambiguous word are best described by terms that, although bearing a strong association to this word, are mutually exclusive, i.e. whose association strength within the retrieved Web pages is as weak as possible. Measuring association strength is based upon a novel confidence gain approach that relates the observed co-occurrence frequency for two sense descriptor candidates to an average co-occurrence frequency for pairs of arbitrary words. The proposed approach is fully unsupervised and takes into account the contemporary meanings of words, as reflected in texts from the Internet. Our results are evaluated using a list of ambiguous words commonly referred to in the literature.
Keywords :
Internet; computational linguistics; data mining; information retrieval; natural languages; text analysis; Internet; Web pages; ambiguous word meanings; computational linguistics; confidence gain approach; contemporary word meanings; information retrieval; natural languages; text mining; word co-occurrence; word sense induction; Databases; Dictionaries; Frequency; Gain measurement; Information retrieval; Natural languages; Particle measurements; Statistics; Text mining; Web pages;
Conference_Titel :
Data Mining, 2003. ICDM 2003. Third IEEE International Conference on
Print_ISBN :
0-7695-1978-4
DOI :
10.1109/ICDM.2003.1250998