Title :
Novelty detection for text documents using named entity recognition
Author :
Ng, Kok Wah ; Tsai, Flora S. ; Chen, Lihui ; Goh, Kiat Chong
Author_Institution :
Nanyang Technol. Univ., Singapore
Abstract :
In order to determine novel information from raw text documents, a novelty detection recommender system was developed to explore the method of comparing various types of entities within sentences. We first detected novel sentences using named entity recognition to extract the entity types of person, place, time, and organization. In addition, part-of-speech tagging was performed to tag each word in the documents, allowing syntactic structures of noun, verb, and adjective to be used for comparisons. WordNet, an English lexical database of concepts and relations, was also incorporated to generate synonyms for the entities and parts of speech as well as to determine the similarity of sentences. The novelty score of each sentence was determined by using two different metrics, UniqueComparison and Importance Value. UniqueComparison calculated the number of matched entities, whereas ImportanceValue took into account the total weight of matched words that coexisted in both the test and history sentences. The results look promising when compared to the benchmark scores for the Text Retrieval Conference´s (TREC) Novelty Track 2004. This demonstrated that the combination of named entity recognition and part-of-speech tagging is capable of detecting novelty with good results.
Keywords :
computational linguistics; information filtering; knowledge acquisition; text analysis; English lexical database; WordNet; detection recommender system; named entity recognition; part-of-speech tagging; text document detection; Data mining; Databases; Equations; Frequency; Information processing; Recommender systems; Speech; Tagging; Testing; Text recognition; data mining; novelty detection; recommender system; text mining;
Conference_Titel :
Information, Communications & Signal Processing, 2007 6th International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-1-4244-0982-2
Electronic_ISBN :
978-1-4244-0983-9
DOI :
10.1109/ICICS.2007.4449883