DocumentCode :
630108
Title :
Use of latent semantic indexing to identify name variants in large data collections
Author :
Bradford, R.B.
Author_Institution :
Agilex Technol., Chantilly, VA, USA
fYear :
2013
fDate :
4-7 June 2013
Firstpage :
27
Lastpage :
32
Abstract :
In many intelligence and security informatics applications, named entities constitute a particularly important element of queries and analytic operations. In such applications, variations in the rendering of entity names present a pervasive problem. The problem is most frequently encountered when dealing with names of persons. For person names, a wide variety of factors may lead to variations: use of nicknames, differences in given name / surname order, misspellings, phonetic renderings, use of different transliteration systems, etc. Historically, a number of methods have been developed for generating possible name variants. Most of these have been based on phonetic similarities, edit distance, or longest common substrings. However, in general, the larger the data collection, the less effective these techniques are. This paper presents an approach to attaining both high precision and high recall for name variant identification in large text collections. The approach exploits the technique of latent semantic indexing (LSI). In this approach, the contextual information provided by LSI allows likely true variants to be selected from multiple candidate variants generated by other techniques. This significantly improves the precision of candidate name variant results. This paper describes a basic LSI-augmented approach to name variant identification, as well as a new approach that yields additional precision improvements.
Keywords :
indexing; pattern classification; query processing; security; text analysis; LSI-augmented approach; analytic operations; contextual information; edit distance; entity name rendering; intelligence informatics applications; large data collections; latent semantic indexing; longest common substrings; name variant identification; pervasive problem; phonetic similarities; query element; security informatics applications; Data collection; Indexing; Information management; Large scale integration; Rendering (computer graphics); Semantics; Vectors; Entity matching; LSI; Latent Semantic Indexing; entity resolution; name variants;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on
Conference_Location :
Seattle, WA
Print_ISBN :
978-1-4673-6214-6
Type :
conf
DOI :
10.1109/ISI.2013.6578781
Filename :
6578781
Link To Document :
بازگشت