DocumentCode
630108
Title
Use of latent semantic indexing to identify name variants in large data collections
Author
Bradford, R.B.
Author_Institution
Agilex Technol., Chantilly, VA, USA
fYear
2013
fDate
4-7 June 2013
Firstpage
27
Lastpage
32
Abstract
In many intelligence and security informatics applications, named entities constitute a particularly important element of queries and analytic operations. In such applications, variations in the rendering of entity names present a pervasive problem. The problem is most frequently encountered when dealing with names of persons. For person names, a wide variety of factors may lead to variations: use of nicknames, differences in given name / surname order, misspellings, phonetic renderings, use of different transliteration systems, etc. Historically, a number of methods have been developed for generating possible name variants. Most of these have been based on phonetic similarities, edit distance, or longest common substrings. However, in general, the larger the data collection, the less effective these techniques are. This paper presents an approach to attaining both high precision and high recall for name variant identification in large text collections. The approach exploits the technique of latent semantic indexing (LSI). In this approach, the contextual information provided by LSI allows likely true variants to be selected from multiple candidate variants generated by other techniques. This significantly improves the precision of candidate name variant results. This paper describes a basic LSI-augmented approach to name variant identification, as well as a new approach that yields additional precision improvements.
Keywords
indexing; pattern classification; query processing; security; text analysis; LSI-augmented approach; analytic operations; contextual information; edit distance; entity name rendering; intelligence informatics applications; large data collections; latent semantic indexing; longest common substrings; name variant identification; pervasive problem; phonetic similarities; query element; security informatics applications; Data collection; Indexing; Information management; Large scale integration; Rendering (computer graphics); Semantics; Vectors; Entity matching; LSI; Latent Semantic Indexing; entity resolution; name variants;
fLanguage
English
Publisher
ieee
Conference_Titel
Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on
Conference_Location
Seattle, WA
Print_ISBN
978-1-4673-6214-6
Type
conf
DOI
10.1109/ISI.2013.6578781
Filename
6578781
Link To Document