Author_Institution :
Agilex Technol., Chantilly, VA, USA
Abstract :
In many intelligence and security informatics applications, named entities constitute a particularly important element of queries and analytic operations. In such applications, variations in the rendering of entity names present a pervasive problem. The problem is most frequently encountered when dealing with names of persons. For person names, a wide variety of factors may lead to variations: use of nicknames, differences in given name / surname order, misspellings, phonetic renderings, use of different transliteration systems, etc. Historically, a number of methods have been developed for generating possible name variants. Most of these have been based on phonetic similarities, edit distance, or longest common substrings. However, in general, the larger the data collection, the less effective these techniques are. This paper presents an approach to attaining both high precision and high recall for name variant identification in large text collections. The approach exploits the technique of latent semantic indexing (LSI). In this approach, the contextual information provided by LSI allows likely true variants to be selected from multiple candidate variants generated by other techniques. This significantly improves the precision of candidate name variant results. This paper describes a basic LSI-augmented approach to name variant identification, as well as a new approach that yields additional precision improvements.
Keywords :
indexing; pattern classification; query processing; security; text analysis; LSI-augmented approach; analytic operations; contextual information; edit distance; entity name rendering; intelligence informatics applications; large data collections; latent semantic indexing; longest common substrings; name variant identification; pervasive problem; phonetic similarities; query element; security informatics applications; Data collection; Indexing; Information management; Large scale integration; Rendering (computer graphics); Semantics; Vectors; Entity matching; LSI; Latent Semantic Indexing; entity resolution; name variants;