• DocumentCode
    630108
  • Title

    Use of latent semantic indexing to identify name variants in large data collections

  • Author

    Bradford, R.B.

  • Author_Institution
    Agilex Technol., Chantilly, VA, USA
  • fYear
    2013
  • fDate
    4-7 June 2013
  • Firstpage
    27
  • Lastpage
    32
  • Abstract
    In many intelligence and security informatics applications, named entities constitute a particularly important element of queries and analytic operations. In such applications, variations in the rendering of entity names present a pervasive problem. The problem is most frequently encountered when dealing with names of persons. For person names, a wide variety of factors may lead to variations: use of nicknames, differences in given name / surname order, misspellings, phonetic renderings, use of different transliteration systems, etc. Historically, a number of methods have been developed for generating possible name variants. Most of these have been based on phonetic similarities, edit distance, or longest common substrings. However, in general, the larger the data collection, the less effective these techniques are. This paper presents an approach to attaining both high precision and high recall for name variant identification in large text collections. The approach exploits the technique of latent semantic indexing (LSI). In this approach, the contextual information provided by LSI allows likely true variants to be selected from multiple candidate variants generated by other techniques. This significantly improves the precision of candidate name variant results. This paper describes a basic LSI-augmented approach to name variant identification, as well as a new approach that yields additional precision improvements.
  • Keywords
    indexing; pattern classification; query processing; security; text analysis; LSI-augmented approach; analytic operations; contextual information; edit distance; entity name rendering; intelligence informatics applications; large data collections; latent semantic indexing; longest common substrings; name variant identification; pervasive problem; phonetic similarities; query element; security informatics applications; Data collection; Indexing; Information management; Large scale integration; Rendering (computer graphics); Semantics; Vectors; Entity matching; LSI; Latent Semantic Indexing; entity resolution; name variants;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligence and Security Informatics (ISI), 2013 IEEE International Conference on
  • Conference_Location
    Seattle, WA
  • Print_ISBN
    978-1-4673-6214-6
  • Type

    conf

  • DOI
    10.1109/ISI.2013.6578781
  • Filename
    6578781