• DocumentCode
    470027
  • Title

    Discovering interchangeable words from string databases

  • Author

    Alvarez, Marco A. ; Lim, SeungJin

  • Author_Institution
    Dept. of Comput. Sci., Utah State Univ., Logan, UT
  • Volume
    1
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    25
  • Lastpage
    30
  • Abstract
    This paper presents a solution for the problem of finding interchangeable words in the context of an input collection of strings. Interchangeable words are words that can be replaced indistinctly in phrases or free text without deviating its actual meaning. Under restricted conditions, pairs of interchangeable might be useful for data deduplication, copy detection, software localization, among others. The calculation of the degree of interchangeability involves the accurate calculation of semantic similarity between pairs of words and the search for candidate pairs in the overall search space imposed by the input collection. The solution presented in this paper is composed by a search method for candidate pairs using the Levenshtein distance algorithm and a novel algorithm - SSA -for calculating the semantic similarity between words. The proposed solution was implemented and tested within a real world application related to a string message database from a software development company. The system was used to build an ontology with clusters of interchangeable words.
  • Keywords
    database management systems; word processing; Levenshtein distance algorithm; copy detection; data deduplication; interchangeable words; semantic similarity; software localization; string databases; string message database; Application software; Clustering algorithms; Computer science; Databases; Educational institutions; Marine animals; Ontologies; Programming; Search methods; Software testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management, 2007. ICDIM '07. 2nd International Conference on
  • Conference_Location
    Lyon
  • Print_ISBN
    978-1-4244-1475-8
  • Electronic_ISBN
    978-1-4244-1476-5
  • Type

    conf

  • DOI
    10.1109/ICDIM.2007.4444195
  • Filename
    4444195