• DocumentCode
    2478776
  • Title

    Normalized Sum-over-Paths Edit Distances

  • Author

    Diez, Silvia Garcia ; Fouss, Francois ; Shimbo, Masashi ; Saerens, Marco

  • Author_Institution
    Unite de Syst. d´´Inf. (ISYS), Univ. Catholique de Louvain, Louvain-la-Neuve, Belgium
  • fYear
    2010
  • fDate
    23-26 Aug. 2010
  • Firstpage
    1044
  • Lastpage
    1047
  • Abstract
    In this paper, normalized SoP string-edit distances, taking into account all possible alignments between two sequences, are investigated. These normalized distances are variants of the Sum-over-Paths (SoP) distances which compute the expected cost on all sequence alignments by favoring low-cost ones - therefore favoring good alignment. Such distances consider two sequences tied by many optimal or nearly-optimal alignments as more similar than two sequences sharing only one, optimal, alignment. They depend on a parameter, θ, and reduce to the standard distances - the edit-distance or the longest common subsequence - when θ → 0, while having the same time complexity. This paper puts the emphasis on applying some type of normalization of the expectation of the cost. Experimental results for clustering and classification tasks performed on four OCR data sets show that (i) the applied normalization generally improves the existing results, and (ii) as for the SoP edit-distances, the normalized SoP edit-distances clearly outperform the non-randomized measures, i.e. the standard edit distance and longest common subsequence.
  • Keywords
    computational complexity; optical character recognition; pattern classification; pattern clustering; string matching; OCR data sets; classification tasks; clustering task; longest common subsequence; nearly-optimal alignments; normalization; normalized SoP string-edit distances; normalized distances; normalized sum-over-paths edit distances; sequence alignments; standard edit distance; time complexity; Electronic mail; Indexes; Kernel; Light emitting diodes; Pattern recognition; Presses; Support vector machines; edit distance; longest common subsequence; normalization; randomized-shortest paths;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2010 20th International Conference on
  • Conference_Location
    Istanbul
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4244-7542-1
  • Type

    conf

  • DOI
    10.1109/ICPR.2010.261
  • Filename
    5595855