• DocumentCode
    3106899
  • Title

    Improving Grouped-Entity Resolution Using Quasi-Cliques

  • Author

    On, Byung-Won ; Elmacioglu, Ergin ; Lee, Dongwon ; Kang, Jaewoo ; Pei, Jian

  • Author_Institution
    Pennsylvania State Univ., University Park, PA
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    1008
  • Lastpage
    1015
  • Abstract
    The entity resolution (ER) problem, which identifies duplicate entities that refer to the same real world entity, is essential in many applications. In this paper, in particular, we focus on resolving entities that contain a group of related elements in them (e.g., an author entity with a list of citations, a singer entity with song list, or an intermediate result by GROUP BY SQL query). Such entities, named as grouped-entities, frequently occur in many applications. The previous approaches toward grouped-entity resolution often rely on textual similarity, and produce a large number of false positives. As a complementing technique, in this paper, we present our experience of applying a recently proposed graph mining technique, Quasi-Clique, atop conventional ER solutions. Our approach exploits contextual information mined from the group of elements per entity in addition to syntactic similarity. Extensive experiments verify that our proposal improves precision and recall up to 83% when used together with a variety of existing ER solutions, but never worsens them.
  • Keywords
    data mining; text analysis; SQL query; graph mining technique; grouped-entity resolution; quasi-cliques; textual similarity; Computer errors; Data mining; Data structures; Degradation; Erbium; Large-scale systems; Motion pictures; Proposals; Software libraries; Technological innovation;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2006. ICDM '06. Sixth International Conference on
  • Conference_Location
    Hong Kong
  • ISSN
    1550-4786
  • Print_ISBN
    0-7695-2701-7
  • Type

    conf

  • DOI
    10.1109/ICDM.2006.85
  • Filename
    4053144