• DocumentCode
    3739161
  • Title

    Methodology for Large-Scale Entity Resolution without Pairwise Matching

  • Author

    Cheng Chen;Daniel Pullen;Reed H. Petty;John R. Talburt

  • Author_Institution
    Black Oak Analytics, Inc., Little Rock, AR, USA
  • fYear
    2015
  • Firstpage
    204
  • Lastpage
    210
  • Abstract
    Entity Resolution is the process of determining if two information system records are referring to the same entities, and is a crucial part in Information Quality research. The ER process becomes exponentially more complex and time consuming as datasets approach Big Data volumes. Due to the special characters of transitive closure in Entity Resolution and high volume of input data, traditional ER pairwise matching algorithms are not able to solve the problem efficiently. This paper presents a methodology to perform Entity Resolution without pairwise matching using match keys. Transitive closure occurs when each input reference can potentially create more than one match key. This paper also introduces a novel distributed parallel transitive closure algorithm in Entity Resolution context and an optimized version, which applies the method on multiple match keys. The implementation of the methodology is built upon the Hadoop MapReduce for distributed computation.
  • Keywords
    "Erbium","Algorithm design and analysis","Standards","Metadata","Rocks","XML"
  • Publisher
    ieee
  • Conference_Titel
    Data Mining Workshop (ICDMW), 2015 IEEE International Conference on
  • Electronic_ISBN
    2375-9259
  • Type

    conf

  • DOI
    10.1109/ICDMW.2015.197
  • Filename
    7395672