• DocumentCode
    2731779
  • Title

    Disambiguation Algorithm for People Search on the Web

  • Author

    Kalashnikov, Dmitri V. ; Mehrotra, Sharad ; Chen, Zhaoqi ; Nuray-Turan, Rabia ; Ashish, Naveen

  • Author_Institution
    Dept. of Comput. Sci., California Univ., Irvine, CA
  • fYear
    2007
  • fDate
    15-20 April 2007
  • Firstpage
    1258
  • Lastpage
    1260
  • Abstract
    In this paper we develop a disambiguation algorithm and then study its impact on People Search. The proposed algorithm first uses extraction techniques to automatically extract `significant´ entities such as the names of other persons, organizations, and locations on each Web page. In addition, it extracts and parses HTML and Web related data on each Web page, such as hyperlinks and email addresses. The algorithm then views all this information in a unified way: as an entity-relationship graph where entities (e.g., people, organizations, locations, Web pages) are interconnected via relationships (e.g., `Web page-mentions-person´, relationships derived from hyperlinks, etc). The algorithm gains its power by being able to analyze several types of information: attributes associated with the entities (e.g., TF/IDF for Web pages) and, most importantly, direct and indirect interconnections that exist among entities in the ER graph. We next outline our approach in Section 2 and then compare it with the state of the art solutions in Section 3.
  • Keywords
    Web sites; information retrieval; HTML; People Search; Web page; World Wide Web; disambiguation algorithm; entity-relationship graph; extraction techniques; Clustering algorithms; Computer science; Data mining; Information analysis; Internet; Machine learning; Middleware; Search engines; Web pages; Web search;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
  • Conference_Location
    Istanbul
  • Print_ISBN
    1-4244-0802-4
  • Electronic_ISBN
    1-4244-0803-2
  • Type

    conf

  • DOI
    10.1109/ICDE.2007.368987
  • Filename
    4221777