• DocumentCode
    1791666
  • Title

    Why name ambiguity resolution matters for scholarly big data research

  • Author

    Jinseok Kim ; Diesner, Jana ; Aleyasen, Amirhossein ; Heejun Kim ; Hwan-Min Kim

  • Author_Institution
    Grad. Sch. of Libr. & Inf. Sci., Univ. of Illinois at Urbana-Champaign, Urbana, IL, USA
  • fYear
    2014
  • fDate
    27-30 Oct. 2014
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    This paper illustrates how data pre-processing choices about author name disambiguation can affect research findings about scholarly networks and hypotheses about underlying social mechanisms. We have analyzed three big scholarly datasets that were disambiguated algorithmically and via two common initial-based disambiguation methods; namely first-initial and all-initials disambiguation. The comparison of resulting bibliometric and network properties revealed that initial-disambiguation bears the prevalent risks of incorrectly merging author identities, underestimating the number of unique authors and inflating the average productivity and number of collaborators per author. The gaps between outcomes of name ambiguity resolution methods range from -4.23% to -87.36% per dataset for the number of unique authors, from 3.75% to 691.20% for average productivity, and from 5.06% to 285.28% for degree centrality for initial based methods compared to algorithmic disambiguation. This calls for special attention to data pre-processing choices in scholarly big data research.
  • Keywords
    Big Data; information analysis; all-initials disambiguation; bibliometric properties; first-initial disambiguation; initial-based disambiguation methods; name ambiguity resolution methods; network properties; scholarly big data research; Bibliometrics; Big data; Collaboration; Educational institutions; Heuristic algorithms; Merging; Productivity; bibliometrics; collaboration; disambiguation; network analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2014 IEEE International Conference on
  • Conference_Location
    Washington, DC
  • Type

    conf

  • DOI
    10.1109/BigData.2014.7004345
  • Filename
    7004345