• DocumentCode
    659529
  • Title

    A case study on entity Resolution for Distant Processing of big Humanities data

  • Author

    Weijia Xu ; Esteva, Maria ; Trelogan, Jessica ; Swinson, Todd

  • Author_Institution
    Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
  • fYear
    2013
  • fDate
    6-9 Oct. 2013
  • Firstpage
    113
  • Lastpage
    120
  • Abstract
    At the forefront of big data in the Humanities, collections management can directly impact collections access and reuse. However, curators using traditional data management methods for tasks such as identifying redundant from relevant and related records, a small increase in data volume can significantly increase their workload. In this paper, we present preliminary work aimed at assisting curators in making important data management decisions for organizing and improving the overall quality of large unstructured Humanities data collections. Using Entity Resolution as a conceptual framework, we created a similarity model that compares directories and files based on their implicit metadata, and clusters pairs of closely related directories. Useful relationships between data are identified and presented through a graphical user interface that allows qualitative evaluation of the clusters and provides a guide to decide on data management actions. To evaluate the model´s performance, we experimented with a test collection and asked the curator to classify the clusters according to four model cluster configurations that consider the presence of related and duplicate information. Evaluation results suggest that the model is useful for making data management action decisions.
  • Keywords
    Big Data; graphical user interfaces; humanities; meta data; pattern classification; pattern clustering; big humanities data processing; cluster configurations; clusters classification; clusters qualitative evaluation; collections access; collections management; collections reuse; curators; data management action decisions; data management methods; data volume; directories; distant processing; entity resolution; files; graphical user interface; metadata; similarity model; unstructured humanities data collections; Data handling; Data models; Data storage systems; Erbium; Information management; Organizations; Vectors; Collections Management; Digital Humanities; Distant Processing; Entity Resolution;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data, 2013 IEEE International Conference on
  • Conference_Location
    Silicon Valley, CA
  • Type

    conf

  • DOI
    10.1109/BigData.2013.6691678
  • Filename
    6691678