Title :
A case study on entity Resolution for Distant Processing of big Humanities data
Author :
Weijia Xu ; Esteva, Maria ; Trelogan, Jessica ; Swinson, Todd
Author_Institution :
Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA
Abstract :
At the forefront of big data in the Humanities, collections management can directly impact collections access and reuse. However, curators using traditional data management methods for tasks such as identifying redundant from relevant and related records, a small increase in data volume can significantly increase their workload. In this paper, we present preliminary work aimed at assisting curators in making important data management decisions for organizing and improving the overall quality of large unstructured Humanities data collections. Using Entity Resolution as a conceptual framework, we created a similarity model that compares directories and files based on their implicit metadata, and clusters pairs of closely related directories. Useful relationships between data are identified and presented through a graphical user interface that allows qualitative evaluation of the clusters and provides a guide to decide on data management actions. To evaluate the model´s performance, we experimented with a test collection and asked the curator to classify the clusters according to four model cluster configurations that consider the presence of related and duplicate information. Evaluation results suggest that the model is useful for making data management action decisions.
Keywords :
Big Data; graphical user interfaces; humanities; meta data; pattern classification; pattern clustering; big humanities data processing; cluster configurations; clusters classification; clusters qualitative evaluation; collections access; collections management; collections reuse; curators; data management action decisions; data management methods; data volume; directories; distant processing; entity resolution; files; graphical user interface; metadata; similarity model; unstructured humanities data collections; Data handling; Data models; Data storage systems; Erbium; Information management; Organizations; Vectors; Collections Management; Digital Humanities; Distant Processing; Entity Resolution;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691678