DocumentCode :
1791781
Title :
Integrating Data Mining and Data Management Technologies for Scholarly Inquiry
Author :
Larson, Ray R. ; Marciano, Richard ; Chien-Yi Hou ; Shreyas ; Watry, Paul ; Harrison, Jonathan ; Aguilar, Luis ; Fuselier, Jerome
Author_Institution :
Sch. of Inf., Univ. of California, Berkeley, Berkeley, CA, USA
fYear :
2014
fDate :
27-30 Oct. 2014
Firstpage :
67
Lastpage :
71
Abstract :
This short paper discusses the “Integrating Data Mining and Data Management Technologies for Scholarly Inquiry” project. In this “Round Two” Digging Into Data Challenge award, we explored uses and approaches for large-scale data analysis and processing for the Humanities and Social Sciences through the integration of several infrastructure frameworks: Cheshire, iRODS, and Amazon Web Services (EC2 computing and S3 storage). Our “big data” consisted of the entire texts collection of the Internet Archive (approximately 3.6 million volumes) and the entire JSTOR database. We performed surface-level natural language processing on this data to identify noun phrases and further refinements to identify personal, corporate, and geographic names. We then used resources including library and archival authority records to identify variants and merge names. The goal is to create an integrated index of persons, places, and organizations referenced in our collections.
Keywords :
Big Data; Web services; data mining; merging; natural language processing; text analysis; Amazon Web Services; Big Data; Cheshire; Data Challenge award; EC2 computing; Internet archive; JSTOR database; S3 storage; archival authority records; corporate name identification; data management technology integration; data mining technology integration; geographic name identification; humanities; iRODS; integrated person-place-organization index; large-scale data analysis; large-scale data processing; library records; name merging; noun phrases; personal name identification; scholarly inquiry; social sciences; surface-level natural language processing; text collection; Data mining; Educational institutions; Indexing; Internet; Libraries; Prototypes; XML; Cheshire3; Internet Archive; JSTOR; big data; data management; data mining; iRODS; natural language processing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
Type :
conf
DOI :
10.1109/BigData.2014.7004455
Filename :
7004455
Link To Document :
بازگشت