DocumentCode :
1868189
Title :
Implementing MapReduce over language and literature data over the UK National Grid Service
Author :
Sarwar, Muhammad S. ; Alexander, M. ; Anderson, J. ; Green, J. ; Sinnott, Richard O.
Author_Institution :
Nat. e-Sci. Centre, Univ. of Glasgow, Glasgow, UK
fYear :
2011
fDate :
5-6 Sept. 2011
Firstpage :
1
Lastpage :
6
Abstract :
Humanities researchers are producing large volumes and heterogeneous varieties of language and literature data collections in digital format. These collections include dictionaries, thesauri, corpora, images, audio and video resources. The increased availability of these datasets brought about by advances and adaptations of the Internet and increased digitisation of humanities data resources, poses new challenges for humanities researchers. Many of these challenges are related to data access and usage and include security, integrity, interoperability, information retrieval, sharing, licensing and copyright. The JISC-funded project Enhancing Repositories for Language and Literature Research (ENROLLER; https://www.enroller.org.uk) is addressing these issues through development of a targeted e-Research environment. A key component of this effort is in supporting large-scale analysis of diverse language and literature data sets. To this end, this paper presents the application of the MapReduce algorithm, that supports information retrieval and linguistic analysis on those datasets. In particular, we describe how MapReduce is used to provide advanced bulk search capabilities exploiting a range of high performance computing resources including the UK National Grid Service (www.ngs.ac.uk) and ScotGrid (www.scotgrid.ac.uk) to offer a step change in the kinds of research that can be undertaken by this community. We also present performance analysis results based on the application of these systems.
Keywords :
Internet; copyright; data integrity; dictionaries; grid computing; humanities; information retrieval; natural languages; open systems; scientific information systems; security of data; thesauri; ENROLLER; Internet; JlSC-funded project; MapReduce algorithm; ScotGrid; UK National Grid Service; audio resource; bulk search capability; copyright; data access; data integrity; data security; data usage; dictionary; e-Research environment; enhancing repositories for language and literature research project; high performance computing resource; humanities data resource digitisation; humanities researcher; information retrieval; information sharing; interoperability; language data collection; linguistic analysis; literature data collection; thesaurus; video resource; Data mining; Dictionaries; Indexing; Instruction sets; Pragmatics; Thesauri; ENROLLER; Grid information retrieval; MapReduce; NGS; Scotgrid; eHumanities;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Emerging Technologies (ICET), 2011 7th International Conference on
Conference_Location :
Islamabad
Print_ISBN :
978-1-4577-0769-8
Type :
conf
DOI :
10.1109/ICET.2011.6048475
Filename :
6048475
Link To Document :
بازگشت