Title :
Analysis of Big Data Technologies and Method - Query Large Web Public RDF Datasets on Amazon Cloud Using Hadoop and Open Source Parsers
Author :
Garcia, Ted ; Taehyung Wang
Author_Institution :
Dept. of Comput. Sci., California State Univ., Northridge, Northridge, CA, USA
Abstract :
Extremely large datasets found in Big Data projects are difficult to work with using conventional databases, statistical software, and visualization tools. Massively parallel software, such as Hadoop, running on tens, hundreds, or even thousands of servers is more suitable for Big Data challenges. Additionally, in order to achieve the highest performance when querying large datasets, it is necessary to work these datasets at rest without preprocessing or moving them into a repository. Therefore, this work will analyze tools and techniques to overcome working with large datasets at rest. Parsing and querying will be done on the raw dataset - the untouched Web Data Commons RDF files. Web Data Commons comprises five billion pages of web pages crawled from the Internet. This work will analyze available tools and appropriate methods to assist the Big Data developer in working with these extremely large, semantic RDF datasets. Hadoop, open source parsers, and Amazon Cloud services will be used to data mine these files. In order to assist in further discovery, recommendations for future research will be included.
Keywords :
Big Data; cloud computing; data mining; grammars; parallel processing; public domain software; query processing; very large databases; Amazon Cloud services; Big Data development; Big Data method; Big Data project; Big Data technologies; Hadoop; Internet; Web Data Commons RDF files; Web pages; data mining; databases; extremely large datasets; extremely large semantic RDF dataset; large Web public RDF dataset querying; massively parallel software; open source parsers; statistical software; visualization tools; Cloud computing; Data handling; Data mining; Data storage systems; Information management; Java; Resource description framework; Amazon cloud computing; Any23; Hadoo; Jena; Map/Reduc; NXParser; RDF; Semantic Web; big data; open source software;
Conference_Titel :
Semantic Computing (ICSC), 2013 IEEE Seventh International Conference on
Conference_Location :
Irvine, CA
DOI :
10.1109/ICSC.2013.49