Title :
RSenter: Tool for Topics and Terms Extraction from Unstructured Data Debris
Author :
Lomotey, Richard K. ; Deters, Ralph
Author_Institution :
Dept. of Comput. Sci., Univ. of Saskatchewan, Saskatoon, SK, Canada
fDate :
June 27 2013-July 2 2013
Abstract :
There is enormous volume of user generated content (data) today in open source repositories, online social networks, and so on that enterprises can feed on to enhance product and services delivery. Apart from the open source data, enterprises are also generating a lot of data in-house since modern business requirements are shifting from paper-base to digital records. The major setback however is that, the data is unstructured in the sense that it is in heterogeneous formats (different file types including multimedia files), it is schema less, and it is scattered on multiple sources. This condition makes knowledge discovery (a.k.a. data mining) very challenging. Previous studies have proposed the hierarchical clustering methodology since it enhances human readability and provides clear dependency structure through topics, term and document organization. But, the methodology can be resource intensive and time consuming. Our work investigates the methodology and proposes a tool called RSenter that searches based on parallelization, random walk (or linear search), pessimistic search, and optimistic search in order to generate the hierarchical structure in real time within a search space. Currently, RSenter can search through NoSQL databases and HTML documents and traverse through all the links that are connected to that HTML to the nth depth, extracting the entire user specified elements (topics and terms). Further, the tool can search through an entire repository and organize the files in a hierarchical structure regardless of the file formats.
Keywords :
business data processing; content management; data mining; hypermedia markup languages; parallel processing; social networking (online); HTML documents; NoSQL databases; RSenter; business requirements; knowledge discovery; online social networks; open source repositories; optimistic search; parallelization; pessimistic search; random walk; term extraction; topic extraction; unstructured data debris; user generated content; Communities; Data mining; Databases; Dictionaries; Organizations; Thesauri; big data; data mining; hierarchical clustering; information extraction; terms; topics; unstructured data;
Conference_Titel :
Big Data (BigData Congress), 2013 IEEE International Congress on
Conference_Location :
Santa Clara, CA
Print_ISBN :
978-0-7695-5006-0
DOI :
10.1109/BigData.Congress.2013.59