RSenter: Tool for Topics and Terms Extraction from Unstructured Data Debris

Author

Lomotey, Richard K. ; Deters, Ralph

Author_Institution

Dept. of Comput. Sci., Univ. of Saskatchewan, Saskatoon, SK, Canada

fYear

2013

fDate

June 27 2013-July 2 2013

Firstpage

395

Lastpage

402

Abstract

There is enormous volume of user generated content (data) today in open source repositories, online social networks, and so on that enterprises can feed on to enhance product and services delivery. Apart from the open source data, enterprises are also generating a lot of data in-house since modern business requirements are shifting from paper-base to digital records. The major setback however is that, the data is unstructured in the sense that it is in heterogeneous formats (different file types including multimedia files), it is schema less, and it is scattered on multiple sources. This condition makes knowledge discovery (a.k.a. data mining) very challenging. Previous studies have proposed the hierarchical clustering methodology since it enhances human readability and provides clear dependency structure through topics, term and document organization. But, the methodology can be resource intensive and time consuming. Our work investigates the methodology and proposes a tool called RSenter that searches based on parallelization, random walk (or linear search), pessimistic search, and optimistic search in order to generate the hierarchical structure in real time within a search space. Currently, RSenter can search through NoSQL databases and HTML documents and traverse through all the links that are connected to that HTML to the nth depth, extracting the entire user specified elements (topics and terms). Further, the tool can search through an entire repository and organize the files in a hierarchical structure regardless of the file formats.

Keywords

business data processing; content management; data mining; hypermedia markup languages; parallel processing; social networking (online); HTML documents; NoSQL databases; RSenter; business requirements; knowledge discovery; online social networks; open source repositories; optimistic search; parallelization; pessimistic search; random walk; term extraction; topic extraction; unstructured data debris; user generated content; Communities; Data mining; Databases; Dictionaries; Organizations; Thesauri; big data; data mining; hierarchical clustering; information extraction; terms; topics; unstructured data;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data (BigData Congress), 2013 IEEE International Congress on

Conference_Location

Santa Clara, CA

Print_ISBN

978-0-7695-5006-0

Type

conf

DOI

10.1109/BigData.Congress.2013.59

Filename

6597163