• DocumentCode
    168724
  • Title

    Supporting Queries and Analyses of Large-Scale Social Media Data with Customizable and Scalable Indexing Techniques over NoSQL Databases

  • Author

    Xiaoming Gao ; Qiu, Jian

  • Author_Institution
    Sch. of Inf. & Comput., Indiana Univ., Bloomington, IN, USA
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    587
  • Lastpage
    590
  • Abstract
    Social media data analysis demonstrates two special characteristics in Big Data processing. First, most analyses focus on data subsets related to specific social events or activities instead of the whole dataset. Second, analysis workflows consist of multiple stages, and algorithms applied in each stage may use different computation and communication patterns depending on processing frameworks. This paper presents our efforts in supporting the data storage and processing requirements for such characteristics. To achieve efficient queries about target data subsets, we propose a general customizable and scalable indexing framework that can be built over distributed NoSQL databases. This framework allows users to define suitable customized index structures for their query patterns against social media data, and supports scalable indexing of both historical and streaming data. We implement this framework on HBase, and name it IndexedHBase. Starting from IndexedHBase, we build a distributed analysis stack based on YARN to support analysis algorithms using different processing frameworks, such as Hadoop MapReduce, Harp, and Giraph. This analysis stack is used to host the Truthy social media data observatory, and we have applied the customized index structures in supporting both query evaluation and sophisticated analysis algorithms. Performance tests show that our solutions outperform implementations using both direct raw data scans and current indexing mechanisms in existing NoSQL databases.
  • Keywords
    Big Data; SQL; data analysis; indexing; query processing; social networking (online); storage management; Big Data processing; Giraph; Hadoop MapReduce; Harp; Truthy social media data observatory; YARN; analysis workflows; communication patterns; computation patterns; customizable indexing framework; customizable indexing techniques; customized index structures; data storage; data subsets; distributed NoSQL databases; distributed analysis stack; historical data indexing; indexedHBase; large-scale social media data; processing requirements; query evaluation; query patterns; scalable indexing framework; scalable indexing techniques; social activities; social events; social media data analysis; streaming data indexing; Algorithm design and analysis; Data analysis; Distributed databases; Indexing; Media; NoSQL databases; YARN; customizable and scalable indexing; social media data analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.57
  • Filename
    6846507