• DocumentCode
    1791536
  • Title

    PGMHD: A scalable probabilistic graphical model for massive hierarchical data problems

  • Author

    AlJadda, Khalifeh ; Korayem, Mohammed ; Ortiz, Camilo ; Grainger, Trey ; Miller, John A. ; York, William S.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Georgia, Athens, GA, USA
  • fYear
    2014
  • fDate
    27-30 Oct. 2014
  • Firstpage
    55
  • Lastpage
    60
  • Abstract
    In the big data era, scalability has become a crucial requirement for any useful computational model. Probabilistic graphical models are very useful for mining and discovering data insights, but they are not scalable enough to be suitable for big data problems. Bayesian Networks particularly demonstrate this limitation when their data is represented using few random variables with a massive set of outcome values for each of them. With hierarchical data - data that is arranged in a treelike structure with several levels - one would expect to see hundreds of thousands or millions of values distributed over even just a small number of levels. When modeling this kind of hierarchical data across large data sets, Bayesian networks become unsuitable for representing the probability distributions for the following reasons: i) each level represents a single random variable with hundreds of thousands of values, ii) the number of levels is usually small, so there are also few random variables, and iii) the structure of the network is predefined since the dependency is modeled top-down from each parent to each of its child nodes. In this paper we propose a scalable probabilistic graphical model to overcome these limitations for massive hierarchical data. We believe the proposed model will lead to an easily-scalable, more readable, and expressive implementation for problems that require probabilistic-based solutions for massive amounts of hierarchical data. We successfully applied this model to solve two different challenging probabilistic-based problems on massive hierarchical data sets for different domains, namely, bioinformatics and latent semantic discovery over search logs.
  • Keywords
    Bayes methods; Big Data; data mining; trees (mathematics); Bayesian networks; Big Data; PGMHD; bioinformatics; data insight discovery; data insight mining; latent semantic discovery; massive hierarchical data problems; probabilistic graphical models; probability distributions; scalable probabilistic graphical model; treelike structure; Bayes methods; Data models; Graphical models; Probabilistic logic; Probability distribution; Random variables; Semantics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2014 IEEE International Conference on
  • Conference_Location
    Washington, DC
  • Type

    conf

  • DOI
    10.1109/BigData.2014.7004213
  • Filename
    7004213