• DocumentCode
    2209200
  • Title

    Block-GP: Scalable Gaussian Process Regression for Multimodal Data

  • Author

    Das, Kamalika ; Srivastava, Ashok N.

  • fYear
    2010
  • fDate
    13-17 Dec. 2010
  • Firstpage
    791
  • Lastpage
    796
  • Abstract
    Regression problems on massive data sets are ubiquitous in many application domains including the Internet, earth and space sciences, and finances. In many cases, regression algorithms such as linear regression or neural networks attempt to fit the target variable as a function of the input variables without regard to the underlying joint distribution of the variables. As a result, these global models are not sensitive to variations in the local structure of the input space. Several algorithms, including the mixture of experts model, classification and regression trees (CART), and others have been developed, motivated by the fact that a variability in the local distribution of inputs may be reflective of a significant change in the target variable. While these methods can handle the non-stationarity in the relationships to varying degrees, they are often not scalable and, therefore, not used in large scale data mining applications. In this paper we develop Block-GP, a Gaussian Process regression framework for multimodal data, that can be an order of magnitude more scalable than existing state-of-the-art nonlinear regression algorithms. The framework builds local Gaussian Processes on semantically meaningful partitions of the data and provides higher prediction accuracy than a single global model with very high confidence. The method relies on approximating the covariance matrix of the entire input space by smaller covariance matrices that can be modeled independently, and can therefore be parallelized for faster execution. Theoretical analysis and empirical studies on various synthetic and real data sets show high accuracy and scalability of Block-GP compared to existing nonlinear regression techniques.
  • Keywords
    Gaussian processes; pattern clustering; regression analysis; very large databases; Block-GP; Internet; classification and regression trees; covariance matrix; earth sciences; finances; mixture of experts model; multimodal data; scalable Gaussian process regression; semantically meaningful partitions; space sciences; state-of-the-art nonlinear regression algorithms; Gaussian Process; parallel computation; regression;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining (ICDM), 2010 IEEE 10th International Conference on
  • Conference_Location
    Sydney, NSW
  • ISSN
    1550-4786
  • Print_ISBN
    978-1-4244-9131-5
  • Electronic_ISBN
    1550-4786
  • Type

    conf

  • DOI
    10.1109/ICDM.2010.38
  • Filename
    5694040