• DocumentCode
    860687
  • Title

    Compression and Aggregation for Logistic Regression Analysis in Data Cubes

  • Author

    Xi, Ruibin ; Lin, Nan ; Chen, Yixin

  • Author_Institution
    Dept. of Math., Washington Univ., St. Louis, MO
  • Volume
    21
  • Issue
    4
  • fYear
    2009
  • fDate
    4/1/2009 12:00:00 AM
  • Firstpage
    479
  • Lastpage
    492
  • Abstract
    Logistic regression is an important technique for analyzing and predicting data with categorical attributes. In this paper, We consider supporting online analytical processing (OLAP) of logistic regression analysis for multi-dimensional data in a data cube where it is expensive in time and space to build logistic regression models for each cell from the raw data. We propose a novel scheme to compress the data in such a way that we can reconstruct logistic regression models to answer any OLAP query without accessing the raw data. Based on a first-order approximation to the maximum likelihood estimating equations, we develop a compression scheme that compresses each base cell into a small compressed data block with essential information to support the aggregation of logistic regression models. Aggregation formulae for deriving high-level logistic regression models from lower level component cells are given. We prove that the compression is nearly lossless in the sense that the aggregated estimator deviates from the true model by an error that is bounded and approaches to zero when the data size increases. The results show that the proposed compression and aggregation scheme can make feasible OLAP of logistic regression in a data cube. Further, it supports real-time logistic regression analysis of stream data, which can only be scanned once and cannot be permanently retained. Experimental results validate our theoretical analysis and demonstrate that our method can dramatically save time and space costs with almost no degradation of the modeling accuracy.
  • Keywords
    data compression; data mining; regression analysis; data compression; data cubes; first-order approximation; logistic regression analysis; multidimensional data; online analytical processing; Data mining; Statistical databases;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2008.186
  • Filename
    4624260