Author_Institution :
Dept. of Comput. Sci., Stevens Inst. of Technol., Hoboken, NJ, USA
Abstract :
Large-scale data analysis and mining activities, such as identifying interesting trends, making unusual patterns stand out and verifying hypotheses, require sophisticated information extraction queries. Being able to express these data mining queries concisely is of major importance not only from the user´s, but also from the system´s point of view. Recent research in OLAP has focused on data cubes and their applications; however, the expression and processing of ad-hoc decision support queries has been given very little attention. In this paper, we present an appropriate framework for these queries and introduce a syntactic construct to support it. This SQL extension allows most OLAP queries, such as complex intra- and inter-group comparisons, trends and hierarchical comparisons, to be expressed in a compact, intuitive and simple manner. However, this syntactic extension is not the focus of this paper. This succinct representation of a complex OLAP query translates immediately to a novel, simple and efficient evaluation algorithm. We show how to optimize, analyze and parallelize this algorithm and discuss issues such as multiple query analysis and scaling. This algorithm constitutes the main contribution of this paper. Finally, we introduce our implementation on top of a commercial system and present several experimental results of real-life queries that show orders of magnitude of performance improvement in certain cases. We argue that this tight coupling between representation and algorithm is essential to efficient processing of ad-hoc OLAP queries
Keywords :
SQL; data analysis; data mining; query processing; SQL extension; ad-hoc OLAP query evaluation; algorithm analysis; algorithm optimization; algorithm parallelization; commercial system; data cubes; data mining; decision support queries; evaluation algorithm; hierarchical comparisons; hypothesis verification; in-place computation; information extraction queries; inter-group comparisons; interesting trends identification; intra-group comparisons; large-scale data analysis; multiple query analysis; performance improvement; scaling; syntactic construct; trends; unusual patterns; Computer science; Data analysis; Data mining; Pattern analysis; Performance analysis; Query processing; Read only memory;