Title :
Query result size estimation using a novel histogram-like technique: the rectangular attribute cardinality map
Author :
Oommen, B. John ; Thiyagarajah, M.
Author_Institution :
Sch. of Comput. Sci., Carleton Univ., Ottawa, Ont., Canada
Abstract :
Current database systems utilize histograms to approximate frequency distributions of attribute values of relations. These are used to efficiently estimate query result sizes and access plan costs. Even though they have been in use for nearly two decades, there has been no significant mathematical techniques (other than those used in statistics for traditional histogram approximations) to study them. We introduce a new histogram-like approximation strategy called the Rectangular Attribute Cardinality Map (R-ACM), that aims to approximate the density of the underlying attribute values using the philosophies of numerical integration. In this new histogram-like approximation method, the density function within a given sector is approximated by a rectangular cell, where the height of the cell is obtained so as to guarantee that the actual probability density differs from the approximated one by a maximum of a user specified tolerance, τ. Furthermore, unlike the two traditional histogram types, namely equi-width and equi-depth, the R-ACM is neither equi-width nor equi-depth. Analytically, we show that for the R-ACM, the distribution of an attribute value within the sector is binomially distributed. This permits us to derive worst-case and average case results for the estimation errors of the probability mass itself. Our theoretical results, which include a rigorous maximum likelihood and expected case analyses, and an extensive set of experiments demonstrate that the R-ACM scheme (which is essentially histogram-like) is much more accurate than the traditional histograms for query result size estimation. Due to its high accuracy and low construction costs, we hope that it could become an invaluable tool for query optimization in the future database systems
Keywords :
binomial distribution; integration; maximum likelihood estimation; query processing; statistical databases; R-ACM; access plan costs; approximate frequency distributions; approximation strategy; attribute value; attribute values; average case results; binomial distribution; construction costs; database systems; density function; estimation errors; expected case analyses; future database systems; histogram types; histogram-like technique; mathematical techniques; maximum likelihood; numerical integration; probability density; probability mass; query optimization; query result size estimation; query result sizes; rectangular attribute cardinality map; user specified tolerance; Approximation methods; Cost function; Database systems; Density functional theory; Estimation error; Frequency; Histograms; Maximum likelihood estimation; Query processing; Statistical distributions;
Conference_Titel :
Database Engineering and Applications, 1999. IDEAS '99. International Symposium Proceedings
Conference_Location :
Montreal, Que.
Print_ISBN :
0-7695-0265-2
DOI :
10.1109/IDEAS.1999.787246