• DocumentCode
    888529
  • Title

    Sample-based quality estimation of query results in relational database environments

  • Author

    Ballou, Donald P. ; Chengalur-Smith, InduShobha N. ; Wang, Richard Y.

  • Author_Institution
    Manage. Sci. & Inf. Syst., Albany Univ., NY, USA
  • Volume
    18
  • Issue
    5
  • fYear
    2006
  • fDate
    5/1/2006 12:00:00 AM
  • Firstpage
    639
  • Lastpage
    650
  • Abstract
    The quality of data in relational databases is often uncertain, and the relationship between the quality of the underlying base tables and the set of potential query results, a type of information product (IP), that could be produced from them has not been fully investigated. This paper provides a basis for the systematic analysis of the quality of such IPs. This research uses the relational algebra framework to develop estimates for the quality of query results based on the quality estimates of samples taken from the base tables. Our procedure requires an initial sample from the base tables; these samples are then used for all possible information IPs. Each specific query governs the quality assessment of the relevant samples. By using the same sample repeatedly, our approach is relatively cost effective. We introduce the reference-table procedure, which can be used for quality estimation in general. In addition, for each of the basic algebraic operators, we discuss simpler procedures that may be applicable. Special attention is devoted to the join operation. We examine various, relevant statistical issues, including how to deal with the impact on quality of missing rows in base tables. Finally, we address several implementation issues related to sampling.
  • Keywords
    estimation theory; query processing; relational algebra; relational databases; sampling methods; base table; data quality estimation; database sampling; information product; join operation; query processing; reference-table procedure; relational algebra; relational database; statistical issue; Algebra; Costs; Information systems; Quality assessment; Quality control; Quality management; Relational databases; Sampling methods; Warehousing; Yield estimation; Data quality; database sampling; information product; quality control.; relational algebra;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2006.83
  • Filename
    1613867