• DocumentCode
    2858867
  • Title

    Using Shannon Entropy in ETL Processes

  • Author

    Balta, Marian ; Felea, Victor

  • Author_Institution
    Al. I. Cuza Univ., Iasi
  • fYear
    2007
  • fDate
    26-29 Sept. 2007
  • Firstpage
    151
  • Lastpage
    156
  • Abstract
    The ETL (extract, transform and load) processes are responsible for the extraction of the data from the external sources, transforming the data in order to satisfy the integration and cleanness needs and for loading the data into the data warehouse. In the data mining field, there is a special concern on using the metrics for efficient classification algorithms. One of these approaches is the one that uses metrics on partitions, based on the Shannon entropy, to study the degree of concentration of values. In this paper we show how this idea can be used in verification of the consistency of data loaded into the data warehouse by ETL processes. We calculate the Shannon entropy and Gini index on partitions induced by attribute sets and we show that these values can be used to signal a possible problem in the data extraction process. We also show how the choice of the set of attributes determining the partition can have a significant impact on the effectiveness of the method.
  • Keywords
    data analysis; entropy; ETL process; Gini index; Shannon entropy; classification algorithm; data consistency verification; data extraction; data mining; data warehouse; Classification algorithms; Computer science; Data analysis; Data mining; Data warehouses; Entropy; Load management; Partitioning algorithms; Scientific computing; Signal processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Symbolic and Numeric Algorithms for Scientific Computing, 2007. SYNASC. International Symposium on
  • Conference_Location
    Timisoara
  • Print_ISBN
    978-0-7695-3078-8
  • Type

    conf

  • DOI
    10.1109/SYNASC.2007.41
  • Filename
    4438093