DocumentCode
2858867
Title
Using Shannon Entropy in ETL Processes
Author
Balta, Marian ; Felea, Victor
Author_Institution
Al. I. Cuza Univ., Iasi
fYear
2007
fDate
26-29 Sept. 2007
Firstpage
151
Lastpage
156
Abstract
The ETL (extract, transform and load) processes are responsible for the extraction of the data from the external sources, transforming the data in order to satisfy the integration and cleanness needs and for loading the data into the data warehouse. In the data mining field, there is a special concern on using the metrics for efficient classification algorithms. One of these approaches is the one that uses metrics on partitions, based on the Shannon entropy, to study the degree of concentration of values. In this paper we show how this idea can be used in verification of the consistency of data loaded into the data warehouse by ETL processes. We calculate the Shannon entropy and Gini index on partitions induced by attribute sets and we show that these values can be used to signal a possible problem in the data extraction process. We also show how the choice of the set of attributes determining the partition can have a significant impact on the effectiveness of the method.
Keywords
data analysis; entropy; ETL process; Gini index; Shannon entropy; classification algorithm; data consistency verification; data extraction; data mining; data warehouse; Classification algorithms; Computer science; Data analysis; Data mining; Data warehouses; Entropy; Load management; Partitioning algorithms; Scientific computing; Signal processing;
fLanguage
English
Publisher
ieee
Conference_Titel
Symbolic and Numeric Algorithms for Scientific Computing, 2007. SYNASC. International Symposium on
Conference_Location
Timisoara
Print_ISBN
978-0-7695-3078-8
Type
conf
DOI
10.1109/SYNASC.2007.41
Filename
4438093
Link To Document