Title :
Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction
Author :
Yang, Ge ; Jin, Ruoming ; Agrawal, Gagan
Author_Institution :
Dept. of Comput. & Inf. Sci., Ohio State Univ., Columbus, OH, USA
Abstract :
Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.
Keywords :
data warehouses; parallel algorithms; software performance evaluation; tree data structures; aggregation tree; communication frequency; communication volume; data distribution; data structure; data warehouses; interprocessor communication; parallel algorithms; parallel data cube construction; parallel machines; parallelism level; performance trade-offs; Aggregates; Companies; Concurrent computing; Data analysis; Data warehouses; Distributed computing; Frequency; Parallel algorithms; Parallel processing; Performance analysis;
Conference_Titel :
Parallel and Distributed Processing Symposium, 2003. Proceedings. International
Print_ISBN :
0-7695-1926-1
DOI :
10.1109/IPDPS.2003.1213162