Author_Institution :
Clayton Sch. of Inf. Technol., Monash Univ., Clayton, VIC, Australia
Abstract :
The sizes of databases have seen exponential growth in the past, and such growth is expected to accelerate in the future, with the steady drop in storage cost accompanied by a rapid increase in storage capacity. Many years ago, a terabyte database was considered to be large, but nowadays they are sometimes regarded as small, and the daily volumes of data being added to some databases are measured in terabytes. In the future, petabyte and exabyte databases will be common. With such volumes of data, it is evident that the sequential processing paradigm will be unable to cope, for example, even assuming a data rate of 1 terabyte per second, reading through a petabyte database will take over 10 days. To effectively manage such volumes of data, it is necessary to allocate multiple resources to it, very often massively so. The processing of databases of such astronomical proportions requires an understanding of how high-performance systems and parallelism work. Besides the massive volume of data in the database to be processed, some data has been distributed across the globe in a Grid environment. These massive data centres are also a part of the emergence of Cloud computing, where data access has shifted from local machines to powerful servers hosting web applications and services, making data access across the Internet using standard web browsers pervasive. This adds another dimension to such systems. This talk, based on our recent published book [1], discusses fundamental understanding of parallelism in data-intensive applications, and demonstrates how to develop faster capabilities to support them. This includes the importance of indexing in parallel systems [2-4], specialized algorithms to support various query processing [5-9], as well as objectoriented scheme [10-12]. Parallelism in databases has been around since the early 1980s, when many researchers in this area aspired to build large special-purpose database machines -- databases employing dedicated specialize- parallel hardware. Some projects were born, including Bubba, Gamma, etc. These came and went. However, commercial DBMS vendors quickly realized the importance of supporting high performance for large databases, and many of them have incorporated parallelism and grid features into their products. Their commitment to high-performance systems and parallelism, as well as grid configurations, shows the importance and inevitability of parallelism. There have been an increase number of researches in high performance parallel database processing in the last five years (2008-12). Data partitioning is still the fundamental issue in high performance database processing [13, 14]. The data itself is getting more complex, including XML-based data [15, 16], bio-informatics data [17, 18], and data streams [19, 20]. These new data types require new approaches to parallel processing. In addition, database transactions [21, 22] are still a major focus in many high performance database systems, such as grid transactions. We also see an increasing growth of new application domains, broadly categorized as data-intensive applications, including data warehousing and online analytic processing (OLAP) [23-25]. Therefore, it is critical to understand the underlying principle of data parallelism, before specialized and new application domains can be properly addressed.