Title :
A selective checkpointing mechanism for query plans in a parallel database system
Author :
Ting Chen ; Taura, Koichi
Author_Institution :
Univ. of Tokyo, Tokyo, Japan
Abstract :
Most existing parallel database systems achieve fault tolerance by aborting unfinished queries upon a failure and restart the entire from the beginning. This is inefficient for long running queries of OLAP workloads. To solve this problem, this paper presents a selective checkpointing mechanism which materializes the outputs of some necessary operators, enabling to resume queries from middle of the execution upon failures. Each query is represented by a DAG of relational operators in which data are typically pipelined between operators. The goal of the mechanism is to find a set of operators whose outputs are worth being checkpointed to minimize the expected runtime of the whole query. It firstly provides a cost model to estimate the expected runtime of a whole query plan under a given failure probability for each operator. Then a divide-and-conquer algorithm is proposed to find a close-to-optimal solution to the problem. The algorithm divides the query plan into subplans with smaller search spaces. For a given query plan with n operators, the algorithm runs in O(n) time. The mechanism is implemented in a shared-nothing parallel database system called ParaLite which provides a coordination layer to glue many SQLite instances together, and parallelizes SQL queries across them. The experimental results indicate that different fault-tolerant strategies affect the overall runtimes of queries. Our selective checkpointing mechanism can choose reasonable operators to be checkpointed and outperforms other fault-tolerant strategies. In addition, the divide-and-conquer algorithm taken by our mechanism has a smaller overhead than brute-force approach while keeping a similar effectiveness.
Keywords :
SQL; checkpointing; data mining; directed graphs; divide and conquer methods; fault tolerance; parallel databases; probability; query processing; DAG; OLAP workloads; ParaLite; SQL query parallelization; SQLite; brute-force approach; coordination layer; cost model; divide-and-conquer algorithm; failure probability; fault-tolerant strategies; query plan; relational operators; selective checkpointing mechanism; shared-nothing parallel database system; Checkpointing; Database systems; Fault tolerance; Fault tolerant systems; Program processors; Runtime;
Conference_Titel :
Big Data, 2013 IEEE International Conference on
Conference_Location :
Silicon Valley, CA
DOI :
10.1109/BigData.2013.6691580