مرکز منطقه ای اطلاع رساني علوم و فناوري - Backtrack-Based Failure Recovery in Distributed Stream Processing

Abstract :

Since stream analytics is treated as a kind of cloud service, there exists a pressing need for its reliability and fault-tolerance. In a streaming process, the parallel and distributed tasks are chained in a graph-structure with each task transforming a stream to a new stream, the transaction property guarantees the streaming data, called tuples, to be processed in the order of their generation in every dataflow path, with each tuple processed once and only once. The failure recovery of a task allows the previously produced results to be corrected for eventual consistency, which is different from the instant consistency of global state enforced by the failure recovery of general distributed systems, and therefore presents new technical challenges. Transactional stream processing typically requires every task to checkpoint its execution state, and when it is restored from a failure, to have the last state recovered from the checkpoint and missing tuple re-acquired and processed. Currently there exist two kind approaches: one treats the whole process as a single transaction, and therefore suffers from the loss of intermediate results during failures, the other relies on the receipt of acknowledgement (ACK) to decide whether moving forward to emit the next resulting tuple or resending the current one after timeout, on the per-tuple basis, thus incurs extremely high latency penalty. In contradistinction to the above, we propose the backtrack mechanism for failure recovery, which allows a task to process tuples continuously without waiting for ACKs and without resending tuples in the failure-free case, but to request (ASK) the source tasks to resend the missing tuples only when it is restored from a failure which is a rare case thus has limited impact on the overall performance. We have implemented the proposed mechanisms on Fontainebleau, the distributed stream analytics infrastructure we developed on top of Storm. As a principle, we ensure all the transactional proper- ies to be system supported and transparent to users. Our experience shows that the ASK-based recovery mechanism significantly outperforms the ACK-based one.