• DocumentCode
    235141
  • Title

    Design and analysis of fault tolerance mechanism for sparrow

  • Author

    Wenzhuo Li ; Chuang Lin

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2014
  • fDate
    5-7 Dec. 2014
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    Big data processing frameworks are developing towards larger degrees of parallelism and shorter task durations in order to achieve lower response time. Scheduling highly parallel tasks that complete in nearly 100 milliseconds poses a major challenge for task schedulers. Taking the challenge, researchers turn to decentralized frameworks to relieve the pressure of task schedulers, among which Sparrow is a good choice. However, little efforts are devoted to fault tolerance of Sparrow, which does not handle worker failures, giving rise to incomplete tasks. We present a fault tolerance mechanism named Heartbeat on Sparrow to handle failures of worker machines. Through simulation, we compare it with a simple mechanism. The result shows that Heartbeat on Sparrow can detect worker failures faster and reschedule all failed tasks more efficiently, achieving recovery of tasks and states in sub-second time. We hope this mechanism will make some contributions to Sparrow and other decentralized designs on fault tolerance side.
  • Keywords
    Big Data; fault tolerant computing; parallel processing; scheduling; Big Data processing framework; Heartbeat mechanism; Sparrow; decentralized design; decentralized framework; fault tolerance mechanism; parallelism degree; task schedulers; Detectors; Fault tolerance; Fault tolerant systems; Heart beat; Heart rate variability; Monitoring; Probes; Sparrow; decentralized task scheduling; failure dectector; failure recovery; fault torlerance;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Performance Computing and Communications Conference (IPCCC), 2014 IEEE International
  • Conference_Location
    Austin, TX
  • Type

    conf

  • DOI
    10.1109/PCCC.2014.7017054
  • Filename
    7017054