• DocumentCode
    2693907
  • Title

    Checkpoint-based fault-tolerant infrastructure for virtualized service providers

  • Author

    Goiri, Ínigo ; Juliá, Ferran ; Guitart, Jordi ; Torres, Jordi

  • Author_Institution
    Barcelona Supercomput. Center, Tech. Univ. of Catalonia, Barcelona, Spain
  • fYear
    2010
  • fDate
    19-23 April 2010
  • Firstpage
    455
  • Lastpage
    462
  • Abstract
    Crash and omission failures are common in service providers: a disk can break down or a link can fail anytime. In addition, the probability of a node failure increases with the number of nodes. Apart from reducing the provider´s computation power and jeopardizing the fulfillment of his contracts, this can also lead to computation time wasting when the crash occurs before finishing the task execution. In order to avoid this problem, efficient checkpoint infrastructures are required, especially in virtualized environments where these infrastructures must deal with huge virtual machine images. This paper proposes a smart checkpoint infrastructure for virtualized service providers. It uses Another Union File System to differentiate read-only from read-write parts in the virtual machine image. In this way, read-only parts can be checkpointed only once, while the rest of checkpoints must only save the modifications in read-write parts, thus reducing the time needed to make a checkpoint. The checkpoints are stored in a Hadoop Distributed File System. This allows resuming a task execution faster after a node crash and increasing the fault tolerance of the system, since checkpoints are distributed and replicated in all the nodes of the provider. This paper presents a running implementation of this infrastructure and its evaluation, demonstrating that it is an effective way to make faster checkpoints with low interference on task execution and efficient task recovery after a node failure.
  • Keywords
    checkpointing; distributed databases; fault tolerant computing; task analysis; virtual machines; Hadoop distributed file system; another union file system; checkpoint based fault tolerant infrastructure; checkpoint infrastructure; computation power; computation time wasting; crash failure; fault tolerance; node failure; omission failure; read only part; read write part; task recovery; union file system; virtual machine image; virtualized service provider; Computer crashes; Contracts; Fault tolerance; File systems; Finishing; Image restoration; Interference; Lead time reduction; Quality of service; Virtual machining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Operations and Management Symposium (NOMS), 2010 IEEE
  • Conference_Location
    Osaka
  • ISSN
    1542-1201
  • Print_ISBN
    978-1-4244-5366-5
  • Electronic_ISBN
    1542-1201
  • Type

    conf

  • DOI
    10.1109/NOMS.2010.5488493
  • Filename
    5488493