• DocumentCode
    3137666
  • Title

    Proactive process-level live migration in HPC environments

  • Author

    Wang, Chao ; Mueller, Frank ; Engelmann, Christian ; Scott, Stephen L.

  • Author_Institution
    Dept. of Comput. Sci., North Carolina State Univ., Raleigh, NC, USA
  • fYear
    2008
  • fDate
    15-21 Nov. 2008
  • Firstpage
    1
  • Lastpage
    12
  • Abstract
    As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one´s health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70% of the faults are handled proactively.
  • Keywords
    fault tolerant computing; message passing; parallel processing; system monitoring; HPC environment; I/O requirements; MPI execution environment; health monitoring; health-inflicted node failure; high-performance computing environment; manual job resubmission; node failures; proactive process-level live migration; reactive fault tolerance; Chaos; Computer science; Condition monitoring; Fault tolerance; Fault tolerant systems; Mathematics; Middleware; Operating systems; Temperature sensors; US Department of Energy;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for
  • Conference_Location
    Austin, TX
  • Print_ISBN
    978-1-4244-2834-2
  • Electronic_ISBN
    978-1-4244-2835-9
  • Type

    conf

  • DOI
    10.1109/SC.2008.5222634
  • Filename
    5222634