• DocumentCode
    108109
  • Title

    Scalable Relative Debugging

  • Author

    Minh Ngoc Dinh ; Abramson, David ; Chao Jin

  • Author_Institution
    Fac. of Inf. Technol., Monash Univ., Mulgrave, VIC, Australia
  • Volume
    25
  • Issue
    3
  • fYear
    2014
  • fDate
    Mar-14
  • Firstpage
    740
  • Lastpage
    749
  • Abstract
    Detecting and isolating bugs that arise only at high processor counts is a challenging task. Over a number of years, we have implemented a special debugging method, called "relative debugging," that supports debugging applications as they evolve or are ported to larger machines. It allows a user to compare the state of a suspect program against another reference version even as the number of processors is increased. The innovative idea is the comparison of runtime data to reason about the state of the suspect program. While powerful, a naïve implementation of the comparison phase does not scale to large problems running on large machines. In this paper, we propose two different solutions including a hash-based scheme and a direct point-to-point scheme. We demonstrate the implementation, a case study, as well as the performance, of our techniques on 20K cores of a Cray XE6 system.
  • Keywords
    parallel processing; program debugging; Cray XE6 system; direct point-to-point scheme; hash-based scheme; parallel applications; scalable relative debugging; special debugging method; suspect program; Arrays; Computer bugs; Debugging; Magnetic heads; Runtime; Servers; Parallellism and concurrency; assertion checkers; distributed debugging;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/TPDS.2013.86
  • Filename
    6487495