DocumentCode
904640
Title
A new approach to fault-tolerant wormhole routing for mesh-connected parallel computers
Author
Ho, Ching-Tien ; Stockmeyer, Larry
Author_Institution
IBM Almaden Res. Center, San Jose, CA, USA
Volume
53
Issue
4
fYear
2004
fDate
4/1/2004 12:00:00 AM
Firstpage
427
Lastpage
438
Abstract
A new method for fault-tolerant wormhole routing in arbitrary dimensional meshes is introduced. The method was motivated by certain routing requirements of an initial design of the Blue Gene supercomputer at IBM Research. The machine is organized as a three-dimensional mesh containing many thousands of nodes and the routing method should tolerate a few percent of the nodes being faulty. There has been much work on routing methods for meshes that route messages around faults or regions of faults. The new method is to declare certain nonfaulty nodes to be "lambs." A lamb is used for routing but not processing, so a lamb is neither the source nor the destination of a message. The lambs are chosen so that every "survivor node," a node that is neither faulty nor a lamb, can reach every survivor node by at most two rounds of dimension-ordered (such as e-cube) routing. An algorithm for finding a set of lambs is presented. The results of simulations on 2D and 3D meshes of various sizes with various numbers of random node faults are given. For example, on a 32 × 32 × 32 3D mesh with 3 percent random faults and using at most two rounds of e-cube routing for each message, the average number of lambs is less than 68, which is less than 7 percent of the number 983 of faults and less than 0.21 percent of the number 32,768 of nodes.
Keywords
fault tolerant computing; mesh generation; multiprocessor interconnection networks; network routing; parallel machines; performance evaluation; Blue Gene supercomputer; IBM; e-cube routing; fault-tolerant wormhole routing; mesh-connected parallel computers; message routing; parallel computing; performance evaluation; Concurrent computing; Fault tolerance; Hardware; Magnetic heads; Mesh networks; Parallel processing; Routing; Software performance; Supercomputers; System recovery;
fLanguage
English
Journal_Title
Computers, IEEE Transactions on
Publisher
ieee
ISSN
0018-9340
Type
jour
DOI
10.1109/TC.2004.1268400
Filename
1268400
Link To Document