Title :
Mass Storage on the Terascale Computing System
Author :
Stone, Nathan T B ; Scott, J. Ray ; Kochmar, John ; Sommerfield, Jason ; Subramanya, Ravi ; Reddy, R. ; Vargo, Katie
Author_Institution :
Pittsburgh Supercomputing Center, Pittsburgh, PA 15213
Abstract :
On August 3rd, 2000, the National Science Foundation announced the award of $45 million to the Pittsburgh Supercomputing Center to provide "terascale" computing capability for U.S. researchers in all science and engineering disciplines. This Terascale Computing System (TCS) will be built using commodity components. The computational engines will be Compaq Alpha CPU\´s in a four processors per node configuration. The final system will have over 682 quad processor nodes, for a total of more than 2700 Alpha CPUs. Each node will have 4 gigabytes of memory, for a total system memory of over 2.5 terabytes. All the nodes will be interconnected by a high speed, very low latency Quadrics switch fabric, constructed in a full "fat-tree" topology. Given the very high component count in this system, it is important to architect the solution to be tolerant of failures. Part of this architecture is the efficient saving of a program\´s memory state to disk. This checkpointing of the program memory should be easy for the programmer to invoke and sufficiently fast to allow for frequent checkpoints, yet it should not severely impact the performance of the compute nodes or file servers. There will be flexibility in the recovery so that spare nodes can be automatically swapped in for failed nodes and the job restarted from the most recent checkpoint without significant user or system management intervention. It is estimated that the file servers required to collect and store these checkpoints and other temporary storage for executing jobs will collectively have ?27 terabytes of disk storage. The file servers will maintain a disk cache that will be migrated to a Hierarchical Storage Manager serving over 300 terabytes. This paper will discuss the hardware and software architectural design of the TCS machine. The software architecture of this system rests upon Compaq\´s "AlphaServer SC" software, which includes administration, control, accounting, and scheduling software. We will describe - our accounting, scheduling, and monitoring systems and their relation to the software included in AlphaServer SC.
Keywords :
Cache storage; Checkpointing; Computer architecture; Delay; Engines; Fabrics; File servers; Programming profession; Switches; Topology;
Conference_Titel :
Mass Storage Systems and Technologies, 2001. MSS '01. Eighteenth IEEE Symposium on
Conference_Location :
San Diego, CA, USA
Print_ISBN :
0-7695-0849-9
DOI :
10.1109/MSS.2001.10016