• DocumentCode
    1459627
  • Title

    Coding for high availability of a distributed-parallel storage system

  • Author

    Malluhi, Qutaibah M. ; Johnston, William E.

  • Author_Institution
    Dept. of Comput. Sci., Jacksonville State Univ., AL, USA
  • Volume
    9
  • Issue
    12
  • fYear
    1998
  • fDate
    12/1/1998 12:00:00 AM
  • Firstpage
    1237
  • Lastpage
    1252
  • Abstract
    We have developed a distributed parallel storage system that employs the aggregate bandwidth of multiple data servers connected by a high-speed wide-area network to achieve scalability and high data throughput. This paper studies different schemes to enhance the reliability and availability of such network-based distributed storage systems. The general approach of this paper employs “erasure” error-correcting codes that can be used to reconstruct missing information caused by hardware, software, or human faults. The paper describes the approach and develops optimized algorithms for the encoding and decoding operations. Moreover, the paper presents techniques for reducing the communication and computation overhead incurred while reconstructing missing data from the redundant information. These techniques include clustering, multidimensional coding, and the full two-dimensional parity schemes. The paper considers trade-offs between redundancy, fault tolerance, and complexity of error recovery
  • Keywords
    error correction codes; fault tolerant computing; network servers; parallel memories; redundancy; wide area networks; aggregate bandwidth; clustering; coding; communication overhead reduction; computation overhead reduction; decoding operation; distributed-parallel storage system; encoding operations; erasure error-correcting codes; error recovery complexity; fault tolerance; full 2D parity schemes; hardware faults; high availability; high data throughput; high-speed wide area network; human faults; missing information reconstruction; multidimensional coding; multiple data servers; network-based distributed storage systems; optimized algorithms; redundancy; redundant information; reliability; scalability; software faults; Aggregates; Availability; Bandwidth; Clustering algorithms; Error correction codes; Hardware; Humans; Network servers; Scalability; Throughput;
  • fLanguage
    English
  • Journal_Title
    Parallel and Distributed Systems, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1045-9219
  • Type

    jour

  • DOI
    10.1109/71.737699
  • Filename
    737699