Abstract :
Shared temporary storage space is often the constraining resource for clusters that serve as execution nodes in wide-area distributed systems. At least one large national-scale computing grid has reported a failure rate of as high as thirty percent of submitted jobs, often due to accidentally filled shared storage spaces. Previous systems have attacked this problem by adding space allocation to the distributed system interface. However, these allocations are not enforced at the filesystem level, and thus unexpected or unaccounted uses of storage may cause the system to fail. By adding an inexpensive allocation mechanism to the operating system, we may improve the robustness of such systems at minimal cost. In this paper, we describe an abstract model of space allocation in the file system and explore three implementations of the model: a user-level library, a recursive loopback filesystem, and a modified kernel filesystem. We evaluate the performance and completeness of these implementations and demonstrate that kernel support is essential to keeping the overhead low. Finally, we demonstrate empirically that a cluster under heavy filesystem load can be made more robust by adding allocations to the filesystem
Keywords :
file organisation; grid computing; operating system kernels; storage allocation; distributed system interface; file system; filesystem level; grid storage systems; kernel filesystem; kernel support; national-scale computing grid; operating system support; recursive loopback filesystem; shared storage spaces; shared temporary storage space; space allocation; user-level library; wide-area distributed systems; Computer crashes; Costs; File servers; File systems; Grid computing; Kernel; Libraries; Operating systems; Resource management; Robustness;