Title :
Backfilling Using System-Generated Predictions Rather than User Runtime Estimates
Author :
Tsafrir, Dan ; Etsion, Yoav ; Feitelson, Dror G.
Author_Institution :
Sch. of Comput. Sci. & Eng., Hebrew Univ., Jerusalem
fDate :
6/1/2007 12:00:00 AM
Abstract :
The most commonly used scheduling algorithm for parallel supercomputers is FCFS with backfilling, as originally introduced in the EASY scheduler. Backfilling means that short jobs are allowed to run ahead of their time provided they do not delay previously queued jobs (or at least the first queued job). However, predictions have not been incorporated into production schedulers, partially due to a misconception (that we resolve) claiming inaccuracy actually improves performance, but mainly because underprediction is technically unacceptable: users will not tolerate jobs being killed just because system predictions were too short. We solve this problem by divorcing kill-time from the runtime prediction and correcting predictions adaptively as needed if they are proved wrong. The end result is a surprisingly simple scheduler, which requires minimal deviations from current practices (e.g., using FCFS as the basis) and behaves exactly like EASY as far as users are concerned; nevertheless, it achieves significant improvements in performance, predictability, and accuracy. Notably, this is based on a very simple runtime predictor that just averages the runtimes of the last two jobs by the same user; counter intuitively, our results indicate that using recent data is more important than mining the history for similar jobs. All the techniques suggested in this paper can be used to enhance any backfilling algorithm and are not limited to EASY
Keywords :
parallel machines; processor scheduling; EASY scheduler; backfilling algorithm; first come first serve order; parallel job scheduling algorithm; supercomputers; system-generated prediction; user runtime estimates; Accuracy; Delay effects; Dynamic scheduling; History; Job production systems; Measurement; Processor scheduling; Runtime; Scheduling algorithm; Supercomputers; EASY; EASY++; Parallel job scheduling; SJBF.; backfilling; dynamic prediction correction; history-based predictions; performance metrics; runtime estimates; system-generated predictions;
Journal_Title :
Parallel and Distributed Systems, IEEE Transactions on
DOI :
10.1109/TPDS.2007.70606