• DocumentCode
    168668
  • Title

    Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing

  • Author

    Sadooghi, Iman ; Palur, Sandeep ; Anthony, Ajay ; Kapur, Isha ; Belagodu, Karthik ; Purandare, Pankaj ; Ramamurty, Kiran ; Ke Wang ; Raicu, Ioan

  • Author_Institution
    Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
  • fYear
    2014
  • fDate
    26-29 May 2014
  • Firstpage
    404
  • Lastpage
    413
  • Abstract
    Task scheduling and execution over large scale, distributed systems plays an important role on achieving good performance and high system utilization. Due to the explosion of parallelism found in today´s hardware, applications need to perform over-decomposition to deliver good performance, this over-decomposition is driving job management systems´ requirements to support applications with a growing number of tasks with finer granularity. Our goal in this work is to provide a compact, light-weight, scalable, and distributed task execution framework (Cloud Kon) that builds upon cloud computing building blocks (Amazon EC2, SQS, and Dynamo DB). Most of today´s state-of-the-art job execution systems have predominantly Master/Slaves architectures, which have inherent limitations, such as scalability issues at extreme scales and single point of failures. On the other hand distributed job management systems are complex, and employ non-trivial load balancing algorithms to maintain good utilization. Cloud Kon is a distributed job management system that can support both HPC and MTC workloads with millions of tasks/jobs. We compare our work with other state-of-the-art job management systems including Sparrow and MATRIX. The results show that Cloud Kon delivers better scalability compared to other state-of-the-art systems for some metrics - all with a significantly smaller code-base (5%).
  • Keywords
    cloud computing; parallel processing; resource allocation; scheduling; Amazon EC2; Cloud Kon; Dynamo DB; HPC workloads; MTC workloads; SQS; cloud computing building blocks; distributed job management systems; distributed scheduling; distributed systems; distributed task execution framework; high system utilization; high-performance computing; many-task computing; master/slaves architectures; message queues; nontrivial load balancing algorithms; task scheduling; Cloud computing; Computer architecture; Load management; Message systems; Processor scheduling; Scalability; Throughput; CloudKon; Many-Task Computing; distributed HPC scheduling; distributed scheduling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
  • Conference_Location
    Chicago, IL
  • Type

    conf

  • DOI
    10.1109/CCGrid.2014.30
  • Filename
    6846476