• DocumentCode
    262116
  • Title

    Optimization Techniques within the Hadoop Eco-system: A Survey

  • Author

    Rumi, Giulia ; Colella, Claudia ; Ardagna, Danilo

  • Author_Institution
    Dipt. di Elettron., Inf. e Bioingegneria, Politec. di Milano, Milan, Italy
  • fYear
    2014
  • fDate
    22-25 Sept. 2014
  • Firstpage
    437
  • Lastpage
    444
  • Abstract
    Nowadays, we live in a digital world producing data at an impressive speed: data are large, change quickly, and are often too complex to be processed by existing tools. The problem is to extract knowledge from all these data in an efficient way. MapReduce is a data parallel programming model for clusters of commodity machines that was created to address this problem. In this paper we provide an overview of the Hadoop ecosystem. We introduce the most significative approaches supporting automatic, on-line resource provisioning. Moreover, we analyse optimization approaches proposed in frameworks built on top of MapReduce, such as Pig and Hive, which point out the importance of scheduling techniques in MapReduce when multiple workflows are executed concurrently. Therefore, the default Hadoop schedulers are discussed along with some enhancements proposed by the research community. The analysis is performed to highlight how research contributions try to address common Hadoop points of weakness. As it stands out from our comparison, none of the frameworks surpasses the others and a fair evaluation is also difficult to be performed, the choice of the framework must be related to the specific application goal but there is no single solution that addresses all the issues typical of MapReduce.
  • Keywords
    data handling; knowledge acquisition; optimisation; parallel programming; pattern clustering; resource allocation; scheduling; Hadoop ecosystem; MapReduce; commodity machines; data parallel programming model; knowledge extraction; on-line resource provisioning; optimization techniques; research community; scheduling techniques; Optimization; Programming; Resource management; Scalability; Scheduling; Time factors; Yarn; Clouds; Design; Performance analysis; Resource management; Scheduling algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2014 16th International Symposium on
  • Conference_Location
    Timisoara
  • Print_ISBN
    978-1-4799-8447-3
  • Type

    conf

  • DOI
    10.1109/SYNASC.2014.65
  • Filename
    7034715