Title :
hatS: A Heterogeneity-Aware Tiered Storage for Hadoop
Author :
Krish, K.R. ; Anwar, Ayesha ; Butt, Ali R.
Author_Institution :
Dept. of Comput. Sci., Virginia Tech, Blacksburg, VA, USA
Abstract :
Hadoop has become the de-facto large-scale data processing framework for modern analytics applications. A major obstacle for sustaining high performance and scalability in Hadoop is managing the data growth while meeting the ever higher I/O demand. To this end, a promising trend in storage systems is to utilize hybrid and heterogeneous devices - Solid State Disks (SSD), ram disks and Network Attached Storage (NAS), which can help achieve very high I/O rates at acceptable cost. However, the Hadoop Distributed File System (HDFS) that is unable to exploit such heterogeneous storage. This is because HDFS works on the assumption that the underlying devices are homogeneous storage blocks, disregarding their individual I/O characteristics, which leads to performance degradation. In this paper, we present hatS, a Heterogeneity-Aware Tiered Storage, which is a novel redesign of HDFS into a multi-tiered storage system that seamlessly integrates heterogeneous storage technologies into the Hadoop ecosystem. hatS also proposes data placement and retrieval policies, which improve the utilization of the storage devices based on their characteristics such as I/O throughput and capacity. We evaluate hatS using an actual implementation on a medium-sized cluster consisting of HDDs and two types of SSDs (i.e., SATA SSD and PCIe SSD). Experiments show that hatS achieves 32.6% higher read bandwidth, on average, than HDFS for the test Hadoop jobs (such as Grep and Test DFSIO) by directing 64% of the I/O accesses to the SSD tiers. We also evaluate our approach with trace-driven simulations using synthetic Facebook workloads, and show that compared to the standard setup, hatS improves the average I/O rate by 36%, which results in 26% improvement in the job completion time.
Keywords :
data handling; distributed processing; storage management; Facebook workloads; HDD; HDFS; Hadoop distributed file system; Hadoop ecosystem; NAS; PCIe SSD; RAM disks; SATA SSD; SSD; data placement policy; data processing framework; data retrieval policy; hard-disc drives; hatS framework; heterogeneity-aware tiered storage; heterogeneous storage; input-output throughput; job completion time; medium-sized cluster; network attached storage; random access memory; solid state disks; solid-state disks; trace-driven simulations; Bandwidth; Distributed databases; File systems; Market research; Performance evaluation; Standards; Throughput; Hadoop Distributed File System (HDFS); Tiered storage; data placement and retrieval policy;
Conference_Titel :
Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on
Conference_Location :
Chicago, IL
DOI :
10.1109/CCGrid.2014.51