Building Wrangler: A transformational data intensive resource for the open science community

Author

Gaffney, Niall ; Jordan, Christopher ; Minyard, Tommy ; Stanzione, Dan

Author_Institution

Texas Adv. Comput. Center, Univ. of Texas at Austin, Austin, TX, USA

fYear

2014

fDate

27-30 Oct. 2014

Firstpage

20

Lastpage

22

Abstract

With the growth of data in science and engineering fields and the I/O intense technologies used to carry out research with these massive datasets, it has become clear new solutions to support data research is required. In support of this, the Texas Advanced Computing Center presents Wrangler, the first open science research platform built from the ground up in support of data. Wrangler features a replicated 10 PB Lustre based parallel file system, compute capacity of 120 Intel Haswell nodes and 15 TB of RAM. In addition to the base system, Wrangler features a unique NAND flash-based storage system from DSSD, providing users with 0.5 PB of storage 1 TB/s bandwidth and 250 million IOP/s across the cluster. Supporting Hadoop, but not just Hadoop, Wrangler will provide current and future researchers with an environment supporting the most I/O intensive workflows in fields from astronomy to paleontology. With data at the forefront of Wrangler´s mission, support for ETL workflows, data curation, and data publication will enable users as they both discover new results and publish their own research. Support for both SQL and noSQL databases and GIS based extensions will also be provided, allowing users to leverage these tools for both data cataloging and cross-study integration. Wrangler will allow users to focus more on what is most important to them, the data and knowledge gained from its analysis, and less on the details of curation and I/O optimization.

Keywords

NAND circuits; SQL; data handling; file organisation; flash memories; parallel processing; random-access storage; DSSD; ETL workflow; GIS based extension; Hadoop; I/O intense technology; I/O intensive workflow; I/O optimization; Intel Haswell nodes; NAND flash-based storage system; PB Lustre based parallel file system; RAM; Texas Advanced Computing Center; Wrangler; cross-study integration; data cataloging; data curation; data publication; data research; noSQL database; open science community; open science research platform; transformational data intensive resource; Bandwidth; Communities; Decision support systems; Distributed databases; File systems; Servers; Data Analysis; Data Systems; Data storage systems; Data transfer; Database machines;

fLanguage

English

Publisher

ieee

Conference_Titel

Big Data (Big Data), 2014 IEEE International Conference on

Conference_Location

Washington, DC

Type

conf

DOI

10.1109/BigData.2014.7004480

Filename

7004480