Abstract :
We live in an era when compute is cheap, data is plentiful, and system software is being given away for free. Today, the critical bottlenecks in data-driven organizations are human bottlenecks, measured in the costs of software developers, IT professionals, and data analysts. How can computer science remain relevant in this context? The Big Data ecosystem presents two archetypal settings for answering this question: NoSQL distributed databases, and analytics on Hadoop. In the case of NoSQL, developers are being asked to build parallel programs for global-scale systems that cannot even guarantee the consistency of a single register of memory. How can this possibly be made to work? I´ll talk about what we have seen in the wild in user deployments, and what we´ve learned from developers and their design patterns. Then I´ll present theoretical results - the CALM Theorem - that shed light on what´s possible here, and what requires more expensive tools for coordination on top of the typical NoSQL offerings. Finally, I will highlight some new approaches to writing and testing software - exemplified by the Bloom language - that can help developers of distributed software avoid expensive coordination when possible, and have the coordination logic synthesized for them automatically when necessary. In the Hadoop context, the key bottlenecks lie with data analysts and data engineers, who are routinely asked to work with data that cannot possibly be loaded into tools for statistical analytics or visualization. Instead, they have to engage in time-consuming data “wrangling” - to try and figure out what´s in their data, whip it into a rectangular shape for analysis, and figure out how to clean and integrate it for use. I´ll discuss what we heard talking with data analysts in both academic interviews and commercial engagements. Then I´ll talk about how techniques from human-computer interaction, machine learning, and database systems can be brought together - o address this human bottleneck, as exemplified by our work on various systems including the Data Wrangler project and Trifacta´s platform for data transformation.
Keywords :
Big Data; data analysis; distributed databases; human computer interaction; learning (artificial intelligence); program testing; Big Data ecosystem; Bloom language; CALM theorem; Data Wrangler project; Hadoop; NoSQL distributed databases; Trifacta platform; data analysts; data engineers; data transformation; data wrangling; database systems; distributed software developers; global-scale systems; human-computer interaction; machine learning; parallel programs; software testing;