• DocumentCode
    3145339
  • Title

    Scientific workflow design 2.0: Demonstrating streaming data collections in Kepler

  • Author

    Dou, Lei ; Zinn, Daniel ; McPhillips, Timothy ; Köhler, Sven ; Riddle, Sean ; Bowers, S. ; Ludäscher, Bertram

  • Author_Institution
    UC Davis Genome Center, Univ. of California, Davis, CA, USA
  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    1296
  • Lastpage
    1299
  • Abstract
    Scientific workflow systems are used to integrate existing software components (actors) into larger analysis pipelines to perform in silico experiments. Current approaches for handling data in nested-collection structures, as required in many scientific domains, lead to many record-management actors (shims) that make the workflow structure overly complex, and as a consequence hard to construct, evolve and maintain. By constructing and executing workflows from bioinformatics and geosciences in the Kepler system, we will demonstrate how COMAD (Collection-Oriented Modeling and Design), an extension of conventional workflow design, addresses these shortcomings. In particular, COMAD provides a hierarchical data stream model (as in XML) and a novel declarative configuration language for actors that functions as a middleware layer between the workflow´s data model (streaming nested collections) and the actor´s data model (base data and lists thereof). Our approach allows actor developers to focus on the internal actor processing logic oblivious to the workflow structure. Actors can then be re-used in various workflows simply by adapting actor configurations. Due to streaming nested collections and declarative configurations, COMAD workflows can usually be realized as linear data processing pipelines, which often reflect the scientific data analysis intention better than conventional designs. This linear structure not only simplifies actor insertions and deletions (workflow evolution), but also decreases the overall complexity of the workflow, reducing future effort in maintenance.
  • Keywords
    data analysis; middleware; records management; COMAD; Collection-Oriented Modeling and Design; XML; data collection; hierarchical data stream model; internal actor processing; middleware; record management; scientific workflow design; software component; Assembly; Bioinformatics; Data models; Humidity; Phylogeny; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767938
  • Filename
    5767938