• DocumentCode
    3426965
  • Title

    On the storage, management and analysis of (multi) similarity for large scale protein structure datasets in the grid

  • Author

    Folino, Gianluigi ; Shah, Azhar Ali ; Kransnogor, Natalio

  • Author_Institution
    CNR-ICAR, Univ. of Calabria, Rende, Italy
  • fYear
    2009
  • fDate
    2-5 Aug. 2009
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    Assessment of the (Multi) Similarity among a set of protein structures is achieved through an ensemble of protein structure comparison methods/algorithms. This leads to the generation of a multitude of data that varies both in type and size. After passing through standardization and normalization, this data is further used in consensus development; providing domain independent and highly reliable view of the assessment of (di)similarities. This paper briefly describes some of the techniques used for the estimation of missing/invalid values resulting from the process of multi-comparison of very large scale datasets in a distributed/grid environment. This is followed by an empirical study on the storage capacity and query processing time required to cope with the results of such comparisons. In particular we investigate and compare the storage/query overhead of two commonly used database technologies such as the Hierarchical Data Format (HDF) (HDF5) and Relational Database Management System (RDBMS) (Oracle/SQL) in terms of our application deployed on the National Grid Service (NGS), UK. As the technologies explored under this investigation are quite generic in the science and engineering domain, our findings would also be beneficial for other scientific applications having related magnitude of data and functionality.
  • Keywords
    SQL; bioinformatics; grid computing; molecular biophysics; proteins; query processing; relational databases; HDF5; Hierarchical Data Format; National Grid Service; Oracle; RDBMS; Relational Database Management System; SQL; large scale protein structure datasets; multisimilarity; normalization; query processing; standardization; storage capacity; Algorithm design and analysis; Application software; Data engineering; Large-scale systems; Performance analysis; Protein engineering; Relational databases; Robustness; Technology management; Testing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer-Based Medical Systems, 2009. CBMS 2009. 22nd IEEE International Symposium on
  • Conference_Location
    Albuquerque, NM
  • ISSN
    1063-7125
  • Print_ISBN
    978-1-4244-4879-1
  • Electronic_ISBN
    1063-7125
  • Type

    conf

  • DOI
    10.1109/CBMS.2009.5255328
  • Filename
    5255328