• DocumentCode
    3570882
  • Title

    Mapping specifications for ranked hierarchical trees in data integration systems

  • Author

    Soomro, Sarfaraz ; Matsunaga, Andrea ; Fortes, Jose A. B.

  • Author_Institution
    Adv. Comput. & Inf. Syst. Lab., Univ. of Florida, Gainesville, FL, USA
  • fYear
    2014
  • Firstpage
    269
  • Lastpage
    276
  • Abstract
    A popular approach to deal with data integration of heterogeneous data sources is to Extract, Transform and Load (ETL) data from disparate sources into a consolidated data store while addressing integration challenges including, but not limited to: structural differences in the source and target schemas, semantic differences in their vocabularies, and data encoding. This work focuses on the integration of tree-like hierarchical data or information that when modeled as a relational schema can take the shape of a flat schema, a self-referential schema or a hybrid schema. Examples include evolutionary taxonomies, geological time scales, and organizational charts. Given the observed complexity in developing ETL processes for this particular but common type of data, our work focuses on reducing the time and effort required to map and transform this data. Our research automates and simplifies the transformation from ranked self-referential to flat representations (and vice-versa), by: (a) proposing MSL+, an extension to IBM´s Mapping Specification Language (MSL), to succinctly express the mapping between schemas while hiding the actual transformation implementation complexity from the user, and (b) implementing a transformation component for the Talend open-source ETL platform, called Tree Transformer (TT). We evaluated MSL+ and TT, in the context of biodiversity data integration, where this class of transformations is a recurring pattern. We demonstrate the effectiveness of MSL+ with respect to development time savings as well as a 2 to 25-fold performance improvement in transformation time achieved by TT when compared to existing implementations and to Talend built-in components.
  • Keywords
    data integration; formal specification; ETL data; IBM mapping specification language; MSL; biodiversity data integration; data integration systems; evolutionary taxonomies; extract transform and load; flat schema; geological time scales; heterogeneous data sources; mapping specifications; observed complexity; organizational charts; ranked hierarchical trees; relational schema; structural differences; transformation component; transformation implementation complexity; tree like hierarchical data; tree transformer; Cities and towns; Complexity theory; Continents; Data integration; Data models; Transforms; XML; ETL; data integration; data transformation; mapping language; self-referential schema;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2014 IEEE 15th International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2014.7051899
  • Filename
    7051899