• DocumentCode
    1801558
  • Title

    Efficiently Computing Inclusion Dependencies for Schema Discovery

  • Author

    Bauckmann, Jana ; Leser, Ulf ; Naumann, Felix

  • Author_Institution
    Humboldt-Universitat zu Berlin, Germany
  • fYear
    2006
  • fDate
    2006
  • Firstpage
    2
  • Lastpage
    2
  • Abstract
    Large data integration projects must often cope with undocumented data sources. Schema discovery aims at automatically finding structures in such cases. An important class of relationships between attributes that can be detected automatically are inclusion dependencies (IND), which provide an excellent basis for guessing foreign key constraints. INDs can be discovered by comparing the sets of distinct values of pairs of attributes. In this paper we present efficient algorithms for finding unary INDs. We first show that (and why) SQL is not suitable for this task. We then develop two algorithms that compute inclusion dependencies outside of the database. Both are much faster than the SQL-based methods; in fact, for larger schemas they are the only feasible solution. Our experiments show that we can compute all unary INDs in a schema of 1, 680 attributes with a total database size of 3.2 GB in approximately 2.5 hours.
  • Keywords
    Computer science; Conferences; Data analysis; Data engineering; Relational databases; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering Workshops, 2006. Proceedings. 22nd International Conference on
  • Conference_Location
    Atlanta, GA, USA
  • Print_ISBN
    0-7695-2571-7
  • Type

    conf

  • DOI
    10.1109/ICDEW.2006.54
  • Filename
    1623797