• DocumentCode
    3165169
  • Title

    Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks

  • Author

    Arai, Benjamin ; Lin, Song ; Gunopulos, Dimitrios

  • Author_Institution
    Univ. of California, Riverside
  • fYear
    2007
  • fDate
    28-31 Oct. 2007
  • Firstpage
    23
  • Lastpage
    32
  • Abstract
    Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.
  • Keywords
    approximation theory; data mining; database management systems; peer-to-peer computing; data sampling; data-mining tasks; heterogeneous peer-to-peer networks; local optimization; overhead cost; peer-to-peer databases; sample quality; user-defined communication budget; Aggregates; Computer science; Costs; Data engineering; Data mining; Distributed databases; Histograms; Network topology; Peer to peer computing; Sampling methods;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
  • Conference_Location
    Omaha, NE
  • ISSN
    1550-4786
  • Print_ISBN
    978-0-7695-3018-5
  • Type

    conf

  • DOI
    10.1109/ICDM.2007.71
  • Filename
    4470226