• DocumentCode
    2848433
  • Title

    Finding (recently) frequent items in distributed data streams

  • Author

    Manjhi, Amit ; Shkapenyuk, Vladislav ; Dhamdhere, Kedar ; Olston, Christopher

  • Author_Institution
    Carnegie Mellon Univ., Pittsburgh, PA, USA
  • fYear
    2005
  • fDate
    5-8 April 2005
  • Firstpage
    767
  • Lastpage
    778
  • Abstract
    We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Naive methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than naive approaches while providing the same error guarantees on answers.
  • Keywords
    data mining; distributed databases; minimisation; data structures; hierarchical communication structure; multiple distributed data streams; optimization problem; precision gradient; recently frequent items finding; worst-case communication load minimization; Association rules; Computer crime; Computer networks; Data structures; Delay; Frequency estimation; HTML; Itemsets; Large-scale systems; Monitoring;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on
  • ISSN
    1084-4627
  • Print_ISBN
    0-7695-2285-8
  • Type

    conf

  • DOI
    10.1109/ICDE.2005.68
  • Filename
    1410191