• DocumentCode
    652853
  • Title

    Diagnosing Data Center Behavior Flow by Flow

  • Author

    Arefin, Ahsan ; Singh, V.K. ; Guofei Jiang ; Yueping Zhang ; Lumezanu, Cristian

  • Author_Institution
    Univ. of Illinois at Urbana-Champaign, Champaign, IL, USA
  • fYear
    2013
  • fDate
    8-11 July 2013
  • Firstpage
    11
  • Lastpage
    20
  • Abstract
    Multi-tenant data centers are complex environments, running thousands of applications that compete for the same infrastructure resources and whose behavior is guided by (sometimes) divergent configurations. Small workload changes or simple operator tasks may yield unpredictable results and lead to expensive failures and performance degradation. In this paper, we propose a holistic approach for detecting operational problems in data centers. Our framework, FlowDiff, collects information from all entities involved in the operation of a data center -- applications, operators, and infrastructure -- and continually builds behavioral models for the operation. By comparing current models with pre-computed, known-to-be-stable models, FlowDiff is able to detect many operational problems, ranging from host and network failures to unauthorized access. FlowDiff also identifies common system operations (e.g., VM migration, software upgrades) to validate the behavior changes against planned operator tasks. We show that using passive measurements on control traffic from programmable switches to a centralized controller is sufficient to build strong behavior models; FlowDiff does not require active measurements or expensive server instrumentation. Our experimental results using NEC data center testbed, Amazon EC2, and simulations demonstrate that FlowDiff is effective and robust in detecting anomalous behavior. FlowDiff scales well with the number of applications running in the data center and their traffic volume.
  • Keywords
    computer centres; performance evaluation; Amazon EC2; FlowDiff; NEC data center testbed; anomalous behavior; data center behavior diagnosis; expensive failures; host failures; multitenant data centers; network failures; operational problems; performance degradation; planned operator tasks; unauthorized access; Automata; Control systems; Data models; Delays; Learning automata; Servers; Time factors; Application Signature; Data Center; Diagnosis; EC2; Infrastructure Signature; OpenFlow; Passive Monitoring; Task Signature;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Distributed Computing Systems (ICDCS), 2013 IEEE 33rd International Conference on
  • Conference_Location
    Philadelphia, PA
  • ISSN
    1063-6927
  • Type

    conf

  • DOI
    10.1109/ICDCS.2013.18
  • Filename
    6681571