RINC: Real-Time Inference-based Network Diagnosis in the Cloud

Report ID: TR-975-14
Author: Rexford, Jennifer / Ghasemi, Mojgan / Benson, Theophilus
Date: 2015-01-05
Pages: 15
Download Formats: |PDF|
Abstract:

Cloud tenants experience performance problems due to issues within their virtual machines (VMs) or within the cloud infrastructure. To offer good and predictable performance, cloud providers must be able to detect and diagnose performance problems in real time. However, existing cloud diagnosis techniques are either unable to detect problems in the tenant’s VMs or are too costly. We argue that rather than collecting all statistics, cloud diagnosis should proceed in phases, with each phase selectively collecting heavier weight measurements. To this end, we introduce a set of novel techniques for inferring the internal state of a VM’s midstream network connections which allows us to accurately collect measurements at any point during a connection. Our framework, RINC, runs within the hypervisor, using these techniques to selectively monitor a tenant’s connections. RINC provides a simple query interface to its cloud-wide platform that allows cloud operators to easily write diagnosis applications. We evaluate RINC on a testbed and with a simulator using a combination of real data center traces and synthetic workloads. Our evaluations validate RINC’s accuracy and show that, by being selective, RINC is able to scale to a cloud with 100K physical servers or 1Million VMs. Moreover we demonstrate RINC’s flexibility and expressibility by implementing five diagnosis applications.