Minimizing Wide-Area Performance Disruptions in Inter-Domain Routing

Report ID: TR-909-11
Author: Zhu, Yaping
Date: 2011-08-00
Pages: 113
Download Formats: |PDF|
Abstract:

The Internet is the platform for most of our communications needs today. The networks underlying the Internet undergo continual change -- both planned changes (e.g., adding a new router) or unplanned failures. Unfortunately, these changes can lead to performance disruptions, which affect the user experience. Because of this, network operators have to quickly diagnose and fix any problems that arise. Diagnosing wide-area performance disruptions is challenging: first, each network has limited visibility into other networks, so network operators must collect and analyze measurements of routing and traffic data in order to infer the root cause of the disruption; second, there are so many potential factors which might lead to performance disruptions, and these factors are usually interdependent of each other; third, there are no formalized ways to define metrics and classify the performance disruption according to the causes, thus network diagnosis is usually done in an ad-hoc manner.

The thesis conducts two case studies to diagnose wide-area performance disruptions from the perspectives of a large tier-1 Internet Service Provider (ISP) and a large content distribution network (CDN): i) From the ISPs perspective, we designed and implemented a system that tracks inter-domain route changes at scale and in real time. Our system can be used as the building block for many diagnosis tools for the ISPs. ii) From the CDNs perspective, we focus on diagnosing wide-area network changes which resulted in latency increases to access the services in the CDN. We designed a method for automatically classifying large increases of latency, and evaluated our techniques on one month of measurement data to identify major sources of high latency for the CDN.

Stepping back, the difficulties in network diagnosis can be traced back to the inter-domain routing protocol itself. Based on the lessons learned from the case studies, we refactor the border gateway protocol (BGP), the main inter-domain routing protocol in two ways: first, since the network operator has visibility into its own network and some limited visibility in the neighboring networks, we propose to select a route only based on the next-hop AS (instead of the networks further away); second, the BGP protocol was designed as a way to exchange path availability information between independent networks, not with the operational challenges of performance, security, and traffic engineering. This has led many to propose additional BGP attributes that satisfy the operational needs. These proposals make the protocol and configuration more complicated, and thus more error- prone and more difficult for network operators to diagnose problems. Instead, we propose simplifying the protocol, and in effect enable addressing the operational challenges outside the protocol. Our proposal of next-hop BGP not only simplifies the protocol, but also has the benefits of fast convergence, incentive compatibility, and easier support for multi-path routing.