Towards Highly Reliable and Scalable Distributed Systems (thesis)

Report ID: TR-772-07
Author: Park, KyoungSoo
Date: 2007-01-00
Pages: 148
Download Formats: |PDF|
Abstract:

Over the past decades, we have observed the development of a number of Internet-scale distributed systems such as the Domain Name System (DNS) and Content Distribution Networks (CDNs), which support the Internet access of millions of users. Though they have become indispensable infrastructure in the Internet, their operating dynamics have not been well studied so far. This dissertation focuses on the reliability and scalability of such large-scale distributed systems and explores the design principles for highly available and scalable systems.

One key principle for highly reliable services is distributed management of available resources based on autonomous monitoring. We implement CoDeeN, a latency-sensitive public CDN service with purely decentralized control, and demonstrate that its service reliability is greatly improved using careful resource monitoring, even in a highly unreliable environment. CoDeeN has been operating over three years, handling 5.8+ billion successful HTTP requests and serving over 14 million users, and has been one of the most stable long-running services on PlanetLab.

The second principle is that intelligent composition of temporarily unreliable resources can provide better reliability than any of the underlying resources. Using this principle, we build CoDNS, which has failure rates that range from one-tenth to one-hundredth of that of the existing DNS services. By aggregating the unreliable services, CoDNS improves availability by an extra '9', from 99\% to over 99.9\%, and in some cases achieves over 99.99\% or less than 8 seconds of downtime per day. The utility of this service has also been proved in practice by providing more predictable and reliable name lookup service to the CoDeeN CDN service.

Finally, we find that the scalability of a large-scale distributed system can be immensely improved by independent and asynchronous node peering strategy and effective request distribution. CoBlitz, a scalable large-file transfer CDN atop CoDeeN and CoDNS, achieves downloading performance 27-48\% higher than BitTorrent, while reducing the origin load by a factor of 7 more than the previously best known research system.