Understanding and Improving Modern Web Traffic Caching

Report ID: TR-908-11
Author: Ihm, Sunghwan
Date: 2011-08-00
Pages: 127
Download Formats: |PDF|
Abstract:

The World Wide Web is one of the most popular and important Internet applications, and our daily lives heavily rely on it. Despite its importance, the current Web access is still limited for two reasons: (1) the Web has changed and grown significantly as social networking, video streaming, and file hosting sites have become popular, requiring more and more bandwidth, and (2) the need for Web access also has grown, and many users in bandwidth-limited environments, such as people in the developing world or mobile device users, still suffer from poor Web access.

There was a burst of research a decade ago aimed at understanding the nature of Web traffic and thus improving Web access, but unfortunately, it has dropped off just as the Web has changed significantly. As a result, we have little understanding of the underlying nature of today’s Web traffic, and thus miss traffic optimization opportunities for improving Web access. To help improve Web access, this dissertation attempts to fill the missing gap between previous research and today’s Web.

For a better understanding of today’s Web traffic, we first analyze five years (2006- 2010) of real Web traffic from a globally-distributed proxy system, which captures the browsing behavior of over 70,000 users from 187 countries. Using this data set, we examine major changes in Web traffic characteristics that occurred during this period. We also develop a new Web page analysis technique that is better suited for modern Web page interactions. Using our analysis technique, we analyze various aspects of page-level changes, and present a simple Web traffic model that we develop based on our findings. Finally, we investigate the redundancy of this traffic, using both traditional object-level caching as well as content-based approaches that use the caching technique at the sub-object or packet level. Among many findings, we observe a huge potential benefit of the content-based caching approaches - the byte hit rate is almost twice as large as that of the traditional object-level caching approach.

Motivated by the possible benefits from content-based caching approaches, we also develop Wanax, a scalable and flexible wide-area network (WAN) accelerator that is designed for low-bandwidth and resource-limited developing world environments. It uses a novel multi-resolution chunking (MRC) scheme that provides high compression rates and high disk performance for a variety of content, while using much less memory than existing approaches. Wanax exploits the design of MRC to perform intelligent load shedding to maximize throughput even when running on resource-limited shared platforms. Finally, Wanax exploits mesh network environments, instead of just the star topologies common in enterprise branch offices. Equally importantly, the designs of Wanax can be applied to enterprise environments, providing the same benefits.