Measuring the Web Using a Versatile Meta Information Crawler

Report ID: TR-643-02
Author: Liu, Ting / Peterson, Larry / LaPaugh, Andrea S.
Date: 2002-02-00
Pages: 12
Download Formats: |PDF| |Postscript|
Abstract:

In this paper, we present data which characterizes three aspects of Web interactions: failures, timing performance, and protocol compliance. We collected the data using our Versatile Meta Information Crawler, which is designed to acquire a wide sample of the Web, accurately recording its behavior and performance, and building a large repository of Web page meta information. We have crawled 300,000 Web pages under 130,000 domain names and 90,000 IP addresses that are dispersed throughout the Web. The major findings are as follows. For failures, the likelihood of encountering a Web failure is 12%. DNS failures account for 50% of all the communication failures, and "URL Not Found"s account for 90\% of all the transaction failures. For timing performance, none of the communication phases dominates the entire Web transaction. We examine each phase in more detail to identify its empirical parameters. For protocol compliance, persistent connections are not indicated properly by major servers, and conditional GET is not sufficiently supported. Based on the data, we suggest a number of system improvements.