Scheduling Web Crawl for Better Performance and Quality

Report ID: TR-682-03
Author: Cao, Fengyun / Jiang, Dongming / Singh, Jaswinder Pal
Date: 2003-10-00
Pages: 11
Download Formats: |PDF|
Abstract:

Web crawler is an essential component of search engines, data mining and other Internet applications. Scheduling Web pages to be downloaded is an important aspect of crawling. Previous research on Web crawl focused on optimizing either crawl speed or quality of the Web pages downloaded. While both metrics are important, scheduling using one of them alone is insufficient and can bias or hurt overall crawl process. This paper explores the design space of crawl scheduling to balance performance and quality factors and optimize the global crawl efficiency. We design a network-efficient scheduling framework and use it to evaluate various scheduling strategies. We also define a new scheduling algorithm that factor both network performance and Web page quality into scheduling decision-making. Real world experiments clearly demonstrate the effectiveness of the two-level scheduling scheme and the new algorithm in improving overall crawl efficiency. Experiments also show that crawl-scheduling design can always be optimized based on full understanding of application properties.