Query-Independent Ranking for Large-Scale Persistent Search Systems

Report ID: TR-837-08
Author: Schmidt, Erich R.
Date: 2008-10-00
Pages: 109
Download Formats: |PDF|
Abstract:

Existing search services rely heavily on citation-based authority (e.g. PageRank) to assess the quality of publications. The quality and relevance of results is particularly important in persistent search, but the current rank computations are strongly biased against new pages. We propose SiteRank, a new ranking mechanism that handles new publications well and also dramatically reduces the computation costs.

This performance improvement is especially valuable when authority is computed in a persistent search service. Current systems, whether small-scale notifiers (e.g. CNN Alerts) or persistent queries on traditional search engines (e.g. Google Alerts), suffer from limited coverage and/or low refresh rates. We propose Distributed Persistent Search (DPS), a new architecture based on a publish-subscribe framework that achieves linear improvement in publication processing and notification routing, as a function of the number of servers used.

In order to fully utilize the distributed architecture of DPS and eliminate the single point of failure that is the rank server, we also propose Distributed SiteRank, a fully distributed citation-based rank computation which scales well with the number of documents and can be used in both traditional and persistent search systems.