We describe the design, prototyping and evaluation of ARC, a system for automatically compiling a list of authoritative web resources on any (sufficiently broad) topic. Our system extracts a "global" notion of the importance of a page for a given topic. It performs an iterative local analysis of links pointing to and from the pages matching the topic in a term-based query as well as the associated text in the neighborhood of HTML link instantiations (<a href="http://....">....</a>). The goal of ARC is similar to that of resource lists such as those provided by Yahoo! and Infoseek, with the fundamental difference that these services construct lists either manually or through a combination of human and automated effort, while ARC operates fully automatically. As such, the construction of resource lists in ARC is considerably simpler and more efficient. We describe a study aimed at assessing the quality of the resource lists produced by ARC, through the evaluation of ARC, Yahoo!, and Infoseek resource lists by a panel of human users. This evaluation suggests that the resources found by ARC frequently fare almost as well as, and sometimes better than, lists of resources that are manually compiled or classified into a topic. In this direction, we also provide examples of ARC resource lists for the reader to examine.
The subject of this paper is the design and evaluation of an automatic resource compiler: a system that, given a topic that is suitably broad and well-represented on the web, will seek out and return a list of web resources that it considers the most authoritative for that topic. Our system is built on an algorithm that performs a local analysis of both text and links to arrive at a "global consensus" of the best resources for the topic. To evaluate our system, we perform a user-study, comparing its results with those of commercial, human-compiled/assisted services. To our knowledge, this is one of the first systematic user-studies comparing the quality of multiple resource lists compiled using different methods. Our study suggests that, although our resource lists are compiled wholly automatically (and despite being presented to users without any embellishments in the "look and feel" or the presentation context), they fare relatively well compared to the commercial human-compiled lists.
When web users seek definitive information on a broad topic, they frequently go to a hierarchical, manually-compiled taxonomy such as Yahoo!, or a human-assisted compilation such as Infoseek. Often, these taxonomies will contain a broad topic as one or more nodes in the hierarchy, presumably containing the best sources of information on this topic on the web. Thus an important role of such a taxonomy is to provide, for any broad topic, such a resource list with high-quality resources. Our interest is in studying the extent to which such authoritative resource compilation be automated. Note that resource compilation differs from search engines (a second, somewhat distinct service provided by Yahoo! and Infoseek, as well as others such as Altavista): search engines must be able to support rapid term-based search in volume on arbitrary text queries, and are typically based on inverted indices. Resource compilation, on the other hand, may be viewed as an "off-line process" that stresses high quality. It need not answer any text query; it only covers topics deemed to be of broad interest.
In this paper we describe ARC (for Automatic Resource Compiler), an algorithm and system for automatically compiling a resource list on any topic that is broad and well-represented on the web. Our technique is based on the combination of document text with the annotative power of links (href's) and the text in the vicinity of the href's. By using an automated system to compile resource lists, we obtain faster coverage of the available resources than a human can achieve (or, alternatively, the ability to update the resource lists more frequently). As our studies with human users show, however, the loss in quality does not seem to be significant compared to manually or semi-manually compiled lists.
The use of links for ranking documents is similar to work on citation analysis in the field of bibliometrics (see e.g. [White and McCain]). In the context of the Web, links have been used for enhancing relevance judgments by [Rivlin, Botafogo, and Schneiderman] and [Weiss et al]. They have been incorporated into query-based frameworks for searching by [Arocena, Mendelzon, and Mihaila] and by [Spertus].
Our work is oriented in a different direction - namely, to use links as a means of harnessing the latent human annotation in hyper-links so as to broaden a user search and focus on a type of `high-quality' page. Similar motivation arises in work of [Pirolli, Pitkow, and Rao]; [Carriere and Kazman]; and Page[Page97]. Pirolli et al. discuss a method based on link and text-based information for grouping and categorizing WWW pages. Carriere and Kazman use the number of neighbors (without regard to the directions of links) of a page in the link structure as a method of ranking pages; and Page views web searches as random walks to assign a topic-independent "rank" to each page on the WWW, which can then be used to re-order the output of a search engine. For a more detailed review of search engines and their rank functions (including some based on the number of links pointing to a web page) see Search Engine Watch[SEW].
Finally, the link-based algorithm of Kleinberg[Kleinberg97] serves as one of the building blocks of our method here; this connection is described in more detail in Section 2 below, explaining how we enhance it with textual analysis.