Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
Internet Searching: Crawling Is Conceptually Quite Simple: Starting at Some Well-Known Sites On The Web
In 300 B.C., the Library of Alexandria held some half million scrolls and served as a
cultural center for the world, uniting the thoughts of Greek, Egyptian, Macedonian and
Babylonian culture. According to the Letter of Aristeas, the library had “a large budget in order
to collect, if possible, all the books in the world.” Whole fields of study such as geometry and
grammar grew from the scholars who had the fortune to travel to and study at Alexandria.
In 2003 A.D., the Internet forms a distributed library of billions of pages, one that is
accessible to anyone, anywhere in the world, at the click of a mouse. Every day hundreds of
millions of “trips” to the library start with a query to an Internet search engine. If I want to
research a topic such as “Library of Alexandria,” I can type those words to a search engine and in
less than a second have access to a list of 14,000 pages, sorted by their usefulness to the query.
This unprecedented ease of access to information has revolutionized the way research is done by
students, scientists, journalists, shoppers, and others. It opens up an online marketplace of
products, services, and ideas that benefits both information providers and seekers; sellers and
buyers; consumers and advertisers.
How is it that an Internet Search engine can find the answers to a query so quickly? It is
a four-step process:
Crawling is conceptually quite simple: starting at some well-known sites on the web,
recursively follow every hypertext link, recording the pages encountered along the way. In
computer science this is called the transitive closure of the link relation. However, the conceptual
simplicity hides a large number of practical complications: sites may be busy or down at one
point, and come back to life later; pages may be duplicated at multiple sites (or with different
URLs at the same site) and must be dealt with accordingly; many pages have text that does not
conform to the standards for HTML, HTTP redirection, robot exclusion, or other protocols; some
information is hard to access because it is hidden behind a form, Flash animation or Javascript
program. Finally, the necessity of crawling 100 million pages a day means that building a crawler
is an exercise in distributed computing, requiring many computers that must work together and
schedule their actions so as to get to all the pages without overwhelming any one site with too
many requests at once.
A search engine’s index is similar to the index in the back of a book: it is used to find the
pages on which a word occurs. There are two main differences: the search engine’s index lists
every occurrence of every word, not just the important concepts, and the number of pages is in
the billions, not hundreds. Various techniques of compression and clever representation are used
to keep the index “small,” but it is still measured in terabytes (millions of megabytes), which
again means that distributed computing is required. Most modern search engines index link data
as well as word data. It is useful to know how many pages link to a given page, and what are the
quality of those pages. This kind of analysis is similar to citation analysis in bibliographic work,
and helps establish which pages are authoritative. Algorithms such as PageRank and HITS are
used to assign a numeric measure of authority to each page. For example, the PageRank
algorithm says that the rank of a page is a function of the sum of the ranks of the pages that link
to the page. If we let PR(p) be the PageRank of page p, Out(p) be the number of outgoing links
from page p, Links(p) be the set of pages that link to page p and N be the total number of pages in
the index, then we can define PageRank by
Recommended Reading
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan,
A. Tomkins. “Hypersearching the Web.” Scientific American, June 1999. (pp. 54-60.)