Web Crawlers
&
Hyperlink Analysis
Albert Sutojo
CS267  Fall 2005
Instructor : Dr. T.Y Lin
Agenda
Web Crawlers
History & definitions
Algorithms
Architecture
Hyperlink Analysis
HITS
PageRank
Web Crawlers
Definitions
Spiders, robots, bots, aggregators, agents and
intelligent agents.
An internet-aware program that can retrieve
information from specific location on the internet
A program that collects documents by recursively
fetching links from a set of starting pages.
Web crawlers are programs that exploit the
graph structures of the web to move from
page to page
Research on crawlers
There is a few research about crawlers
very little research has been done on
crawlers.
Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Efficient Crawling
Through URL Ordering, Stanford University , 1998
Research on crawlers
Unfortunately, many of the techniques used by dot-coms, and
especially the resulting performance, are private, behind company
walls, or are disclosed in patents.
Arvind Arasu, et al, Searching the web,
web, Stanford university 2001
 due to the competitive nature of the search engine business,
the designs of these crawlers have not been publicly described.
There are two notable exceptions : The Google crawler and the
Internet Archive crawler. Unfortunately , the descriptions of these
crawlers in the literature are too terse to enable reproducibility
Alan Heydon and Marc Najork, Mercator : A scalable, Extensible Web Crawler , Compaq System
Research Center 2001
Research on crawlers
Web crawling and indexing companies are rather
protective about the engineering details of their
software assets. Much of the discussion of the
typical anatomy of large crawlers is guided by an
early paper discussing the crawling system for
[26] Google , as well as a paper about the design
of Mercator, a crawler written in Java at Compaq
Research Center [108].
Soumen Chakrabarti. Mining The Web discovering knowledge from
hypertext data. Morgan Kaufmann 2003
Research on crawlers
1993 : First crawler, Matthew Grays Wanderer
1994 :
David Eichmann. The RBSE Spider  Balancing Effective
Search Against Web Load. In Proceedings of the First
International World Wide Web Conference , 1994.
Oliver A. McBryan. GENVL and WWWW : Tools for taming
the web. In Proceedings of the First International World
Wide Web Conference, 1994.
Brian Pinkerton . Finding What people Want :
Experiences with the webCrawler. In Proceedings of the
Second International World Wide Web Conference ,
1994.
Research on crawlers
1997 : www.archive.org crawler
M. Burner. Crawling towards eternity: Building an archive of
the world wide web. Web Techniques Magazine, 2(5), May
1997.
1998 : Google crawler
S. Brin and L. Page. The anatomy of a large-scale
hypertextual Web search engine. In Proceedings of the 7th
World Wide Web Conference, pages 107117, 1998.
1999 : Mercator
A. Heydon and M. Najork. Mercator: A scalable, extensible
web crawler. World Wide Web, 2(4):219229, 1999.
Research on crawlers
2001 : WebFountain Crawler
J. Edwards, K. S. McCurley, and J. A. Tomlin. An adaptive model for
optimizing performance of an incremental web crawler. In Proceedings of
the 10th International World Wide Web Conference, pages 106113, May
2001.
2002 :
Cho and Garcia-Molinas crawler
J. Cho and H. Garcia-Molina. Parallel crawlers. In
Proceedings of the 11th International World Wide Web Conference,
2002.
UbiCrawler
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler:
A scalable fully distributed web crawler. In Proceedings of
the 8th Australian World Wide Web Conference, July 2002.
Research on crawlers
2002 : Shkapenyuk and Suels Crawler
V. Shkapenyuk and T. Suel. Design and implementation of a
high-performance distributed web crawler. In IEEE
International Conference on Data Engineering (ICDE), Feb.
2002.
2004 : Carlos Castillo
Castillo, C. Effective Web Crawling. Phd Thesis. University of
Chile. November 2004.
2005 : DynaBot
Daniel Rocco, James Caverlee, Ling Liu, Terence Critchlow.
Posters: Exploiting the Deep Web with DynaBot : Matching,
Probing, and Ranking. Special interest tracks and posters of the
14th international conference on World Wide Web, May 2005
2006 : ?
Crawler basic algorithm
1.
2.
3.
4.
5.
6.
7.
Remove a URL from the unvisited URL list
Determine the IP Address of its host name
Download the corresponding document
Extract any links contained in it.
If the URL is new, add it to the list of
unvisited URLs
Process the downloaded document
Back to step 1
Single Crawler
Initialize URL list
with starting URLs
Termination ?
Crawling
loop
[done]
[not done]
Pick URL
from URL list
[URL]
Parse page
Add URL to URL List
[No more URL]
Multithreaded Crawler
Get URL
Add URL
Check for termination
Thread
Get URL
end
Add URL
Check for termination
end
Lock URL List
Lock URL List
Pick URL from List
Pick URL from List
Thread
Unlock URL List
Unlock URL List
Fetch Page
Fetch Page
Parse Page
Parse Page
Lock URL List
Lock URL List
Parallel Crawler
C - Proc
C - Proc
Internet
Local Connect
Collected
Pages
Queues of
URLs to
visit
C - Proc
C - Proc
Local Connect
Breadth-first search
Robot Protocol
Contains part of the web site that a
crawler should not visit.
Placed at the root of a web site, robots.txt
# robots.txt for http://somehost.com/
User-agent: *
Disallow: /cgi-bin/
Disallow: /registration # Disallow robots on registration
page
Disallow: /login
Search Engine : architecture
Page Repository
Client
Queries
Crawler(s)
WWW
Indexer
Module
Indexes :
Text
Collection
Analysis
Module
Structure Utility
Query
Engine
Results
Ranking
Search Engine : major
components
Crawlers
Collects documents by recursively fetching links from a set of
starting pages.
Each crawler has different policies
The pages indexed by various search engines are different
The Indexer
Processes pages, decide which of them to index, build various
data structures representing the pages (inverted index,web
graph, etc), different representation among search engines.
Might also build additional structure ( LSI )
The Query Processor
Processes user queries and returns matching answers in an
order determined by a ranking algorithm.
Issues on crawler
1.
2.
3.
4.
5.
General architecture
What pages should the crawler download
?
How should the crawler refresh pages ?
How should the load on the visited web
sites be minimized ?
How should the crawling process be
parallelized ?
Web Pages Analysis
Content-based analysis
Based on the words in documents
Each document or query is represented as a
term vector
E.g : Vector space model algorithm, tf-idf
Connectivity-based analysis
Use hyperlink structure
Used to indicate importance measure of
web pages
E.g : PageRank, HITS
Hyperlink Analysis
Exploiting hyperlink structure of web pages to find
relevant and importance pages for a user query
Assumptions :
1.
2.
Hyperlink from page A to page B is a recommendation of
page B of the author of page A
If page A and page B are connected by a hyperlink , then
might be on the same topic.
Used for crawling, ranking, computing the
geographic scope of a web page, finding mirrored
hosts , computing statistics of web pages and
search engines, web page categorization.
Hyperlink Analysis
Most popular methods :
HITS (1998)
(Hypertext Induced Topic Search)
By Jon Kleinberg
PageRank (1998)
By Lawrence Page & Sergey Brin
Googles founders
HITS
Involves two steps :
1. Building a neighborhood graph N related to
the query terms
2. Computing authority and hub scores for
each document in N, and present the two
ranked list of the most authoritative and
most hubby documents
HITS
freedom : term 1  doc 3, doc 117, doc 3999
.
.
registration : term 10  doc 15, doc 3, doc 101,
doc 19,
doc 1199, doc 280
faculty : term 11  doc 56, doc 94, doc 31, doc 3
.
.
graduation : term m  doc 223
HITS
3
15
101
673
31
1199
19
Indegree
94
56
Outdegree
HITS
HITS defines Authorities and Hubs
An authority is a document with several
inlinks
A hub is a document that has several outlinks
Authority
Hub
HITS computation
Good authorities are pointed to by good hubs
Good hubs point to good authorities
Page i has both authority score xi and a hub score
yi
Xi
(k)
j : e
ji
E Yj
(k-1)
Yi
(k)
j : e
ij
E Xj
(k)
For k = 1,2,3,
E = the set of all directed edges in the web graph
eij = the directed edge from node i to node j
Given initial authority score Xi(0) and hub score Yi(0)
HITS computation
Xi
(k)
j : e
ji
E Yj
(k-1)
Yi
(k)
 j : e
ij
E Xj
(k)
For k = 1,2,3,
Can be written in matrix L of the directed web
graph
Lij = 1, there exists and edge from node i and j
0, otherwise
HITS computation
1
d1 d2 d3 d4
d1
L=
3
Xi
LT Yj
(k-1)
d3
d4
(k)
d2
And
Yi
(k)
0
1
0
0
 Xj
(k)
1
0
1
1
1
1
0
0
0
0
1
0
HITS computation
1.
2.
Initialize y(0)
= e , e is a column vector
of all ones
Until convergence do
(k)
(kT
x(k) = LT y(k-1)
y(k) = L x(k)
k =k+1
Normalize x(k) and y(k)
x
1)
(k)
=L Lx
(k-
y(k) = L LT y(k-
LT 1)
L = authority
matrix
L LT = hub matrix
Computing authority vector X and hub vector Y
can be viewed as finding dominant right-hand
eigenvectors of LT L and L LT
HITS Example
3
10
1 2 3 5 6 10
6
1
2
L=
5
3
5
6
10
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
HITS Example
1 2 3 5 6 10
1
2
LTL =
3
5
6
10
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
1 2 3 5 6 10
1
2
LLT =
3
5
6
10
0
1
0
0
0
0
Authorities and Hub matrices
0
0
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
0
0
0
0
HITS Example
The normalized principles eigenvectors with
the Authority score x and Hub y are :
XT = (0
.3660
.1340
.5
0)
YT = (.3660 0 .2113 0 .2113 .2113)
Authority Ranking = (6
Hub Ranking = (1 3 6
1
10
2
2
10)
5)
Strengths and Weaknesses of
HITS
Strengths
Dual rankings
Weaknesses
Query-dependence
Hub score can be easily manipulated
It is possible that a very authoritative yet
off-topic document be linked to a
document containing the query terms
( Topic drift)
PageRank
Is a numerical value that represent how
important a page is.
casts a vote to page that it links to, the
more vote cast to a page the more important
the page.
The importance of a page that links to a page
determines how importance the link is.
The importance score of a page is calculated
from the vote cast for that page.
Used by Google
PageRank
PR(A) = (1  d) x d [ PR (t1)/C(t1) + PR (t1)/C(t1) + .. PR (tn)/C(tn) ]
Where :
PR(A)
= PageRank of page A
d = damping factor , usually set to 0.85
t1, t2,t3,  tn = pages that link to page A
C( ) = the number of outlinks of a page
In a simpler way :
R(A) = 0.15 x 0.85 [ a share of the PageRank of every page that links to
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
Each page is assigned an initial PageRank of 1
The sites maximum PageRank is 3
PR(A) = 0.15
PR(B) = 0.15
PR(C) = 0.15
The total PageRank in the site = 0.45, seriously wasting
most of its potential PageRank
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 0.15
PR(B) = 1
PR(C) = 0.15
Page Bs PageRank increase, because page A
has voted for page B
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
After 100 interation
 PR(A) = 0.15
 PR(B) = 0.2775
 PR(C) = 0.15
 The total PageRank in the site 0.5775
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
No matter how many iterations are run, each
page always end up with PR = 1
PR(A) = 1
PR(B) = 1
PR(C) = 1
This occur by linking in a loop
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 1.85
 PR(B) = 0.575
 PR(C) = 0.575
After 100 iterations :
 PR(A) = 1.459459
 PR(B) = 0.7702703
 PR(C) = 0.7702703
 The total pageRank is
3 (max), so none is
being wasted
 Page A has a higher
PR
PageRank Example
B
PR= 1
A
PR= 1
C
PR= 1
PR(A) = 1.425
 PR(B) = 1
 PR(C) = 0.575
After 100 iterations :
 PR(A) = 1.298245
 PR(B) = 0.999999
 PR(C) = 0.7017543
 Page C share its
vote between A and
B
 Page A lost some
values
Dangling Links
B
PR= 1
A
PR= 1
C
PR= 1
Is a link to a page that has no links going from it or
links to a page that has not been indexed
Google removes the link shortly after the
calculations start and reinstates them shortly
before the calculations are finished.
PageRank Demo
http://homer.informatics.indiana.edu/
cgi-bin/pagerank/cleanup.cgi
PageRank Implementation
freedom : term 1  doc 3, doc 117, doc 3999
.
.
registration : term 10  doc 101, doc 87,doc 1199
faculty : term 11  doc 280, doc 85
.
.
graduation : term m  doc 223
PageRank Implementation
Query result on term 10 and 11 is
{101, 280, 85, 87, 1199}
PR(87) = 0.3751
PR(85) = 0.2862
PR(101) = 0.04151
PR(280) = 0.03721
PR(1199) = 0.0023
Document 87 is the most important of the
relevant documents
Strength and weaknesses of
PageRank
Weaknesses
The topic drift problem due to the
importance of determining an accurate
relevancy score.
Much work, thought and heuristic must be
applied by Google engineers to determine
the relevancy score, otherwise, the
PageRank retrieved list might often useless
to a user.
Question : why does importance serve as
such a good proxy to relevance ?
Some of these questions might be
Strength and weaknesses of
PageRank
Strengths
The use of importance rather than
relevance
By measuring importance, querydependence is not an issue.
Query-independence
Faster retrieval
HITS vs PageRank
HITS
PageRank
Connectivity-based analysis
Connectivity-based analysis
Eigenvector & eigenvalues
calculation
Eigenvector & eigenvalues
calculation
Query-dependence
Query-independence
Relevance
Importance
Q&A
Thank you