0% found this document useful (0 votes)

268 views50 pages

Web Crawlers & Hyperlink Analysis

This document discusses web crawlers and hyperlink analysis. It provides an overview of the history and definitions of web crawlers, including common algorithms and architectures. It also discusses two popular methods of hyperlink analysis: HITS (Hypertext Induced Topic Search) and PageRank. HITS involves computing authority and hub scores for pages to determine important pages on a topic based on link structure. PageRank is Google's algorithm that also analyzes hyperlinks to determine the importance of pages.

Uploaded by

Ashish Dugar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

268 views50 pages

Web Crawlers & Hyperlink Analysis

Uploaded by

Ashish Dugar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 50

Web Crawlers

&
Hyperlink Analysis
Albert Sutojo
CS267 Fall 2005
Instructor : Dr. T.Y Lin

Agenda

Web Crawlers

History & definitions

Algorithms
Architecture

Hyperlink Analysis

HITS
PageRank

Web Crawlers

Definitions

Spiders, robots, bots, aggregators, agents and

intelligent agents.
An internet-aware program that can retrieve
information from specific location on the internet
A program that collects documents by recursively
fetching links from a set of starting pages.
Web crawlers are programs that exploit the
graph structures of the web to move from
page to page

Research on crawlers

There is a few research about crawlers

very little research has been done on

crawlers.
Junghoo Cho, Hector Garcia-Molina, Lawrence Page, Efficient Crawling
Through URL Ordering, Stanford University , 1998

Research on crawlers

Unfortunately, many of the techniques used by dot-coms, and

especially the resulting performance, are private, behind company
walls, or are disclosed in patents.
Arvind Arasu, et al, Searching the web,
web, Stanford university 2001

due to the competitive nature of the search engine business,

the designs of these crawlers have not been publicly described.
There are two notable exceptions : The Google crawler and the
Internet Archive crawler. Unfortunately , the descriptions of these
crawlers in the literature are too terse to enable reproducibility
Alan Heydon and Marc Najork, Mercator : A scalable, Extensible Web Crawler , Compaq System
Research Center 2001

Research on crawlers

Web crawling and indexing companies are rather

protective about the engineering details of their
software assets. Much of the discussion of the
typical anatomy of large crawlers is guided by an
early paper discussing the crawling system for
[26] Google , as well as a paper about the design
of Mercator, a crawler written in Java at Compaq
Research Center [108].

Soumen Chakrabarti. Mining The Web discovering knowledge from

hypertext data. Morgan Kaufmann 2003

Research on crawlers

1993 : First crawler, Matthew Grays Wanderer

1994 :

David Eichmann. The RBSE Spider Balancing Effective

Search Against Web Load. In Proceedings of the First
International World Wide Web Conference , 1994.
Oliver A. McBryan. GENVL and WWWW : Tools for taming
the web. In Proceedings of the First International World
Wide Web Conference, 1994.
Brian Pinkerton . Finding What people Want :
Experiences with the webCrawler. In Proceedings of the
Second International World Wide Web Conference ,
1994.

Research on crawlers

1997 : www.archive.org crawler

M. Burner. Crawling towards eternity: Building an archive of
the world wide web. Web Techniques Magazine, 2(5), May
1997.

1998 : Google crawler

S. Brin and L. Page. The anatomy of a large-scale
hypertextual Web search engine. In Proceedings of the 7th
World Wide Web Conference, pages 107117, 1998.

1999 : Mercator
A. Heydon and M. Najork. Mercator: A scalable, extensible
web crawler. World Wide Web, 2(4):219229, 1999.

Research on crawlers

2001 : WebFountain Crawler

J. Edwards, K. S. McCurley, and J. A. Tomlin. An adaptive model for
optimizing performance of an incremental web crawler. In Proceedings of
the 10th International World Wide Web Conference, pages 106113, May
2001.

2002 :

Cho and Garcia-Molinas crawler

J. Cho and H. Garcia-Molina. Parallel crawlers. In
Proceedings of the 11th International World Wide Web Conference,
2002.

UbiCrawler
P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler:
A scalable fully distributed web crawler. In Proceedings of
the 8th Australian World Wide Web Conference, July 2002.

Research on crawlers

2002 : Shkapenyuk and Suels Crawler

V. Shkapenyuk and T. Suel. Design and implementation of a
high-performance distributed web crawler. In IEEE
International Conference on Data Engineering (ICDE), Feb.
2002.

2004 : Carlos Castillo

Castillo, C. Effective Web Crawling. Phd Thesis. University of
Chile. November 2004.

2005 : DynaBot
Daniel Rocco, James Caverlee, Ling Liu, Terence Critchlow.
Posters: Exploiting the Deep Web with DynaBot : Matching,
Probing, and Ranking. Special interest tracks and posters of the
14th international conference on World Wide Web, May 2005

2006 : ?

Crawler basic algorithm

1.
2.
3.
4.
5.

6.
7.

Remove a URL from the unvisited URL list

Determine the IP Address of its host name
Download the corresponding document
Extract any links contained in it.
If the URL is new, add it to the list of
unvisited URLs
Process the downloaded document
Back to step 1

Single Crawler
Initialize URL list
with starting URLs

Termination ?
Crawling
loop

[done]

[not done]
Pick URL
from URL list
[URL]
Parse page

Add URL to URL List

[No more URL]

Multithreaded Crawler
Get URL

Add URL

Check for termination

Thread

Get URL
end

Add URL

Check for termination

end

Lock URL List

Pick URL from List

Thread
Unlock URL List

Unlock URL List

Fetch Page

Parse Page

Lock URL List

Parallel Crawler
C - Proc

C - Proc

Internet
Local Connect

Collected
Pages
Queues of
URLs to
visit

C - Proc

Local Connect

Breadth-first search

Robot Protocol

Contains part of the web site that a

crawler should not visit.
Placed at the root of a web site, robots.txt

# robots.txt for http://somehost.com/

User-agent: *
Disallow: /cgi-bin/
Disallow: /registration # Disallow robots on registration
page
Disallow: /login

Search Engine : architecture

Page Repository
Client
Queries

Crawler(s)
WWW

Indexer
Module

Indexes :

Text

Collection
Analysis
Module

Structure Utility

Query
Engine

Results

Ranking

Search Engine : major

components

Crawlers

Collects documents by recursively fetching links from a set of

starting pages.
Each crawler has different policies
The pages indexed by various search engines are different

The Indexer

Processes pages, decide which of them to index, build various

data structures representing the pages (inverted index,web
graph, etc), different representation among search engines.
Might also build additional structure ( LSI )

The Query Processor

Processes user queries and returns matching answers in an

order determined by a ranking algorithm.

Issues on crawler
1.
2.

3.
4.

General architecture
What pages should the crawler download
?
How should the crawler refresh pages ?
How should the load on the visited web
sites be minimized ?
How should the crawling process be
parallelized ?

Web Pages Analysis

Content-based analysis

Based on the words in documents

Each document or query is represented as a
term vector
E.g : Vector space model algorithm, tf-idf

Connectivity-based analysis

Use hyperlink structure

Used to indicate importance measure of
web pages
E.g : PageRank, HITS

Hyperlink Analysis

Exploiting hyperlink structure of web pages to find

relevant and importance pages for a user query
Assumptions :
1.

Hyperlink from page A to page B is a recommendation of

page B of the author of page A
If page A and page B are connected by a hyperlink , then
might be on the same topic.

Used for crawling, ranking, computing the

geographic scope of a web page, finding mirrored
hosts , computing statistics of web pages and
search engines, web page categorization.

Hyperlink Analysis
Most popular methods :

HITS (1998)
(Hypertext Induced Topic Search)
By Jon Kleinberg

PageRank (1998)
By Lawrence Page & Sergey Brin
Googles founders

HITS

Involves two steps :

1. Building a neighborhood graph N related to
the query terms
2. Computing authority and hub scores for
each document in N, and present the two
ranked list of the most authoritative and
most hubby documents

HITS
freedom : term 1 doc 3, doc 117, doc 3999
.
.
registration : term 10 doc 15, doc 3, doc 101,
doc 19,
doc 1199, doc 280
faculty : term 11 doc 56, doc 94, doc 31, doc 3
.
.
graduation : term m doc 223

673
31
1199

Indegree

Outdegree

HITS

HITS defines Authorities and Hubs

An authority is a document with several
inlinks
A hub is a document that has several outlinks

Authority

Hub

HITS computation

Good authorities are pointed to by good hubs

Good hubs point to good authorities
Page i has both authority score xi and a hub score
yi

(k)

j : e

E Yj

(k-1)

(k)

j : e

E Xj

(k)

For k = 1,2,3,

E = the set of all directed edges in the web graph

eij = the directed edge from node i to node j
Given initial authority score Xi(0) and hub score Yi(0)

HITS computation
Xi

(k)

j : e

E Yj

(k-1)

(k)

j : e

E Xj

(k)

For k = 1,2,3,

Can be written in matrix L of the directed web

graph
Lij = 1, there exists and edge from node i and j
0, otherwise

HITS computation
1

d1 d2 d3 d4
d1

L=
3

LT Yj

(k-1)

d3
d4

(k)

And

(k)

0
1
0
0

(k)

1
0
1
1

1
1
0
0

0
0
1
0

HITS computation
1.

Initialize y(0)
= e , e is a column vector
of all ones
Until convergence do
(k)
(kT

x(k) = LT y(k-1)
y(k) = L x(k)
k =k+1
Normalize x(k) and y(k)

x
1)

(k)

=L Lx

(k-

y(k) = L LT y(k-

LT 1)
L = authority
matrix

L LT = hub matrix
Computing authority vector X and hub vector Y
can be viewed as finding dominant right-hand
eigenvectors of LT L and L LT

HITS Example
3
10

1 2 3 5 6 10

6
1
2

L=
5

3
5
6
10

0
1
0
0
0
0

0
0
0
0
0
0

1
0
0
0
1
0

0
0
1
0
0
1

0
0
0
0
0
0

HITS Example
1 2 3 5 6 10
1
2

LTL =

3
5
6
10

0
1
0
0
0
0

0
0
0
0
0
0

1
0
0
0
1
0

0
0
1
0
0
1

0
0
0
0
0
0

1 2 3 5 6 10
1
2

LLT =

3
5
6
10

0
1
0
0
0
0

Authorities and Hub matrices

0
0
0
0
0
0

1
0
0
0
1
0

0
0
1
0
0
1

0
0
0
0
0
0

HITS Example
The normalized principles eigenvectors with
the Authority score x and Hub y are :
XT = (0

.3660

.1340

YT = (.3660 0 .2113 0 .2113 .2113)

Authority Ranking = (6

Hub Ranking = (1 3 6

1
10

2
2

10)
5)

Strengths and Weaknesses of

HITS

Strengths

Dual rankings
Weaknesses

Query-dependence

Hub score can be easily manipulated

It is possible that a very authoritative yet

off-topic document be linked to a
document containing the query terms
( Topic drift)

PageRank

Is a numerical value that represent how

important a page is.
casts a vote to page that it links to, the
more vote cast to a page the more important
the page.
The importance of a page that links to a page
determines how importance the link is.
The importance score of a page is calculated
from the vote cast for that page.
Used by Google

PageRank
PR(A) = (1 d) x d [ PR (t1)/C(t1) + PR (t1)/C(t1) + .. PR (tn)/C(tn) ]

Where :
PR(A)
= PageRank of page A
d = damping factor , usually set to 0.85
t1, t2,t3, tn = pages that link to page A
C( ) = the number of outlinks of a page
In a simpler way :

R(A) = 0.15 x 0.85 [ a share of the PageRank of every page that links to

PageRank Example
B
PR= 1

A
PR= 1
C
PR= 1

Each page is assigned an initial PageRank of 1

The sites maximum PageRank is 3
PR(A) = 0.15
PR(B) = 0.15
PR(C) = 0.15
The total PageRank in the site = 0.45, seriously wasting
most of its potential PageRank

PageRank Example
B
PR= 1

A
PR= 1

C
PR= 1

PR(A) = 0.15
PR(B) = 1
PR(C) = 0.15
Page Bs PageRank increase, because page A
has voted for page B

PageRank Example
B
PR= 1

A
PR= 1

C
PR= 1

After 100 interation

PR(A) = 0.15
PR(B) = 0.2775
PR(C) = 0.15
The total PageRank in the site 0.5775

PageRank Example
B
PR= 1

A
PR= 1

C
PR= 1

No matter how many iterations are run, each

page always end up with PR = 1
PR(A) = 1
PR(B) = 1
PR(C) = 1
This occur by linking in a loop

PageRank Example
B
PR= 1

A
PR= 1

C
PR= 1

PR(A) = 1.85
PR(B) = 0.575
PR(C) = 0.575
After 100 iterations :
PR(A) = 1.459459
PR(B) = 0.7702703
PR(C) = 0.7702703

The total pageRank is

3 (max), so none is
being wasted
Page A has a higher
PR

PageRank Example
B
PR= 1

A
PR= 1

C
PR= 1

PR(A) = 1.425
PR(B) = 1
PR(C) = 0.575
After 100 iterations :
PR(A) = 1.298245
PR(B) = 0.999999
PR(C) = 0.7017543

Page C share its

vote between A and
B
Page A lost some
values

Dangling Links
B
PR= 1

A
PR= 1

C
PR= 1

Is a link to a page that has no links going from it or

links to a page that has not been indexed
Google removes the link shortly after the
calculations start and reinstates them shortly
before the calculations are finished.

PageRank Demo

http://homer.informatics.indiana.edu/
cgi-bin/pagerank/cleanup.cgi

PageRank Implementation
freedom : term 1 doc 3, doc 117, doc 3999
.
.
registration : term 10 doc 101, doc 87,doc 1199
faculty : term 11 doc 280, doc 85
.
.
graduation : term m doc 223

PageRank Implementation

Query result on term 10 and 11 is

{101, 280, 85, 87, 1199}
PR(87) = 0.3751
PR(85) = 0.2862
PR(101) = 0.04151
PR(280) = 0.03721
PR(1199) = 0.0023
Document 87 is the most important of the
relevant documents

Strength and weaknesses of

PageRank

Weaknesses

The topic drift problem due to the

importance of determining an accurate
relevancy score.

Much work, thought and heuristic must be

applied by Google engineers to determine
the relevancy score, otherwise, the
PageRank retrieved list might often useless
to a user.

Question : why does importance serve as

such a good proxy to relevance ?

Some of these questions might be

Strength and weaknesses of

PageRank

Strengths

The use of importance rather than

relevance

By measuring importance, querydependence is not an issue.

Query-independence

Faster retrieval

HITS vs PageRank
HITS

PageRank

Connectivity-based analysis

Eigenvector & eigenvalues

calculation

Eigenvector & eigenvalues

calculation

Query-dependence

Query-independence

Relevance

Importance

Q&A

Thank you

Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Cross Word Puzzle Game in C Language
No ratings yet
Cross Word Puzzle Game in C Language
8 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Famepilot Django Assignment
No ratings yet
Famepilot Django Assignment
1 page
Coding Questions
100% (1)
Coding Questions
170 pages
C 100 Dev
No ratings yet
C 100 Dev
10 pages
2.fundamentals of Python
No ratings yet
2.fundamentals of Python
14 pages
Learning Flask Framework - Sample Chapter
100% (2)
Learning Flask Framework - Sample Chapter
27 pages
11 Beginner Tips For Learning Python Programming - Real Python
No ratings yet
11 Beginner Tips For Learning Python Programming - Real Python
8 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
26 pages
Getting Started With TensorFlow - Js - TensorFlow - Medium
No ratings yet
Getting Started With TensorFlow - Js - TensorFlow - Medium
6 pages
Ultimate Data Science - GenAI Bootcamp
No ratings yet
Ultimate Data Science - GenAI Bootcamp
34 pages
Pandas
100% (1)
Pandas
1,131 pages
Web App Vulnerability Scanner Guide
100% (1)
Web App Vulnerability Scanner Guide
43 pages
Setting Up A Simple OCR Server: by Real Python 37 Comments
No ratings yet
Setting Up A Simple OCR Server: by Real Python 37 Comments
8 pages
EBOOK Python From The Very Beginning Wit
No ratings yet
EBOOK Python From The Very Beginning Wit
2 pages
Getting Started With Building Microservices
No ratings yet
Getting Started With Building Microservices
17 pages
Git Cheat Sheet: Commands & Workflow
No ratings yet
Git Cheat Sheet: Commands & Workflow
1 page
Scipy (Python Library) : Prepared By: Jenish Patel Jesal Zala Kirtan Shah Sanyukta Gautam
No ratings yet
Scipy (Python Library) : Prepared By: Jenish Patel Jesal Zala Kirtan Shah Sanyukta Gautam
17 pages
Salcescu, Cristian - Functional Programming in JavaScript (Functional Programming With JavaScript and React Book 4) (2020)
No ratings yet
Salcescu, Cristian - Functional Programming in JavaScript (Functional Programming With JavaScript and React Book 4) (2020)
182 pages
Use Delta Lake in Azure Synapse Analytics
No ratings yet
Use Delta Lake in Azure Synapse Analytics
37 pages
Google Cloud Core Infrastructure Guide
No ratings yet
Google Cloud Core Infrastructure Guide
69 pages
React Interview Q&A Guide
No ratings yet
React Interview Q&A Guide
1 page
LAB Python Basics Ver 7.0
No ratings yet
LAB Python Basics Ver 7.0
17 pages
DL4J Deep Learning Guide
No ratings yet
DL4J Deep Learning Guide
26 pages
AI & ML Cheat Sheets Collection
100% (1)
AI & ML Cheat Sheets Collection
24 pages
ReactJS Tutorial - Javatpoint
No ratings yet
ReactJS Tutorial - Javatpoint
7 pages
Git Workflows
No ratings yet
Git Workflows
5 pages
Python Full Stack
No ratings yet
Python Full Stack
37 pages
DBMS Ninja Notes
No ratings yet
DBMS Ninja Notes
134 pages
40+ Project Ideas For Beginners, Intermidiate and Advanced Learners
100% (1)
40+ Project Ideas For Beginners, Intermidiate and Advanced Learners
6 pages
Nuxtjs Succinctly
No ratings yet
Nuxtjs Succinctly
98 pages
OCR & Groq: Fast Data Extraction
No ratings yet
OCR & Groq: Fast Data Extraction
17 pages
Learn PySpark: Build Python-Based Machine Learning and Deep Learning Models 1st Edition Pramod Singh Instant Download
No ratings yet
Learn PySpark: Build Python-Based Machine Learning and Deep Learning Models 1st Edition Pramod Singh Instant Download
120 pages
Introduction To Keras!: Vincent Lepetit!
No ratings yet
Introduction To Keras!: Vincent Lepetit!
33 pages
Snake Game
No ratings yet
Snake Game
65 pages
SC QB
No ratings yet
SC QB
24 pages
React 18
No ratings yet
React 18
7 pages
Python Random Module: W3schools
No ratings yet
Python Random Module: W3schools
5 pages
Flask Restplus
No ratings yet
Flask Restplus
86 pages
Server Side Rendering in ReactJS
No ratings yet
Server Side Rendering in ReactJS
14 pages
State Common Entrance Test Cell: 6276 MKSSS's Cummins College of Engineering For Women, Karvenagar, Pune
No ratings yet
State Common Entrance Test Cell: 6276 MKSSS's Cummins College of Engineering For Women, Karvenagar, Pune
30 pages
Feedzai Machine Learning For Fraud Prevention Final 0216
No ratings yet
Feedzai Machine Learning For Fraud Prevention Final 0216
9 pages
Spring Boot CRUD Operations Guide
No ratings yet
Spring Boot CRUD Operations Guide
16 pages
Crawler: 1.0 Introduction
No ratings yet
Crawler: 1.0 Introduction
12 pages
08 Web Search and Web Crawling
No ratings yet
08 Web Search and Web Crawling
33 pages
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
No ratings yet
Crawling The Web: Seed Page and Then Uses The External Links Within It To Attend To Other Pages
25 pages
Web Crawler A Review
No ratings yet
Web Crawler A Review
5 pages
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
No ratings yet
Brief Introduction On Working of Web Crawler: Rishika Gour Prof. Neeranjan Chitare
4 pages
5.web Crawler Writeup
No ratings yet
5.web Crawler Writeup
7 pages
Ms. Poonam Sinai Kenkre
No ratings yet
Ms. Poonam Sinai Kenkre
43 pages
EDS WebCrawlerArchitecture
No ratings yet
EDS WebCrawlerArchitecture
3 pages
Crawler and URL Retrieving & Queuing
No ratings yet
Crawler and URL Retrieving & Queuing
5 pages
Effective Web Crawler Strategies
No ratings yet
Effective Web Crawler Strategies
3 pages
Web Crawler
0% (1)
Web Crawler
16 pages
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Compaq Systems Research Center, 130 Lytton Avenue, Palo Alto, CA 94301, USA
11 pages
Explores The Ways of Usage of Web Crawler in Mobile Systems
No ratings yet
Explores The Ways of Usage of Web Crawler in Mobile Systems
5 pages
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
No ratings yet
Web Crawlers: Presented By: B. Tech. Final Year Information Technology
27 pages
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
No ratings yet
Mercator: A Scalable, Extensible Web Crawler: Allan Heydon Marc Najork Compaq Systems Research Center
14 pages
Dept. of Cse, Msec 2014-15
No ratings yet
Dept. of Cse, Msec 2014-15
19 pages
Linear Measurement Report
0% (2)
Linear Measurement Report
4 pages
Electric Charges and Fields (MCQs-1)
No ratings yet
Electric Charges and Fields (MCQs-1)
5 pages
Uace Biology Paper 1 2018
No ratings yet
Uace Biology Paper 1 2018
6 pages
Keor Hpe 60 80 GB PDF
No ratings yet
Keor Hpe 60 80 GB PDF
2 pages
Evolution of MicroProcessor
No ratings yet
Evolution of MicroProcessor
28 pages
Detecting Fake News with Deep Learning
No ratings yet
Detecting Fake News with Deep Learning
6 pages
Limiting Reactant Worksheet
No ratings yet
Limiting Reactant Worksheet
4 pages
Syllabus 8
No ratings yet
Syllabus 8
20 pages
Nasa
No ratings yet
Nasa
36 pages
AEM System Admin Workbook
No ratings yet
AEM System Admin Workbook
206 pages
COA Chapter 2
No ratings yet
COA Chapter 2
23 pages
Regional Anatomy of Thorax: DR Mba
No ratings yet
Regional Anatomy of Thorax: DR Mba
33 pages
Group Theory Problems
No ratings yet
Group Theory Problems
4 pages
Automation in Construction: Kai Guo, Limao Zhang
No ratings yet
Automation in Construction: Kai Guo, Limao Zhang
22 pages
Preparation of Uric Acid Standard Stock Solution
No ratings yet
Preparation of Uric Acid Standard Stock Solution
2 pages
Grade 4 Fractions
No ratings yet
Grade 4 Fractions
16 pages
ML Naive Bayes 1
No ratings yet
ML Naive Bayes 1
19 pages
Parameter Theory and Linguistic Change GALVES CYRINO Et Al
100% (3)
Parameter Theory and Linguistic Change GALVES CYRINO Et Al
405 pages
GATE 2001 Instrumentation Solved Paper
No ratings yet
GATE 2001 Instrumentation Solved Paper
15 pages
Tray Sdrywer
No ratings yet
Tray Sdrywer
10 pages
Class 8 Syllabus Overview
No ratings yet
Class 8 Syllabus Overview
28 pages
Unit III Low Cost Automation
No ratings yet
Unit III Low Cost Automation
18 pages
An Introduction To Digital Design Using A
No ratings yet
An Introduction To Digital Design Using A
30 pages
How Do Calculators Work - ScienceABC
No ratings yet
How Do Calculators Work - ScienceABC
18 pages
Civil Engineering Course Guide
100% (2)
Civil Engineering Course Guide
13 pages
Cryptography and Network Security Chapter 2
No ratings yet
Cryptography and Network Security Chapter 2
18 pages
Rapid Rise Fire Tests of Protection Materials For Structural Steel
100% (1)
Rapid Rise Fire Tests of Protection Materials For Structural Steel
43 pages
Average Load: A. 100 MW C. 180 MW D. 200 MW
No ratings yet
Average Load: A. 100 MW C. 180 MW D. 200 MW
41 pages
Thermal Power Generation Full Seminar Report 74537
88% (8)
Thermal Power Generation Full Seminar Report 74537
23 pages
CDKB Case Study
No ratings yet
CDKB Case Study
3 pages

Web Crawlers & Hyperlink Analysis

Uploaded by

Web Crawlers & Hyperlink Analysis

Uploaded by

Web Crawlers

History & definitions

Spiders, robots, bots, aggregators, agents and

There is a few research about crawlers

very little research has been done on

Unfortunately, many of the techniques used by dot-coms, and

due to the competitive nature of the search engine business,

Web crawling and indexing companies are rather

Soumen Chakrabarti. Mining The Web discovering knowledge from

1993 : First crawler, Matthew Grays Wanderer

David Eichmann. The RBSE Spider Balancing Effective

1997 : www.archive.org crawler

1998 : Google crawler

2001 : WebFountain Crawler

Cho and Garcia-Molinas crawler

2002 : Shkapenyuk and Suels Crawler

2004 : Carlos Castillo

Crawler basic algorithm

Remove a URL from the unvisited URL list

Add URL to URL List

[No more URL]

Check for termination

Check for termination

Lock URL List

Lock URL List

Pick URL from List

Pick URL from List

Unlock URL List

Lock URL List

Lock URL List

Contains part of the web site that a

# robots.txt for http://somehost.com/

Search Engine : architecture

Search Engine : major

Collects documents by recursively fetching links from a set of

Processes pages, decide which of them to index, build various

The Query Processor

Processes user queries and returns matching answers in an

Web Pages Analysis

Based on the words in documents

Use hyperlink structure

Exploiting hyperlink structure of web pages to find

Hyperlink from page A to page B is a recommendation of

Used for crawling, ranking, computing the

Involves two steps :

HITS defines Authorities and Hubs

Good authorities are pointed to by good hubs

E = the set of all directed edges in the web graph

Can be written in matrix L of the directed web

Authorities and Hub matrices

YT = (.3660 0 .2113 0 .2113 .2113)

Strengths and Weaknesses of

Hub score can be easily manipulated

It is possible that a very authoritative yet

Is a numerical value that represent how

Each page is assigned an initial PageRank of 1

After 100 interation

No matter how many iterations are run, each

The total pageRank is

Page C share its

Is a link to a page that has no links going from it or

Query result on term 10 and 11 is

Strength and weaknesses of

The topic drift problem due to the

Much work, thought and heuristic must be

Question : why does importance serve as

Some of these questions might be

Strength and weaknesses of

The use of importance rather than

By measuring importance, querydependence is not an issue.

Eigenvector & eigenvalues

Eigenvector & eigenvalues

You might also like