[go: up one dir, main page]

0% found this document useful (0 votes)
43 views5 pages

CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine

This document describes a search engine project called Poogle that was developed by four students for a class. Poogle crawls the web, indexes pages, calculates page rankings using PageRank, and responds to user search queries. It uses Amazon Web Services machines and is designed to be scalable, fault tolerant, and to return high-quality search results. The project was developed over several weeks and involved building modules for crawling, indexing, PageRank calculation, and query response.

Uploaded by

Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views5 pages

CIS 555 F P P: P ' F S E: Inal Roject Oogle ENN S Avorite Earch Ngine

This document describes a search engine project called Poogle that was developed by four students for a class. Poogle crawls the web, indexes pages, calculates page rankings using PageRank, and responds to user search queries. It uses Amazon Web Services machines and is designed to be scalable, fault tolerant, and to return high-quality search results. The project was developed over several weeks and involved building modules for crawling, indexing, PageRank calculation, and query response.

Uploaded by

Rajesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CIS 555 FINAL PROJECT

POOGLE: PENN’S FAVORITE SEARCH ENGINE


Sanjay Paul, Levi Cai, Dan Kim, and Federico Nusymowicz
{sanjayp, cail, dki, fnusy}@seas.upenn.edu

Professor Andreas Haeberlen


ahae@seas.upenn.edu

ABSTRACT

Poogle is a search engine that runs on a distributed network of AWS machines. This paper
describes, analyzes and evaluates the Poogle system.

I. INTRODUCTION
A. Project Goals
We built Poogle with certain objectives in mind.
Our primary goals included:
• Efficiently crawling a large corpus
• Responding to several concurrent queries in a
timely fashion
• Providing a clean user interface
• Serving up high-quality search results
• Developing a highly scalable system able to
operate across a large network of peers
B. High-Level Approach
Our system operates in three distinct phases.
Phase 1 consists of a process that crawls the web,
and then indexes pages as they become available.
The second phase calculates PageRank for all
indexed content and readies the system to answer
queries efficiently. Finally, a third process Figure 1: High-Level Approach
instantiates a server that answers user queries by
calculating document rankings and returning • 5/3/2012: PageRank operational.
search results in order of relevance. Figure 1 • 5/5/2012: server module functional.
depicts the high-level approach. • 5/6/2012: tuned ranking function.
Each of the processes relies upon a distributed • 5/8/2012: finished integrating components.
network of peer nodes. Individual nodes are
D. Division of Labor
responsible for both running the modules and for
storing portions of the database. We explain our We worked together as much as we could and
architecture in further detail in Section II. pair-programmed relatively often, which helped
ease component integration towards the end of the
C. Project Timeline project. Each team member contributed to most
• 4/18/2012: crawler operational. modules. We also assigned individual
• 4/21/2012: indexer operational. accountability for certain specific tasks-
• 4/28/2012: crawler and indexer integrated; Sanjay Paul: crawler functionality and Pastry
combined module operational across variable- ring management.
sized node networks.
Levi Cai: indexer functionality and Pastry
• 4/29/2012: user interface completed.
network testing.

1
Dan Kim: user interface and PageRank module.
Federico Nusymowicz: server module and
BerkeleyDB/AMI management.

II. ARCHITECTURE
A. Database
We store our data persistently using BerkleyDB.
Our implementation distributes BDB data
structures across several FreePastry peer nodes,
where each peer node takes responsibility for a
subset of the data. Our most relevant data
structures include:
• Pages, which contain the raw XML/HTML
content downloaded from a given URL,
PageRank information, and a list of the page’s
outgoing links. Each peer node takes
responsibility for a subset of URL hosts.
• HitLists, which mimic the Google data
structure going by the same name [1]. Each
HitList corresponds to a specific word within Figure 2: 2-Node Crawler/Indexer
a Page. HitLists also count the number of
times the term occurs within a document,
the link gets discarded. Otherwise the link gets
maintain position information for each word
added to the back of the node’s URL queue.
occurrence, and hold additional term-ranking
information, such as whether the term was a As soon as pages become available, the indexer
‘fancy hit’ (i.e. a title, header, or meta parses them one by one. Parsing entails forming
information). HitLists for each word in the content body and
• HitBins, which aggregate all the HitLists for a then calculating the term’s TF factor. Figure 2
specific word. Individual nodes in the depicts a 2-node crawler & indexer process.
network take responsibility for storing a We designed our crawler / indexer with fault
subset of word HitBins. tolerance in mind. The process regularly
As shown in Figure 1, BDB data structures checkpoints by syncing with disk, thereby
serve as the main point of interface between our allowing the crawler and indexer to pick up from
three component processes. where they stopped in the event of a crash. The
process is also highly scalable and stable – we
B. Crawler / Indexer successfully ran it across 10 nodes and crawled
Our crawler’s architecture borrows heavily from continuously with no problems.
Mercator’s design [2], with some simplifications Our Phase 1 architecture’s main drawback
made in the interest of reducing development regarding scalability was the fact that new nodes
overhead. Each crawler node operates as follows: could not dynamically join or leave the process
1. Poll the local BDB’s URL queue. without impacting the overall network’s
2. Enforce politeness; if the URL’s host was execution. A possible future improvement could
recently pinged, move the URL to the back of involve periodic checks to dynamically
queue and repeat step 1. redistribute Pages.
3. Download the HTML/XML content and store After giving the process its shutdown signal, the
it as a Page in the local BDB. crawler stops running and the indexer sorts
4. Extract all links and route each one to the HitBins by TF factor, in order to later improve our
node responsible for the URL’s host. servers’ query response speeds.
5. Go back to step 1.
B. PageRank
Whenever a crawler node receives a link, it
checks whether the URL is a duplicate, and if so After halting the crawler / indexer process, our
system moves on to compute PageRank. The

2
PageRank calculation begins by aggregating Page We implemented a Tomcat servlet in order to
data at a single master node. This allowed for route queries to our servers. The queries
better ordering of our results, enabling us to return themselves then got hashed and routed through the
more authoritative, reliable sources. We Pastry ring, thus balancing computing load across
implemented PageRank using an iterative, pseudo- all server nodes and improving mean response
distributed Hadoop job. The results from crawling time.
were then aggregated and fed into Hadoop. Routing queries through Pastry provided the
Since calculating a URL's PageRank relies on unexpected side effect of fault tolerance, since
the rank of pages that link into that URL, and even in the event that a server node crashed, the
since outgoing links compose other page’s ranks search engine still remained operational.
in turn, we needed to iterate a large number of
times in order to come to an acceptable result. D. Ranking
Our map algorithm accepted URLs along with Every returned HitBin contains HitLists sorted
their associated page ranks and outgoing links. by TF factor, which the indexer computes as:
Then, in the combining stage, we aggregated an freq(word,doc) / max[freq(any word,doc)]
entry’s PageRank by adding the scaled ranks of all …where word counts are double-weighted for
incoming URLs. We iteratively used these entries each of the ‘fancy hits’ described earlier. Servers
in the reducing stage to calculate page ranks until also gather the HitBin’s total size (n) when they
reaching some degree of convergence. retrieve the bin’s top 10,000 entries. Additionally,
Defining the convergence function proved to be servers communicate with each other at startup in
a fairly non-trivial task. The iterative reduce step order to calculate the total size of the corpus (N).
took a long time to run (hours), and as an To determine a HitList’s TF-IDF weight, servers
additional challenge, determining the proper calculate:
threshold for ‘convergence’ proved to be more of TF-IDF = TF * log(N/n)
an art than a science. In the interest of time, we
chose a less sophisticated approach: set number Servers then weigh each query term according
iteration. We found that although set number to the formula:
iteration proved more inaccurate than a rigid wquery(word) = 0.5 + [0.5*freq(word,query) /
definition of ‘diff’ convergence, with a sufficient max(freq(any word,query))] * log(N/n)
number of iterations, the end values varied little
For each Page referred to in one of the retrieved
enough to consider them fairly accurate
HitLists, servers then calculate the cosine
PageRanks. As an added benefit, set number
similarity between the query and the page, and
iteration helped us predict the process’ runtime.
scale by the Page’s PageRank in order to calculate
Once the PageRank algorithm completed, we the final ranking score.
distributed PageRank scores across all network
nodes and appended them to HitLists in order to
later improve server response time. III. EVALUATION
A. Crawler / Indexer
C. Server
The crawler and indexer were run as a coupled
After all HitLists got updated with their relevant
pair in a single JVM process on each of ten total
PageRank scores, we switched each node to server
Amazon Machine Instances on the EC2 cloud.
mode. Servers then began actively answering
They were allocated a virtual memory size bound
queries. More specifically, servers:
in the range between 1 and 2 MBs (min/max).
1. Listened for query requests. Overflow data was pushed to the key-value store
2. Split queries by term and requested the provided by the open-sourced Berkeley DB
corresponding HitBins from other servers. distribution and top-level caching (as well as that
3. Waited for HitBins to return and cached provided by Berkeley DB) was exploited to
HitBins for the most popular terms. Servers improve performance. Resource allocation was
only retrieved the top 10,000 entries (based distributed across the myriad and performance-
on each entry’s TF factor) from a given throttling resource sinks present in each process,
HitBin in order to improve retrieval speed. and these were primarily directed towards tracking
4. Servers then calculated document ranking duplicate encountered URLs (cached), the URL
based on an augmented TF-IDF vector model. expansion frontier (breadth-first), and space
5. Finally, servers returned the query results. occupied by indexer entries en route between

3
machine nodes and from main memory to disk In a crawling run of approximately 100 minutes,
storage. Each component was allocated an we achieved an indexed corpus size of about
explicitly bound capacity that varied by priority ~128K pages. The story, however, was markedly
and relative speed. different on an individual machine-to-machine
The internal construction of the crawler-indexer basis. In short, many machines went massively
reflected their outward performance characteristic under-utilized due to discrepancies caused by
in that the crawler thread pool was relatively small inconsistent hashing. Whereas the machine nodes
(nine total was found to be an optimal value) and attempt enforce a uniform hash, the distribution of
the indexer thread pool much larger (thirty threads hashed content (in this case of the URLs and
was used though less testing was done to pinpoint keywords) achieved a perceptible skew. Future
an optimal). This follows naturally from the fact utilization of the crawler-indexer might consider
that much of the crawler’s operation is network-IO improving its behavior by either virtualizing nodes
bound due to intra-node URL passing and page (and allowing dynamic reassignment of hashes)
downloading with some unavoidable disk and/or leveraging machine learning principles to
overhead due to a non-negligible cache miss rate improve the system’s anticipated distribution of
(i.e. often >20%, but still tolerable due to locality). hashed content.
On analysis, the average crawl spent less than B. Server
10% of its time performing computations – the
rest was due to blocked I/O. By contrast, the Our system was designed to provide fast results,
indexer had much more opportunity to exploit where much of the system’s time was spent on
parallelization due to its inherently compute- pre-processing results. As a result, each page on
bound primary operations. An addendum in this average had about 80B of associated data, after the
regard was the fact that indexed buckets needed to nearly 1.5 hour crawl and indexing session, the
be shuffled between machine nodes, resulting in fastest server had nearly 2.2GB of data stored.
large messages being directly routed to nodes Result retrievals of single words were near
(based on hash). instantaneous regardless if it was stored in the
cache or not, and multi-word searches were not
Aside from using a SLRU cache replacement much longer.
policy on important top-level objects, we managed “This is the coolest place on earth” returned in
to achieve high performance by adopting a “route- approximately 2.7 seconds and when queried
to-final” policy – in effect, any object (i.e. indexer again it returned on average in 1.2 seconds due to
entry, URL, etc) that was to be pushed from a the cache. Several other 7+ word queries returned
source to a sink within the system was buffered at similar time intervals with over several hundred
and hastily evicted to its final destination. The results each. Single word entries are much faster.
objective here was to clear space in memory by
making some bandwidth concessions and to
minimize time wastage caused by objects pending IV. LESSONS LEARNED
a push to the endpoint machine instance. This was Building a search engine was an incredibly rich
an exceedingly effective practice, though it led, experience – there were countless opportunities
not surprisingly, to the side effect of high- for optimizations, tweaks, and improvements.
frequency message passing and overflow within Some of our ideas proved so interesting that we
the system. The FreePastry distributed hashing often had trouble focusing on completing the
scheme utilized to coordinate the machines proved project’s most basic features. We focused so much
to be less tolerant to high network traffic than we on optimizing our crawler, for example, that by
anticipated and we had to implement coherent the end of the project we barely had time to refine
buffering, throttling, and node-rotation schemes in our PageRank algorithm. If we had the chance to
order to achieve robustness, which was strained by start over, we would definitely get all the basic
adverse message queue overflows and locking features working on a basic level before starting to
exceptions due to timeouts. While buffering (with polish any part of our code base.
direct message-passing) and throttling were fairly
We also learned about the importance of clear
intuitive countermeasures, we also chose to cycle
interfaces. After our first few nights of
messaging allowance by triggering nodes to flush
programming together, we all thought we had a
their buffers to the system in sequence.
relatively thorough understanding of each
The crawler and indexer performed well above component’s architecture and required data
expectation with an average throughput of 160 structures. After splitting the work, however, we
pages crawled and 117 pages indexed per minute. quickly realized how wrong we were. Lack of

4
clearly specified interfaces ended up costing us robustness to scalability and high-performance
countless hours of integration effort. operation. The lessons we take away from the
experience will undoubtedly carry with us well
into the future.
IV. CONCLUSION
In summary, we unanimously concur that
architecting this system was by far the most REFERENCES
challenging task any of us have ever undertaken, [1] S. Brin and L. Page. “The Anatomy of a Large-
but also among the most rewarding. We gained an Scale Hypertextual Search Engine”. Stanford
appreciation for a wide breadth of challenges University. Computer Science Department, 1998.
inherent to massively scaled system design and the
[2] M. Najork and A. Heydon. “High-Performance
solutions they necessitated. On a fundamental
Web Crawling”. Kulwer Academic Publishers, Inc.
level, a search engine handily incorporates almost Compaq SRC, September 2001.
every relevant facet of distributed system design,
ranging from parallelization considerations to

You might also like