Information Storage
and Retrieval
CS418
Search Engine Architecture
Lecture 2
Dr. Ebtsam AbdelHakam
ebtsamabd@gmail.com
Computer Science Dept.
Minia University
Requirements of Designing a
Search Engine
The two primary requirements of a search engine are:
• Effectiveness (quality): We want to be able to retrieve
the most relevant set of documents possible for a query.
• Efficiency (speed): We want to process queries from
users as quickly as possible.
Designing a Search Engine
Search engine design balances two factors:
‣ Effectiveness – accuracy of results, presentation
of results, absence of spam, good ad selection
‣ Efficiency / Performance – response time,
concurrency, disaster mitigation, security
issues.
These factors deeply impact the architecture of these
systems. Often the engineering solutions feed back
into research (NoSQL, Map Reduce, etc.).
Search Engine Basic Building Blocks
Search engine components support two major functions,
which we are called:
.
1- the indexing process: The indexing process builds the
structures that enable searching.
The index (inverted index) is an efficient data structure that
represents the documents of a Corpus and allows fast searching of
the Corpus documents using that indexed information.
2- the query process: the query process uses those
structures (index) and a person’s query to produce a
ranked list of documents
Query process
1. User interaction
It supports creation and refinement of user query and
displays the results.
2. Ranking
It uses query and indexes to create ranked list of
documents.
3. Evaluation
It monitors and measures the effectiveness and
efficiency. It is done offline
Query Process
(User Interaction)
The• user interaction component provides the interface between
the person doing the searching and the search engine.
Its three tasks are:
1- Accepting the user’s query, query language is defined and
transforming it into index terms.
- Query Transformation: The user-interface parses user queries, and
converts search terms in a form that is acceptable for input to the query
engine i.e. into index terms that appear in the index vocabulary.
- Spell checking and query suggestion suggest improvements to the
user, or run alternative queries in the background
User Interaction
Query suggestion (a prank)
•
User Interaction Component
Its •three tasks are:
2- Take the ranked list of documents from the search engine and
organize it into the results shown to the user.
‣ Displays the top-ranked results
‣ Generates snippets to show how queries match documents
‣ Highlights important words and passages
‣ Retrieves query-relevant advertising.
User Interaction Component
•
3- Finally, this component also provides a range of techniques for
refining the query so that it better represents the information
need.
‣ Query expansion adds terms related to the query terms (e.g.
synonyms, related entities)
‣ Relevance feedback runs an initial query, then uses the top-ranked
documents to expand the query for a second run
Query Process
(Ranking)
Ranking Component
The ranking component is the core of the search engine.
• It takes the transformed query from the user interaction component
and generates a ranked list of documents using scores based on a
retrieval model.
• Ranking must be both efficient, since many queries may need to be
processed in a short time, and effective, since the quality of the
ranking determines whether the search engine accomplishes the
goal of finding relevant information.
The efficiency of ranking depends on the indexes,
The effectiveness depends on the retrieval model.
Ranking
Document scoring
•
‣ A score is assigned to the most likely-relevant documents based
on how well it matches the query.
‣ Core component of a search engine, and often the most
closely-guarded secret.
‣ Many, many approaches and variations have been
developed
‣ The basic form is the dot product of query term weights and
corresponding document weights:
Query Process
(Evaluation)
Evaluation component
The task of the evaluation component is to measure and monitor
effectiveness and efficiency.
• An important part of that is to record and analyze user behavior using
log data.
The results of evaluation are used to tune and improve the ranking
component.
• Most of the evaluation component is not part of the online search
engine, apart from logging user and system data.
Evaluation is primarily an offline activity, but it is a critical part of any
search application.
Evaluation component
• Logging
‣ Logging user interaction is an essential tool for
measuring performance
‣ Query logs and clickthrough data are used for query
suggestion, spell checking, query caching, ranking,
advertising search, …
• Logging. Query logs of the users’ interactions with the search
engine are obtained and are of paramount importance.
• They can improve the search experience, speed up results, store
results of common queries, and identify source of new revenue.
Evaluation component
Pages that are clicked or ignored might be logged to improve the overall
quality of the search engine but also detect patterns in user activity (i.e.
data-mining).
Query logs can be used for a variety of other reasons that include:
1. Keeping track of a history of user queries,
2. Generation of spell checking logs (instead of running the
spellchecker every time)
3. Recording of time spent on the query or a particular document
4. Query logs and clickt-hrough data are used for query suggestion,
spell checking, query caching, ranking, advertising search.