Lecture 5:
Further enhancement to Scoringandresults
assembly and evaluation
Today’s Lecture
• Speeding up vector spaceranking
• Putting together acompletesearch system
• Will require learning about anumber of miscelaneous topics and
heuristics
• Evaluation of the IR System
.
Efficient CosineRanking
• Find the Kdocs in the collection“nearest” tothe query Klargest
query-doccosines.
• Efficientranking:
– Computingasingle cosineefficiently.
– Choosingthe Klargestcosinevaluesefficiently.
• Canwe do this without computing allN cosines?
Computing CosineScores
Computing the K largest Cosines:
Selection vs.Sorting
• Typicaly we want to retrieve the topK docs (in the cosine
ranking for thequery)
– not to totally order all docs in the collection
• Canwe pick off docswith K highestcosines?
• Let J= number of docs with non zero cosines
– Weseekthe Kbest of thesefromJ
Sec.7.1
UseHeap for SelectingTop K
• Binary tree in which each node’s value > the values of children
(max heap)
• Takes 2J operations to construct,then each of
K “winners” read off in 2log J steps.
• For J=1M, K=100, this is about 10% of the cost of sorting
1
.9 .3
.3 .8 .1
.1 13
Bottlenecks
• Primarycomputational bottleneck inscoring: cosinecomputation
• Canwe avoid all this computation?
• Yes,but maysometimes get it wrong
– adoc not in the top Kmaycreepinto the list ofK output docs
– Is this suchabad thing?
CosineSimilarity is only aProxy
• User has a task and a query formulation
• Cosine matches docs to query
• Thus cosine is anyway a proxy for user happiness
• If we get a list of K docs “close” to the top K by cosine measure,
should be ok
GenericApproach
• Find a set A of contenders,with K < |A|<< N
– A does not necessarily contain the top K,but has many docs
from among the top K
– Return the top K docs in A
• Think of A as pruning non-contenders
• The same approach is also used for other (non-
cosine) scoring functions
• Will look at several schemes following this approach
Pruning N on-Contenders
• Index elimination
• Champion lists
• Static quality scores
• Impact ordering
• Cluster pruning
Index Elimination
• Basic cosine computation algorithm only considers docs
containing at least one query term
• Take this further:
– Only consider high-idf query terms
– Only consider docs containing many query terms
High-idf Q ueryTerms O nly
• For aquery such as catcher in therye
• O nly accumulate scores from catcher and rye
• Intuition:in andthe contribute little to the scores andso don’t
alter rank-ordering much
• Benefit:
– Postingsoflow-idftermshavemanydocs. these(many)docsget
eliminatedfrom setAof contenders
Docs Containing Many Q ueryTerms
• Anydoc with at least one query termis a candidate for
the top Koutput list
• For multi-term queries,only compute scores for docs
containingseveralof the queryterms
– Say,at least 3 out of 4
– Imposes a“soft conjunction” on queries seenon web search engines
(earlyGoogle)
• Easyto implement in postings traversal
Sec.7.1.2
3 of 4 Q ueryTerms
Antony 3 4 8 16 32 64 128
Brutus 2 4 8 16 32 64 128
Caesar 1 2 3 5 8 13 21 34
Calpurnia 13 16 32
Scoresonly computed for docs 8,16 and32.
16
Champion lists
• Precompute for each dictionary term t,the r
docs of highest weight in t’s postings
– Call this the champion list for t
– (aka fancy list or top docs for t)
• At query time,only compute scores for docs in the champion list of some query
term
– Pick the K top-scoring docs from amongst these
• Note that r has to be chosen at index build time
– Thus, it’s possible that r < K
Static quality scores
• We want top-ranking documents to be both relevant and authoritative
• Relevance is being modeled by cosine scores
• Authority is typically a query-independent property of a document
• Examples of authority signals
Wikipedia among websites
Articles in certain newspapers
A research paper with many citations
Many bitly’s,diggs marks
(Pagerank)
Modeling authority
• Assign to each document a query-independent quality score in [0,1]
to each document d
– Denote this by g(d)
• Thus,a quantity like the number of citations is scaled into [0,1]
N et score
• Considerasimpletotal scorecombining cosinerelevancea
authority
• net-score(q,d) = g(d) + cosine(q,d)
– Canusesomeother linear combination
– Indeed,anyfunction of the two“signals” of user happiness– more
later
• Now we seekthe top Kdocs by netscore
TopK by net score – fastmethods
• First idea:O rder all postings byg(d)
• Key:this is acommon ordering for allpostings
• Thus,canconcurrently traverse queryterms’ postings for
– Postings intersection
– Cosinescore computation
W hy order postings byg(d)?
• Under g(d)-ordering, top-scoring docs likely to appear early in
postings traversal
• In time-bound applications (say, we have to return whatever
search results we can in 50 ms), this allows us to stop postings
traversal early
– Short of computing scores for all docs in postings
Champion lists in g(d)-ordering
• Can combine champion lists with g(d)- ordering
• Maintain for each term a champion list of the
r docs with highest g(d) + tf-idftd
• Seek top-K results from only the docs in these champion lists
High and low lists
• For each term,we maintain two postings lists called high and low
– Think of high as the champion list
• When traversing postings on a query,only traverse high lists first
– If we get more than K docs, select the top K and stop
– Else proceed to get docs from the low lists
• Can be used even for simple cosine scores, without global quality
g(d)
• A means for segmenting index into two tiers
• Document-at-a-time scoring
Cluster pruning
Preprocessing:
• Pick 𝑁 docs at random:call these leaders
• For every other doc,pre-compute nearest leader
– Docs attached to a leader:its followers;
– Likely:each leader has ~ 𝑁 followers.
Cluster pruning
Query Processing:
• Process a query as follows:
– Given query Q,find its nearest leader L.
– Seek K nearest docs from among L’s followers.
Visualization
Parametric and zoneindexes
• Thus far,a doc has been a sequence of terms
• In fact documents have multiple parts,some with special semantics:
– Author
– Title
– Date of publication
– Language
– Format
– etc.
• These constitute the metadata about a document
Fields
• We sometimes wish to search by these metadata
– E.g.,find docs authored byWilliam Shakespeare in the year
1601,containing alas poorYorick
• Year = 1601 is an example of a field
• Also,author last name = shakespeare,etc.
• Field or parametric index: postings for each field value
– Sometimes build range trees (e.g.,for dates)
• Field query typically treated as conjunction
– (doc must be authored by shakespeare)
Zone
• A zone is a region of the doc that can contain an arbitrary
amount of text,e.g.,
– Title
– Abstract
– References …
• Build inverted indexes on zones as well to permit querying
• E.g.,“find docs with merchant in the title zone and matching the
query gentle rain”
Examplezone indexes
Encode zones in dictionary vs. postings.
Tiered indexes
• Break postings up into a hierarchy of lists
– Most important
–…
– Least important
• Can be done by g(d) or another measure
• Inverted index thus broken up into tiers of decreasing
importance
• At query time use top tier unless it fails to yield K docs
– If so drop to lower tiers
Exampletiered index
Q uery term proximity
• Free text queries:just a set of terms typed into the query box –
common on the web
• Users prefer docs in which query terms occur within close
proximity of each other
• Let w be the smallest window in a doc containing all query terms,
e.g.,
• For the query strained mercy the smallest window in the
doc The quality of mercy is not strained is ? (words)
• Would like scoring function to take this into account – how?
Impact-ordered postings
• We only want to compute scores for docs for which wft,d
(weighted term-frequency) is high enough
• Term-at-a-time scoring
• We sort each postings list by wft,d
• Now: not all postings in a common order!
• How do we compute scores in order to pick off top K?
– Two ideas follow
1.Earlytermination
• When traversing t’s postings,stop early after either
– a fixed number of r docs
– wft,d drops below some threshold
• Take the union of the resulting sets of docs
– One from the postings of each query term
• Compute only the scores for docs in this union
2.idf-ordered terms
• When considering the postings of query terms
• Look at them in order of decreasing idf
– High idf terms likely to contribute most to score
• As we update score contribution from each query term
– Stop if doc scores relatively unchanged
• Can apply to cosine or some other net scores
Query parsers
• Free text query from user may in fact spawn one or more queries to
the indexes,e.g.,query rising interest rates
– Run the query as a phrase query
– If <K docs contain the phrase rising interest rates,run the two phrase
queries
– If we still have <K docs, run the vector space query
rising interest rates
– Rank matching docs by vector space scoring
• This sequence is issued by a query parser
Aggregate scores
• We’ve seen that score functions can combine cosine,static quality,
proximity,etc.
• How do we know the best combination?
• Some applications – expert-tuned
• Increasingly common: machine-learned
Putting it alltogether
Evaluation of Information Retrieval Systems
• How do we evaluate??
• Try a few queries on his engine and say “Not bad”?
• Is it enough??
• How fast does it index?
• Number of documents/hour
• Incremental indexing
• How fast does it search?
• Latency and CPU needs
• Does it recommend related products/links?
Measure for a search engine
• All of the preceding criteria are measurable
◦ We can make expressiveness precise
• The key measurement:User happiness
◦ What is this?
◦ Speed of response/size of index are factors
◦ But blindingly fast, useless answers won’t make a user happy
◦ Need a way of quantifying user happiness
Measuring user happiness
• Issue :who is the user we are tying to make happy?
◦ Depends on setting
• Web Engine
oUser finds what he/she wants and returns to the engine
• eCommerce site
Happiness:Elusive to measure
• Most common proxy: relevance of search results
• But how do you measure relevance?
Three elements:
1. A benchmark document collection
• CRAN FIELD,REUTERS ,TREC
2. A benchmark suite of queries
3. An assessment of either Relevant or Nonrelevant for each query and
each document
• Relevance is assessed relative to an information need, not a query
Evaluating an IR system
• Note :User need is translated into a query
• Relevance is assessed relative to the user need,not the query
•E.g,Information need:My swimming pool bottom is becoming black
and needs to be cleaned
• Query:Pool cleaner
•Assess whether the doc addresses the underlying need,not whether
it has these words
Binary Assessments
F Measure: A single measure that trades off precision versus recall
Precision-recall C urve
1.0
0.8
Precision
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Rank Based Measures
Binary relevance
◦ Precision@K(P@K)
◦ MeanAverage Precision (MAP)
Precision@K(P@K)
• Set a rank threshold K
• Compute % relevant in top K
• Ignores documents ranked lower than K
Mean Average Precision
•Consider rank position of each relevant doc (K1, K2, ……,Kr)
• Compute Precision@ K for each K1, K2, … … ,Kr
• Average Precision = average of P@K
• MAP is average Precision across multiple queries/rankings
Average Precision
Mean Average Precision
Mean Average Precision