Course Title: Information Storage and Retrieval
Course Code: ITec-4081
CP Credits (CP) : 5 [2 Lecture , 3 Lab ]
IT@AmboU IR ITec-4081 MegersaO ©2025 1
Chapter - 2
Text/Document Operations
IT@AmboU IR ITec-4081 MegersaO ©2025 2
Contents
• Document Preprocessing
– Lexical Analysis
– Stopword elimination
– Stemming
• Index term selection
– Luhn’s selection
– Zipf’s law
• Term extraction
– Term weighting
– Similarity measures
IT@AmboU IR ITec-4081 MegersaO ©2025 3
Cont.…
• Text/document Preprocessing is the process of
– controlling the size of the vocabulary or the number of distinct
words used as index terms.
– It will lead to an improvement in the information retrieval
performance.
• However, some search engines on the Web omit preprocessing
– i.e. Every word in the document will be used an index term.
• Document preprocessing is a procedure which can be divided mainly
into five text operations (or transformations): Lexical Analysis of the
Text, Elimination of Stopwords, Stemming, Thesauri and Index Terms
Selection
IT@AmboU IR ITec-4081 MegersaO ©2025 4
Concepts in Document Processing
• Lexical analysis
– Convert a stream of chars into a set of words
• Index terms selection
– Full text or not
– Noun groups [Inquery system 1995]
• Most semantics is carried by the noun words
• Elimination of stop words
– Words with high frequency are bad discriminators
– E.g., articles, prepositions, conjunctions, etc.
• Stemming
– Reduce a word to its grammatical root by removing affixes (prefixes
and suffixes)
IT@AmboU IR ITec-4081 MegersaO ©2025 5
Cont..
IT@AmboU IR ITec-4081 MegersaO ©2025 6
Generating Document Representatives
• Text Processing System
– Input text – full text, abstract or title.
– Output – a document representative adequate for use in an
automatic retrieval system.
• The document representative consists of a list of class names, each name
representing a class of words occurring in the total input text.
– A document will be indexed by a name if one of its
significant words occurs as a member of that class.
Documents Tokenization Stop words Stemming Thesaurus
Index terms
IT@AmboU IR ITec-4081 MegersaO ©2025 7
Text Operations
• Not all words in a document are equally significant to represent the
contents/meanings of a document.
– Some word carry more meaning than others.
– Noun words are the most representative of a document content.
• Therefore, there is a need to preprocess the text of a document in a
collection to choose those used as index terms.
• Using the set of all words in a collection to index documents creates
too much noise for the retrieval task.
– Reduce noise means reduce words which can be used to refer to the
document.
IT@AmboU IR ITec-4081 MegersaO ©2025 8
Cont.….
• Text operations is the process of text transformations in to logical
representations.
• The main goal of text operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms are:
– Lexical analysis/Tokenization of the text - digits, hyphens, punctuations
marks, and the case of letters.
– Elimination of stop words - filter out words which are not useful in
the retrieval process.
– Stemming words - remove affixes (prefixes and suffixes).
– Construction of term categorization structures such as thesaurus, to
capture relationship for allowing the expansion of the original query
with related terms.
IT@AmboU IR ITec-4081 MegersaO ©2025 9
Lexical Analysis/Tokenization of Text
Lexical analysis is the process of converting a stream of characters (the text of the
documents) into a stream of words (the candidate words to be adopted as index
terms).
Thus, one of the major objectives of the lexical analysis phase is the identification of
the words in the text.
For instance, the following four particular cases have to be considered with care:
digits, hyphens, punctuation marks, and the case of the letters (lower and upper
case).
Numbers are usually not good index terms because, without a surrounding context,
they by themselves are inherently vague.
Normally, punctuation marks are removed entirely in the process of lexical analysis.
The case of letters is usually not important for the identification of index terms.
– As a result, the lexical analyzer normally converts all the text to either lower or
IR ITec-4081 MegersaO ©2025 10
upper case.
Tokenization of Text
Change text of the documents into words(token) to be adopted as index
terms.
Tokenization greatly depends on how the concept of word defined.
– A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
– Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
– Usually, common words (such as “a”, “the”, “of”, …) are ignored.
IT@AmboU IR ITec-4081 MegersaO ©2025 11
Cont.….
• Tokenization is one of the step used to convert text of the documents
into a sequence of words, w1, w2, … wn to be adopted as index terms.
• It is the process of demarcating and possibly classifying sections of a string
of input characters into words.
• It the machinimas of analyze text into a sequence of discrete tokens
(words).
• For instances
– Input: “The quick brown fox jumps over the lazy dog”.
– Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing).
✓ “The” , “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”
IT@AmboU IR ITec-4081 MegersaO ©2025 12
Cont.….
• Each such token is now a candidate for an index entry, after further
processing
– But what are valid tokens to use?
• Tokenization Issues
– numbers, hyphens, punctuations marks, apostrophes …
IT@AmboU IR ITec-4081 MegersaO ©2025 13
Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc.?
– C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican vs.
republican) can be a meaningful part of a token. However, frequently they
are not.
• Simplest approach is to ignore all numbers and punctuation marks (period,
colon, comma, brackets, semi-colon, apostrophe, …) & use only case insensitive
unbroken strings of alphabetic characters as words.
– Generally, systems do not index numbers as text, but often very useful for
search.
– “meta-data” is often indexed, including creation date, format, etc.
separately.
IT@AmboU IR ITec-4081 MegersaO ©2025 14
Issues in Tokenization
• Two words may be connected by hyphens. Can two words connected
by hyphens taken as one word or two words?
– Break up hyphenated sequence as two tokens.
– In most cases hyphen – break up the words (e.g. state-of-the-art →
state of the art), but some words, e.g. Gar-malee, MS-DOS, B-49 -
unique words which require hyphens.
▪ Two words (phrase) may be separated by space.
– E.g. Addis Ababa, San Francisco, Los Angeles
▪ Two words may be written in different ways
– For example lowercase, lower-case, lower case? data base, database,
data-base.
IT@AmboU IR ITec-4081 MegersaO ©2025 15
Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415005)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is
unique.
– Generally, don’t index numbers as text, though very useful.
• What about case of letters (e.g. Data or data or DATA):
– Cases are not important and there is a need to convert all to upper or
lower.
• Issues of tokenization are language specific IR ITec-4081 Megersa ©2025
– Requires the language to be known.
– What
IT@AmboU works for one language doesn’t work for the other. 16
Exercise: Tokenization
• The cat slept peacefully in the living room. It’s a very old cat.
• The instructor (Dr. O’Neill) thinks that the boys’ stories about
Chile’s capital aren’t amusing.
1.Write down the individual sentences after tokenization.
2.How many sentences are there?
IT@AmboU IR ITec-4081 MegersaO ©2025 17
Exercise: Tokenization
1. New York-based start-up's revenue increased by 20% in Q4-2023."
2. John was born on 12/05/1995 and his phone number is +251923415005."
3. Addis Ababa University and Stanford University are top
institutions.
IT@AmboU IR ITec-4081 MegersaO ©2025 18
Elimination of Stop word
• In fact, a word which occurs in 80% of the documents in the collection is
useless for purposes of retrieval.
– Such words are frequently referred to as stop words and are normally
filtered out as potential index terms.
– Articles, prepositions, and conjunctions are natural candidates for a list
of stop words.
• So, Stop words are extremely common words across document
collections that have no discriminatory power for text/document
representation.
• They would appear to be of little value in helping select documents
matching a user need and
IT@AmboU IR ITec-4081 MegersaO ©2025 19
Cont.…
• Examples of stop words are articles, prepositions, conjunctions, etc.:
– articles (a, an, the);
– pronouns: (I, he, she, it, their, his)
– Some prepositions (on, of, in, about, besides, against),
– conjunctions/ connectors (and, but, for, nor, or, so, yet),
– verbs (is, are, was, were),
– adverbs (here, there, out, because, soon, after) and
– adjectives (all, any, each, every, few, many, some) can also be
treated as stop words.
• But mostly, Stop words are language dependent.
IT@AmboU IR ITec-4081 MegersaO ©2025 20
Cont.…
• Intuition:
– Stop words have little semantic content; It is typical to remove such high-
frequency words.
– Stopwords take up 50% of the text. Hence, document size reduces by 30-
50%.
• Smaller indices for information retrieval.
– Good compression techniques for indices: The 30 most common words
account for 30% of the tokens in written text.
– Better approximation of importance for classification, summarization, etc.
• It has important benefit to reduces the size of the indexing structure
considerably.
• In fact, it is typical to obtain a compression in the size of the indexing
structure .
IT@AmboU IR ITec-4081 MegersaO ©2025 21
Cont.…
• Stop word elimination used to be standard in older IR systems.
• But the trend is getting away from doing this.
• Most web search engines index stopwords:
– Good query optimization techniques mean you pay little at query time
for including stopwords.
– You need stop words for:
• Phrase queries: “King of Denmark”
• Various song titles, etc.: “Let it be”, “To be or not to be”
• “Relational” queries: “flights to London”
– Elimination of stop words might reduce recall (e.g. “To be or not to be”
✓ all eliminated except “be”: – will produce no or irrelevant retrieval)
IT@AmboU IR ITec-4081 MegersaO ©2025 22
How to determine a list of stop words?
• One method: Sort terms (in decreasing or increasing order) by
collection frequency and take the most frequent ones.
• Another method: Build a stop word list that contains a set of articles,
pronouns, etc.
– Why do we need stop lists? With a stop list, we can compare and
exclude the most common words from index terms.
• With the removal of stop words,
» we can measure better approximation of importance
for classification, summarization, etc.
IT@AmboU IR ITec-4081 MegersaO ©2025 23
Normalization
• It is canonicalizing tokens (change to simplest form) so that matches occur
despite superficial differences in the character sequences of the tokens.
– Need to “normalize” terms in indexed text as well as query terms into
the same form.
– Example: We want to match U.S.A. and USA, by deleting periods in a
term.
• Case Folding or Case normalization: Often best to lower case everything,
since users will use lowercase regardless of ‘correct’ capitalization…
– Republican vs. republican
– Fasil vs. fasil vs. FASIL
– Anti-discriminatory vs. antidiscriminatory
– A.A. AA Addis Ababa
IT@AmboU IR ITec-4081 MegersaO ©2025 24
Normalization issues
• Case folding is good for
– Allowing instances of Automobile at the beginning of a sentence to
match with a query of automobile.
– Helping a search engine when most users type ferrari when they are
interested in a Ferrari car
• Bad for
– Proper names vs. common nouns
• E.g. General Motors, Associated Press, …
• Solution: lowercase only words at the beginning of the sentence.
• In IR, lowercasing is most practical because of the way users issue their
queries.
IT@AmboU IR ITec-4081 MegersaO ©2025 25
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” or stem form to recognize morphological
variation .
– The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of
reducing variants to the same stem.
• Often removes inflectional and derivational morphology of a word.
• Inflectional morphology is vary the form of words in order to express grammatical
features, such as singular/plural or past/present tense.
– E.g. Boy → boys, cut → cutting.
• Derivational morphology is makes new words from old ones.
– E.g. creation is formed from create, but they are two separate words.
– And also, destruction → destroy
• Stemming is language dependent
– Correct stemming is language specific and can be complex.
for example compressed and for example compress and
compression
IT@AmboU
are both accepted.IR ITec-4081 MegersaO ©2025
compress are both accept 26
Cont.….
• The final output from a conflation (reducing to the same token)
algorithm is a set of classes, one for each stem detected.
• A Stem is the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).
– For example: ‘connect’ is the stem for {connected, connecting
connection, connections}.
– Thus, [automate, automatic, automation]→ all reduce to → automat
• A class name is assigned to a document if and only if one of its members
occurs as a significant word in the text of the document.
– A document representative then becomes a list of class names, which
are often referred as the documents index terms/keywords.
• Queries : Queries are handled in the same way. 27
IT@AmboU IR ITec-4081 MegersaO ©2025
Ways to implement stemming
• There are basically two ways to implement stemming.
• The first approach is to create a big dictionary that maps words to their
stems.
– The advantage of this approach is that
» it works perfectly (in so far as the stem of a word can be
defined perfectly);
— the disadvantages are the space required by the dictionary and the
investment required to maintain the dictionary as new words appear.
IT@AmboU IR ITec-4081 MegersaO ©2025 28
Ways to implement stemming
• The second approach is to use a set of rules that extract stems from
words.
– The advantages of this approach are that the code is typically small,
and it can gracefully handle new words;
– the disadvantage is that it occasionally makes mistakes.
• But, since stemming is imperfectly defined anyway, occasional mistakes
are tolerable, and
» the rule-based approach is the one that is generally chosen.
IT@AmboU IR ITec-4081 MegersaO ©2025 29
Porter Stemmer
• Stemming is the operation of stripping the suffices from a word, leaving
its stem.
• Google, for instance, uses stemming to search for web pages containing
the words connected, connecting, connection and connections when
users ask for a web page that contains the word connect.
• In 1979, Martin Porter developed a stemming algorithm that
– uses a set of rules to extract stems from words, and
– though it makes some mistakes, most common words seem to work
out right.
– Porter describes his algorithm and provides a reference
implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
IT@AmboU IR ITec-4081 MegersaO ©2025 30
Porter stemmer
• It is the most common algorithm for stemming English words to their
common grammatical root.
• It is simple procedure for removing known affixes in English without using a
dictionary.
• To gets rid of plurals the following rules are used:
– SSES → SS caresses → caress
– IES → i ponies → poni
– SS → SS caress → caress
–S → cats → cat
–EMENT → (Delete final ement if what remains is longer than 1 character )
replacement → replac
cement → cement
IT@AmboU IR ITec-4081 MegersaO ©2025 31
Porter stemmer
• While the first column gets rid of plurals, second column removes -ed
or -ing.
– e.g.
;; agreed -> agree ;; disabled -> disable
;; matting -> mat ;; mating -> mate
;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed
IT@AmboU IR ITec-4081 MegersaO ©2025 32
Stemming: challenges
• May produce unusual stems that are not English words:
– Removing ‘UAL’ from FACTUAL and EQUAL
• May conflate words (reduce to the same token) that are actually
distinct.
– “computer”, “computational”, “computation” all reduced to same
token “comput”
• Not recognize all morphological derivations.
IT@AmboU IR ITec-4081 MegersaO ©2025 33
Thesaurus
• The word thesaurus has Greek and Latin origins and is used as a
reference to a treasury of words.
• This treasury consists of
– a precompiled list of important words in a given domain of
knowledge and for each word in this list, a set of related words.
• A Thesaurus, sometimes called dictionary of synonyms,
– It is a reference work which arranges words by their meanings,
» sometimes as a hierarchy of broader and narrower terms,
» sometimes simply as lists of synonyms and antonyms.
IT@AmboU IR ITec-4081 MegersaO ©2025 34
Cont.…
• Mostly full-text searching cannot be accurate,
– since different authors may select different words to represent the
same concept.
• Problem of this are:
– The same meaning can be expressed using different terms that are
synonyms (two words having similar meaning),
– homonyms (words pronounced or spelled the same way but have
different meanings), and related terms.
IT@AmboU IR ITec-4081 MegersaO ©2025 35
Cont.…
• Thesaurus: The vocabulary of a controlled indexing language, formally
organized
– so that a priori relationships between concepts (for example as
"broader" and “related") are made explicit.
• A thesaurus contains terms and relationships between terms.
– IR thesauri rely typically upon the use of symbols such as (UF=used
for), (BT=broader term, and (RT=related term) to demonstrate inter-
term relationships.
– For instances
– car = automobile, truck, bus, taxi, motor vehicle
– color = colour, paint
IT@AmboU IR ITec-4081 MegersaO ©2025 36
Aim of Thesaurus
• It is tries to control the use of the vocabulary by showing a set of related words
to handle synonyms and homonyms.
• The aim of thesaurus is therefore:
– to provide a standard vocabulary for indexing and searching
• Thesaurus rewrite words to form equivalence classes, and we index such
equivalences.
• When the document contains automobile, index it under car as well
(usually, also vice-versa)
– to assist users with locating terms for proper query formulation:
• When the query contains automobile, look under car as well for
expanding query.
– to provide classified hierarchies that allow the broadening and narrowing of
the current request according to user needs. IT@AmboU .37
Thesaurus Construction
• For example:
– thesaurus built to assist IR for searching cars and vehicles:
• Term: Motor vehicles
UF : Automobiles
Cars
Trucks
BT: Vehicles
RT: Road Engineering
Road Transport
Where UF=used for, BT=broader term, and RT= related term
IT@AmboU 38
More Example
• For Example thesaurus built to assist IR in the fields of computer
science:
• TERM: natural languages
UF natural language processing (UF=used for NLP)
BT languages (BT=broader term is languages)
TT languages (TT = (top term) is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition
IT@AmboU IR ITec-4081 MegersaO ©2025 39
Language-specificity
• Many of the above features embody transformations that are
– Language-specific and
– Often, application-specific
• These are “plug-in” addenda to the indexing process
• Both open source and commercial plug-ins are available for handling
these issues.
IT@AmboU IR ITec-4081 MegersaO ©2025 40
Index term selection
• Index language is the language used to describe documents and requests.
• Elements of the index language are index terms which may be derived
from the text of the document to be described, or may be arrived at
independently.
– If a full text representation of the text is adopted, then all words in
the text are used as index terms = full text indexing.
– Otherwise, need to select the words to be used as index terms for
reducing the size of the index file which is basic to design an efficient
searching IR system.
IT@AmboU IR ITec-4081 MegersaO ©2025 41
Statistical Properties of Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• There are three well-known researcher who define statistical
properties of words in a text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth corpus
size
• Such properties of text collection greatly affect the
performance of IR system & can be used to select suitable term
weights & other aspects of the system.
42
Zipf’s law
• Zipf’s Law states that in a large corpus, the frequency of a
word is inversely proportional to its rank in the frequency
table.
Generalized Zipf’s Law f(r) is the frequency of the word at
rank 𝑟.
𝐶 is a normalization constant.
𝑟 is the rank of the word.
𝑠 is an exponent (typically close to 1 in
natural language data).
43
Example of Zipf’s Law
• Scenario: Suppose we analyze a news article corpus with
10,000 words. The word frequency distribution might
look like this:
Rank Word Frequency
1 the 1000
2 is 500
3 government 250
4 policy 125
5 renewable 50
10 microgrid 10
3/12/2025 44
Zipf’s Law
• The most frequent words ("the," "is") appear often but are
not informative → Stopwords removed.
• The least frequent word ("microgrid") appears only 10
times → Might be too rare.
• Moderate-frequency words ("government," "policy,"
"renewable") contain useful semantic meaning for
retrieval.
• Zipf’s Law helps remove common stopwords and rare
words.
• It identifies mid-range terms as good index terms.
3/12/2025 45
Zipf's Law, Luhn's Model
and Heap's Law
Law/Model Key Idea Application in IR
Word frequency is Identifies
inversely stopwords and
Zipf's Law
proportional to mid-frequency
rank keywords
Mid-frequency Helps select
Luhn's Model words are best for optimal index
indexing terms
Vocabulary grows Predicts storage
Heap's Law sub-linearly with needs and search
corpus size efficiency
3/12/2025 46
Zipf's Law, Luhn's Model
and Heap's Law
• Zipf’s Law helps remove stopwords.
• Luhn’s Model selects the best indexing
terms.
• Heap’s Law helps estimate index size as a
collection grows.
3/12/2025 47
Word Distribution
• A few words are very
common.
✓2 most frequent words
(e.g. “the”, “of”) can
account for about 10%
of word occurrences.
• Most words are very rare.
✓Half the words in a
corpus appear only once,
called “read only once”
48
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of
word occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095 49
More Example: Zipf’s Law
• word frequency analysis and follows Zipf's Law, which
states that the frequency of a word is inversely
proportional to its rank.
where N is the total number of
words in the dataset.
• The most frequent word ("the") appears 69,971 times, while the
second most frequent ("of") appears 36,411 times.
• As rank increases, frequency decreases (following Zipf’s Law).
• The R.(F/N) column should ideally be close to a constant value if
Zipf’s Law holds well.
3/12/2025 50
Word distribution: Zipf's Law
The x-axis represents the rank (r) of a word in a corpus (e.g., the
most frequent word is rank 1, the second most frequent is rank 2,
and so on).
The y-axis represents the frequency (𝑓) of the word’s occurrence.
The curve follows a power-law distribution, meaning that a few
words are used very frequently, while most words are used rarely.
IT@AmboU IR ITec-4081 MegersaO ©2025 52
Word distribution: Zipf's Law
The highest-ranked word (rank = 1) is the most frequent.
The second-ranked word (rank = 2) has half the frequency of the
first.The third-ranked word (rank = 3) has one-third the frequency, and
so on.
Find the Most Frequent Word (𝑓1)In our previous example, "the" is
the most frequent word with 𝑓1=4f 1 =4.
Compute Expected Frequency Using 𝑓∝1𝑟f∝ r1 If "the" is rank 𝑟=1r=1,
its frequency 𝑓1 is 4.
The expected frequency of the second most common word
IT@AmboU IR ITec-4081 MegersaO ©2025 53
Word distribution: Zipf's Law
The expected frequency of the third most common word
The expected frequency of the fourth most common word:
IT@AmboU IR ITec-4081 MegersaO ©2025 54
Word significance: Luhn’s Ideas
• Luhn's cutoff formula helps determine the upper and lower
frequency thresholds to filter out common and rare words in a
text.
• While Luhn did not specify a single mathematical formula, the
general approach is based on Zipf’s Law and frequency
distribution.
3/12/2025 60
• Lower Cutoff (𝑓min )This is the minimum frequency needed
for a word to be informative.
• Formula
f max = maximum word frequency in the text.
𝑘1 = empirical constant (typically 10).
3/12/2025 61
• Upper Cutoff (𝑓𝑐𝑢𝑡𝑜𝑓𝑓 max)
• This is the frequency above which words are too common.
•where:k2 = empirical constant (typically 2).
3/12/2025 62
Example
• Let’s assume :Most frequent word ("AI")
appears 100 times → 𝑓max=100
• Using empirical constants 𝑘1=10 and 𝑘2=2
So: Words appearing < 10 times → too rare (ignored).
Words appearing > 20 times → too common (ignored).
Words appearing between 10 and 20 times → most informative.
3/12/2025 63
Heaps Law
• Heap's Law is an empirical law used in Natural
Language Processing (NLP) and information
retrieval (IR).
• It describes the relationship between the size of a
text corpus and the number of unique words
(vocabulary size) that appear in it.
3/12/2025 68
Heaps Law
• Heap's Law FormulaHeap’s Law is
mathematically represented as:
where:𝑉(𝑛)= the number of unique words (vocabulary size) in a corpus of 𝑛 words
(the total number of tokens).
𝐾 = a constant, which depends on the language and corpus (often around 10).
𝑛 = the total number of words (tokens) in the corpus.
𝛽= an exponent that typically ranges between 0.4 and 0.6, depending on the
corpus.
3/12/2025 69
Heaps Law
• where:𝑉(𝑛)= the number of unique words
(vocabulary size) in a corpus of 𝑛n words (the
total number of tokens).
• 𝐾 = a constant, which depends on the language
and corpus (often around 10).
• 𝑛 = the total number of words (tokens) in the
corpus.
• 𝛽= an exponent that typically ranges between
0.4 and 0.6, depending on the corpus. For most
natural language texts, this value is around 0.5.
3/12/2025 70
Term Weighting
3/12/2025 71
Home work
Weight xx%
❖ Statistical Properties of Text
– Explain in detail about Zipf's Law, Heap’s Law and Luhn’s
Law as well as there contribution of for information retrieval
text operation?
IT@AmboU IR ITec-4081 MegersaO ©2025 72
IT@AmboU IR ITec-4081 MegersaO ©2025 73