0% found this document useful (0 votes)

16 views63 pages

Chapter - 2 Text Operation (Lecture 2.1)

The document outlines the course ITec-4081 on Information Storage and Retrieval, focusing on text/document operations including preprocessing techniques such as lexical analysis, stopword elimination, and stemming. It emphasizes the importance of selecting index terms to improve information retrieval performance by reducing noise and enhancing the relevance of search results. Key concepts discussed include tokenization, normalization, and the challenges associated with these processes in document indexing.

Uploaded by

abrehamasfewu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views63 pages

Chapter - 2 Text Operation (Lecture 2.1)

Uploaded by

abrehamasfewu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Course Title: Information Storage and Retrieval

Course Code: ITec-4081

CP Credits (CP) : 5 [2 Lecture , 3 Lab ]

IT@AmboU IR ITec-4081 MegersaO ©2025 1

Chapter - 2

Text/Document Operations

IT@AmboU IR ITec-4081 MegersaO ©2025 2

Contents
• Document Preprocessing
– Lexical Analysis

– Stopword elimination

– Stemming

• Index term selection

– Luhn’s selection

– Zipf’s law

• Term extraction
– Term weighting

– Similarity measures

IT@AmboU IR ITec-4081 MegersaO ©2025 3

Cont.…
• Text/document Preprocessing is the process of

– controlling the size of the vocabulary or the number of distinct

words used as index terms.

– It will lead to an improvement in the information retrieval

performance.

• However, some search engines on the Web omit preprocessing

– i.e. Every word in the document will be used an index term.
• Document preprocessing is a procedure which can be divided mainly
into five text operations (or transformations): Lexical Analysis of the
Text, Elimination of Stopwords, Stemming, Thesauri and Index Terms
Selection
IT@AmboU IR ITec-4081 MegersaO ©2025 4
Concepts in Document Processing
• Lexical analysis
– Convert a stream of chars into a set of words
• Index terms selection
– Full text or not
– Noun groups [Inquery system 1995]

• Most semantics is carried by the noun words

• Elimination of stop words

– Words with high frequency are bad discriminators
– E.g., articles, prepositions, conjunctions, etc.

• Stemming
– Reduce a word to its grammatical root by removing affixes (prefixes
and suffixes)
IT@AmboU IR ITec-4081 MegersaO ©2025 5
Cont..

IT@AmboU IR ITec-4081 MegersaO ©2025 6

Generating Document Representatives
• Text Processing System

– Input text – full text, abstract or title.

– Output – a document representative adequate for use in an

automatic retrieval system.

• The document representative consists of a list of class names, each name

representing a class of words occurring in the total input text.

– A document will be indexed by a name if one of its

significant words occurs as a member of that class.

Documents Tokenization Stop words Stemming Thesaurus

Index terms
IT@AmboU IR ITec-4081 MegersaO ©2025 7
Text Operations
• Not all words in a document are equally significant to represent the
contents/meanings of a document.
– Some word carry more meaning than others.

– Noun words are the most representative of a document content.

• Therefore, there is a need to preprocess the text of a document in a

collection to choose those used as index terms.

• Using the set of all words in a collection to index documents creates

too much noise for the retrieval task.

– Reduce noise means reduce words which can be used to refer to the
document.

IT@AmboU IR ITec-4081 MegersaO ©2025 8

Cont.….
• Text operations is the process of text transformations in to logical
representations.

• The main goal of text operations for selecting index terms, i.e. to choose
words/stems (or groups of words) to be used as indexing terms are:

– Lexical analysis/Tokenization of the text - digits, hyphens, punctuations

marks, and the case of letters.
– Elimination of stop words - filter out words which are not useful in
the retrieval process.

– Stemming words - remove affixes (prefixes and suffixes).

– Construction of term categorization structures such as thesaurus, to

capture relationship for allowing the expansion of the original query
with related terms.
IT@AmboU IR ITec-4081 MegersaO ©2025 9
Lexical Analysis/Tokenization of Text
 Lexical analysis is the process of converting a stream of characters (the text of the
documents) into a stream of words (the candidate words to be adopted as index
terms).

 Thus, one of the major objectives of the lexical analysis phase is the identification of
the words in the text.

 For instance, the following four particular cases have to be considered with care:
digits, hyphens, punctuation marks, and the case of the letters (lower and upper
case).

 Numbers are usually not good index terms because, without a surrounding context,
they by themselves are inherently vague.

 Normally, punctuation marks are removed entirely in the process of lexical analysis.

 The case of letters is usually not important for the identification of index terms.

– As a result, the lexical analyzer normally converts all the text to either lower or
IR ITec-4081 MegersaO ©2025 10
upper case.
Tokenization of Text
 Change text of the documents into words(token) to be adopted as index
terms.

 Tokenization greatly depends on how the concept of word defined.

– A word is a sequence of letters terminated by a separator (period,
comma, space, etc).
– Definition of letter and separator is flexible; e.g., hyphen could be
defined as a letter or as a separator.
– Usually, common words (such as “a”, “the”, “of”, …) are ignored.

IT@AmboU IR ITec-4081 MegersaO ©2025 11

Cont.….
• Tokenization is one of the step used to convert text of the documents
into a sequence of words, w1, w2, … wn to be adopted as index terms.
• It is the process of demarcating and possibly classifying sections of a string
of input characters into words.
• It the machinimas of analyze text into a sequence of discrete tokens
(words).

• For instances
– Input: “The quick brown fox jumps over the lazy dog”.
– Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing).
✓ “The” , “quick”, “brown”, “fox”, “jumps”, “over”, “the”, “lazy”, “dog”, “.”

IT@AmboU IR ITec-4081 MegersaO ©2025 12

Cont.….
• Each such token is now a candidate for an index entry, after further
processing
– But what are valid tokens to use?
• Tokenization Issues

– numbers, hyphens, punctuations marks, apostrophes …

IT@AmboU IR ITec-4081 MegersaO ©2025 13

Issues in Tokenization
• One word or multiple: How to handle special cases involving hyphens,
apostrophes, punctuation marks etc.?
– C++, C#, URL’s, e-mail, …
– Sometimes punctuations (e-mail), numbers (1999), & case (Republican vs.
republican) can be a meaningful part of a token. However, frequently they
are not.

• Simplest approach is to ignore all numbers and punctuation marks (period,

colon, comma, brackets, semi-colon, apostrophe, …) & use only case insensitive
unbroken strings of alphabetic characters as words.

– Generally, systems do not index numbers as text, but often very useful for
search.

– “meta-data” is often indexed, including creation date, format, etc.

separately.
IT@AmboU IR ITec-4081 MegersaO ©2025 14
Issues in Tokenization
• Two words may be connected by hyphens. Can two words connected
by hyphens taken as one word or two words?
– Break up hyphenated sequence as two tokens.
– In most cases hyphen – break up the words (e.g. state-of-the-art →
state of the art), but some words, e.g. Gar-malee, MS-DOS, B-49 -
unique words which require hyphens.
▪ Two words (phrase) may be separated by space.
– E.g. Addis Ababa, San Francisco, Los Angeles
▪ Two words may be written in different ways
– For example lowercase, lower-case, lower case? data base, database,
data-base.

IT@AmboU IR ITec-4081 MegersaO ©2025 15

Issues in Tokenization
• Numbers: are numbers/digits words and used as index terms?
– dates (3/12/91 vs. Mar. 12, 1991);
– phone numbers (+251923415005)
– IP addresses (100.2.86.144)
– Numbers are not good index terms (like 1910, 1999); but 510 B.C. is
unique.
– Generally, don’t index numbers as text, though very useful.

• What about case of letters (e.g. Data or data or DATA):

– Cases are not important and there is a need to convert all to upper or
lower.

• Issues of tokenization are language specific IR ITec-4081 Megersa ©2025

– Requires the language to be known.

– What
IT@AmboU works for one language doesn’t work for the other. 16
Exercise: Tokenization

• The cat slept peacefully in the living room. It’s a very old cat.

• The instructor (Dr. O’Neill) thinks that the boys’ stories about
Chile’s capital aren’t amusing.

1.Write down the individual sentences after tokenization.

2.How many sentences are there?

IT@AmboU IR ITec-4081 MegersaO ©2025 17

Exercise: Tokenization
1. New York-based start-up's revenue increased by 20% in Q4-2023."

2. John was born on 12/05/1995 and his phone number is +251923415005."

3. Addis Ababa University and Stanford University are top

institutions.

IT@AmboU IR ITec-4081 MegersaO ©2025 18

Elimination of Stop word
• In fact, a word which occurs in 80% of the documents in the collection is
useless for purposes of retrieval.

– Such words are frequently referred to as stop words and are normally
filtered out as potential index terms.

– Articles, prepositions, and conjunctions are natural candidates for a list

of stop words.
• So, Stop words are extremely common words across document
collections that have no discriminatory power for text/document
representation.

• They would appear to be of little value in helping select documents

matching a user need and

IT@AmboU IR ITec-4081 MegersaO ©2025 19

Cont.…
• Examples of stop words are articles, prepositions, conjunctions, etc.:

– articles (a, an, the);

– pronouns: (I, he, she, it, their, his)

– Some prepositions (on, of, in, about, besides, against),

– conjunctions/ connectors (and, but, for, nor, or, so, yet),

– verbs (is, are, was, were),

– adverbs (here, there, out, because, soon, after) and

– adjectives (all, any, each, every, few, many, some) can also be
treated as stop words.

• But mostly, Stop words are language dependent.

IT@AmboU IR ITec-4081 MegersaO ©2025 20

Cont.…
• Intuition:
– Stop words have little semantic content; It is typical to remove such high-
frequency words.
– Stopwords take up 50% of the text. Hence, document size reduces by 30-
50%.
• Smaller indices for information retrieval.
– Good compression techniques for indices: The 30 most common words
account for 30% of the tokens in written text.
– Better approximation of importance for classification, summarization, etc.
• It has important benefit to reduces the size of the indexing structure
considerably.
• In fact, it is typical to obtain a compression in the size of the indexing
structure .
IT@AmboU IR ITec-4081 MegersaO ©2025 21
Cont.…
• Stop word elimination used to be standard in older IR systems.
• But the trend is getting away from doing this.

• Most web search engines index stopwords:

– Good query optimization techniques mean you pay little at query time
for including stopwords.

– You need stop words for:

• Phrase queries: “King of Denmark”

• Various song titles, etc.: “Let it be”, “To be or not to be”

• “Relational” queries: “flights to London”

– Elimination of stop words might reduce recall (e.g. “To be or not to be”

✓ all eliminated except “be”: – will produce no or irrelevant retrieval)

IT@AmboU IR ITec-4081 MegersaO ©2025 22
How to determine a list of stop words?
• One method: Sort terms (in decreasing or increasing order) by
collection frequency and take the most frequent ones.
• Another method: Build a stop word list that contains a set of articles,
pronouns, etc.

– Why do we need stop lists? With a stop list, we can compare and
exclude the most common words from index terms.

• With the removal of stop words,

» we can measure better approximation of importance

for classification, summarization, etc.

IT@AmboU IR ITec-4081 MegersaO ©2025 23

Normalization
• It is canonicalizing tokens (change to simplest form) so that matches occur
despite superficial differences in the character sequences of the tokens.

– Need to “normalize” terms in indexed text as well as query terms into

the same form.

– Example: We want to match U.S.A. and USA, by deleting periods in a

term.

• Case Folding or Case normalization: Often best to lower case everything,

since users will use lowercase regardless of ‘correct’ capitalization…

– Republican vs. republican

– Fasil vs. fasil vs. FASIL

– Anti-discriminatory vs. antidiscriminatory

– A.A. AA Addis Ababa

IT@AmboU IR ITec-4081 MegersaO ©2025 24
Normalization issues
• Case folding is good for

– Allowing instances of Automobile at the beginning of a sentence to

match with a query of automobile.

– Helping a search engine when most users type ferrari when they are
interested in a Ferrari car

• Bad for

– Proper names vs. common nouns

• E.g. General Motors, Associated Press, …

• Solution: lowercase only words at the beginning of the sentence.

• In IR, lowercasing is most practical because of the way users issue their
queries.
IT@AmboU IR ITec-4081 MegersaO ©2025 25
Stemming/Morphological analysis
• Stemming reduces tokens to their “root” or stem form to recognize morphological
variation .

– The process involves removal of affixes (i.e. prefixes and suffixes) with the aim of
reducing variants to the same stem.

• Often removes inflectional and derivational morphology of a word.

• Inflectional morphology is vary the form of words in order to express grammatical

features, such as singular/plural or past/present tense.

– E.g. Boy → boys, cut → cutting.

• Derivational morphology is makes new words from old ones.

– E.g. creation is formed from create, but they are two separate words.

– And also, destruction → destroy

• Stemming is language dependent
– Correct stemming is language specific and can be complex.

for example compressed and for example compress and

compression
IT@AmboU
are both accepted.IR ITec-4081 MegersaO ©2025
compress are both accept 26
Cont.….
• The final output from a conflation (reducing to the same token)
algorithm is a set of classes, one for each stem detected.

• A Stem is the portion of a word which is left after the removal of its
affixes (i.e., prefixes and/or suffixes).

– For example: ‘connect’ is the stem for {connected, connecting

connection, connections}.
– Thus, [automate, automatic, automation]→ all reduce to → automat

• A class name is assigned to a document if and only if one of its members

occurs as a significant word in the text of the document.

– A document representative then becomes a list of class names, which

are often referred as the documents index terms/keywords.
• Queries : Queries are handled in the same way. 27
IT@AmboU IR ITec-4081 MegersaO ©2025
Ways to implement stemming
• There are basically two ways to implement stemming.

• The first approach is to create a big dictionary that maps words to their
stems.

– The advantage of this approach is that

» it works perfectly (in so far as the stem of a word can be
defined perfectly);

— the disadvantages are the space required by the dictionary and the
investment required to maintain the dictionary as new words appear.

IT@AmboU IR ITec-4081 MegersaO ©2025 28

Ways to implement stemming
• The second approach is to use a set of rules that extract stems from
words.

– The advantages of this approach are that the code is typically small,
and it can gracefully handle new words;

– the disadvantage is that it occasionally makes mistakes.

• But, since stemming is imperfectly defined anyway, occasional mistakes

are tolerable, and

» the rule-based approach is the one that is generally chosen.

IT@AmboU IR ITec-4081 MegersaO ©2025 29

Porter Stemmer
• Stemming is the operation of stripping the suffices from a word, leaving
its stem.

• Google, for instance, uses stemming to search for web pages containing
the words connected, connecting, connection and connections when
users ask for a web page that contains the word connect.

• In 1979, Martin Porter developed a stemming algorithm that

– uses a set of rules to extract stems from words, and

– though it makes some mistakes, most common words seem to work

out right.

– Porter describes his algorithm and provides a reference

implementation in C at
http://tartarus.org/~martin/PorterStemmer/index.html
IT@AmboU IR ITec-4081 MegersaO ©2025 30
Porter stemmer
• It is the most common algorithm for stemming English words to their
common grammatical root.
• It is simple procedure for removing known affixes in English without using a
dictionary.
• To gets rid of plurals the following rules are used:
– SSES → SS caresses → caress
– IES → i ponies → poni
– SS → SS caress → caress
–S →  cats → cat
–EMENT →  (Delete final ement if what remains is longer than 1 character )
replacement → replac
cement → cement
IT@AmboU IR ITec-4081 MegersaO ©2025 31
Porter stemmer
• While the first column gets rid of plurals, second column removes -ed
or -ing.

– e.g.
;; agreed -> agree ;; disabled -> disable

;; matting -> mat ;; mating -> mate

;; meeting -> meet ;; milling -> mill
;; messing -> mess ;; meetings -> meet
;; feed -> feed

IT@AmboU IR ITec-4081 MegersaO ©2025 32

Stemming: challenges
• May produce unusual stems that are not English words:

– Removing ‘UAL’ from FACTUAL and EQUAL

• May conflate words (reduce to the same token) that are actually
distinct.

– “computer”, “computational”, “computation” all reduced to same

token “comput”

• Not recognize all morphological derivations.

IT@AmboU IR ITec-4081 MegersaO ©2025 33

Thesaurus
• The word thesaurus has Greek and Latin origins and is used as a
reference to a treasury of words.
• This treasury consists of

– a precompiled list of important words in a given domain of

knowledge and for each word in this list, a set of related words.
• A Thesaurus, sometimes called dictionary of synonyms,

– It is a reference work which arranges words by their meanings,

» sometimes as a hierarchy of broader and narrower terms,

» sometimes simply as lists of synonyms and antonyms.

IT@AmboU IR ITec-4081 MegersaO ©2025 34

Cont.…
• Mostly full-text searching cannot be accurate,

– since different authors may select different words to represent the

same concept.

• Problem of this are:

– The same meaning can be expressed using different terms that are
synonyms (two words having similar meaning),

– homonyms (words pronounced or spelled the same way but have

different meanings), and related terms.

IT@AmboU IR ITec-4081 MegersaO ©2025 35

Cont.…
• Thesaurus: The vocabulary of a controlled indexing language, formally
organized
– so that a priori relationships between concepts (for example as
"broader" and “related") are made explicit.

• A thesaurus contains terms and relationships between terms.

– IR thesauri rely typically upon the use of symbols such as (UF=used
for), (BT=broader term, and (RT=related term) to demonstrate inter-
term relationships.

– For instances

– car = automobile, truck, bus, taxi, motor vehicle

– color = colour, paint
IT@AmboU IR ITec-4081 MegersaO ©2025 36
Aim of Thesaurus
• It is tries to control the use of the vocabulary by showing a set of related words
to handle synonyms and homonyms.

• The aim of thesaurus is therefore:

– to provide a standard vocabulary for indexing and searching

• Thesaurus rewrite words to form equivalence classes, and we index such

equivalences.

• When the document contains automobile, index it under car as well

(usually, also vice-versa)

– to assist users with locating terms for proper query formulation:

• When the query contains automobile, look under car as well for
expanding query.

– to provide classified hierarchies that allow the broadening and narrowing of

the current request according to user needs. IT@AmboU .37
Thesaurus Construction
• For example:

– thesaurus built to assist IR for searching cars and vehicles:

• Term: Motor vehicles

UF : Automobiles

Cars
Trucks
BT: Vehicles

RT: Road Engineering

Road Transport

Where UF=used for, BT=broader term, and RT= related term

IT@AmboU 38
More Example
• For Example thesaurus built to assist IR in the fields of computer
science:
• TERM: natural languages
 UF natural language processing (UF=used for NLP)
 BT languages (BT=broader term is languages)
 TT languages (TT = (top term) is languages)
 RT artificial intelligence (RT=related term/s)
computational linguistic
formal languages
query languages
speech recognition

Language-specificity
• Many of the above features embody transformations that are

– Language-specific and

– Often, application-specific

• These are “plug-in” addenda to the indexing process

• Both open source and commercial plug-ins are available for handling
these issues.

Index term selection
• Index language is the language used to describe documents and requests.

• Elements of the index language are index terms which may be derived
from the text of the document to be described, or may be arrived at
independently.

– If a full text representation of the text is adopted, then all words in

the text are used as index terms = full text indexing.

– Otherwise, need to select the words to be used as index terms for

reducing the size of the index file which is basic to design an efficient
searching IR system.

Statistical Properties of Text
• How is the frequency of different words distributed?
• How fast does vocabulary size grow with the size of a corpus?
• There are three well-known researcher who define statistical
properties of words in a text:
– Zipf’s Law: models word distribution in text corpus
– Luhn’s idea: measures word significance
– Heap’s Law: shows how vocabulary size grows with the growth corpus
size

• Such properties of text collection greatly affect the

performance of IR system & can be used to select suitable term
weights & other aspects of the system.

42
Zipf’s law
• Zipf’s Law states that in a large corpus, the frequency of a
word is inversely proportional to its rank in the frequency
table.

Generalized Zipf’s Law f(r) is the frequency of the word at

rank 𝑟.
𝐶 is a normalization constant.
𝑟 is the rank of the word.
𝑠 is an exponent (typically close to 1 in
natural language data).
43
Example of Zipf’s Law

• Scenario: Suppose we analyze a news article corpus with

10,000 words. The word frequency distribution might
look like this:

Rank Word Frequency

1 the 1000
2 is 500
3 government 250
4 policy 125
5 renewable 50
10 microgrid 10

3/12/2025 44
Zipf’s Law

• The most frequent words ("the," "is") appear often but are
not informative → Stopwords removed.
• The least frequent word ("microgrid") appears only 10
times → Might be too rare.
• Moderate-frequency words ("government," "policy,"
"renewable") contain useful semantic meaning for
retrieval.
• Zipf’s Law helps remove common stopwords and rare
words.
• It identifies mid-range terms as good index terms.

3/12/2025 45
Zipf's Law, Luhn's Model
and Heap's Law
Law/Model Key Idea Application in IR
Word frequency is Identifies
inversely stopwords and
Zipf's Law
proportional to mid-frequency
rank keywords
Mid-frequency Helps select
Luhn's Model words are best for optimal index
indexing terms
Vocabulary grows Predicts storage
Heap's Law sub-linearly with needs and search
corpus size efficiency

3/12/2025 46
Zipf's Law, Luhn's Model
and Heap's Law

• Zipf’s Law helps remove stopwords.

• Luhn’s Model selects the best indexing

terms.

• Heap’s Law helps estimate index size as a

collection grows.

3/12/2025 47
Word Distribution
• A few words are very
common.
✓2 most frequent words
(e.g. “the”, “of”) can
account for about 10%
of word occurrences.
• Most words are very rare.
✓Half the words in a
corpus appear only once,
called “read only once”

48
More Example: Zipf’s Law
• Illustration of Rank-Frequency Law. Let the total number of
word occurrences in the sample N = 1, 000, 000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095 49
More Example: Zipf’s Law
• word frequency analysis and follows Zipf's Law, which
states that the frequency of a word is inversely
proportional to its rank.

where N is the total number of

words in the dataset.

• The most frequent word ("the") appears 69,971 times, while the
second most frequent ("of") appears 36,411 times.
• As rank increases, frequency decreases (following Zipf’s Law).
• The R.(F/N) column should ideally be close to a constant value if
Zipf’s Law holds well.

3/12/2025 50
Word distribution: Zipf's Law
The x-axis represents the rank (r) of a word in a corpus (e.g., the
most frequent word is rank 1, the second most frequent is rank 2,
and so on).

The y-axis represents the frequency (𝑓) of the word’s occurrence.

The curve follows a power-law distribution, meaning that a few

words are used very frequently, while most words are used rarely.

Word distribution: Zipf's Law
The highest-ranked word (rank = 1) is the most frequent.

The second-ranked word (rank = 2) has half the frequency of the

first.The third-ranked word (rank = 3) has one-third the frequency, and
so on.

Find the Most Frequent Word (𝑓1)In our previous example, "the" is
the most frequent word with 𝑓1=4f 1 =4.

Compute Expected Frequency Using 𝑓∝1𝑟f∝ r1 If "the" is rank 𝑟=1r=1,

its frequency 𝑓1 is 4.

The expected frequency of the second most common word

Word distribution: Zipf's Law
The expected frequency of the third most common word

The expected frequency of the fourth most common word:

Word significance: Luhn’s Ideas
• Luhn's cutoff formula helps determine the upper and lower
frequency thresholds to filter out common and rare words in a
text.

• While Luhn did not specify a single mathematical formula, the

general approach is based on Zipf’s Law and frequency
distribution.

3/12/2025 60
• Lower Cutoff (𝑓min⁡ )This is the minimum frequency needed
for a word to be informative.

• Formula

f max = maximum word frequency in the text.

𝑘1 = empirical constant (typically 10).

3/12/2025 61
• Upper Cutoff (𝑓𝑐𝑢𝑡𝑜𝑓𝑓 max)
• This is the frequency above which words are too common.

•where:k2 = empirical constant (typically 2).

3/12/2025 62
Example
• Let’s assume :Most frequent word ("AI")
appears 100 times → 𝑓max=100
• Using empirical constants 𝑘1=10 and 𝑘2=2

So: Words appearing < 10 times → too rare (ignored).

Words appearing > 20 times → too common (ignored).
Words appearing between 10 and 20 times → most informative.

3/12/2025 63
Heaps Law

• Heap's Law is an empirical law used in Natural

Language Processing (NLP) and information
retrieval (IR).

• It describes the relationship between the size of a

text corpus and the number of unique words
(vocabulary size) that appear in it.

3/12/2025 68
Heaps Law
• Heap's Law FormulaHeap’s Law is
mathematically represented as:

where:𝑉(𝑛)= the number of unique words (vocabulary size) in a corpus of 𝑛 words

(the total number of tokens).
𝐾 = a constant, which depends on the language and corpus (often around 10).
𝑛 = the total number of words (tokens) in the corpus.
𝛽= an exponent that typically ranges between 0.4 and 0.6, depending on the
corpus.

3/12/2025 69
Heaps Law
• where:𝑉(𝑛)= the number of unique words
(vocabulary size) in a corpus of 𝑛n words (the
total number of tokens).
• 𝐾 = a constant, which depends on the language
and corpus (often around 10).
• 𝑛 = the total number of words (tokens) in the
corpus.
• 𝛽= an exponent that typically ranges between
0.4 and 0.6, depending on the corpus. For most
natural language texts, this value is around 0.5.

3/12/2025 70
Term Weighting

3/12/2025 71
Home work

Weight xx%

❖ Statistical Properties of Text

– Explain in detail about Zipf's Law, Heap’s Law and Luhn’s

Law as well as there contribution of for information retrieval

text operation?

Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Text Processing, Tokenization & Characteristics
100% (1)
Text Processing, Tokenization & Characteristics
89 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
CL - Lec 6
No ratings yet
CL - Lec 6
28 pages
Lec 19
No ratings yet
Lec 19
60 pages
M6L2 Lyst1662
No ratings yet
M6L2 Lyst1662
24 pages
5 The Term Vocabulary & Posting List
No ratings yet
5 The Term Vocabulary & Posting List
36 pages
Info Retrieval for Linguists
No ratings yet
Info Retrieval for Linguists
38 pages
Text Processing
No ratings yet
Text Processing
114 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
Explain Text Operation
No ratings yet
Explain Text Operation
6 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
04 - Lect4 - Text Transformation
No ratings yet
04 - Lect4 - Text Transformation
16 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Text Preprocessing: Information Retrieval
100% (2)
Text Preprocessing: Information Retrieval
16 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
3-More On Indexing & Text Operations
No ratings yet
3-More On Indexing & Text Operations
27 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
16 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
29 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
6 pages
Ir Manual
No ratings yet
Ir Manual
53 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
NLP Lecture 6 Week 3
No ratings yet
NLP Lecture 6 Week 3
9 pages
Lab 2
No ratings yet
Lab 2
49 pages
NLB Final Lab Manual
No ratings yet
NLB Final Lab Manual
23 pages
Chapter 2 (Information Storage & Retrieval)
No ratings yet
Chapter 2 (Information Storage & Retrieval)
56 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
CH 2 - Text Operation
No ratings yet
CH 2 - Text Operation
38 pages
Modern Information Storage and Retrieval: Document/Text Operations
No ratings yet
Modern Information Storage and Retrieval: Document/Text Operations
5 pages
VO - MCA - SEM 4 - Text Mining - U2
No ratings yet
VO - MCA - SEM 4 - Text Mining - U2
15 pages
NLP Class X AI
No ratings yet
NLP Class X AI
36 pages
Module 4 Notes
No ratings yet
Module 4 Notes
34 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Cataloging and Indexing
No ratings yet
Cataloging and Indexing
52 pages
AIUnit 6 10
No ratings yet
AIUnit 6 10
8 pages
03text Processing
No ratings yet
03text Processing
22 pages
2 - Text Operation
No ratings yet
2 - Text Operation
35 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
Lexical Analysis for Developers
No ratings yet
Lexical Analysis for Developers
16 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Document 4
No ratings yet
Document 4
11 pages
Presentation1 IASpdf
No ratings yet
Presentation1 IASpdf
8 pages
1network Devices
No ratings yet
1network Devices
14 pages
IR Chapter 1
No ratings yet
IR Chapter 1
64 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Android Chapter 2
No ratings yet
Android Chapter 2
121 pages
Coordination Compounds - DPP 2 - Nomenclature of Complexes
No ratings yet
Coordination Compounds - DPP 2 - Nomenclature of Complexes
7 pages
Woodside Petroleum 2016 Annual Report
No ratings yet
Woodside Petroleum 2016 Annual Report
144 pages
Genetically Modified Organisms (Gmos)
No ratings yet
Genetically Modified Organisms (Gmos)
14 pages
Mathematical Modelling and Analysis of Plastic Waste Pollution and Its Impact On The Ocean Surface2020
No ratings yet
Mathematical Modelling and Analysis of Plastic Waste Pollution and Its Impact On The Ocean Surface2020
28 pages
Concepts in Disaster Management
No ratings yet
Concepts in Disaster Management
13 pages
Robotics Homework: Inertia & Motion
No ratings yet
Robotics Homework: Inertia & Motion
3 pages
Imo 2015 SL
No ratings yet
Imo 2015 SL
12 pages
Rational Functions for Grade 11
No ratings yet
Rational Functions for Grade 11
9 pages
Scaffold Inspection Register
100% (1)
Scaffold Inspection Register
1 page
Manual D22
No ratings yet
Manual D22
3 pages
The Elementalists by C. Sharp - Excerpt
100% (1)
The Elementalists by C. Sharp - Excerpt
2 pages
Rate Analysis of Suspension Bridge
No ratings yet
Rate Analysis of Suspension Bridge
3 pages
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
No ratings yet
Sample Program: XGB-INV IG5A (RS-485 Modbus RTU)
4 pages
Architecture of Dreams
No ratings yet
Architecture of Dreams
68 pages
"Weather Prediction System": (Major Project) Master of Computer Application
100% (1)
"Weather Prediction System": (Major Project) Master of Computer Application
31 pages
Pag 12.3
No ratings yet
Pag 12.3
2 pages
Essay On The Role of Computers in Everyday Life
No ratings yet
Essay On The Role of Computers in Everyday Life
3 pages
Dashala System Explained
No ratings yet
Dashala System Explained
2 pages
Bibliography
No ratings yet
Bibliography
20 pages
VSC40 Range (UK)
No ratings yet
VSC40 Range (UK)
6 pages
Case Study On North India Floods (2016)
No ratings yet
Case Study On North India Floods (2016)
8 pages
Print - Udyam Registration Certificate
No ratings yet
Print - Udyam Registration Certificate
2 pages
2024 Ieee Patent White Paper
No ratings yet
2024 Ieee Patent White Paper
135 pages
Metal Crystallization and Grain Size
No ratings yet
Metal Crystallization and Grain Size
33 pages
Cisco IOS XE: PPPoE Radius
No ratings yet
Cisco IOS XE: PPPoE Radius
6 pages
Perceptions Towards The Quality of Online Education
No ratings yet
Perceptions Towards The Quality of Online Education
3 pages
Healthcare-Associated Infections Guide
No ratings yet
Healthcare-Associated Infections Guide
74 pages
Saga Edition Character Sheet v8.0
No ratings yet
Saga Edition Character Sheet v8.0
60 pages
2022 DCC Graduate Programme Brochure
No ratings yet
2022 DCC Graduate Programme Brochure
12 pages
APT3200 Manual PDF
No ratings yet
APT3200 Manual PDF
46 pages