0% found this document useful (0 votes)

69 views52 pages

02 Text Operation

The document discusses text operations that are commonly performed in information storage and retrieval systems. It describes five main text operations: lexical analysis, stop word elimination, stemming, term selection, and thesaurus construction. For each operation, it explains the objective, challenges, and issues involved. The operations aim to preprocess text to select meaningful index terms that best represent document content and reduce noise for improved retrieval performance.

Uploaded by

Mikiyas Abate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views52 pages

02 Text Operation

Uploaded by

Mikiyas Abate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 52

Introduction to

Information Storage and Retrieval

Chapter Two: Text Operations

Chapter Objectives
At the end of this chapter students will be able to
Identify major steps in text operation
Understand challenges in text operation
Differentiate content bearing and function terms
Tokenize text into index terms
Eliminate stopwords from text
Understand stemming and stemming algorithms
Understand the need for thesaurus construction
Compute distribution of terms in text
Relate term distribution and term significance

2 02: Text Operation

Text Operations
 Five text operations
1. Lexical analysis
 Objective is to handle digits, hyphens, punctuation marks, and case of
letters
2. Stop word elimination
 Objective is filtering out words with very low discriminating power for
retrieval purpose
3. Stemming
 Removing affixes to allow retrieval of documents with syntactical variation
of query terms
4. Term selection
 Select index terms that carry more semantic

5. Thesaurus construction
 Term categorization to allow expansion of query with related terms

3 02: Text Operation

Text Operations …
 Not all words in a document are equally significant to represent the
contents/meanings of a document
 Some word carry more meaning than others
 Noun words are the most representative of a document content
 Therefore, need to preprocess the text of a document in a collection
to be used as index terms
 Using the set of all words in a collection to index documents creates
too much noise for the retrieval task
 Reduce noise means reduce words which can be used to refer to the document
 Preprocessing is the process of controlling the size of the
vocabulary or the number of distinct words used as index terms
 Preprocessing will lead to an improvement in the information retrieval
performance
 However, some search engines on the Web omit preprocessing
Every word in the document is an index term

4 02: Text Operation

Generating Document Representatives
Text Processing System
Input text – full text, abstract or title
Output – a document representative adequate for use in an automatic
retrieval system
 The document representative consists of a list of class names, each
name representing a class of words occurring in the total input text.
A document will be indexed by a name if one of its significant words
occurs as a member of that class.

Documents Tokenization Stop word Stemming Thesaurus

Index
Terms

5 02: Text Operation

Tokenization/Lexical Analysis
Change text of the documents into words to be adopted as
index terms
Objective - identify words in the text
 Digits, hyphens, punctuation marks, case of letters
 Numbers are not good index terms (like 1910, 1999); but 510
B.C. – unique
 Hyphen – break up the words (e.g. state-of-the-art = state of the
art)- but some words, e.g. gilt-edged, B-49 - unique words which
require hyphens
 gilt-edged and gilt edged do not mean the same; similarly, B-49 refers to
submarine in Soviet while B49 refers to city bus in the US
 Punctuation marks – remove them totally unless significant, e.g.
program code: x.exe and xexe
 Case of letters – not important and can convert all to upper or
lower
02: Text Operation
6
Tokenization …
Analyze text into a sequence of discrete tokens (words).
Input: “Friends, Romans and Countrymen”
Output: Tokens (an instance of a sequence of characters that are
grouped together as a useful semantic unit for processing)
Friends
Romans
and
Countrymen
Each such token is now a candidate for an index entry,
after further processing
But what are valid tokens to produce?

7 02: Text Operation

Issues in Tokenization
One word or multiple: How do you decide it is one token or
two or more?
splitting on white space can also split what should be regarded
as a single token. This occurs most commonly with names (San
Francisco, Los Angeles
Hewlett-Packard  Hewlett and Packard as two tokens?
 state-of-the-art: break up hyphenated sequence.
 San Francisco, Los Angeles
lowercase, lower-case, lower case ?
 data base, database, data-base
• cases with internal spaces that we might wish to regard as a single
token include phone numbers ((800) 234-2333) and dates
(Mar 11, 1983)
• Splitting tokens on spaces can cause bad retrieval results, for
example, if a search for York University mainly returns documents
containing New York University
8 02: Text Operation
Issues in Tokenization …
• The problems of hyphens and non-separating whitespace can even
interact; Advertisements for air fares frequently contain items like San
Francisco-Los Angeles, where simply doing whitespace splitting would
give unfortunate results
• Elimination of period
 IP addresses (100.2.86.144)
How to handle special cases involving apostrophes, hyphens etc? C++,
C#, URLs, emails, …
Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs.
republican) can be a meaningful part of a token, they are not usually
however
 Simplest approach is to ignore all numbers and punctuation and use only
case-insensitive unbroken strings of alphabetic characters as tokens
Generally, don’t index numbers as text,
Will often index “meta-data” , including creation date, format, etc. separately

Issues of tokenization are language specific

Requires the language to be known

9 02: Text Operation

Elimination of STOPWORD
Stopwords are extremely common words across document
collections that have no discriminatory power
They may occur in 80% of the documents in a collection.
They would appear to be of little value in helping select documents
matching a user need and needs to be filtered out from index list
Examples of stopword:
articles (a, an, the);
pronouns: (I, he, she, it, their, his)
 prepositions (on, of, in, about, besides, against),
 conjunctions/ connectors (and, but, for, nor, or, so, yet),
 verbs (is, are, was, were),
 adverbs (here, there, out, because, soon, after) and
 adjectives (all, any, each, every, few, many, some)
Stopwords are language dependent
10 02: Text Operation
Elimination of STOPWORD …
Stop word elimination used to be standard in older IR
systems
But the trend is away from doing this. Most web search
engines index stop words:
Good query optimization techniques mean you pay little
at query time for including stop words.
You need stopwords for:
Phrase queries: “King of Denmark”
Various song titles, etc.: “Let it be”, “To be or not to be”
“Relational” queries: “flights to London”
Elimination of stopwords might reduce recall (e.g. “To be
or not to be” – all eliminated except “be” – no or irrelevant
retrieval)
11 02: Text Operation
How to determine a list of stopword?
Intuition:
Stopword have little semantic content; It is typical to remove such
high-frequency words
Stopwords take up 50% of the text. Hence, document size reduces by
30-50% when stopword are eliminated
One method: Sort terms (in decreasing order) by collection
frequency and take the most frequent ones
In a collection about insurance practices, “insurance” would be a stop
word
Another method: Build a stop word list that contains a set of
articles, pronouns, etc.
 Why do we need stop lists: With
a stop list, we can compare and
exclude from index terms entirely the most common words
 With the removal of stopwords, we can measure better

12
approximation
02: Text Operation
of importance for classification, summarization, etc.
Normalization
• It is standardizing tokens so that matches occur
despite superficial differences in the character
sequences of the tokens
Need to “normalize” terms in indexed text as well as query terms
into the same form
Example: We want to match U.S.A. and USA, by deleting periods
in a term
Case Folding: Often best to lower case everything, since users
will use lowercase regardless of ‘correct’ capitalization…
Republican vs. republican
Fasil vs. fasil vs. FASIL
Anti-discriminatory vs. antidiscriminatory

13 02: Text Operation

Normalization issues
Good for
Allow instances of Automobile at the beginning of a sentence to
match with a query of automobile
Helps a search engine when most users type ferrari when they
are interested in a Ferrari car
Bad for
Proper names vs. common nouns
 E.g. General Motors, Associated Press, …
Solution:
lowercase only words at the beginning of the sentence
In IR, lowercasing is most practical because of the way
users issue their queries

14 02: Text Operation

Stemming
 Stemming is the conflation of variant forms of a word into a single
representation, the stem, semantically related to the variants
The words connect, connects, connected, connecting, connectivity,
connection can bed stemmed into connect
 The stem does not need to be a valid word, it need to capture the
meaning of the word though
 Stemming aims to increase effectiveness of an information retrieval,
recall where by more relevant out of the entire collection are retrieved
 Its also used to reduce the size of index files, since a single stem
typically corresponds to several full terms, by storing stems instead of
terms, compression factor of 50 percent can be achieve
 Conflation of words or so called stemming can either be done manually
by using some kind of regular expressions or automatically using
stemmers
15 02: Text Operation
Term conflation
One of the problems involved in the use of free text for
indexing and retrieval is the variation in word forms that is
likely to be encountered
The most common types of variations are
spelling errors (father, fathor)
alternative spellings i.e. locality or national usage (color vs
colour, labor vs labour)
multi-word concepts (database, data base)
affixes (dependent, independent, dependently)
 abbreviations (i.e., that is).

16 02: Text Operation

Conflation or stemming methods
Automatic
approaches

Affix Successor
Table lookup n-gram
Removal Variety
method Method
Method Method

longest match

simple
removal

17 02: Text Operation

Affix removal
Affix removal method removes suffix or prefix from the
words so as to convert them into a common stem form
Most of the stemmers that are currently used use this type of
approach for conflation
Affix removal method is based on two principles one is
iterations and the other is longest match
An iterative stemming algorithm is simply a recursive
procedure, as its name implies, which removes strings in
each order-class one at a time, starting at the end of a word
and working toward its beginning
No more than one match is allowed within a single order-
class, by definition
18 02: Text Operation
Affix removal
 Iteration is usually based on the fact that suffixes are attached
to stems in a certain order, that is, there exist order-classes of
suffixes
 The longest-match principle states that within any given class
of endings, if more than one ending provides a match, the one
which is longest should be removed
 The first stemmer based on this approach is the one developed
by Lovins (1968); MF Porter (1980) also used this method
 However, Porter’s stemmer is more compact and easy to use
then Lovins
 YASS is another stemmer based on the same approach; it is
however language independent
19 02: Text Operation
Table lookup approach
Store terms and their corresponding stems in a table
Stemming is then done via lookups in the table
One way to do stemming is to store a table of all index
terms and their stems
Terms from queries and indexes could then be stemmed
via table lookup
Problems with this approach
making these lookup tables we need to extensively work on a
language
There will be some probability that these tables may miss out
some exceptional cases
storage overhead for such a table

20 02: Text Operation

Successor Variety approach
 Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances
 The successor variety of a string is the number of different characters that follow
it in words in some body of text
 The successor variety of substrings of a term will decrease as more characters are
added until a segment boundary is reached
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE,
READING, READS, RED, ROPE, RIPE

Prefix Successor Variety Letters

R 3 E,I,O
RE 2 A,D
REA 1 D
READ 3 A,I,S
READA 1 B
READAB 1 L
READABL 1 E
READABLE 1 (Blank)

21 02: Text Operation

n-gram stemmers
 Another method of conflating terms called the shared digram method
 A digram is a pair of consecutive letters
 Besides digrams we can also use trigrams and hence it is called n-gram method
in general
 Association measures are calculated between pairs of terms based on shared
unique digrams
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti

statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti

 Dice’s coefficient (similarity)

2C 2*6
  .80S
A B 78
 A and B are the numbers of unique digrams in the first and the second words. C
is the number of unique digrams shared by A and B

22 02: Text Operation

n-gram stemmers …
Similarity measures are determined for all pairs of terms in
the database, forming a similarity matrix
Such similarity measures are determined for all pairs of
terms in the database
Once such similarity is computed for all the word pairs they
are clustered as groups
The value of Dice coefficient gives us the hint that the stem
for these pair of words lies in the first unique 8 digrams out
of 10.

23 02: Text Operation

Criteria for judging stemmers
Correctness
Overstemming: too much of a term is removed.
 Can cause unrelated terms to be conflated  retrieval of non-relevant
documents
 Over-stemming is when two words with different stems are stemmed to
the same root. This is also known as a false positive
Understemming: too little of a term is removed.
 Prevent related terms from being conflated  relevant documents may
not be retrieved
 Under-stemming is when two words that should be stemmed to the
same root are not. This is also known as a false negative
Term Stem
GASES GAS (correct)
GAS GA (over)
GASEOUS GASE (under)

24 02: Text Operation

Assumptions in Stemming
Words with the same stem are semantically
related and have the same meaning to the
user of the text
The chance of matching increase when the
index term are reduced to their word stems
because it is normal to search using “
retrieve” than “retrieving”

25 02: Text Operation

Of the four types of stemming strategies (affix removal, table
lookup, successor variety, and n-grams) which is preferable?
 Table lookup consists simply of looking for the stem of a word
in a table, a simple procedure but dependent on data on stems
for the whole language. Since such data is not readily available
and might require considerable storage space, this type of
stemming algorithm might not be practical.
Successor variety stemming is based on the determination of
morpheme boundaries, uses knowledge from structural
linguistics, more complex than affix removal algorithms
 N-grams stemming is based on the identification of digrams
and trigrams and is more a term clustering procedure than a
stemming one
 Affix removal stemming is intuitive, simple, and can be
implemented efficiently
26 02: Text Operation
Porter Stemmer
 Porter stemmer is the most popular affix (suffix) removal algorithm, for its
simplicity and elegance
 Despite being simpler, the Porter algorithm yields results comparable to those of
the more sophisticated algorithms
 The Porter algorithm uses a suffix list for suffix stripping, the idea is to apply a
series of rules to the suffixes of the words in the text. For instance, the suffix s is
replace by nil as shown below converting plural into singular

s
 The longest sequence of letters is searched left hand side in a set of
rules

 Applied to the word stresses yields the stem stress instead of the stem stresse.
 A detailed description of the Porter algorithm can be found in the appendix of the
text book and its implementation at
27 02: Text Operation
http://tartarus.org/~martin/PorterStemmer/index.html
Porter stemmer
Most common algorithm for stemming English words to
their common grammatical root
It is simple procedure for removing known affixes in
English without using a dictionary. To gets rid of plurals
the following rules are used:
SSES  SS caresses  caress
IES  y ponies  pony
SS  SS caress → caress
S   cats  cat
ment (Delete final element if what remains is longer than 1
character )
replacement  replace
cement  cement

28 02: Text Operation

Thesauri
Mostly full-text searching cannot be accurate, since different
authors may select different words to represent the same
concept
Problem: The same meaning can be expressed using different terms
that are synonyms, homonyms, and related terms
How can it be achieved such that for the same meaning the identical
terms are used in the index and the query?
Thesaurus: The vocabulary of a controlled indexing
language, formally organized so that a priori relationships
between concepts (for example as "broader" and “related")
are made explicit.
A thesaurus contains terms and relationships between terms
IR thesauri rely typically upon the use of symbols such as USE/UF
(UF=used for), BT, and RT to demonstrate inter-term relationships.
e.g., car = automobile, truck, bus, taxi, motor vehicle

29
-color = colour, paint
02: Text Operation
Aim of Thesaurus
Thesaurus tries to control the use of the vocabulary by
showing a set of related words to handle synonyms and
homonyms
The aim of thesaurus is therefore:
to provide a standard vocabulary for indexing and searching
Thesaurus rewrite to form equivalence classes, and we index such
equivalences
When the document contains automobile, index it under car as well
(usually, also vice-versa)
to assist users with locating terms for proper query
formulation: When the query contains automobile, look
under car as well for expanding query
to provide classified hierarchies that allow the broadening
and narrowing of the current request according to user needs
30 02: Text Operation
Thesaurus Construction
Example: thesaurus built to assist IR for searching cars and
vehicles :
Term: Motor vehicles
UF : Automobiles, Cars, Trucks
BT: Vehicles
RT: Road Engineering, Road Transport
Example: thesaurus built to assist IR in the fields of
computer science:
TERM: natural languages
UF natural language processing (UF=used for NLP)
BT languages (BT=broader term is languages)
TT languages (TT = top term is languages)
RT artificial intelligence (RT=related term/s)
computational linguistic, formal languages, query languages, speech
31 recognition
02: Text Operation
Language-specificity
Many of the above features embody transformations
that are
Language-specific and
Often, application-specific
These are “plug-in” addenda to the indexing process
Both open source and commercial plug-ins are
available for handling these

32 02: Text Operation

Statistical Properties of Text
How is the frequency of different words distributed?
Refer to Zipf’s law
How fast does vocabulary size grow with the size of a
corpus? Refer to Heap’s law
Such factors affect the performance of IR system & can be used
to select suitable term weights & other aspects of the system.
A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for about
10% of word occurrences in a document.
Most words are very rare.
Half the words in a corpus appear only once, called “read only
once”
Called a “heavy tailed” distribution, since most of the

33
probability
02: Text
mass
Operation
is in the “tail”
Sample Word Frequency Data

34 02: Text Operation

Zipf’s distributions
For all the words in a documents collection, for each word w
• f : is the frequency that w appears
• r : is rank of w in order of frequency.
(The most commonly occurring word has rank 1, etc.)

Distribution of sorted word

f
frequencies, according to
Zipf’s law
w has rank r and
frequency f

35 02: Text Operation r

Word distribution: Zipf's Law
Zipf's Law- named after the Harvard linguistics professor
George Kingsley Zipf (1902-1950),
attempts to capture the distribution of the frequencies (i.e. ,
number of occurances ) of the words within a text
Zipf's Law states that when the distinct words in a text
are arranged in decreasing order of their frequency of
occuerence (most frequent words first), the occurence
characterstics of the vocabulary can be characterized by the
constant rank-frequency law of Zipf:
Frequency * Rank = constant
If the words, w, in a collection are ranked, r, by their
frequency, f, they roughly fit the relation: r*f=c
Different collections have different constants c
36 02: Text Operation
Example: Zipf's Law

The table shows the most frequently occurring words from

336,310 document collection containing 125, 720, 891 total
words; out of which there are 508,209 unique words
37 02: Text Operation
Zipf’s law: modeling word distribution
The collection frequency of the ith most common term is
proportional to 1/i
1
fi 
i

If the most frequent term occurs f1 then the second most
frequent term has half as many occurrences, the third most
frequent term has a third as many, etc
Zipf's Law states that the frequency of the i-th most
frequent word is 1/iӨ times that of the most frequent word
occurrence of some event ( P ), as a function of the rank (i)
when the rank is determined by the frequency of occurrence, is a
power-law function Pi ~ 1/i Ө with the exponent Ө close to unity.

38 02: Text Operation

More Example: Zipf’s Law
 Illustration of Rank-Frequency Law. Let the total number of word
occurrences in the sample N = 1,000,000
Rank (R) Term Frequency (F) R.(F/N)
1 the 69 971 0.070
2 of 36 411 0.073
3 and 28 852 0.086
4 to 26 149 0.104
5 a 23237 0.116
6 in 21341 0.128
7 that 10595 0.074
8 is 10099 0.081
9 was 9816 0.088
10 he 9543 0.095

39 02: Text Operation

Methods that Build on Zipf's Law
• Stop lists: Ignore the most frequent words (upper cut-
off)
• Used by almost all systems
• Significant words: Take words in between the most
frequent (upper cut-off) and least frequent words (lower
cut-off)
• Used in IR systems
• Term weighting: Give differing weights to terms based
on their frequency, with most frequent words weighted
less
• Used by almost all ranking methods
40 02: Text Operation
Explanations for Zipf’s Law
The law has been explained by “principle of least effort”
which makes it easier for a speaker or writer of a language
to repeat certain words instead of coining new words
Zipf’s explanation was his “principle of least effort” which balance
between speaker’s desire for a small vocabulary and hearer’s desire
for a large one
Zipf’s Law Impact on IR
Good News: Stopwords will account for a large fraction
of text so eliminating them greatly reduces inverted-index
storage costs
Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation
analysis for query expansion) is difficult since they are
extremely rare
41 02: Text Operation
Word significance: Luhn’s Ideas

Luhn Idea (1958): the frequency of word occurrence in a

text furnishes a useful measurement of word significance
Luhn suggested that both extremely common and
extremely uncommon words were not very useful for
indexing

42 02: Text Operation

Word significance: Luhn’s Ideas
For this, Luhn specifies two cutoff points: an upper and a
lower cutoffs based on which non-significant words are
excluded
The words exceeding the upper cutoff were considered to be
common
The words below the lower cutoff were considered to be rare
Hence they are not contributing significantly to the content of the
text
The ability of words to discriminate content, reached a peak at a
rank order position half way between the two-cutoffs
Let f be the frequency of occurrence of words in a text, and r their
rank in decreasing order of word frequency, then a plot relating f & r
yields the following curve

43 02: Text Operation

Luhn’s Ideas

Luhn (1958) suggested that both extremely common and extremely

uncommon words were not very useful for document representation
& indexing

44 02: Text Operation

Vocabulary Growth: Heaps’ Law
How does the size of the overall vocabulary (number of
unique words) grow with the size of the corpus?
This determines how the size of the inverted index will scale with
the size of the corpus.
Heaps’ law: estimates the number of vocabularies in a
given corpus
For a text of n words, the vocabulary size grows by O(nβ), where β
is a constant, 0 < β < 1
If V is the size of the vocabulary and n is the length of the corpus
in words, Heaps provides the following equation:
Where constants:

K  10100 V  Kn
  0.40.6 (approx. square-root)
45 02: Text Operation
Heaps’ distributions
• Distribution of size of the vocabulary: there is a linear
relationship between vocabulary size and number of
tokens

• Example: from 1,000,000,000 words, there may be

1,000,000 distinct words. Do you agree?
46 02: Text Operation
Example
We want to estimate the size of the vocabulary for a corpus of
1,000,000 words. However, we only know the statistics
computed on smaller corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique words

Estimate the vocabulary size for the 1,000,000 words corpus?

How about for a corpus of 1,000,000,000 words?

47 02: Text Operation

Relevance Measure
retrieved & not retrieved

irrelevan
irrelevant & irrelevant

t
retrieved & not retrieved

releva
relevant but relevant

nt
retrieved not retrieved

48 02: Text Operation

Issues: recall and precision
breaking up hyphenated terms increase recall
but decrease precision
preserving case distinctions enhance precision
but decrease recall
commercial information systems usually take
recall enhancing approach (numbers and words
containing digits are index terms, and all are
case insensitive)

49 02: Text Operation

Exercise: Tokenization
The cat slept peacefully in the living room. It’s a very old
cat.

Mr. O’Neill thinks that the boys’ stories about Chile’s

capital aren’t amusing.

50 02: Text Operation

Index term selection
Index language is the language used to describe
documents and requests
Elements of the index language are index terms which
may be derived from the text of the document to be
described, or may be arrived at independently.
If a full text representation of the text is adopted, then
all words in the text are used as index terms = full text
indexing
Otherwise, need to select the words to be used as index
terms for reducing the size of the index file which is
basic to design an efficient searching IR system

51 02: Text Operation

End of Chapter Two

52 02: Text Operation

2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
Information Retrieval: IR Evaluation
No ratings yet
Information Retrieval: IR Evaluation
36 pages
5-Retrieval Effectiveness
No ratings yet
5-Retrieval Effectiveness
20 pages
IR Chap3
No ratings yet
IR Chap3
45 pages
Retrieval Evaluation in IR Systems
No ratings yet
Retrieval Evaluation in IR Systems
28 pages
Lecture 01 Introduction
No ratings yet
Lecture 01 Introduction
84 pages
Chap02 - Basic Elements of C++
No ratings yet
Chap02 - Basic Elements of C++
41 pages
Chapter 3 - Layered Models
No ratings yet
Chapter 3 - Layered Models
53 pages
C# Chapter 2
No ratings yet
C# Chapter 2
23 pages
Basic Programing I Chapter 1
No ratings yet
Basic Programing I Chapter 1
48 pages
Chapter 05
No ratings yet
Chapter 05
48 pages
Unit 1 (C++) - Introduction
No ratings yet
Unit 1 (C++) - Introduction
91 pages
FIS Unit Three
No ratings yet
FIS Unit Three
23 pages
Computer Networking Fundamentals Chapter 2
No ratings yet
Computer Networking Fundamentals Chapter 2
133 pages
Lec21-22 Programming in C++ Variables & Data Types-1
No ratings yet
Lec21-22 Programming in C++ Variables & Data Types-1
33 pages
Data Commu
No ratings yet
Data Commu
237 pages
EDP Part 1
No ratings yet
EDP Part 1
42 pages
Chapter 1 SAD Introduction
No ratings yet
Chapter 1 SAD Introduction
25 pages
Intro to Basic C++ Programming
No ratings yet
Intro to Basic C++ Programming
32 pages
Chapter 4 - Reference Models and Network Protocols
No ratings yet
Chapter 4 - Reference Models and Network Protocols
74 pages
Systems Analysis & Design Guide
No ratings yet
Systems Analysis & Design Guide
45 pages
SAD PPt-1,2,3
No ratings yet
SAD PPt-1,2,3
39 pages
Chapter Six - Pointer
No ratings yet
Chapter Six - Pointer
11 pages
C++ Chapter Three
No ratings yet
C++ Chapter Three
55 pages
Chapter 2 - Data Communication and Network Components
No ratings yet
Chapter 2 - Data Communication and Network Components
66 pages
Chapter 6 - Switiching Technologies and Network Devices
No ratings yet
Chapter 6 - Switiching Technologies and Network Devices
47 pages
Chapter 4.4 Application Layers
No ratings yet
Chapter 4.4 Application Layers
22 pages
Chapter 4-Communication
No ratings yet
Chapter 4-Communication
36 pages
CHAPTER 04 (Network)
No ratings yet
CHAPTER 04 (Network)
40 pages
OOP2 Lecture Week 12 (Spring2023 24)
No ratings yet
OOP2 Lecture Week 12 (Spring2023 24)
19 pages
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
No ratings yet
Computer Programming: Chapter 1. Overview of Computer Software and Programming Languages
15 pages
C# Multiform & Database Guide
No ratings yet
C# Multiform & Database Guide
25 pages
Chapter 3
No ratings yet
Chapter 3
24 pages
Chapter 4-Protocols
No ratings yet
Chapter 4-Protocols
45 pages
Chapter 2
No ratings yet
Chapter 2
20 pages
OSI Layers
No ratings yet
OSI Layers
70 pages
Chapter-2 SN
No ratings yet
Chapter-2 SN
68 pages
Lecture-02-Basic Elements of C++
100% (1)
Lecture-02-Basic Elements of C++
82 pages
ICT Lecturer 6
No ratings yet
ICT Lecturer 6
17 pages
UNit 5
No ratings yet
UNit 5
50 pages
Chapter 1
No ratings yet
Chapter 1
23 pages
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
No ratings yet
DH-INT1472-CLC-Chapter 1 - Introduction To Information Security
55 pages
Chapter 3 - C++ Handout
No ratings yet
Chapter 3 - C++ Handout
55 pages
Programming Paradigms-1-79
No ratings yet
Programming Paradigms-1-79
79 pages
2 Data Communications Concepts
No ratings yet
2 Data Communications Concepts
15 pages
Chapter 4-Communication
No ratings yet
Chapter 4-Communication
41 pages
C# Chapter 5
No ratings yet
C# Chapter 5
11 pages
Ambo University Woliso Campus, Technology and Informatics School Department of Computer Science
No ratings yet
Ambo University Woliso Campus, Technology and Informatics School Department of Computer Science
48 pages
Basics
No ratings yet
Basics
118 pages
Chapter 6
No ratings yet
Chapter 6
28 pages
WinForms GUI Programming Guide
No ratings yet
WinForms GUI Programming Guide
75 pages
C++ Basics for 2nd Year CS Students
No ratings yet
C++ Basics for 2nd Year CS Students
14 pages
C# Chapter 4
No ratings yet
C# Chapter 4
8 pages
Chapter Six C#
No ratings yet
Chapter Six C#
17 pages
Chapter Three - Programming Constructs
No ratings yet
Chapter Three - Programming Constructs
55 pages
Chapter 1 - Overview of Computer Networks
No ratings yet
Chapter 1 - Overview of Computer Networks
49 pages
Ch5 Retrieval Evaluation 2021
No ratings yet
Ch5 Retrieval Evaluation 2021
26 pages
C#.Net Programming Basics
No ratings yet
C#.Net Programming Basics
51 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
C# String Manipulation Guide
No ratings yet
C# String Manipulation Guide
59 pages
1.chapter 1 - Overview Information System Final
No ratings yet
1.chapter 1 - Overview Information System Final
131 pages
1 IR Intro
No ratings yet
1 IR Intro
30 pages
01 Introduction To ISR
No ratings yet
01 Introduction To ISR
48 pages
Shyna 30 Present Tenses
No ratings yet
Shyna 30 Present Tenses
3 pages
Irregular Verbs in English1
No ratings yet
Irregular Verbs in English1
2 pages
Using Articles (A, An, The) Correctly
No ratings yet
Using Articles (A, An, The) Correctly
3 pages
History of The English Language: Inflectional Simplicity Questions
No ratings yet
History of The English Language: Inflectional Simplicity Questions
2 pages
DLL Direct and Reported Speech Oct 23-27
No ratings yet
DLL Direct and Reported Speech Oct 23-27
5 pages
French Sem1
No ratings yet
French Sem1
41 pages
Pilihan Ganda Bahasa Inggris
No ratings yet
Pilihan Ganda Bahasa Inggris
10 pages
Mastering Subordinate Clauses
No ratings yet
Mastering Subordinate Clauses
13 pages
Fun with Verbs for Kids
No ratings yet
Fun with Verbs for Kids
2 pages
1001 Easy German Phrases Prof M Charlotte Wolf Download
No ratings yet
1001 Easy German Phrases Prof M Charlotte Wolf Download
77 pages
11c. Present Continuous Simple
100% (1)
11c. Present Continuous Simple
9 pages
6th Grade ELA Summer Packet
No ratings yet
6th Grade ELA Summer Packet
6 pages
English Grammar for Test Takers
No ratings yet
English Grammar for Test Takers
5 pages
Planificare 1
No ratings yet
Planificare 1
3 pages
Julian Garzon 1006
No ratings yet
Julian Garzon 1006
3 pages
Manual
No ratings yet
Manual
168 pages
Context Clues
No ratings yet
Context Clues
2 pages
Present Perfect - Rules
No ratings yet
Present Perfect - Rules
11 pages
Types of Gerunds
100% (1)
Types of Gerunds
3 pages
Ingles 2. Unit 8
No ratings yet
Ingles 2. Unit 8
11 pages
Primary Checkpoint English 0844 Specimen 2014 Paper 1
100% (2)
Primary Checkpoint English 0844 Specimen 2014 Paper 1
12 pages
Compare-Contrast Essay MKK
No ratings yet
Compare-Contrast Essay MKK
7 pages
Post-Modification: in English Grammar, A Postmodifier Is A Modifier That Follows The Word
No ratings yet
Post-Modification: in English Grammar, A Postmodifier Is A Modifier That Follows The Word
1 page
8 SSC Bhashasutram 1
No ratings yet
8 SSC Bhashasutram 1
6 pages
Unit 6 Gold Experience - Understanding Passive Voice and Causative Structures KEY
No ratings yet
Unit 6 Gold Experience - Understanding Passive Voice and Causative Structures KEY
6 pages
Student's Book Grammar Reference Answer Key
No ratings yet
Student's Book Grammar Reference Answer Key
6 pages
English Module Basic Sentence Pattern
100% (3)
English Module Basic Sentence Pattern
13 pages
Cool Kids 3 - SS
No ratings yet
Cool Kids 3 - SS
2 pages
Unterbeck, B, Rissanen, M., Nevalainen, T. & Saari, M. Gender in Grammar and Cognition. I Approaches To Gender. II Manifestations of Gender
100% (1)
Unterbeck, B, Rissanen, M., Nevalainen, T. & Saari, M. Gender in Grammar and Cognition. I Approaches To Gender. II Manifestations of Gender
883 pages
T e 1722345570 Esl Imperative Verb Forms Adults A1 A2 - Ver - 1
No ratings yet
T e 1722345570 Esl Imperative Verb Forms Adults A1 A2 - Ver - 1
20 pages

02 Text Operation

Uploaded by

02 Text Operation

Uploaded by

Introduction to

Information Storage and Retrieval

Chapter Two: Text Operations

2 02: Text Operation

3 02: Text Operation

4 02: Text Operation

Documents Tokenization Stop word Stemming Thesaurus

5 02: Text Operation

7 02: Text Operation

Issues of tokenization are language specific

9 02: Text Operation

13 02: Text Operation

14 02: Text Operation

16 02: Text Operation

17 02: Text Operation

20 02: Text Operation

Prefix Successor Variety Letters

21 02: Text Operation

 Dice’s coefficient (similarity)

22 02: Text Operation

23 02: Text Operation

24 02: Text Operation

25 02: Text Operation

28 02: Text Operation

32 02: Text Operation

34 02: Text Operation

Distribution of sorted word

35 02: Text Operation r

The table shows the most frequently occurring words from

38 02: Text Operation

39 02: Text Operation

Luhn Idea (1958): the frequency of word occurrence in a

42 02: Text Operation

43 02: Text Operation

Luhn (1958) suggested that both extremely common and extremely

44 02: Text Operation

• Example: from 1,000,000,000 words, there may be

Estimate the vocabulary size for the 1,000,000 words corpus?

47 02: Text Operation

48 02: Text Operation

49 02: Text Operation

Mr. O’Neill thinks that the boys’ stories about Chile’s

50 02: Text Operation

51 02: Text Operation

52 02: Text Operation

You might also like