0% found this document useful (0 votes)

21 views7 pages

Chapter 4 - Processing Text

The document discusses the processing of text for search engines, covering topics such as corpus characteristics, text transformation techniques like tokenization, stopping, and stemming, as well as text statistics and probability laws like Zipf's law. It also addresses document parsing, inverted index construction, challenges in querying, and the importance of document structure and link analysis for ranking. Additionally, it highlights the complexities of internationalization in search engines, particularly in word segmentation.

Uploaded by

do24l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

Chapter 4 - Processing Text

Uploaded by

do24l

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Processing Text

● Corpus: collection of some types of documents

○ Corpus of news/articles
○ Usually share some characteristics

● We can rely on the textual characteristics to build text-based engines

● The laws apply to:
○ Different languages
○ Different text domains

1. Words to Terms

● Text transformation/processing: convert words into index terms

○ Index term: representation of the content of a document (used for searching)
● Tokenization: split words apart
● Stopping: ignore some words to make query processing more efficient and effective
● Stemming: allow similar words (like “run” and “running”) to match each other

⇒ Search engines is about deciphering the meaning of text by the counts of word occurrences
and co-occurrences (the number of times group of words occur together in documents)

2. Text Statistics

● Distribution of word frequencies is very skewed

○ Words like “the”, “of” account for 10% of all word occurrences
○ ½ of all unique words in a large sample of text occur only once

● Knowing (1) how fast the vocabulary size grows with the size of a corpus and (2) how
the words are distributed helps us determine:
○ Indexing technique
○ Storage: how to store data + storage availability

Probability
● Experiment: a specific set of actions the results of which can not be predicted with
certainty
● Sample space: possible set of recorded data
● Event space: subsets of a sample space defined by a specific event/outcome
○ Example: The event that the sum of 2 dice is 4

● Zipf’s law: The frequency of the rth most common word is inversely proportional to r
○ Graph frequency vs. rank → hard exponential decay (the more frequent the word,
the higher the rank)
○ The rank of word r * its frequency f is approximately a constant k
■ 𝑟 × 𝑓 = 𝑘
● If rank 1 term occurs cN times, ten rank 2 occurs cN/2, …
○ Most frequent word (1/1) will occur 2x as frequently as the
2nd most frequent (½).
○ Alternatively:
■ 𝑟 × 𝑃𝑟 = 𝑐 where
● 𝑃𝑟: probability of occurrence for the rth ranked word
● 𝑐 (≈ 0. 1): a constant
○ Might be inaccurate for low and high ranks
■ Not accurate at tails
○ There might be words with the same frequency n ⇒ Count: 𝑟𝑛 − 𝑟𝑛+1 (where 𝑟𝑛 is
the rank of the last word in the group, 𝑟𝑛+1 is the rank of the last word of the
previous group of words with higher frequency)

● Vocabulary Growth: as the size of the corpus grows, new words occur
○ Can be due to: typos, new words in the vocabulary, …
β
○ Heap’s law: 𝑣 = 𝑘 × 𝑛
■ 𝑣: vocab size for a corpus of size 𝑛 words
■ 𝑘, β: parameters (peculiar to each collection)
● 10 ≤ 𝑘 ≤ 100
● β ≈ 0. 5

3. Document Parsing

Process:
1. Taking a document
2. Turn it into a representation (like metadata)
a. Non-context: bag of words
b. word2vec: ML algorithm to see what’s the probability of seeing a word next to
another
On the other hand
1. Taking a query
2. Turn it into a representation (have to be of the same form as the document)

⇒ Compare them
⇒ Evaluate the comparisons (how related is this to the query)
⇒ Feedback (eg: based on how many times you clicked on something)
Basic inverted index construction:

● Token: a semantic unit

○ Instance of a word occurring in a doc. Example: Bush, bush, friends
○ Tokenizing: figuring out how to divide up the documents into fragments

● Text preprocessing:
○ Remove punctuations
○ Lowercase

● Inverted index: map a word to the documents it appears

○ Term: entry in dictionary (normalized). Example: bush, friend
○ Type: class of all tokens containing the same character sequence (Example: “to
be or not to be” has 6 tokens, 4 terms)

What’s in a document
● Multiple language
● Multiple character set/encoding: How do we go from the binary to the characters
● Different compressing algorithms: zipped/compressed
● Recognition of the content and structure of text documents
○ Tokenizing/lexical analysis: recognizing each word occurrence in the sequence
of characters in a document
○ Metadata: document attributes (date/author); tags to identify document
components

Big challenge in querying: Vocabulary mismatch

● Word boundary variation
○ eusa is different from e-usa = usa = u.s.a = us of a
● Morphological variation
○ Policeman = policemen = police vs. policy = politics
● Spelling variation:
○ Gaddafi = Gadhafi = …
● Synonymy and Polysemy
○ apple vs. Apple vs. Big Apple

a) Tokenizing

Challenges:
● Small words important in queries that are disregarded
● Hyphenated/non-hyphenated forms of many words are common
● Special characters
● Capitalized words (since tokenizing lowercases everything)
● Apostrophes (‘) that can be part of a word
● Periods
⇒ Can treat hyphens, apostrophes and periods as spaces
⇒ No single right way to do it

1. First pass: identifying markup or tags in the document (accommodate syntax errors in
the structure)
2. Second pass: identify structure in appropriate parts of the document

b) Stopping

● Function words: words that have little meaning apart from other words
● Stopwords: function words in the context of information retrievals
○ We will throw them out → decrease index size, increase retrieval efficiency
○ Can be customized for specific parts of the document structure/fields (like
“click”, “here”, “privacy” are stopwords when processing anchor text)
⇒ Downside of just removing the stopwords: Cannot search on them!
c) Stemming

Pros:
+ Smaller dictionary size
+ Increased recall (number of documents returned)

Cons:
- Exact quotes
- Decrease in specificity
- Decrease in precision (more *junk*)

● Component of text processing that captures the relationships between different

variations of a word
○ Reduce the different forms of a word to a common stem that occur because of
■ Inflection (eg, plurals, tenses) or
■ Derivations (making a verb in to a noun by adding the suffix -ation)
→ For example, “swimming” and “swam” can be reduced to “swim”

● 2 types of stemmers:
○ Algorithmic: decide whether 2 words are related
■ Simplest: “s” signifies plurality
● False negatives: cannot detect a relationship (example: country
and countries)
● False positives: detect a relationship when there is none (example:
I and is)
⇒ Solution: consider more kinds of suffixes → reduce false
negative, but false positives increase
■ Porter stemmer:
● The most popular algorithmic stemmer
● Used in many information retrieval experiments and systems
since the 70s
○ Dictionary-based: store term relationships based on pre-created dictionaries of
related terms
■ Instead of computing the relationship between words, these just store
lists of related words in a large dictionary
■ Drawback: tend to be static + less flexible (while language evolves
constantly)

○ Combine the 2:
■ Use algorithmic to deal with new/unrecognized words
■ Dictionary is used to detect relationships between common words
■ Krovetz stemmer: use dictionary to check the validity of a word + if not
found, then check for suffixes → feed to the dictionary again
● Lower false positive than Porter
● Higher false negative

d) Phrases and N-grams

● Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)

○ POS tagging:
■ Too slow
■ Alternative: store word position information in the indexes → only identify
phrases when a query is processed

⇒ Testing word proximities at query time is likely to be too slow

⇒ Identify n words to be a unit during text processing

● n-gram: sequence of n words

○ Generated by choosing a particular value for n
○ Then move that window forward one unit (character/word) at a time
■ Example: “hello” contains the following bigrams: he, el, ll, lo
○ Indexes based on n-grams are larger than word indexes

4. Document Structure and Markup

● Document parsers use the structure of web pages (example, from HTML markup) for
ranking algorithm
○ Headings indicate that the phrase is particularly important

● XML: allows each application to define what the element types

○ Defined by a schema (like a database schema)
○ Therefore, its elements are more tied to the semantics of the data than HTML
elements

5. Link Analysis

● Links tell relationship between pages–help with ranking pages

● Anchor Text: 2-3 words describing the topic of the linked page
○ For a link www.ebay.com is highly linked to “eBay” in the anchor text
○ This motivates a simple ranking algorithm: search through all links in the
collection, looking for anchor text that matches user’s query
● PageRank:
○ Naive: Use page rank to measure popularity of pages
■ Like count the number of inlinks (links pointing to a page) for each page
■ Susceptible to spam!
○ PageRank: Google search engine
■ A value associated with a web page
■ Help search engines sift through all pages containing the word “eBay” to
find the most popular one
■ The PageRank for a page = Sum(Page that goes to it / count of pages
that it links to)

○ If we take into account “distraction” links on the site…

■ Adjust the PR to be the weighted average of clicking on that link (λ) vs.
not clicking on that link

○ Rank sinks: pages with no outbound links (accumulate PR but do not distribute it)

● Link quality:
○ Hard to assess: web page designers may try to create useless links to improve
the PR → link spam
■ Solution: detect link spam in comment sections and ignore them during
indexing
● Easier solution: put rel=nofollow to unimportant links

6. Internationalization

● Monolingual search engine: search engine designed for a particular language

● Challenge:
○ Word segmentation: identifying breaks within sequence of words/index terms
■ Techniques: Hidden Markov Model

2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
No ratings yet
20200728204914D5872 - COMP6639 - Session 28 - Natural Language Processing
29 pages
IR Lec3
No ratings yet
IR Lec3
41 pages
MSC IR 2021
100% (1)
MSC IR 2021
188 pages
Multimedia Information Retrieval (CSC 545) : The Problem of IR
No ratings yet
Multimedia Information Retrieval (CSC 545) : The Problem of IR
29 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Pert23 - NLP
No ratings yet
Pert23 - NLP
30 pages
Java NLP Techniques Guide
No ratings yet
Java NLP Techniques Guide
51 pages
Made By:-Bhawana Agarwal Cs Iiiyr
No ratings yet
Made By:-Bhawana Agarwal Cs Iiiyr
29 pages
Text Mining
No ratings yet
Text Mining
62 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Pipeline
No ratings yet
Pipeline
9 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
Text Mining & NLP for Academics
No ratings yet
Text Mining & NLP for Academics
38 pages
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
No ratings yet
Tokenization: Token Normalization Is The Process of Canonicalizing Tokens So That Matches Occur
3 pages
Unit3 QueryLanguages Berlin
No ratings yet
Unit3 QueryLanguages Berlin
29 pages
IR Chapter 2
No ratings yet
IR Chapter 2
37 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
61 pages
Bulu
No ratings yet
Bulu
47 pages
Introduction To Information Retrieval: Courtesy
No ratings yet
Introduction To Information Retrieval: Courtesy
61 pages
Lecture 3
No ratings yet
Lecture 3
70 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Unit-Ii Notes
No ratings yet
Unit-Ii Notes
17 pages
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
No ratings yet
Introduction To Information Retrieval: Jian-Yun Nie University of Montreal Canada
61 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Introduction IR
No ratings yet
Introduction IR
61 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
57 pages
Emutye
No ratings yet
Emutye
20 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
IR Chapter 2 Text Operations
No ratings yet
IR Chapter 2 Text Operations
25 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
3 Irs Mid Important Questions
No ratings yet
3 Irs Mid Important Questions
6 pages
Chap 4
No ratings yet
Chap 4
76 pages
IR Problem: Introduction To Information Retrieval Outline
No ratings yet
IR Problem: Introduction To Information Retrieval Outline
11 pages
Text Processing
No ratings yet
Text Processing
114 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
Boolean and Vector Space Retrieval Models
No ratings yet
Boolean and Vector Space Retrieval Models
31 pages
L6 Dictonary+Tolerant Retrieval
No ratings yet
L6 Dictonary+Tolerant Retrieval
63 pages
Text Mining
No ratings yet
Text Mining
34 pages
Chapter Five (ISR)
No ratings yet
Chapter Five (ISR)
17 pages
My M-7
No ratings yet
My M-7
44 pages
Search Engine Indexing Guide
No ratings yet
Search Engine Indexing Guide
10 pages
IRSNOTES2
No ratings yet
IRSNOTES2
4 pages
03text Processing
No ratings yet
03text Processing
22 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Berta Wants To Be A Dancer Grammar Drills Information Gap Activities Reading - 77700
No ratings yet
Berta Wants To Be A Dancer Grammar Drills Information Gap Activities Reading - 77700
1 page
Vocabulary (1-26) Series 'V'
No ratings yet
Vocabulary (1-26) Series 'V'
19 pages
Analysis of Contextual Meaning Expression Found in Mahaerzain Song Lyrics - PDF MEANING
No ratings yet
Analysis of Contextual Meaning Expression Found in Mahaerzain Song Lyrics - PDF MEANING
87 pages
(SB) Advanced Speaking (SPA)
No ratings yet
(SB) Advanced Speaking (SPA)
46 pages
Language Transcription Guide
No ratings yet
Language Transcription Guide
29 pages
Contact and Code Choice
No ratings yet
Contact and Code Choice
32 pages
Sana
No ratings yet
Sana
3 pages
C. NO 106 (Art of Translation)
No ratings yet
C. NO 106 (Art of Translation)
152 pages
ESL Notes On The Future Tense
No ratings yet
ESL Notes On The Future Tense
6 pages
Vocabulary Teaching Strategies
50% (2)
Vocabulary Teaching Strategies
38 pages
Baicchi - Ed - 2020 - Figurative Meaning Construction in Thought and Language
No ratings yet
Baicchi - Ed - 2020 - Figurative Meaning Construction in Thought and Language
321 pages
Formal Languages & Automata Theory (BTCS-T-PC-015) - End Term Exam - 2022-2023
No ratings yet
Formal Languages & Automata Theory (BTCS-T-PC-015) - End Term Exam - 2022-2023
3 pages
Speech Exam
No ratings yet
Speech Exam
5 pages
Integrated Skills Test
No ratings yet
Integrated Skills Test
9 pages
Printable Spencerian Practice Sheets PDF
100% (4)
Printable Spencerian Practice Sheets PDF
5 pages
Bolobolod CRIP - 033646
No ratings yet
Bolobolod CRIP - 033646
6 pages
ĐỀ ÔN TẬP KIẾN THỨC TIẾNG ANH CẤP 2
No ratings yet
ĐỀ ÔN TẬP KIẾN THỨC TIẾNG ANH CẤP 2
4 pages
Grade 7 English Strand: Reading Test: Grade Date School Name Student Name
No ratings yet
Grade 7 English Strand: Reading Test: Grade Date School Name Student Name
7 pages
Linking Words PDF
No ratings yet
Linking Words PDF
10 pages
200 Must Know Spanish Phrases
No ratings yet
200 Must Know Spanish Phrases
27 pages
3 +Nguyen+Hieu+Thao
No ratings yet
3 +Nguyen+Hieu+Thao
9 pages
Coursework Marking Criteria
100% (1)
Coursework Marking Criteria
7 pages
Transitive and Intransitive Verbs
No ratings yet
Transitive and Intransitive Verbs
6 pages
Quran Recitation Rules Guide
100% (9)
Quran Recitation Rules Guide
44 pages
Accounting Issues for Students
No ratings yet
Accounting Issues for Students
32 pages
Oral Comm Test Paper Periodical 2
No ratings yet
Oral Comm Test Paper Periodical 2
3 pages
Unit 2 Lesson 4 and 5 - Here's My Band and True For Me
No ratings yet
Unit 2 Lesson 4 and 5 - Here's My Band and True For Me
9 pages
Language Test 1B
No ratings yet
Language Test 1B
2 pages
Actividad Semana 2 Grado Decimo
No ratings yet
Actividad Semana 2 Grado Decimo
7 pages
MKLMKL
No ratings yet
MKLMKL
196 pages

Chapter 4 - Processing Text

Uploaded by

Chapter 4 - Processing Text

Uploaded by

Processing Text

●​ Corpus: collection of some types of documents

●​ We can rely on the textual characteristics to build text-based engines

1.​ Words to Terms

●​ Text transformation/processing: convert words into index terms

2.​ Text Statistics

●​ Distribution of word frequencies is very skewed

3.​ Document Parsing

●​ Token: a semantic unit

●​ Inverted index: map a word to the documents it appears

Big challenge in querying: Vocabulary mismatch

●​ Component of text processing that captures the relationships between different

d)​ Phrases and N-grams

●​ Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)

⇒ Testing word proximities at query time is likely to be too slow

●​ n-gram: sequence of n words

4.​ Document Structure and Markup

●​ XML: allows each application to define what the element types

5.​ Link Analysis

●​ Links tell relationship between pages–help with ranking pages

○​ If we take into account “distraction” links on the site…

●​ Monolingual search engine: search engine designed for a particular language

You might also like

● Corpus: collection of some types of documents

● We can rely on the textual characteristics to build text-based engines

1. Words to Terms

● Text transformation/processing: convert words into index terms

2. Text Statistics

● Distribution of word frequencies is very skewed

3. Document Parsing

● Token: a semantic unit

● Inverted index: map a word to the documents it appears

● Component of text processing that captures the relationships between different

d) Phrases and N-grams

● Phrase: a simple noun phrase (can be nouns, or adjectives followed by nouns)

● n-gram: sequence of n words

4. Document Structure and Markup

● XML: allows each application to define what the element types

5. Link Analysis

● Links tell relationship between pages–help with ranking pages

○ If we take into account “distraction” links on the site…

● Monolingual search engine: search engine designed for a particular language