0% found this document useful (0 votes)

7 views22 pages

03text Processing

Uploaded by

nivijune1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views22 pages

03text Processing

Uploaded by

nivijune1306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Text Processing

Information Retrieval
Lecture 3

Lecture 3 Information Retrieval 1

Text Operations
Converting text to indexing terms
Goal: produce a set of indexing terms
that make the best use of resources

that will accurately match user query terms

Lecture 3 Information Retrieval 2

Text Processing Steps
1. Lexical Analysis
2. Elimination of stopwords
3. Stemming
4. Selection of index terms
5. Building a thesaurus

Lecture 3 Information Retrieval 3

Lexical Analysis
Converting byte stream to tokens
a.k.a tokenization or lexing
Three ways to build your lexer
manually (in C or a scripting language)

use a generator such as lex or flex

use a special-purpose DFA generator

Handling of numbers and punctuation

should be tunable for the application

Lecture 3 Information Retrieval 4

Lexing: Numbers and digits
Numbers need context
"deaths from car accidents in 1989"

{deaths, car, accidents, 1989}

{1989} could retrieve many irrelevant docs

However...
numbers do appear in user queries

rest of terms can give context

might be helped by using phrases

Lecture 3 Information Retrieval 5

Lexing: Hyphens
Keep them?
query might use a non-hyphenated variant

end-of-line hyphens are noise

Throw them out?

can’t recognize a hyphenated term in a query

Two advanced solutions

index as phrase but allow partial matches

use proximity information

Lecture 3 Information Retrieval 6

Lexing: Punctuation
Obvious: segment on puctuation
But (like hyphens) can appear inside a
single term:
"B.C.", "B.S.": without periods, these are just

single letters
URLs as index terms?

Idea: look at surrounding characters

whitespace? end of sentence

not whitespace? abbreviation

Lecture 3 Information Retrieval 7

Lexing: Markup
Nowadays, everything has markup
SGML, HTML, XML...

This information can be useful or not...

Some alternatives:
emit text appearing inside all or some tags

emit tags as tokens which can be interpreted

by the indexer.

Lecture 3 Information Retrieval 8

Writing a lexer by hand
while ((c = getchar()) != EOF){
if (isalpha(c)) { …
Very fast! but
Error-prone

Hard to make it flexible or modular

Alternative: use a scripting langauge

Easier to describe text patterns

But can be hard to maintain

Lecture 3 Information Retrieval 9

Using a DFA generator
Generalization of the hand-written lexer
Define a state machine
transitions occur on different character input

states define possible next steps

write a table, not a procedure

Program generates the lexer

Easier to maintain and debug!
(Frakes & Baeza-Yates ’92 have code)
Lecture 3 Information Retrieval 10
Stop Words
the, of, and, a, in, to, is, for, with, are
take up a lot of space

retrieve all documents

don’t relate to information need

It’s easy to index something that appears

everywhere
Removing stopwords can cause problems:
"to be or not to be" → {be}

"C" as a stop word would be trouble for a computer

programming index!
Lecture 3 Information Retrieval 11
Removing Stop Words
Start with a list of stop words
Table lookup
Make a table out of a static stoplist
Match each token against the table
Hashes, perfect hashing, tries
Build into the lexical analyzer (see F&BY)
Or take a statistical approach

Lecture 3 Information Retrieval 12

Stemming
Reduce variant word forms to a single
"stem" form
-'s, -ing, -ed, -s; in-, ad-, pre-, sub-, ...
Four approaches
table lookup - use a dictionary
successor variety - fancy suffix removal
affix removal - cut prefixes and suffixes
character n-grams (not really stemming)

Lecture 3 Information Retrieval 13

Porter’s algorithm (1980)
Stage 1a and b
SSES -> SS caresses -> caress
Removes suffixes in
five stages IES -> I ponies -> poni
ties -> ti
Only one rule in each
stage fires SS -> SS caress -> caress
Each depends on a S -> ø cats -> cat
suffix and the stem (m>0) feed -> feed
measure m EED->EE agreed -> agree
[C](VC)m[V] (*v*) ED-> plastered -> plaster

(v) ING-> motoring -> motor

Lecture 3 Information Retrieval 14

Porter Errors (Krovetz 93)
Too eager Too cautious
organization/organ european/europe
doing/doe matrices/matrix
policy/police create/creation
university/universe machine/machinery
negligible/negligent explain/explanation
arm/army resolve/resolution
past/paste triangle/triangular
Lecture 3 Information Retrieval 15
Stems and roots
Stemmers are language specific
See the Snowball project
http://snowball.sourceforge.net/
for stemmers in other languages
Morphological analysis
reducing words to their linguistic roots
requires more sophisticated processing
Think about how this can affect the query
Lecture 3 Information Retrieval 16
Character n-grams
Slide an n-character window through text
No stemming or stoplisting
May need to consider punctuation and
hyphens
Redundant tokens: good for noisy text
Less effective than word (stem) pairs in
clean text

Lecture 3 Information Retrieval 17

Term Selection
Individual words
Adjacent word pairs (word n-grams)
Noun phrases
requires more sophisticated NLP
identify nouns along with adjectives and
adverbs in the same phrase
"computer science" and "world-wide web"

Lecture 3 Information Retrieval 18

The Case for Complexity
User queries are only one or two words
The bag-of-words approach is too
simplistic given short queries
Using phrases, sophisticated handling for
numbers, etc. boosts the quality of that
first list of documents.

Lecture 3 Information Retrieval 19

The Case for Simplicity
Query throughput is as (more?) important
than quality responses
Disk is cheap
Complex processing takes too long
Easy to make a wrong decision
Feedback will improve the results

Lecture 3 Information Retrieval 20

Simple or Complex?
Can look at it on two levels:
Does more sophisticated term
processing improve retrieval results?
... or ...
Does it enable a more sophisticated
interface for the user?

Lecture 3 Information Retrieval 21

Designing with Filters
The UNIX philosophy: "do one thing and
do it well."
Filters read text input and produce text
output
can be linked together in pipes
can be simple (cut, nl) or complex (awk,perl)
Lexers are filters
You can have several in your toolbox

Lecture 3 Information Retrieval 22

2 - Text Operation - 1
No ratings yet
2 - Text Operation - 1
28 pages
Inverted Index Construction Guide
No ratings yet
Inverted Index Construction Guide
57 pages
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
No ratings yet
Lecture2-Dictionary - Term Vocabulary and Postings Lists ch2 and ch4
33 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
No ratings yet
Lecture 3 - Terms, Postings, Dictionaries, and Tolerant Retrieval
77 pages
2 Text Operations
No ratings yet
2 Text Operations
32 pages
2T-Inverted Index
No ratings yet
2T-Inverted Index
54 pages
C7 SpellCorrection
No ratings yet
C7 SpellCorrection
43 pages
Lecture 3-Term Vocabulary and Posting Lists
No ratings yet
Lecture 3-Term Vocabulary and Posting Lists
26 pages
Intro to Info Retrieval Basics
No ratings yet
Intro to Info Retrieval Basics
34 pages
L6 Dictonary+Tolerant Retrieval
No ratings yet
L6 Dictonary+Tolerant Retrieval
63 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
37 pages
02 Text Operation
No ratings yet
02 Text Operation
52 pages
Chapter 4 - Processing Text
No ratings yet
Chapter 4 - Processing Text
7 pages
Text Processing for IR Systems
No ratings yet
Text Processing for IR Systems
43 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
Lecture 03
No ratings yet
Lecture 03
53 pages
IR Lec03 Vocabulary Postings List
No ratings yet
IR Lec03 Vocabulary Postings List
28 pages
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
No ratings yet
Lecture 2: Datastructures and Algorithms For Indexing: Information Retrieval Computer Science Tripos Part II
47 pages
SEO for Dictionaries & Retrieval
No ratings yet
SEO for Dictionaries & Retrieval
8 pages
Lecture3 Tolerant Retrieval
No ratings yet
Lecture3 Tolerant Retrieval
48 pages
Information Retrieval Systems Chap 2
67% (3)
Information Retrieval Systems Chap 2
60 pages
L3 Vocabulary+Postings List
No ratings yet
L3 Vocabulary+Postings List
28 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
Lecture3 Tolerant Retrieval
100% (1)
Lecture3 Tolerant Retrieval
48 pages
C2 Dictionary
No ratings yet
C2 Dictionary
6 pages
Lecture3 Roy
No ratings yet
Lecture3 Roy
5 pages
Text-Processing
No ratings yet
Text-Processing
70 pages
Info Retrieval for Linguists
No ratings yet
Info Retrieval for Linguists
38 pages
2-Text Operations - New
No ratings yet
2-Text Operations - New
39 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
41 pages
Text Preprocessing for Information Retrieval
No ratings yet
Text Preprocessing for Information Retrieval
58 pages
Lecture 4
No ratings yet
Lecture 4
48 pages
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
No ratings yet
Lecture3-Tolerant-retrieval Dictionaries and Tolerant Retrieval CH 3
47 pages
Lec4 IR
No ratings yet
Lec4 IR
53 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
No ratings yet
Unit 3 - Basic Tokenizing, Indexing, and Implementation of Vector-Space Retrieval
8 pages
Lecture 4-Dictionaries and Tolerant Retrieval
No ratings yet
Lecture 4-Dictionaries and Tolerant Retrieval
50 pages
2.3 Chap NLP Stemming
No ratings yet
2.3 Chap NLP Stemming
32 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
No ratings yet
CSE 435/535 Information Retrieval: Chapter 2: Tokenization, Stemming, Lemmatization
48 pages
Elasticsearch Basics for Beginners
No ratings yet
Elasticsearch Basics for Beginners
44 pages
SEO for Term Vocabulary & Postings
No ratings yet
SEO for Term Vocabulary & Postings
20 pages
Lec 19
No ratings yet
Lec 19
60 pages
Chapter 1: Boolean Retrieval
No ratings yet
Chapter 1: Boolean Retrieval
9 pages
Basics of Text Processing
No ratings yet
Basics of Text Processing
28 pages
Chap 2 Part 2
No ratings yet
Chap 2 Part 2
20 pages
Term Vocabulary and Postings List
No ratings yet
Term Vocabulary and Postings List
64 pages
Lecture2 Dictionary
No ratings yet
Lecture2 Dictionary
62 pages
Lecture5 Spell Correction 1per
No ratings yet
Lecture5 Spell Correction 1per
61 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
115 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
175 pages
Web Info Retrieval for Graduates
No ratings yet
Web Info Retrieval for Graduates
6 pages
Topic 4 W4 - Text Processing
No ratings yet
Topic 4 W4 - Text Processing
42 pages
Information Retrieval: Lecture One
No ratings yet
Information Retrieval: Lecture One
101 pages
Software Development Basics
No ratings yet
Software Development Basics
189 pages
Intro to C Programming Basics
No ratings yet
Intro to C Programming Basics
25 pages
Student Portal
50% (2)
Student Portal
89 pages
Introduction PDF
No ratings yet
Introduction PDF
13 pages
CS Inheritance Program Guide
No ratings yet
CS Inheritance Program Guide
10 pages
A 2d Collision Detection Tutorial, Including A C Implementation. First Draft, Please Email Comments!
No ratings yet
A 2d Collision Detection Tutorial, Including A C Implementation. First Draft, Please Email Comments!
3 pages
2021 Online MRC Jan 20-21 - Wsi
No ratings yet
2021 Online MRC Jan 20-21 - Wsi
19 pages
Log
No ratings yet
Log
3 pages
Simulating Modus Verilog Patterns App Note
No ratings yet
Simulating Modus Verilog Patterns App Note
26 pages
Working With Files in Pascal
100% (2)
Working With Files in Pascal
2 pages
Rofile Ummary: Software Engineer Intern
No ratings yet
Rofile Ummary: Software Engineer Intern
1 page
Uvm 3
No ratings yet
Uvm 3
11 pages
Mock Simulation Year11
No ratings yet
Mock Simulation Year11
16 pages
Sap Hybris Commerce Main Ad
No ratings yet
Sap Hybris Commerce Main Ad
11 pages
Guspiel Michal
No ratings yet
Guspiel Michal
39 pages
Unified Modeling Language (Uml) : Assignment
No ratings yet
Unified Modeling Language (Uml) : Assignment
32 pages
Strong Naming On Precompiled Assemblies
0% (1)
Strong Naming On Precompiled Assemblies
75 pages
Module 2 ACA Notes
100% (1)
Module 2 ACA Notes
31 pages
Java Assignment
No ratings yet
Java Assignment
23 pages
Day1-Session2 - Overview of Arduino IoT Platforms
No ratings yet
Day1-Session2 - Overview of Arduino IoT Platforms
21 pages
As P2 Fundamental Problem-Solving and Programming Skills V24 Compress
No ratings yet
As P2 Fundamental Problem-Solving and Programming Skills V24 Compress
536 pages
Java 69.006
No ratings yet
Java 69.006
50 pages
C Programming Viva Questions
No ratings yet
C Programming Viva Questions
22 pages
Prototyping
100% (1)
Prototyping
37 pages
Awesome Boxes: Étienne Deparis 2019-07-27 v0.6
No ratings yet
Awesome Boxes: Étienne Deparis 2019-07-27 v0.6
8 pages
AlgoForge Zero-to-Hero DSA Sheet
No ratings yet
AlgoForge Zero-to-Hero DSA Sheet
55 pages
CICS Training for IT Professionals
100% (3)
CICS Training for IT Professionals
196 pages
OS Unit-II - Process Management
No ratings yet
OS Unit-II - Process Management
33 pages
Software Design & Architecture Lab
No ratings yet
Software Design & Architecture Lab
3 pages
Java Calculator
No ratings yet
Java Calculator
8 pages

03text Processing

Uploaded by

03text Processing

Uploaded by

Text Processing

Lecture 3 Information Retrieval 1

that will accurately match user query terms

Lecture 3 Information Retrieval 2

Lecture 3 Information Retrieval 3

use a generator such as lex or flex

use a special-purpose DFA generator

Handling of numbers and punctuation

Lecture 3 Information Retrieval 4

{deaths, car, accidents, 1989}

{1989} could retrieve many irrelevant docs

rest of terms can give context

might be helped by using phrases

Lecture 3 Information Retrieval 5

end-of-line hyphens are noise

Throw them out?

Two advanced solutions

use proximity information

Lecture 3 Information Retrieval 6

Idea: look at surrounding characters

not whitespace? abbreviation

Lecture 3 Information Retrieval 7

This information can be useful or not...

emit tags as tokens which can be interpreted

Lecture 3 Information Retrieval 8

Hard to make it flexible or modular

Alternative: use a scripting langauge

But can be hard to maintain

Lecture 3 Information Retrieval 9

states define possible next steps

write a table, not a procedure

Program generates the lexer

retrieve all documents

don’t relate to information need

It’s easy to index something that appears

"C" as a stop word would be trouble for a computer

Lecture 3 Information Retrieval 12

Lecture 3 Information Retrieval 13

(*v*) ING-> motoring -> motor

Lecture 3 Information Retrieval 14

Lecture 3 Information Retrieval 17

Lecture 3 Information Retrieval 18

Lecture 3 Information Retrieval 19

Lecture 3 Information Retrieval 20

Lecture 3 Information Retrieval 21

Lecture 3 Information Retrieval 22

You might also like

(v) ING-> motoring -> motor