0% found this document useful (0 votes)

45 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

This document discusses techniques for text-based retrieval of images and documents. It covers topics like stemming, stop word removal, term weighting using TF-IDF, and using tools like Lucene for text retrieval. Challenges discussed include multi-lingual retrieval, query expansion, relevance feedback, and ensuring diversity in results. The document provides an overview of fundamental concepts and techniques in text retrieval that are also applied to image retrieval.

Uploaded by

piccolovegita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views23 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

piccolovegita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Business Information

Systems

Text-based (image) retrieval

Henning Müller
HES SO//Valais
Sierre, Switzerland
Business Information
Systems

Overview

•  Difference of words and features

–  Weightings instead of distance measures
•  Stemming and pre-treatment
•  Approaches for multilingual retrieval
•  Tools available on the web
–  Lucene, …
Business Information
Systems

Text retrieval (of images)

•  Started in the early 1960s … for images 1970s

•  Not the main focus of this talk
•  Text retrieval is old!!
–  Many techniques in image retrieval are taken from
this domain (sometimes reinvented)
•  It becomes clear that the combination of visual
and textual retrieval has biggest potential
–  Good text retrieval engines exist in Open Source
Business Information
Systems

Problems with annotation (of images)

•  Many things are hard to express
–  Feelings, situations, … (what is scary?)
–  What is in the image, what is it about, what does
it invoke?
•  Annotation is never complete
–  Plus it depends on the goal of the annotation
•  Many ways to say the same thing …
–  Synonyms, hyponyms, hypernyms, …
•  Mistakes
–  Spelling errors, spelling differences (US vs. UK),
weird abbreviations (particularly medical …)
Business Information
Systems

Basics in text retrieval

•  Started with boolean search of words in text
–  In combination with AND, OR, NOT
–  No ranking, rather finite list of corresponding
documents
•  Vector space model to have distance between
search terms and documents
–  Each occurring word is a dimension, its difference
in frequency can be measured
–  Overall frequency of words as importance for axis
Business Information
Systems

Zipf distribution (wikipedia example)

•  X- rank

•  Y- number
of occurrences
of the word
Business Information
Systems

Principle ideas used in text IR

•  Words follow basically a Zipf distribution

•  Tf/idf weightings
–  A word frequent in a document describes it well
–  A word rare in a collection has a high
discriminative power
–  Many variations of tf/idf (see also Salton/Buckley
paper)
•  Use of inverted files for quick query responses
–  Relevance feedback, query expansion, …
Business Information
Systems

Techniques used in text retrieval

•  Bag of words approach
–  Or N-grams can be used
•  Stop words can be removed
•  Stemming can improve results
•  Named entity recognition
•  Spelling correction (also umlauts, accents, …)
–  Google had a big success with this
•  Mapping of text to a controlled vocabulary/
ontology
Business Information
Systems

Stop word removal

•  Very frequent words contain little information and
can be removed
–  Automatically in Google et al.
•  These words depend on the language
–  Stop word lists exist in many languages
•  Often 40-50% of texts
–  Contains also less frequent words not carrying
information
•  Or simply remove words above a certain
frequency
Business Information
Systems

Stemming - conflation

•  Strongly dependent on the language

•  Basically suffix stripping based on a set of rules
–  Cats, catty, catlike=cat as root or stem
•  Can also create errors or slightly change
meaning (errors often reported around ~5%)
•  Porter stemmer for English is one of the most
well known algorithms with a free
implementation
Business Information
Systems

Synonymy, polysemy

•  Synonymy
–  Several words can say the same thing: car,
automobile
•  Polysemy
–  The same word can have several meanings
•  Latent semantic Indexing (LSI)
–  Word cooccurences in the entire collection
–  Can reduce effects of synonyms
Business Information
Systems

Query expansion vs. relevance feedback

•  Most queries contain only very few keywords

•  Add keywords to expand the original query
–  Can be automatic or manual
–  Semantically similar words, synonyms,
discriminative words
•  Often used in a similar way as relevance
feedback but not with entire documents
Business Information
Systems

Medical terminologies

•  MeSH, UMLS are frequently used

–  Mapping of free text to terminologies
•  Quality for the first few is very high
–  Links between items can be used
•  Hyponyms, hypernyms, …
–  Several axes exist (anatomy, pathology, …)
•  This can be used for making a query more
discriminative
•  This can also be used for multilingual retrieval
Business Information
Systems

Wordnet
•  Hierarchy, links, definitions in English language
–  Maintained in Princeton
•  Car, auto, automobile, machine, motorcar
–  motor vehicle, automotive vehicle
•  vehicle
–  conveyance, transport
»  instrumentality, instrumentation
»  artifact, artefact
»  object, physical object
»  entity, something
Business Information
Systems

Apache Lucene

•  Open source text retrieval system

–  Written in Java
•  Several tools available
–  Easy to use
•  Used in many research projects and in industry
•  Image retrieval plugin exists
–  LIRE (Lucene Image REtrieval)
–  Using simple MPEG-7 visual features
Business Information
Systems

Multilingual retrieval

•  Many collections are inherently multilingual

–  Web, FlickR, medical teaching files, …
•  Translation resources exist on the web
–  TrebleCLEF has a survey of such resources in
work
–  Translate query into document language
–  Translate documents into query language
–  Map documents and queries onto a common
terminology of concepts
•  We understand documents in other languages
Business Information
Systems

Cross Language Evaluation Forum (CLEF)

•  Forum to compare multilingual retrieval in a

variety of domains
–  GeoCLEF
–  QA CLEF
–  Domain-specific CLEF
–  …
•  Proceedings are a very good start for multilingual
techniques
Business Information
Systems

Challenges in multi-linguality

•  Language pairs have a strongly varying difficulty

–  Families of languages are easier for multilingual
retrieval
•  Resources available depend strongly on the
languages used
–  English has many resources, German, Spanish
and French quite a few but rare languages rather
little
Business Information
Systems

Multilingual tools

•  Many translation tools are accessible on the

web
–  Yahoo! Babel fish
–  www.reverso.net
–  Google translate
•  Named entity recognition
•  Word-sense disambiguation
Business Information
Systems

Current challenges in text retrieval

•  Many taken from the WWW or linked to it
•  Analysis of link structures to obtain information
on potential relevance
–  Also in companies, social platforms, …
•  Question of diversity in results
–  You do not want to have the same results show
up ten times on the top
•  Retrieval in context (domain specific)
•  Question answering
Business Information
Systems
Diversity
Business Information
Systems

Conclusions
•  Text retrieval is the basis of image retrieval
–  Many techniques come from this domain
•  Text has more semantics than visual features
–  But other problems as well
•  Text and image features combined have biggest
chances for success
–  Use text wherever available
•  Multilinguality is an important issue as most of
the web is very multilingual
–  And also a part of research
Business Information
Systems

References
•  G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval, Information Processing and
Management, 24(5):513--523, 1988.
•  K. Sparck Jones and C. J. Van Rijsbergen, Progress in documentation, Journal of Documentation}, 32:59--75, 1976.
•  J. J. Rocchio, Relevance feedback in information retrieval, The SMART Retrieval System, Experiments in Automatic
Document Processing, pages 313--323.
•  M. Braschler, C. Peters, Cross-Language Evaluation Forum: Objectives, Results, Achievements, Information Retrieval,
2004.
•  J. Gobeill, H. Müller, P. Ruch, Translation by Text Categorization: Medical Image Retrieval in ImageCLEFmed 2006,
Springer Lecture Notes in Computer Science (LNCS 4730), pages 706-710, 2007.

Indexing Database Systems
No ratings yet
Indexing Database Systems
5 pages
Information Retrieval Systems
No ratings yet
Information Retrieval Systems
18 pages
UNIT 1 IRS WWWWW
No ratings yet
UNIT 1 IRS WWWWW
26 pages
Introduction to Info Retrieval Systems
No ratings yet
Introduction to Info Retrieval Systems
2 pages
Cmrit Isr Notes - Docx New
No ratings yet
Cmrit Isr Notes - Docx New
54 pages
PDF Maker 1755642646912
No ratings yet
PDF Maker 1755642646912
27 pages
9210 Imp Questions
No ratings yet
9210 Imp Questions
71 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
45 pages
CSE494/598 Principles of Information Engineering
No ratings yet
CSE494/598 Principles of Information Engineering
45 pages
Unit V
No ratings yet
Unit V
43 pages
Information Retrieval: Prof: Ehab Ezzat Hassanein
No ratings yet
Information Retrieval: Prof: Ehab Ezzat Hassanein
49 pages
IRS - Notes - I&2 CSE A&B
No ratings yet
IRS - Notes - I&2 CSE A&B
27 pages
Irs Unit-V
No ratings yet
Irs Unit-V
48 pages
Unit - 6
No ratings yet
Unit - 6
12 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
IRS Unit-1
No ratings yet
IRS Unit-1
27 pages
Irs Unit-4 Notes - 241202 - 150037
No ratings yet
Irs Unit-4 Notes - 241202 - 150037
18 pages
Cp5293 Big Data Analytics Question Bank
No ratings yet
Cp5293 Big Data Analytics Question Bank
26 pages
Information Retrieval Course Guide
No ratings yet
Information Retrieval Course Guide
16 pages
Business Informatics
No ratings yet
Business Informatics
2 pages
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
No ratings yet
Information Retrieval Is A Complex Process Because There Is No Infallible Way To Provide A Direct Connection Between A User
4 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Irs Unit-Iv
No ratings yet
Irs Unit-Iv
48 pages
Fuzzy Ontologies and Scale Free Networks
No ratings yet
Fuzzy Ontologies and Scale Free Networks
11 pages
Cs8080 Irt Unit 1 PDF
No ratings yet
Cs8080 Irt Unit 1 PDF
28 pages
Unit I - Irs
No ratings yet
Unit I - Irs
85 pages
Unit I - Irs
No ratings yet
Unit I - Irs
116 pages
Modern Information Retrieval: A Brief Overview
No ratings yet
Modern Information Retrieval: A Brief Overview
9 pages
Bulu
No ratings yet
Bulu
47 pages
Case Sudies Assignment
No ratings yet
Case Sudies Assignment
21 pages
Unit - I - IR
No ratings yet
Unit - I - IR
39 pages
Information Retrieval: Introduction To
No ratings yet
Information Retrieval: Introduction To
21 pages
TYBSC CS Information Retrieval Munotes
No ratings yet
TYBSC CS Information Retrieval Munotes
85 pages
Information Storage and Retrieval
No ratings yet
Information Storage and Retrieval
5 pages
5 Unit Notes
100% (1)
5 Unit Notes
166 pages
KOCH, Ned (Ed.) Information - Systems - Research Action PDF
No ratings yet
KOCH, Ned (Ed.) Information - Systems - Research Action PDF
438 pages
Search and Retrieval of Information
No ratings yet
Search and Retrieval of Information
7 pages
Application of Computational Linguistics
No ratings yet
Application of Computational Linguistics
19 pages
Irs Unit III
No ratings yet
Irs Unit III
74 pages
11 Multimedia Media IR
No ratings yet
11 Multimedia Media IR
19 pages
Introduction Information Retrieval
No ratings yet
Introduction Information Retrieval
73 pages
Optimizing Information Retrieval Systems
No ratings yet
Optimizing Information Retrieval Systems
4 pages
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
No ratings yet
Text Databases and Information Retrieval: Riloff, Hollaar@cs - Utah.edu&
3 pages
Chap 1
No ratings yet
Chap 1
22 pages
Web Index: How Do Resources End Up in A Web Index?
No ratings yet
Web Index: How Do Resources End Up in A Web Index?
5 pages
All Units Notes TYBSC-CS-Information-Retrieval
No ratings yet
All Units Notes TYBSC-CS-Information-Retrieval
89 pages
IR Cs Sem 6
No ratings yet
IR Cs Sem 6
16 pages
Question Bank 1
No ratings yet
Question Bank 1
29 pages
Information Retrieval Techniques
No ratings yet
Information Retrieval Techniques
59 pages
IR-Module 1 and 2
No ratings yet
IR-Module 1 and 2
48 pages
Hci Unit 5
No ratings yet
Hci Unit 5
22 pages
Unit I
No ratings yet
Unit I
65 pages
Irs Unit - 1-1
No ratings yet
Irs Unit - 1-1
45 pages
Irs Unit-2 Notes - 241015 - 102936
No ratings yet
Irs Unit-2 Notes - 241015 - 102936
27 pages
Morphology and Sentence Structure Guide
No ratings yet
Morphology and Sentence Structure Guide
21 pages
Workbook 4.2 Stem Changers
No ratings yet
Workbook 4.2 Stem Changers
2 pages
Urdu Script - Art and Culture Notes
No ratings yet
Urdu Script - Art and Culture Notes
2 pages
Unit 2 - Notes
No ratings yet
Unit 2 - Notes
15 pages
Resources - Engaging With Lesson Planning - Introducing New Language
No ratings yet
Resources - Engaging With Lesson Planning - Introducing New Language
4 pages
S 2.3-Pedagogy of English-I Final1
No ratings yet
S 2.3-Pedagogy of English-I Final1
192 pages
Comment Correction of Sentences Examples
No ratings yet
Comment Correction of Sentences Examples
14 pages
CS-501 TOC Notes
No ratings yet
CS-501 TOC Notes
98 pages
Lesson 15 Quiz On Adjective
100% (1)
Lesson 15 Quiz On Adjective
3 pages
NTPC Preparation
No ratings yet
NTPC Preparation
16 pages
B. Inggris: Soal Ulangan Akhir Semester 2
No ratings yet
B. Inggris: Soal Ulangan Akhir Semester 2
4 pages
Grade 7 DLP Week 1 Lesson 1 Day 1-4
No ratings yet
Grade 7 DLP Week 1 Lesson 1 Day 1-4
19 pages
Editable Course 11 Progress Test
No ratings yet
Editable Course 11 Progress Test
2 pages
Grade 6 Fixed Modul
No ratings yet
Grade 6 Fixed Modul
128 pages
KS2 Grammar and Punctuation Latest Updated
No ratings yet
KS2 Grammar and Punctuation Latest Updated
7 pages
Clauses Exercises: - Structure of Sentences
No ratings yet
Clauses Exercises: - Structure of Sentences
7 pages
Đề đánh giá kết thúc học phần Ngữ âm Âm vị học tiếng Anh K29A4MN K30KT4 1
100% (1)
Đề đánh giá kết thúc học phần Ngữ âm Âm vị học tiếng Anh K29A4MN K30KT4 1
3 pages
English 12 (Old) - Unit 2 - Vocab - SS
No ratings yet
English 12 (Old) - Unit 2 - Vocab - SS
4 pages
Thesis Present or Past Tense
100% (1)
Thesis Present or Past Tense
5 pages
Focus4 2E Cumulative Test 2 Units1-4 ANSWERS
No ratings yet
Focus4 2E Cumulative Test 2 Units1-4 ANSWERS
3 pages
Słowotwórstwo Ćwiczenia
No ratings yet
Słowotwórstwo Ćwiczenia
3 pages
Bahan Ajar 3 - Summative Assessment Grade Vii
No ratings yet
Bahan Ajar 3 - Summative Assessment Grade Vii
5 pages
For STUDENTS To Print Out (Pronunciation in Practice) PDF
No ratings yet
For STUDENTS To Print Out (Pronunciation in Practice) PDF
123 pages
The Ascent of Babel - Gerry Altman
No ratings yet
The Ascent of Babel - Gerry Altman
283 pages
Kwon NegotiatingFamilyLanguage 2020
No ratings yet
Kwon NegotiatingFamilyLanguage 2020
13 pages
Meaning and Nonverbal Communication in Films
No ratings yet
Meaning and Nonverbal Communication in Films
18 pages
BA UoE Unit 10 PDF
No ratings yet
BA UoE Unit 10 PDF
23 pages
Mark Richard - The Deep, Wide River of Learning
No ratings yet
Mark Richard - The Deep, Wide River of Learning
26 pages
Ten Lectures On Natural Semantic MetaLanguage Exploring Language Thought and Culture Using Simple Translatable Words 1st Edition Cliff Goddard PDF Version
No ratings yet
Ten Lectures On Natural Semantic MetaLanguage Exploring Language Thought and Culture Using Simple Translatable Words 1st Edition Cliff Goddard PDF Version
89 pages
Translation Types
No ratings yet
Translation Types
3 pages

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Text-Based (Image) Retrieval: Henning Müller HES SO//Valais Sierre, Switzerland

Uploaded by

Business Information

Text-based (image) retrieval

• Difference of words and features

Text retrieval (of images)

• Started in the early 1960s … for images 1970s

Problems with annotation (of images)

Basics in text retrieval

Zipf distribution (wikipedia example)

Principle ideas used in text IR

• Words follow basically a Zipf distribution

Techniques used in text retrieval

Stop word removal

• Strongly dependent on the language

Query expansion vs. relevance feedback

• Most queries contain only very few keywords

• MeSH, UMLS are frequently used

• Open source text retrieval system

• Many collections are inherently multilingual

Cross Language Evaluation Forum (CLEF)

• Forum to compare multilingual retrieval in a

• Language pairs have a strongly varying difficulty

• Many translation tools are accessible on the

Current challenges in text retrieval

You might also like

•  Difference of words and features

•  Started in the early 1960s … for images 1970s

•  Words follow basically a Zipf distribution

•  Strongly dependent on the language

•  Most queries contain only very few keywords

•  MeSH, UMLS are frequently used

•  Open source text retrieval system

•  Many collections are inherently multilingual

•  Forum to compare multilingual retrieval in a

•  Language pairs have a strongly varying difficulty

•  Many translation tools are accessible on the