Chapter - 6 - Searching and Indexing

Chapter 5 discusses the concepts of searching and indexing, emphasizing the importance of indexing for efficient text search. It introduces Apache Lucene as a powerful, open-source search engine library that serves as the core for various applications, including Elasticsearch. The chapter also explains the structure and functionality of inverted indexes and full-text search techniques used in search applications.

Uploaded by

Tek singh Ayer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views44 pages

Chapter - 6 - Searching and Indexing

Uploaded by

Tek singh Ayer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Chapter 5

Searching and Indexing

Bal Krishna Nyaupane

Assistant Professor
Department of Electronics and Computer Engineering
Institute of Engineering, Tribhuvan University
bkn@wrc.edu.np
2 Indexing
 To search large amounts of text quickly, one must
first index that text and convert it into a format that
will let one search it rapidly, eliminating the
slow sequential scanning process. This conversion
process is called indexing, and its output is called
an index.
 Indexing is the initial part of all search applications.
 Goal of indexing is to process the original data into
a highly efficient cross-reference lookup in order
to facilitate rapid searching.
 The job is simple when the content is already
textual in nature and its location is known.
3 Searching

Searching is the process of looking up words in

an index to find documents where they appear
Searches index instead of text
4 What is Lucene?
 Apache Lucene is a free and open-source search engine
software library, originally written in Java by Doug Cutting.
 It is supported by the Apache Software Foundation and is
released under the Apache Software License. Lucene is widely
used as a standard foundation for non-research search applications.
 Lucene has been ported to other programming languages
including Object Pascal, Perl, C#, C++, Python, Ruby and PHP.
 Lucene is the search core of both Apache Solr™ and
Elasticsearch™.
 Lucene Core is a Java library providing powerful indexing and
search features, as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities.
5 What is Lucene?
 Lucene is a high-performance, scalable information retrieval
(IR) library.
 Lucene lets you add searching capabilities to your applications.
It’s a mature, free, open source project implemented in Java, and a
project in the Apache Software Foundation.
 Lucene’s website, at http://lucene.apache.org/java , is a great
place to learn more about the current status of Lucene.
 There you’ll find the tutorial, Javadocs for Lucene’s API for all recent
releases, an issue-tracking system, links for downloading releases,
and Lucene’s wiki (http://wiki.apache.org/lucene-java ), which
contains many community-created and -maintained pages.
6 What is Lucene?
7 Who use Lucene?
8 Lucene Architecture
9 Typical Components of Search Application
10 Typical Components of Search Application
A common misconception is that Lucene is an entire search
application, when in fact it’s simply the core indexing and
searching component.
We’ll see that a search application starts with an indexing chain,
which in turn requires separate steps to retrieve the raw content;
create documents from the content, possibly extracting text
from binary documents; and index the documents.
Once the index is built, the components required for searching
are equally diverse, including a user interface, a means for
building up a programmatic query, query execution (to
retrieve matching documents), and results rendering.
11 How Search Application works?
12 How Search Application works?
13 How Search Application works?
14 Basic Application
15 Inverted Index
 An inverted index is an index data structure storing a mapping
from content, such as words or numbers, to its locations in a
document or a set of documents.
 The purpose of an inverted index is to allow fast full-text
searches, at a cost of increased processing when a document is
added to the database.
 The inverted file may be the database file itself, rather than its
index. It is the most popular data structure used in document
retrieval systems, used on a large scale for example in search
engines.
 Additionally, several significant general-purpose mainframe-based
database management systems have used inverted list architectures.
16 Inverted Index
17 Inverted Index
18
19
20 Full-text search
 A full-text search is a comprehensive search method that compares every
word of the search request against every word within the document or
database.
 Web search engines and document editing software make extensive use of
the full-text search technique in functions for searching a text database stored
on the Web or on the local drive of a computer; it lets the user find a word or
phrase anywhere within the database or document.
 Full-text search is the most common technique used in Web search engines
and Web pages.
 Each page is searched and indexed, and if any matches are found, they are
displayed via the indexes. Parts of original text are displayed against the
user’s query and then the full text.
 Full-text search reduces the hassle of searching for a word in huge amounts
of metadata, such as the World Wide Web and commercial-scale databases.
21 Full-text search
 A common question from non-Full-Text users is, “If Full-Text search is
about looking for words inside text, then XQuery already does that with
the contains function. So what's missing?” The contains function does not
do a Full-Text search – it does a substring search.
 The main difference is that a Full-Text search will generally match only a
complete word, and not just part of a string. For example, a Full-Text search
for “dent” will not match a piece of text that contains the word “students,” but
a substring search will.
 Also, when running a Full-Text search, there is generally an assumption that
the match will be case-insensitive,2 so that “dent” will match “DENT” as well
as “dent” (and “Dent” and “dEnt” and “DEnt” and so on). With substring
queries, matching is usually case-sensitive (depending on the collation
used), so that the text being searched has to match the case of the search
term.
22 Full-text search
23 Core indexing classes
24 Primary Analyzers available in Lucene
25 Analysis examples
26 Core searching classes
 TermQuery: TermQuery is the most commonly-used query object and is
the foundation of many complex queries that Lucene can make use of.
 TopDocs: TopDocs points to the top N search results which matches the
search criteria. It is a simple container of pointers to point to documents
which are the output of a search result.
 IndexSearcher: This class acts as a core component which
reads/searches indexes created after the indexing process. It takes
directory instance pointing to the location containing the indexes.
 Term: This class is the lowest unit of searching. It is similar to Field in
indexing process.
 Query: Query is an abstract class and contains various utility methods and
is the parent of all types of queries that Lucene uses during search
process.
27 Lucene Implementation
28 Lucene Indexing
29 Lucene Indexing Step 1 of 5
30 Lucene Indexing Step 2 of 5
31 Lucene Indexing Step 3 of 5
32 Lucene Indexing Step 4 of 5
33 Lucene Indexing Step 5 of 5
34 Searching
35 Searching: Step 1 of 6
36 Searching: Step 2 of 6
37 Searching: Step 3 of 6
38 Searching: Step 4 and 5 of 6
39 Searching: Step 6 of 6
40 Elasticsearch
 Elasticsearch is a distributed, open-source search and analytics engine
built on Apache Lucene and developed in Java.
 It started as a scalable version of the Lucene open-source search
framework then added the ability to horizontally scale Lucene indices.
 Elasticsearch allows you to store, search, and analyze huge volumes
of data quickly and in near real-time and give back answers in
milliseconds.
 It’s able to achieve fast search responses because instead of searching
the text directly, it searches an index. It uses a structure based on
documents instead of tables and schemas and comes with extensive
REST APIs for storing and searching the data.
 At its core, you can think of Elasticsearch as a server that can process
JSON requests and give you back JSON data.
41 Elasticsearch
 ElasticSearch is able to achieve fast search responses because, instead
of searching the text directly, it searches an index instead. This is like
retrieving pages in a book related to a keyword by scanning the index at
the back of a book, as opposed to searching every word of every page of
the book.
 This type of index is called an inverted index, because it inverts a page-
centric data structure (page->words) to a keyword-centric data structure
(word->pages).
 In ElasticSearch, a Document is the unit of search and index. An index
consists of one or more Documents, and a Document consists of one or
more Fields.
 In database terminology, a Document corresponds to a table row, and a
Field corresponds to a table column.
42 Elasticsearch
RESTful API is an interface that two computer systems use to
exchange information securely over the internet.
Features
▪ Real time data,
▪ Real time analytics,
▪ Distributed, high availability, multi-tenancy, full text search,
▪ Document oriented, conflict management, schema free,
▪ RESTful API per-operation persistence, apache 2 open source
license, build on top of apache Lucene.
43 Why Elasticsearch?
Easy to deploy (minimum configuration)
Scales vertically and horizontally
Easy to use API
Modules for most programming/scripting languages
Actively developed with good online documentation
 It’s free.
44

Thank You
???

Apache Lucene
No ratings yet
Apache Lucene
19 pages
Lucene 4 Guide for Developers
No ratings yet
Lucene 4 Guide for Developers
28 pages
4
No ratings yet
4
35 pages
Searching and Indexing
No ratings yet
Searching and Indexing
21 pages
Chapter 5 1712934164766
No ratings yet
Chapter 5 1712934164766
13 pages
5 Indexing and Searching Big Data
No ratings yet
5 Indexing and Searching Big Data
11 pages
Chap 2
No ratings yet
Chap 2
29 pages
Search Engine Architecture Guide
No ratings yet
Search Engine Architecture Guide
23 pages
Luce Ne Bootcamp
No ratings yet
Luce Ne Bootcamp
83 pages
Search Engine Architecture
No ratings yet
Search Engine Architecture
15 pages
Lucene Solr
No ratings yet
Lucene Solr
52 pages
Chapter 2
No ratings yet
Chapter 2
45 pages
Lucene & Solr for Java Developers
No ratings yet
Lucene & Solr for Java Developers
35 pages
Text
No ratings yet
Text
5 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Tutorial 3
No ratings yet
Tutorial 3
38 pages
Topic 2 W2 - SDR - Edited - March2023
No ratings yet
Topic 2 W2 - SDR - Edited - March2023
25 pages
L01
No ratings yet
L01
33 pages
Lucene Tutorial
100% (1)
Lucene Tutorial
189 pages
Bulu
No ratings yet
Bulu
47 pages
Build a Rich Snippets Search Engine
No ratings yet
Build a Rich Snippets Search Engine
37 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
20 ElasticSearch
No ratings yet
20 ElasticSearch
62 pages
Chapter 5 Searching and Indexing Big Data 250525 070825
No ratings yet
Chapter 5 Searching and Indexing Big Data 250525 070825
19 pages
IR Project Guide for CS Students
No ratings yet
IR Project Guide for CS Students
15 pages
Mini Google
No ratings yet
Mini Google
34 pages
Information Retrieval & XML Data
No ratings yet
Information Retrieval & XML Data
37 pages
Search Engine Using Apache Lucene
No ratings yet
Search Engine Using Apache Lucene
5 pages
Elasticsearch Blueprints - Sample Chapter
No ratings yet
Elasticsearch Blueprints - Sample Chapter
24 pages
Unit 1: Introduction and Data Pre-Processing
No ratings yet
Unit 1: Introduction and Data Pre-Processing
71 pages
Elasticsearch and Apache Lucene
No ratings yet
Elasticsearch and Apache Lucene
7 pages
VV - IR - UNIT-I - Part2
No ratings yet
VV - IR - UNIT-I - Part2
35 pages
Information Retrieval
No ratings yet
Information Retrieval
142 pages
Lect 1 IRIntroduction
No ratings yet
Lect 1 IRIntroduction
59 pages
2 Mod-1 - Lec-2
No ratings yet
2 Mod-1 - Lec-2
58 pages
Everything in Brief Introduction
No ratings yet
Everything in Brief Introduction
5 pages
Lucene and Solr Search Engine Guide
No ratings yet
Lucene and Solr Search Engine Guide
6 pages
Introduction To Information Retrieval
No ratings yet
Introduction To Information Retrieval
50 pages
Networking
No ratings yet
Networking
51 pages
Marc Krellenst's Session at Lucene Revolution 2011
No ratings yet
Marc Krellenst's Session at Lucene Revolution 2011
16 pages
Web Information Retrieval
No ratings yet
Web Information Retrieval
10 pages
Chap 1
No ratings yet
Chap 1
22 pages
Information Storage And: Retrieval Techniques
No ratings yet
Information Storage And: Retrieval Techniques
56 pages
Assignment 1
No ratings yet
Assignment 1
23 pages
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
0% (1)
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
37 pages
Logo 345 1649916914 Elasticsearch-Introductions
No ratings yet
Logo 345 1649916914 Elasticsearch-Introductions
86 pages
Apache Lucene 4: Search Library Insights
No ratings yet
Apache Lucene 4: Search Library Insights
8 pages
NLP 05
No ratings yet
NLP 05
26 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
Untitled Document
No ratings yet
Untitled Document
9 pages
Information Retrieval Overview
No ratings yet
Information Retrieval Overview
44 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Elastic Search
No ratings yet
Elastic Search
19 pages
7 B - Query Languages
No ratings yet
7 B - Query Languages
33 pages
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
No ratings yet
Web Search Engines: Practice and Experience: Content Analysis Query Prcessing Search Log
21 pages
Chapter - 4 - NoSQL - 1676181987
No ratings yet
Chapter - 4 - NoSQL - 1676181987
85 pages
Chapter - 1 - Introduction To Big Data
No ratings yet
Chapter - 1 - Introduction To Big Data
78 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Justify For Distributed Scenario Normalization Contradicts Data Availability
No ratings yet
Justify For Distributed Scenario Normalization Contradicts Data Availability
5 pages
Chep 4
No ratings yet
Chep 4
1 page
Single Split CAC Wall Mounted Leaflet (20220621 232442918)
No ratings yet
Single Split CAC Wall Mounted Leaflet (20220621 232442918)
10 pages
Basic Calculus - Session 2
No ratings yet
Basic Calculus - Session 2
24 pages
Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
No ratings yet
Modern Data Science With R - Baumer Benjamin SKaplan Daniel THort
985 pages
Case Study 1 - Zynga Finds A New Strategy To Compete in Online Social Gaming
No ratings yet
Case Study 1 - Zynga Finds A New Strategy To Compete in Online Social Gaming
5 pages
Paper Title (Use Style: Paper Title) : Subtitle As Needed (Paper Subtitle)
No ratings yet
Paper Title (Use Style: Paper Title) : Subtitle As Needed (Paper Subtitle)
16 pages
Backpropagation in Neural Network - GeeksforGeeks
No ratings yet
Backpropagation in Neural Network - GeeksforGeeks
10 pages
Direct Tax Notes
No ratings yet
Direct Tax Notes
1 page
Log
No ratings yet
Log
2 pages
đề atbm
No ratings yet
đề atbm
23 pages
Akka in Action 1st Edition Raymond Roestenburg PDF Download
100% (2)
Akka in Action 1st Edition Raymond Roestenburg PDF Download
59 pages
DA0Z8CMB8D0 REV D Schematic Diagram 2
No ratings yet
DA0Z8CMB8D0 REV D Schematic Diagram 2
48 pages
Java Networking Protocols Guide
No ratings yet
Java Networking Protocols Guide
20 pages
CIS017-1 - CIS095-1 - Assignment 1 (Design and Implement A Database) Report Template 2020-2021-16!3!2021
No ratings yet
CIS017-1 - CIS095-1 - Assignment 1 (Design and Implement A Database) Report Template 2020-2021-16!3!2021
7 pages
Tachometer Monitor User Guide
No ratings yet
Tachometer Monitor User Guide
9 pages
Film Case Study
No ratings yet
Film Case Study
5 pages
IJRRSSH On AI and Patents-20042024-35
No ratings yet
IJRRSSH On AI and Patents-20042024-35
11 pages
A Sonarqube Static Analysis of The Spectral Workbench: January 2021
No ratings yet
A Sonarqube Static Analysis of The Spectral Workbench: January 2021
16 pages
CBSE Syllabus For Class 9 Information Technology 2023 24
No ratings yet
CBSE Syllabus For Class 9 Information Technology 2023 24
13 pages
DBMS Lab Manual 2023-2024
No ratings yet
DBMS Lab Manual 2023-2024
18 pages
Ai Infrastructure
No ratings yet
Ai Infrastructure
20 pages
Testing Port Connectivity Reference Guide
No ratings yet
Testing Port Connectivity Reference Guide
4 pages
Core Azure Services Overview
No ratings yet
Core Azure Services Overview
43 pages
DX Diag
No ratings yet
DX Diag
31 pages
Cyber Course 1 Capstone - Part III - Student Template - Rev Aug 22
0% (1)
Cyber Course 1 Capstone - Part III - Student Template - Rev Aug 22
4 pages
ECC Security: Risks and Benefits
No ratings yet
ECC Security: Risks and Benefits
5 pages
Technical Training For Emc Ii EM SD - P: (Types 92, 115, 137, 155)
100% (2)
Technical Training For Emc Ii EM SD - P: (Types 92, 115, 137, 155)
9 pages
Machine Learning For Healthcare Handling and Managing Data Rashmi Agrawal Instant Download
No ratings yet
Machine Learning For Healthcare Handling and Managing Data Rashmi Agrawal Instant Download
78 pages
Pain Points - A Gentle Introduction To Rust PDF
0% (1)
Pain Points - A Gentle Introduction To Rust PDF
154 pages
Jeevan Shikshan November 2023
No ratings yet
Jeevan Shikshan November 2023
44 pages
1st Four Chapters Short Questions
No ratings yet
1st Four Chapters Short Questions
2 pages

Chapter - 6 - Searching and Indexing

Uploaded by

Chapter - 6 - Searching and Indexing

Uploaded by

Chapter 5

Searching and Indexing

Bal Krishna Nyaupane

Searching is the process of looking up words in

You might also like