[go: up one dir, main page]

0% found this document useful (0 votes)
26 views44 pages

Chapter - 6 - Searching and Indexing

Chapter 5 discusses the concepts of searching and indexing, emphasizing the importance of indexing for efficient text search. It introduces Apache Lucene as a powerful, open-source search engine library that serves as the core for various applications, including Elasticsearch. The chapter also explains the structure and functionality of inverted indexes and full-text search techniques used in search applications.

Uploaded by

Tek singh Ayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views44 pages

Chapter - 6 - Searching and Indexing

Chapter 5 discusses the concepts of searching and indexing, emphasizing the importance of indexing for efficient text search. It introduces Apache Lucene as a powerful, open-source search engine library that serves as the core for various applications, including Elasticsearch. The chapter also explains the structure and functionality of inverted indexes and full-text search techniques used in search applications.

Uploaded by

Tek singh Ayer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Chapter 5

Searching and Indexing

Bal Krishna Nyaupane


Assistant Professor
Department of Electronics and Computer Engineering
Institute of Engineering, Tribhuvan University
bkn@wrc.edu.np
2 Indexing
 To search large amounts of text quickly, one must
first index that text and convert it into a format that
will let one search it rapidly, eliminating the
slow sequential scanning process. This conversion
process is called indexing, and its output is called
an index.
 Indexing is the initial part of all search applications.
 Goal of indexing is to process the original data into
a highly efficient cross-reference lookup in order
to facilitate rapid searching.
 The job is simple when the content is already
textual in nature and its location is known.
3 Searching

Searching is the process of looking up words in


an index to find documents where they appear
Searches index instead of text
4 What is Lucene?
 Apache Lucene is a free and open-source search engine
software library, originally written in Java by Doug Cutting.
 It is supported by the Apache Software Foundation and is
released under the Apache Software License. Lucene is widely
used as a standard foundation for non-research search applications.
 Lucene has been ported to other programming languages
including Object Pascal, Perl, C#, C++, Python, Ruby and PHP.
 Lucene is the search core of both Apache Solr™ and
Elasticsearch™.
 Lucene Core is a Java library providing powerful indexing and
search features, as well as spellchecking, hit highlighting and
advanced analysis/tokenization capabilities.
5 What is Lucene?
 Lucene is a high-performance, scalable information retrieval
(IR) library.
 Lucene lets you add searching capabilities to your applications.
It’s a mature, free, open source project implemented in Java, and a
project in the Apache Software Foundation.
 Lucene’s website, at http://lucene.apache.org/java , is a great
place to learn more about the current status of Lucene.
 There you’ll find the tutorial, Javadocs for Lucene’s API for all recent
releases, an issue-tracking system, links for downloading releases,
and Lucene’s wiki (http://wiki.apache.org/lucene-java ), which
contains many community-created and -maintained pages.
6 What is Lucene?
7 Who use Lucene?
8 Lucene Architecture
9 Typical Components of Search Application
10 Typical Components of Search Application
A common misconception is that Lucene is an entire search
application, when in fact it’s simply the core indexing and
searching component.
We’ll see that a search application starts with an indexing chain,
which in turn requires separate steps to retrieve the raw content;
create documents from the content, possibly extracting text
from binary documents; and index the documents.
Once the index is built, the components required for searching
are equally diverse, including a user interface, a means for
building up a programmatic query, query execution (to
retrieve matching documents), and results rendering.
11 How Search Application works?
12 How Search Application works?
13 How Search Application works?
14 Basic Application
15 Inverted Index
 An inverted index is an index data structure storing a mapping
from content, such as words or numbers, to its locations in a
document or a set of documents.
 The purpose of an inverted index is to allow fast full-text
searches, at a cost of increased processing when a document is
added to the database.
 The inverted file may be the database file itself, rather than its
index. It is the most popular data structure used in document
retrieval systems, used on a large scale for example in search
engines.
 Additionally, several significant general-purpose mainframe-based
database management systems have used inverted list architectures.
16 Inverted Index
17 Inverted Index
18
19
20 Full-text search
 A full-text search is a comprehensive search method that compares every
word of the search request against every word within the document or
database.
 Web search engines and document editing software make extensive use of
the full-text search technique in functions for searching a text database stored
on the Web or on the local drive of a computer; it lets the user find a word or
phrase anywhere within the database or document.
 Full-text search is the most common technique used in Web search engines
and Web pages.
 Each page is searched and indexed, and if any matches are found, they are
displayed via the indexes. Parts of original text are displayed against the
user’s query and then the full text.
 Full-text search reduces the hassle of searching for a word in huge amounts
of metadata, such as the World Wide Web and commercial-scale databases.
21 Full-text search
 A common question from non-Full-Text users is, “If Full-Text search is
about looking for words inside text, then XQuery already does that with
the contains function. So what's missing?” The contains function does not
do a Full-Text search – it does a substring search.
 The main difference is that a Full-Text search will generally match only a
complete word, and not just part of a string. For example, a Full-Text search
for “dent” will not match a piece of text that contains the word “students,” but
a substring search will.
 Also, when running a Full-Text search, there is generally an assumption that
the match will be case-insensitive,2 so that “dent” will match “DENT” as well
as “dent” (and “Dent” and “dEnt” and “DEnt” and so on). With substring
queries, matching is usually case-sensitive (depending on the collation
used), so that the text being searched has to match the case of the search
term.
22 Full-text search
23 Core indexing classes
24 Primary Analyzers available in Lucene
25 Analysis examples
26 Core searching classes
 TermQuery: TermQuery is the most commonly-used query object and is
the foundation of many complex queries that Lucene can make use of.
 TopDocs: TopDocs points to the top N search results which matches the
search criteria. It is a simple container of pointers to point to documents
which are the output of a search result.
 IndexSearcher: This class acts as a core component which
reads/searches indexes created after the indexing process. It takes
directory instance pointing to the location containing the indexes.
 Term: This class is the lowest unit of searching. It is similar to Field in
indexing process.
 Query: Query is an abstract class and contains various utility methods and
is the parent of all types of queries that Lucene uses during search
process.
27 Lucene Implementation
28 Lucene Indexing
29 Lucene Indexing Step 1 of 5
30 Lucene Indexing Step 2 of 5
31 Lucene Indexing Step 3 of 5
32 Lucene Indexing Step 4 of 5
33 Lucene Indexing Step 5 of 5
34 Searching
35 Searching: Step 1 of 6
36 Searching: Step 2 of 6
37 Searching: Step 3 of 6
38 Searching: Step 4 and 5 of 6
39 Searching: Step 6 of 6
40 Elasticsearch
 Elasticsearch is a distributed, open-source search and analytics engine
built on Apache Lucene and developed in Java.
 It started as a scalable version of the Lucene open-source search
framework then added the ability to horizontally scale Lucene indices.
 Elasticsearch allows you to store, search, and analyze huge volumes
of data quickly and in near real-time and give back answers in
milliseconds.
 It’s able to achieve fast search responses because instead of searching
the text directly, it searches an index. It uses a structure based on
documents instead of tables and schemas and comes with extensive
REST APIs for storing and searching the data.
 At its core, you can think of Elasticsearch as a server that can process
JSON requests and give you back JSON data.
41 Elasticsearch
 ElasticSearch is able to achieve fast search responses because, instead
of searching the text directly, it searches an index instead. This is like
retrieving pages in a book related to a keyword by scanning the index at
the back of a book, as opposed to searching every word of every page of
the book.
 This type of index is called an inverted index, because it inverts a page-
centric data structure (page->words) to a keyword-centric data structure
(word->pages).
 In ElasticSearch, a Document is the unit of search and index. An index
consists of one or more Documents, and a Document consists of one or
more Fields.
 In database terminology, a Document corresponds to a table row, and a
Field corresponds to a table column.
42 Elasticsearch
RESTful API is an interface that two computer systems use to
exchange information securely over the internet.
Features
▪ Real time data,
▪ Real time analytics,
▪ Distributed, high availability, multi-tenancy, full text search,
▪ Document oriented, conflict management, schema free,
▪ RESTful API per-operation persistence, apache 2 open source
license, build on top of apache Lucene.
43 Why Elasticsearch?
Easy to deploy (minimum configuration)
Scales vertically and horizontally
Easy to use API
Modules for most programming/scripting languages
Actively developed with good online documentation
 It’s free.
44

Thank You
???

You might also like