[go: up one dir, main page]

0% found this document useful (0 votes)
27 views19 pages

Unit 1

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views19 pages

Unit 1

Uploaded by

nanipavan830
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

‭UNIT-1‬

‭FUNCTIONAL OVERVIEW OF IRS‬


‭1. Normalizing Incoming Items:‬

‭ his step is about converting various types of incoming data into‬


T
‭a consistent, standard format so that they can be easily‬
‭processed and searched.‬

‭●‬ ‭Language Encoding:‬‭Ensure that text from different‬


‭languages is properly encoded, typically in Unicode, which‬
‭allows consistent display and search across languages.‬
‭●‬ ‭Different File Formats:‬‭Convert files from various formats‬
‭(like text, images, videos) into a standard format. For‬
‭example:‬
‭○‬ ‭Videos could be converted to formats like MPEG-2,‬
‭MPEG-1, AVI.‬
‭○‬ ‭Audio files to WAV, Real Audio.‬
‭○‬ ‭Images to GIF, JPEG, BMP.‬

‭2. Logical Restructuring – Zoning:‬

‭ reak down the content into meaningful sections. For example, if‬
B
‭you're processing an academic paper, divide it into sections like‬
‭Title, Author, Abstract, Main Text, Conclusion, References,‬
‭Keywords. This helps in more precise searching and better‬
‭display of search results.‬

‭3. Creating a Searchable Data Structure (Indexing):‬

‭This involves several steps:‬

‭1.‬‭Identification of Processing Tokens:‬


‭○‬ ‭Processing Tokens:‬‭These are the key pieces of‬
‭information used in searches, often better defined‬
‭than just words.‬
‭○‬ ‭Valid Word Symbols:‬‭Alphabetic characters and‬
‭numbers.‬
‭○‬ ‭Inter-Word Symbols:‬‭Blanks, periods, semicolons‬
‭(these don't affect the search).‬
‭○‬ ‭Special Processing Symbols:‬‭Hyphens.‬
‭2.‬‭W ords are defined as continuous sequences of valid word‬
‭symbols separated by inter-word symbols.‬
‭3.‬‭Stop Algorithm:‬
‭○‬ ‭Stop Words:‬‭Remove common words (like 'the', 'and')‬
‭that appear in almost every document, or words that‬
‭appear very infrequently, to save system resources.‬
‭ ‬ ‭Stop List:‬‭A predefined list of such stop words.‬

‭4.‬‭Characterize Tokens:‬
‭○‬ ‭W ord Characteristics:‬‭Identify specific features like‬
‭proper names, acronyms, numbers, dates.‬
‭○‬ ‭Part of Speech Tagging:‬‭Determine if the word is a‬
‭noun, verb, etc.‬
‭○‬ ‭W ord Sense Disambiguation:‬‭Understand the‬
‭meaning of a word based on context.‬
‭5.‬‭Stemming Algorithm:‬
‭○‬ ‭Stemming:‬‭Reduce words to their base or root form.‬
‭For example, 'computing', 'computers', and‬
‭'computation' are all reduced to 'comput'. This reduces‬
‭the number of unique words and saves storage space,‬
‭while also improving search efficiency.‬

‭4. Creating the Searchable Data Structure:‬

‭ fter processing tokens through the stemming algorithm, they‬


A
‭are updated into a searchable data structure. This structure‬
‭could be a signature file, inverted list, or PAT tree, and it‬
‭represents the semantic concepts of items in the database. It‬
‭limits what a user can find as a result of the search, ensuring‬
‭efficient and accurate retrieval of information.‬

‭Summary:‬

‭●‬ ‭Normalization:‬‭Convert and standardize different formats‬


‭and languages.‬
‭●‬ ‭Zoning:‬‭Break down content into logical sections.‬
‭●‬ ‭Token Identification:‬‭Identify important searchable‬‭tokens‬
‭and remove unnecessary ones.‬
‭●‬ ‭Token Characterization:‬‭Determine the specific features‬
‭and context of tokens.‬
‭●‬ ‭Stemming:‬‭Reduce words to their base form to save‬
‭space and improve search efficiency.‬
‭●‬ ‭Indexing:‬‭Create an internal structure that represents the‬
‭data and enables efficient searching.‬

‭Selective Dissemination of Information (SDI):‬

‭ DI is a system that automatically matches new information‬


S
‭against users' interests and delivers relevant items to them.‬

‭●‬ ‭How it works:‬


‭○‬ ‭Search Process:‬‭The system continuously searches‬
‭new items.‬
‭○‬ ‭User Profiles:‬‭Each user has a profile that describes‬
‭their interests.‬
‭○‬ ‭User Mail Files:‬‭W here the system stores items‬
‭matching user interests.‬
‭●‬ ‭User Profile:‬
‭○‬ ‭A broad search statement that describes what the‬
‭user is interested in.‬
‭○‬ ‭A list of mail files to receive documents that match the‬
‭search statement.‬
‭○‬ ‭W hen a new item matches the profile, it is sent to the‬
‭associated mail files.‬
‭●‬ ‭Difference from Ad Hoc Queries:‬
‭○‬ ‭Profiles have many search terms and cover a wide‬
‭range of interests.‬
‭○‬ ‭Ad hoc queries are short and specific.‬

‭Document Database Search:‬

‭ his allows users to search all items that have been received‬
T
‭and stored in the system.‬

‭●‬ ‭Components:‬
‭○‬ ‭Search Process:‬‭The mechanism that handles‬
‭searches.‬
‭○‬ ‭User Queries:‬‭Specific search statements entered by‬
‭users.‬
‭○‬ ‭Document Database:‬‭The collection of all processed‬
‭and stored items.‬
‭●‬ ‭Characteristics of Document Database:‬
‭○‬ ‭Items usually do not change once stored.‬
‭○‬ ‭It can be partitioned by time and allow for archiving.‬
‭●‬ ‭Difference from Profiles:‬‭Queries are short and focused‬
‭on specific interests.‬

‭Index Database Search:‬

‭ sers can save and organize items for future reference through‬
U
‭indexing.‬

‭●‬ ‭Index Process:‬


‭○‬ ‭Users can add items to an index with extra terms and‬
‭descriptions.‬
‭○‬ ‭The index can point to the original item or contain‬
‭detailed information about it.‬
‭●‬ ‭Components:‬
‭○‬ ‭Indexes:‬‭Like a library card catalog, they help‬
‭organize and find items.‬
‭○‬ ‭Index Database Search Process:‬‭Lets users create‬
‭and search indexes.‬
‭○‬ ‭Users can search the index and retrieve either the‬
‭index itself or the original item.‬
‭●‬ ‭Types of Index Files:‬
‭○‬ ‭Public Index Files:‬‭Managed by library staff and‬
‭include all items in the Document Database.‬
‭○‬ ‭Private Index Files:‬‭Created by individual users,‬
‭each user can have multiple private indexes.‬

‭Combined File Search:‬


‭ his process integrates searches across both the document and‬
T
‭index databases.‬

‭●‬ ‭Public vs. Private Index Files:‬


‭○‬ ‭Public index files cover all items and are accessible to‬
‭all users.‬
‭○‬ ‭Private index files are specific to individual users and‬
‭cover a smaller subset of items.‬
‭●‬ ‭Database Management System:‬
‭○‬ ‭Often, index files are managed using a structured‬
‭database management system (RDBMS).‬

‭Automatic File Build (Information Extraction):‬

‭This process helps create indexes automatically.‬

‭●‬ ‭How it works:‬


‭○‬ ‭Processes new documents and identifies key‬
‭information like authors, publication date, source, and‬
‭references.‬
‭○‬ ‭Rules for which documents to process and how to‬
‭extract index terms are stored in Automatic File Build‬
‭Profiles.‬
‭●‬ ‭Candidate Index Records:‬
‭○‬ ‭The result of processing new documents.‬
‭○‬ ‭Reviewed and edited by users before updating the‬
‭actual index file.‬

‭Summary:‬

‭●‬ ‭SDI:‬‭Automatically matches new items to user interests‬‭and‬


‭delivers relevant information.‬
‭●‬ ‭Document Database Search:‬‭Allows users to search all‬
‭stored items.‬
‭●‬ ‭Index Database Search:‬‭Enables users to save, organize,‬
‭and search items using indexes.‬
‭●‬ ‭Combined File Search:‬‭Integrates document and index‬
‭searches.‬
‭●‬ ‭Automatic File Build:‬‭Automates the creation of index‬
‭records by extracting key information from new documents‬

‭DIGITAL LIBRARY‬
‭DATA WAREHOUSE‬
‭IRS CAPABILITIES‬
‭Boolean Logic:‬

‭●‬ ‭Boolean logic allows users to combine search terms using‬


‭operators like AND, OR, and NOT. For instance, "cats AND‬
‭dogs" retrieves items containing both words, "cats OR‬
‭dogs" retrieves items containing either word, and "cats‬
‭NOT dogs" retrieves items containing "cats" but excluding‬
‭"dogs."‬

‭Proximity:‬
‭●‬ ‭Proximity search looks for words that appear close to each‬
‭other within a specified distance. For example, searching‬
‭"bake NEAR/5 cake" finds instances where "bake" and‬
‭"cake" appear within five words of each other, which helps‬
‭in locating related terms in context.‬

‭Contiguous Word Phrases:‬

‭●‬ ‭This capability searches for exact phrases where words‬


‭appear together in the same order. For example, searching‬
‭for "climate change" returns results where these two words‬
‭are next to each other, ensuring the phrase's specific‬
‭context is maintained in the search results.‬

‭Fuzzy Searches:‬

‭●‬ ‭Fuzzy searches find words that are similar to the search‬
‭term, accommodating spelling variations and typos. For‬
‭example, searching for "color" might also return "colour."‬
‭This is useful when dealing with documents containing‬
‭typographical errors or different spellings of the same word.‬

‭Term Masking:‬

‭●‬ ‭Term masking uses wildcards to replace characters in a‬


‭search term. For example, "comp*" can find "computer,"‬
‭"compete," and "compile." The asterisk (*) represents any‬
‭number of characters, while a question mark (?) can‬
‭replace a single character, broadening the search scope.‬

‭Numeric & Date Ranges:‬

‭●‬ ‭This capability allows searching within specific numeric or‬


‭date ranges. For example, searching for documents from‬
‭2010 to 2020 or finding products priced between $50 and‬
‭ 100. It helps in filtering search results based on‬
$
‭quantitative criteria, like dates or numbers.‬

‭Concept & Thesaurus Expansions:‬

‭●‬ ‭This search capability includes related concepts or‬


‭synonyms to broaden search results. For example,‬
‭searching for "happy" might also retrieve "joyful" or‬
‭"content." Thesaurus expansions enhance search flexibility‬
‭by understanding and including variations in terminology,‬
‭ensuring comprehensive results.‬

‭Natural Language Queries:‬

‭●‬ ‭Natural language queries allow users to search using‬


‭everyday language, mimicking human conversation. For‬
‭example, instead of using keywords, a user might ask,‬
‭"What is the capital of France?" The system interprets the‬
‭question and retrieves relevant information, making‬
‭searches more intuitive.‬

‭Multimedia Queries:‬

‭●‬ ‭Multimedia queries enable searching for various types of‬


‭content such as images, videos, and audio files. For‬
‭example, finding all videos related to "wildlife." This‬
‭capability is essential for databases that include diverse‬
‭media types, allowing users to locate non-textual‬
‭information easily.‬

‭Browse Capabilities‬

‭1.‬‭Ranking:‬
‭○‬ ‭Ranking orders search results by relevance or‬
‭importance. This helps users see the most relevant‬
‭items first, based on criteria like keyword matches,‬
‭document popularity, or date of publication. For‬
‭example, a search for "renewable energy" will show‬
‭the most relevant articles at the top.‬
‭2.‬‭Zoning:‬
‭○‬ ‭Zoning divides a document into logical sections such‬
‭as title, author, abstract, and main text. This helps in‬
‭targeted searching within specific sections. For‬
‭example, a user might search only within the‬
‭"abstract" zone to find articles with relevant‬
‭summaries.‬
‭3.‬‭Highlighting:‬
‭○‬ ‭Highlighting visually emphasizes search terms in the‬
‭results. When users search for a keyword, this‬
‭feature highlights occurrences of that keyword in the‬
‭displayed documents. This makes it easier for users‬
‭to spot the relevant information quickly.‬

‭Miscellaneous Capabilities‬

‭1.‬‭Vocabulary Browse:‬
‭○‬ ‭Vocabulary browsing allows users to explore terms‬
‭and their relationships within a specific domain or‬
‭subject. It often includes browsing through an index or‬
‭thesaurus to find related terms and expand searches‬
‭effectively. For example, exploring synonyms and‬
‭related terms for "biodiversity."‬
‭2.‬‭Iterative Search & Search History Log:‬
‭○‬ ‭Iterative search involves refining searches based on‬
‭previous results to narrow down to the most relevant‬
‭information. The search history log keeps track of all‬
‭ earch queries, allowing users to revisit and refine‬
s
‭past searches for improved results.‬
‭3.‬‭Canned Query:‬
‭○‬ ‭Canned queries are pre-defined searches created for‬
‭common queries. These saved searches can be‬
‭quickly executed without having to re-enter the search‬
‭criteria. For example, a canned query for "latest‬
‭technology news" would fetch up-to-date articles on‬
‭that topic.‬
‭4.‬‭Multimedia:‬
‭○‬ ‭Multimedia capabilities involve searching and‬
‭retrieving various types of content like images,‬
‭videos, and audio files. For instance, users can‬
‭search for educational videos, photographs, or music‬
‭files, enabling a richer and more diverse search‬
‭experience.‬

You might also like