Context Based Web Indexing For Semantic Web: Anchal Jain Nidhi Tyagi
Context Based Web Indexing For Semantic Web: Anchal Jain Nidhi Tyagi
Context Based Web Indexing For Semantic Web: Anchal Jain Nidhi Tyagi
Nidhi Tyagi 2
Asst. Professor(SHOBHIT UNIVERSITY)
Abstract : A context based focused crawler downloads web pages that are more relevant for user query in
syntax of context. Wherein downloaded web pages are indexed for providing the speed to search engine. This paper purposes a new indexing technique based on B+ tree that indexed the context along with ontologys of keywords. These keywords are extracted from the web documents that are stored in web repository. This purposed indexing technique increases the speed of search engine for finding the more relevant documents from semantic web Keywords - Architecture, B+ Tree, Context, Semantic web, Web repository
I. INTRODUCTION With the rapid growth of the Internet, the World Wide Web (WWW) has become one of the most important resources for obtaining information and one of the most important media of communication. Currently there are huge amounts of documents existing in the World Wide Web. Finding information from WWW according the user interest becomes a critical task. Modern web search engines can cache, index and search several billion of web pages, which only includes a small part of all existing documents in the Web. And even for this small amount, the search quality could not meet a user's requirements in many cases. Many ideas have been proposed to improve the web search quality, which can be measured with the following two metrics: (1) Precision rate: The ratio of the number of relevant documents retrieved to the total number of documents retrieved. (2) Recall rate: The ratio of the number of relevant documents extracted to the total number of relevant documents in the Web. The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would require considerable time and computing power. For example, while an index of 10,000 documents can be queried within milliseconds, a sequential scan of every word in documents is a time consuming task. The additional computer storage required to store the index, as well as the considerable increase in the time required for an update to take place, are traded off for the time saved during information retrieval[1]. In B+ tree all paths from the root to the leaf nodes are equal length .So this tree is called balanced tree. All data is stored at the leaf nodes (leaf pages). Leaf pages are linked to each other .B+ tree reduces the number of I/O operations required to find an element in the tree. Finding a record requires O (Logbn) operations. This strategy is more beneficial for search engine.
II.
Related Work
Here many algorithm & technique all ready purposed for indexer to achieve the indexing on documents for information retrieval. But they are not more efficient for search. Nidhi Tyagi, R.P Agarwal [1] This paper proposes a technique for indexing [1] the keyword extracted from the web documents along with their contexts wherein it uses a height balanced binary search (AVL) tree, for indexing purpose to enhance the performance of the retrieval system. P. Gupta and A. K. Sharma [2], worked on context based indexing in search engines using ontology. The index construction is done on the basis of the context using ontology. The context repository, thesaurus and ontology repository are used by the indexer to identify the context of the document. C. Zhou, W. Ding and Na Yang [5], the paper introduces a double indexing mechanism for search engines based on campus Net. The CNSE consists of crawl machine, Chinese automatic segmentation, index and search machine. The proposed mechanism has document index as well as word index. The document index is based on, where the documents do the clustering, and ordered by the position in each document. During the retrieval, the search engine first gets the document id of the word in the word index, and then goes to the position of www.iosrjournals.org 89 | Page
Pre-Processing of Documents
Extract Keywords
Thesaurus
USER
B+ Tree
Query Processor or
keyword
Context
Ontology
Doc_id
Text
Thesaurus
Ontology
Enter keyword
CROWN
Generate context
Figure 1
Show Documents
Fig 2 Query Retrieval Interface In figure 2 the user entered keyword Crown & desired context of the keyword displayed through the generate context button, the corresponding related web page URLs are listed (available in the repository) displayed by pressing the show document button. This can help the user to directly access more related and relevant information.
www.iosrjournals.org
91 | Page
ALGORITHM COMPARISION
3 2.5 2.5 1.5 BINARY TREE AVL TREE
2
1 0
B+ TREE
The purposed algorithm for indexing provides a fast access to document context and structure along with an optimized searching. 3.3 Proposed algorithm for the indexing scheme Step1: Preprocess the crawled web documents and extract the keyword along with their frequency of occurrence. Step 2: Input the keywords to the context generator which extracts the multiple contextual Sense of the word. Context is being searched in the thesaurus (a dictionary of words available on WWW from thesaurus.com, which contains the words as well their multiple meanings). Step3: The keywords along with the context are indexed using the B+ tree. Step4: Compare the entered keyword with the nodes keyword field of tree, until a similar word is found. Corresponding document_id is stored? Context is being searched in the thesaurus (a dictionary of words available on from thesaurus.com, which contains the Words as well their multiple meanings). Step5: If search is not a success, create a node containing the following fields (Left child, Keyword, right child, and link) .The link is pointer variable which points to the Database where the context of keyword stored along with its ontology based document_id. Step6: Arrange the node in the B+ tree, according to the height BF. Step7: Repeat step 4, 5 and 6 until all the extract keywords are arranged. Step8: Now when the user fires the query with context explicitly specified, then the index is being searched, reducing its search time to half of the linear search. Step9: Thus, B+ indexing technique provides a fast access to document context and structure.
IV.
Conclusion
This paper presents an indexing structure that can be constructed on the basis of the context of the document. The context of the document can be extracted by using thesaurus and ontology repository. So this paper uses ontology for context based index building. The context based index enables retrieval from index on the basis of context rather than keywords. This aids in improving the quality of the retrieved results. A rough estimate of support values for the existing and the proposed system clearly depicts the better performance of the existing system. Future Scopes: Future scope of this system is that the B+ tree based indexing technique, is able to support dynamic indexing and improves the performance in terms of accuracy and efficiency for retrieving more, relevant documents as per the users requirements since the context of the various keywords is also stored along with them. Thus, the indexing technique provides a fast access to document context and structure along with an optimized searching
www.iosrjournals.org
92 | Page
www.iosrjournals.org
93 | Page