WO2007011140A1 - Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues - Google Patents
Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues Download PDFInfo
- Publication number
- WO2007011140A1 WO2007011140A1 PCT/KR2006/002787 KR2006002787W WO2007011140A1 WO 2007011140 A1 WO2007011140 A1 WO 2007011140A1 KR 2006002787 W KR2006002787 W KR 2006002787W WO 2007011140 A1 WO2007011140 A1 WO 2007011140A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- candidate phrases
- phrases
- documents
- extracting
- secondary candidate
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000004364 calculation method Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 6
- 239000000284 extract Substances 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Definitions
- the present invention relates to an information search technology and, more particularly, to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
- a conventional search system groups search results into groups based on their types, sequentially provides the search results based on similarities with search words, or places search results that are most similar to the search words at the top of search pages.
- the present invention provides a method and apparatus for searching for information based on topics by extracting phrases constituting search results to select topics and outputting the search results topic-by-topic so that users can obtain desired information more easily.
- the present invention further provides a method of searching for information based on issues by outputting Internet search results as issues in order of appearance frequencies of the search results.
- FIG. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention
- FIG. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
- FIGs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
- FIG. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
- FIGs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
- Fig. 14 is an issue output result
- Fig. 15 is another issue output result
- FIG. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
- a method of displaying search results with respect to a search word including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.
- a method of extracting topics including: (a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value; (b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words; (c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents; (d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; (e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and (f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
- a method of extracting issues including: (a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and (b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
- an apparatus for providing search services based on extracted topics including: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
- Fig. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention.
- 'E Founding_anniversary'
- 'F Party'.
- 1 ABE', 1 ABF', and 'ABD' may be grouped into a group
- 'CDE', 'CDF' and 'CDG' may be grouped into a group
- 'AEFG', 'AEFH', and 'AEFI' may be grouped into a group.
- 'AB' becomes a topic 100
- 'CD' becomes a topic
- 'AEF' becomes a topic.
- the term 'topic' implies an expression indicating a subject of search results.
- FIG. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
- similarities between search results are calculated according to a similarity calculation method by referring to words that are included in titles or content in the search results and match with the search word. Further, representative phrases are extracted among a combination of duplicate words in similar search results, and search results are displayed according to the extracted representative phrases.
- document IDs are sequentially assigned to documents matching with a search word based on appearance orders of the documents, and documents with documents IDs less than a predetermined value are extracted (operation S210).
- the predetermined value may vary based on the number of search results, i.e., documents, or the like.
- Data composed of 'words' which are included in titles or content of the documents, and 'Appearance frequencies of the words' are stored (operation S220).
- primary candidate phrases composed of words of appearance frequencies greater than a predetermined value in the titles or content of the documents are extracted (operation S230).
- the predetermined value may vary according to the number of primary candidate phrases to be extracted.
- secondary candidate phrases are generated from combinations of phrases composed of the words constituting the primary candidate phrases, and weight values of the secondary candidate phrases are calculated (operation S240).
- the weight values of secondary candidate phrases are calculated by referring to document IDs included in the secondary candidate phrases, appearance frequencies of words constituting the secondary candidate phrases, and the number of primary candidate phrases used in the secondary candidate phrases. For instance, since a document with a low document ID is important, its weight value becomes high. In addition, if appearance frequency of words constituting the secondary candidate phrase is high, it is regarded as an important document. Further, if document ID included in the secondary candidate phrase is low, it is regarded as an important document.
- similarities between secondary candidate phrases with weight values greater than a predetermined value are calculated by use of vectors consisting of document IDs of documents that belong to the secondary candidate phrases (operation S250). That is, when there are several document IDs, the similarities are calculated by referring to the number of the same document IDs.
- secondary candidate phrases having similarities greater than a predetermined value secondary candidate phrases with low weight values are eliminated and the remaining secondary candidate phrases are determined as topics (operation S260).
- FIGs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
- a database 330 is obtained from words constituting the titles
- phrases composed of words of appearance frequencies greater than a predetermined value are extracted from the titles 320 to make primary candidate phrases 340. It can be seen from Fig. 5 that there are six titles each composed of a string of consecutive words 'Neowiz', 'Yogurting', 'RPG', 'Search_corporation', 'Jukeon', Popularized', 'Announces', 'Music', 'Service', and 'Mobile_carrier' among the fourteen titles 320 in Fig. 3.
- secondary candidate phrases 350 are created with a combination of phrases composed of the words. Appearance frequencies 351 of phrases including the secondary candidate phrases 350 in the primary candidate phrases 340 are extracted. As described in Fig. 2, weight values 352 of the secondary candidate phrases 350 are calculated by referring to document IDs included in the secondary candidate phrases 350, appearance frequencies of words constituting the secondary candidate phrases 350, and the number of primary candidate phrases 340 used in the secondary candidate phrases 350. It can be seen form Fig.
- strings 353 of document IDs of documents including the secondary candidate phrases 350 are extracted to calculate similarities between the secondary candidate phrases 350.
- documents containing the phrase 'Announces RPG yogurting popularized' are (7, 10)
- documents containing the phrase 'Neowiz yogurting' are (1, 5, 7, 10)
- documents containing the phrase 'Neowiz search_corporation' are (2, 4, 12)
- documents containing the phrase 'Neowiz search' are (2, 4, 8, 12).
- the similarity between the phrases 'Announces RPG yogurting popularized' and 'Neowiz yogurting' is 66%, the similarity is regarded to be low.
- Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
- data having the same or similar data greater than a predetermined threshold value is extracted from stored data.
- a plurality of high-ranking data is extracted as issue data from the extracted data.
- the issue data is displayed in order of writing time of the issue data or in order of a number of similar documents.
- the stored data may be all of the Internet documents, specific blogs, data on news sites, or data obtained from predetermined search methods.
- target documents on the Internet or target documents matching with a search word are extracted (operation S410).
- the extracted documents may be the same or similar to one another.
- documents having appearance frequencies greater than a predetermined value are extracted (operation S420).
- High-ranking documents having a number of the same or similar documents are extracted as issues (operation S430).
- the extracted issues are output in order of writing time of the documents or the number of same or similar documents (operation S440).
- FIGs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
- Fig. 14 is an issue output result.
- Issues may be extracted from the whole target documents on the Internet and displayed as described above. As described in Figs. 2 to 9, topics may be extracted from the target documents and issues may be extracted from the topics and displayed.
- Fig. 15 is another issue output result.
- FIG. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention.
- the information search apparatus includes a web data storage unit 810, a searching unit 820, a primary candidate phrase extracting unit 830, a secondary candidate phrase extracting unit 840, a similar candidate phrase eliminating unit 850, and a topic output unit 860.
- the web data storage unit 810 collects and stores documents on the Internet.
- the searching unit 820 uses typical search methods to search for the documents.
- the primary candidate phrase extracting unit 830 sequentially assigns document IDs to the documents in appearance order of the documents, and extracts documents having document IDs less than a predetermined value. A method of extracting the primary candidate phrases is described above in detail with reference to Fig. 2.
- the secondary candidate phrase extracting unit 840 extracts words contained in titles or content of the documents and appearance frequencies of the words, extracts documents containing words of appearance frequencies greater than a predetermined value in the titles or content as primary candidate phrases, generates secondary candidate phrases composed of combinations of phrases obtained from the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases.
- the similar candidate phrase eliminating unit 850 uses vectors consisting of document IDs of documents belonging to secondary candidate phrases with weight values greater than a predetermined value to calculate similarities between the secondary candidate phrases.
- the similar candidate phrase eliminating unit 850 eliminates secondary candidate phrases with lower weight values among secondary candidate phrases with similarities greater than a predetermined value, and sets the remaining secondary candidate phrases as topics.
- the topic output unit 860 sets the topics as titles and outputs the topics and documents corresponding to the topics.
- the above-mentioned methods of extracting topics and issues may be written with computer programs. Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art.
- the programs are stored in computer readable media, read and executed by computers, thereby implementing the methods of extracting topics and issues. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
- the present invention can be efficiently applied to industrial fields related to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Disclosed is a method of displaying search results with respect to a search word, including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.
Description
Description
METHOD OF EXTRACTING TOPICS AND ISSUES AND
METHOD AND APPARATUS FOR PROVIDING SEARCH
RESULTS BASED ON TOPICS AND ISSUES
Technical Field
[1] The present invention relates to an information search technology and, more particularly, to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
Background Art
[2] A conventional search system groups search results into groups based on their types, sequentially provides the search results based on similarities with search words, or places search results that are most similar to the search words at the top of search pages.
[3] However, there is a problem in the conventional search system in that too many redundant search results appear and most of the search results are useless since users tend to view only a few search results appearing at the top of the search pages. Disclosure of Invention Technical Solution
[4] The present invention provides a method and apparatus for searching for information based on topics by extracting phrases constituting search results to select topics and outputting the search results topic-by-topic so that users can obtain desired information more easily.
[5] The present invention further provides a method of searching for information based on issues by outputting Internet search results as issues in order of appearance frequencies of the search results. Advantageous Effects
[6] According to the present invention, users can use search results more efficiently since the users can easily grasp the search results and are not provided with repeated search results. Brief Description of the Drawings
[7] The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
[8] Fig. 1 is a view for explaining a method of providing search results based on topics
according to an embodiment of the present invention;
[9] Fig. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention;
[10] Figs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention;
[11] Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention;
[12] Figs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention;
[13] Fig. 14 is an issue output result;
[14] Fig. 15 is another issue output result; and
[15] Fig. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention. Best Mode for Carrying Out the Invention
[16] According to an aspect of the present invention, there is provided a method of displaying search results with respect to a search word, including: (a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and (b) displaying the representative phrases and the search results that belong to each of the representative phrases.
[17] According to another aspect of the present invention, there is provided a method of extracting topics, including: (a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value; (b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words; (c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents; (d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; (e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and (f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary
candidate phrases as topics.
[ 18] According to another aspect of the present invention, there is provided a method of extracting issues, including: (a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and (b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
[19] According to another aspect of the present invention, there is provided an apparatus for providing search services based on extracted topics, including: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics. Mode for the Invention
[20] Exemplary embodiments in accordance with the present invention will now be described in detail with reference to the accompanying drawings.
[21] Fig. 1 is a view for explaining a method of providing search results based on topics according to an embodiment of the present invention.
[22] Referring to Fig. 1, search results are grouped into groups having similar phrases, and topics are extracted from the groups. For instance, it is assumed that 'A=Neowiz', 'B= Separated_search_serive', 'C=Pmang', 'D=Special_force',
'E=Founding_anniversary', and 'F=Party'. When various search results including 'A, B, C, D, E, and F are output, 1ABE', 1ABF', and 'ABD' may be grouped into a group, 'CDE', 'CDF' and 'CDG' may be grouped into a group, and 'AEFG', 'AEFH', and 'AEFI' may be grouped into a group. In this case, 'AB' becomes a topic 100, 'CD' becomes a
topic and 'AEF' becomes a topic. The term 'topic' implies an expression indicating a subject of search results.
[23] Fig. 2 is a flow chart of a method of extracting topics according to an embodiment of the present invention.
[24] Referring to Fig. 2, when a search word is input, similarities between search results are calculated according to a similarity calculation method by referring to words that are included in titles or content in the search results and match with the search word. Further, representative phrases are extracted among a combination of duplicate words in similar search results, and search results are displayed according to the extracted representative phrases.
[25] In more detail, document IDs are sequentially assigned to documents matching with a search word based on appearance orders of the documents, and documents with documents IDs less than a predetermined value are extracted (operation S210). The predetermined value may vary based on the number of search results, i.e., documents, or the like. Data composed of 'words' which are included in titles or content of the documents, and 'Appearance frequencies of the words' are stored (operation S220). Next, primary candidate phrases composed of words of appearance frequencies greater than a predetermined value in the titles or content of the documents are extracted (operation S230). The predetermined value may vary according to the number of primary candidate phrases to be extracted.
[26] Next, secondary candidate phrases are generated from combinations of phrases composed of the words constituting the primary candidate phrases, and weight values of the secondary candidate phrases are calculated (operation S240). The weight values of secondary candidate phrases are calculated by referring to document IDs included in the secondary candidate phrases, appearance frequencies of words constituting the secondary candidate phrases, and the number of primary candidate phrases used in the secondary candidate phrases. For instance, since a document with a low document ID is important, its weight value becomes high. In addition, if appearance frequency of words constituting the secondary candidate phrase is high, it is regarded as an important document. Further, if document ID included in the secondary candidate phrase is low, it is regarded as an important document.
[27] Next, similarities between secondary candidate phrases with weight values greater than a predetermined value are calculated by use of vectors consisting of document IDs of documents that belong to the secondary candidate phrases (operation S250). That is, when there are several document IDs, the similarities are calculated by referring to the number of the same document IDs. Among secondary candidate phrases having similarities greater than a predetermined value, secondary candidate phrases with low weight values are eliminated and the remaining secondary candidate phrases are
determined as topics (operation S260).
[28] The topics and the documents belonging to individual topics are displayed.
[29] Figs. 3 to 9 are views for explaining a method of extracting topics according to an embodiment of the present invention.
[30] As shown in Fig. 3, when a search word 'Neowiz' is entered, titles 320 appear as search results and document IDs 310 are assigned to the titles based on appearance orders of the titles.
[31] As shown in Fig. 4, a database 330 is obtained from words constituting the titles
320 and appearance frequencies of the words. It can be seen from Fig. 4 that a word 'Neowiz' appears thirteen times and a word 'Yogurting' appears four times in the titles 320. The appearance frequencies of the other words are obtained in this manner. Words of appearance frequencies less than a predetermined value are eliminated. In Fig. 4, a word 'Showdown' appears once and is eliminated.
[32] Next, phrases composed of words of appearance frequencies greater than a predetermined value are extracted from the titles 320 to make primary candidate phrases 340. It can be seen from Fig. 5 that there are six titles each composed of a string of consecutive words 'Neowiz', 'Yogurting', 'RPG', 'Search_corporation', 'Jukeon', Popularized', 'Announces', 'Music', 'Service', and 'Mobile_carrier' among the fourteen titles 320 in Fig. 3.
[33] Next, as shown in Fig. 6, secondary candidate phrases 350 are created with a combination of phrases composed of the words. Appearance frequencies 351 of phrases including the secondary candidate phrases 350 in the primary candidate phrases 340 are extracted. As described in Fig. 2, weight values 352 of the secondary candidate phrases 350 are calculated by referring to document IDs included in the secondary candidate phrases 350, appearance frequencies of words constituting the secondary candidate phrases 350, and the number of primary candidate phrases 340 used in the secondary candidate phrases 350. It can be seen form Fig. 7 that the phrase 'Announces RPG yogurting popularized' has a weight value of 1732, the phrase 'Neowiz Jukeon' has a weight value of 1720, the phrase 'Neowiz search_corporation' has a weight value of 1710, and the phrase 'Neowiz Jukeon mobile_carrier' has a weight value of 1320. The phrase 'Jukeon mobile_carrier music' having a weight value of 1200 is discarded. Thus, a reference weight value to eliminate phrases is 1200.
[34] Referring to Fig. 8, strings 353 of document IDs of documents including the secondary candidate phrases 350 are extracted to calculate similarities between the secondary candidate phrases 350. For instance, it is assumed that documents containing the phrase 'Announces RPG yogurting popularized' are (7, 10), documents containing the phrase 'Neowiz yogurting' are (1, 5, 7, 10), documents containing the phrase 'Neowiz search_corporation' are (2, 4, 12), and documents containing the phrase
'Neowiz search' are (2, 4, 8, 12). In this case, since the similarity between the phrases 'Announces RPG yogurting popularized' and 'Neowiz yogurting' is 66%, the similarity is regarded to be low. Since the similarity between the phrases 'Neowiz search_corporation' and 'Neowiz search' is 82%, the phrase 'Neowiz search' having a lower weight value is eliminated from the secondary candidate phrases. In this manner, topics 361 and search results topic-by-topic are obtained as shown in Fig. 9.
[35] Fig. 10 is a flow chart of a method of extracting issues according to an embodiment of the present invention.
[36] First, data having the same or similar data greater than a predetermined threshold value is extracted from stored data. A plurality of high-ranking data is extracted as issue data from the extracted data. The issue data is displayed in order of writing time of the issue data or in order of a number of similar documents. The stored data may be all of the Internet documents, specific blogs, data on news sites, or data obtained from predetermined search methods.
[37] In more detail, target documents on the Internet or target documents matching with a search word are extracted (operation S410). The extracted documents may be the same or similar to one another. After the number of same or similar documents is calculated, documents having appearance frequencies greater than a predetermined value are extracted (operation S420).
[38] High-ranking documents having a number of the same or similar documents are extracted as issues (operation S430). The extracted issues are output in order of writing time of the documents or the number of same or similar documents (operation S440).
[39] Figs. 11 to 13 are views for explaining a method of extracting issues according to an embodiment of the present invention.
[40] When there are Internet data 510 as shown in FIG. 11 , the data 510 are arranged in order of document title 520 and its appearance frequency 521 as shown in FIG. 12. Documents of appearance frequencies less than a predetermined value are eliminated. In this case, documents of appearance frequencies less than two hundreds are eliminated. The remaining documents are selected as issues and output in order of recent writing date as shown in Fig. 13.
[41] Fig. 14 is an issue output result.
[42] Issues may be extracted from the whole target documents on the Internet and displayed as described above. As described in Figs. 2 to 9, topics may be extracted from the target documents and issues may be extracted from the topics and displayed.
[43] Fig. 15 is another issue output result.
[44] Issues and topics may be displayed as shown in Fig. 15. For instance, issues 720 and topics 730 corresponding to a search word 'Neowiz' 710 may be displayed at different positions.
[45] Fig. 16 is a block diagram of an information search apparatus according to an embodiment of the present invention.
[46] The information search apparatus includes a web data storage unit 810, a searching unit 820, a primary candidate phrase extracting unit 830, a secondary candidate phrase extracting unit 840, a similar candidate phrase eliminating unit 850, and a topic output unit 860.
[47] The web data storage unit 810 collects and stores documents on the Internet. The searching unit 820 uses typical search methods to search for the documents. The primary candidate phrase extracting unit 830 sequentially assigns document IDs to the documents in appearance order of the documents, and extracts documents having document IDs less than a predetermined value. A method of extracting the primary candidate phrases is described above in detail with reference to Fig. 2. The secondary candidate phrase extracting unit 840 extracts words contained in titles or content of the documents and appearance frequencies of the words, extracts documents containing words of appearance frequencies greater than a predetermined value in the titles or content as primary candidate phrases, generates secondary candidate phrases composed of combinations of phrases obtained from the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases.
[48] The similar candidate phrase eliminating unit 850 uses vectors consisting of document IDs of documents belonging to secondary candidate phrases with weight values greater than a predetermined value to calculate similarities between the secondary candidate phrases. The similar candidate phrase eliminating unit 850 eliminates secondary candidate phrases with lower weight values among secondary candidate phrases with similarities greater than a predetermined value, and sets the remaining secondary candidate phrases as topics. The topic output unit 860 sets the topics as titles and outputs the topics and documents corresponding to the topics.
[49] The above-mentioned methods of extracting topics and issues may be written with computer programs. Codes and code segments constituting the programs can be easily deduced by computer programmers skilled in the art. In addition, the programs are stored in computer readable media, read and executed by computers, thereby implementing the methods of extracting topics and issues. Examples of the computer readable media include magnetic recording media, optical recording media, and carrier wave media.
[50] While the present invention has been described with reference to exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the present invention as defined by the following claims.
Industrial Applicability
[51] The present invention can be efficiently applied to industrial fields related to a method and apparatus for extracting topics from search results and providing the search results based on the topics, and a method and apparatus for selecting and providing frequently appearing search results as issues.
Claims
[1] A method of displaying search results with respect to a search word, comprising:
(a) referring to words contained in titles or content of search results matching with the search word to calculate similarities between the search results according to a predetermined similarity calculation method, and extracting representative phrases among combinations of words repeatedly contained in similar search results; and
(b) displaying the representative phrases and the search results that belong to each of the representative phrases.
[2] The method of claim 1, wherein the operation (a) comprises:
(al) extracting words contained in titles or content of the search results matching with the search word, and extracting primary candidate phrases in which at least one of the words consecutively appears; and
(a2) generating secondary candidate phrases from words constituting the primary candidate phrases, calculating significance of the secondary candidate phrases based on appearance orders of the search results, appearance frequencies of the words, and the number of primary candidate phrases used in the secondary candidate phrases, and extracting representative phrases by eliminating similar candidate phrases from the secondary candidate phrases of higher significance.
[3] A method of extracting topics, comprising:
(a) assigning document IDs to documents with respect to a search word based on appearance orders of the documents, and extracting documents with document IDs less than a predetermined value;
(b) extracting words contained in titles or content of the extracted documents and appearance frequencies of the words;
(c) extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents;
(d) generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases;
(e) calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to the secondary candidate phrases; and
(f) eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
[4] The method of claim of 3, further including (g) displaying the topics as titles and documents that belong to each of the topics.
[5] The method of claim of 3, wherein the operation (d) comprises: generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases; and calculating weight values of the secondary candidate phrases based on document IDs contained in the secondary candidate phrases, appearance frequencies of the words constituting the secondary candidate phrases, and the number of the primary candidate phrases used in the secondary candidate phrases.
[6] A method of extracting issues, comprising:
(a) extracting the same or similar data the number of which is greater than a predetermined threshold value among stored data; and
(b) extracting as issue data a plurality of high-ranking data among the extracted data and displaying the issue data in order of writing time of the issue data or in order of a number of similar documents.
[7] The method of claim 6, wherein the stored data is data obtained by a predetermined search method.
[8] The method of claim 6, wherein the operation (a) includes determining the same or similar data based on words contained in titles or content of stored data, and extracting the same or similar data the number of which is greater than a predetermined threshold value.
[9] An apparatus for providing search services based on extracted topics, comprising: a searching unit searching for stored documents; a primary candidate phrase extracting unit sequentially assigning document IDs to searched documents based on appearance orders of the searched documents, and extracting documents with document IDs less than a predetermined value; a secondary candidate phrase extracting unit extracting words contained in titles or content of the extracted documents and appearance frequencies of the words, extracting primary candidate phrases composed of words of appearance frequencies greater than a predetermined value appearing consecutively in the titles or content of the documents, generating secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculating weight values of the secondary candidate phrases; and a similar candidate phrase eliminating unit calculating similarities between secondary candidate phrases with weight values greater than a predetermined value by use of vectors consisting of document IDs of documents belonging to
the secondary candidate phrases, eliminating secondary candidate phrases with low weight values among the secondary candidate phrases with similarities greater than a predetermined value, and setting the remaining secondary candidate phrases as topics.
[10] The apparatus of claim 9, further including a topic output unit displaying the topics as titles and documents that belong to each of the topics.
[11] The apparatus of claim 9, wherein the secondary candidate phrase extracting unit generates secondary candidate phrases from combinations of phrases composed of the words constituting the primary candidate phrases, and calculates weight values of the secondary candidate phrases based on document IDs contained in the secondary candidate phrases, appearance frequencies of the words constituting the secondary candidate phrases, and the number of the primary candidate phrases used in the secondary candidate phrases.
[12] Computer readable media storing programs for executing on a computer the method of claim 1 or 2.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR20050064515 | 2005-07-15 | ||
KR10-2005-0064515 | 2005-07-15 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2007011140A1 true WO2007011140A1 (en) | 2007-01-25 |
Family
ID=37668993
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/KR2006/002787 WO2007011140A1 (en) | 2005-07-15 | 2006-07-14 | Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2007011140A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008098282A1 (en) * | 2007-02-16 | 2008-08-21 | Funnelback Pty Ltd | Search result sub-topic identification system and method |
JP2014059865A (en) * | 2012-09-14 | 2014-04-03 | Hon Hai Precision Industry Co Ltd | Retrieval system and method thereof |
CN111666371A (en) * | 2020-04-21 | 2020-09-15 | 北京三快在线科技有限公司 | Theme-based matching degree determination method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5924090A (en) * | 1997-05-01 | 1999-07-13 | Northern Light Technology Llc | Method and apparatus for searching a database of records |
US6012053A (en) * | 1997-06-23 | 2000-01-04 | Lycos, Inc. | Computer system with user-controlled relevance ranking of search results |
KR20000050225A (en) * | 2000-05-29 | 2000-08-05 | 전상훈 | Internet information searching system and method by document auto summation |
US6212517B1 (en) * | 1997-07-02 | 2001-04-03 | Matsushita Electric Industrial Co., Ltd. | Keyword extracting system and text retrieval system using the same |
KR20040029895A (en) * | 2002-10-02 | 2004-04-08 | 씨씨알 주식회사 | Search system |
-
2006
- 2006-07-14 WO PCT/KR2006/002787 patent/WO2007011140A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5924090A (en) * | 1997-05-01 | 1999-07-13 | Northern Light Technology Llc | Method and apparatus for searching a database of records |
US6012053A (en) * | 1997-06-23 | 2000-01-04 | Lycos, Inc. | Computer system with user-controlled relevance ranking of search results |
US6212517B1 (en) * | 1997-07-02 | 2001-04-03 | Matsushita Electric Industrial Co., Ltd. | Keyword extracting system and text retrieval system using the same |
KR20000050225A (en) * | 2000-05-29 | 2000-08-05 | 전상훈 | Internet information searching system and method by document auto summation |
KR20040029895A (en) * | 2002-10-02 | 2004-04-08 | 씨씨알 주식회사 | Search system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008098282A1 (en) * | 2007-02-16 | 2008-08-21 | Funnelback Pty Ltd | Search result sub-topic identification system and method |
AU2008215153B2 (en) * | 2007-02-16 | 2012-02-16 | Squiz Pty Ltd | Search result sub-topic identification system and method |
AU2008215153B9 (en) * | 2007-02-16 | 2012-03-01 | Squiz Pty Ltd | Search result sub-topic identification system and method |
US8214347B2 (en) | 2007-02-16 | 2012-07-03 | Funnelback Pty Ltd. | Search result sub-topic identification system and method |
JP2014059865A (en) * | 2012-09-14 | 2014-04-03 | Hon Hai Precision Industry Co Ltd | Retrieval system and method thereof |
CN111666371A (en) * | 2020-04-21 | 2020-09-15 | 北京三快在线科技有限公司 | Theme-based matching degree determination method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11803596B2 (en) | Efficient forward ranking in a search engine | |
US7257574B2 (en) | Navigational learning in a structured transaction processing system | |
KR101255405B1 (en) | Indexing and searching speech with text meta-data | |
US7979268B2 (en) | String matching method and system and computer-readable recording medium storing the string matching method | |
US8713024B2 (en) | Efficient forward ranking in a search engine | |
JP5241828B2 (en) | Dictionary word and idiom determination | |
US7953735B2 (en) | Information processing apparatus, method and program | |
US8781817B2 (en) | Phrase based document clustering with automatic phrase extraction | |
US7863510B2 (en) | Method, medium, and system classifying music themes using music titles | |
KR100847376B1 (en) | Retrieval Method and Device Using Automatic Query Extraction | |
JP5477635B2 (en) | Information processing apparatus and method, and program | |
JP2008152774A (en) | Characteristic expression extraction device, method, and program | |
JP6524008B2 (en) | INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND PROGRAM | |
MX2012011923A (en) | Ascribing actionable attributes to data that describes a personal identity. | |
JP4839195B2 (en) | Method for calculating conformity of XML document, program thereof, and information processing apparatus | |
CN116738065B (en) | Enterprise searching method, device, equipment and storage medium | |
WO2007011129A1 (en) | Information search method and information search apparatus on which information value is reflected | |
CN103942328A (en) | Video retrieval method and video device | |
CN104657376A (en) | Searching method and searching device for video programs based on program relationship | |
JP5302614B2 (en) | Facility related information search database formation method and facility related information search system | |
JP2007334388A (en) | Method and device for clustering, program, and computer-readable recording medium | |
WO2007011140A1 (en) | Method of extracting topics and issues and method and apparatus for providing search results based on topics and issues | |
US20090216739A1 (en) | Boosting extraction accuracy by handling training data bias | |
JP5199968B2 (en) | Keyword type determination device, keyword type determination method, and keyword type determination program | |
JPH1196170A (en) | Data base generating method, method and device for information retrieval, and recording medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1)EPC |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 06783312 Country of ref document: EP Kind code of ref document: A1 |