CN101751430A - Electronic dictionary fuzzy searching method - Google Patents
Electronic dictionary fuzzy searching method Download PDFInfo
- Publication number
- CN101751430A CN101751430A CN200810239543A CN200810239543A CN101751430A CN 101751430 A CN101751430 A CN 101751430A CN 200810239543 A CN200810239543 A CN 200810239543A CN 200810239543 A CN200810239543 A CN 200810239543A CN 101751430 A CN101751430 A CN 101751430A
- Authority
- CN
- China
- Prior art keywords
- keyword
- entry
- dictionary
- retrieval
- keywords
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 230000008878 coupling Effects 0.000 claims description 15
- 238000010168 coupling process Methods 0.000 claims description 15
- 238000005859 coupling reaction Methods 0.000 claims description 15
- 238000012217 deletion Methods 0.000 claims description 8
- 230000037430 deletion Effects 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 abstract 1
- 238000012015 optical character recognition Methods 0.000 description 8
- 238000013519 translation Methods 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to an electronic dictionary fuzzy searching method, belonging to the field of fuzzy identification. The electronic dictionary comprises a lexical entry dictionary stored with a plurality of lexical entries, a keyword dictionary stored with a plurality of keywords and a keyword retrieval table; each lexical entry is composed of one or a plurality of keywords, the retrieval table records the corresponding relation between each keyword and the lexical entry containing the keyword. The method comprises the following steps: dividing the words input by a user into one or a plurality of keywords; retrieving one lexical entry or a plurality of lexical entries corresponding to each keyword, and respectively calculating edition distances between the input words and the lexical entries; sequencing the edition distances and selecting at least one lexical entry having the smallest edition distance as the retrieval result. Due to supporting fuzzy retrieval, when the lexical entries input by the user have a plurality of incorrect words, the retrieval result that can not be retrieved by traditional retrieving manner can be obtained quickly by using the inventive method.
Description
Technical field
The invention belongs to area of pattern recognition, particularly relate to a kind of electronic dictionary fuzzy searching method.
Background technology
Present electronic dictionary mainly is to support accurately coupling or prefix matching, and accurately coupling is complete when errorless at the input entry, and electronic dictionary just can provide correct result.With Sino-British electronic dictionary is example, import Chinese entry " People's Bank of China ", output result " People ' s Bank of China ", the entry of importing " Chinese people bank " or " Shen state the People's Bank " wrong like this individual character does not then have correct result's output.Prefix matching several words that are electronic dictionaries before the input entry only has or only have under the correct situation of preceding several words, electronic dictionary can provide series of results according to correct preceding several words.With Sino-British electronic dictionary is example, imports Chinese entry " Chinese people bank ", and output is the result of the entry of prefix with " Chinese ".If importing " Shen state the People's Bank " then exporting with " Shen " is the result of the entry of prefix.
Fuzzy search also do not supported in present electronic dictionary.Fuzzy search: during to be electronic dictionary have some wrong individual character, still can export the result of a series of correspondences in the input entry.With Sino-British electronic dictionary is example, and input " People's Bank of China ", " Chinese people bank " or " Shen state the People's Bank " all can retrieve correct result " People ' s Bank of China ".
At a OCR (OCR, Optical Character Recognition, optical character identification) (promptly at first obtains image in the dictionary for translation, carry out OCR identification then, use electronic dictionary to translate and export translation result at last), the character of the wrong knowledge of recognition result possibility of output is finished in OCR identification, if use traditional retrieval mode in dictionary, to can not find result for retrieval probably, therefore, if electronic dictionary can have the function of fuzzy search, will certainly improve retrieval precision.
Summary of the invention
Do not support the function of fuzzy search at present electronic dictionary, purpose of the present invention designs a kind of electronic dictionary fuzzy searching method exactly, to improve retrieval precision.
In order to realize the object of the invention, the present invention proposes a kind of electronic dictionary fuzzy searching method, described electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and described concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of described keyword dictionary and the described entry dictionary, said method comprising the steps of:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to the participle step retrieves wherein one or more entries of each keyword correspondence from described keyword index table, calculate the word of described input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Preferably, wherein step (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Preferably, wherein the editing distance described in the step (b) can refer to and allow an entry S1 become the character sum that another entry S2 need operate, and described operation comprises increase, deletion or substitute character.
Preferably, wherein step (c) back also comprises the step that shows result for retrieval.
Preferably, wherein said entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Preferably, wherein step (a) is preceding also comprises: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention has positive effect: owing to can support fuzzy search, make the user when input has some wrong individual characters in the entry, the result that script can not retrieve by the accurate retrieval mode of tradition, obtain result for retrieval soon by method and apparatus of the present invention, both improve speed, also improved retrieval precision.The present invention can be used for the electronic dictionary of handwriting input; And in the OCR dictionary for translation.
Description of drawings
Fig. 1 is dictionary configuration figure of the present invention;
Fig. 2 is the schematic flow sheet of fuzzy search in the retrieval flow of the present invention.
Embodiment
Describe the electronic dictionary of support fuzzy search of the present invention in detail below in conjunction with accompanying drawing.
The data structure of the electronic dictionary of support fuzzy search of the present invention as shown in Figure 1, comprises an entry dictionary 10 and a keyword dictionary 20.Entry dictionary 10 refers to traditional electronic dictionary, is that example comprises a series of entries as " People's Bank of China " with Sino-British electronic dictionary.Keyword dictionary 20 has comprised the keyword that each entry can be divided in the entry dictionary, has comprised " China ", " people ", " bank " three keywords as " People's Bank of China ".And the electronic dictionary of support fuzzy search of the present invention also comprises a keyword index table (not shown), this concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary 20 and the entry dictionary 10, generally speaking, a keyword is to there being several entries.As shown in Figure 1, keyword " people " has write down all and has comprised keyword " people's " the index of entry (as " Chinese people ", " people " etc.) in the entry dictionary in the keyword dictionary.Index relative please see Figure the arrow in 1.
Described entry dictionary 10 is promptly supported the realization of the electronic dictionary of accurate coupling or prefix, can use plurality of data structures (Hash table, search tree, Trie tree etc.).The present invention has adopted even numbers group Trie to set and has realized.The data structure of even numbers group Trie tree is two linear arrays, and one is the base array, and one is the check array.The base array is used for determining the transfer of state, and the check array is used to check the correctness of transfer.With the Chinese-English Dictionary is example, at first the basic Chinese characters among all GB2312 is changed into the sequence code of 1-6768, with the basic value as state exchange; Then the sequence code of all Chinese characters is put into the base array as original state; Next put the follow-up Chinese character sequence code of different entries into array, generate new state, and the base value of original state in the array is adjusted, can put into array to guarantee all follow-up Chinese characters; By that analogy, up to depositing all entry states in array; The final state of representing even numbers group Trie tree simultaneously with negative value.
The realization of described keyword dictionary 20 also can be used plurality of data structures (Hash table, search tree, Trie tree etc.).The same with entry dictionary 10, the present invention has adopted even numbers group Trie to set and has realized, and has added an index and write down the index of all entries that comprise this keyword in entry dictionary 10 in each keyword structure.
The construction method of dictionary configuration of the present invention as shown in Figure 1 is as follows:
1) each entry that will comprise in the entry storehouse of all entries is divided into keyword, and keeps keyword and its semantic identical in former entry.Form crucial dictionary." China ", " people ", " bank " three key words have been divided into as " People's Bank of China ".
2) so crucial dictionary is created as keyword dictionary 20.
3) use the entry dictionary of setting up according to the entry storehouse 10, when setting up these two dictionaries, corresponding to each keyword a keyword index table is arranged, the one or more entries of this index point will write down " People's Bank of China " index in entry dictionary 10 respectively as " People's Bank of China " in the concordance list of " China ", " people ", " bank " three keywords.
The search method of the electronic dictionary of support fuzzy search of the present invention, at first the word to user's input carries out traditional accurate coupling retrieval, if find accurate coupling retrieval then the direct accurately coupling result for retrieval that shows; If can not find, then carry out fuzzy search of the present invention.
Fig. 2 is the schematic flow sheet of fuzzy search in the retrieval flow in the one embodiment of the invention.The fuzzy matching retrieval may further comprise the steps:
In the present invention, weigh the similarity of two entries, refer to and allow the character sum of the operation that an entry S1 becomes another entry S2 to be needed (increase, deletion is replaced) with " editing distance ".In Chinese one and Chinese character calculate a character, a character calculated in a letter in English or other phonetic languages, for example S1=" the Shen state people " then " Shen " among the S1 can be replaced to S2=" Chinese people " " in " to obtain " Chinese people " consistent with S2, owing to carried out replacement operation one time, here editing distance is 1.
Can be transported to devices such as display by the result who obtains after the inventive method retrieval, also can deliver to other modules carries out further data processing, because of not being emphasis of the present invention, no longer describes in detail.
The present invention also provides the fuzzy search device based on the above-mentioned electronic dictionary fuzzy retrieval method, and electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary and the entry dictionary, and the fuzzy search device comprises with lower module:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to word-dividing mode retrieves wherein one or more entries of each keyword correspondence from the keyword index table, calculates the word of input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Wherein module (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Wherein the editing distance in the module (b) refers to and allows entry S1 become the character sum that entry S2 need operate, and operation comprises increase, deletion or substitute character.
Wherein module (c) back also comprises the module that shows result for retrieval.
Wherein the entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Wherein module (a) is preceding also comprises with lower module: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention also provides the electronic dictionary corresponding to above-mentioned search method and indexing unit, comprises the entry dictionary that stores a plurality of entries, also comprises:
Store the keyword dictionary of a plurality of keywords;
The keyword index table;
And as the fuzzy search device in the previous technique scheme;
Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised this keyword in each keyword of keyword dictionary and the entry dictionary.
The present invention can be used for a lot of occasions, can be used for the electronic dictionary of handwriting input; Also can be used in a as previously mentioned OCR dictionary for translation, the character of the wrong knowledge of recognition result possibility of output is finished in OCR identification, if use traditional retrieval mode in dictionary, to can not find result for retrieval probably, but be to use dictionary of the present invention to carry out fuzzy search and just can retrieve the result that the user needs, improved user's satisfaction.
It should be noted that the foregoing description is example and unrestricted the present invention, those skilled in the art can design a lot of alternate embodiments and not break away from the scope of appended claims.
Claims (6)
1. electronic dictionary fuzzy searching method, it is characterized in that: described electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and described concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of described keyword dictionary and the described entry dictionary, said method comprising the steps of:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to the participle step retrieves wherein one or more entries of each keyword correspondence from described keyword index table, calculate the word of described input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
2. the method for claim 1 is characterized in that: step (a) adopts reverse maximum syncopation to carry out participle, and the word of input is divided into several keywords.
3. method as claimed in claim 1 or 2 is characterized in that: the editing distance described in the step (b) refers to and allows an entry S1 become the character sum that another entry S2 need operate, and described operation comprises increase, deletion or substitute character.
4. the method for claim 1, it is characterized in that: step (c) back also comprises the step that shows result for retrieval.
5. as the described method of any one claim of front, it is characterized in that: described entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
6. as the described method of any one claim of front, it is characterized in that: step (a) is preceding also to be comprised: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810239543A CN101751430A (en) | 2008-12-12 | 2008-12-12 | Electronic dictionary fuzzy searching method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200810239543A CN101751430A (en) | 2008-12-12 | 2008-12-12 | Electronic dictionary fuzzy searching method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101751430A true CN101751430A (en) | 2010-06-23 |
Family
ID=42478421
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200810239543A Pending CN101751430A (en) | 2008-12-12 | 2008-12-12 | Electronic dictionary fuzzy searching method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101751430A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101986308A (en) * | 2010-11-16 | 2011-03-16 | 传神联合(北京)信息技术有限公司 | Quick term marking method |
CN102682068A (en) * | 2012-03-01 | 2012-09-19 | 沈文策 | Method and system for searching user name |
CN102867049A (en) * | 2012-09-10 | 2013-01-09 | 山东康威通信技术股份有限公司 | Chinese PINYIN quick word segmentation method based on word search tree |
CN103186615A (en) * | 2011-12-30 | 2013-07-03 | 北大方正集团有限公司 | Search prompting method and system |
CN103631965A (en) * | 2013-12-10 | 2014-03-12 | 湖南农业大学 | Tokenizer-based agricultural knowledge entry handheld terminal and entry method of agricultural knowledge |
CN105912649A (en) * | 2016-04-08 | 2016-08-31 | 南京邮电大学 | Database fuzzy retrieval method and system |
CN103186615B (en) * | 2011-12-30 | 2016-11-30 | 北大方正集团有限公司 | A kind of Search Hints method and system |
CN106202127A (en) * | 2015-05-08 | 2016-12-07 | 深圳市腾讯计算机系统有限公司 | A kind of vertical search engine processing method and processing device to retrieval request |
WO2017092122A1 (en) * | 2015-12-03 | 2017-06-08 | 小米科技有限责任公司 | Similarity determination method, device, and terminal |
CN107729351A (en) * | 2017-08-29 | 2018-02-23 | 天翼爱音乐文化科技有限公司 | Multilayer inquiry correcting method and system based on music searching engine |
CN107832330A (en) * | 2017-09-27 | 2018-03-23 | 华为技术有限公司 | A kind of searching method and terminal device |
CN109684439A (en) * | 2018-12-28 | 2019-04-26 | 语联网(武汉)信息技术有限公司 | The method and device of prefix index is carried out during participle |
CN112580691A (en) * | 2020-11-25 | 2021-03-30 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium of metadata field |
CN113474767A (en) * | 2019-02-14 | 2021-10-01 | 昭和电工株式会社 | Document search device, document search system, document search program, and document search method |
-
2008
- 2008-12-12 CN CN200810239543A patent/CN101751430A/en active Pending
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101986308B (en) * | 2010-11-16 | 2013-07-31 | 传神联合(北京)信息技术有限公司 | Quick term marking method |
CN101986308A (en) * | 2010-11-16 | 2011-03-16 | 传神联合(北京)信息技术有限公司 | Quick term marking method |
CN103186615A (en) * | 2011-12-30 | 2013-07-03 | 北大方正集团有限公司 | Search prompting method and system |
CN103186615B (en) * | 2011-12-30 | 2016-11-30 | 北大方正集团有限公司 | A kind of Search Hints method and system |
CN102682068A (en) * | 2012-03-01 | 2012-09-19 | 沈文策 | Method and system for searching user name |
CN102867049A (en) * | 2012-09-10 | 2013-01-09 | 山东康威通信技术股份有限公司 | Chinese PINYIN quick word segmentation method based on word search tree |
CN103631965A (en) * | 2013-12-10 | 2014-03-12 | 湖南农业大学 | Tokenizer-based agricultural knowledge entry handheld terminal and entry method of agricultural knowledge |
CN106202127A (en) * | 2015-05-08 | 2016-12-07 | 深圳市腾讯计算机系统有限公司 | A kind of vertical search engine processing method and processing device to retrieval request |
CN106202127B (en) * | 2015-05-08 | 2020-02-11 | 深圳市腾讯计算机系统有限公司 | Method and device for processing retrieval request by vertical search engine |
US10089301B2 (en) | 2015-12-03 | 2018-10-02 | Xiaomi Inc. | Method and apparatus for determining semantic similarity of character strings |
WO2017092122A1 (en) * | 2015-12-03 | 2017-06-08 | 小米科技有限责任公司 | Similarity determination method, device, and terminal |
CN105912649A (en) * | 2016-04-08 | 2016-08-31 | 南京邮电大学 | Database fuzzy retrieval method and system |
CN107729351A (en) * | 2017-08-29 | 2018-02-23 | 天翼爱音乐文化科技有限公司 | Multilayer inquiry correcting method and system based on music searching engine |
CN107832330A (en) * | 2017-09-27 | 2018-03-23 | 华为技术有限公司 | A kind of searching method and terminal device |
CN107832330B (en) * | 2017-09-27 | 2021-06-15 | 华为技术有限公司 | Searching method and terminal equipment |
CN109684439A (en) * | 2018-12-28 | 2019-04-26 | 语联网(武汉)信息技术有限公司 | The method and device of prefix index is carried out during participle |
CN109684439B (en) * | 2018-12-28 | 2020-10-30 | 语联网(武汉)信息技术有限公司 | Method and device for indexing prefix in word segmentation process |
CN113474767A (en) * | 2019-02-14 | 2021-10-01 | 昭和电工株式会社 | Document search device, document search system, document search program, and document search method |
CN113474767B (en) * | 2019-02-14 | 2023-09-01 | 株式会社力森诺科 | File retrieval device, file retrieval system, file retrieval program, and file retrieval method |
CN112580691A (en) * | 2020-11-25 | 2021-03-30 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium of metadata field |
CN112580691B (en) * | 2020-11-25 | 2024-05-14 | 北京北大千方科技有限公司 | Term matching method, matching system and storage medium for metadata field |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101751430A (en) | Electronic dictionary fuzzy searching method | |
Agichtein et al. | Mining reference tables for automatic text segmentation | |
US8745077B2 (en) | Searching and matching of data | |
CN103294776B (en) | Smartphone address book fuzzy search method | |
CN103198079B (en) | The implementation method of relevant search and device | |
US9110980B2 (en) | Searching and matching of data | |
CN101916263B (en) | Fuzzy keyword query method and system based on weighted edit distance | |
CN105528411B (en) | Device and method for full-text retrieval of ship equipment interactive electronic technical manual | |
CN102834802A (en) | Enabling faster full-text searching using a structured data store | |
CN106326303A (en) | Spoken language semantic analysis system and method | |
CN102063482B (en) | High-efficiency contact searching method of handheld device | |
CN1193779A (en) | Chinese Sentence Segmentation Method and Its Application in Chinese Error Checking System | |
CN102789464A (en) | Natural language processing method, device and system based on semanteme recognition | |
CN105760359B (en) | Question processing system and method thereof | |
AU2018102145A4 (en) | Method of establishing English geographical name index and querying method and apparatus thereof | |
CN102339294A (en) | Searching method and system for preprocessing keywords | |
CN109885641B (en) | Method and system for searching Chinese full text in database | |
CN107526721B (en) | Ambiguity elimination method and device for comment vocabularies of e-commerce products | |
CN105426490A (en) | Tree structure based indexing method | |
CN101436205A (en) | Method and apparatus for enquiring unique word by explanation | |
CN112949287B (en) | Hot word mining method, system, computer equipment and storage medium | |
CN110532538A (en) | Property dispute judgement document's critical entities extraction algorithm | |
CN110457695B (en) | Online text error correction method and system | |
Priya et al. | Hybrid optimization algorithm using N gram based edit distance | |
CN108595584B (en) | Chinese character output method and system based on digital marks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20100623 |