[go: up one dir, main page]

CN101751430A - Electronic dictionary fuzzy searching method - Google Patents

Electronic dictionary fuzzy searching method Download PDF

Info

Publication number
CN101751430A
CN101751430A CN200810239543A CN200810239543A CN101751430A CN 101751430 A CN101751430 A CN 101751430A CN 200810239543 A CN200810239543 A CN 200810239543A CN 200810239543 A CN200810239543 A CN 200810239543A CN 101751430 A CN101751430 A CN 101751430A
Authority
CN
China
Prior art keywords
keyword
entry
dictionary
retrieval
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810239543A
Other languages
Chinese (zh)
Inventor
王琛
朱军民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hanwang Technology Co Ltd
Original Assignee
Hanwang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hanwang Technology Co Ltd filed Critical Hanwang Technology Co Ltd
Priority to CN200810239543A priority Critical patent/CN101751430A/en
Publication of CN101751430A publication Critical patent/CN101751430A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an electronic dictionary fuzzy searching method, belonging to the field of fuzzy identification. The electronic dictionary comprises a lexical entry dictionary stored with a plurality of lexical entries, a keyword dictionary stored with a plurality of keywords and a keyword retrieval table; each lexical entry is composed of one or a plurality of keywords, the retrieval table records the corresponding relation between each keyword and the lexical entry containing the keyword. The method comprises the following steps: dividing the words input by a user into one or a plurality of keywords; retrieving one lexical entry or a plurality of lexical entries corresponding to each keyword, and respectively calculating edition distances between the input words and the lexical entries; sequencing the edition distances and selecting at least one lexical entry having the smallest edition distance as the retrieval result. Due to supporting fuzzy retrieval, when the lexical entries input by the user have a plurality of incorrect words, the retrieval result that can not be retrieved by traditional retrieving manner can be obtained quickly by using the inventive method.

Description

Electronic dictionary fuzzy searching method
Technical field
The invention belongs to area of pattern recognition, particularly relate to a kind of electronic dictionary fuzzy searching method.
Background technology
Present electronic dictionary mainly is to support accurately coupling or prefix matching, and accurately coupling is complete when errorless at the input entry, and electronic dictionary just can provide correct result.With Sino-British electronic dictionary is example, import Chinese entry " People's Bank of China ", output result " People ' s Bank of China ", the entry of importing " Chinese people bank " or " Shen state the People's Bank " wrong like this individual character does not then have correct result's output.Prefix matching several words that are electronic dictionaries before the input entry only has or only have under the correct situation of preceding several words, electronic dictionary can provide series of results according to correct preceding several words.With Sino-British electronic dictionary is example, imports Chinese entry " Chinese people bank ", and output is the result of the entry of prefix with " Chinese ".If importing " Shen state the People's Bank " then exporting with " Shen " is the result of the entry of prefix.
Fuzzy search also do not supported in present electronic dictionary.Fuzzy search: during to be electronic dictionary have some wrong individual character, still can export the result of a series of correspondences in the input entry.With Sino-British electronic dictionary is example, and input " People's Bank of China ", " Chinese people bank " or " Shen state the People's Bank " all can retrieve correct result " People ' s Bank of China ".
At a OCR (OCR, Optical Character Recognition, optical character identification) (promptly at first obtains image in the dictionary for translation, carry out OCR identification then, use electronic dictionary to translate and export translation result at last), the character of the wrong knowledge of recognition result possibility of output is finished in OCR identification, if use traditional retrieval mode in dictionary, to can not find result for retrieval probably, therefore, if electronic dictionary can have the function of fuzzy search, will certainly improve retrieval precision.
Summary of the invention
Do not support the function of fuzzy search at present electronic dictionary, purpose of the present invention designs a kind of electronic dictionary fuzzy searching method exactly, to improve retrieval precision.
In order to realize the object of the invention, the present invention proposes a kind of electronic dictionary fuzzy searching method, described electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and described concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of described keyword dictionary and the described entry dictionary, said method comprising the steps of:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to the participle step retrieves wherein one or more entries of each keyword correspondence from described keyword index table, calculate the word of described input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Preferably, wherein step (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Preferably, wherein the editing distance described in the step (b) can refer to and allow an entry S1 become the character sum that another entry S2 need operate, and described operation comprises increase, deletion or substitute character.
Preferably, wherein step (c) back also comprises the step that shows result for retrieval.
Preferably, wherein said entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Preferably, wherein step (a) is preceding also comprises: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention has positive effect: owing to can support fuzzy search, make the user when input has some wrong individual characters in the entry, the result that script can not retrieve by the accurate retrieval mode of tradition, obtain result for retrieval soon by method and apparatus of the present invention, both improve speed, also improved retrieval precision.The present invention can be used for the electronic dictionary of handwriting input; And in the OCR dictionary for translation.
Description of drawings
Fig. 1 is dictionary configuration figure of the present invention;
Fig. 2 is the schematic flow sheet of fuzzy search in the retrieval flow of the present invention.
Embodiment
Describe the electronic dictionary of support fuzzy search of the present invention in detail below in conjunction with accompanying drawing.
The data structure of the electronic dictionary of support fuzzy search of the present invention as shown in Figure 1, comprises an entry dictionary 10 and a keyword dictionary 20.Entry dictionary 10 refers to traditional electronic dictionary, is that example comprises a series of entries as " People's Bank of China " with Sino-British electronic dictionary.Keyword dictionary 20 has comprised the keyword that each entry can be divided in the entry dictionary, has comprised " China ", " people ", " bank " three keywords as " People's Bank of China ".And the electronic dictionary of support fuzzy search of the present invention also comprises a keyword index table (not shown), this concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary 20 and the entry dictionary 10, generally speaking, a keyword is to there being several entries.As shown in Figure 1, keyword " people " has write down all and has comprised keyword " people's " the index of entry (as " Chinese people ", " people " etc.) in the entry dictionary in the keyword dictionary.Index relative please see Figure the arrow in 1.
Described entry dictionary 10 is promptly supported the realization of the electronic dictionary of accurate coupling or prefix, can use plurality of data structures (Hash table, search tree, Trie tree etc.).The present invention has adopted even numbers group Trie to set and has realized.The data structure of even numbers group Trie tree is two linear arrays, and one is the base array, and one is the check array.The base array is used for determining the transfer of state, and the check array is used to check the correctness of transfer.With the Chinese-English Dictionary is example, at first the basic Chinese characters among all GB2312 is changed into the sequence code of 1-6768, with the basic value as state exchange; Then the sequence code of all Chinese characters is put into the base array as original state; Next put the follow-up Chinese character sequence code of different entries into array, generate new state, and the base value of original state in the array is adjusted, can put into array to guarantee all follow-up Chinese characters; By that analogy, up to depositing all entry states in array; The final state of representing even numbers group Trie tree simultaneously with negative value.
The realization of described keyword dictionary 20 also can be used plurality of data structures (Hash table, search tree, Trie tree etc.).The same with entry dictionary 10, the present invention has adopted even numbers group Trie to set and has realized, and has added an index and write down the index of all entries that comprise this keyword in entry dictionary 10 in each keyword structure.
The construction method of dictionary configuration of the present invention as shown in Figure 1 is as follows:
1) each entry that will comprise in the entry storehouse of all entries is divided into keyword, and keeps keyword and its semantic identical in former entry.Form crucial dictionary." China ", " people ", " bank " three key words have been divided into as " People's Bank of China ".
2) so crucial dictionary is created as keyword dictionary 20.
3) use the entry dictionary of setting up according to the entry storehouse 10, when setting up these two dictionaries, corresponding to each keyword a keyword index table is arranged, the one or more entries of this index point will write down " People's Bank of China " index in entry dictionary 10 respectively as " People's Bank of China " in the concordance list of " China ", " people ", " bank " three keywords.
The search method of the electronic dictionary of support fuzzy search of the present invention, at first the word to user's input carries out traditional accurate coupling retrieval, if find accurate coupling retrieval then the direct accurately coupling result for retrieval that shows; If can not find, then carry out fuzzy search of the present invention.
Fig. 2 is the schematic flow sheet of fuzzy search in the retrieval flow in the one embodiment of the invention.The fuzzy matching retrieval may further comprise the steps:
Step 210, at first the word to input uses keyword dictionary 20 to adopt reverse maximum syncopation to carry out participle, the word of input is divided into several keywords, as " People's Bank of China " being divided into " China ", " people ", " bank " three keywords; If word segmentation result is arranged then enter next step, otherwise show and do not retrieve; The word segmentation result of this step has obtained one or more keywords.
Step 220, the keyword that obtains according to word segmentation result retrieves wherein each keyword corresponding entries from concordance list, calculate the word of input and the editing distance between these entries respectively, during as retrieval " the Shen state people ", the keyword that entry is divided into is " Shen ", " state ", " people ", wherein choose the longest keyword " people " corresponding entries in concordance list " People's Republic of China (PRC) " is arranged, " Chinese people ", " the People's Bank ", " PLA " etc., with the input entry " the Shen state people " editing distance be respectively " People's Republic of China (PRC) "-5 (twice replacement operation, " Shen state "=>China; Insert operation three times, insert " republic "), " Chinese people "-1 (replacement operation, " Shen "=>" in "), " the People's Bank "-4 (twice deletion action, deletion " Shen state "; " bank " inserted in twice insertion operation), " PLA "-5 (twice deletion action, deletion " Shen state "; Insert operation three times and insert " PLA ") etc.
In the present invention, weigh the similarity of two entries, refer to and allow the character sum of the operation that an entry S1 becomes another entry S2 to be needed (increase, deletion is replaced) with " editing distance ".In Chinese one and Chinese character calculate a character, a character calculated in a letter in English or other phonetic languages, for example S1=" the Shen state people " then " Shen " among the S1 can be replaced to S2=" Chinese people " " in " to obtain " Chinese people " consistent with S2, owing to carried out replacement operation one time, here editing distance is 1.
Step 230 sorts from small to large to editing distance, and tabulation is sorted from small to large according to editing distance as the entry in the previous step, obtains " Chinese people "-1, " the People's Bank "-4, " People's Republic of China (PRC) "-5, " PLA "-5;
Step 240, return the result of the less several entries of the entry result of editing distance minimum or editing distance, during as retrieval " the Shen state people " in previous step, because the editing distance minimum of " Chinese people " is 1, so return entry " Chinese people " and corresponding explanation the thereof.
Can be transported to devices such as display by the result who obtains after the inventive method retrieval, also can deliver to other modules carries out further data processing, because of not being emphasis of the present invention, no longer describes in detail.
The present invention also provides the fuzzy search device based on the above-mentioned electronic dictionary fuzzy retrieval method, and electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of keyword dictionary and the entry dictionary, and the fuzzy search device comprises with lower module:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to word-dividing mode retrieves wherein one or more entries of each keyword correspondence from the keyword index table, calculates the word of input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
Wherein module (a) adopts reverse maximum syncopation to carry out participle, and the word of importing is divided into several keywords.
Wherein the editing distance in the module (b) refers to and allows entry S1 become the character sum that entry S2 need operate, and operation comprises increase, deletion or substitute character.
Wherein module (c) back also comprises the module that shows result for retrieval.
Wherein the entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
Wherein module (a) is preceding also comprises with lower module: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
The present invention also provides the electronic dictionary corresponding to above-mentioned search method and indexing unit, comprises the entry dictionary that stores a plurality of entries, also comprises:
Store the keyword dictionary of a plurality of keywords;
The keyword index table;
And as the fuzzy search device in the previous technique scheme;
Wherein each entry is made up of one or more keywords, and concordance list has write down that all have comprised this keyword in each keyword of keyword dictionary and the entry dictionary.
The present invention can be used for a lot of occasions, can be used for the electronic dictionary of handwriting input; Also can be used in a as previously mentioned OCR dictionary for translation, the character of the wrong knowledge of recognition result possibility of output is finished in OCR identification, if use traditional retrieval mode in dictionary, to can not find result for retrieval probably, but be to use dictionary of the present invention to carry out fuzzy search and just can retrieve the result that the user needs, improved user's satisfaction.
It should be noted that the foregoing description is example and unrestricted the present invention, those skilled in the art can design a lot of alternate embodiments and not break away from the scope of appended claims.

Claims (6)

1. electronic dictionary fuzzy searching method, it is characterized in that: described electronic dictionary comprises the entry dictionary that stores a plurality of entries, the keyword dictionary that stores a plurality of keywords and keyword index table; Wherein each entry is made up of one or more keywords, and described concordance list has write down that all have comprised the corresponding relation of the entry of this keyword in each keyword of described keyword dictionary and the described entry dictionary, said method comprising the steps of:
(a) participle: the word to user's input uses the keyword dictionary to carry out participle, and the word of importing is divided into one or more keywords;
(b) calculate editing distance: the keyword that obtains according to the participle step retrieves wherein one or more entries of each keyword correspondence from described keyword index table, calculate the word of described input and the editing distance between these entries respectively;
(c) choose result for retrieval: to editing distance sort and the entry of choosing at least one editing distance minimum as result for retrieval.
2. the method for claim 1 is characterized in that: step (a) adopts reverse maximum syncopation to carry out participle, and the word of input is divided into several keywords.
3. method as claimed in claim 1 or 2 is characterized in that: the editing distance described in the step (b) refers to and allows an entry S1 become the character sum that another entry S2 need operate, and described operation comprises increase, deletion or substitute character.
4. the method for claim 1, it is characterized in that: step (c) back also comprises the step that shows result for retrieval.
5. as the described method of any one claim of front, it is characterized in that: described entry dictionary is for supporting the electronic dictionary of accurate coupling or prefix.
6. as the described method of any one claim of front, it is characterized in that: step (a) is preceding also to be comprised: at first the word to user's input accurately mates retrieval, is somebody's turn to do accurately coupling result for retrieval if find accurate coupling retrieval then directly show.
CN200810239543A 2008-12-12 2008-12-12 Electronic dictionary fuzzy searching method Pending CN101751430A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810239543A CN101751430A (en) 2008-12-12 2008-12-12 Electronic dictionary fuzzy searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810239543A CN101751430A (en) 2008-12-12 2008-12-12 Electronic dictionary fuzzy searching method

Publications (1)

Publication Number Publication Date
CN101751430A true CN101751430A (en) 2010-06-23

Family

ID=42478421

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810239543A Pending CN101751430A (en) 2008-12-12 2008-12-12 Electronic dictionary fuzzy searching method

Country Status (1)

Country Link
CN (1) CN101751430A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986308A (en) * 2010-11-16 2011-03-16 传神联合(北京)信息技术有限公司 Quick term marking method
CN102682068A (en) * 2012-03-01 2012-09-19 沈文策 Method and system for searching user name
CN102867049A (en) * 2012-09-10 2013-01-09 山东康威通信技术股份有限公司 Chinese PINYIN quick word segmentation method based on word search tree
CN103186615A (en) * 2011-12-30 2013-07-03 北大方正集团有限公司 Search prompting method and system
CN103631965A (en) * 2013-12-10 2014-03-12 湖南农业大学 Tokenizer-based agricultural knowledge entry handheld terminal and entry method of agricultural knowledge
CN105912649A (en) * 2016-04-08 2016-08-31 南京邮电大学 Database fuzzy retrieval method and system
CN103186615B (en) * 2011-12-30 2016-11-30 北大方正集团有限公司 A kind of Search Hints method and system
CN106202127A (en) * 2015-05-08 2016-12-07 深圳市腾讯计算机系统有限公司 A kind of vertical search engine processing method and processing device to retrieval request
WO2017092122A1 (en) * 2015-12-03 2017-06-08 小米科技有限责任公司 Similarity determination method, device, and terminal
CN107729351A (en) * 2017-08-29 2018-02-23 天翼爱音乐文化科技有限公司 Multilayer inquiry correcting method and system based on music searching engine
CN107832330A (en) * 2017-09-27 2018-03-23 华为技术有限公司 A kind of searching method and terminal device
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field
CN113474767A (en) * 2019-02-14 2021-10-01 昭和电工株式会社 Document search device, document search system, document search program, and document search method

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101986308B (en) * 2010-11-16 2013-07-31 传神联合(北京)信息技术有限公司 Quick term marking method
CN101986308A (en) * 2010-11-16 2011-03-16 传神联合(北京)信息技术有限公司 Quick term marking method
CN103186615A (en) * 2011-12-30 2013-07-03 北大方正集团有限公司 Search prompting method and system
CN103186615B (en) * 2011-12-30 2016-11-30 北大方正集团有限公司 A kind of Search Hints method and system
CN102682068A (en) * 2012-03-01 2012-09-19 沈文策 Method and system for searching user name
CN102867049A (en) * 2012-09-10 2013-01-09 山东康威通信技术股份有限公司 Chinese PINYIN quick word segmentation method based on word search tree
CN103631965A (en) * 2013-12-10 2014-03-12 湖南农业大学 Tokenizer-based agricultural knowledge entry handheld terminal and entry method of agricultural knowledge
CN106202127A (en) * 2015-05-08 2016-12-07 深圳市腾讯计算机系统有限公司 A kind of vertical search engine processing method and processing device to retrieval request
CN106202127B (en) * 2015-05-08 2020-02-11 深圳市腾讯计算机系统有限公司 Method and device for processing retrieval request by vertical search engine
US10089301B2 (en) 2015-12-03 2018-10-02 Xiaomi Inc. Method and apparatus for determining semantic similarity of character strings
WO2017092122A1 (en) * 2015-12-03 2017-06-08 小米科技有限责任公司 Similarity determination method, device, and terminal
CN105912649A (en) * 2016-04-08 2016-08-31 南京邮电大学 Database fuzzy retrieval method and system
CN107729351A (en) * 2017-08-29 2018-02-23 天翼爱音乐文化科技有限公司 Multilayer inquiry correcting method and system based on music searching engine
CN107832330A (en) * 2017-09-27 2018-03-23 华为技术有限公司 A kind of searching method and terminal device
CN107832330B (en) * 2017-09-27 2021-06-15 华为技术有限公司 Searching method and terminal equipment
CN109684439A (en) * 2018-12-28 2019-04-26 语联网(武汉)信息技术有限公司 The method and device of prefix index is carried out during participle
CN109684439B (en) * 2018-12-28 2020-10-30 语联网(武汉)信息技术有限公司 Method and device for indexing prefix in word segmentation process
CN113474767A (en) * 2019-02-14 2021-10-01 昭和电工株式会社 Document search device, document search system, document search program, and document search method
CN113474767B (en) * 2019-02-14 2023-09-01 株式会社力森诺科 File retrieval device, file retrieval system, file retrieval program, and file retrieval method
CN112580691A (en) * 2020-11-25 2021-03-30 北京北大千方科技有限公司 Term matching method, matching system and storage medium of metadata field
CN112580691B (en) * 2020-11-25 2024-05-14 北京北大千方科技有限公司 Term matching method, matching system and storage medium for metadata field

Similar Documents

Publication Publication Date Title
CN101751430A (en) Electronic dictionary fuzzy searching method
Agichtein et al. Mining reference tables for automatic text segmentation
US8745077B2 (en) Searching and matching of data
CN103294776B (en) Smartphone address book fuzzy search method
CN103198079B (en) The implementation method of relevant search and device
US9110980B2 (en) Searching and matching of data
CN101916263B (en) Fuzzy keyword query method and system based on weighted edit distance
CN105528411B (en) Device and method for full-text retrieval of ship equipment interactive electronic technical manual
CN102834802A (en) Enabling faster full-text searching using a structured data store
CN106326303A (en) Spoken language semantic analysis system and method
CN102063482B (en) High-efficiency contact searching method of handheld device
CN1193779A (en) Chinese Sentence Segmentation Method and Its Application in Chinese Error Checking System
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
CN105760359B (en) Question processing system and method thereof
AU2018102145A4 (en) Method of establishing English geographical name index and querying method and apparatus thereof
CN102339294A (en) Searching method and system for preprocessing keywords
CN109885641B (en) Method and system for searching Chinese full text in database
CN107526721B (en) Ambiguity elimination method and device for comment vocabularies of e-commerce products
CN105426490A (en) Tree structure based indexing method
CN101436205A (en) Method and apparatus for enquiring unique word by explanation
CN112949287B (en) Hot word mining method, system, computer equipment and storage medium
CN110532538A (en) Property dispute judgement document's critical entities extraction algorithm
CN110457695B (en) Online text error correction method and system
Priya et al. Hybrid optimization algorithm using N gram based edit distance
CN108595584B (en) Chinese character output method and system based on digital marks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100623