CN102867049A

CN102867049A - Chinese PINYIN quick word segmentation method based on word search tree

Info

Publication number: CN102867049A
Application number: CN2012103320727A
Authority: CN
Inventors: 于少飞; 袁美英; 杨震威
Original assignee: Shandong Conwell Communication Technology Co Ltd
Current assignee: Conway Communication Technology Co Ltd
Priority date: 2012-09-10
Filing date: 2012-09-10
Publication date: 2013-01-09
Anticipated expiration: 2032-09-10
Also published as: CN102867049B

Abstract

The invention discloses a Chinese PINYIN quick word segmentation method based on a word search tree. The method is implemented by a computer or embedded mobile equipment and comprises the following working steps of: 1, building a Chinese character PINYIN search tree according to all the known Chinese character PINYIN lists; 2, combining the search tree with a hash table according to the built word search tree, and segmenting a string of given Chinese PINYINs; 3, working out a word segmentation result; and 4, destroying the search tree and releasing resources. Due to a public prefix of a character string, a construction space is saved, so that unnecessary character string comparison is greatly reduced; by the redundancy hash table with an index, the search efficiency is improved; and the time complexity of an algorithm is reduced to the minimum.

Description

A kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree

Technical field

The invention belongs to computing machine or various hand-held embedded movable equipment Chinese information processing technical field, particularly a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree.

Background technology

From a string continuous Chinese phonetic alphabet, automatically identify each individual character phonetic by the computer software algorithm, be the technology that spelling input method and search engine (associating Chinese sentence according to phonetic type key word) must use again.All existing Chinese word phonetics as key word, are set up a Hash table, by from the Hash table of setting up, repeatedly searching and mating, can realize a string continuous Chinese phonetic alphabet is carried out participle, but there is the not high problem of efficient in the method during participle.

For raising the efficiency, in the prior art above-mentioned Hash table is made following improvement: with the initial of Chinese word phonetic as key word, set up a Hash table, each element of Hash table is a single-track link table, is storing in the chained list take Hash table key word letter all individual character phonetics as beginning.Through after the improvement, when searching, from Hash table, obtain fast first the first node pointer of a single-track link table according to initial so at every turn, and then the traversal single-track link table, final coupling made.Use the Hash table through improvement to improve participle efficient, but processing polysemant, during such as " xian " " piao ", still can there be the problem that needs special processing, after retrieving " xi ", a kind of scheme is immediately " xi " to be removed from word string, next continue retrieval " an ", but phonetic " xian " has just been lost like this; First scheme is to keep " xi " in word string, continue retrieval with the phonetic of alphabetical x beginning, until find the end of single-track link table, and then remove " xi ", continue again at last retrieval " an ", so just can find all possible phonetic plan " xi " " an " " xian ", but this scheme search efficiency is lower.

Chinese patent (the patent No.: 200710118921), a kind of internal memory processing method of telephone number mapping domain name server and device, although this patent has been mentioned search tree and Hash table, but the basic function that this patent has just used search tree and Hash table to store and search, not through any improvement, also without any extension and improvement; And on the purposes of using, also there is fundamental difference, this patent has just been stored data at the node of search tree and Hash table, just find merely the function of the data of storage by search tree and Hash table, that Chinese phonetic alphabet fast word segmentation is finished in the combination of a kind of mutation by search tree and Hash table and the present invention mainly realizes, this patent is the data of searching storage, the present invention is the fast word segmentation of finishing the Chinese phonetic alphabet, and there is the difference of essence in two files in use-pattern and purposes.

Chinese patent (the patent No.: 200810129141.8), adjust the method and apparatus of candidate word sequence, although this patent has been mentioned search tree and Hash table, but the trie tree in this patent has just been used a kind of in search tree or the Hash tree, as among the present invention, search tree and Hash table are not combined use, both combine closely in the present invention, and are indispensable; And purposes is also different, the method and apparatus that this patent is adjusted candidate word sequence just utilizes search tree or Hash table to store, whether the pinyin string of judging storage is the standard spelling, using search tree or Hash tree is not the function that is used as participle, and the present invention uses the combination of the mutation of search tree and Hash tree to finish the fast word segmentation of phonetic, finally forms pinyin string.

Chinese patent (the patent No.: 200910107961.1), a kind of implementation method of fast word segmentation, although this patent also is a kind of method of participle, but this searching patents tree is realized by one-level concordance list and HASH multiway tree, the deficiency of this patent is: when processing polysemant, and the situation of easy Zao Cheng Lost word; If do not want to occur the situation of Lost word, the method that just need to take to reduce search efficiency solves, although guaranteed the correctness of lookup result, direct result is exactly the problem that has caused search efficiency lower.

Although the single-track link table in the such scheme has such as easy care in the internal memory, inserts the advantages such as deletion is simple, have that query performance is low, the shortcoming of search efficiency Di Huo Lost word.

Summary of the invention

Purpose of the present invention is exactly in order to address the above problem, a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree is provided, search tree is combined with Hash table, finish fast word segmentation to the Chinese phonetic alphabet with a kind of mutation of Hash tree, this kind participle mode had both been avoided low, the efficient Di, Lost word problem of query performance, improve again search efficiency, realized fast word segmentation.

To achieve these goals, the present invention adopts following technical scheme:

A kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree, the method realizes that by computing machine or embedded movable equipment the groundwork step is as follows:

Step 1, set up Chinese word phonetic search tree according to all known Chinese word pinyin tables;

The word lookup tree that step 2, foundation have been set up is combined search tree with Hash table, given a string Chinese phonetic alphabet is carried out participle;

Step 3, provide word segmentation result;

Step 4, destruction search tree, releasing resource.

In the described step 1, set up individual character phonetic search tree according to all known Chinese word pinyin tables, mainly comprise following steps:

(1) root node does not comprise character, and each node only comprises a character except root node;

(2) character that comprises of all child nodes of each node is not identical;

(3) except leaf node, it is 26 Hash table that each node has a length, and Hash table is take the ascending order of 26 English alphabets as index, and each element is stored respectively a child node, and the child node actual number is less than or equal to 26;

(4) each node comprises an identification field, and this

field value

0 or 1 is used for sign from root node to this node, and the Connection operator of process gets up whether to represent a complete Chinese word phonetic on the path.

In the described step 2, according to the word lookup tree of having set up, search tree is combined with Hash table, given a string Chinese phonetic alphabet is carried out participle, mainly comprise following steps:

A) begin once to search for from root node;

B) obtain the first letter that to search keyword, and from Hash table, select corresponding subtree and forward this subtree to proceed retrieval according to this letter;

C) on corresponding subtree, obtain the second letter that to search keyword, and further select corresponding subtree to retrieve;

D) iterative process: obtain the 1st, 2 of keyword ... n letter continues to search.

In the described step 3, concrete steps are as follows:

1) at certain node place, if all letters of keyword have been removed or the identification field value of node is 1, then from the root path to the present node, exports successively the identification field values of all characters and present node.

2) if all letters of keyword are removed, namely finish and search; Otherwise obtain the next letter of keyword, the root node of getting back to search tree continues iteration and searches.

3) for the keyword that comprises a plurality of semantemes, for example: piao, both can be construed to " ticket ", also can be construed to " fur-lined jacket ", word segmentation result will be exported all possible value.

In the described step 4, finish participle after, destroy search tree, releasing resource reclaims the internal memory that takies.

Described search tree is the polymorphic set of a kind of support, the data structure of operations such as comprising insertion, delete and search.

Described Hash table is also named hash table, is the data structure that directly conducts interviews according to key value.

Beneficial effect of the present invention:

This invention is a new breakthrough at computing machine or various hand-held embedded movable equipment Chinese information processing technical field, and comprehensively Xie Jue Lost word, problem that search efficiency is low provide a kind of new approaches of Chinese information processing technical field.

The present invention utilizes the common prefix of character string to save the structure space, reduce to greatest extent meaningless character string comparison, so not only improved the search efficiency of phonetic, also can the Effective Raise computing machine or the internal memory service efficiency of various hand-held embedded movable equipments, Effective Raise the operational efficiency of various device;

The present invention utilizes the redundant Hash table of tape index to improve search efficiency, reduces to greatest extent the time complexity of algorithm.Effectively reduced the number of searches of node, Effective Raise the real-time of algorithm, and guaranteed the accuracy of searching.Simultaneously, also Effective Raise search efficiency and effectively saved query time.

Participle mode of the present invention had both been avoided low, the efficient Di, Lost word problem of query performance, had improved again search efficiency, had realized fast word segmentation.Be a new breakthrough at participle technique, because search efficiency improves, participle efficient improves, and also can improve Chinese phonetic alphabet user's work efficiency accordingly, saves working time, reduces labour intensity.

Description of drawings

Fig. 1: search tree participle process flow diagram;

Fig. 2: the structural drawing of search tree;

Fig. 3: participle process flow diagram flow chart.

Embodiment

The invention will be further described below in conjunction with accompanying drawing and embodiment.

As shown in Figure 1, at first set up the Hash tree of a search tree and Hash table combination according to existing Chinese word pinyin table, then the given a string continuous Chinese phonetic alphabet is carried out participle, provide analysis result, destroy at last search tree, releasing resource reclaims internal memory.

Set up Hash tree, set up a word lookup tree according to all known Chinese word pinyin tables.The root node of search tree does not comprise character, and each node only comprises a character except root node.The character that all child nodes of each node of search tree comprise is not identical.Search tree is except leaf node, and it is 26 Hash table that each node has a length, and Hash table is take the ascending order of 26 English alphabets as index, and each element is stored respectively a child node, and the child node actual number is less than or equal to 26.Search tree is except root node, and each node comprises an identification field, and this

field value

Word segmentation processing, the word lookup tree according to having set up carries out participle to given a string Chinese phonetic alphabet.Begin once to search for from root node.Obtain the first letter that to search keyword, and from Hash table, select corresponding subtree and forward this subtree to proceed retrieval according to this letter.On corresponding subtree, obtain the second letter that to search keyword, and further select corresponding subtree to retrieve.Same step obtains the 3rd, the 4th of keyword, a n letter, and corresponding subtree is retrieved.

The output word segmentation result, at certain node place, all letters of keyword have been removed or the identification field value of node is 1, then begin to export successively the identification field values of all characters and present node to present node from root path.If all letters of keyword are removed, namely finish and search; Otherwise obtain the next letter of keyword, the root node of getting back to search tree continues iteration and searches.For the keyword that comprises a plurality of semantemes, for example: piao, both can be construed to " ticket ", also can be construed to " fur-lined jacket ", word segmentation result will be exported all possible value.

Destroy Hash tree, finish participle after, destroy Hash tree, reclaim the internal memory that takies.

As shown in Figure 2, as space is limited, only listed the phonetic transcriptions of Chinese characters of part of representative among the figure.

" a ", root node comprises an identification field (value 0) and a Hash table, from first element (representing the child node that character is a) of root node Hash table down, (character is a) to comprise too an identification field to first order child node, at this moment the identification field value is 1, because alphabetical a namely represents a complete Chinese phonetic alphabet one.

" ai " " an " " ao ", continuation character is the node of a, it also has a Hash table, because not only " " phonetic that starts with alphabetical a equally, also have " ai " " an " " ao ", so it is three child nodes of " i " " n " " o " that a node has character, and the attribute field of three child nodes all value be 1.

" ang ", same, character is that the node of n has also comprised a child node that character is g in Hash table, and the zone bit of child node also is 1.So obtain phonetic " ang " after will connecting from a – n – g of process on the path that root node begins.

" pi " " po " " pian ", derivation principle is the same.

As shown in Figure 3, part is come the descriptor flow process with the false code of class C in this process flow diagram.

Whole process flow diagram has been described complete participle process, is subdivided into a main flow and three sub-processes.Comprise the polysemant method of processing in the pinyin string, when processing polysemant, used the recursive call mode.

Beginning of flow process is a string Chinese phonetic alphabet of input and initialization local variable, and pc1 is used for recording the beginning of an individual character participle; Pc2 is a dynamic cursor, the sign current character; Pc3 is used for recording the end of an individual character participle, and purpose is in order to process polysemant; Pt is a dynamic cursor, identifies current search tree node.

Main flow is an iterative process, carries out a series of judgement according to the value of the child node of the zone bit of pt node, character that pc2 points to and pt node, and then jumps to different sub-processes.

The job step of main flow is:

Step 1, the participle pinyin string is treated in input;

Step 2 is stated following variable: (1) character pointer pc1, pc2, pc3, (2) search tree node pointer pt; Pc1=pc2 points to the first character pc3=null of pinyin string;

Step 3, the root node of pt=search tree;

Step 4 judges whether the identification field of pt node equals 1, if just enter sub-process 1, just enters if not step 5;

Step 5 judges whether the character that pc2 points to is empty, if just enter sub-process 2, just enters if not step 6;

Step 6 judges that can the character that point to take pc as index, find child node in the Hash table of pt node; If just enter step 7; Otherwise enter sub-process 3;

Step 7 is carried out pc2++, the pt=child node; Return step 4;

Sub-process 1, when the flag of pt node was 1, a complete individual character phonetic had been found in representative, export this individual character phonetic this moment, then according to pc3(polysemant token variable) value continue the Recursion process polysemant, main flow is returned in last redirect.

The detailed step of sub-process 1 is:

Step (1-1), all characters that output is pointed to from pc1 to pc2-1, and report that this is an individual character phonetic;

Step (1-2) judges that whether pc3 is empty, if just enter step (1-4), just enters step (1-3) if not;

Step (1-3), all characters that intercepting is pointed to from pc3 to pc2-1 from the beginning recursive call of this flow process, enter step (1-4);

Step (1-4), pc3=pc2; Return the step 5 of main flow.

When sub-process 2, the character that pc2 points to were empty, representative had arrived the ending for the treatment of the participle pinyin string, if this moment, pc1(represented the beginning of an individual character participle) be not equal to pc2, all characters that output pc1 points to pc2-1, and be reported as the unallowable instruction digit string, jump out at last whole participle flow process.

The detailed step of sub-process 2 is:

Step (2-1) judges whether pc1 equals pc2, if just finish, just enters if not step (2-2);

All characters that step (2-2) output is pointed to from pc1 to pc2-1, and report that this is the unallowable instruction digit string, then finish.

Sub-process 3, at the current search tree node of pt() child node in do not find the node of coupling, this moment according to pc3(polysemant token variable) value do further judgement, if pc3 is null value, then representative has run into illegal character (for example character i, it is not the beginning character of any Chinese word phonetic), exports this unallowable instruction digit, behind the replacement variable, jump to the beginning of main flow; Otherwise representative need to be got back to the position (pipi for example, pc1 is in the beginning of string now, pc2 and pc3 are in the position of second p) of pc3, be equally the replacement variate-value after, jump to the beginning of main flow.

The detailed step of sub-process 3 is:

Step (3-1): judge that whether pc3 is empty, if so, enters step (3-2); If not, just enter step (3-6);

Step (3-2): all characters that output is pointed to from pc1 to pc2, and report that this is the unallowable instruction digit string; Enter step (3-3);

Step (3-3): judge whether pc1==pc2 sets up, if so, just enter step (3-4), if not, just enter step (3-5);

Step (3-4): pc2++; Enter step (3-5);

Step (3-5): pc1=pc2; Enter the step 3 of main flow;

Step (3-6): pc1=pc3; Pc2=pc3; Pc3=null; Enter the step 3 of main flow.

Although above-mentionedly by reference to the accompanying drawings the specific embodiment of the present invention is described; but be not limiting the scope of the invention; one of ordinary skill in the art should be understood that; on the basis of technical scheme of the present invention, those skilled in the art do not need to pay various modifications that creative work can make or distortion still in protection scope of the present invention.

Claims

1. Chinese phonetic alphabet fast word segmentation method that realizes based on word lookup tree, the method realizes that by computing machine or embedded movable equipment it is characterized in that, the groundwork step is as follows:

Step 3, provide word segmentation result;

Step 4, destruction search tree, releasing resource.

2. a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree as claimed in claim 1 is characterized in that, in the described step 1, sets up individual character phonetic search tree according to all known Chinese word pinyin tables, mainly comprises following steps:

(2) character that comprises of all child nodes of each node is not identical;

(4) each node comprises an identification field, and this field value 0 or 1 is used for sign from root node to this node, and the Connection operator of process gets up whether to represent a complete Chinese word phonetic on the path.

3. a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree as claimed in claim 1 is characterized in that, in the described step 2, according to the word lookup tree of having set up, search tree is combined with Hash table, given a string Chinese phonetic alphabet is carried out participle, mainly comprise following steps:

A) begin once to search for from root node;

4. a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on word lookup tree as claimed in claim 1 is characterized in that in the described step 3, concrete steps are as follows:

1) at certain node place, if all letters of keyword have been removed or the identification field value of node is 1, then from the root path to the present node, exports successively the identification field values of all characters and present node;

2) if all letters of keyword are removed, namely finish and search; Otherwise obtain the next letter of keyword, the root node of getting back to search tree continues iteration and searches;

3) for the keyword that comprises a plurality of semantemes, the output word segmentation result.

5. a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on word lookup tree as claimed in claim 1 is characterized in that, in the described step 4, finish participle after, destroy search tree, releasing resource reclaims the internal memory that takies.

6. a kind of Chinese phonetic alphabet fast word segmentation method that realizes based on the word lookup tree as claimed in claim 1 is characterized in that described search tree is the polymorphic set of a kind of support, comprises the data structure of insertion, deletion and search operation.