Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the embodiment of the invention is described in further detail below in conjunction with accompanying drawing.
Fig. 1 is that the present invention realizes the first embodiment process flow diagram based on the searching method of compound word.As shown in Figure 1, this method specifically comprises:
Step S101, the word of each webpage that web page library is preserved is combined into line frequency degree statistics, and described word combination is formed by this webpage being carried out the monobasic word that word segmentation processing obtains;
Step S102 is for frequency is set up the compound word index greater than the word combination of predetermined threshold value; Described frequency is combined as compound word greater than the word of predetermined threshold value;
Step S103 when containing the keyword that mates with described compound word in obtaining searching request, returns Search Results according to described compound word index.
The embodiment of the invention is screened compound word by the frequency according to the combination of statistics word, and sets up the compound word index for compound word, when containing the keyword that mates with described compound word in obtaining searching request, returns Search Results according to described compound word index.Can reduce fractionation granularity, save and handle resource, reduce operation time, thereby can respond user's searching request faster the retrieval language of user's input.
Fig. 2 is that the present invention realizes the second embodiment process flow diagram based on the searching method of compound word.As shown in Figure 2, this method specifically comprises:
Step S201 is provided with the compound word candidate storehouse of preparing to be used to add up the word combination, and described compound word candidate storehouse is used to deposit the word combination, and the counter of corresponding each word combination.
In the present embodiment, realize above-mentioned setting by the compound word candidate storehouse of a sky of initialization.Concrete mode is: defining a compound word candidate storehouse M, is a set, and its effect is to deposit following elements A:
A=struct{
The compound entry of String
The counter of the compound entry of Int32iCounter
}
In fact M realizes being exactly the array of a structure A in program.
Step S202 for each webpage in the web page library, reads this webpage earlier and this webpage is carried out word segmentation processing.
In this enforcement, participle is that sentence is carried out the process of cutting participle by word.It is to remove stop words that webpage is carried out the word segmentation processing main policies, and carries out synonym normalization, obtains the result after the word segmentation processing then.For example, the web page contents that reads is " international strategies of Intellectual Property in China is selected to arrange with domestic strategy ".Remove stop words, and carry out synonym normalization and carry out participle, the monobasic word of acquisition is: " China ", " knowledge ", " property right ", " world ", " strategy ", " selection ", " domestic ", " strategy ", " arrangement ".
Step S203, the monobasic word that described word segmentation processing is obtained carries out permutation and combination, obtains the word combination.
In the present embodiment, the monobasic word that word segmentation processing is obtained carries out permutation and combination, obtains the word combination.Described monobasic speech is for forming the basic word or the vocabulary of phrase or statement, with the monobasic speech that obtains of giving an example among the step S202 is that example is carried out permutation and combination, can obtain various binary speech permutation and combination such as " Chinese knowledge ", " intellecture property ", " international strategies ", " domestic strategy ", " strategy is arranged ", " selecting domestic ", " strategic choice ", not enumerate one by one at this.Can also obtain various binary speech permutation and combination such as " Intellectual Property in China ", " selecting domestic strategy ", not enumerate one by one at this.Same can obtain more polynary word combination, can be by the unit time that queueing discipline is provided with the maximum of combination word is set.Described queueing discipline is used to indicate the monobasic word that described word segmentation processing is obtained to carry out permutation and combination according to the rule that is provided with, and except the unit of the maximum of the combination word that is provided with time, setting such as can also whether make up at random.
Step S204 judges the word combination that whether has acquisition in the described compound word candidate storehouse, if the judgment is Yes, and then counter values that should the word combination is added one; Otherwise, be used to increase the combination of this word at described compound word candidate storehouse, and counter values that should the word combination is set at one.
In this enforcement, word is combined in the number of times that occurs in the webpage of the web page library frequency this word combination, occurs 30 times in the webpage of web page library such as a word, says that then the frequency that this word makes up this web page library is 30.
Each word combination that permutation and combination obtains is judged, if there is the word combination that obtains in the described compound word candidate storehouse, then counter values that should the word combination is added one; If there is not the word combination of acquisition in the described compound word candidate storehouse, then be used to increase this word combination at described compound word candidate storehouse, and counter values that should the word combination is set at one.To each word combination, be made as Xi, specific algorithm is as follows:
IF (Xi is in candidate storehouse M)
{
Among the visit M about the record of Xi
ICounter counter about the record of Xi among the M is added one;
}
Else
{
In M, increase record Xi.
ICounter Counter Value about the record of Xi among the M is set at one.
}
After the word in webpage combination is added up, check all webpages of whether having added up in the web page library, if execution in step S205 then, if otherwise read the webpage of adding up, carry out word segmentation processing, and execution in step S204-step S204.
Step S205 is for frequency is set up the compound word index greater than the word combination of predetermined threshold value; Described frequency is combined as compound word greater than the word of predetermined threshold value.
In the present embodiment, if after having added up the frequency of the portmanteau word of used webpage in the web page library, set a threshold value, travel through all compound word candidate storehouses, export all word combinations greater than this threshold value, described frequency is combined as compound word greater than the word of predetermined threshold value.Wherein, threshold value can be according to the frequency maximal value of the word combination of statistics as a reference, is provided with according to this maximal value according to the action need of reality.Perhaps the frequency to the used word combination in the compound word candidate storehouse carries out normalized, obtain the probability that each word combination occurs, the word of choosing greater than certain probable value according to this maximal value setting according to the action need of reality makes up as compound word then, and the described compound word of choosing is combined into compound dictionary.
Step S206 is with searching request and the compound word coupling of obtaining.
Present embodiment after obtaining user's searching request, mates compound word in retrieval language in this request and the above-mentioned compound dictionary.If contain the middle compound word of compound dictionary in this retrieval language, then the match is successful.
Step S207 after the match is successful, utilizes described compound word index to carry out search arithmetic, and returns the result after this search arithmetic.
In the present embodiment, at compound word after the match is successful.Retrieval in this searching request language is split:
If this retrieval language only splits into a compound word, then the index according to this compound word returns Search Results;
If this retrieval language only splits into a plurality of compound words, then the result that the index of each compound word is obtained is in conjunction with seeking common ground and ask and set operation, and returns calculated result;
If this retrieval language only splits into compound word and monobasic speech, then the results set that obtains of results set that the index of compound word is obtained and monobasic glossarial index seeks common ground and asks and set operation, and returns calculated result.
The embodiment of the invention is screened compound word by the frequency according to the combination of statistics word, and sets up the compound word index for compound word, when containing the keyword that mates with described compound word in obtaining searching request, returns Search Results according to described compound word index.Can reduce fractionation granularity, save and handle resource, reduce operation time, improve recall precision, thereby can respond user's searching request faster the retrieval language of user's input.
Fig. 3 is the structural representation of embodiment of the invention search engine server first embodiment.As shown in Figure 3, this search engine server comprises: frequency statistics unit 310, screening unit 320, generation unit 330 and search processing 340.
Frequency statistics unit 310, the word that is used for each webpage that web page library is preserved is combined into line frequency degree statistics, and described word combination is formed by this webpage being carried out the monobasic word that word segmentation processing obtains.
Screening unit 320 is used to screen the word combination of frequency greater than predetermined threshold value, and described frequency is combined as compound word greater than the word of predetermined threshold value.
Generation unit 330 is used to described compound word to set up the compound word index.
Search processing 340 when containing the keyword that mates with described compound word in obtaining searching request, is returned Search Results according to described compound word index.
The embodiment of the invention is screened compound word by the frequency according to the combination of statistics word, and sets up the compound word index for compound word, when containing the keyword that mates with described compound word in obtaining searching request, returns Search Results according to described compound word index.Can reduce fractionation granularity, save and handle resource, reduce operation time, improve recall precision, thereby can respond user's searching request faster the retrieval language of user's input.
Fig. 4 is the structural representation of embodiment of the invention search engine server second embodiment.As shown in Figure 4, this search engine server comprises: storage unit 410, frequency statistics unit 420, screening unit 430, generation unit 440 and search processing 450 are set.
Storage unit 410 is set, is used to be provided with the compound word candidate storehouse of preparing to be used to add up the word combination, described compound word candidate storehouse is used to deposit the word combination, and the counter of corresponding each word combination.
In the present embodiment, storage unit 410 is set realizes above-mentioned setting by the compound word candidate storehouse of a sky of initialization.Storage unit 410 concrete executive modes are set is: defining a compound word candidate storehouse M, is a set, and its effect is to deposit following elements A:
A=struct{
The compound entry of String
The counter of the compound entry of Int32iCounter
}
In fact M realizes being exactly the array of a structure A in program.
Frequency statistics unit 420 is used to screen the word combination of frequency greater than predetermined threshold value, and described frequency is combined as compound word greater than the word of predetermined threshold value.This frequency statistics unit 420 comprises: word-dividing mode 421, arrangement module 422, judge module 423 and statistical module 424.
Word-dividing mode 421 is used for reading each webpage of web page library and this webpage is carried out word segmentation processing.
In this enforcement, participle is that sentence is carried out the process of cutting participle by word.It is to remove stop words that 421 pairs of webpages of word-dividing mode carry out the word segmentation processing main policies, and carries out synonym normalization, obtains the result after the word segmentation processing then.For example, the web page contents that reads is " international strategies of Intellectual Property in China is selected to arrange with domestic strategy ".Remove stop words, and carry out synonym normalization and carry out participle, the monobasic word of acquisition is: " China ", " knowledge ", " property right ", " world ", " strategy ", " selection ", " domestic ", " strategy ", " arrangement ".
Arrange module 422 and be used for the monobasic word that described word segmentation processing obtains is carried out permutation and combination, obtain the word combination.
In the present embodiment, the monobasic word that word segmentation processing is obtained carries out permutation and combination, obtains the word combination.The monobasic speech that obtains with above-mentioned word-dividing mode 421 is that example is carried out permutation and combination, can obtain various binary speech permutation and combination such as " Chinese knowledge ", " intellecture property ", " international strategies ", " domestic strategy ", " strategy is arranged ", " selecting domestic ", " strategic choice ", not enumerate one by one at this.Can also obtain various binary speech permutation and combination such as " Intellectual Property in China ", " selecting domestic strategy ", not enumerate one by one at this.Same can obtain more polynary word combination, can module 425 be set by a rule that is used to be provided with the queueing discipline of permutation and combination is set, by the unit time that queueing discipline is provided with the maximum of combination word is set.Described queueing discipline is used to indicate the monobasic word that described word segmentation processing is obtained to carry out permutation and combination according to the rule that is provided with, and except the unit of the maximum of the combination word that is provided with time, setting such as can also whether make up at random.
Judge module 423 is used for after described arrangement module obtains the word combination, judges the word combination that whether has acquisition in the described compound word candidate storehouse; Statistical module 424 is used for judging that at described judge module there is the word combination that obtains in the compound word candidate storehouse, then counter values that should the word combination is added one; Otherwise, be used to increase the combination of this word at described compound word candidate storehouse, and counter values that should the word combination is set at one.
In the present embodiment, each word combination that the 423 pairs of permutation and combination of judge module obtain judges, if there is the word combination that obtains in the described compound word candidate storehouse, then statistical module 424 is counter values that should the word combination is added one; If there is not the word combination of acquisition in the described compound word candidate storehouse, then statistical module 424 is used to increase this word combination at described compound word candidate storehouse, and counter values that should the word combination is set at one.To each word combination, be made as Xi, specific algorithm is as follows:
IF (Xi is in candidate storehouse M)
{
Among the visit M about the record of Xi
ICounter counter about the record of Xi among the M is added one;
}
Else
{
In M, increase record Xi.
ICounter Counter Value about the record of Xi among the M is set at one.
}
After combination is added up to the word in the webpage, check whether add up all webpages of playing in the web page library, if playing, statistics further handles by screening unit 430; Otherwise continue to carry out word segmentation processing, and finish the frequency statistics of the combination of speaking to oneself respectively by arrangement module 422, judge module 423, statistical module 424 by 421 pairs of webpages of not adding up of word-dividing mode.
Screening unit 430 is used to screen the word combination of frequency greater than predetermined threshold value, and described frequency is combined as compound word greater than the word of predetermined threshold value.
In the present embodiment, if after having added up the frequency of the portmanteau word of used webpage in the web page library, set a threshold value by screening unit 430, travel through all compound word candidate storehouses, export all word combinations greater than this threshold value, described frequency is combined as compound word greater than the word of predetermined threshold value.Wherein, threshold value can be according to the frequency maximal value of the word combination of statistics as a reference, is provided with according to this maximal value according to the action need of reality.Perhaps the frequency to the used word combination in the compound word candidate storehouse carries out normalized, obtain the probability that each word combination occurs, the word of choosing greater than certain probable value according to this maximal value setting according to the action need of reality makes up as compound word then, and the described compound word of choosing is combined into compound dictionary.
Generation unit 440 is used to described compound word to set up the compound word index.
Search processing 450 when containing the keyword that mates with described compound word in obtaining searching request, is returned Search Results according to described compound word index.Described search processing 450 specifically comprises: acquisition module 451, matching module 452 and computing module 453.
Acquisition module 451 is used to obtain user's searching request; Searching request and compound word that matching module 452 is used for obtaining mate; Computing module 753 is used to utilize described compound word index to carry out search arithmetic, and returns the result after this search arithmetic.
In the present embodiment, after acquisition module 451 obtains user's searching request, during matching module 452 will be asked in retrieval language and the above-mentioned compound dictionary compound word mate.If contain the middle compound word of compound dictionary in this retrieval language, then the match is successful.To retrieve language at compound word after the match is successful.Retrieval in this searching request language is split:
If this retrieval language only splits into a compound word, then computing module 453 returns Search Results according to the index of this compound word;
If this retrieval language only splits into a plurality of compound words, then computing module 453 result that the index of each compound word is obtained is in conjunction with seeking common ground and ask and set operation, and returns calculated result;
If this retrieval language only splits into compound word and monobasic speech, then the results set that obtains of the results set that obtains of the index of 453 pairs of compound words of computing module and monobasic glossarial index seeks common ground and asks and set operation, and returns calculated result.
The embodiment of the invention is screened compound word by the frequency according to the combination of statistics word, and sets up the compound word index for compound word, when containing the keyword that mates with described compound word in obtaining searching request, returns Search Results according to described compound word index.Can reduce fractionation granularity, save and handle resource, reduce operation time, improve recall precision, thereby can respond user's searching request faster the retrieval language of user's input.
More than cited only be preferred embodiment of the present invention, can not limit the present invention's interest field certainly with this, therefore the equivalent variations of doing according to claim of the present invention still belongs to the scope that the present invention is contained.