Chinese word segmentation method, system, device and storage medium based on statistics
Technical Field
The invention relates to the technical field of information processing, in particular to a Chinese word segmentation method, a system, a device and a storage medium based on statistics.
Background
The Chinese word segmentation refers to a process of recombining a plurality of continuous Chinese characters into a word sequence according to a certain standard, is the basis of Chinese information processing, and has a lot of applications in the fields of natural language processing and artificial intelligence, and common application scenes such as search engines, speech synthesis, machine translation and the like.
Existing chinese word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The word segmentation method based on character string matching is also called a mechanical word segmentation method, and means that a Chinese character string to be segmented is matched with an entry in a sufficiently large machine dictionary according to a certain strategy, if a certain character string is found in the dictionary, the matching is considered to be successful, but the word segmentation method based on character string matching not only needs the dictionary, but also has low word segmentation accuracy, and especially has two aspects of ambiguity recognition and new word recognition. The word segmentation method based on understanding means that a computer simulates the understanding of sentences by a person to achieve the function of word recognition, and the basic idea is to synchronously perform syntactic analysis and semantic word segmentation during word segmentation and process an ambiguous phenomenon by adopting syntactic information and semantic information; however, due to the characteristics of the Chinese language knowledge, it is difficult to organize various information into a form that can be directly read by a machine and contradict with a large amount of requirements of the understanding-based word segmentation method for the language knowledge and the information, so that the understanding-based word segmentation method is still in an experimental stage. The word segmentation method based on statistics is to judge whether a character string constitutes a word according to the frequency of the character string appearing in a corpus, the word is a combination of characters, and the probability of constituting a word is higher as the frequency of the adjacent character colleagues is higher. Based on the defects of the prior art, how to improve the accuracy of Chinese word segmentation becomes a technical problem to be solved urgently in the industry.
Disclosure of Invention
In order to solve one of the above technical problems, the present invention provides a statistical-based chinese word segmentation method, system, device and storage medium.
The first technical scheme adopted by the invention is as follows:
a Chinese word segmentation method based on statistics comprises the following steps:
acquiring a target text, wherein the target text comprises a plurality of words;
performing word segmentation processing on a target text according to a preset corpus, and identifying a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
combining a Viterbi algorithm and a preset scale factor to carry out reverse recursive processing on words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence;
and generating a word segmentation result according to the optimal word segmentation sequence.
Further, the step of performing inverse matching on words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, each path containing a plurality of word segmentation nodes specifically comprises the following steps:
acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and reversely matching each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the step of performing reverse matching on each word of the target text by combining a second-order hidden markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes specifically comprises the following steps:
performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word-dividing paths;
acquiring a preset weight, wherein the preset weight refers to an influence value of word length in Chinese grammar on sequence;
and correcting the generated plurality of pre-word segmentation paths according to a preset weight value, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Further, the method, in combination with the viterbi algorithm and preset scale factors, performs reverse recursive processing on the word of each participle node on each participle path to obtain an optimal participle sequence, and specifically includes the following steps:
generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
performing reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
and carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
The second technical scheme adopted by the invention is as follows:
a statistics-based chinese word segmentation system, comprising:
the acquisition module is used for acquiring a target text, and the target text comprises a plurality of words;
the recognition module is used for performing word segmentation processing on the target text according to a preset corpus and recognizing a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
the output module is used for finishing the reverse matching of the first probability and the second probability to the words contained in the target text and outputting a plurality of word segmentation paths, and the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on the words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating a word segmentation result according to the optimal word segmentation sequence.
Further, the output module includes:
the first obtaining unit is used for obtaining the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and the reverse matching unit is used for performing reverse matching on each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the inverse matching unit includes:
the generating subunit is used for performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the method comprises the steps that a subunit is obtained, and a preset weight is obtained, wherein the preset weight refers to an influence value of Chinese grammar on word length and word sequence;
and the output subunit is used for correcting the generated plurality of word segmentation sequences according to the preset weight value and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module comprises:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
and the second obtaining unit is used for carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
The third technical scheme adopted by the invention is as follows:
an automatic generation device of computer code, the memory is used for storing at least one program, and the processor is used for loading the at least one program to execute the method.
The fourth technical scheme adopted by the invention is as follows:
a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.
The invention has the beneficial effects that: the method comprises the steps of obtaining a first probability and a second probability of each word in an input target text through preset corpus statistics, reversely matching the first probability and the second probability of each word contained in the target text, outputting a plurality of word segmentation paths, wherein each word segmentation path contains a plurality of word segmentation nodes, reversely recursively processing the words of each word segmentation node on each word segmentation path according to a Viterbi algorithm and a preset scale factor to obtain a final word segmentation sequence, and finally generating a word segmentation result based on the optimal word segmentation sequence, so that the word segmentation accuracy of Chinese word segmentation is improved, the calculation amount is reduced through the preset proportion, and the cost is reduced.
Drawings
FIG. 1 is a flow chart of the steps of a statistical-based Chinese word segmentation method provided by the present invention;
FIG. 2 is a block diagram of a statistical-based Chinese word segmentation system according to the present invention.
Detailed Description
Example one
Referring to fig. 1, the invention provides a flow chart of steps of a statistical-based chinese word segmentation method, which specifically includes the following steps:
s1, obtaining a target text, wherein the target text comprises a plurality of words;
s2, performing word segmentation processing on the target text according to a preset corpus, and identifying a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
s3, reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
s4, combining the Viterbi algorithm and the preset scale factor to carry out reverse recursive processing on the words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence;
and S5, generating a word segmentation result according to the optimal word segmentation sequence.
The preset corpus comprises a people daily report 2014 version corpus, a Beijing language university corpus, a Beijing university corpus and the like, the people daily report 2014 version corpus is preferred in the embodiment, language materials which are actually appeared in practical use in the language are stored in the people daily report 2014 version corpus, the data bases are segmented according to specifications and added with labeled word characteristics and word senses, the preset system is arranged in the system for segmenting the target text, the first probability of the target text, namely the probability of word appearance and the second probability, namely the probability of adjacent appearance of two words can be identified, then the target text is reversely matched by combining the first probability and the second probability, the reversely matched target text is obtained by taking words from back to front and subtracting one word each time until the input target text is hit by the preset corpus or a single word is left, a plurality of segmentation paths are output, and each segmentation path comprises a plurality of segmentation nodes, the words of each participle node on each participle path are subjected to reverse recursion processing through a Viterbi algorithm and a preset scale factor to obtain an optimal participle sequence, wherein the preset scale factor refers to that the probability value of the participle path is obtained by multiplying the probability value by 100 when the Viterbi algorithm is applied to the participle path through a Viterbi formula, then logarithms are taken from two sides of the Viterbi formula so as to reduce a large number of floating point operations performed by a computer in the operation, and simultaneously, the phenomenon that the result value is possibly too large to be expressed when the computer is operated and can only be expressed by 0 to cause precision loss is avoided, and finally, the participle result is generated according to the optimal participle sequence, so that the accuracy rate of Chinese partic.
Further, as a preferred embodiment, the step S3 specifically includes the following steps:
s31, combining the first probability, the second probability and a Bayes formula to obtain the conditional probability of each word in the target text, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and S32, reversely matching each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
In this embodiment, a bayesian formula, a first word probability, that is, a word frequency where a word appears, and a second probability, that is, a word frequency where two words appear adjacent to each other are used to obtain a conditional probability of each word in a target text, where if the probability of a single word B is P (B), and the probability of two adjacent words AB is P (a, B), then the conditional probability P (a | B) ═ P (a, B)/P (B) where a word a appears is provided, the same may be used to obtain the conditional probability P (B | a) where a word B appears on the premise that a word a exists according to the bayesian formula of the probabilities P (a) and P (AB) where a word a appears, and then the obtained conditional probabilities of each word are inversely matched to the target text by a second-order hidden markov algorithm, and a plurality of word segmentation paths are output, where each word path includes a plurality of segmentation nodes.
Further, as a preferred embodiment, the step S33 specifically includes the following steps:
s331, performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word-dividing paths;
s332, acquiring a preset weight, wherein the preset weight refers to an influence value of word length in Chinese grammar on sequence;
s333, correcting the generated plurality of pre-word segmentation paths according to the preset weight value, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Specifically, conditional probabilities of words in a target text are obtained, a second-order hidden markov algorithm is used for reversely matching the words in the target text to generate a plurality of pre-segmentation paths, in the embodiment, the generated pre-segmentation paths are corrected through the preset weight values, and a plurality of segmentation paths are output, wherein the preset weight values are influence values of word length and word order based on Chinese grammar, such as 'true ideal' of the text, the 'true/rational' display of the pre-segmentation paths matched through the second-order hidden markov algorithm is inconvenient to understand, the 'true/true ideal' of the segmentation paths are output after the 'true/rational' of the pre-segmentation paths are corrected through the preset weight values based on Chinese grammar, and the segmentation paths are obviously more complex to Chinese understanding after the preset weight values are corrected.
Further, as a preferred embodiment, the step S4 specifically includes the following steps:
s41, generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
s42, carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
s43, carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
In this embodiment, a transition probability matrix [ a ] of step length node length is generated according to the conditional probability of each participle node on each participle pathji]And applying Viterbi algorithm to the patent probability matrix [ aji]And performing reverse recursion processing, outputting a plurality of word segmentation sequences, performing logarithm extremum processing on the probabilities of the word segmentation sequences through a preset scale factor (100) and an argmax function, and acquiring the word segmentation sequence with the maximum probability value as the optimal word segmentation sequence, wherein the probabilities of the word segmentation sequences refer to the probabilities of the word segmentation sequences corresponding to the transition probabilities of different node lengths based on a Viterbi algorithm.
Example two
Referring to fig. 2, the structural block diagram of a statistical-based chinese word segmentation system of the present invention includes:
the acquisition module is used for acquiring a target text, and the target text comprises a plurality of words;
the recognition module is used for performing word segmentation processing on the target text according to a preset corpus and recognizing a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
the output module is used for finishing the reverse matching of the first probability and the second probability to the words contained in the target text and outputting a plurality of word segmentation paths, and the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on the words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating a word segmentation result according to the optimal word segmentation sequence.
Further, the output module includes:
the first obtaining unit is used for obtaining the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and the reverse matching unit is used for performing reverse matching on each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the inverse matching unit includes:
the generating subunit is used for performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the method comprises the steps that a subunit is obtained, and a preset weight is obtained, wherein the preset weight refers to an influence value of Chinese grammar on word length and word sequence;
and the output subunit is used for correcting the generated plurality of word segmentation sequences according to the preset weight value and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module comprises:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
and the second obtaining unit is used for carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
EXAMPLE III
An automatic computer code generation device comprises a memory and a processor, wherein the memory is used for storing at least one program, and the processor is used for loading the at least one program to execute the method in the embodiment I.
The computer code automatic generation device of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the invention, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
Example four
A storage medium having stored therein processor-executable instructions for performing a method as in embodiment one when executed by a processor.
The storage medium of this embodiment can execute the statistical-based chinese word segmentation method provided in the first embodiment of the method of the present invention, can execute any combination of the implementation steps of the method embodiments, and has corresponding functions and advantages of the method.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.