CN111160024B

CN111160024B - Chinese word segmentation method, system, device and storage medium based on statistics

Info

Publication number: CN111160024B
Application number: CN201911392455.1A
Authority: CN
Inventors: 寇永娴; 陈惠芳; 蓝飘; 胡志乐; 李娟�
Original assignee: GRG Banking Equipment Co Ltd; GRG Banking IT Co Ltd
Current assignee: GRG Banking IT Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2023-08-15
Anticipated expiration: 2039-12-30
Also published as: CN111160024A

Abstract

The application discloses a Chinese word segmentation method, a Chinese word segmentation system, a Chinese word segmentation device and a Chinese word segmentation storage medium based on statistics, wherein the Chinese word segmentation method comprises the following steps: acquiring a target text; word segmentation processing is carried out on the target text according to a preset corpus, and a first probability and a second probability are identified; reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes; carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence; and generating word segmentation results according to the optimal word segmentation sequence, so that the word segmentation accuracy of Chinese word segmentation processing is improved, the calculated amount is reduced through the preset proportion, and the cost is reduced.

Description

Chinese word segmentation method, system, device and storage medium based on statistics

Technical Field

The present application relates to the field of information processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for chinese word segmentation based on statistics.

Background

The Chinese word segmentation refers to a process of recombining a plurality of continuous Chinese words into word sequences according to a certain specification, is the basis of Chinese information processing, has a great deal of application in the fields of natural language processing and artificial intelligence, and has common application scenes such as search engines, speech synthesis, machine translation and the like.

The existing Chinese word segmentation algorithm can be divided into three main categories: word segmentation method based on character string matching, word segmentation method based on understanding and word segmentation method based on statistics. The word segmentation method based on character string matching is characterized in that a Chinese character string to be segmented is matched with entries in a 'sufficiently large' machine dictionary according to a certain strategy, if a certain character string is found in the dictionary, the matching is considered to be successful, but the word segmentation method based on character string matching not only needs the dictionary, but also has low word segmentation accuracy, and especially has two aspects of ambiguity recognition and new word recognition. The word segmentation method based on understanding is to enable a computer to simulate the understanding of a sentence to achieve the function of word recognition, and the basic idea is that when word segmentation is carried out, syntactic analysis and semantic word segmentation are synchronously carried out, and the syntactic information and the semantic information are adopted to process ambiguity; however, due to the characteristics of the Chinese language knowledge, it is difficult to organize various information into a form that can be directly read by a machine and contradict between the understanding-based word segmentation method and the massive demand of the language knowledge and the information, so that the understanding-based word segmentation method is still in an experimental stage. The word segmentation method based on statistics refers to judging whether a character string forms a word according to the occurrence frequency of the character string in a corpus, wherein the word is a combination of words, and the probability of forming a word is higher as the number of times of occurrence of adjacent word colleagues is larger. Based on the deficiency of the prior art, how to improve the accuracy of Chinese word segmentation becomes a technical problem to be solved in the industry.

Disclosure of Invention

In order to solve one of the above technical problems, the present application aims to provide a statistical-based Chinese word segmentation method, a statistical-based Chinese word segmentation system, a statistical-based Chinese word segmentation device and a statistical-based Chinese word segmentation storage medium.

The first technical scheme adopted by the application is as follows:

a Chinese word segmentation method based on statistics comprises the following steps:

obtaining a target text, wherein the target text contains a plurality of words;

word segmentation is carried out on a target text according to a preset corpus, a first probability and a second probability are identified, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;

reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;

carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;

and generating a word segmentation result according to the optimal word segmentation sequence.

Further, the step of reversely matching the words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes, specifically comprises the following steps:

the method comprises the steps of combining a first probability, a second probability and a Bayesian formula to obtain the conditional probability of each word in a target text, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;

and reversely matching each word of the target text by combining a second-order hidden Markov algorithm with the obtained condition probability of each word, outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.

Further, the step of reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained probability of each word condition and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes, specifically comprises the following steps:

performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word paths;

acquiring a preset weight, wherein the preset weight refers to a word length influence value on the grammar in the Chinese grammar;

and correcting the generated plurality of pre-word paths according to preset weight values, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.

Further, the method combines a viterbi algorithm and a preset scale factor to carry out reverse recursion processing on the word of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence, and specifically comprises the following steps:

generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;

carrying out reverse recursion on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;

and combining a preset scale factor and an argmax function to take the logarithmic extremum of the probabilities of a plurality of word segmentation sequences, and obtaining the optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the largest.

The second technical scheme adopted by the application is as follows:

a statistics-based chinese word segmentation system, comprising:

the acquisition module is used for acquiring a target text, wherein the target text contains a plurality of words;

the recognition module is used for carrying out word segmentation processing on the target text according to a preset corpus, and recognizing a first probability and a second probability, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;

the output module is used for reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein the paths contain a plurality of nodes;

the recursion module is used for carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;

and the generating module is used for generating word segmentation results according to the optimal word segmentation sequence.

Further, the output module includes:

the first acquisition unit is used for acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and the Bayesian formula, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;

and the reverse matching unit is used for reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, outputting a plurality of word segmentation paths, and each word segmentation path comprises a plurality of word segmentation nodes.

Further, the reverse matching unit includes:

the generation subunit is used for reversely matching each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;

the acquisition subunit acquires preset weights, wherein the preset weights refer to influence values of word length and word sequence based on Chinese grammar;

the output subunit is used for correcting the generated word segmentation sequences according to preset weights and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.

Further, the recursion module includes:

the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;

the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;

the second acquisition unit is used for combining a preset scale factor and an argmax function to perform logarithmic extremum processing on the probabilities of the word segmentation sequences, so as to acquire an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.

The third technical scheme adopted by the application is as follows:

an automatic generation device of computer code, the memory for storing at least one program, the processor for loading the at least one program to perform the method described above.

The fourth technical scheme adopted by the application is as follows:

a storage medium having stored therein processor executable instructions which when executed by a processor are for performing the method as described above.

The beneficial effects of the application are as follows: the method comprises the steps of obtaining first probability and second probability of each word in an input target text through statistics of a preset corpus, outputting a plurality of word segmentation paths through reverse matching of the first probability and the second probability of each word contained in the target text, enabling each word segmentation path to contain a plurality of word segmentation nodes, carrying out reverse recursion words of each word segmentation node on each word segmentation path according to a Viterbi algorithm and a preset scale factor, obtaining a final word segmentation sequence, and finally generating word segmentation results based on the optimal word segmentation sequence, so that word segmentation accuracy of Chinese word segmentation processing is improved, calculation amount is reduced through preset scale, and cost is reduced.

Drawings

FIG. 1 is a flow chart of steps of a Chinese word segmentation method based on statistics provided by the application;

fig. 2 is a block diagram of a statistics-based chinese word segmentation system.

Detailed Description

Example 1

Referring to fig. 1, the present application provides a statistical-based chinese word segmentation method, which specifically includes the following steps:

s1, acquiring a target text, wherein the target text contains a plurality of words;

s2, word segmentation is carried out on the target text according to a preset corpus, a first probability and a second probability are identified, the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;

s3, reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;

s4, carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;

s5, generating word segmentation results according to the optimal word segmentation sequence.

Wherein the preset corpus comprises a people daily report 2014 edition corpus, a Beijing university Chinese corpus and the like, the people daily report 2014 edition corpus is preferably selected in the embodiment, the people daily report 2014 edition corpus stores language materials which are actually appeared in actual use, the language materials are databases which are segmented according to specifications and are added with tagged parts of speech and word senses, the first probability of word occurrence and the second probability of word occurrence which are adjacent to each other of the target text can be identified by carrying out word segmentation processing on the target text in a preset system, then the first probability and the second probability are reversely matched, the reverse matching is that the input target text is taken forward from the back, and each time one word is subtracted, outputting a plurality of word segmentation paths until a preset corpus hits or single word remains, wherein each word segmentation path comprises a plurality of word segmentation nodes, carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path through a Viterbi algorithm and a preset scale factor, and obtaining an optimal word segmentation sequence, wherein the preset scale factor is based on multiplication and 100 when the Viterbi algorithm is used for solving the probability value of the word segmentation path for the word segmentation path through the Viterbi algorithm, and logarithms are taken on two sides of the Viterbi formula, so that a large number of floating point operations are carried out by a computer in operation, meanwhile, the situation that the result value is very approximate to be unable to be represented when the computer is operated is avoided, precision loss is only caused by 0 representation, and finally, the word segmentation result is generated according to the optimal word segmentation sequence, thereby improving the accuracy of Chinese word segmentation and reducing the operation quantity.

Further as a preferred embodiment, the step S3 specifically includes the following steps:

s31, combining the first probability, the second probability and a Bayesian formula to obtain the conditional probability of each word in the target text, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;

s32, reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, and outputting a plurality of word segmentation paths which contain a plurality of word segmentation nodes.

In this embodiment, through a bayesian formula, a first word probability, that is, a word frequency of a word occurrence, a second probability, that is, a word frequency of a word occurrence adjacent to two words, a conditional probability of each word in a target text is obtained, if the probability of a single occurrence of a word B is P (B), and the probability of an adjacent occurrence of two words AB is P (a, B), then under the premise that the word B exists, the conditional probability of the word a occurrence P (a|b) =p (a, B)/P (B), and similarly, the conditional probability of the word B occurrence P (b|a) under the premise that the word a exists can be obtained by adopting the bayesian formula according to the probability of the single occurrence of the word a and the probability of the word P (AB), then a plurality of word segmentation paths are output by reversely matching the obtained conditional probability of each word with respect to the target text through a second-order hidden markov algorithm, and each word segmentation path contains a plurality of word segmentation nodes.

Further as a preferred embodiment, the step S33 specifically includes the following steps:

s331, performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word paths;

s332, acquiring a preset weight, wherein the preset weight refers to a word length influence value on the grammar in the Chinese grammar;

s333, correcting the generated plurality of pre-word paths according to preset weights, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.

Specifically, the conditional probability of each word in the target text is obtained, the second-order hidden Markov algorithm is adopted to reversely match each word in the target text, a plurality of pre-word paths are generated, in this embodiment, the generated plurality of pre-word paths are corrected through the preset weight, a plurality of word paths are output, wherein the preset weight is an influence value of word length and word sequence based on Chinese grammar, such as text 'true and true' and 'true and true', the pre-word paths matched through the second-order hidden Markov algorithm are 'true/true' and are not easy to understand, and the word paths 'true/true' are output after the pre-word paths 'true/true' are corrected through the preset weight based on Chinese grammar, so that understanding of Chinese is obviously compounded through the word paths after the preset weight correction.

Further as a preferred embodiment, the step S4 specifically includes the following steps:

s41, generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;

s42, carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;

s43, combining a preset scale factor and an argmax function to carry out logarithmic extremum processing on probabilities of a plurality of word segmentation sequences, and obtaining an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.

In this embodiment, a transition probability matrix [ a ] of step node length is generated according to conditional probabilities on word segmentation nodes on each word segmentation path _ji ]And adopts Viterbi algorithm to make the patent probability matrix _ji ]And (3) carrying out reverse recursion processing, outputting a plurality of word segmentation sequences, carrying out logarithmic extremum processing on the probabilities of the plurality of word segmentation sequences through a preset scale factor (100) and argmax function, and obtaining the word segmentation sequence with the maximum probability value as the optimal word segmentation sequence, wherein the probabilities of the plurality of word segmentation sequences refer to the probabilities of the corresponding word segmentation sequences obtained by using the transition probabilities of different node lengths based on a Viterbi algorithm.

Example two

Referring to fig. 2, a structural block diagram of a chinese word segmentation system based on statistics of the present application includes:

Further, the output module includes:

Further, the reverse matching unit includes:

Further, the recursion module includes:

Example III

An apparatus for automatically generating computer code, comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of embodiment one.

The computer code automatic generation device of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the application, can execute any combination implementation steps of the method embodiment, and has corresponding functions and beneficial effects of the method.

Example IV

A storage medium having stored therein processor-executable instructions which, when executed by a processor, are adapted to carry out the method of embodiment one.

The storage medium of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the application, and can execute the implementation steps of any combination of the embodiments of the method, thereby having the corresponding functions and beneficial effects of the method.

While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and the equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims

1. A Chinese word segmentation method based on statistics is characterized by comprising the following steps:

obtaining a target text, wherein the target text contains a plurality of words;

generating word segmentation results according to the optimal word segmentation sequences;

the step of reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes, specifically comprises the following steps:

2. The method for word segmentation in chinese language according to claim 1, wherein said step of reversely matching each word of the target text by combining the second-order hidden markov algorithm with the obtained probability of each word condition, and outputting a plurality of word segmentation paths, wherein each word segmentation path includes a plurality of word segmentation nodes, comprises the steps of:

3. The method of claim 1, wherein the step of combining viterbi algorithm and preset scale factor to perform reverse recursion processing on the word of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence comprises the following steps:

4. A statistics-based chinese word segmentation system, comprising:

the generating module is used for generating word segmentation results according to the optimal word segmentation sequence;

the output module includes:

5. The statistics-based chinese word segmentation system as recited in claim 4, wherein the reverse matching unit comprises:

6. The statistics-based chinese word segmentation system as recited in claim 4, wherein the recursion module comprises: the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;

7. An automatic computer code generating device comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any of claims 1-3.

8. A storage medium having stored therein processor executable instructions which, when executed by a processor, are adapted to carry out the method of any one of claims 1-3.