[go: up one dir, main page]

CN111160024A - Chinese word segmentation method, system, device and storage medium based on statistics - Google Patents

Chinese word segmentation method, system, device and storage medium based on statistics Download PDF

Info

Publication number
CN111160024A
CN111160024A CN201911392455.1A CN201911392455A CN111160024A CN 111160024 A CN111160024 A CN 111160024A CN 201911392455 A CN201911392455 A CN 201911392455A CN 111160024 A CN111160024 A CN 111160024A
Authority
CN
China
Prior art keywords
word segmentation
word
probability
target text
paths
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911392455.1A
Other languages
Chinese (zh)
Other versions
CN111160024B (en
Inventor
寇永娴
陈惠芳
蓝飘
胡志乐
李娟�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GRG Banking IT Co Ltd
Original Assignee
GRG Banking Equipment Co Ltd
GRG Banking IT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GRG Banking Equipment Co Ltd, GRG Banking IT Co Ltd filed Critical GRG Banking Equipment Co Ltd
Priority to CN201911392455.1A priority Critical patent/CN111160024B/en
Publication of CN111160024A publication Critical patent/CN111160024A/en
Application granted granted Critical
Publication of CN111160024B publication Critical patent/CN111160024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a Chinese word segmentation method, a system, a device and a storage medium based on statistics, wherein the method comprises the following steps: acquiring a target text; performing word segmentation processing on the target text according to a preset corpus to identify a first probability and a second probability; reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes; combining a Viterbi algorithm and a preset scale factor to carry out reverse recursive processing on words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence; and generating a word segmentation result according to the optimal word segmentation sequence, thereby improving the word segmentation accuracy of Chinese word segmentation processing, and reducing the calculated amount and the cost through a preset proportion.

Description

Chinese word segmentation method, system, device and storage medium based on statistics
Technical Field
The invention relates to the technical field of information processing, in particular to a Chinese word segmentation method, a system, a device and a storage medium based on statistics.
Background
The Chinese word segmentation refers to a process of recombining a plurality of continuous Chinese characters into a word sequence according to a certain standard, is the basis of Chinese information processing, and has a lot of applications in the fields of natural language processing and artificial intelligence, and common application scenes such as search engines, speech synthesis, machine translation and the like.
Existing chinese word segmentation algorithms can be divided into three major categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The word segmentation method based on character string matching is also called a mechanical word segmentation method, and means that a Chinese character string to be segmented is matched with an entry in a sufficiently large machine dictionary according to a certain strategy, if a certain character string is found in the dictionary, the matching is considered to be successful, but the word segmentation method based on character string matching not only needs the dictionary, but also has low word segmentation accuracy, and especially has two aspects of ambiguity recognition and new word recognition. The word segmentation method based on understanding means that a computer simulates the understanding of sentences by a person to achieve the function of word recognition, and the basic idea is to synchronously perform syntactic analysis and semantic word segmentation during word segmentation and process an ambiguous phenomenon by adopting syntactic information and semantic information; however, due to the characteristics of the Chinese language knowledge, it is difficult to organize various information into a form that can be directly read by a machine and contradict with a large amount of requirements of the understanding-based word segmentation method for the language knowledge and the information, so that the understanding-based word segmentation method is still in an experimental stage. The word segmentation method based on statistics is to judge whether a character string constitutes a word according to the frequency of the character string appearing in a corpus, the word is a combination of characters, and the probability of constituting a word is higher as the frequency of the adjacent character colleagues is higher. Based on the defects of the prior art, how to improve the accuracy of Chinese word segmentation becomes a technical problem to be solved urgently in the industry.
Disclosure of Invention
In order to solve one of the above technical problems, the present invention provides a statistical-based chinese word segmentation method, system, device and storage medium.
The first technical scheme adopted by the invention is as follows:
a Chinese word segmentation method based on statistics comprises the following steps:
acquiring a target text, wherein the target text comprises a plurality of words;
performing word segmentation processing on a target text according to a preset corpus, and identifying a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
combining a Viterbi algorithm and a preset scale factor to carry out reverse recursive processing on words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence;
and generating a word segmentation result according to the optimal word segmentation sequence.
Further, the step of performing inverse matching on words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, each path containing a plurality of word segmentation nodes specifically comprises the following steps:
acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and reversely matching each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the step of performing reverse matching on each word of the target text by combining a second-order hidden markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes specifically comprises the following steps:
performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word-dividing paths;
acquiring a preset weight, wherein the preset weight refers to an influence value of word length in Chinese grammar on sequence;
and correcting the generated plurality of pre-word segmentation paths according to a preset weight value, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Further, the method, in combination with the viterbi algorithm and preset scale factors, performs reverse recursive processing on the word of each participle node on each participle path to obtain an optimal participle sequence, and specifically includes the following steps:
generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
performing reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
and carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
The second technical scheme adopted by the invention is as follows:
a statistics-based chinese word segmentation system, comprising:
the acquisition module is used for acquiring a target text, and the target text comprises a plurality of words;
the recognition module is used for performing word segmentation processing on the target text according to a preset corpus and recognizing a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
the output module is used for finishing the reverse matching of the first probability and the second probability to the words contained in the target text and outputting a plurality of word segmentation paths, and the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on the words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating a word segmentation result according to the optimal word segmentation sequence.
Further, the output module includes:
the first obtaining unit is used for obtaining the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and the reverse matching unit is used for performing reverse matching on each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the inverse matching unit includes:
the generating subunit is used for performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the method comprises the steps that a subunit is obtained, and a preset weight is obtained, wherein the preset weight refers to an influence value of Chinese grammar on word length and word sequence;
and the output subunit is used for correcting the generated plurality of word segmentation sequences according to the preset weight value and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module comprises:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
and the second obtaining unit is used for carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
The third technical scheme adopted by the invention is as follows:
an automatic generation device of computer code, the memory is used for storing at least one program, and the processor is used for loading the at least one program to execute the method.
The fourth technical scheme adopted by the invention is as follows:
a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.
The invention has the beneficial effects that: the method comprises the steps of obtaining a first probability and a second probability of each word in an input target text through preset corpus statistics, reversely matching the first probability and the second probability of each word contained in the target text, outputting a plurality of word segmentation paths, wherein each word segmentation path contains a plurality of word segmentation nodes, reversely recursively processing the words of each word segmentation node on each word segmentation path according to a Viterbi algorithm and a preset scale factor to obtain a final word segmentation sequence, and finally generating a word segmentation result based on the optimal word segmentation sequence, so that the word segmentation accuracy of Chinese word segmentation is improved, the calculation amount is reduced through the preset proportion, and the cost is reduced.
Drawings
FIG. 1 is a flow chart of the steps of a statistical-based Chinese word segmentation method provided by the present invention;
FIG. 2 is a block diagram of a statistical-based Chinese word segmentation system according to the present invention.
Detailed Description
Example one
Referring to fig. 1, the invention provides a flow chart of steps of a statistical-based chinese word segmentation method, which specifically includes the following steps:
s1, obtaining a target text, wherein the target text comprises a plurality of words;
s2, performing word segmentation processing on the target text according to a preset corpus, and identifying a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
s3, reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
s4, combining the Viterbi algorithm and the preset scale factor to carry out reverse recursive processing on the words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence;
and S5, generating a word segmentation result according to the optimal word segmentation sequence.
The preset corpus comprises a people daily report 2014 version corpus, a Beijing language university corpus, a Beijing university corpus and the like, the people daily report 2014 version corpus is preferred in the embodiment, language materials which are actually appeared in practical use in the language are stored in the people daily report 2014 version corpus, the data bases are segmented according to specifications and added with labeled word characteristics and word senses, the preset system is arranged in the system for segmenting the target text, the first probability of the target text, namely the probability of word appearance and the second probability, namely the probability of adjacent appearance of two words can be identified, then the target text is reversely matched by combining the first probability and the second probability, the reversely matched target text is obtained by taking words from back to front and subtracting one word each time until the input target text is hit by the preset corpus or a single word is left, a plurality of segmentation paths are output, and each segmentation path comprises a plurality of segmentation nodes, the words of each participle node on each participle path are subjected to reverse recursion processing through a Viterbi algorithm and a preset scale factor to obtain an optimal participle sequence, wherein the preset scale factor refers to that the probability value of the participle path is obtained by multiplying the probability value by 100 when the Viterbi algorithm is applied to the participle path through a Viterbi formula, then logarithms are taken from two sides of the Viterbi formula so as to reduce a large number of floating point operations performed by a computer in the operation, and simultaneously, the phenomenon that the result value is possibly too large to be expressed when the computer is operated and can only be expressed by 0 to cause precision loss is avoided, and finally, the participle result is generated according to the optimal participle sequence, so that the accuracy rate of Chinese partic.
Further, as a preferred embodiment, the step S3 specifically includes the following steps:
s31, combining the first probability, the second probability and a Bayes formula to obtain the conditional probability of each word in the target text, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and S32, reversely matching each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
In this embodiment, a bayesian formula, a first word probability, that is, a word frequency where a word appears, and a second probability, that is, a word frequency where two words appear adjacent to each other are used to obtain a conditional probability of each word in a target text, where if the probability of a single word B is P (B), and the probability of two adjacent words AB is P (a, B), then the conditional probability P (a | B) ═ P (a, B)/P (B) where a word a appears is provided, the same may be used to obtain the conditional probability P (B | a) where a word B appears on the premise that a word a exists according to the bayesian formula of the probabilities P (a) and P (AB) where a word a appears, and then the obtained conditional probabilities of each word are inversely matched to the target text by a second-order hidden markov algorithm, and a plurality of word segmentation paths are output, where each word path includes a plurality of segmentation nodes.
Further, as a preferred embodiment, the step S33 specifically includes the following steps:
s331, performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word-dividing paths;
s332, acquiring a preset weight, wherein the preset weight refers to an influence value of word length in Chinese grammar on sequence;
s333, correcting the generated plurality of pre-word segmentation paths according to the preset weight value, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Specifically, conditional probabilities of words in a target text are obtained, a second-order hidden markov algorithm is used for reversely matching the words in the target text to generate a plurality of pre-segmentation paths, in the embodiment, the generated pre-segmentation paths are corrected through the preset weight values, and a plurality of segmentation paths are output, wherein the preset weight values are influence values of word length and word order based on Chinese grammar, such as 'true ideal' of the text, the 'true/rational' display of the pre-segmentation paths matched through the second-order hidden markov algorithm is inconvenient to understand, the 'true/true ideal' of the segmentation paths are output after the 'true/rational' of the pre-segmentation paths are corrected through the preset weight values based on Chinese grammar, and the segmentation paths are obviously more complex to Chinese understanding after the preset weight values are corrected.
Further, as a preferred embodiment, the step S4 specifically includes the following steps:
s41, generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
s42, carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
s43, carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
In this embodiment, a transition probability matrix [ a ] of step length node length is generated according to the conditional probability of each participle node on each participle pathji]And applying Viterbi algorithm to the patent probability matrix [ aji]And performing reverse recursion processing, outputting a plurality of word segmentation sequences, performing logarithm extremum processing on the probabilities of the word segmentation sequences through a preset scale factor (100) and an argmax function, and acquiring the word segmentation sequence with the maximum probability value as the optimal word segmentation sequence, wherein the probabilities of the word segmentation sequences refer to the probabilities of the word segmentation sequences corresponding to the transition probabilities of different node lengths based on a Viterbi algorithm.
Example two
Referring to fig. 2, the structural block diagram of a statistical-based chinese word segmentation system of the present invention includes:
the acquisition module is used for acquiring a target text, and the target text comprises a plurality of words;
the recognition module is used for performing word segmentation processing on the target text according to a preset corpus and recognizing a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
the output module is used for finishing the reverse matching of the first probability and the second probability to the words contained in the target text and outputting a plurality of word segmentation paths, and the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on the words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating a word segmentation result according to the optimal word segmentation sequence.
Further, the output module includes:
the first obtaining unit is used for obtaining the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and the reverse matching unit is used for performing reverse matching on each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the inverse matching unit includes:
the generating subunit is used for performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the method comprises the steps that a subunit is obtained, and a preset weight is obtained, wherein the preset weight refers to an influence value of Chinese grammar on word length and word sequence;
and the output subunit is used for correcting the generated plurality of word segmentation sequences according to the preset weight value and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module comprises:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
and the second obtaining unit is used for carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
EXAMPLE III
An automatic computer code generation device comprises a memory and a processor, wherein the memory is used for storing at least one program, and the processor is used for loading the at least one program to execute the method in the embodiment I.
The computer code automatic generation device of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the invention, can execute any combination implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.
Example four
A storage medium having stored therein processor-executable instructions for performing a method as in embodiment one when executed by a processor.
The storage medium of this embodiment can execute the statistical-based chinese word segmentation method provided in the first embodiment of the method of the present invention, can execute any combination of the implementation steps of the method embodiments, and has corresponding functions and advantages of the method.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A Chinese word segmentation method based on statistics is characterized by comprising the following steps:
acquiring a target text, wherein the target text comprises a plurality of words;
performing word segmentation processing on a target text according to a preset corpus, and identifying a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
combining a Viterbi algorithm and a preset scale factor to carry out reverse recursive processing on words of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence;
and generating a word segmentation result according to the optimal word segmentation sequence.
2. The statistical-based Chinese word segmentation method according to claim 1, wherein the step of reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, each path containing a plurality of word segmentation nodes, specifically comprises the steps of:
acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and reversely matching each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
3. The method of claim 2, wherein the step of reversely matching each word in the target text by using the second-order hidden markov algorithm and the obtained conditional probability of each word to output a plurality of word segmentation paths, each word segmentation path including a plurality of word segmentation nodes comprises the steps of:
performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word-dividing paths;
acquiring a preset weight, wherein the preset weight refers to an influence value of word length in Chinese grammar on sequence;
and correcting the generated plurality of pre-word segmentation paths according to a preset weight value, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
4. The statistical-based Chinese word segmentation method of claim 2, wherein the step of performing reverse recursive processing on the words of each word segmentation node on each word segmentation path in combination with a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence specifically comprises the steps of:
generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
performing reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
and carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
5. A Chinese word segmentation system based on statistics is characterized by comprising:
the acquisition module is used for acquiring a target text, and the target text comprises a plurality of words;
the recognition module is used for performing word segmentation processing on the target text according to a preset corpus and recognizing a first probability and a second probability, wherein the first probability refers to the word frequency of a single word, and the second probability is the word frequency of two adjacent words;
the output module is used for finishing the reverse matching of the first probability and the second probability to the words contained in the target text and outputting a plurality of word segmentation paths, and the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on the words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating a word segmentation result according to the optimal word segmentation sequence.
6. The statistics-based Chinese word segmentation system of claim 5, wherein the output module comprises: the first obtaining unit is used for obtaining the conditional probability of each word in the target text by combining the first probability, the second probability and a Bayesian formula, wherein the conditional probability refers to the probability of the second word appearing on the premise of the first word;
and the reverse matching unit is used for performing reverse matching on each word of the target text by combining a second-order hidden Markov algorithm and the obtained conditional probability of each word, and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
7. The statistics-based Chinese word segmentation system of claim 5, wherein the inverse matching unit comprises:
the generating subunit is used for performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the method comprises the steps that a subunit is obtained, and a preset weight is obtained, wherein the preset weight refers to an influence value of Chinese grammar on word length and word sequence;
and the output subunit is used for correcting the generated plurality of word segmentation sequences according to the preset weight value and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
8. The statistics-based Chinese tokenization system of claim 6, wherein the recursion module comprises: the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
and the second obtaining unit is used for carrying out logarithm extreme value processing on the probabilities of the word segmentation sequences by combining a preset scale factor and an argmax function to obtain an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the maximum.
9. An apparatus for automatic generation of computer code, comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1 to 4.
10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-4.
CN201911392455.1A 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics Active CN111160024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911392455.1A CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911392455.1A CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Publications (2)

Publication Number Publication Date
CN111160024A true CN111160024A (en) 2020-05-15
CN111160024B CN111160024B (en) 2023-08-15

Family

ID=70558931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911392455.1A Active CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Country Status (1)

Country Link
CN (1) CN111160024B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779062A (en) * 2021-02-23 2021-12-10 北京沃东天骏信息技术有限公司 SQL statement generation method and device, storage medium and electronic equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4819271A (en) * 1985-05-29 1989-04-04 International Business Machines Corporation Constructing Markov model word baseforms from multiple utterances by concatenating model sequences for word segments
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CN101819772A (en) * 2010-02-09 2010-09-01 中国船舶重工集团公司第七○九研究所 Phonetic segmentation-based isolate word recognition method
CN104408034A (en) * 2014-11-28 2015-03-11 武汉数为科技有限公司 Text big data-oriented Chinese word segmentation method
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN108984159A (en) * 2018-06-15 2018-12-11 浙江网新恒天软件有限公司 A kind of breviary phrase extended method based on markov language model
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4819271A (en) * 1985-05-29 1989-04-04 International Business Machines Corporation Constructing Markov model word baseforms from multiple utterances by concatenating model sequences for word segments
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CN101819772A (en) * 2010-02-09 2010-09-01 中国船舶重工集团公司第七○九研究所 Phonetic segmentation-based isolate word recognition method
CN104408034A (en) * 2014-11-28 2015-03-11 武汉数为科技有限公司 Text big data-oriented Chinese word segmentation method
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN108984159A (en) * 2018-06-15 2018-12-11 浙江网新恒天软件有限公司 A kind of breviary phrase extended method based on markov language model
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭瑜: ""基于语法的分词系统的设计与实现"", pages 138 - 906 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779062A (en) * 2021-02-23 2021-12-10 北京沃东天骏信息技术有限公司 SQL statement generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN111160024B (en) 2023-08-15

Similar Documents

Publication Publication Date Title
CN108920460B (en) Training method of multi-task deep learning model for multi-type entity recognition
CN103678282B (en) A kind of segmenting method and device
Chen Bayesian grammar induction for language modeling
CN108287820B (en) Text representation generation method and device
US20230076658A1 (en) Method, apparatus, computer device and storage medium for decoding speech data
CN110705294A (en) Named entity recognition model training method, named entity recognition method and device
EP4131076A1 (en) Serialized data processing method and device, and text processing method and device
CN107480143A (en) Dialogue topic dividing method and system based on context dependence
CN111709243A (en) Knowledge extraction method and device based on deep learning
US20060253273A1 (en) Information extraction using a trainable grammar
CN113408272A (en) Method, device, equipment and storage medium for training abstract generation model
CN110210028A (en) For domain feature words extracting method, device, equipment and the medium of speech translation text
KR102550340B1 (en) Chapter-level text translation method and device
CN114120166A (en) Video question and answer method and device, electronic equipment and storage medium
CN114722833B (en) A semantic classification method and device
CN111813923A (en) Text summarization method, electronic device and storage medium
CN113887253A (en) Method, apparatus, and medium for machine translation
CN114048733A (en) Training method of text error correction model, and text error correction method and device
CN113378553B (en) Text processing method, device, electronic equipment and storage medium
CN113553833A (en) Text error correction method and device and electronic equipment
CN114239559A (en) Method, apparatus, device and medium for generating text error correction and text error correction model
CN117271558A (en) Language query model construction method, query language acquisition method and related devices
CN111160024A (en) Chinese word segmentation method, system, device and storage medium based on statistics
CN112926323B (en) Chinese named entity recognition method based on multistage residual convolution and attention mechanism
CN113987135B (en) A bank product problem retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee after: GRG BANKING IT Co.,Ltd.

Country or region after: China

Patentee after: Guangdian Yuntong Group Co.,Ltd.

Address before: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee before: GRG BANKING IT Co.,Ltd.

Country or region before: China

Patentee before: GRG BANKING EQUIPMENT Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240618

Address after: Room 701, No. 11, Kelin Road, Science City, Huangpu District, Guangzhou City, Guangdong Province, 510663

Patentee after: GRG BANKING IT Co.,Ltd.

Country or region after: China

Address before: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee before: GRG BANKING IT Co.,Ltd.

Country or region before: China

Patentee before: Guangdian Yuntong Group Co.,Ltd.