[go: up one dir, main page]

CN111160024B - Chinese word segmentation method, system, device and storage medium based on statistics - Google Patents

Chinese word segmentation method, system, device and storage medium based on statistics Download PDF

Info

Publication number
CN111160024B
CN111160024B CN201911392455.1A CN201911392455A CN111160024B CN 111160024 B CN111160024 B CN 111160024B CN 201911392455 A CN201911392455 A CN 201911392455A CN 111160024 B CN111160024 B CN 111160024B
Authority
CN
China
Prior art keywords
word segmentation
word
probability
target text
combining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911392455.1A
Other languages
Chinese (zh)
Other versions
CN111160024A (en
Inventor
寇永娴
陈惠芳
蓝飘
胡志乐
李娟�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GRG Banking IT Co Ltd
Original Assignee
GRG Banking Equipment Co Ltd
GRG Banking IT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GRG Banking Equipment Co Ltd, GRG Banking IT Co Ltd filed Critical GRG Banking Equipment Co Ltd
Priority to CN201911392455.1A priority Critical patent/CN111160024B/en
Publication of CN111160024A publication Critical patent/CN111160024A/en
Application granted granted Critical
Publication of CN111160024B publication Critical patent/CN111160024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The application discloses a Chinese word segmentation method, a Chinese word segmentation system, a Chinese word segmentation device and a Chinese word segmentation storage medium based on statistics, wherein the Chinese word segmentation method comprises the following steps: acquiring a target text; word segmentation processing is carried out on the target text according to a preset corpus, and a first probability and a second probability are identified; reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes; carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence; and generating word segmentation results according to the optimal word segmentation sequence, so that the word segmentation accuracy of Chinese word segmentation processing is improved, the calculated amount is reduced through the preset proportion, and the cost is reduced.

Description

Chinese word segmentation method, system, device and storage medium based on statistics
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a method, a system, an apparatus, and a storage medium for chinese word segmentation based on statistics.
Background
The Chinese word segmentation refers to a process of recombining a plurality of continuous Chinese words into word sequences according to a certain specification, is the basis of Chinese information processing, has a great deal of application in the fields of natural language processing and artificial intelligence, and has common application scenes such as search engines, speech synthesis, machine translation and the like.
The existing Chinese word segmentation algorithm can be divided into three main categories: word segmentation method based on character string matching, word segmentation method based on understanding and word segmentation method based on statistics. The word segmentation method based on character string matching is characterized in that a Chinese character string to be segmented is matched with entries in a 'sufficiently large' machine dictionary according to a certain strategy, if a certain character string is found in the dictionary, the matching is considered to be successful, but the word segmentation method based on character string matching not only needs the dictionary, but also has low word segmentation accuracy, and especially has two aspects of ambiguity recognition and new word recognition. The word segmentation method based on understanding is to enable a computer to simulate the understanding of a sentence to achieve the function of word recognition, and the basic idea is that when word segmentation is carried out, syntactic analysis and semantic word segmentation are synchronously carried out, and the syntactic information and the semantic information are adopted to process ambiguity; however, due to the characteristics of the Chinese language knowledge, it is difficult to organize various information into a form that can be directly read by a machine and contradict between the understanding-based word segmentation method and the massive demand of the language knowledge and the information, so that the understanding-based word segmentation method is still in an experimental stage. The word segmentation method based on statistics refers to judging whether a character string forms a word according to the occurrence frequency of the character string in a corpus, wherein the word is a combination of words, and the probability of forming a word is higher as the number of times of occurrence of adjacent word colleagues is larger. Based on the deficiency of the prior art, how to improve the accuracy of Chinese word segmentation becomes a technical problem to be solved in the industry.
Disclosure of Invention
In order to solve one of the above technical problems, the present application aims to provide a statistical-based Chinese word segmentation method, a statistical-based Chinese word segmentation system, a statistical-based Chinese word segmentation device and a statistical-based Chinese word segmentation storage medium.
The first technical scheme adopted by the application is as follows:
a Chinese word segmentation method based on statistics comprises the following steps:
obtaining a target text, wherein the target text contains a plurality of words;
word segmentation is carried out on a target text according to a preset corpus, a first probability and a second probability are identified, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and generating a word segmentation result according to the optimal word segmentation sequence.
Further, the step of reversely matching the words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes, specifically comprises the following steps:
the method comprises the steps of combining a first probability, a second probability and a Bayesian formula to obtain the conditional probability of each word in a target text, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
and reversely matching each word of the target text by combining a second-order hidden Markov algorithm with the obtained condition probability of each word, outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
Further, the step of reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained probability of each word condition and outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes, specifically comprises the following steps:
performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word paths;
acquiring a preset weight, wherein the preset weight refers to a word length influence value on the grammar in the Chinese grammar;
and correcting the generated plurality of pre-word paths according to preset weight values, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Further, the method combines a viterbi algorithm and a preset scale factor to carry out reverse recursion processing on the word of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence, and specifically comprises the following steps:
generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
carrying out reverse recursion on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
and combining a preset scale factor and an argmax function to take the logarithmic extremum of the probabilities of a plurality of word segmentation sequences, and obtaining the optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the largest.
The second technical scheme adopted by the application is as follows:
a statistics-based chinese word segmentation system, comprising:
the acquisition module is used for acquiring a target text, wherein the target text contains a plurality of words;
the recognition module is used for carrying out word segmentation processing on the target text according to a preset corpus, and recognizing a first probability and a second probability, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
the output module is used for reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating word segmentation results according to the optimal word segmentation sequence.
Further, the output module includes:
the first acquisition unit is used for acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and the Bayesian formula, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
and the reverse matching unit is used for reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, outputting a plurality of word segmentation paths, and each word segmentation path comprises a plurality of word segmentation nodes.
Further, the reverse matching unit includes:
the generation subunit is used for reversely matching each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the acquisition subunit acquires preset weights, wherein the preset weights refer to influence values of word length and word sequence based on Chinese grammar;
the output subunit is used for correcting the generated word segmentation sequences according to preset weights and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module includes:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
the second acquisition unit is used for combining a preset scale factor and an argmax function to perform logarithmic extremum processing on the probabilities of the word segmentation sequences, so as to acquire an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.
The third technical scheme adopted by the application is as follows:
an automatic generation device of computer code, the memory for storing at least one program, the processor for loading the at least one program to perform the method described above.
The fourth technical scheme adopted by the application is as follows:
a storage medium having stored therein processor executable instructions which when executed by a processor are for performing the method as described above.
The beneficial effects of the application are as follows: the method comprises the steps of obtaining first probability and second probability of each word in an input target text through statistics of a preset corpus, outputting a plurality of word segmentation paths through reverse matching of the first probability and the second probability of each word contained in the target text, enabling each word segmentation path to contain a plurality of word segmentation nodes, carrying out reverse recursion words of each word segmentation node on each word segmentation path according to a Viterbi algorithm and a preset scale factor, obtaining a final word segmentation sequence, and finally generating word segmentation results based on the optimal word segmentation sequence, so that word segmentation accuracy of Chinese word segmentation processing is improved, calculation amount is reduced through preset scale, and cost is reduced.
Drawings
FIG. 1 is a flow chart of steps of a Chinese word segmentation method based on statistics provided by the application;
fig. 2 is a block diagram of a statistics-based chinese word segmentation system.
Detailed Description
Example 1
Referring to fig. 1, the present application provides a statistical-based chinese word segmentation method, which specifically includes the following steps:
s1, acquiring a target text, wherein the target text contains a plurality of words;
s2, word segmentation is carried out on the target text according to a preset corpus, a first probability and a second probability are identified, the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
s3, reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
s4, carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
s5, generating word segmentation results according to the optimal word segmentation sequence.
Wherein the preset corpus comprises a people daily report 2014 edition corpus, a Beijing university Chinese corpus and the like, the people daily report 2014 edition corpus is preferably selected in the embodiment, the people daily report 2014 edition corpus stores language materials which are actually appeared in actual use, the language materials are databases which are segmented according to specifications and are added with tagged parts of speech and word senses, the first probability of word occurrence and the second probability of word occurrence which are adjacent to each other of the target text can be identified by carrying out word segmentation processing on the target text in a preset system, then the first probability and the second probability are reversely matched, the reverse matching is that the input target text is taken forward from the back, and each time one word is subtracted, outputting a plurality of word segmentation paths until a preset corpus hits or single word remains, wherein each word segmentation path comprises a plurality of word segmentation nodes, carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path through a Viterbi algorithm and a preset scale factor, and obtaining an optimal word segmentation sequence, wherein the preset scale factor is based on multiplication and 100 when the Viterbi algorithm is used for solving the probability value of the word segmentation path for the word segmentation path through the Viterbi algorithm, and logarithms are taken on two sides of the Viterbi formula, so that a large number of floating point operations are carried out by a computer in operation, meanwhile, the situation that the result value is very approximate to be unable to be represented when the computer is operated is avoided, precision loss is only caused by 0 representation, and finally, the word segmentation result is generated according to the optimal word segmentation sequence, thereby improving the accuracy of Chinese word segmentation and reducing the operation quantity.
Further as a preferred embodiment, the step S3 specifically includes the following steps:
s31, combining the first probability, the second probability and a Bayesian formula to obtain the conditional probability of each word in the target text, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
s32, reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, and outputting a plurality of word segmentation paths which contain a plurality of word segmentation nodes.
In this embodiment, through a bayesian formula, a first word probability, that is, a word frequency of a word occurrence, a second probability, that is, a word frequency of a word occurrence adjacent to two words, a conditional probability of each word in a target text is obtained, if the probability of a single occurrence of a word B is P (B), and the probability of an adjacent occurrence of two words AB is P (a, B), then under the premise that the word B exists, the conditional probability of the word a occurrence P (a|b) =p (a, B)/P (B), and similarly, the conditional probability of the word B occurrence P (b|a) under the premise that the word a exists can be obtained by adopting the bayesian formula according to the probability of the single occurrence of the word a and the probability of the word P (AB), then a plurality of word segmentation paths are output by reversely matching the obtained conditional probability of each word with respect to the target text through a second-order hidden markov algorithm, and each word segmentation path contains a plurality of word segmentation nodes.
Further as a preferred embodiment, the step S33 specifically includes the following steps:
s331, performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word paths;
s332, acquiring a preset weight, wherein the preset weight refers to a word length influence value on the grammar in the Chinese grammar;
s333, correcting the generated plurality of pre-word paths according to preset weights, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
Specifically, the conditional probability of each word in the target text is obtained, the second-order hidden Markov algorithm is adopted to reversely match each word in the target text, a plurality of pre-word paths are generated, in this embodiment, the generated plurality of pre-word paths are corrected through the preset weight, a plurality of word paths are output, wherein the preset weight is an influence value of word length and word sequence based on Chinese grammar, such as text 'true and true' and 'true and true', the pre-word paths matched through the second-order hidden Markov algorithm are 'true/true' and are not easy to understand, and the word paths 'true/true' are output after the pre-word paths 'true/true' are corrected through the preset weight based on Chinese grammar, so that understanding of Chinese is obviously compounded through the word paths after the preset weight correction.
Further as a preferred embodiment, the step S4 specifically includes the following steps:
s41, generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;
s42, carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
s43, combining a preset scale factor and an argmax function to carry out logarithmic extremum processing on probabilities of a plurality of word segmentation sequences, and obtaining an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.
In this embodiment, a transition probability matrix [ a ] of step node length is generated according to conditional probabilities on word segmentation nodes on each word segmentation path ji ]And adopts Viterbi algorithm to make the patent probability matrix ji ]And (3) carrying out reverse recursion processing, outputting a plurality of word segmentation sequences, carrying out logarithmic extremum processing on the probabilities of the plurality of word segmentation sequences through a preset scale factor (100) and argmax function, and obtaining the word segmentation sequence with the maximum probability value as the optimal word segmentation sequence, wherein the probabilities of the plurality of word segmentation sequences refer to the probabilities of the corresponding word segmentation sequences obtained by using the transition probabilities of different node lengths based on a Viterbi algorithm.
Example two
Referring to fig. 2, a structural block diagram of a chinese word segmentation system based on statistics of the present application includes:
the acquisition module is used for acquiring a target text, wherein the target text contains a plurality of words;
the recognition module is used for carrying out word segmentation processing on the target text according to a preset corpus, and recognizing a first probability and a second probability, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
the output module is used for reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
and the generating module is used for generating word segmentation results according to the optimal word segmentation sequence.
Further, the output module includes:
the first acquisition unit is used for acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and the Bayesian formula, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
and the reverse matching unit is used for reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, outputting a plurality of word segmentation paths, and each word segmentation path comprises a plurality of word segmentation nodes.
Further, the reverse matching unit includes:
the generation subunit is used for reversely matching each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the acquisition subunit acquires preset weights, wherein the preset weights refer to influence values of word length and word sequence based on Chinese grammar;
the output subunit is used for correcting the generated word segmentation sequences according to preset weights and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
Further, the recursion module includes:
the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
the second acquisition unit is used for combining a preset scale factor and an argmax function to perform logarithmic extremum processing on the probabilities of the word segmentation sequences, so as to acquire an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.
Example III
An apparatus for automatically generating computer code, comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of embodiment one.
The computer code automatic generation device of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the application, can execute any combination implementation steps of the method embodiment, and has corresponding functions and beneficial effects of the method.
Example IV
A storage medium having stored therein processor-executable instructions which, when executed by a processor, are adapted to carry out the method of embodiment one.
The storage medium of the embodiment can execute the Chinese word segmentation method based on statistics provided by the first embodiment of the method of the application, and can execute the implementation steps of any combination of the embodiments of the method, thereby having the corresponding functions and beneficial effects of the method.
While the preferred embodiment of the present application has been described in detail, the present application is not limited to the embodiments, and those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of the present application, and the equivalent modifications or substitutions are included in the scope of the present application as defined in the appended claims.

Claims (8)

1. A Chinese word segmentation method based on statistics is characterized by comprising the following steps:
obtaining a target text, wherein the target text contains a plurality of words;
word segmentation is carried out on a target text according to a preset corpus, a first probability and a second probability are identified, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
reversely matching words contained in the target text by combining the first probability and the second probability, and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes;
carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
generating word segmentation results according to the optimal word segmentation sequences;
the step of reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein each path contains a plurality of word segmentation nodes, specifically comprises the following steps:
the method comprises the steps of combining a first probability, a second probability and a Bayesian formula to obtain the conditional probability of each word in a target text, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
and reversely matching each word of the target text by combining a second-order hidden Markov algorithm with the obtained condition probability of each word, outputting a plurality of word segmentation paths, wherein each word segmentation path comprises a plurality of word segmentation nodes.
2. The method for word segmentation in chinese language according to claim 1, wherein said step of reversely matching each word of the target text by combining the second-order hidden markov algorithm with the obtained probability of each word condition, and outputting a plurality of word segmentation paths, wherein each word segmentation path includes a plurality of word segmentation nodes, comprises the steps of:
performing reverse matching on each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of pre-word paths;
acquiring a preset weight, wherein the preset weight refers to a word length influence value on the grammar in the Chinese grammar;
and correcting the generated plurality of pre-word paths according to preset weight values, and outputting a plurality of word segmentation paths, wherein each path comprises a plurality of word segmentation nodes.
3. The method of claim 1, wherein the step of combining viterbi algorithm and preset scale factor to perform reverse recursion processing on the word of each word segmentation node on each word segmentation path to obtain an optimal word segmentation sequence comprises the following steps:
generating transition probability matrixes with different node lengths according to the conditional probability of words on each word segmentation node on each word segmentation path;
carrying out reverse recursion on the transition probability matrix by adopting a Viterbi algorithm, and outputting a plurality of word segmentation sequences;
and combining a preset scale factor and an argmax function to take the logarithmic extremum of the probabilities of a plurality of word segmentation sequences, and obtaining the optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is the largest.
4. A statistics-based chinese word segmentation system, comprising:
the acquisition module is used for acquiring a target text, wherein the target text contains a plurality of words;
the recognition module is used for carrying out word segmentation processing on the target text according to a preset corpus, and recognizing a first probability and a second probability, wherein the first probability refers to single word frequency, and the second probability refers to word frequency of two adjacent words;
the output module is used for reversely matching words contained in the target text by combining the first probability and the second probability and outputting a plurality of word segmentation paths, wherein the paths contain a plurality of nodes;
the recursion module is used for carrying out reverse recursion processing on words of each word segmentation node on each word segmentation path by combining a Viterbi algorithm and a preset scale factor to obtain an optimal word segmentation sequence;
the generating module is used for generating word segmentation results according to the optimal word segmentation sequence;
the output module includes:
the first acquisition unit is used for acquiring the conditional probability of each word in the target text by combining the first probability, the second probability and the Bayesian formula, wherein the conditional probability refers to the probability of the second word on the premise that the first word exists;
and the reverse matching unit is used for reversely matching each word of the target text by combining the second-order hidden Markov algorithm with the obtained word condition probability, outputting a plurality of word segmentation paths, and each word segmentation path comprises a plurality of word segmentation nodes.
5. The statistics-based chinese word segmentation system as recited in claim 4, wherein the reverse matching unit comprises:
the generation subunit is used for reversely matching each word of the target text by adopting a second-order hidden Markov algorithm to generate a plurality of word segmentation sequences;
the acquisition subunit acquires preset weights, wherein the preset weights refer to influence values of word length and word sequence based on Chinese grammar;
the output subunit is used for correcting the generated word segmentation sequences according to preset weights and outputting a plurality of word segmentation paths, and each path comprises a plurality of word segmentation nodes.
6. The statistics-based chinese word segmentation system as recited in claim 4, wherein the recursion module comprises: the generating unit is used for generating transition probability matrixes with different node lengths according to the conditional probabilities of words on the word segmentation nodes on each word segmentation path;
the output unit is used for carrying out reverse recursion processing on the transition probability matrix by adopting a Viterbi algorithm and outputting a plurality of word segmentation sequences;
the second acquisition unit is used for combining a preset scale factor and an argmax function to perform logarithmic extremum processing on the probabilities of the word segmentation sequences, so as to acquire an optimal word segmentation sequence, wherein the probability of the optimal word segmentation sequence is maximum.
7. An automatic computer code generating device comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any of claims 1-3.
8. A storage medium having stored therein processor executable instructions which, when executed by a processor, are adapted to carry out the method of any one of claims 1-3.
CN201911392455.1A 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics Active CN111160024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911392455.1A CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911392455.1A CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Publications (2)

Publication Number Publication Date
CN111160024A CN111160024A (en) 2020-05-15
CN111160024B true CN111160024B (en) 2023-08-15

Family

ID=70558931

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911392455.1A Active CN111160024B (en) 2019-12-30 2019-12-30 Chinese word segmentation method, system, device and storage medium based on statistics

Country Status (1)

Country Link
CN (1) CN111160024B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779062B (en) * 2021-02-23 2025-02-21 北京沃东天骏信息技术有限公司 SQL statement generation method, device, storage medium and electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4819271A (en) * 1985-05-29 1989-04-04 International Business Machines Corporation Constructing Markov model word baseforms from multiple utterances by concatenating model sequences for word segments
CN101819772A (en) * 2010-02-09 2010-09-01 中国船舶重工集团公司第七○九研究所 Phonetic segmentation-based isolate word recognition method
CN104408034A (en) * 2014-11-28 2015-03-11 武汉数为科技有限公司 Text big data-oriented Chinese word segmentation method
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN108984159A (en) * 2018-06-15 2018-12-11 浙江网新恒天软件有限公司 A kind of breviary phrase extended method based on markov language model
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4819271A (en) * 1985-05-29 1989-04-04 International Business Machines Corporation Constructing Markov model word baseforms from multiple utterances by concatenating model sequences for word segments
CN101819772A (en) * 2010-02-09 2010-09-01 中国船舶重工集团公司第七○九研究所 Phonetic segmentation-based isolate word recognition method
CN104408034A (en) * 2014-11-28 2015-03-11 武汉数为科技有限公司 Text big data-oriented Chinese word segmentation method
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN105975454A (en) * 2016-04-21 2016-09-28 广州精点计算机科技有限公司 Chinese word segmentation method and device of webpage text
CN106844350A (en) * 2017-02-15 2017-06-13 广州索答信息科技有限公司 A kind of computational methods of short text semantic similarity
CN108170680A (en) * 2017-12-29 2018-06-15 厦门市美亚柏科信息股份有限公司 Keyword recognition method, terminal device and storage medium based on Hidden Markov Model
CN108984159A (en) * 2018-06-15 2018-12-11 浙江网新恒天软件有限公司 A kind of breviary phrase extended method based on markov language model
CN109033085A (en) * 2018-08-02 2018-12-18 北京神州泰岳软件股份有限公司 The segmenting method of Chinese automatic word-cut and Chinese text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭瑜."基于语法的分词系统的设计与实现".《中国优秀硕士学位论文全文数据库信息科技辑》.2014,(第undefined期),I138-906. *

Also Published As

Publication number Publication date
CN111160024A (en) 2020-05-15

Similar Documents

Publication Publication Date Title
CN109857845B (en) Model training and data retrieval method, device, terminal and computer-readable storage medium
US20210193121A1 (en) Speech recognition method, apparatus, and device, and storage medium
US10073673B2 (en) Method and system for robust tagging of named entities in the presence of source or translation errors
Xu et al. Minimum bayes risk decoding and system combination based on a recursion for edit distance
EP4131255A1 (en) Method and apparatus for decoding voice data, computer device and storage medium
US20180276525A1 (en) Method and neural network system for human-computer interaction, and user equipment
KR101543992B1 (en) Intra-language statistical machine translation
CN108287820B (en) Text representation generation method and device
CN112100354B (en) Man-machine conversation method, device, equipment and storage medium
CN111859964B (en) Method and device for identifying named entities in sentences
CN111739514B (en) Voice recognition method, device, equipment and medium
CN106503231B (en) Search method and device based on artificial intelligence
CN110619043A (en) Automatic text abstract generation method based on dynamic word vector
US8356065B2 (en) Similar text search method, similar text search system, and similar text search program
CN102479191A (en) Method and device for providing multi-granularity word segmentation result
JPH0689302A (en) Dictionary memory
US11423237B2 (en) Sequence transduction neural networks
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN117648933A (en) Natural language ambiguity resolution method and system based on deep learning and knowledge base
CN111160024B (en) Chinese word segmentation method, system, device and storage medium based on statistics
Fusayasu et al. Word-Error Correction of Continuous Speech Recognition Based on Normalized Relevance Distance.
CN112447172B (en) Quality improvement method and device for voice recognition text
CN112766002B (en) Text alignment method and system based on dynamic programming
Besacier et al. Word confidence estimation for speech translation
CN113111651A (en) Chinese word segmentation method and device and search word bank reading method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee after: GRG BANKING IT Co.,Ltd.

Country or region after: China

Patentee after: Guangdian Yuntong Group Co.,Ltd.

Address before: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee before: GRG BANKING IT Co.,Ltd.

Country or region before: China

Patentee before: GRG BANKING EQUIPMENT Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240618

Address after: Room 701, No. 11, Kelin Road, Science City, Huangpu District, Guangzhou City, Guangdong Province, 510663

Patentee after: GRG BANKING IT Co.,Ltd.

Country or region after: China

Address before: 510663 research institute office building, No.9, Kelin Road, Science City, Guangzhou high tech Industrial Development Zone, Guangzhou City, Guangdong Province

Patentee before: GRG BANKING IT Co.,Ltd.

Country or region before: China

Patentee before: Guangdian Yuntong Group Co.,Ltd.