CN108595426B

CN108595426B - A word vector optimization method based on the structural information of Chinese characters

Info

Publication number: CN108595426B
Application number: CN201810368909.0A
Authority: CN
Inventors: 郭宇春; 潘常玮; 陈一帅
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2021-07-20
Anticipated expiration: 2038-04-23
Also published as: CN108595426A

Abstract

The invention provides a word vector optimization method based on the structural information of Chinese characters. The method includes: obtaining distributed word vectors of the words to be processed; performing morphological feature representation of the words according to the Chinese characters contained in the words to be processed, and obtaining the morphological feature vectors of the words to be processed; The shape feature vector and the distributed word vector are combined to represent the optimized feature vector of the word to be processed. The present invention designs a scheme for optimizing word vector expression by using Chinese glyph structure information, using the original neural network word distributed expression technology, combined with Chinese glyph structure features, and carrying out word vector expression based on actual natural language processing tasks. Feature optimization enhances the expression ability and generalization transfer ability of word vectors, which helps to improve the word feature representation of word vectors on low-frequency words and unknown words.

Description

Word vector optimization method based on Chinese character font structural information

Technical Field

The invention relates to the technical field of word vector representation, in particular to a word vector optimization method based on Chinese character font structural information.

Background

In the traditional method, words in a text are expressed numerically by means of one-hot representation (one-hot representation), but the expression method only symbolizes the words, does not contain any semantic information, and obtains high-dimensional sparse representation. The presence of the distribution hypothesis allows the representation of the word vector to be further optimized with respect to how the semantics are incorporated into the word representation: the semantics of a word are determined by its context. The distributed representation based on the neural network is generally called word embedding (word embedding) or distributed representation (distributed representation), original sparse huge dimensionality is compressed and embedded into a space with smaller dimensionality, and the semantic representation in the word vector form is the basis of a neural translation model and also becomes the basis of various natural language processing tasks. Therefore, designing a better word vector model is also a common challenge for various natural language processing tasks such as text classification, machine translation, and language modeling.

For low-frequency words and unknown words, in the neural network distributed expression method in the prior art, a special word vector (such as "UNK") is set for replacement, because the distributed semantic representation itself is a statistical learning method, the accuracy of the semantic representation is based on sufficient sample data, statistical commonality is learned from the sample data and the distributed low-dimensional numerical expression is encoded, so when the occurrence frequency of the words is low and even when the words are never seen before, the confidence of the word vector representation is low, and semantic deviation is generated due to the characteristics of individual samples.

Disclosure of Invention

The embodiment of the invention provides a word vector optimization method based on Chinese character font structural information, which aims to overcome the problems in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

A word vector optimization method based on Chinese character font structural information comprises the following steps:

acquiring a distributed word vector of a word to be processed;

performing word shape feature representation of the words according to the Chinese characters contained in the words to be processed, and acquiring word shape feature vectors of the words to be processed;

and combining and representing the word form characteristic vector and the distributed word vector of the word to be processed to obtain an optimized characteristic vector of the word to be processed.

Further, the obtaining of the distributed word vector of the word to be processed includes:

firstly, carrying out word segmentation pretreatment on an original text of words to be treated, carrying out distributed word vector expression on words in the pretreated original text, and obtaining a distributed word vector of the words to be treated.

Setting a word frequency threshold, counting the word frequency of the word to be processed by utilizing a preset word bank, and judging whether the word frequency of the word to be processed is lower than the set word frequency threshold.

Further, the obtaining of the word shape feature vector of the word to be processed according to the word shape feature representation of the word by the Chinese character contained in the word to be processed includes:

self-extracting and learning of Chinese character structural information is carried out through a deep learning technology, and all the Chinese character structural information is stored in a Chinese character structural database;

decomposing and counting all characters in the original text of the word to be processed, respectively querying the Chinese character structure database according to each character, acquiring the structure information of each character, and expressing the structure information of each character as a low-dimensional feature vector by using an unsupervised feature extraction method;

and carrying out an averaging operation on the low-dimensional feature vectors corresponding to all the characters, and taking the obtained average as the morphological feature vector of the word to be processed.

Further, the combining and representing the morphological feature vector and the distributed word vector of the word to be processed to obtain the optimized feature vector of the word to be processed includes:

performing dimension connection on the word form characteristic vector and the distributed word vectors to generate a fused word vector, and taking the fused word vector as an optimized characteristic vector of the word to be processed;

finding one or more neighbor words of the word to be processed in a word bank by utilizing the morphological feature vector through a set similarity calculation index, then carrying out an averaging operation on the distributed word vector of the one or more neighbor words and the distributed word vector of the word to be processed, taking the obtained average value as an optimized feature vector of the word to be processed, and taking the optimized feature vector as a semantic expression word vector common to the one or more neighbor words and the word to be processed.

It can be seen from the technical solutions provided by the embodiments of the present invention that a scheme for optimizing word vector expression by using chinese font structure information is designed in the embodiments of the present invention, and the characteristics of word vectors are optimized based on actual natural language processing tasks by using the original neural network word distributed expression technology in combination with chinese font structure characteristics, so that the expression capability and generalization migration capability of word vectors are enhanced. The method is beneficial to improving the word feature representation of the word vector on low-frequency words and unknown words, and the purpose of optimizing the performance of natural language processing tasks such as text classification is realized.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation of a word vector optimization method based on structural information of Chinese character patterns according to an embodiment of the present invention;

fig. 2 is a processing flow chart of a word vector optimization method based on structural information of Chinese character patterns according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.

The embodiment of the invention discloses a word vector optimization scheme based on Chinese character font structural information, which mainly relates to the following steps: extracting method of Chinese character font structure; a method of representing a word feature vector from a font; combining strategies between the word structure feature vectors and the distributed expression word feature vectors based on statistics; and the use of the post-word vector in the actual natural language processing task is improved, and how to select different expression combination schemes according to different scenes. More accurate expression optimization is realized.

The implementation principle schematic diagram of the word vector optimization method based on the Chinese character font structural information provided by the embodiment of the invention is shown in fig. 1, the specific processing flow is shown in fig. 2, and the method comprises the following processing steps:

step 21: and extracting the structure information of the Chinese characters, and storing the structure information of all the Chinese characters in a Chinese character structure database.

And through a deep learning technology, the independent extraction and learning of the Chinese character structural information are carried out, and all the Chinese character structural information is stored in a Chinese character structural database.

And step 22, obtaining the distributed word vector of the word to be processed.

The method comprises the steps of firstly carrying out preprocessing such as word segmentation on an original text of Words to be processed, carrying out distributed word vector expression on the Words in the original text by using methods such as CBOW (Continuous Bag-of-Words) and Skip-gram, and obtaining a distributed word vector of the Words to be processed.

Setting a word frequency threshold, counting the word frequency of the word to be processed by utilizing a preset word bank, and judging whether the word frequency of the word to be processed is lower than the set word frequency threshold. Currently, prior art methods may only make a uniform "UNK" substitution representation for low frequency words. The method of the embodiment of the invention is mainly applied to low-frequency words lower than the word frequency threshold, but can also be applied to high-frequency words higher than the word frequency threshold.

And step 23, performing word shape feature representation of the words according to the Chinese characters contained in the words to be processed, and acquiring word shape feature vectors of the words to be processed.

The definition of a word self comprises each character and the combination relation among the characters, all the characters in the original text of the word to be processed are decomposed and counted, the Chinese character structure database is respectively inquired according to each character, and the structure information of each character is obtained. And then, expressing the structural information of each word as a low-dimensional feature vector by using an unsupervised feature extraction method.

And then, carrying out an averaging operation on the low-dimensional feature vectors corresponding to all the characters, and taking the obtained average as the morphological feature vector of the word to be processed. And (4) adding word structure characteristics into the word form characteristic vector of the word to be processed to complete the optimization on the expression of the word.

For example, for the word "ocean", 32-dimensional font feature vectors of the word "ocean" and the word "ocean" are calculated by using an unsupervised deep learning model, and a font structure feature representation vector of the word "ocean" can be obtained by averaging the two vectors.

And 24, combining and representing the morphological characteristic vector and the distributed word vector of the word to be processed to obtain an optimized characteristic vector of the word to be processed.

The characteristic combination of the word form characteristic vector and the distributed word vector of the word to be processed is carried out in two ways:

one way is that: directly connecting the word shape feature vector and the distributed word vector in dimensionality to generate a 160-dimensional fused word vector; using the fused word vector as the optimized characteristic vector of the word to be processed

For example, for the word "ocean", a 128-dimensional distributed word vector is obtained by calculation using a conventional context distributed semantic expression mode, and in the last step, a 32-dimensional morphological feature vector is obtained, and the 128-dimensional distributed word vector and the 32-dimensional morphological feature vector are subjected to dimension connection to generate a 160-dimensional fused word vector.

The other mode is as follows: finding one or more neighboring words of the word to be processed in a word bank by utilizing a morphological feature vector through a certain similarity calculation index, then carrying out an averaging operation on the distributed word vector of the one or more neighboring words and the distributed word vector of the word to be processed, taking the obtained average value as an optimized feature vector of the word to be processed, and taking the optimized feature vector as a semantic expression word vector common to the one or more neighboring words and the word to be processed

Assuming that "ocean" is most similar to "Wang ocean" and "sea water" in font, the 128-dimensional distributed word vector corresponding to the two words and the word vector of the word "ocean" can be used for calculating an average value to obtain an optimized feature vector of the "ocean", and the optimized feature vector is also a semantic expression word vector common to the three words.

And inputting the obtained expression of the optimized feature vector of the word to be processed into a sentence expression model frame to obtain a sentence expression, and outputting a final classification result of the word to be processed in a classifier.

In summary, the embodiment of the present invention designs a scheme for optimizing word vector expression by using chinese font structure information, and performs characteristic optimization of word vectors based on actual natural language processing tasks by using the original neural network word distributed expression technology and combining chinese font structure characteristics, so that the expression capability and generalization migration capability of word vectors are enhanced. The method is beneficial to improving the word feature representation of the word vector on low-frequency words and unknown words, and the purpose of optimizing the performance of natural language processing tasks such as text classification is realized.

The embodiment of the invention realizes a special optimization strategy for the Chinese word vector on the basis of the original word vector distributed expression method and the two-dimensional graph structure information extraction method, so that the actual performance of the Chinese word vector on natural language processing tasks such as text classification and the like is enhanced.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A word vector optimization method based on Chinese character font structural information is characterized by comprising the following steps:

acquiring a distributed word vector of a word to be processed;

performing word shape feature representation of the words according to the Chinese characters contained in the words to be processed, and acquiring word shape feature vectors of the words to be processed, wherein the word shape feature vectors specifically comprise:

carrying out an averaging operation on the low-dimensional feature vectors corresponding to all the characters, and taking the obtained average as a morphological feature vector of the word to be processed;

the expression method comprises the following steps of combining and representing the morphological characteristic vector and the distributed word vector of the word to be processed to obtain an optimized characteristic vector of the word to be processed, and specifically comprises the following steps:

performing dimension connection on the morphological feature vector and the distributed word vector to generate a fused word vector, taking the fused word vector as an optimized feature vector of the word to be processed, or finding one or more neighbor words of the word to be processed in a word bank by utilizing the morphological feature vector through a set similarity calculation index, then performing an averaging operation on the distributed word vector of the one or more neighbor words and the distributed word vector of the word to be processed, taking the obtained average value as the optimized feature vector of the word to be processed, and taking the optimized feature vector as a semantic expression word vector common to the one or more neighbor words and the word to be processed.

2. The method of claim 1, wherein obtaining the distributed word vector of the word to be processed comprises:

firstly, carrying out word segmentation pretreatment on an original text of words to be treated, carrying out distributed word vector expression on words in the pretreated original text, and obtaining a distributed word vector of the words to be treated;