CN116992865A

CN116992865A - Word method analysis method and device based on context and storage medium

Info

Publication number: CN116992865A
Application number: CN202310975048.3A
Authority: CN
Inventors: 廖志成; 何春茂; 黄松清; 许志勇
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-11-03

Abstract

The invention provides a word method analysis method, a word method analysis device and a storage medium based on context, wherein the method comprises the following steps: acquiring a data set with labels, wherein the data set comprises: a plurality of sentences and a context constructed for each word of each of the plurality of sentences; extracting feature information representing word features from the context of each word; model training is carried out on the marked data in the marked data set and the characteristic information extracted from the context environment of each word through a machine learning algorithm to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between the words; inputting sentences to be analyzed into the model, and outputting a prediction result of analyzing and predicting the sentences to be analyzed. The scheme provided by the invention can better understand the meaning and grammar roles of the current word by considering the surrounding words and sentence structures.

Description

Word method analysis method and device based on context and storage medium

Technical Field

The present invention relates to the field of natural language processing, and in particular, to a method, an apparatus, and a storage medium for word method analysis based on context.

Background

In the field of natural language processing, word law analysis is an important task. It can decompose a sentence into lexical units and determine the relationships between the words. The traditional word method analysis method mainly comprises a rule-based analysis method, a statistical model-based analysis method and a rule-based and statistical mixed analysis method. The rule-based method, which requires expert knowledge and a lot of time effort and is difficult to cover all linguistic phenomena, achieves the parsing of text by manually defining rules. The method is based on statistics, and predicts the grammar structure of sentences through a statistical learning algorithm by utilizing a large number of labeled corpus training models, and the method needs a large number of corpora and computing resources and is difficult to process the text types which do not appear. The mixing method combines the rule-based method and the statistics-based method to make up for the respective defects, but the methods have the problems of low accuracy, low processing speed and the like, and cannot meet the actual application demands.

Disclosure of Invention

The invention aims to overcome the defects of the related art, and provides a word method analysis method, a word method analysis device and a word method storage medium based on context, so as to solve the problems of low accuracy and low processing speed in word grammar analysis in the related art.

The invention provides a word method analysis method based on context, which comprises the following steps:

acquiring a data set with labels, wherein the data set comprises: a plurality of sentences and a context constructed for each word of each of the plurality of sentences, wherein each sentence is comprised of a sequence of words and a corresponding sequence of part-of-speech tags; extracting feature information for representing a feature of a word from the context of each word, the feature information comprising: at least one of part-of-speech characteristics, dependency characteristics, context characteristics, and lexical morphology; model training is carried out on the marked data in the marked data set and the characteristic information extracted from the context environment of each word through a machine learning algorithm to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between the words; inputting sentences to be analyzed into the model, and outputting a prediction result of analyzing and predicting the sentences to be analyzed.

Optionally, the context environment includes: the word to the left and the word to the right of each word, and the part-of-speech tags of the word to the left and the word to the right and the dependency relationship between each word and the word to the left and the word to the right thereof; wherein the process of constructing a context for each word of each sentence of the plurality of sentences comprises: word segmentation is carried out on each sentence so as to respectively obtain more than two words forming each sentence; marking the parts of speech of each word in each sentence; performing dependency analysis between each word in each sentence; a context is built for each word based on the word segmentation, part-of-speech tagging, and dependency analysis performed.

Optionally, extracting feature information for representing word features from the context of each word includes: extracting part-of-speech features of the current word, part-of-speech features of the word to the left of the current word, dependency features of the current word with the word to the left thereof, part-of-speech features of the word to the right of the current word, and dependency features of the current word with the word to the right thereof from the context of the current word.

Optionally, the method further comprises: analyzing the sentence by using grammar rules according to the prediction result of analyzing and predicting the sentence to be analyzed, and generating a structured output result in a preset form, wherein the structured output result in the preset form comprises the following steps: syntax tree.

Optionally, post-processing is performed on the generated structured output result in the preset form, so as to improve the accuracy and stability of the grammar analysis; the post-processing includes: at least one of error correction, disambiguation, context adjustment, normalization, manual verification.

In another aspect, the present invention provides a context-based word method analysis apparatus, including: a data acquisition unit, configured to acquire a data set with a label, where the data set includes: a plurality of sentences and a context constructed for each word of each of the plurality of sentences, wherein each sentence is comprised of a sequence of words and a corresponding sequence of part-of-speech tags; a feature extraction unit for extracting feature information for representing a feature of a word from the context of each word, the feature information comprising: at least one of part-of-speech characteristics, dependency characteristics, context characteristics, and lexical morphology; a model training unit, configured to perform model training on the labeling data in the labeled dataset and the feature information extracted from the context of each word by using a machine learning algorithm, so as to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between words; the analysis and prediction unit is used for inputting sentences to be analyzed into the model and outputting prediction results of analysis and prediction of the sentences to be analyzed.

Optionally, the feature extraction unit extracts feature information for representing a feature of a word from the context of each word, including: extracting part-of-speech features of the current word, part-of-speech features of the word to the left of the current word, dependency features of the current word with the word to the left thereof, part-of-speech features of the word to the right of the current word, and dependency features of the current word with the word to the right thereof from the context of the current word.

Optionally, the method further comprises: the parsing unit is configured to parse the sentence by using a grammar rule according to a prediction result of performing analysis prediction on the sentence to be analyzed, and generate a structured output result in a preset form, where the structured output result in the preset form includes: syntax tree.

Optionally, the method further comprises: the post-processing unit is used for carrying out post-processing on the generated structured output result in the preset form so as to improve the accuracy and stability of grammar analysis; the post-processing includes: at least one of error correction, disambiguation, context adjustment, normalization, manual verification.

In a further aspect the invention provides a storage medium having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.

In a further aspect the invention provides a terminal comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the program is executed.

The invention further provides a terminal which comprises any of the context-based word method devices.

According to the technical scheme of the invention, the meaning and grammar roles of the current word can be better understood by considering the surrounding words and sentence structures. Compared with a grammar analysis method based on rules or statistical machine learning, the method can more accurately eliminate ambiguity and understand context, and improve analysis accuracy and efficiency. The scheme provided by the invention not only considers the context information, but also combines other various characteristics, such as part of speech, dependency relationship and the like. By comprehensively considering the different characteristics, the word method analysis can be more accurately performed, and the analysis effect and generalization capability are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a method diagram of one embodiment of a context-based word method analysis method provided by the present invention;

FIG. 2 is a method diagram of another embodiment of a context-based word method analysis method provided by the present invention;

FIG. 3 is a word parsing flow in the related art;

FIG. 4 is a method diagram of one embodiment of a context-based word method analysis system in accordance with aspects of the present technique;

FIG. 5 is a block diagram illustrating an embodiment of a context-based word analysis apparatus according to the present invention;

FIG. 6 is a block diagram of another embodiment of a context-based word analysis apparatus provided by the present invention;

FIG. 7 is an exemplary diagram of a grammar tree that labels parts of speech corresponding to words;

FIG. 8 is an exemplary diagram of a dependency tree for model prediction;

FIG. 9 is an example diagram of a syntax tree constructed by part-of-speech tagging and dependency analysis.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to specific embodiments of the present invention and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The word grammar analysis flow in the related art may be illustrated with reference to fig. 3, and fig. 3 illustrates the word grammar analysis flow based on rules and based on statistical models in the related art. The word grammar analysis flow in the related technology mainly comprises input, preprocessing, word segmentation, part-of-speech tagging, syntactic analysis, semantic analysis and output.

The invention provides a word method analysis method based on context.

FIG. 1 is a method diagram of an embodiment of a context-based word method analysis method provided by the present invention.

As shown in fig. 1, the method for analyzing a word method based on a context at least includes step S110, step S120, step S130, and step S140 according to an embodiment of the present invention.

Step S110, a data set with labels is acquired.

Specifically, in a data preparation stage, a data set with tags is prepared, the data set including a plurality of sentences that have been subjected to word segmentation processing, part-of-speech tagging, and dependency analysis, and a context constructed for each word of each of the plurality of sentences based on the results of the word segmentation processing, part-of-speech tagging, and dependency analysis, each sentence being composed of a word sequence and a corresponding part-of-speech tagging sequence.

Before word segmentation, part-of-speech tagging and dependency analysis, the sentences to be analyzed are preprocessed. Firstly, cleaning texts to be analyzed, such as punctuation marks, stop words and the like, and performing operations such as case-to-case conversion and the like, so that noise and redundant information can be reduced, and analysis accuracy and efficiency are improved. Second, the low frequency vocabulary (such as stop words) is deleted from the text, which can reduce the complexity and calculation cost of model training.

After preprocessing, sentence segmentation processing, part-of-speech tagging and dependency analysis are performed on the to-be-analyzed object, and a context environment is built for each word. The context environment may specifically include: the left (front) and right (rear) words of each word, and the parts of speech notations of the left and right words and the dependency of each word with its left and right words. By considering surrounding words and sentence structures, the meaning and grammatical roles of the current word can be better understood. For example, "bat" in "isaw a bat" may be a "bat" or a "bat" and needs to be differentiated according to the context.

Specifically, the process of building a context for each word can be implemented by:

1. word segmentation: each sentence is segmented to obtain two or more words that make up each sentence, respectively, i.e., the text is divided into individual word units. This step is to determine the position of each word in the sentence.

2. Part of speech tagging: each word in each sentence is tagged with a part of speech, i.e. the part of speech role played by the word in the sentence is determined. For example, a noun, verb, adjective, etc.

3. Dependency analysis: dependency analysis between each word in each sentence is performed, i.e., dependency between each word, i.e., grammatical relations between words, including principal-predicate relationships, guest-moving relationships, etc., is determined, i.e., which words are subject, predicate, object, etc., is analyzed. Dependency analysis may help understand the structure and semantic information of sentences.

4. Building a context environment: a context is built for each word based on the word segmentation, part-of-speech tagging, and dependency analysis performed. The contextual environment includes: the word to the left and the word to the right of each word, and the part-of-speech tags of the word to the left and the word to the right and the dependency relationship between each word and its word to the left and the word to the right. This can provide more context information to help understand and process words in sentences.

The process of building a context is illustrated by one example as follows:

original sentence: i like to eat apples.

Word segmentation: i likes |eating |apple.

Part of speech tagging: i (pronoun) | likes (verb) | eats (verb) | apples (noun).

Dependency analysis: the like is the relationship of main and accessory, and the eating is the relationship of moving and guest.

Building a context environment:

for the word "I" there is no word to the left, the word to the right is "like", the part of speech is tagged as a pronoun, and the dependency is a dominant relationship. Contextual environment: [ None, 'like', 'verb' ]

For the word like, the left word is I, the part of speech is marked as verb, and the dependency relationship is the main-predicate relationship; the right word is "eat", the part of speech is labeled as verb, and the dependency relationship is unknown. Contextual environment: [ ' I'm ', ' pronoun ', ' eat ', ' verb ',

for the word "eat", the left word is "like", the part of speech is marked as verb, and the dependency relationship is a move-guest relationship; the right word is "apple", the part of speech is labeled as noun, and the dependency relationship is the dominant relationship. Contextual environment: [ 'like', 'verb', 'apple', 'noun', ]

For the word "apple", the left word is "eating", the part of speech is marked as noun, and the dependency relationship is the main-predicate relationship; there are no words on the right. Contextual environment: [ 'eat', 'verb', none ]

From the above example, it can be seen that each word is built up a corresponding context, including left and right words, part-of-speech tags, and dependencies. Such a context provides more comprehensive contextual information that facilitates more accurate understanding and processing of words.

Step S120, extracting feature information for representing word features from the context environment of each word.

The feature information representing word features includes, for example, at least one of part-of-speech features, dependency features, context features, and lexical morphology.

Extracting information representing word features from the context may be based on several aspects:

part of speech feature: part of speech is a description of the role a word plays in a sentence. By extracting part-of-speech information, the grammatical function and sentence structure of the word can be known. For example, in a named entity recognition task, noun part-of-speech features may be extracted to determine whether a word is an entity name.

Dependency characteristics: dependencies describe grammatical relationships between words, including master-predicate relationships, guest-move relationships, and the like. By extracting dependency information, a syntactic dependency structure between words can be obtained. For example, in a relationship extraction task, a master predicate relationship feature may be extracted to determine whether a relationship exists between two words.

Contextual characteristics: the context contains the word to the left (front) and the word to the right (rear) of each word as well as the part-of-speech tags of the words to the left and right and the dependency of each word with the words to the left and right. By extracting the context features, the changes and meaning of words in different contexts can be captured. For example, in a word sense disambiguation task, the part of speech and dependency characteristics of left and right words may be extracted to determine the specific meaning of an ambiguous word.

Vocabulary morphology: refers to the form and structure of words, including stems, affix, and other lexical variations. Lexical morphology may provide information about words, such as part of speech, tense, number, and language. The following are some common lexical morphologies and examples thereof:

1. stem (Stem): stem is the basic part of a word, usually in the form of a removed word tail. For example, the stem of the word "running" is "run", and the stem of "cats" is "cat". By means of the stem we can get the basic meaning of the word.

2. Affix (Affix): the affix is a part attached before or after the stem and may change the meaning or part of speech of the word. There are two types of suffixes:

Prefix (Prefix): an affix located in front of the stem. For example, "un-" is a negative prefix, and the word "happy" is changed to "unhappy", indicating that it is not happy.

-Suffix (Suffix): an affix located after the stem. For example, "ness" is a noun suffix, changing the adjective "happy" to the noun "happiness" indicates happiness.

3. Case form (Capitalization): the case form of the word may provide some information. The capitalized word is typically a proper noun, a word beginning with a sentence, or a title. For example, "John" is a personal name, and "London" is a place name.

By extracting lexical morphological information, a corresponding feature representation may be generated for each word, which may be used in natural language processing tasks such as part-of-speech tagging, syntactic analysis, named entity recognition, etc., to enhance the understanding and analysis capabilities of the context.

In one embodiment, the part-of-speech feature of the current word, the part-of-speech feature of the word to the left of the current word, the dependency feature of the current word with the word to the left of the current word, the part-of-speech feature of the word to the right of the current word, the dependency feature of the current word with the word to the right of the current word are extracted from the context of the current word.

For example, to illustrate how information representing word features can be extracted from a context: assume a context: the features that can be extracted by the 'like', 'verb', 'eat', 'verb' include:

part of speech feature of the current word: "verb".

Part of speech feature of left word: "like" corresponds to a verb.

Dependency characteristics of left word: a master relationship.

Part of speech feature of right word: "eat" corresponds to the verb.

Dependency characteristics of right word: unknown.

By extracting these features, we can express the word "eat" as a vector containing the features described above, which aids in subsequent semantic understanding and processing tasks. The selection and extraction modes of the features can be flexibly adjusted according to the requirements of specific tasks and the design of the model.

After extracting feature information for representing the word features from the context of each word, the extracted feature information is converted into a numerical representation usable in a machine learning algorithm for model training by the machine learning algorithm.

Step S130, model training is carried out on the marked data in the marked data set and the characteristic information extracted from the context environment of each word through a machine learning algorithm to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between the words.

Specifically, model training is performed on the labeling data in the labeled dataset and feature information extracted from the context of each word by a machine learning algorithm. For example, a conditional random field, a neural network and the like perform model training on the prepared characteristic information and the labeling data. The goal of training is to learn the relationships between words and context, and to predict the part-of-speech tags and grammar structure information for each word.

Optionally, after training to obtain a model for predicting the part of speech and the dependency relationship between words of each word in the sentence, evaluating the trained model by using an evaluation data set, and calculating the accuracy and/or recall of part of speech tagging and dependency relationship analysis by the model so as to perform tuning and improvement according to the evaluation result.

Step S140, inputting the sentence to be analyzed into the model, and outputting a prediction result of performing analysis prediction on the sentence to be analyzed.

Specifically, when the trained model is used for prediction, an unlabeled sentence is input, and a word sequence of the sentence is included. The model performs part-of-speech tagging and dependency analysis on each word in the sentence and outputs a prediction result, which may include a predicted part-of-speech tagging sequence and dependency tree.

For example, for the sentence "I like to eat apples", the following outputs can be obtained after model training:

part-of-speech tagging sequence: "pronoun verb noun" dependency tree: [ ('like', 'i', 'main term relation') ] ('eat', 'like', 'guest-move relation') ] ('apple', 'eat', 'noun') ].

Such output may help understand the grammatical structure of the sentence, the dependencies between words, and the role played by each word. Such information is useful for subsequent natural language processing tasks such as question-answering systems, machine translation, etc.

FIG. 2 is a method diagram of another embodiment of a context-based word method analysis method provided by the present invention.

As shown in fig. 2, based on the above embodiment, according to another embodiment of the present invention, the method for analyzing a word method based on a context further includes step S150.

And step S150, analyzing the sentence by using grammar rules according to a prediction result of analysis and prediction of the sentence to be analyzed, and generating a structured output result in a preset form.

The structured output result in the preset form comprises a grammar tree. Specifically, the syntax tree may be generated by:

1. Part of speech tagging: the model prediction result may include a predicted part-of-speech tagging sequence and a dependency tree, the part-of-speech tagging sequence of the model prediction being applied to the original sentence, associating each word with its corresponding part of speech.

Assume an original sentence: "I love cat and dog"

The part-of-speech tagging sequence of model prediction is: [ "PRON", "VERB", "NOUN", "CONJ", "NOUN", "" ], means the part of speech of each word.

Each word is associated with its corresponding part of speech, i.e. the part of speech tagging sequence of each word predicted by the model is sequentially associated with the words in the original sentence. For example, the part of speech corresponding to "I" is "PRON", "love" is "VERB", and so on.

-"I"->"PRON"

-"love"->"VERB"

-"cats"->"NOUN"

-"and"->"CONJ"

-"dogs"->"NOUN"

-"."->"."

Thus, the association relation between each word in the original sentence and the corresponding part of speech is established. These associations may be used for further analysis and processing, such as building syntax trees, dependency analysis, etc.

For example, by the above-described association relationship, the following syntax tree (refer to fig. 7) can be constructed:

in this syntax tree, each parenthesis represents a word and its corresponding part of speech. In this way, a structured output result may be generated based on the part-of-speech tagging sequence of the model predictions.

2. Dependency analysis: and according to the dependency relation tree information predicted by the model, connecting each word with the head word of the dependency relation. Dependency tags may be used to represent different syntactic relationships, such as master-predicate relationships, guest-move relationships, and the like.

In dependency analysis, a head word refers to a word on which a word depends or depends, and may also be referred to as a "parent", "dominant" or "center word". Dependency tags are used to represent syntactic relationships, such as master-predicate relationships, dynamic guest relationships, and the like. Each word is connected to its head word of dependency, i.e. each word is connected to its head word according to the dependency tree.

The following is the step of establishing a head word connection for each word and its dependencies, and one example:

1) Assume an original sentence: "I love cat and dog"

2) The dependency tree for model prediction is shown below (see FIG. 8):

3) Establishing a connection relation: according to the dependency tree, for each word, it is connected to its corresponding head word. For example:

the head word of "love" is "I", with a dependency label "nsubj" (subject relationship).

"I" has no head word, and because it is the root node, there is no dependency label.

The head word of-the "catas" is "love", with the dependency tag "dobj" (direct object relationship).

The head word of "and" is "cata", with the dependency label "cc" (parallel connectives).

The head word of "dogs" is "cata" with a dependency label "conj" (parallel relationship).

In this way, a connection is established between each word and its dependency header word.

For example, in the above example, the following connection relationship may be obtained:

- "love" -> "I" (nsubj)

-"I"->None

- "cats" -> "love" (dobj)

- "and" -> "cats" (cc)

- "dogs" -> "cats" (conj)

in this way we can extract the head word of each word and its dependencies, thus enabling dependency analysis.

3. Parsing the syntax tree: based on the part-of-speech tags and the dependency relationships, the sentences are analyzed according to grammar rules, and grammar trees or other structured output results are generated. The process of parsing the output results of the generation of the grammar tree or other structure depends on the definition and application of the grammar rules. Grammar rules may define syntactic relationships and constraints between different components based on linguistic knowledge. By applying grammar rules, the predicted part-of-speech tags and dependencies are converted into output results with structured representations. A syntax tree is a way to present sentence components and syntactic relations in a tree structure, each node of the tree representing a word, the edges between the nodes representing syntactic relations.

Grammar rules are a set of conventions that specify the relationships between sentence structures and components. It defines the normal language expression and describes the rules of how words are combined into phrases and how phrases are combined into sentences. The generating the syntax tree may specifically include: each word is represented by an edge as a node syntactic relationship. The edges bear relation information among words in the grammar tree, which helps us understand the structure and grammar relation of sentences. This way of edge connection makes the syntax tree a structure that is easy to visualize and parse.

For example, for the sentence "I love cat and dog", "through part-of-speech tagging and dependency analysis, the following syntax tree (see fig. 9) may be constructed:

in this syntax tree, each word becomes a node, and the syntactic relationship is represented by an edge. For example, "love" is a verb (verb) whose head word is "I" (master-predicate relationship), while it also has a direct object "cata" (move-guest relationship). Similarly, "cats" and "dogs" are connected by a side-by-side relationship "and".

In this example, part-of-speech tagging, dependency analysis, and parsing of the syntax tree are interrelated and interdependent steps. Part-of-speech tagging provides the basic grammatical attributes of words, while dependency analysis determines syntactic relationships between words. The parse syntax tree then utilizes the part-of-speech tags and the dependencies to construct a structured representation of the sentence. Therefore, these three steps are typically performed sequentially, with each other, to produce the final output result. Other structured output results may be, for example, a dependency graph.

The resulting syntax tree or other structured output is an expression of the syntax structure of the sentence. It can be used for further tasks of semantic understanding, question-answering, text generation, etc. The output results provide structured information about sentence components and relationships, facilitating in-depth analysis and processing of sentences.

Optionally, based on the above embodiment, the method further includes: and carrying out post-processing on the generated structured output result in the preset form so as to improve the accuracy and stability of grammar analysis.

Wherein, the post-processing specifically may include: at least one of error correction, disambiguation, context adjustment, normalization, manual verification.

Error correction: output errors are detected and corrected by rules or machine learning methods. For example, by comparing the degree of matching between the output syntax tree or other structured result and the syntax rules, it is determined whether there is an error, and correction is attempted.

Disambiguation: when the output result is ambiguous, a preset heuristic algorithm, rule or statistical method can be applied to resolve the ambiguity. For example, the most reasonable parsing result is selected by considering context information, word order, syntactic constraints, etc.

Context adjustment: the output result may be adjusted using the context information. For example, the labels and the dependency relationships are finely adjusted according to the semantic information, the logic relationships and the like of the previous or the following, so that the generated grammar tree or other structured results are ensured to accord with the semantic consistency of the whole sentence.

Normalizing: the output results may be normalized to achieve consistency and standardization. For example, structures having the same meaning but using different forms may be converted into a unified form.

And (3) manual verification: in some cases, the output may need to be submitted to manual verification and correction. This may be through manual review, feedback from the annotators, or expert domain knowledge to make further adjustments and improvements to the results.

In general, the corresponding strategies and methods can be designed for post-processing according to specific application requirements and tasks. The post-processing aims to improve the accuracy, stability and reliability of the grammar analysis result, so that the grammar analysis result is more in line with the grammar rules and the context used by the actual language.

The post-processing is mainly used for processing the output result of the model so as to improve the accuracy and stability of the grammar analysis system. These post-processing steps are typically performed after the output of the model is generated. In training the model, a set of annotation data is typically used, which contains the correct parsing results. The goal of training the model is to make the output results of the model as consistent as possible with the results in the annotation dataset. Therefore, during the training phase, we will not actively perform post-processing operations such as error correction or disambiguation.

However, the idea of post-processing may have a certain impact during the training phase. For example, the model may be trained to better handle similar situations by adding a slight error or ambiguity in the annotation dataset. Such a training approach may improve the robustness and generalization ability of the model.

In addition, some skill may be employed during model training to improve the model's ability to correct errors and disambiguate. For example, a suitable loss function may be designed to make the model more concerned with error-prone situations during the training process. Other supervisory signals or a priori knowledge may also be introduced to help the model better handle errors and ambiguities.

In summary, although the post-processing steps are not directly applied to the training model itself, the model can be made more capable of handling errors and ambiguities by reasonable design and skill during the training phase, thereby improving the performance of the parsing system.

In order to clearly illustrate the technical scheme of the present invention, a specific embodiment is used to describe the execution flow of the word method analysis method based on the context provided by the present invention.

FIG. 4 is a method diagram of one embodiment of a context-based word analysis system according to the present invention.

The word method analysis system of the present invention comprises: the system comprises an input module, a preprocessing module, a context construction module, a feature extraction module, a model training module, a decoding module, a post-processing module and an output module. As shown in fig. 4, the word method analysis system of the present invention comprises the following specific implementation steps:

1. the input module receives sentences to be analyzed.

2. The preprocessing module preprocesses the input sentence. It is first necessary to clean the text data, for example, remove punctuation marks, stop words, and the like, and perform operations such as case-to-case conversion. By the aid of the method, noise and redundant information can be reduced, and analysis accuracy and efficiency are improved. And secondly, the vocabulary (such as stop words) which occur in low frequency is deleted from the text, so that the complexity and the calculation cost of model training can be reduced.

3. The context construction module constructs a context environment for each word, including information such as left and right words, their part of speech labels, dependencies, and the like. By considering surrounding words and sentence structures, the meaning and grammatical roles of the current word can be better understood. For example, "bat" in "isaw a bat" may be a "bat" or a "bat" and needs to be differentiated according to the context.

4. The feature extraction module extracts information, such as part of speech, lexical morphology, etc., from the context for representing the word features.

5. The model training module trains the model using a machine learning algorithm (e.g., conditional random field) with a labeled dataset to learn the relationships between words and contexts and predict the part-of-speech labeling and grammar structure information for each word.

6. The decoding module parses the sentence using grammar rules based on the model prediction results, generating a grammar tree or other structured output results.

The output of the model is decoded and interpreted to produce a result that is understandable to humans. The role of the decoding module in the lexical analysis is to decode the model output into a readable and structured form, such as a grammar tree or a dependency graph, to help us better understand the relationship between the grammar structure of the sentence and the words.

7. The post-processing module performs post-processing on the output result of the decoding module, such as error correction, ambiguity resolution and the like, so as to improve the accuracy and stability of grammar analysis.

8. The output module outputs the result.

FIG. 5 is a block diagram illustrating an embodiment of a context-based word analysis apparatus according to the present invention. As shown in fig. 5, the context-based word law analysis apparatus 100 includes a data acquisition unit 110, a feature extraction unit 120, a model training unit 130, and an analysis prediction unit 140.

A data acquisition unit 110 for acquiring a data set with labels.

Specifically, in a data preparation stage, a data set with tags is prepared, the data set including a plurality of sentences that have been subjected to word segmentation processing, part-of-speech tagging, and dependency analysis, and a context constructed for each word of each of the plurality of sentences based on the results of the word segmentation processing, part-of-speech tagging, and dependency analysis, each sentence being composed of a word sequence and a corresponding part-of-speech tagging sequence. The data acquisition unit 110 acquires a data set with labels.

After preprocessing, sentence segmentation processing, part-of-speech tagging and dependency analysis are performed on the to-be-analyzed object, and a context environment is built for each word.

The context environment may specifically include: the left (front) and right (rear) words of each word, and the parts of speech notations of the left and right words and the dependency of each word with its left and right words. By considering surrounding words and sentence structures, the meaning and grammatical roles of the current word can be better understood. For example, "bat" in "isaw a bat" may be a "bat" or a "bat" and needs to be differentiated according to the context.

The process of building a context is described by way of one example:

original sentence: i like to eat apples.

Word segmentation: i likes |eating |apple.

Building a context environment:

A feature extraction unit 120 for extracting feature information for representing the feature of the word from the context of each word.

The feature information representing word features includes, for example, at least one of part-of-speech features, dependency features, context features, and lexical morphology. Extracting information representing word features from the context may be based on several aspects:

In one specific embodiment, the feature extraction unit 120 extracts feature information for representing the feature of the word from the context of each word, including: extracting part-of-speech features of the current word, part-of-speech features of the word to the left of the current word, dependency features of the current word with the word to the left thereof, part-of-speech features of the word to the right of the current word, and dependency features of the current word with the word to the right thereof from the context of the current word.

part of speech feature of the current word: "verb".

Part of speech feature of left word: "like" corresponds to a verb.

Dependency characteristics of left word: a master relationship.

Part of speech feature of right word: "eat" corresponds to the verb.

Dependency characteristics of right word: unknown.

A model training unit 130 for performing model training on the labeling data in the labeled dataset and the feature information extracted from the context of each word by a machine learning algorithm to obtain a model for predicting the part of speech of each word in the sentence and the dependency relationship between the words.

Specifically, the model training unit 130 performs model training on the labeling data in the labeled dataset and extracting feature information from the context of each word by a machine learning algorithm. For example, a conditional random field, a neural network and the like perform model training on the prepared characteristic information and the labeling data. The goal of training is to learn the relationships between words and context, and to predict the part-of-speech tags and grammar structure information for each word.

An analysis prediction unit 140, configured to input a sentence to be analyzed into the model, and output a prediction result of performing analysis prediction on the sentence to be analyzed.

part-of-speech tagging sequence: "pronoun verb noun" dependency tree: [ ('like', 'i', 'main term relation') ] ('eat', 'like', 'guest-move relation') ] ('apple', 'eat', 'noun') ]. Such output may help understand the grammatical structure of the sentence, the dependencies between words, and the role played by each word. Such information is useful for subsequent natural language processing tasks such as question-answering systems, machine translation, etc.

FIG. 6 is a block diagram of another embodiment of a context-based word analysis apparatus provided by the present invention. As shown in fig. 6, the context-based word method analysis device 100 further includes:

The parsing unit 150 is configured to parse the sentence by using a grammar rule according to a prediction result of performing analysis prediction on the sentence to be analyzed, and generate a structured output result in a preset form.

The structured output result in the preset form comprises: syntax tree.

Specifically, the syntax tree may be generated by:

Assume an original sentence: "I love cat and dog"

-"I"->"PRON"

-"love"->"VERB"

-"cats"->"NOUN"

-"and"->"CONJ"

-"dogs"->"NOUN"

-"."->"."

1) Assume an original sentence: "I love cat and dog"

2) The dependency tree for model prediction is shown below (see FIG. 8):

- "love" -> "I" (nsubj)

-"I"->None

- "cats" -> "love" (dobj)

- "and" -> "cats" (cc)

- "dogs" -> "cats" (conj)

Optionally, based on the above embodiment, the apparatus 100 further includes: a post-processing unit (not shown). And the post-processing unit is used for carrying out post-processing on the generated structured output result in the preset form so as to improve the accuracy and stability of the grammar analysis.

The invention also provides a storage medium corresponding to the context-based word method analysis method, having stored thereon a computer program which when executed by a processor performs the steps of any of the methods described above.

Accordingly, the scheme provided by the invention can better understand the meaning and grammar roles of the current word by considering the surrounding words and sentence structures. Compared with the traditional grammar analysis method based on rules or statistical machine learning, the method can more accurately eliminate ambiguity and understand context, and improve analysis accuracy and efficiency. The scheme provided by the invention not only considers the context information, but also combines other various characteristics, such as part of speech, dependency relationship and the like. By comprehensively considering the different characteristics, the word method analysis can be more accurately performed, and the analysis effect and generalization capability are improved.

In natural language processing, the same word may have a plurality of different meanings, and the scheme provided by the invention can judge the specific meaning and grammar role of each word according to the sentence context through word method analysis based on the context, thereby eliminating ambiguity.

The scheme provided by the invention can better understand the meaning and grammar roles of the current word by considering the surrounding words and sentence structures. Is very important for natural language processing tasks such as machine translation, emotion analysis and the like.

According to the scheme provided by the invention, by considering the context information, the generation of errors and confusion can be reduced, and the analysis accuracy is improved.

The functions described herein may be implemented in hardware, software executed by a processor, firmware, or any combination thereof. If implemented in software that is executed by a processor, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Other examples and implementations are within the scope and spirit of the application and the appended claims. For example, due to the nature of software, the functions described above may be implemented using software executed by a processor, hardware, firmware, hardwired, or a combination of any of these. In addition, each functional unit may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate components may or may not be physically separate, and components as control devices may or may not be physical units, may be located in one place, or may be distributed over a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the related art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above description is only an example of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method of context-based word analysis, comprising:

acquiring a data set with labels, wherein the data set comprises: a plurality of sentences and a context constructed for each word of each of the plurality of sentences, wherein each sentence is comprised of a sequence of words and a corresponding sequence of part-of-speech tags;

extracting feature information for representing a feature of a word from the context of each word, the feature information comprising: at least one of a part-of-speech feature, a dependency feature, a context feature, and or a lexical morphology;

model training is carried out on the marked data in the marked data set and the characteristic information extracted from the context environment of each word through a machine learning algorithm to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between the words;

Inputting sentences to be analyzed into the model, and outputting a prediction result of analyzing and predicting the sentences to be analyzed.

2. The method of claim 1, wherein the context comprises: the word to the left and the word to the right of each word, and the part-of-speech tags of the word to the left and the word to the right and the dependency relationship between each word and the word to the left and the word to the right thereof;

wherein the process of constructing a context for each word of each sentence of the plurality of sentences comprises:

word segmentation is carried out on each sentence so as to respectively obtain more than two words forming each sentence;

marking the parts of speech of each word in each sentence;

performing dependency analysis between each word in each sentence;

a context is built for each word based on the word segmentation, part-of-speech tagging, and dependency analysis performed.

3. The method of claim 2, wherein extracting feature information representing word features from the context of each word comprises:

extracting part-of-speech features of the current word, part-of-speech features of the word to the left of the current word, dependency features of the current word with the word to the left thereof, part-of-speech features of the word to the right of the current word, and dependency features of the current word with the word to the right thereof from the context of the current word.

4. A method according to any one of claims 1-3, further comprising:

analyzing the sentence by using grammar rules according to the prediction result of analyzing and predicting the sentence to be analyzed, and generating a structured output result in a preset form, wherein the structured output result in the preset form comprises the following steps: syntax tree.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

post-processing is carried out on the generated structured output result in the preset form so as to improve the accuracy and stability of grammar analysis;

the post-processing includes: at least one of error correction, disambiguation, context adjustment, normalization, manual verification.

6. A context-based word method analysis apparatus, comprising:

a data acquisition unit, configured to acquire a data set with a label, where the data set includes: a plurality of sentences and a context constructed for each word of each of the plurality of sentences, wherein each sentence is comprised of a sequence of words and a corresponding sequence of part-of-speech tags;

a feature extraction unit for extracting feature information for representing a feature of a word from the context of each word, the feature information comprising: at least one of part-of-speech characteristics, dependency characteristics, context characteristics, and lexical morphology;

A model training unit, configured to perform model training on the labeling data in the labeled dataset and the feature information extracted from the context of each word by using a machine learning algorithm, so as to obtain a model for predicting the part of speech of each word in a sentence and the dependency relationship between words;

the analysis and prediction unit is used for inputting sentences to be analyzed into the model and outputting prediction results of analysis and prediction of the sentences to be analyzed.

7. The apparatus of claim 6, wherein the contextual environment comprises: the word to the left and the word to the right of each word, and the part-of-speech tags of the word to the left and the word to the right and the dependency relationship between each word and the word to the left and the word to the right thereof;

marking the parts of speech of each word in each sentence;

performing dependency analysis between each word in each sentence;

8. The apparatus according to claim 7, wherein the feature extraction unit extracts feature information representing a feature of a word from the context of the each word, comprising:

9. The apparatus according to any one of claims 6-8, further comprising:

the parsing unit is configured to parse the sentence by using a grammar rule according to a prediction result of performing analysis prediction on the sentence to be analyzed, and generate a structured output result in a preset form, where the structured output result in the preset form includes: syntax tree.

10. A storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of claims 1-5.