CN118211593B - Session semantic recognition method and device - Google Patents
Session semantic recognition method and device Download PDFInfo
- Publication number
- CN118211593B CN118211593B CN202410315061.0A CN202410315061A CN118211593B CN 118211593 B CN118211593 B CN 118211593B CN 202410315061 A CN202410315061 A CN 202410315061A CN 118211593 B CN118211593 B CN 118211593B
- Authority
- CN
- China
- Prior art keywords
- network layer
- conversation
- transformer
- word
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 239000013598 vector Substances 0.000 claims abstract description 74
- 238000010606 normalization Methods 0.000 claims description 36
- 230000011218 segmentation Effects 0.000 claims description 20
- 238000010276 construction Methods 0.000 claims description 3
- 230000008447 perception Effects 0.000 abstract description 4
- 230000000750 progressive effect Effects 0.000 description 10
- 238000005457 optimization Methods 0.000 description 7
- 230000009191 jumping Effects 0.000 description 6
- 238000011176 pooling Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000007670 refining Methods 0.000 description 2
- 102000012673 Follicle Stimulating Hormone Human genes 0.000 description 1
- 108010079345 Follicle Stimulating Hormone Proteins 0.000 description 1
- 102000009151 Luteinizing Hormone Human genes 0.000 description 1
- 108010073521 Luteinizing Hormone Proteins 0.000 description 1
- 230000001154 acute effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000001447 compensatory effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 229940028334 follicle stimulating hormone Drugs 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 229940040129 luteinizing hormone Drugs 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 210000000707 wrist Anatomy 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a conversation semantic recognition method and a conversation semantic recognition device, which belong to the technical field of semantic recognition, wherein the method comprises the steps of converting acquired conversation audio data into conversation text data, and recognizing conversation keywords from the conversation text data; the method comprises the steps of inputting a conversation keyword and a conversation sequence into a BERT model to obtain a word vector and a conversation text vector of the conversation keyword, constructing a document according to the word vector and the conversation text vector of the conversation keyword, extracting effective information of the constructed document by using a Doc2vec model to generate a document vector related to both the keyword and the conversation context, and inputting the document vector into an intention recognition model to obtain an output conversation recognition result. The method and the device can realize sharper perception of text semantics, learn semantic features deeply, and improve accuracy of understanding conversation semantics.
Description
Technical Field
The invention relates to the technical field of semantic recognition, in particular to a method and a device for recognizing session semantics.
Background
In the medical conversation voice recognition scene, the voice response of the user needs to be analyzed, the currently adopted mode is keyword matching, and the existing keyword matching technology has the following defects:
(1) No keyword word library is established, and the context relation is lost by completely depending on keyword matching;
(2) The intention recognition model is not established in the session, a great deal of manpower is required to repeatedly listen to judge the intention of the client, and the efficiency is quite low.
Therefore, a set of models based on the conversation context relation are required to be established to identify the intention of the customer answer, so that the technical problem that the intention of the customer cannot be accurately mastered by sales personnel is solved, and the sales accuracy is improved. In theory, the machine learning method, the deep learning model or the attention-based model in the prior art can be used as an intention recognition model, but the models are not enough in generalization capability and cannot be well adapted to new situations, or complex sentences and fine differences of medical texts cannot be accurately understood, or information loss is easy to occur when long-sequence input is processed, or a large amount of high-quality labeling information is relied on.
Disclosure of Invention
In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention provides a method and apparatus for identifying session semantics, which can solve the above-mentioned technical problems in whole or in part. On the one hand, the invention realizes the end-to-end semantic coding through the BERT-Doc2Vec model combination, and the context information is reduced in a lossless manner, so that the document vector relevant to both the keyword and the session context is obtained. The BERT model is used for guaranteeing word meaning, the Doc2Vec model is used for fusing word context relations, and the document vector has precision and abstraction. In another aspect of the invention, the document vector is input into an original intention recognition model, the intention recognition model introduces textCNN networks and LSTM networks, the text semantic features can be subjected to multi-scale modeling, and the local and global semantic information on the text can be extracted at the same time. In addition, the coding-decoding end also introduces an LSTM network, can learn the long-distance dependence information of the text and enhance the context modeling capability, adopts a multi-layer overlapped progressive coding-decoding structure to form an end-to-end semantic abstract graph, can deeply optimize the text semantic, and finally realizes sharper and accurate perception of the text semantic by modeling the text semantic in multiple views and transmitting semantic abstract and refined information. Compared with a single structure, the progressive encoding-decoding network structure can learn semantic features deeper, and the text expression capability is improved.
In one aspect of the present invention, a method for identifying session semantics is provided, including:
Converting the acquired session audio data into session text data, and identifying session keywords from the session text data;
inputting the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords;
constructing a document according to the word vector of the conversation keyword and the conversation text vector, extracting effective information of the constructed document by using a Doc2vec model, and generating a document vector related to both the keyword and the conversation context;
And inputting the document vector into an intention recognition model to obtain an output session recognition result.
Further, the intent recognition model includes an encoding network, a first textCNN-LSTM network layer, a decoding network, and a recognition network;
The coding network comprises a first converter encoder network layer, a second textCNN-LSTM network layer and a second converter encoder network layer which are sequentially connected, wherein the first converter encoder network layer receives an input document vector and is connected to the second converter encoder network layer in a jumping mode at the same time;
The decoding network comprises a first converter decoder network layer, a third textCNN-LSTM network layer and a second converter decoder network layer which are sequentially connected, wherein the first converter decoder network layer is also connected to the second converter decoder network layer in a jumping manner;
A second fransformer encoder network layer in the encoding network is directly hopped to connect to a first fransformer decoder network layer in the decoding network, while the second fransformer encoder network layer is connected to the first fransformer decoder network layer through the first textCNN-LSTM network layer;
Wherein the first textCNN-LSTM network layer, the second textCNN-LSTM network layer and the third textCNN-LSTM network layer all comprise textCNN modules and LSTM modules which are connected in sequence;
The recognition network comprises Softmnax classification layers, an output of a second transducer decoder network layer in the decoding network is connected to the Softmnax classification layers, and the Softmnax classification layers are used for outputting final recognition results.
Further, the step of identifying the session keyword from the session text data further includes:
Establishing word libraries according to the session text data, wherein the word libraries comprise keyword word libraries, stop word libraries, proper noun word libraries and forbidden word libraries;
and extracting the effective information of the conversation text by using a word segmentation tool according to the word library to obtain a word segmentation result.
Further, the first transducer encoder network layer and the second transducer encoder network layer each comprise a first self-attention module, a first residual error and standardization module, a first feedforward network, a second residual error and standardization module which are connected in sequence;
The second residual of the first transducer encoder network layer and the output of the normalization module are skip connected to the first self-attention module of the second transducer encoder network layer.
Further, the first and second converter decoder network layers each include a second self-attention module, a third residual error and normalization module, a third self-attention module, a fourth residual error and normalization module, a second feedforward network, a fifth residual error and normalization module, which are sequentially connected;
The second residual of the second transducer encoder network layer and the output of the normalization module are skip connected to a third self-attention module of the first transducer decoder network layer.
Further, the fifth residual of the first converter decoder network layer and the output of the normalization module are connected in a skip manner to a third self-attention module of the second converter decoder network layer.
In another aspect of the present invention, there is provided a session semantic recognition apparatus, including:
the keyword acquisition module is configured to convert the acquired session audio data into session text data, and identify session keywords from the session text data;
The vectorization module is configured to input the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords;
The document vector construction module is configured to construct a document according to the word vector of the conversation keyword and the conversation text vector, extract effective information of the constructed document by using a Doc2vec model and generate a document vector related to both the keyword and the conversation context;
And the session identification module is configured to input the document vector into the intention identification model to obtain an output session identification result.
Further, the intent recognition model includes an encoding network, a first textCNN-LSTM network layer, a decoding network, and a recognition network;
The coding network comprises a first converter encoder network layer, a second textCNN-LSTM network layer and a second converter encoder network layer which are sequentially connected, wherein the first converter encoder network layer receives an input document vector and is connected to the second converter encoder network layer in a jumping mode at the same time;
The decoding network comprises a first converter decoder network layer, a third textCNN-LSTM network layer and a second converter decoder network layer which are sequentially connected, wherein the first converter decoder network layer is also connected to the second converter decoder network layer in a jumping manner;
A second fransformer encoder network layer in the encoding network is directly hopped to connect to a first fransformer decoder network layer in the decoding network, while the second fransformer encoder network layer is connected to the first fransformer decoder network layer through the first textCNN-LSTM network layer;
Wherein the first textCNN-LSTM network layer, the second textCNN-LSTM network layer and the third textCNN-LSTM network layer all comprise textCNN modules and LSTM modules which are connected in sequence;
The recognition network comprises Softmnax classification layers, an output of a second transducer decoder network layer in the decoding network is connected to the Softmnax classification layers, and the Softmnax classification layers are used for outputting final recognition results.
Further, the keyword acquisition module is further configured to:
Establishing word libraries according to the session text data, wherein the word libraries comprise keyword word libraries, stop word libraries, proper noun word libraries and forbidden word libraries;
and extracting the effective information of the conversation text by using a word segmentation tool according to the word library to obtain a word segmentation result.
Further, the first transducer encoder network layer and the second transducer encoder network layer each comprise a first self-attention module, a first residual error and standardization module, a first feedforward network, a second residual error and standardization module which are connected in sequence;
the second residual of the first transducer encoder network layer and the output of the normalization module are connected in a skip manner to a first self-attention module of the second transducer encoder network layer;
The first and second converter decoder network layers comprise a second self-attention module, a third residual error and standardization module, a third self-attention module, a fourth residual error and standardization module, a second feedforward network, a fifth residual error and standardization module which are connected in sequence;
The second residual of the second transducer encoder network layer and the output of the normalization module are connected in a skip manner to a third self-attention module of the first transducer decoder network layer;
the fifth residual of the first transducer decoder network layer and the output of the normalization module are skip connected to a third self-attention module of the second transducer decoder network layer.
The method and the device for identifying the session semantics have the following beneficial effects:
(1) And (3) realizing end-to-end semantic coding through BERT-Doc2Vec model combination, and losslessly reducing context information to further obtain document vectors related to both keywords and session context. The BERT model is used for guaranteeing word meaning, the Doc2Vec model is used for fusing word context, and document vectors have precision and abstraction.
(2) The original intention recognition model introduces textCNN network and LSTM network, can carry out multi-scale modeling on text semantic features, can extract local and global semantic information on the text at the same time, introduces LSTM network at the encoding-decoding end, can learn long-distance dependency information of the text, enhances the context modeling capability, adopts a multi-level overlapped progressive encoding-decoding structure to form an end-to-end semantic abstract graph, can obtain depth optimization on the text semantic, models the text semantic through multiple views, transmits semantic abstract and refining information, and realizes sharper and accurate perception on the text semantic.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:
FIG. 1 is a flow chart of a method for identifying session semantics provided by one embodiment of the present application;
FIG. 2 is a schematic diagram of the structure of an intent recognition model provided by an embodiment of the present application;
FIG. 3 is a schematic diagram of the structure of an encoding network and decoding network provided by one embodiment of the present application;
Fig. 4 is a schematic structural diagram of a session semantic recognition device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present invention to describe the acquisition modules, these acquisition modules should not be limited to these terms. These terms are only used to distinguish the acquisition modules from each other.
The term "if" as used herein may be interpreted as "at" or "when" depending on the context "or" in response to a determination "or" in response to a detection. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.
It should be noted that, the terms "upper", "lower", "left", "right", and the like in the embodiments of the present invention are described in terms of the angles shown in the drawings, and should not be construed as limiting the embodiments of the present invention. In addition, in the context, it will also be understood that when an element is referred to as being formed "on" or "under" another element, it can be directly formed "on" or "under" the other element or be indirectly formed "on" or "under" the other element through intervening elements.
Referring to fig. 1, an embodiment of the present invention provides a session semantic recognition method, which is described below by taking a phone call back scenario of an internet hospital as an example, and includes the following steps:
Step S101, converting the acquired session audio data into session text data, and identifying session keywords from the session text data.
First, internet hospital telephone return visit data is acquired. The method aims at collecting original voice interaction text data as training and testing samples of a model, cleaning the text data and constructing various word libraries required by model training.
Specifically, the telephone return visit audio data are converted into conversation text data, then a keyword word stock, an stop word stock, a proper noun word stock and an forbidden word stock are established according to the conversation text data, and effective text information is extracted by using a word segmentation tool to obtain conversation keywords.
And establishing a proper noun word stock. In the telephone call-back session of the internet hospital, the medical related special nouns appear, and the special nouns may be wrongly divided based on the basic word stock and need to be established to be solved by establishing a special word stock. Such as "compensatory growth", "epiphyseal closure of the wrist", "follicle stimulating hormone", "luteinizing hormone", and the like.
And establishing a stop word lexicon. And adding user related information and sensitive information which are required to be filtered based on internet hospital telephone call back visit requirements by using the conventional deactivated word stock. Such as "phone", "age", "address", etc.
And establishing an illicit word stock. And adding special contraband words required based on internet hospital telephone call back visit requirements by using the existing contraband word library. For example, "life" and "lost" and "gusty" and "law" and the like.
The word segmentation tool can be based on word frequency and HMM algorithm in a keyword Trie Tree dictionary, so that the score of each word and adjacent words is obtained, and extraction of text information is realized according to the DAG sequence.
Keywords can be added according to the requirements of the service scene, but word frequency matching is required for word segmentation, for example:
Session is "so we now have the foremost true not to say how many new patients we are.
The word segmentation results are [ (present ', ' most ', ' main ', ' true ', ' new ', ' patient's use.
The target result is "new patient" rather than "patient".
The result of the analysis is patient 3932n, where 3932 represents word frequency and n represents the attribute.
Then the keyword "new trouble" needs to be added and the word frequency of "new trouble" is adjusted.
The word segmentation model expression is that the input session sequence is X= { X 1,x2,…,xn }, the word segmentation result is y= { y 11,..y1m,y2,…,y2m,…yn1,…ynm},y11 as the first word extracted from the first group of sessions, and the word segmentation formula based on the HMM is as follows:
p(y|x)=Πp(yi|yi-1,xi)
The word frequency formula is calculated as follows:
Wherein the method comprises the steps of The keywords provided for the operation are provided for,B epsilon Z + defaults to 35 for the keyword word frequency of the Trie Tree dictionary, wherein i epsilon {1,2,3.. } j epsilon {1, 2..i-1, i+1,..n }, and the word segmentation structure is modified according to the probability and word frequency of the HMM;
Word segmentation processing is carried out on text data of call return visit data:
ki∈R
Where k i is the result i e {1,2, 3..}, k i is the real domain R as vector.
Step S102, inputting the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords.
The word segmentation result and the conversation sequence are input into a BERT model, and the BERT model is used for extracting word vectors and conversation text vectors of conversation keywords.
First, the BERT model is used for extracted keyword vectorization:
K={k1,k2,…,km}
Where k i is the keyword extracted from the session, As a vector of the keyword words,D is the vector dimension.
Secondly, the session vectorization is realized by adopting a BERT model:
X={x1,x2,…,xn}
where x i is the number of sessions, As a vector of the session,D is the vector dimension. BERT obtains feature vectors H, h=bert Transformer (x) of text x using a Transformer encoding structure, where the BERT model obtains a contextual representation of the text using the Transformer structure.
The transducer encoding structure is represented by a multi-headed attention mechanism and feed forward full chaining:
(1) Multi-head attention mechanism:
Attention(Q,K,V)=Concat(head1,head2,…headn)wo
Wherein: The mapping matrix representing the ith attention header, Q, K, V represents the Query, key, value vector.
(2) Feedforward full link:
FFN(x)=max(0,xW1+b1)W2+b2
Where w 0,w1,w2,b1,b2 denotes parameters of the full link layer.
And step S103, constructing a document according to the word vector of the conversation keyword and the conversation text vector, extracting effective information of the constructed document by using a Doc2vec model, and generating a document vector related to both the keyword and the conversation context.
Specifically, a document is constructed according to the keyword word vector and the session text vector obtained in step S102, and then effective information is extracted from the constructed document by using the Doc2vec model:
Wherein, In order to construct a document,The keyword vector extracted for the first session,Is the first session vector;
Where i represents the number of sessions i e {1,2, 3..n }, j represents the keyword j e {1,2, 3..m }, Representing Doc2vec vectors generated by the j-th keyword in the i-th session,The combination is independently input to the Doc2Vec model, generating a vector that is related to both the keyword and the context of the session.
The above-described method is particularly effective in extracting document-specific representations of terms, as it enables understanding the contextual importance of keywords in a document.
In summary, the BERT-Doc2Vec model implements end-to-end semantic coding, lossless reduction of context information. The BERT model is used for realizing fidelity word meaning, the Doc2Vec model is used for fusing word context relations, and the constructed word_doc document vector has precision and abstraction.
Step S104, inputting the document vector into an intention recognition model to obtain an output session recognition result.
Specifically, the word_doc document vector constructed in step S103 is input to the intention recognition model 100.
See fig. 2. The intent recognition model 100 includes an encoding network 101, a first textCNN-LSTM network layer 102, a decoding network 103, and a recognition network 104. Wherein the first textCNN-LSTM network layer 102 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The intent recognition model 100 of the invention introduces textCNN a text convolution network and an LSTM network, can model the text semantic features in a multi-scale way, and can extract the local and global semantic information on the text at the same time. Meanwhile, the LSTM layer is introduced at the encoding-decoding end, long-distance dependent information of the text can be learned, and the context modeling capability is enhanced.
Referring to fig. 3, the encoding network 101 includes a first Transformer encoder network layer 1011, a second textCNN-LSTM network layer 1012, and a second Transformer encoder network layer 1013 connected in sequence, wherein the first Transformer encoder network layer 1011 receives an input word_doc document vector and the first Transformer encoder network layer 1011 is simultaneously jump-connected to the second Transformer encoder network layer 1013. Wherein the second textCNN-LSTM network layer 1012 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The coding network of the invention performs progressive abstraction and optimization on text semantics through a progressive coding structure of multi-level superposition, namely, the coder is a multi-layer superposition structure with more than two layers, so as to form an end-to-end semantic abstract graph, and the text semantics can be subjected to depth optimization.
Referring to fig. 3, the first and second Transformer encoder network layers 1011 and 1013 each include a first self-attention module 1014, a first residual and normalization module 1015, a first feed-forward network 1016, a second residual and normalization module 1017, which are connected in sequence. And, the second residual of the first transducer encoder network layer 1011 and the output of the normalization module 1017 are jump connected to the first self-attention module 1014 of the second transducer encoder network layer 1013.
Further, a second textCNN-LSTM network layer 1012 between the first and second Transformer encoder network layers 1011, 1013 is used to extract voice information and context dependencies.
(1) TextCNN network module in second textCNN-LSTM network layer 1012:
wherein E is the encoder output (EncoderOutput):
C=CNN(E)
Where C is the output of the CNN layer, CNN is a collection of convolution operations including convolution, nonlinear activation functions, and pooling operations.
(2) LSTM network module in second textCNN-LSTM network layer 1012:
L,(hn,cn)=LSTM(C)
where L is the output of the LSTM layer, (h n,cn) represents the final hidden state and cell state of the LSTM layer, which is the circular operation of the LSTM network module.
The specific implementation is expressed by the following formula:
(1) Part textCNN:
Ck=pooling(ReLU(E*Kk+bk))
Where "×" is the convolution operation, K k is the kth convolution kernel, b k is the bias, pooling is the pooling operation, reLU is the excitation function,
C=|C1;C2;..;Ck|
Where C represents a concatenation of multiple convolved outputs and "represents a concatenation operation.
(2) LSTM section:
(ht,ct)=LSTM(Ct,ht-1,ct-1)
Where C t is the input of time step t, (h n,cn) represents the final hidden state and cell state of the LSTM layer.
The final LSTM results were:
L=|h1,h2,…,hT|
Referring to fig. 3, the decoding network 103 includes a first converter decoder network layer 1031, a third textCNN-LSTM network layer 1032, and a second converter decoder network layer 1033 connected in sequence, wherein the first converter decoder network layer 1031 is also jump-connected to the second converter decoder network layer 1033. Wherein the third textCNN-LSTM network layer 1032 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The decoding network of the invention performs progressive abstraction and optimization on text semantics through a progressive decoding structure of multi-level superposition, namely a decoder is a multi-layer superposition structure with more than two layers, so as to form an end-to-end semantic abstract graph, and the text semantics can be subjected to depth optimization.
Specifically, the first transducer decoder network layer 1031 accesses the third textCNN-LSTM network layer 1032 for semantic decoding, expressed by:
(1) textCNN layers, where D is decoder output:
C=CNN(D)
Where C is the output of the CNN layer, CNN is a collection of convolution operations including convolution, nonlinear activation functions, and pooling operations.
(2) LSTM layer:
L,(hn,cn)=LSTM(C)
Where L is the output of the LSTM layer, (h n,cn) represents the final hidden state and cell state of the LSTM layer, and LSTM is the cyclic operation of the LSTM layer.
Referring to fig. 3, a second Transformer encoder network layer 1013 in the encoding network 101 is directly jump-connected to a first Transformer decoder network layer 1031 in the decoding network 103, while the second Transformer encoder network layer 1013 is connected to the first Transformer decoder network layer 1031 through the first textCNN-LSTM network layer 102.
Referring to fig. 3, the first and second converter decoder network layers 1031 and 1033 each include a second self-attention module 1034, a third residual error and normalization module 1035, a third self-attention module 1036, a fourth residual error and normalization module 1037, a second feedforward network 1038, a fifth residual error and normalization module 1039, which are sequentially connected.
Referring to fig. 3, the second residual of the second converter encoder network layer 1013 and the output of the normalization module 1017 are skip connected to the third self-attention module 1036 of the first converter decoder network layer 1031.
Referring to fig. 3, the fifth residual of the first converter decoder network layer 1031 and the output of the normalization module 1039 are skip connected to the third self-attention module 1036 of the second converter decoder network layer 1033.
Further, referring to fig. 3, the recognition network 104 includes a Softmnax classification layer 1041, an output of the second converter decoder network layer 1033 in the decoding network 103 is connected to the Softmnax classification layer 1041, and the Softmnax classification layer 1041 is configured to output a final recognition result.
See in particular the following formula:
IH=softmax(D′H)
Where D' H is the input and I H is the prediction intent.
In conclusion, the intent recognition model 100 of the invention introduces textCNN a text convolution network and an LSTM network, can carry out multi-scale modeling on text semantic features, can extract local semantic information and global semantic information on a text at the same time, introduces an LSTM layer at the encoding-decoding end, can learn long-distance dependency information of the text and enhance context modeling capability, carries out progressive abstraction and optimization on the text semantic through a multi-layer overlapped progressive encoding-decoding structure, namely an encoder and a decoder are multi-layer overlapped structures with more than two layers, forms an end-to-end semantic extraction graph, can obtain depth optimization on the text semantic, models the text semantic from different 'view angles' through multi-view modeling text semantic, namely, integrates a plurality of network structures such as textCNN, LSTM and the like, transmits semantic abstract and refining information, and finally realizes more acute and accurate perception on the text semantic. Compared with a single structure, the intent recognition model provided by the invention has the advantages that semantic features can be learned deeply by the progressive network structure, and the text expression capacity is improved.
Training the model according to the description until the model reaches the on-line standard, stopping training the model, and obtaining the model. The trained model is loaded and evaluated at test set TestSet.
The invention uses the operation labeling semantic test set TestSet as an example, and uses the classification accuracy (accuracy) as an evaluation index, wherein the definition is that for a given test data set, the classifier correctly classifies the ratio of the number of samples to the total number of samples. The larger the classification accuracy value, the better the model effect, the lower the value, and the worse the model effect. See in particular the following table:
sequence number 1 semantic recognition based on a transducer model.
Sequence number 2 semantic recognition using the intent recognition model of the present invention.
From the comparison result of the table, on the operation labeling semantic dataset TestSet, the semantic recognition algorithm provided by the invention in sequence number 2 is used, and compared with the semantic recognition based on a transducer model in sequence number 1, the accuracy is 95.90% -84.91% = 10.99%. Therefore, the semantic recognition algorithm with multi-feature fusion provided by the invention can achieve higher precision and realize more accurate intention recognition under the same condition.
Referring to fig. 4, another embodiment of the present invention further provides a session semantic recognition apparatus 200, including a keyword obtaining module 201, a vectorization module 202, a document vector constructing module 203, and a session recognition module 204, where the session semantic recognition apparatus 200 is capable of performing the data communication method in the method embodiment.
Specifically, the session semantic recognition device 200 includes:
A keyword acquisition module 201 configured to convert the acquired session audio data into session text data from which a session keyword is identified;
A vectorization module 202 configured to input a conversation keyword and a conversation sequence to the BERT model, to obtain a word vector and a conversation text vector of the conversation keyword;
A document vector construction module 203 configured to construct a document according to the word vector of the session keyword and the session text vector, extract effective information of the constructed document by using Doc2vec model, and generate a document vector related to both the keyword and the session context;
The session recognition module 204 is configured to input the document vector into an intention recognition model to obtain an output session recognition result.
Further, the intent recognition model includes an encoding network, a first textCNN-LSTM network layer, a decoding network, and a recognition network;
The coding network comprises a first converter encoder network layer, a second textCNN-LSTM network layer and a second converter encoder network layer which are sequentially connected, wherein the first converter encoder network layer receives an input document vector and is connected to the second converter encoder network layer in a jumping mode at the same time;
The decoding network comprises a first converter decoder network layer, a third textCNN-LSTM network layer and a second converter decoder network layer which are sequentially connected, wherein the first converter decoder network layer is also connected to the second converter decoder network layer in a jumping manner;
A second fransformer encoder network layer in the encoding network is directly hopped to connect to a first fransformer decoder network layer in the decoding network, while the second fransformer encoder network layer is connected to the first fransformer decoder network layer through the first textCNN-LSTM network layer;
Wherein the first textCNN-LSTM network layer, the second textCNN-LSTM network layer and the third textCNN-LSTM network layer all comprise textCNN modules and LSTM modules which are connected in sequence;
The recognition network comprises Softmnax classification layers, an output of a second transducer decoder network layer in the decoding network is connected to the Softmnax classification layers, and the Softmnax classification layers are used for outputting final recognition results.
Further, the keyword obtaining module 201 is further configured to:
Establishing word libraries according to the session text data, wherein the word libraries comprise keyword word libraries, stop word libraries, proper noun word libraries and forbidden word libraries;
and extracting the effective information of the conversation text by using a word segmentation tool according to the word library to obtain a word segmentation result.
Further, the first transducer encoder network layer and the second transducer encoder network layer each comprise a first self-attention module, a first residual error and standardization module, a first feedforward network, a second residual error and standardization module which are connected in sequence;
the second residual of the first transducer encoder network layer and the output of the normalization module are connected in a skip manner to a first self-attention module of the second transducer encoder network layer;
The first and second converter decoder network layers comprise a second self-attention module, a third residual error and standardization module, a third self-attention module, a fourth residual error and standardization module, a second feedforward network, a fifth residual error and standardization module which are connected in sequence;
The second residual of the second transducer encoder network layer and the output of the normalization module are connected in a skip manner to a third self-attention module of the first transducer decoder network layer;
the fifth residual of the first transducer decoder network layer and the output of the normalization module are skip connected to a third self-attention module of the second transducer decoder network layer.
It should be noted that, the technical solutions corresponding to the session semantic recognition device 200 provided in this embodiment that may be used to execute the method embodiments are similar to the method in terms of implementation principle and technical effect, and are not repeated here.
The foregoing description is only of the preferred embodiments of the invention. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410315061.0A CN118211593B (en) | 2024-03-19 | 2024-03-19 | Session semantic recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410315061.0A CN118211593B (en) | 2024-03-19 | 2024-03-19 | Session semantic recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118211593A CN118211593A (en) | 2024-06-18 |
CN118211593B true CN118211593B (en) | 2024-12-03 |
Family
ID=91453821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410315061.0A Active CN118211593B (en) | 2024-03-19 | 2024-03-19 | Session semantic recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118211593B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115495566A (en) * | 2022-10-11 | 2022-12-20 | 重庆邮电大学 | Dialog generation method and system for enhancing text features |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691852B (en) * | 2022-06-01 | 2022-08-12 | 阿里巴巴达摩院(杭州)科技有限公司 | Human-machine dialogue system and method |
CN116933799A (en) * | 2023-08-14 | 2023-10-24 | 哈尔滨理工大学 | Word sense disambiguation method based on BiLSTM and GraphSAGE |
-
2024
- 2024-03-19 CN CN202410315061.0A patent/CN118211593B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115495566A (en) * | 2022-10-11 | 2022-12-20 | 重庆邮电大学 | Dialog generation method and system for enhancing text features |
Also Published As
Publication number | Publication date |
---|---|
CN118211593A (en) | 2024-06-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7195365B2 (en) | A Method for Training Convolutional Neural Networks for Image Recognition Using Image Conditional Mask Language Modeling | |
CN109472024B (en) | Text classification method based on bidirectional circulation attention neural network | |
CN110928994B (en) | Similar case retrieval method, similar case retrieval device and electronic equipment | |
US10824653B2 (en) | Method and system for extracting information from graphs | |
WO2022252636A1 (en) | Artificial intelligence-based answer generation method and apparatus, device, and storage medium | |
CN106407211B (en) | Method and device for classifying semantic relationship of entity words | |
CN113836277A (en) | Machine Learning Systems for Digital Assistants | |
CN111831789A (en) | A question and answer text matching method based on multi-layer semantic feature extraction structure | |
CN111259851B (en) | Multi-mode event detection method and device | |
CN114550946A (en) | Medical data processing method, device and storage medium | |
CN111814477B (en) | Dispute focus discovery method, device and terminal based on dispute focus entity | |
CN114547303B (en) | Text multi-feature classification method and device based on Bert-LSTM | |
CN114020871B (en) | Multi-mode social media emotion analysis method based on feature fusion | |
CN115393933A (en) | A video face emotion recognition method based on frame attention mechanism | |
CN115544279A (en) | Multi-modal emotion classification method based on cooperative attention and application thereof | |
CN110110137A (en) | Method and device for determining music characteristics, electronic equipment and storage medium | |
CN118211593B (en) | Session semantic recognition method and device | |
CN118194875A (en) | Intelligent voice service management system and method driven by natural language understanding | |
CN117874536A (en) | Ernie twin network structure industrial application operation question-answer matching method | |
CN117891965A (en) | Visual question-answering method, system, equipment and storage medium based on thinking chain | |
CN117540740A (en) | Intangible cultural heritage communication group sentiment analysis method and its analysis system based on deep learning | |
CN117057350A (en) | Chinese electronic medical record named entity recognition method and system | |
CN116010563A (en) | Multi-round dialogue data analysis method, electronic equipment and storage medium | |
CN114942981A (en) | Question-answer query method and device, electronic equipment and computer readable storage medium | |
Luo | A study into text sentiment analysis model based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |