CN118211593B

CN118211593B - Session semantic recognition method and device

Info

Publication number: CN118211593B
Application number: CN202410315061.0A
Authority: CN
Inventors: 张永哲; 滕达; 黄倩
Original assignee: Guangzhou Chaoda Shengyuan Health Technology Co ltd
Current assignee: Guangzhou Chaoda Shengyuan Health Technology Co ltd
Priority date: 2024-03-19
Filing date: 2024-03-19
Publication date: 2024-12-03
Anticipated expiration: 2044-03-19
Also published as: CN118211593A

Abstract

The invention discloses a conversation semantic recognition method and a conversation semantic recognition device, which belong to the technical field of semantic recognition, wherein the method comprises the steps of converting acquired conversation audio data into conversation text data, and recognizing conversation keywords from the conversation text data; the method comprises the steps of inputting a conversation keyword and a conversation sequence into a BERT model to obtain a word vector and a conversation text vector of the conversation keyword, constructing a document according to the word vector and the conversation text vector of the conversation keyword, extracting effective information of the constructed document by using a Doc2vec model to generate a document vector related to both the keyword and the conversation context, and inputting the document vector into an intention recognition model to obtain an output conversation recognition result. The method and the device can realize sharper perception of text semantics, learn semantic features deeply, and improve accuracy of understanding conversation semantics.

Description

Session semantic recognition method and device

Technical Field

The invention relates to the technical field of semantic recognition, in particular to a method and a device for recognizing session semantics.

Background

In the medical conversation voice recognition scene, the voice response of the user needs to be analyzed, the currently adopted mode is keyword matching, and the existing keyword matching technology has the following defects:

(1) No keyword word library is established, and the context relation is lost by completely depending on keyword matching;

(2) The intention recognition model is not established in the session, a great deal of manpower is required to repeatedly listen to judge the intention of the client, and the efficiency is quite low.

Therefore, a set of models based on the conversation context relation are required to be established to identify the intention of the customer answer, so that the technical problem that the intention of the customer cannot be accurately mastered by sales personnel is solved, and the sales accuracy is improved. In theory, the machine learning method, the deep learning model or the attention-based model in the prior art can be used as an intention recognition model, but the models are not enough in generalization capability and cannot be well adapted to new situations, or complex sentences and fine differences of medical texts cannot be accurately understood, or information loss is easy to occur when long-sequence input is processed, or a large amount of high-quality labeling information is relied on.

Disclosure of Invention

In view of the above-mentioned drawbacks or shortcomings in the prior art, the present invention provides a method and apparatus for identifying session semantics, which can solve the above-mentioned technical problems in whole or in part. On the one hand, the invention realizes the end-to-end semantic coding through the BERT-Doc2Vec model combination, and the context information is reduced in a lossless manner, so that the document vector relevant to both the keyword and the session context is obtained. The BERT model is used for guaranteeing word meaning, the Doc2Vec model is used for fusing word context relations, and the document vector has precision and abstraction. In another aspect of the invention, the document vector is input into an original intention recognition model, the intention recognition model introduces textCNN networks and LSTM networks, the text semantic features can be subjected to multi-scale modeling, and the local and global semantic information on the text can be extracted at the same time. In addition, the coding-decoding end also introduces an LSTM network, can learn the long-distance dependence information of the text and enhance the context modeling capability, adopts a multi-layer overlapped progressive coding-decoding structure to form an end-to-end semantic abstract graph, can deeply optimize the text semantic, and finally realizes sharper and accurate perception of the text semantic by modeling the text semantic in multiple views and transmitting semantic abstract and refined information. Compared with a single structure, the progressive encoding-decoding network structure can learn semantic features deeper, and the text expression capability is improved.

In one aspect of the present invention, a method for identifying session semantics is provided, including:

Converting the acquired session audio data into session text data, and identifying session keywords from the session text data;

inputting the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords;

constructing a document according to the word vector of the conversation keyword and the conversation text vector, extracting effective information of the constructed document by using a Doc2vec model, and generating a document vector related to both the keyword and the conversation context;

And inputting the document vector into an intention recognition model to obtain an output session recognition result.

Further, the intent recognition model includes an encoding network, a first textCNN-LSTM network layer, a decoding network, and a recognition network;

The coding network comprises a first converter encoder network layer, a second textCNN-LSTM network layer and a second converter encoder network layer which are sequentially connected, wherein the first converter encoder network layer receives an input document vector and is connected to the second converter encoder network layer in a jumping mode at the same time;

The decoding network comprises a first converter decoder network layer, a third textCNN-LSTM network layer and a second converter decoder network layer which are sequentially connected, wherein the first converter decoder network layer is also connected to the second converter decoder network layer in a jumping manner;

A second fransformer encoder network layer in the encoding network is directly hopped to connect to a first fransformer decoder network layer in the decoding network, while the second fransformer encoder network layer is connected to the first fransformer decoder network layer through the first textCNN-LSTM network layer;

Wherein the first textCNN-LSTM network layer, the second textCNN-LSTM network layer and the third textCNN-LSTM network layer all comprise textCNN modules and LSTM modules which are connected in sequence;

The recognition network comprises Softmnax classification layers, an output of a second transducer decoder network layer in the decoding network is connected to the Softmnax classification layers, and the Softmnax classification layers are used for outputting final recognition results.

Further, the step of identifying the session keyword from the session text data further includes:

Establishing word libraries according to the session text data, wherein the word libraries comprise keyword word libraries, stop word libraries, proper noun word libraries and forbidden word libraries;

and extracting the effective information of the conversation text by using a word segmentation tool according to the word library to obtain a word segmentation result.

Further, the first transducer encoder network layer and the second transducer encoder network layer each comprise a first self-attention module, a first residual error and standardization module, a first feedforward network, a second residual error and standardization module which are connected in sequence;

The second residual of the first transducer encoder network layer and the output of the normalization module are skip connected to the first self-attention module of the second transducer encoder network layer.

Further, the first and second converter decoder network layers each include a second self-attention module, a third residual error and normalization module, a third self-attention module, a fourth residual error and normalization module, a second feedforward network, a fifth residual error and normalization module, which are sequentially connected;

The second residual of the second transducer encoder network layer and the output of the normalization module are skip connected to a third self-attention module of the first transducer decoder network layer.

Further, the fifth residual of the first converter decoder network layer and the output of the normalization module are connected in a skip manner to a third self-attention module of the second converter decoder network layer.

In another aspect of the present invention, there is provided a session semantic recognition apparatus, including:

the keyword acquisition module is configured to convert the acquired session audio data into session text data, and identify session keywords from the session text data;

The vectorization module is configured to input the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords;

The document vector construction module is configured to construct a document according to the word vector of the conversation keyword and the conversation text vector, extract effective information of the constructed document by using a Doc2vec model and generate a document vector related to both the keyword and the conversation context;

And the session identification module is configured to input the document vector into the intention identification model to obtain an output session identification result.

Further, the keyword acquisition module is further configured to:

the second residual of the first transducer encoder network layer and the output of the normalization module are connected in a skip manner to a first self-attention module of the second transducer encoder network layer;

The first and second converter decoder network layers comprise a second self-attention module, a third residual error and standardization module, a third self-attention module, a fourth residual error and standardization module, a second feedforward network, a fifth residual error and standardization module which are connected in sequence;

The second residual of the second transducer encoder network layer and the output of the normalization module are connected in a skip manner to a third self-attention module of the first transducer decoder network layer;

the fifth residual of the first transducer decoder network layer and the output of the normalization module are skip connected to a third self-attention module of the second transducer decoder network layer.

The method and the device for identifying the session semantics have the following beneficial effects:

(1) And (3) realizing end-to-end semantic coding through BERT-Doc2Vec model combination, and losslessly reducing context information to further obtain document vectors related to both keywords and session context. The BERT model is used for guaranteeing word meaning, the Doc2Vec model is used for fusing word context, and document vectors have precision and abstraction.

(2) The original intention recognition model introduces textCNN network and LSTM network, can carry out multi-scale modeling on text semantic features, can extract local and global semantic information on the text at the same time, introduces LSTM network at the encoding-decoding end, can learn long-distance dependency information of the text, enhances the context modeling capability, adopts a multi-level overlapped progressive encoding-decoding structure to form an end-to-end semantic abstract graph, can obtain depth optimization on the text semantic, models the text semantic through multiple views, transmits semantic abstract and refining information, and realizes sharper and accurate perception on the text semantic.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a method for identifying session semantics provided by one embodiment of the present application;

FIG. 2 is a schematic diagram of the structure of an intent recognition model provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of the structure of an encoding network and decoding network provided by one embodiment of the present application;

Fig. 4 is a schematic structural diagram of a session semantic recognition device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present invention to describe the acquisition modules, these acquisition modules should not be limited to these terms. These terms are only used to distinguish the acquisition modules from each other.

The term "if" as used herein may be interpreted as "at" or "when" depending on the context "or" in response to a determination "or" in response to a detection. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should be noted that, the terms "upper", "lower", "left", "right", and the like in the embodiments of the present invention are described in terms of the angles shown in the drawings, and should not be construed as limiting the embodiments of the present invention. In addition, in the context, it will also be understood that when an element is referred to as being formed "on" or "under" another element, it can be directly formed "on" or "under" the other element or be indirectly formed "on" or "under" the other element through intervening elements.

Referring to fig. 1, an embodiment of the present invention provides a session semantic recognition method, which is described below by taking a phone call back scenario of an internet hospital as an example, and includes the following steps:

Step S101, converting the acquired session audio data into session text data, and identifying session keywords from the session text data.

First, internet hospital telephone return visit data is acquired. The method aims at collecting original voice interaction text data as training and testing samples of a model, cleaning the text data and constructing various word libraries required by model training.

Specifically, the telephone return visit audio data are converted into conversation text data, then a keyword word stock, an stop word stock, a proper noun word stock and an forbidden word stock are established according to the conversation text data, and effective text information is extracted by using a word segmentation tool to obtain conversation keywords.

And establishing a proper noun word stock. In the telephone call-back session of the internet hospital, the medical related special nouns appear, and the special nouns may be wrongly divided based on the basic word stock and need to be established to be solved by establishing a special word stock. Such as "compensatory growth", "epiphyseal closure of the wrist", "follicle stimulating hormone", "luteinizing hormone", and the like.

And establishing a stop word lexicon. And adding user related information and sensitive information which are required to be filtered based on internet hospital telephone call back visit requirements by using the conventional deactivated word stock. Such as "phone", "age", "address", etc.

And establishing an illicit word stock. And adding special contraband words required based on internet hospital telephone call back visit requirements by using the existing contraband word library. For example, "life" and "lost" and "gusty" and "law" and the like.

The word segmentation tool can be based on word frequency and HMM algorithm in a keyword Trie Tree dictionary, so that the score of each word and adjacent words is obtained, and extraction of text information is realized according to the DAG sequence.

Keywords can be added according to the requirements of the service scene, but word frequency matching is required for word segmentation, for example:

Session is "so we now have the foremost true not to say how many new patients we are.

The word segmentation results are [ (present ', ' most ', ' main ', ' true ', ' new ', ' patient's use.

The target result is "new patient" rather than "patient".

The result of the analysis is patient 3932n, where 3932 represents word frequency and n represents the attribute.

Then the keyword "new trouble" needs to be added and the word frequency of "new trouble" is adjusted.

The word segmentation model expression is that the input session sequence is X= { X ₁,x₂,…,x_n }, the word segmentation result is y= { y ₁₁,..y_1m,y₂,…,y_2m,…y_n1,…y_nm},y₁₁ as the first word extracted from the first group of sessions, and the word segmentation formula based on the HMM is as follows:

p(y|x)=Πp(y_i|y_i-1,x_i)

The word frequency formula is calculated as follows:

Wherein the method comprises the steps of The keywords provided for the operation are provided for,B epsilon Z ⁺ defaults to 35 for the keyword word frequency of the Trie Tree dictionary, wherein i epsilon {1,2,3.. } j epsilon {1, 2..i-1, i+1,..n }, and the word segmentation structure is modified according to the probability and word frequency of the HMM;

Word segmentation processing is carried out on text data of call return visit data:

k_i∈R

Where k _i is the result i e {1,2, 3..}, k _i is the real domain R as vector.

Step S102, inputting the conversation keywords and the conversation sequence into the BERT model to obtain word vectors and conversation text vectors of the conversation keywords.

The word segmentation result and the conversation sequence are input into a BERT model, and the BERT model is used for extracting word vectors and conversation text vectors of conversation keywords.

First, the BERT model is used for extracted keyword vectorization:

K={k₁,k₂,…,k_m}

Where k _i is the keyword extracted from the session, As a vector of the keyword words,D is the vector dimension.

Secondly, the session vectorization is realized by adopting a BERT model:

X={x₁,x₂,…,x_n}

where x _i is the number of sessions, As a vector of the session,D is the vector dimension. BERT obtains feature vectors H, h=bert _Transformer (x) of text x using a Transformer encoding structure, where the BERT model obtains a contextual representation of the text using the Transformer structure.

The transducer encoding structure is represented by a multi-headed attention mechanism and feed forward full chaining:

(1) Multi-head attention mechanism:

Attention(Q,K,V)=Concat(head₁,head₂,…head_n)w^o

Wherein: The mapping matrix representing the ith attention header, Q, K, V represents the Query, key, value vector.

(2) Feedforward full link:

FFN(x)=max(0,xW₁+b₁)W₂+b₂

Where w ⁰,w₁,w₂,b₁,b₂ denotes parameters of the full link layer.

And step S103, constructing a document according to the word vector of the conversation keyword and the conversation text vector, extracting effective information of the constructed document by using a Doc2vec model, and generating a document vector related to both the keyword and the conversation context.

Specifically, a document is constructed according to the keyword word vector and the session text vector obtained in step S102, and then effective information is extracted from the constructed document by using the Doc2vec model:

Wherein, In order to construct a document,The keyword vector extracted for the first session,Is the first session vector;

Where i represents the number of sessions i e {1,2, 3..n }, j represents the keyword j e {1,2, 3..m }, Representing Doc2vec vectors generated by the j-th keyword in the i-th session,The combination is independently input to the Doc2Vec model, generating a vector that is related to both the keyword and the context of the session.

The above-described method is particularly effective in extracting document-specific representations of terms, as it enables understanding the contextual importance of keywords in a document.

In summary, the BERT-Doc2Vec model implements end-to-end semantic coding, lossless reduction of context information. The BERT model is used for realizing fidelity word meaning, the Doc2Vec model is used for fusing word context relations, and the constructed word_doc document vector has precision and abstraction.

Step S104, inputting the document vector into an intention recognition model to obtain an output session recognition result.

Specifically, the word_doc document vector constructed in step S103 is input to the intention recognition model 100.

See fig. 2. The intent recognition model 100 includes an encoding network 101, a first textCNN-LSTM network layer 102, a decoding network 103, and a recognition network 104. Wherein the first textCNN-LSTM network layer 102 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The intent recognition model 100 of the invention introduces textCNN a text convolution network and an LSTM network, can model the text semantic features in a multi-scale way, and can extract the local and global semantic information on the text at the same time. Meanwhile, the LSTM layer is introduced at the encoding-decoding end, long-distance dependent information of the text can be learned, and the context modeling capability is enhanced.

Referring to fig. 3, the encoding network 101 includes a first Transformer encoder network layer 1011, a second textCNN-LSTM network layer 1012, and a second Transformer encoder network layer 1013 connected in sequence, wherein the first Transformer encoder network layer 1011 receives an input word_doc document vector and the first Transformer encoder network layer 1011 is simultaneously jump-connected to the second Transformer encoder network layer 1013. Wherein the second textCNN-LSTM network layer 1012 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The coding network of the invention performs progressive abstraction and optimization on text semantics through a progressive coding structure of multi-level superposition, namely, the coder is a multi-layer superposition structure with more than two layers, so as to form an end-to-end semantic abstract graph, and the text semantics can be subjected to depth optimization.

Referring to fig. 3, the first and second Transformer encoder network layers 1011 and 1013 each include a first self-attention module 1014, a first residual and normalization module 1015, a first feed-forward network 1016, a second residual and normalization module 1017, which are connected in sequence. And, the second residual of the first transducer encoder network layer 1011 and the output of the normalization module 1017 are jump connected to the first self-attention module 1014 of the second transducer encoder network layer 1013.

Further, a second textCNN-LSTM network layer 1012 between the first and second Transformer encoder network layers 1011, 1013 is used to extract voice information and context dependencies.

(1) TextCNN network module in second textCNN-LSTM network layer 1012:

wherein E is the encoder output (EncoderOutput):

C=CNN(E)

Where C is the output of the CNN layer, CNN is a collection of convolution operations including convolution, nonlinear activation functions, and pooling operations.

(2) LSTM network module in second textCNN-LSTM network layer 1012:

L,(h_n,c_n)=LSTM(C)

where L is the output of the LSTM layer, (h _n,c_n) represents the final hidden state and cell state of the LSTM layer, which is the circular operation of the LSTM network module.

The specific implementation is expressed by the following formula:

(1) Part textCNN:

C_k＝pooling(ReLU(E*K_k+b_k))

Where "×" is the convolution operation, K _k is the kth convolution kernel, b _k is the bias, pooling is the pooling operation, reLU is the excitation function,

C=|C₁;C₂;..;C_k|

Where C represents a concatenation of multiple convolved outputs and "represents a concatenation operation.

(2) LSTM section:

(h_t,c_t)＝LSTM(C_t,h_t-1,c_t-1)

Where C _t is the input of time step t, (h _n,c_n) represents the final hidden state and cell state of the LSTM layer.

The final LSTM results were:

L=|h₁,h₂,…,h_T|

Referring to fig. 3, the decoding network 103 includes a first converter decoder network layer 1031, a third textCNN-LSTM network layer 1032, and a second converter decoder network layer 1033 connected in sequence, wherein the first converter decoder network layer 1031 is also jump-connected to the second converter decoder network layer 1033. Wherein the third textCNN-LSTM network layer 1032 is a network layer composed of a textCNN model and an LSTM model connected in sequence. The decoding network of the invention performs progressive abstraction and optimization on text semantics through a progressive decoding structure of multi-level superposition, namely a decoder is a multi-layer superposition structure with more than two layers, so as to form an end-to-end semantic abstract graph, and the text semantics can be subjected to depth optimization.

Specifically, the first transducer decoder network layer 1031 accesses the third textCNN-LSTM network layer 1032 for semantic decoding, expressed by:

(1) textCNN layers, where D is decoder output:

C=CNN(D)

(2) LSTM layer:

L,(h_n,c_n)=LSTM(C)

Where L is the output of the LSTM layer, (h _n,c_n) represents the final hidden state and cell state of the LSTM layer, and LSTM is the cyclic operation of the LSTM layer.

Referring to fig. 3, a second Transformer encoder network layer 1013 in the encoding network 101 is directly jump-connected to a first Transformer decoder network layer 1031 in the decoding network 103, while the second Transformer encoder network layer 1013 is connected to the first Transformer decoder network layer 1031 through the first textCNN-LSTM network layer 102.

Referring to fig. 3, the first and second converter decoder network layers 1031 and 1033 each include a second self-attention module 1034, a third residual error and normalization module 1035, a third self-attention module 1036, a fourth residual error and normalization module 1037, a second feedforward network 1038, a fifth residual error and normalization module 1039, which are sequentially connected.

Referring to fig. 3, the second residual of the second converter encoder network layer 1013 and the output of the normalization module 1017 are skip connected to the third self-attention module 1036 of the first converter decoder network layer 1031.

Referring to fig. 3, the fifth residual of the first converter decoder network layer 1031 and the output of the normalization module 1039 are skip connected to the third self-attention module 1036 of the second converter decoder network layer 1033.

Further, referring to fig. 3, the recognition network 104 includes a Softmnax classification layer 1041, an output of the second converter decoder network layer 1033 in the decoding network 103 is connected to the Softmnax classification layer 1041, and the Softmnax classification layer 1041 is configured to output a final recognition result.

See in particular the following formula:

I_H＝softmax(D′_H)

Where D' _H is the input and I _H is the prediction intent.

In conclusion, the intent recognition model 100 of the invention introduces textCNN a text convolution network and an LSTM network, can carry out multi-scale modeling on text semantic features, can extract local semantic information and global semantic information on a text at the same time, introduces an LSTM layer at the encoding-decoding end, can learn long-distance dependency information of the text and enhance context modeling capability, carries out progressive abstraction and optimization on the text semantic through a multi-layer overlapped progressive encoding-decoding structure, namely an encoder and a decoder are multi-layer overlapped structures with more than two layers, forms an end-to-end semantic extraction graph, can obtain depth optimization on the text semantic, models the text semantic from different 'view angles' through multi-view modeling text semantic, namely, integrates a plurality of network structures such as textCNN, LSTM and the like, transmits semantic abstract and refining information, and finally realizes more acute and accurate perception on the text semantic. Compared with a single structure, the intent recognition model provided by the invention has the advantages that semantic features can be learned deeply by the progressive network structure, and the text expression capacity is improved.

Training the model according to the description until the model reaches the on-line standard, stopping training the model, and obtaining the model. The trained model is loaded and evaluated at test set TestSet.

The invention uses the operation labeling semantic test set TestSet as an example, and uses the classification accuracy (accuracy) as an evaluation index, wherein the definition is that for a given test data set, the classifier correctly classifies the ratio of the number of samples to the total number of samples. The larger the classification accuracy value, the better the model effect, the lower the value, and the worse the model effect. See in particular the following table:

sequence number 1 semantic recognition based on a transducer model.

Sequence number 2 semantic recognition using the intent recognition model of the present invention.

From the comparison result of the table, on the operation labeling semantic dataset TestSet, the semantic recognition algorithm provided by the invention in sequence number 2 is used, and compared with the semantic recognition based on a transducer model in sequence number 1, the accuracy is 95.90% -84.91% = 10.99%. Therefore, the semantic recognition algorithm with multi-feature fusion provided by the invention can achieve higher precision and realize more accurate intention recognition under the same condition.

Referring to fig. 4, another embodiment of the present invention further provides a session semantic recognition apparatus 200, including a keyword obtaining module 201, a vectorization module 202, a document vector constructing module 203, and a session recognition module 204, where the session semantic recognition apparatus 200 is capable of performing the data communication method in the method embodiment.

Specifically, the session semantic recognition device 200 includes:

A keyword acquisition module 201 configured to convert the acquired session audio data into session text data from which a session keyword is identified;

A vectorization module 202 configured to input a conversation keyword and a conversation sequence to the BERT model, to obtain a word vector and a conversation text vector of the conversation keyword;

A document vector construction module 203 configured to construct a document according to the word vector of the session keyword and the session text vector, extract effective information of the constructed document by using Doc2vec model, and generate a document vector related to both the keyword and the session context;

The session recognition module 204 is configured to input the document vector into an intention recognition model to obtain an output session recognition result.

Further, the keyword obtaining module 201 is further configured to:

It should be noted that, the technical solutions corresponding to the session semantic recognition device 200 provided in this embodiment that may be used to execute the method embodiments are similar to the method in terms of implementation principle and technical effect, and are not repeated here.

The foregoing description is only of the preferred embodiments of the invention. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in the present invention is not limited to the specific combinations of technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the spirit of the disclosure. Such as the above-mentioned features and the technical features disclosed in the present invention (but not limited to) having similar functions are replaced with each other.

Claims

1. A conversation semantics recognition method, comprising:

Converting the acquired conversation audio data into conversation text data, and identifying conversation keywords from the conversation text data;

Input the conversation keywords and conversation sequences into the BERT model to obtain the word vectors of the conversation keywords and the conversation text vectors;

Constructing a document based on the word vector of the conversation keyword and the conversation text vector, extracting effective information of the constructed document using the Doc2vec model, and generating a document vector related to both the keyword and the conversation context;

Inputting the document vector into the intent recognition model to obtain an output session recognition result;

The intent recognition model includes an encoding network, a first textCNN-LSTM network layer, a decoding network and a recognition network;

The encoding network includes a first Transformer encoder network layer, a second textCNN-LSTM network layer and a second Transformer encoder network layer connected in sequence; wherein the first Transformer encoder network layer receives an input document vector, and the first Transformer encoder network layer is simultaneously jump-connected to the second Transformer encoder network layer;

The decoding network includes a first Transformer decoder network layer, a third textCNN-LSTM network layer, and a second Transformer decoder network layer connected in sequence; wherein the first Transformer decoder network layer is also jump-connected to the second Transformer decoder network layer;

The second Transformer encoder network layer in the encoding network is directly jump-connected to the first Transformer decoder network layer in the decoding network, and the second Transformer encoder network layer is connected to the first Transformer decoder network layer through the first textCNN-LSTM network layer;

The first textCNN-LSTM network layer, the second textCNN-LSTM network layer and the third textCNN-LSTM network layer all include textCNN modules and LSTM modules connected in sequence;

The recognition network includes a Softmnax classification layer, the output of the second Transformer decoder network layer in the decoding network is connected to the Softmnax classification layer, and the Softmnax classification layer is used to output the final recognition result.

2. A method for identifying conversation semantics according to claim 1, characterized in that the step of identifying conversation keywords from the conversation text data further comprises:

Establishing a word library according to the conversation text data, the word library includes a keyword word library, a stop word word library, a dedicated noun word library and a prohibited word word library;

The effective information of the conversation text is extracted using a word segmentation tool according to the word library to obtain a word segmentation result.

3. A conversation semantics recognition method according to claim 1, characterized in that the first Transformer encoder network layer and the second Transformer encoder network layer each include a first self-attention module, a first residual and normalization module, a first feedforward network, and a second residual and normalization module connected in sequence;

The output of the second residual and normalization module of the first Transformer encoder network layer is jump-connected to the first self-attention module of the second Transformer encoder network layer.

4. A method for conversation semantics recognition according to claim 3, characterized in that the first Transformer decoder network layer and the second Transformer decoder network layer each include a second self-attention module, a third residual and normalization module, a third self-attention module, a fourth residual and normalization module, a second feedforward network, and a fifth residual and normalization module connected in sequence;

The output of the second residual and normalization module of the second Transformer encoder network layer is jump-connected to the third self-attention module of the first Transformer decoder network layer.

5. A conversation semantics recognition method according to claim 4, characterized in that the output of the fifth residual and normalization module of the first Transformer decoder network layer is jump-connected to the third self-attention module of the second Transformer decoder network layer.

6. A conversation semantics recognition device, comprising:

A keyword acquisition module, configured to convert the acquired conversation audio data into conversation text data, and identify conversation keywords from the conversation text data;

A vectorization module is configured to input the conversation keywords and the conversation sequence into the BERT model to obtain the word vectors of the conversation keywords and the conversation text vectors;

A document vector construction module is configured to construct a document according to the word vector of the conversation keyword and the conversation text vector, extract effective information of the constructed document using a Doc2vec model, and generate a document vector related to both the keyword and the conversation context;

A conversation recognition module is configured to input the document vector into an intent recognition model to obtain an output conversation recognition result;

7. The conversation semantics recognition device according to claim 6, wherein the keyword acquisition module is further configured to:

8. A conversation semantics recognition device according to claim 6, characterized in that:

The first Transformer encoder network layer and the second Transformer encoder network layer each include a first self-attention module, a first residual and normalization module, a first feedforward network, and a second residual and normalization module connected in sequence;

The output of the second residual and normalization module of the first Transformer encoder network layer is jump-connected to the first self-attention module of the second Transformer encoder network layer;

The first Transformer decoder network layer and the second Transformer decoder network layer each include a second self-attention module, a third residual and normalization module, a third self-attention module, a fourth residual and normalization module, a second feedforward network, and a fifth residual and normalization module connected in sequence;

The output of the second residual and normalization module of the second Transformer encoder network layer is jump-connected to the third self-attention module of the first Transformer decoder network layer;

The output of the fifth residual and normalization module of the first Transformer decoder network layer is jump-connected to the third self-attention module of the second Transformer decoder network layer.