CN118865979B

CN118865979B - Dialogue strategy method based on multi-modal deep learning language behaviors

Info

Publication number: CN118865979B
Application number: CN202411336660.7A
Authority: CN
Inventors: 邹伟建; 刘胜坤; 黄倩影; 刘昌松
Original assignee: Guangdong Shuye Intelligent Technology Co ltd
Current assignee: Guangdong Shuye Intelligent Technology Co ltd
Priority date: 2024-09-25
Filing date: 2024-09-25
Publication date: 2024-12-06
Anticipated expiration: 2044-09-25
Also published as: CN118865979A

Abstract

The invention relates to the technical field of voice recognition, in particular to a language behavior dialogue strategy method based on multi-mode deep learning, which comprises the following steps of acquiring one or more voice data sets input by a user; inputting a voice data set into a pre-trained voice recognition network, carrying out intra-group attention mechanism analysis on the voice data set to obtain an intra-group context vector, obtaining a position code of the voice data set, carrying out inter-group attention mechanism analysis on the voice data set based on the position code and the intra-group context vector to obtain a global context vector, obtaining recognition information according to the global context vector, further calculating a recognition confidence coefficient, carrying out multi-mode fusion on feature data corresponding to a text mode and an image mode respectively and the recognition information based on the recognition confidence coefficient, determining a target recognition result of a complex voice text input by a user, and improving the reliability and the accuracy of the target recognition result.

Description

Dialogue strategy method based on multi-modal deep learning language behaviors

Technical Field

The invention relates to the technical field of data processing, in particular to a language behavior dialogue strategy method based on multi-mode deep learning.

Background

The bank card number voice recognition is a key function of an intelligent client system, the current voice recognition system is mainly based on a circular neural network model and is good in performance in processing general voice tasks, when facing a 16-to-19-bit complex digital sequence of a bank card number, accuracy of digital recognition and reliability of results are difficult to achieve, clients usually read in a grouping mode, for example, the read complex digital sequence is abcdefjhijklmnop, grouping modes of different clients are different, and accents, speech speeds, irregular pauses, background noise and the like of the clients all have certain influence on voice recognition, so that the accuracy of voice recognition of a single mode through a neural network is low at present.

The patent with publication number CN111462733B discloses a multi-modal speech recognition model training method, in which the accuracy of speech recognition is improved by training a speech recognition model with multi-modal data, such as image data and speech signals, as training samples, but in the method, the speech recognition part still cannot overcome the interference generated by the client accent and irregular pauses, and only considers two modal data, such as image data and speech signals, and the accuracy of speech recognition still needs to be improved.

Disclosure of Invention

Therefore, the embodiment of the invention provides a language behavior dialogue strategy method based on multi-mode deep learning, so as to solve the problem of poor accuracy of the current single-mode voice recognition.

The embodiment of the invention provides a language behavior dialogue strategy method based on multi-mode deep learning, which comprises the following steps:

Acquiring a complex voice text input by a user, and grouping the complex voice text to obtain one or more voice data sets;

Inputting the voice data set into a pre-trained voice recognition network, and analyzing an intra-group attention mechanism of the voice data set according to the voice recognition network to determine an intra-group context vector of the voice data set;

Acquiring a position code of each language data set, and analyzing an inter-group attention mechanism of the voice data set based on the position code and the intra-group context vector to obtain a global context vector corresponding to the complex voice text;

Obtaining recognition information output by the voice recognition network according to the global context vector, and obtaining recognition confidence according to the recognition information, wherein the recognition information comprises verification probability of the voice data, one or more candidate recognition results and prediction probability of each candidate recognition result;

Acquiring characteristic data corresponding to a text mode and an image mode, carrying out multi-mode fusion on the identification information and the characteristic data corresponding to the text mode and the image mode based on the identification confidence, and determining a target identification result of the complex voice text input by the user, wherein the characteristic data of the text mode is extracted based on the text input by the user, and the characteristic data of the image mode is extracted based on the image provided by the user.

Preferably, the determining the intra-group context vector of the speech data set according to the intra-group attention mechanism analysis of the speech data set by the speech recognition network includes:

performing coding processing on the voice data group according to the voice recognition network to obtain a block code of the voice data group, wherein the block code comprises a hidden state of each data element in the voice data group;

determining an attention score of each data element in the voice data set according to the hidden state of the corresponding data element and a first learnable parameter set in the voice recognition network;

Normalizing the attention score of the data element to obtain the attention weight of the corresponding data element;

And carrying out weighted summation according to the attention weight and the hiding state of each data element in the voice data set to obtain the intra-set context vector of the voice data set.

Preferably, the performing, based on the position code and the intra-group context vector, an inter-group attention mechanism analysis on the speech data group to obtain a global context vector corresponding to the complex speech text includes:

determining an attention score corresponding to the speech data set based on the position code, the intra-group context vector, and a second set of learnable parameters in the speech recognition network;

Normalizing the attention score of the voice data set to obtain the attention weight of the voice data set;

weighting context vectors in the voice data set according to the attention weight of the voice data set to obtain a weighted vector of the voice data set;

and calculating the summation of the weighted vectors of all the voice data sets to be the global context vector corresponding to the complex voice text.

Obtaining a structural code of the voice data set, and determining an attention score corresponding to the voice data set according to the position code, the structural code, the intra-set context vector and a third learnable parameter set in the voice recognition network;

Preferably, the obtaining the identification information output by the voice recognition network according to the global context vector includes:

Acquiring an original hidden state of the complex voice text according to the voice recognition network;

Splicing the original hidden state and the global context vector to obtain a spliced vector;

Determining one or more candidate recognition results and a prediction probability of each candidate recognition result according to the splice vector and a fourth set of learnable parameters in the speech recognition network;

The verification probability is determined from the global context vector and a fifth set of learnable parameters in the speech recognition network.

Preferably, said determining one or more candidate recognition results, and a predictive probability for each of said candidate recognition results, based on said concatenation vector and a fourth set of learnable parameters in said speech recognition network comprises:

Determining one or more predictive values and corresponding predictive probability distributions for each data element based on the splice vector and a fourth set of learnable parameters in the speech recognition network;

Randomly combining one or more predicted values based on all data elements to obtain one or more candidate recognition results;

And determining the prediction probability corresponding to the candidate recognition result according to the prediction probability distribution of all the data elements in the candidate recognition result.

Preferably, the method for acquiring the recognition confidence comprises the following steps:

Determining the maximum prediction probability in the candidate recognition results, and taking the candidate recognition result corresponding to the maximum prediction probability as a prediction recognition result;

and carrying out weighted summation on the maximum prediction probability and the check probability to obtain the identification confidence.

Preferably, the determining, based on the recognition confidence, the target recognition result of the complex voice text input by the user by performing multi-modal fusion on the recognition information and the feature data corresponding to the text mode and the image mode, includes:

and responding to the recognition confidence coefficient being smaller than a confidence coefficient threshold value, carrying out feature fusion on the predicted recognition result in the recognition information and the feature data of the text mode and the feature data of the image mode to obtain a target recognition result.

Preferably, the training process of the voice recognition network includes:

obtaining training samples, each training sample comprising one or more data packets;

Inputting the training samples into a voice recognition network to be trained, and extracting context vectors of each data packet in the training samples and global context vectors corresponding to the training samples by the voice recognition network to be trained;

Outputting prediction identification information according to the global context vector corresponding to the training sample, wherein the prediction identification information comprises a prediction result and a prediction verification result of the training sample;

And determining the current optimizing loss of the voice recognition network to be trained according to the predicting result and the predicting verification result, and adjusting and training the voice recognition network to be trained according to the optimizing loss to obtain the pre-trained voice recognition network.

Preferably, the determining the current optimization loss of the to-be-trained voice recognition network according to the prediction result and the prediction verification result includes:

determining a difference value of each data element according to the prediction result, wherein when the prediction value of the data element is the same as the true value, the difference value is corresponding to 0, and when the prediction value of the data element is different from the true value, the difference value is corresponding to 1;

Obtaining data loss according to summation of difference values of all the data elements;

calculating binary cross entropy loss between the predicted check bit probability and the real check bit probability in the predicted check result to obtain check bit loss;

determining a length loss according to the number of data elements in the prediction result, the predicted data length and the real data length;

determining a context vector loss according to cosine similarity of a context vector of a first data packet and a context vector of a last data packet in the prediction result;

determining a structural loss from a weighted sum of the length loss and the context vector loss;

And carrying out weighted summation on the data loss, the check bit loss and the structural loss to obtain the optimization loss.

Compared with the prior art, the embodiment of the invention has the advantages that the complex voice text of a user is grouped, the complex digital sequence is divided into a plurality of voice data groups for analysis, the calculated amount of independent analysis of each data is reduced, the data groups are input into a pre-trained voice recognition network, the voice recognition network carries out intra-group attention mechanism analysis on the voice data groups to obtain intra-group context vectors, different numbers in the group are distributed with different attention scores to improve the recognition accuracy, the position codes of the voice data groups are further determined, the numbers at different positions are distinguished by the position codes, the relation between the numbers at different positions is extracted, the inter-group attention mechanism analysis is carried out according to the position codes and the intra-group context vectors to obtain global context vectors, the global context vectors more comprehensively cover the digital information and the position information of the complex digital sequence, the recognition information is determined based on the global context vectors, the recognition accuracy is improved, the verification probability is the probability obtained by the intra-group context analysis, the recognition card number is distributed with different attention scores to improve the recognition accuracy, the recognition accuracy is carried out on the bank card number, the reliability is improved, the recognition mode is more reliably calculated according to the recognition mode, the recognition mode is more reliable, and has better recognition mode is better than the reliability is better than the recognition mode, and has better recognition mode, the complex digital sequence can be accurately identified and processed, and the identification effect is good.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for a language behavior dialogue strategy based on multi-mode deep learning according to an embodiment of the invention.

Detailed Description

Embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, are described in detail below. The embodiments described below by referring to the drawings are exemplary and intended for the purpose of explaining the present disclosure and are not to be construed as limiting the present disclosure.

It should be noted that the terms "first," "second," and the like in the description of the present disclosure and the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with aspects of the present disclosure.

In order to illustrate the technical scheme of the invention, the following description is made by specific examples.

Referring to fig. 1, a flowchart of a method for a language behavior dialogue strategy based on multi-mode deep learning according to an embodiment of the invention is shown in fig. 1, where the method for a language behavior dialogue strategy based on multi-mode deep learning may include:

s101, acquiring a complex voice text input by a user, and grouping the complex voice text to obtain one or more voice data sets.

The user inputs complex voice text, namely the bank card number related voice data read by the user, and the bank card number read currently by the user is identified from the complex voice text so as to ensure the accuracy of the follow-up intelligent service. It is noted that after the bank signs up with the user, the user becomes the authority user of the bank, has authority to transact telephone business with the bank, and the identity information, voiceprint information and telephone number of the user recorded on the agreement are stored in the database. Therefore, before acquiring the voice data related to the bank card number read by the user, the identity information of the user is verified according to the database of the bank, and after the verification is passed, the subsequent intelligent service of the user can be performed, so that the security of the privacy of the user is ensured.

In consideration of the reading habit of the user, in order to improve the accuracy of recognition of the complex voice text, the embodiment groups the complex voice text to obtain a plurality of voice data sets, for example, for a 16-bit bank card number "abcdefjhijklmnop", the complex voice text can be divided into four voice data sets, namely, "abcd", "efjh", "ijkl" and "mnop", respectively, and for the complex voice text actually input by the user, the embodiment can equally divide the voice data into 4 voice data sets based on the pause positions in the voice data, so that the characteristics in the voice data sets and among the voice data sets are analyzed to obtain more accurate voice recognition results.

Optionally, before grouping the complex voice text, feature extraction may be performed on the complex voice text to obtain a feature sequence, and the feature sequence may be grouped to obtain multiple voice data sets, for example, a Mel-frequency cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC) of each voice training sample is calculated, so as to remove noise interference in the complex voice sample, and improve accuracy of recognition of the complex voice text.

S102, inputting the voice data set into a pre-trained voice recognition network, and analyzing the intra-set attention mechanism of the voice data set according to the voice recognition network to determine an intra-set context vector of the voice data set.

Optionally, the pre-trained voice recognition network may be a cyclic neural network (Recurrent Neural Network, RNN), where the current RNN network may have a problem of gradient disappearance or gradient explosion when processing a complex number sequence, and when recognizing a number in the middle of the sequence, the part where the sequence starts may have been "forgotten", resulting in an overall recognition error, and for a recognition scenario of a bank card number, each number is particularly important, so that the conventional RNN network has a poor effect in the scenario of voice recognition of the bank card number.

In order to solve the problem of low recognition accuracy of the traditional RNN network, a packet attention mechanism is added in the RNN network to optimize the recognition result of the RNN network and improve the recognition accuracy of the RNN network, wherein the training process of the RNN network comprises the steps of obtaining training samples, inputting the training samples into a voice recognition network to be trained, extracting context vectors of each data packet in the training samples and global context vectors corresponding to the training samples from the voice recognition network to be trained, outputting prediction recognition information according to the global context vectors corresponding to the training samples, wherein the prediction recognition information comprises a prediction result and a prediction verification result of the training samples, determining current optimization loss of the voice recognition network to be trained according to the prediction result and the prediction verification result, and adjusting and training the voice recognition network to be trained according to the optimization loss to obtain the voice recognition network to be pre-trained.

In some implementations, multiple sets of historical recognition voice training samples can be obtained in a voice database, the voice training samples are voice data corresponding to bank card numbers with complex digital sequences, and the voice training samples can be voice data read out by users with different ages, sexes, accents and speech speeds in a natural grouping mode so as to improve the diversity of the samples.

To improve recognition accuracy, feature extraction is performed on the speech training samples, for example, mel frequency cepstrum coefficient MFCC of each speech training sample is calculated to obtain a speech feature sequenceWherein, the method comprises the steps of, wherein,Is thatTime of dayViterbi deviceThe sign vector of the signal is calculated,The method is the total length of the sequence, and the noise interference in the voice training samples is removed by extracting the MFCC characteristics of the voice training samples and carrying out subsequent analysis on the characteristic parts, and the voice training samples are further subjected to text labeling to determine the accurate text transcription corresponding to each voice training sample, including digital sequences, pauses and grouping marks, wherein the text labeling forms can be as followsWherein, the method comprises the steps of, wherein,,Representing a stop sign, optionally also recording bank card type informationDifferent types of bank cards, such as deposit, debit, or credit cards, may differ in their corresponding data structures.

Further, the validity judgment of the recognition result can be carried out by obtaining the check bit label, and the check bit of the voice training sample can be obtainedBy usingThe algorithm verifies the validity of the card number, and if the calculated check bit is the same as the actual last bit of the card number, the check bit marking value isAnd if the check bits are different, the check bit label value is 0. Finally, the processed speech feature sequence in this embodimentText labeling sequenceCard type informationAnd the check bit label forms a training sample, and each training sample trains the RNN network.

In some implementations, before the training samples are input into the RNN network, the training samples may also be grouped to obtain a plurality of groups, and then each group is encoded according to the RNN network to obtain a group code corresponding to the training samples: Wherein, the method comprises the steps of, wherein, ,Is the firstA group comprising 4 consecutive hidden states,For the first hidden state of the group,、AndThe second hidden state, the third hidden state and the fourth hidden state of the group respectively, G is the input sequence after grouping,Is the firstThe hidden state of each time step is obtained based on the existing processing method for the RNN network, which is not described in detail in this embodiment, and each time step corresponds to a data element,In this embodiment, the number of the grouped total groups is 4, and the training samples after the group coding are input into the RNN network for processing, which is helpful for the RNN network to better understand the structure of the digital sequence and reduce the recognition difficulty caused by the long sequence, and the longer sequence is divided into 4 groups for processing instead of a plurality of independent numbers, thereby reducing the processing complexity.

Further, in order to solve the problem of forgetting when the conventional RNN processes long sequences, in this embodiment, an attention mechanism is added to the RNN network to perform feature extraction, and the attention mechanism is applied to each data packet divided by four digits, so that the model can accurately capture the features of each data packet. Alternatively, the processing of the intra-group attention mechanism may be:

wherein, Represent the firstGroup IIIThe attention score of an element represents the attention degree of the RNN network to the element, and the higher the attention score is, the more important the RNN network considers the element to be in the current task; is a vector parameter that can be learned and, Is a column vector of the type that is used to determine the position of the column,Representing the transpose of this column vector to a row vector; Is a learnable weight matrix for hiding states Mapping to a new feature space; is a learnable bias vector; Is the first Group IIIA hidden state of the individual elements; Is the first Group IIIAttention weights of the individual elements; Is the first The context vector of the group, which is a weighted sum of the attention scores of all data elements, synthesizes the information of all data elements within the data packet; compressing input to within range for hyperbolic tangent activation function ;For normalizing the attention weight.

It will be appreciated that the RNN network includes a plurality of learnable parameters, such as those learnable aboveAndIn the training process of the RNN, the output result of the RNN is influenced by continuously adjusting the learning parameters, and when the training of the RNN is finished, all the learning parameters are also adjusted to be finished, so that the final model parameters are obtained.

Through the processing of the attention mechanism in the group, the attention score is dynamically allocated to each data element in the data packet so as to determine the importance of each data element in the data packet, so that the RNN network can be more suitable for the characteristics of different bank card numbers, and the important figures in part of the bank card numbers are focused on to improve the identification accuracy.

Further, context vectors obtained from each data packetProcessing of the inter-group attention mechanism is performed on all the data packets to acquire the relationship between different data packets, so that the RNN network is helped to understand the global structure of the whole bank card number, and thus the recognition accuracy is improved, for example, different types of account numbers (such as debit cards and credit cards) often have different digital modes, when the model recognizes that a certain digital packet may be a bank identification code, the model needs to pay more attention to the specific mode of the subsequent digital packet so as to improve the recognition accuracy, and the processing of the inter-group attention mechanism may be as follows:

wherein, Is the firstThe attention score of the group, representing the RNN network over the entire firstThe higher the attention score of a group, the more important the RNN network considers the data packet to be in the overall sequence; As a learnable vector parameter, it can be understood as a "query" vector for evaluating the importance of the transformed group feature; A learnable weight matrix for processing context vectors; The learnable weight matrix for processing the position codes, the RNN network can consider the position characteristics of the digital packets in the whole sequence to be particularly important in identifying the bank card number, because the digital packets at different positions may have different meanings (such as bank codes, account numbers, etc.); is a learnable bias vector; Is the first Context vectors of the group are calculated by an intra-group attention mechanism; Is the first Position coding of groups, providing information about the position of the digital packet in the entire sequence; Is the first The group's attention weights, representing the importance that the model ultimately decides to assign to the data packet, are byThe function is obtained by normalizing the attention score; the information of all data packets is integrated for the global context vector, and the weight is determined by the inter-group attention mechanism and represents the global representation of the whole bank card number sequence.

For the position coding of each data packet, the present embodiment is represented in a fixed coding form, for example, expressed as:, And The position codes corresponding to the 1 st data packet and the 2 nd data packet are respectively represented, and the position codes in the embodiment are generated by adopting sine function codes and cosine function codes, wherein the sine code functions are as follows: the cosine coding function is: wherein Is the location of the point of sale,Is an index to the data packet, in this embodiment values 1,2, 3 and 4,The method is characterized in that the method comprises the steps of encoding the total dimension, alternately using the sine encoding function and the cosine encoding function when the encoding is performed based on the sine encoding function and the cosine encoding function, wherein the sine encoding function is used for even index positions, and the cosine encoding function is used for odd index positions.

It will be appreciated that the processing of the inter-group attention mechanism enables the RNN network to capture the relationship between different data packets, the digital elements at different locations in the bank card number often have different meanings, such as bank codes, account numbers, etc., by calculating the location codesThe RNN network model can distinguish digital elements at different positions and learn the interrelation between the digital elements, and not only identify each digital in isolation, so that the accuracy of identifying the bank card number is improved.

In some implementations, when the bank card number of the voice data input by the user is identified, the correctness of the bank card number needs to be verified, so that the customer experience is prevented from being reduced due to the fact that the bank card number with the wrong identification is identified, structural coding information can be added in the RNN network model, so that the RNN network model can learn the mechanism characteristics in the bank card number, optionally, a structural encoder can be added in the RNN network, the structural encoder encodes each data packet according to the bank card type information of the bank card, and the processing process of the structural encoder can be expressed as follows:

wherein, Represent the firstThe structure code of the group comprises the position and action information of the data packet in the bank card number structure; Is a two-dimensional dictionary, the first level key is a group index, the second level key is a card type, and each value is a coded vector representing the combination of the location and the card type; An index for the data packet indicating what number of data packets are currently processed; different types of bank cards, such as deposit, debit or credit cards, may have different structural information, the bank card type being an input parameter.

The structural information of the bank card number is added into the RNN, so that the RNN can learn different structural characteristics in the bank card number, and therefore, more accurate bank card numbers are identified according to the context vector in voice identification, and therefore, the inter-group attention mechanism fused with the structural information can be expressed as follows:

wherein, Is the firstThe attention score of the group, representing the RNN network over the entire firstThe attention degree of the group, the optimized attention score not only considers the context and the position information, but also contains the structural information, and the higher the attention score is, the more important the RNN considers the data packet in the whole bank card number sequence; Is a transposition of a learnable vector parameter, is a row vector, is used for combining with The output of the function is subjected to dot product operation, and multidimensional features can be compressed into a scalar attention score through transposition operation; A learnable weight matrix for processing context vectors; Is the first Context vectors for the group; a learnable weight matrix for processing the position codes; Is the first The position codes of the groups reflect the information of the positions of the digital groups in the whole sequence, so that the RNN network can distinguish the numbers at different positions; Is a learnable weight matrix for processing structural codes; Is the first A structural code of the group, generated by the structural encoder, the structural code vector containing specific information about the data packet in the structure of the bank card number, such as whether the data packet may be part of a bank identification code, an account number, or a check bit, etc.; is a learnable bias vector, increasing the flexibility of the model.

Further, the updated attention score may be based onNormalization processing is carried out to obtain the attention weight of the updated data packetAccording to the attention weightAnd attention scoreIt should be noted that, the acquisition of the inter-group attention mechanism in the RNN network may adopt any of the two acquisition modes, that is, two RNN networks may be trained to perform analysis, and the RNN network with increased structure codes has more calculation processes than the RNN network without structure codes, but the increased structure codes increase the accuracy of identifying the bank card number, so that in the long-sequence data scene of identifying the bank card number, the inter-group attention mechanism fused with the structure codes may be used to perform the acquisition of the global context vector, so as to improve the accuracy of final identification.

Further, the output layer of the RNN network outputs prediction identification information according to the global context vector, where the prediction identification information may include a prediction result and a prediction check result, and the prediction probability of each data element in the prediction result may be expressed as:

wherein, Is a given observationEach state is as followsIn this embodiment, the states refer to the data elements being different numbers, and the vector dimension is equal to the number of possible states, e.g., the predicted values of the identified data elements in this embodiment may be 0-9 and stalls11 Dimensions total; to activate the function, to convert any real vector into a probability distribution, and to ensure that all elements of the output sum to be Each element is atWithin the range; is a learnable weight matrix of the output layer; A learnable bias vector for the output layer; The hidden state is the original hidden state, and contains the local information accumulated in the sequence processing process; the global context vector is obtained through an attention mechanism, and the information of the whole input sequence is synthesized; Representation of AndTo splice the original hidden stateAnd global context vectors derived through an attention mechanismAre connected in the vertical direction so that the RNN network can simultaneously utilize local informationAnd global informationTo make final predictions and to conceal the original stateWith context vectors derived through the attention mechanismIn combination, the RNN network can simultaneously utilize local information (original hidden state) and global information (context information captured by an attention mechanism) when making final prediction, so that the problem that the information is forgotten when the traditional RNN network processes long sequences is solved.

It will be appreciated that the prediction probability distribution of each time step, that is, the prediction probability distribution of each data element, may be obtained from the above-described prediction identification information, and the prediction probability distribution of each time step may be expressed as: Wherein, the method comprises the steps of, Representing a class of values, including numbersAnd special marks,Representation ofCorresponding value class in vectorIs an element of (a).

In some implementations, after determining the global context vector based on the updated inter-group attention mechanism, the check bit prediction may also be performed according to the global context vector, so that the RNN network pays more attention to the previous digital element when judging that the RNN network is the last digit of the check bit, because the check bit is usually calculated from the previous digital element, thereby improving the accuracy of the RNN network prediction, where the check bit prediction may be expressed as:

wherein, Expressed under a given observationThe probability of the correct lower check bit is given asTo the point ofScalar values in between, in the context of bank card number identification, the predictive check bit probability may represent the confidence of the RNN network for the entire card number sequence it identifies, the higher the predictive check bit probability indicates the more likely that the result of the current identification is a valid bank card number; is an activation function that compresses input to A section; mapping the global context vector to a check bit prediction space for a learnable weight matrix of check bit prediction; Is a global context vector, and contains the information of the whole sequence; Is a learnable bias vector for check bit prediction.

The check bit of the bank card number is calculated according to an algorithm based on the previous number, so that the effect of identifying the bank card number can be improved by acquiring the prediction check bit through the RNN network in the embodiment, and the effectiveness of identifying the bank card number is ensured.

Based on the above description of the RNN network training process, in this embodiment, the intra-group attention of the input data packet may be extracted in the RNN network to obtain a context vector of each data packet, and further, the inter-group attention is extracted based on the context vector of each data packet to obtain a global context vector, and in the obtaining of the global context vector, the structure information of the bank card number is added to improve the accuracy of identifying the bank card number, and further, the output layer of the RNN network performs a prediction result and a prediction verification result according to the obtained global context vector to improve the reliability of the current prediction result.

It can be understood that in the training process of the RNN network, the weight parameters which can be learned in the RNN network are continuously adjusted to enable the RNN network to output different prediction results, and further, the accuracy of the current prediction is determined according to the difference between the prediction results and the real label results, that is, the current loss function of the RNN network is calculated, the accuracy of the current output prediction results is reflected by the loss function, the learning parameters in the RNN network are adjusted based on the feedback of the loss function, so that the output results of the RNN network gradually approach the real results until the loss function converges, and the training of the RNN network is completed.

In some implementations, a difference value of each data element may be determined according to a prediction result, where the prediction value of the data element is the same as the true value, the corresponding difference value is 0, where the prediction value of the data element is different from the true value, the corresponding difference value is 1, a data loss is obtained according to a summation of the difference values of all the data elements, a binary cross entropy loss between a prediction check bit probability and a true check bit probability in the prediction check result is calculated to obtain a check bit loss, a length loss is determined according to a number of data elements, a prediction data length and a true data length in the prediction result, a context vector loss is determined according to a cosine similarity of a context vector of a first data packet and a context vector of a last data packet in the prediction result, a structural loss is determined according to a weighted summation of the length loss and the context vector loss, and the data loss, and the check bit loss is weighted and summed to obtain an optimized loss.

Alternatively, the optimization penalty may be expressed as:

wherein, Representing an optimization loss; Representing data loss; Indicating a check bit loss; Representing a structural loss; And Super-parameters representing check bit loss and structural loss, respectively, are set in this embodimentIn other embodiments, the adjustment may be performed according to actual situations, which is not described herein.

For data loss in optimization lossThe real data sample in the voice data marked by the sample can be obtained, the data loss is determined according to the difference between the predicted result of each time step and the real label data, for example, in the embodiment, the bank card number is 16 bits, whether the predicted data and the real data are the same or not is judged, if the predicted data and the real data are the same, the local loss corresponding to the position is 0, and correspondingly, if the predicted data and the real data are not the same, the local loss corresponding to the position is 1, the local loss of each time step in the recognition result is obtained and summed, thereby obtaining the data loss of the current predicted result。

For parity bit loss in optimization lossThe real check bit data in the sample label can be obtained, and the check bit loss is determined according to the difference between the real check bit data result and the predicted check bit probability, wherein the check bit loss is determined based on a binary cross entropy loss calculation method in the embodiment, and is expressed as: Wherein, the method comprises the steps of, wherein, Is true check bit data.

Structural losses in optimizing lossesThe length loss and the context vector loss of the prediction result can be comprehensively obtained, and the structural loss is expressed as: Wherein In order for the length to be lost,Is a context vector penalty; And To preset parameters, the embodiment sets,Wherein the length loss is:, Non-stop, model predictive The number of the marks is determined by the number of the marks,Is the actual card number length and the number of the card,Is the predicted maximum card number length possible, the context vector penalty is:, Representing the cosine similarity calculation, The context vectors corresponding to the first and last digital packets, respectively, may be extracted by the RNN network ifAndIf the similarity is higherProximity to,Near 0, accordingly, ifAndIf the similarity is low, thenProximity to,Near 1.

Evaluating the prediction result of the RNN according to the loss function, wherein the closer the prediction result is to the real result, the loss is optimizedThe smaller the learning parameters, such as weight matrix and offset vector, in the RNN network are reversely adjusted according to the optimization loss L, a new prediction result of the RNN network is obtained according to the updated learning parameters, and then the corresponding optimization loss L is determined according to the comparison between the new prediction result and the sample real result, and so on, until the optimization loss L converges, training of the RNN network is completed, a pre-trained RNN network is obtained, the pre-trained RNN network is the pre-trained speech recognition network in this embodiment, and the values of the learning parameters in the pre-trained RNN network are known.

On the basis of the training process, after the voice data group is input into the pre-trained voice recognition network, the voice data group can be subjected to coding processing according to the voice recognition network to obtain the block coding of the voice data group, the block coding comprises the hiding state of each data element in the voice data group, the attention score of the corresponding data element is determined according to the hiding state of each data element in the voice data group and the first learnable parameter group in the voice recognition network, the attention score of the corresponding data element is subjected to normalization processing to obtain the attention weight of the corresponding data element, the weighted summation is carried out according to the attention weight and the hiding state of each data element in the voice data group to obtain the intra-group context vector of the voice data group, and the first learnable parameter group in the embodiment comprises、And。

That is, the voice recognition network performs intra-group attention mechanism analysis on each voice data group, the intra-group attention mechanism analysis refers to the voice recognition network training process of the step, so as to obtain the attention score of each data element in the data group, normalize the attention score of each data element to obtain the attention weight corresponding to each data element, and then perform weighted summation according to the attention score and the attention weight of the data element to obtain the intra-group context vector corresponding to the voice data group.

S103, obtaining the position code of each voice data group, and carrying out inter-group attention mechanism analysis on the voice data group based on the position code and the context vector in the group to obtain a global context vector corresponding to the complex voice text.

Optionally, a position code of each data packet may be obtained, and the attention score of the corresponding voice data packet is determined according to the position code, the context vector in the group, and a second learnable parameter set in the voice recognition network, the attention score of the voice data packet is normalized to obtain an attention weight of the voice data packet, the context vector in the group of the voice data packet is weighted according to the attention weight of the voice data packet to obtain a weighted vector of the voice data packet, and a sum of weighted vectors of all voice data packets is calculated to be a global context vector corresponding to the voice data, where the second learnable parameter set in this embodiment includes:、、 And 。

It will be appreciated that from the above description of the training process for the RNN network process, the position code for each data packet is obtainedAccording to position codingAcquisition of attention score of speech data set, intra-set context vector of speech data set and second learnable parameter setFurther, the attention score for each speech data setNormalization processing is carried out to obtain corresponding attention weightBased on the attention weight of the speech data setAnd intra-group context vectorsTo obtain global context vectors for all speech data setsThat is, global context vectorThe global context vector comprises global information of the voice data, and the result of the voice data is predicted according to the global context vector, so that the accuracy is higher.

In some implementations, the structure code of the speech data set and the position code of the speech data set may also be obtained, and the attention score of the corresponding speech data set may be determined based on the position code, the structure code, the intra-set context vector, and a third learnable parameter set in the speech recognition network, the attention score of the speech data set may be normalized to obtain an attention weight of the speech data set, the intra-set context vector of the speech data set may be weighted based on the attention weight of the speech data set to obtain a weighted vector of the speech data set, a sum of the weighted vectors of all the speech data sets may be calculated as a global context vector corresponding to the speech data, and the third learnable parameter set in this embodiment includes、、、And。

Based on the above description, the pre-trained RNN network may be an RNN network with increased structural codes, so as to obtain structural codes of the current speech data set when performing inter-group feature extraction on the speech data set based on the RNN network with increased structural codesAnd position codingAnd is coded according to the structurePosition codingIntra-group context vectorAnd a third learning parameter set for obtaining the attention score of the voice data set, and normalizing the attention score of the voice data set to obtain the corresponding attention weightBased on the attention weight of the speech data setAnd intra-group context vectorsTo obtain global context vectors for all speech data setsThat is, global context vectorThe global context vector employed later in this embodiment is a vector obtained based on the attention score obtained by the structure encoding and the position encoding.

S104, according to the global context vector, obtaining the identification information output by the voice identification network, and obtaining the identification confidence according to the identification information.

Wherein the recognition information includes a verification probability of the complex speech text, one or more candidate recognition results, and a prediction probability of each candidate recognition result.

It will be appreciated that the process of obtaining the recognition information output by the speech recognition network according to the global context vector may refer to step S102, where the predicted probability distribution of each data element and the predicted check probability of the check bit are output by the output layer of the RNN network according to the global context vector, and the check probability is obtained based on the global context vector.

In some implementations, due to noise effects of the client's accent or environment, the pre-trained speech recognition network may not only output a corresponding value for each data element, i.e., each data element may output one or more predicted values, each predicted value may have a corresponding predicted probability distribution, e.g., the 3 rd data element may be 4 or 7, the predicted probability distribution of 4 is 60%, the predicted probability distribution of 7 is 75%, and multiple candidate recognition results may be obtained by combining the multiple predicted values for each data element, where the predicted probabilities for each candidate recognition result are different.

That is, the original hidden state of the complex speech text may be obtained according to the speech recognition network, the original hidden state is spliced with the global context vector to obtain a spliced vector, one or more candidate recognition results and the prediction probability of each candidate recognition result are determined according to the spliced vector and a fourth learnable parameter set in the speech recognition network, the verification probability is determined according to the global context vector and a fifth learnable parameter set in the speech recognition network, and the fourth learnable parameter set in this embodiment includesAndThe fifth set of learnable parameters includesAnd。

In some implementations, one or more predicted values and corresponding predicted probability distributions for each data element may be determined based on the concatenation vector and a fourth set of learnable parameters in the speech recognition network, one or more candidate recognition results may be obtained by randomly combining the one or more predicted values for all data elements based on the one or more predicted values for all data elements, the predicted probabilities corresponding to the candidate recognition results may be determined based on the predicted probability distributions for all data elements in the candidate recognition results, i.e., based on the concatenation vectorAnd determining a predictive probability distribution for a fourth set of learnable parametersThe data elements predicted according to the positions are randomly combined to obtain one or more candidate recognition results, and the prediction probability of each candidate recognition result can be the average value of the prediction probability distribution of all the data elements.

In some implementations, the maximum prediction probability in the candidate recognition results can be determined, the candidate recognition result corresponding to the maximum prediction probability is used as the prediction recognition result, and the maximum prediction probability and the check probability are weighted and summed to obtain the recognition confidence.

Since there may be multiple candidate recognition results in this embodiment, the prediction probability of each candidate recognition result is different, and the higher the prediction probability, the greater the corresponding probability, so this embodiment may select the numerical value of the prediction distribution with the highest probability of each data element to form the prediction recognition result, where the prediction probability corresponding to the prediction recognition result may be obtained from the average value of the prediction probability distributions of each data element, and then perform confidence calculation according to the prediction probability of the prediction recognition result, so as to determine the reliability of the recognized recognition result.

Alternatively, the calculation of the recognition confidence may be:

wherein, To identify confidence; the maximum prediction probability is the prediction probability of the recognition result composed of the numerical values corresponding to the maximum prediction probability distribution of each position; Representing a verification probability; And Respectively corresponding weight values, in this embodiment,。

The recognition confidence coefficient of the current prediction recognition result is determined through the prediction probability and the verification probability result, so that the recognition confidence coefficient can more fully reflect the accuracy degree and the reliability degree of the current prediction recognition result, the higher the recognition confidence coefficient is, the greater the reliability of the current prediction recognition result is, the confidence coefficient threshold value can be set in the embodiment, and whether the current prediction recognition result has the reference value or not is judged through the comparison of the recognition confidence coefficient and the confidence coefficient threshold value, for example, the confidence coefficient threshold value in the embodiment is divided into a high confidence coefficient threshold value and a low confidence coefficient threshold value, and the high confidence coefficient threshold valueLow confidence thresholdWhen the recognition confidence is determined to be smaller than the high confidence threshold and larger than or equal to the low confidence threshold, the reliability of the current predicted recognition result is higher, when the recognition confidence is determined to be smaller than the high confidence threshold and larger than or equal to the low confidence threshold and equal to 0.6, the reliability of the current predicted recognition result is generally indicated to be possibly influenced by accents of clients, the complex voice text is assisted to be recognized by combining with results of other modes, the analysis weight of the predicted recognition result is reduced in subsequent analysis, and when the recognition confidence is determined to be smaller than the low confidence threshold and equal to 0.6, the reliability of the current predicted recognition result is indicated to be poor, so that a user can be reminded of re-inputting the complex voice text and analyzing and recognizing the complex voice text again to obtain a more reliable predicted recognition result.

S105, acquiring the characteristic data corresponding to the text mode and the image mode, carrying out multi-mode fusion on the identification information and the characteristic data corresponding to the text mode and the image mode based on the identification confidence, and determining the target identification result of the complex voice text input by the user.

It can be understood that the embodiment can integrate a voice mode with other modes, such as a text mode and an image mode, so as to realize a more comprehensive multi-mode dialogue system, and the multi-mode dialogue system mainly comprises links of feature extraction, feature alignment, multi-mode integration, dialogue state tracking, dialogue strategy generation, response generation and the like by adopting the existing multi-mode deep learning method, wherein feature data of the text mode is extracted based on a text input by a user, and feature data of the image mode is extracted based on an image provided by the user.

In some implementations, in response to the recognition confidence being less than the confidence threshold, the predicted recognition result in the recognition information is feature fused with feature data of the text modality and feature data of the image modality to obtain a target recognition result. That is, when the recognition confidence is smaller, in this embodiment, the recognition confidence is smaller than the high confidence threshold value 0.9 but greater than or equal to the low confidence threshold value 0.6, and the predicted recognition result is fused with the features of other modes to obtain a more accurate target recognition result.

In other embodiments, the text mode and the image mode are added to perform fusion judgment when the recognition confidence of the predicted recognition result is low, and in other embodiments, the text mode and the image mode can be combined to perform fusion judgment when the recognition confidence of the predicted recognition result is low, for example, the text mode and the voice mode can be combined to perform fusion judgment when the complexity is high during text input on the basis of the text mode, for example, the text mode or the voice mode can be combined to perform fusion judgment when the complexity is high during text input on the basis of the text mode, for example, voice interaction can be preferentially used for vision impaired users, the hearing impaired users can select text or image interaction and do not perform specific mode type limitation, so that cross verification information of different modes can be performed for different users, and the accuracy of overall recognition is improved.

The multi-modal dialogue system acquires corresponding feature data from each mode, in this embodiment, the text mode can perform feature extraction on a bank card number text input by a user, the image mode can perform feature extraction on a bank card image provided by the user, the speech mode adopts RNN in this embodiment to extract corresponding features, that is, predict and identify results, the text mode can adopt word embedding technology to extract feature data, the image mode adopts convolutional neural network to extract feature data, and the feature data corresponding to each mode respectively captures essential characteristics of information of different modes.

Furthermore, the multi-modal dialog system can map features of different modalities into the same semantic space so as to ensure that the features can be effectively fused, semantic association between different modality information is established through feature alignment to lay a foundation for subsequent fusion, the features of each modality are integrated by using a multi-modal fusion technology on the basis of feature alignment to generate uniform multi-modal representation, the embodiment can adopt an attention mechanism, a gating mechanism or a more complex fusion network, for example, a self-attention mechanism is used for capturing feature relations in the same modality, and a cross-attention mechanism is used for modeling interaction between different modalities, and the fusion mode can fully utilize complementary information of each modality to improve the understanding capability of the system.

In the embodiment, the characteristics of different modes are extracted, aligned and fused in a multi-stage mode. Illustratively, in the feature extraction stage, for speech modalities, optimized RNN network processing is used, including grouping complex speech text, performing intra-and inter-group attention mechanism analysis, and structure coding, to finally obtain global context vectors as feature vectorsFor text mode, feature vector is extracted by word embedding technologyFor image mode, convolutional neural network is used to extract feature vector. The features of each modality are transformed using a linear projection layer in a feature alignment stage, mapping them to a common feature space, which may be specifically:

a learnable weight matrix for mapping each modal feature to a common feature space; a learnable bias vector.

After the features are mapped to the common feature space, the features are subjected to multi-modal fusion, and the embodiment adopts a combination of a self-attention mechanism and a cross-attention mechanism to realize multi-modal fusion:

Self-attention mechanism (speech modality for example):

wherein, A learnable weight matrix for generating a query (Q), a key (K) and a value (V).The scaling factor of the attention mechanism is equal to the square root of the feature dimension.The self-attention output captures the relationship inside the voice feature.

Cross-attention mechanism (for example speech-text interaction):

Cross-attention output captures the relationship between speech and text features.

After the output characteristics of multiple modes are obtained, characteristic fusion is carried out, and the embodiment obtains the output characteristics from a self-attention mechanism、AndAnd cross-attention mechanism、AndAnd (3) sequentially splicing to obtain feature splice vectors, and further performing linear transformation on the feature splice vectors to obtain final fusion features.

Alternatively, the process of linearly transforming the feature stitching vector may be:

wherein, Is a weight matrix capable of learning and is used for fusing different attention outputs; a learnable bias vector; Feature stitching vectors; for the final fusion feature, information from all modalities and interaction information between modalities are contained, and acquiring a final bank card number identification result according to the final fusion characteristics.

Further, the system update and maintenance of the dialogue state can be performed based on the fused multi-mode, the dialogue state tracking module records and updates key information such as user intention and dialogue history, and provides basis for subsequent decision, the embodiment can use a cyclic neural network or an attention-based model to capture dynamic characteristics of the dialogue, generate an optimal dialogue strategy based on the dialogue state by using a reinforcement learning method, and can use an algorithm such as a deep Q network or a strategy gradient, select the optimal system behavior according to the current dialogue state, the goal of the dialogue strategy is to maximize long-term rewarding (such as user satisfaction or task completion rate), and finally generate final response according to the generated dialogue strategy and multi-mode information, wherein the generated response can be a template-based method or a more advanced neural generation model (such as a Transformer or GPT), and the generated response can be consistent with the guidance of the dialogue strategy and the multi-mode information naturally.

Through the multi-modal fusion and intelligent dialogue strategies, the embodiment can also understand wider user intention and context information on the basis of accurately identifying the bank card number, comprehensively judge the identity verification state of the user according to voice input, identity card images and text messages of the user, request the user to show the bank card images or confirm through text input when identifying the bank card number if the confidence level of voice identification is not high, further understand the query intention of the user more accurately on the basis of the multi-modal information and dialogue history, for example, when the user inquires account balance, the system can not only accurately identify the bank card number, but also actively provide more comprehensive account information (such as recent transaction records or possible abnormal transaction reminding) according to the historical query habit, the current time and other factors of the user, the dialogue strategy of the system can be flexibly adjusted when processing complex query, for example, a more detailed guiding strategy can be adopted for the user who uses service for the first time, and a more direct dialogue mode can be selected for the old user who is familiar with service, and the personalized dialogue strategy can greatly improve the user experience.

In summary, in this embodiment, the complex voice text read by the user is divided into a plurality of voice data sets and input into the pre-trained voice recognition network for processing, the voice recognition network performs intra-group feature analysis on each voice data set to obtain intra-group context vectors, and then performs inter-group feature analysis on each voice data set based on the intra-group context vectors to obtain global context vectors, so as to fully extract local information and global information in the voice data, obtain prediction results based on the global context vectors, improve accuracy of complex voice text recognition, and after determining recognition information of the voice data, calculate recognition confidence according to the recognition information, so as to judge reliability of the current prediction recognition results, and when the recognition confidence determines that the prediction recognition result is not reliable enough, fuse with features of other modalities to obtain accurate target recognition results, and improve accuracy and user experience of card number recognition, in the training process of the pre-trained voice recognition network in this embodiment, fully extract features in the training samples, and the recognition system are fully extracted through a structure coding, a learning mechanism and inter-group attention mechanism, and a recognition function are more fully optimized based on the recognition result, and a recognition function is more comprehensively optimized based on a recognition function of the recognition network, thereby realizing a more comprehensive recognition result, a more comprehensive recognition network is based on a more comprehensive recognition result, a better recognition network is realized, a more comprehensive recognition network is based on a more comprehensive recognition result, a recognition network is based on a more comprehensive recognition result is realized, and a more comprehensive recognition network is obtained, and a better recognition network is based on a more comprehensive recognition result is obtained, the method and the device can accurately identify the bank card number, understand complex user intention, provide personalized service and improve user satisfaction.

The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those skilled in the art that the technical solution described in the foregoing embodiments may be modified or substituted for some of the technical features thereof, and that these modifications or substitutions should not depart from the spirit and scope of the technical solution of the embodiments of the present invention and should be included in the protection scope of the present invention.

Claims

1. The language behavior dialogue strategy method based on multi-modal deep learning is characterized by comprising the following steps of:

Acquiring a position code of each voice data set, and analyzing an inter-group attention mechanism of the voice data set based on the position code and the intra-group context vector to obtain a global context vector corresponding to the complex voice text;

Acquiring characteristic data corresponding to a text mode and an image mode, and carrying out multi-mode fusion on the identification information and the characteristic data corresponding to the text mode and the image mode based on the identification confidence, so as to determine a target identification result of the complex voice text input by the user;

The performing intra-group attention mechanism analysis on the voice data set according to the voice recognition network, determining an intra-group context vector of the voice data set, includes:

carrying out weighted summation according to the attention weight and the hiding state of each data element in the voice data set to obtain an intra-set context vector of the voice data set;

The performing an inter-group attention mechanism analysis on the voice data set based on the position code and the intra-group context vector to obtain a global context vector corresponding to the complex voice text, including:

calculating the summation of the weighted vectors of all the voice data sets to be the global context vector corresponding to the complex voice text;

2. The method for dialogue strategy based on multi-modal deep learning language behavior according to claim 1, wherein the obtaining the recognition information output by the speech recognition network according to the global context vector comprises:

3. A multimodal deep learning language behavioral dialog strategy method in accordance with claim 2 wherein said determining one or more candidate recognition results and a predictive probability for each of said candidate recognition results based on said splice vector and a fourth set of learnable parameters in said speech recognition network comprises:

4. The method for learning language behavior dialogue strategy based on multi-modal depth according to claim 3, wherein the method for obtaining recognition confidence comprises the following steps:

5. The method of claim 4, wherein the determining the target recognition result of the complex voice text input by the user by multimodal fusion of the recognition information with the feature data corresponding to the text modality and the image modality based on the recognition confidence comprises:

6. The method of claim 1, wherein the training process of the speech recognition network comprises:

7. The method of claim 6, wherein determining the current optimization penalty of the speech recognition network to be trained based on the prediction result and the prediction verification result comprises: