WO2024140434A1

WO2024140434A1 - Text classification method based on multi-modal knowledge graph, and device and storage medium

Info

Publication number: WO2024140434A1
Application number: PCT/CN2023/140835
Authority: WO
Inventors: 曾谁飞; 孔令磊; 张景瑞; 李敏; 刘卫强
Original assignee: 青岛海尔电冰箱有限公司; 海尔智家股份有限公司
Priority date: 2022-12-31
Filing date: 2023-12-22
Publication date: 2024-07-04
Also published as: CN116186258A

Abstract

Disclosed in the present invention is a text classification method based on a multi-modal knowledge graph. The method comprises: acquiring and extracting text features of text data; according to the text features, extracting context information of the text data and weight information of text semantic features; and combining the context information and the weight information by means of a fully connected layer, and then outputting same to a classifier for calculating a score to obtain classification result information, and outputting the classification result information. The method improves the accuracy and the generalization capability of text classification, thereby improving the user experience effect.

Description

Text classification method, device and storage medium based on multimodal knowledge graph

Technical Field

The present invention relates to the field of computer technology, and in particular to a text classification method, device and storage medium based on a multimodal knowledge graph.

Background technique

At present, text classification algorithms do not fully utilize the semantic information representation capabilities of multimodal data such as voice, video, and user preferences, favorites, and comments on food, resulting in poor text classification results. Moreover, these text data are based on traditional machine learning methods or methods that combine machine learning with shallow feature information of neural networks. These methods are prone to generalization, insufficient data understanding capabilities, and weak robustness of model building, which in turn affects the insufficient text classification capabilities.

Therefore, how to build a multimodal text classification method with the help of knowledge graphs has become a key technology to improve the accuracy of text classification. Smart refrigerator interaction is inseparable from multi-source heterogeneous data such as real-time voice, video, real-time text, and historical text. Therefore, how to achieve the best feature information extraction and text classification based on multi-modal or cross-modal data for the multi-source heterogeneous data, so as to optimize the accuracy of smart refrigerator text classification and improve the experience of using the refrigerator.

Summary of the invention

The purpose of the present invention is to provide a text classification method, device and storage medium based on a multimodal knowledge graph.

The present invention provides a method for generating text classification based on a multimodal knowledge graph, comprising the steps of:

Acquire real-time audio and video data, and acquire real-time and historical text data; pre-process the real-time audio and video data to acquire real-time voice data and real-time video data; transcribe the real-time voice data into voice text data, and extract text features of the voice text data; transcribe the real-time video data into image text data, and extract text features of the image text data; extract entity features of the real-time and historical text data; acquire context information of the text data and weight information of text semantic features according to the real-time voice data text features, real-time video data text features and entity features; combine the context information and weight information through a fully connected layer, and output them to a classifier to calculate a score to obtain classification result information; and output the classification result information.

As a further improvement of the present invention, the "preprocessing the real-time audio and video data to obtain real-time voice data and video data" specifically includes: performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data to obtain valid audio and video data; using a script or a third-party tool to separate the voice and video of the valid audio and video data to obtain the real-time voice data and real-time video data; preprocessing the real-time voice data and video data, including: framing and windowing the real-time voice data, and cropping and framing the real-time video data; preprocessing the real-time and historical text data, including: word segmentation, removal of stop words, and de-duplicate words.

As a further improvement of the present invention, the "transcription of the real-time voice data into voice text data" specifically includes: extracting features of the real-time voice data to obtain voice features; inputting the voice features into a voice recognition multi-channel and multi-size deep convolutional neural network model to transcribe the first voice text data; outputting the alignment relationship between the voice features and the first voice text data based on a connection time series classification method to obtain second voice text data; based on an attention mechanism, obtaining the key features of the second voice text data or the weight information of the key features; combining the second voice text data and its key features or the weight information of the key features through a fully connected layer, and then calculating the score through a classification function to obtain the voice text data.

As a further improvement of the present invention, the "extracting the effective voice data features" specifically includes: extracting the effective voice data features and obtaining its Mel-frequency cepstral coefficient features.

As a further improvement of the present invention, the "transcription of the real-time video data into image text data" specifically includes: inputting the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features; inputting the image features into a multi-channel multi-size time convolutional network for transcription to obtain first image text data; outputting the alignment relationship between the image features and the first image text data based on a connection temporal classification method to obtain second image text data; combining the second image text data through a fully connected layer, and then calculating the score through a classification function to obtain the image text data.

As a further improvement of the present invention, the "extracting entity features of the real-time and historical text data" specifically includes: using an entity linking method to perform entity extraction on the text data to obtain multiple food entities; querying the food knowledge graph based on each food entity to obtain a corresponding entity vector representation; inputting the entity vector representation into a multi-head attention mechanism for calculation to obtain an entity feature vector.

As a further improvement of the present invention, the "querying the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation" specifically includes: converting the entity into a corresponding entity vector representation in the form of an entity triple; and using a distributed vector representation method of a neural network to realize the entity vector representation.

As a further improvement of the present invention, the "obtaining the context information of the text data and the weight information of the text semantic features according to the real-time voice data text features, real-time video data text features and entity features" specifically includes: converting the real-time voice text features and real-time video text features into voice text word vectors and image text word vectors; inputting the voice text word vectors, image text word vectors and entity features into a bidirectional long short-term memory network model to obtain a context feature vector containing the voice text features, image text features and real-time and historical text feature information.

As a further improvement of the present invention, based on the attention mechanism, the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data are distinguished to obtain the weight information of the text semantic features.

As a further improvement of the present invention, the "based on the attention mechanism, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data" specifically includes: respectively inputting the speech text context feature vector, image text context feature vector, and real-time and historical text entity feature vector into the multi-head attention mechanism; obtaining the self-weight text attention feature vector containing the self-weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features; obtaining the associated weight text attention feature vector containing the associated weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features.

As a further improvement of the present invention, the "combining the context information and weight information through a fully connected layer, and outputting the result to a classifier to calculate a score to obtain classification result information" specifically includes: combining the context feature vector and the weight text attention feature vector through a fully connected layer, and outputting the result to a classification function, calculating the scores of the text semantics of the speech text data, image text data, and real-time and historical text data and their normalized score results, and obtaining the classification result information of the text.

As a further improvement of the present invention, the method of "transcribe the voice data into voice text data and extract text features of the voice text data" also includes: obtaining configuration data stored in an external cache, executing the multi-channel and multi-scale deep convolutional neural network model calculation on the voice data based on the configuration data, performing text transcription and extracting text features.

The present invention also provides an electrical device, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned method for generating text classification based on a multimodal knowledge graph when running the executable instructions stored in the memory.

The present invention also provides a refrigerator, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned method for generating text classification based on a multimodal knowledge graph when running the executable instructions stored in the memory.

The present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned method for generating text classification based on a multimodal knowledge graph.

The beneficial effect of the present invention is that the method provided by the present invention completes the task of identifying and classifying the acquired text data. First, by introducing multimodal data such as real-time voice, real-time video, real-time text, real-time and historical user preferences, interests and historical comment data on food, the problems of single-modal data's single text semantic information and insufficient data understanding are solved; secondly, the introduction of a deep convolutional neural network model makes up for the insufficient feature representation ability of traditional machine learning methods, and can obtain the relevance and complementarity of semantic feature information at a deeper level, strengthen semantic features, and effectively improve text classification accuracy; finally, the entity link representation of the multimodal knowledge graph is increased to improve the generalization ability of text semantic feature information and enhance the user experience.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG1 is a structural block diagram of a model involved in a text classification method based on a multimodal knowledge graph in one embodiment of the present invention.

FIG2 is a schematic diagram of the steps of a text classification method based on a multimodal knowledge graph in one embodiment of the present invention.

FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and real-time and historical text data in one embodiment of the present invention.

FIG. 4 is a schematic diagram of the preprocessing steps for the real-time audio and video data and the real-time and historical text data in one embodiment of the present invention.

FIG. 5 is a schematic diagram of the steps of transcribing the real-time voice data into voice text data in one embodiment of the present invention.

FIG. 6 is a schematic diagram of the steps of transcribing the real-time video data into image text data in one embodiment of the present invention.

7 is a schematic diagram of the steps of obtaining the context information and weight information of the text data according to the real-time speech text features, real-time video text features and entity features in one embodiment of the present invention.

8 is a schematic diagram of the steps of obtaining the context information of the text data and the weight information of the text semantic features according to the real-time speech text features, real-time video text features and entity features in one embodiment of the present invention.

Detailed ways

The present invention will be described in detail below in conjunction with the specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, methodological, or functional changes made by a person skilled in the art based on these embodiments are all within the scope of protection of the present invention.

It should be noted that the term "comprises" or any other variation thereof is intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In addition, the terms "first", "second", etc. are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.

The embodiment of the present invention is a text classification method based on a multimodal knowledge graph. Although the present application provides the method operation steps as described in the following implementation or flowchart 1, based on routine or no creative labor, the execution order of the steps in the method that do not have a necessary causal relationship in logic is not limited to the execution order provided in the implementation of the present application.

As shown in FIG1 , a structural block diagram of a model involved in a text classification method based on a multimodal knowledge graph provided by the present invention is shown in FIG2 , which is a schematic diagram of the steps of a text classification method based on a multimodal knowledge graph, including:

S1: Obtain real-time audio and video data, and obtain real-time and historical text data.

S2: Preprocess the real-time audio and video data to obtain real-time voice data and real-time video data.

S3: Transcribing the real-time voice data into voice text data, and extracting text features of the voice text data.

S4: Transcribing the real-time video data into image text data, and extracting text features of the image text data.

S5: Extract entity features of the real-time and historical text data.

S6: Acquire context information of the text data and weight information of text semantic features according to the real-time speech text features, real-time video text features and entity features.

S7: After combining the context information and weight information through a fully connected layer, the combined information is output to a classifier to calculate a score and obtain classification result information.

S8: Output the classification result information.

The method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input. Exemplarily, in this embodiment, a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model. Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.

As shown in FIG3 , in step S1, it specifically includes:

S11: Acquire the real-time audio and video data collected by the collection device, and/or acquire the real-time audio and video data transmitted from the client terminal.

S12: Acquire the real-time text data collected by the collection device, and/or acquire the real-time text data transmitted from the client terminal.

S13: Acquire historical text data stored internally, and/or acquire historical text data stored externally, and/or

Get the historical text data transmitted by the client terminal.

The real-time audio and video data described here include real-time voice data and real-time video data. The real-time voice refers to the interrogative or directive statements currently spoken by the user to the intelligent electronic device or the client terminal device connected to the intelligent electronic device. Similarly, it can also be the voice information sent by the user collected by the voice collection device. For example, in this embodiment, the user can ask questions such as "What vegetables are in the refrigerator today?", "What beef ingredients are in the refrigerator today?", or the user can issue commands such as "Delete all ingredients". The real-time video data is a real-time video image obtained by real-time shooting using an intelligent electronic device or a client terminal device connected to the intelligent electronic device. For example, in this real-time mode, the user's facial image is captured by a video camera built into the intelligent refrigerator, and the lip area feature image is extracted from the facial image to identify the text content corresponding to the image, such as identifying the image text data of "What vegetables are in the refrigerator today".

The real-time text data described here refers to text data collected by a text collection device, and the historical text data refers to the real-time text data of the user in the previous use process. Furthermore, it can also include historical text data input by the user. Specifically, in this embodiment, the real-time and historical text data include the user's preferences, likes and information about ingredients that the user is interested in, as well as some comment data posted by the user, such as "I used to like Kung Pao Chicken", which covers the information that the user's preference for ingredients is already related to the current real-time text data. The acquisition of real-time text data and historical text data can be used as part of the data set of pre-training and prediction models, which can effectively supplement the single voice representation of real-time audio and video data and enrich semantic features.

As described in steps S11 and S12, in this embodiment, the user's real-time audio and video can be collected by an audio and video collection device such as a camera or a camera set in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly make a voice to the smart refrigerator. In addition, the user's real-time audio and video data transmitted can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, a smart camera, a smart watch, an APP or a smart electronic device such as Bluetooth. During use, the user directly makes a voice to the client terminal or directly uses the camera built into the refrigerator to shoot. After the client terminal collects audio and video, it is transmitted to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. Thereby realizing a multi-channel real-time audio and video acquisition method, it is not limited to having to make a voice to the smart refrigerator. When the user has an interactive demand, the real-time text data is directly collected through a text collection device or a text input device of the client terminal. In other embodiments of the present invention, one or any multiple of the above-mentioned real-time audio and video data or real-time text data acquisition methods can also be used, or the real-time audio and video data and real-time text data can also be obtained through other channels based on the prior art, and the present invention does not make specific restrictions on this.

As described in step S13, in this embodiment, the historical text data stored in the internal memory of the smart refrigerator can be read. In addition, the historical text data stored in the external storage device configured by the smart refrigerator can also be read. The external storage device is a mobile storage device such as a U disk, SD card, etc. By setting an external storage device, the storage space of the smart refrigerator can be further expanded. In addition, the historical text data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained. Implementing multi-channel historical text data acquisition channels can greatly increase the amount of historical text data, thereby improving the accuracy of subsequent voice recognition and video image recognition. In other embodiments of the present invention, one or any multiple of the above-mentioned historical text data acquisition methods may also be used, or the historical text data may also be obtained through other channels based on the prior art, and the present invention does not impose specific restrictions on this.

Furthermore, in the present embodiment, the smart refrigerator is provided with an external cache, and at least part of the historical text data is stored in the external cache. As the usage time increases, the historical text data increases. By storing part of the data in the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the historical text data stored in the external cache can be directly read, which can improve the efficiency of the algorithm.

Specifically, in this embodiment, a Redis component is used as the external cache. The Redis component is a widely used distributed cache system with a key/value storage structure, which can be used as a database, a cache and a message queue agent. In other embodiments of the present invention, other external caches such as Memcached may also be used, and the present invention does not specifically limit this.

In summary, in step S11 to step S13, real-time audio and video data, real-time and historical text data can be flexibly obtained through multiple channels, which improves the user experience while ensuring the data volume and effectively improving the algorithm efficiency.

As shown in FIG4 , in step S2, it specifically includes the steps of:

S21: Cleaning the real-time audio and video data to obtain valid audio and video data.

S22: Separate the effective audio and video data into voice and video to obtain real-time voice data and video data.

S23: pre-processing the real-time voice data and video data, including: performing frame division and windowing processing on the real-time voice data, and cropping and frame division processing on the real-time video data.

S24: Preprocessing the real-time and historical comment text data, including: word segmentation, removal of stop words, and removal of duplicate words.

In step S21, data cleaning of the real-time audio and video data specifically includes:

A certain number of real-time audio and video data sets are obtained. For example, they can be imported into the data cleaning model in the form of files for processing. In order to prevent data import failure, data that does not meet the file import format is parsed and converted, and then irrelevant data and duplicate data in the data set are deleted, and abnormal values and missing value data are processed. Information irrelevant to classification is preliminarily screened out, and the audio and video data is cleaned. At the same time, the cleaned data is output and saved in a specified format, thereby obtaining valid audio and video data.

In step S22, a script or a third-party audio and video separation tool is used to separate the effective audio and video data into voice and video, thereby obtaining real-time voice data and real-time video data.

In an embodiment of the present invention, Python language can be used to write an audio and video separation script, or a third-party audio and video separation tool can be used to separate the input audio and video data to achieve voice and video separation and obtain classified real-time voice and video data.

In step S23, the classified speech is segmented according to the specified time period or sampling number, and the framing process of the speech is completed to obtain the speech signal data, and then through the effect of the window function, the speech signal originally containing noise presents the characteristics of signal enhancement and signal periodicity, and the windowing process is completed, which is convenient for the subsequent better extraction of the characteristic parameters of the speech. Exemplarily, step S23 also includes cutting the effective video data to generate multiple frames of pictures. Specifically, the video data can be first loaded and the video information can be read by writing a script, and then the video can be decoded according to the video information to determine how many pictures the video shows per second, so as to obtain single-frame image information, and the single-frame image information includes the width and height of each frame of the picture, and finally the video is saved as multiple pictures. Therefore, after the processing of step S23, effective real-time voice data and image data can be obtained. In other embodiments of the present invention, other video framing methods such as third-party video cropping tools can also be used, and the present invention does not specifically limit this.

In step S24, the collected real-time and historical text data are preprocessed, such as deleting irrelevant data, duplicate data, and processing abnormal values and missing values, and preliminarily screening information irrelevant to classification. Then, the real-time text data and historical text data are labeled with category labels based on rule-based statistical methods, and the text data is segmented based on string matching, understanding, statistics, and rules. After that, stop words and duplicate words are removed so that the text data meets the input requirements of the neural network model.

As shown in FIG5 , in step S3, it specifically includes:

S31: Extract the effective voice data features to obtain voice features.

S32: Input the speech features into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe and obtain first speech text data.

S33: Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data.

S34: Based on the attention mechanism, obtain the key features of the second speech text data or the weight information of the key features.

S35: The second speech text data and its key features or weight information of the key features are combined through a fully connected layer, and then the scores are calculated through a classification function to obtain the speech text data.

In step S31, extracting the effective voice data features specifically includes:

Extract the features of the speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC). MFCC is a recognizable component in a speech signal and is a cepstral parameter extracted in the Mel scale frequency domain. The Mel scale describes the nonlinear characteristics of the human ear frequency. The MFCC parameters take into account the sensitivity of the human ear to different frequencies and are particularly suitable for speech recognition and speaker identification.

In an embodiment of the present invention, characteristic parameters such as perceptual linear prediction characteristics (PLP) or linear prediction coefficient characteristics (LPC) of the speech data may be obtained through different algorithm steps to replace the MFCC features. The specific adjustments may be made according to the actual application scenario and the model parameters used, and the present invention does not impose any specific restrictions on this.

The specific algorithm steps involved in the above steps can refer to the current existing technology in this field, and the specific content will not be described in detail here.

In step S32, the text content of the valid voice data is transcribed through a network model in the automatic speech recognition technology to obtain the first voice text data.

In this embodiment, the task of speech-to-text conversion is achieved by constructing a multi-channel multi-size deep convolutional neural network model. The deep network model is composed of a multi-layer deep convolutional network model. The deep convolutional neural network model is generally composed of several convolutional layers plus several fully connected layers, and various nonlinear operations and pooling operations are included in the middle. It is mainly used to process grid structured data, so the model can use filters to filter out the contours between adjacent pixels. In addition, the model first proposes speech feature values, and then calculates the feature values instead of calculating the original speech data values. Therefore, compared with the traditional recurrent neural network, the deep convolutional neural network model has the advantages of small computational complexity and easy to characterize local features, and shared weights and pooling layers can give the model better invariance in the time domain or frequency domain. In addition, the deeper nonlinear structure can also give the model a powerful representation ability. In addition, multi-channel and multi-size can extract speech features from different perspectives, obtain more speech feature information, and have better speech recognition accuracy.

Specifically, in this embodiment, in step S32, the multi-channel multi-size deep convolutional neural network used is composed of a 3*3 convolutional layer, 32 channels and a layer of maximum pooling.

In step S33, the alignment relationship between the input speech feature sequence and the output speech text feature sequence is obtained by using the connectionist temporal classification (CTC) method.

In this embodiment, it is difficult to construct an accurate mapping relationship between the effective voice data and the text of the first voice text data, which increases the difficulty of subsequent voice recognition. In order to solve this problem, a time series classification method is adopted. This method is generally used after using a convolutional network model. It is a completely end-to-end acoustic model training that does not require pre-alignment of the data. It only requires an input sequence and an output sequence for training. It does not require alignment and one-to-one labeling of the data, and can directly output the probability of sequence prediction. Based on this predicted probability, we can obtain the most likely text output result to obtain the second voice text data.

Furthermore, in step S34, the attention mechanism can guide the deep convolutional neural network to focus on more critical feature information and suppress other non-critical feature information. Therefore, by introducing the attention mechanism, the local key features or weight information of the second speech text data can be obtained, thereby further reducing the irregular error alignment of the sequence during model training.

Here, in step S35, according to the second speech text data and its key features or the weight information of the key features, the second speech text data is given its own weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information of the text semantic features of the speech text data, so as to enhance the importance of different parts of the text semantic feature information, and finally, through a classification function, such as a Softmax function, the score is calculated to obtain the speech text data.

As shown in FIG6 , in step S4, it specifically includes:

S41: Input the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features.

S42: Input the image features into a multi-channel multi-size temporal convolutional network for transcription to obtain first image text data.

S43: Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data.

S44: After combining the second image text data through the fully connected layer, the scores are calculated through the classification function to obtain the image text data.

In step S41 and step S42, considering that the sentences recognized by the image text are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the facial area, and crop and segment the video image of the facial area to obtain multiple continuous facial picture frames. In this embodiment, the multiple continuous facial picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension. The 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.

In steps S43 and S44, similar to the above-mentioned voice data processing method, a continuous time series classification method is also used to realize the mapping relationship between the effective video data and the text of the first image text data, so as to obtain the second image text data. Then, the second image text data is given its own weight information and/or associated weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information and/or associated weight information of the text semantic features of the image text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the image text data is obtained by calculating the score through the classification function. The specific processing process is the same as the above-mentioned voice data processing steps, which will not be repeated here.

As shown in FIG. 7 , in step S5, it specifically includes:

S51: Using an entity linking method to perform entity extraction on the text data to obtain a plurality of food entities.

S52: Query the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation.

S53: Input the entity vector representation into the multi-head attention mechanism for calculation to obtain the entity feature vector.

In steps S51 and S52, entity linking is the process of correctly and unambiguously pointing the recognized entity objects (such as names of people, places, etc.) in the text to the target entities in the knowledge base. In other words, the knowledge base is searched to find the target item that best matches the entity object, so entity linking is to assign a unique identifier to the entity mentioned in the text, which is generally a post-task of entity extraction and recognition. In this embodiment, all entities related to the ingredients are first extracted from the real-time and historical texts, corresponding to the candidate entity items, and the candidate entity set that may correspond to each entity mention is found from the given knowledge graph, and irrelevant entities in the knowledge graph are filtered out to generate candidate entities; then, the extracted entities are disambiguated and aligned, and multiple candidate entities in the candidate entity set corresponding to each entity are scored and sorted, and the candidate entity with the highest score is output as the entity linking result.

The entity information contains semantic feature information such as user preferences, favorites, topics of interest to users, and comment data on related ingredients. These real-time and historical text data enrich the text semantic content. In addition, the entities found based on the knowledge graph are converted into entity vector representations in the form of triples. The entity vector representation considers the entity relationship structure information and entity description information in the knowledge graph to obtain corresponding vector representations. Specifically, a distributed vector representation method of a neural network is usually used. This method converts words into a distributed representation, that is, represents words as a fixed-length continuous vector. This method can reflect the contribution of different word segmentations to the results.

In step S53, the attention mechanism directly calculates the attention weight of each position of the text data in the encoding process through operation, and then calculates the implicit vector representation of the entire text in the form of weight sum. Generally, we hope that the attention mechanism model can learn different behaviors based on the same attention mechanism, and then combine different behaviors as knowledge. For this reason, we can independently learn multiple different types of text data, and splice them together after the attention pooling operation to generate the final feature vector. In this real-time method, the entity vector representation is input into the multi-head attention mechanism to obtain an entity feature vector with the user's preference, interest and relevance of the food review data, which to a certain extent expands the complementarity of the semantic features between different types of data and the multi-faceted and multi-angle semantic relevance.

As shown in FIG8 , in step S6, it specifically includes:

S61: Convert the speech text data and image text data into speech text word vectors and image text word vectors.

S62: Input the speech text word vector, image text word vector and entity feature vector into a bidirectional long short-term memory network model to obtain a context feature vector containing the speech text features, image text features and real-time and historical text feature information.

S63: Based on the attention mechanism, distinguish the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data, and obtain the weight information of the text semantic features.

In step S61, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the speech text data and the image text data can be converted into the speech text word vector and the image text word vector through the Word2Vec algorithm, or the word vector can be obtained through other existing algorithms in the field such as the Glove algorithm, and the present invention does not impose any specific restrictions on this.

In step S62, the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, abbreviated as BiLSTM) is composed of a forward long short-term memory network (Long Short-Term Memory, abbreviated as LSTM) and a backward long short-term memory network. The LSTM model can better obtain the long-distance dependency of text semantics, and on this basis, the BiLSTM model can better obtain the bidirectional semantics of text. The speech text word vector, image text word vector and entity feature vector are input into the BiLSTM model, and after being processed by the forward LSTM and the backward LSTM, the forward LSTM and the backward LSTM both wait until all time steps are calculated to generate two result vectors, and then the two result vectors are spliced together to output the context feature vector with contextual context information.

In the implementation manner of the present invention, the voice data and video data can also be transcribed into the voice text data and video text data by constructing a neural network model of other structures, and the specific method is not limited.

In step S63, in order to distinguish the weight information of different words or phrases in the speech text data, image text data, and real-time and historical text data, or the associated weight information between different text data, the speech text context feature vector, image text context feature vector, and real-time and historical text context entity feature vector are respectively input into the multi-head attention mechanism to obtain the own weight feature vector containing the own weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features, and the associated weight feature vector containing the semantic association weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantics, thereby making full use of the context information of the above texts, supplementing the deficiency of single features in speech and video data, enriching the semantic representation capability in text data, and optimizing the subsequent text classification capability.

In step S7, it specifically includes:

The context feature vector and the weighted text attention feature vector (including the own weighted text attention feature vector and the associated weighted text attention feature vector) of the speech are combined through a fully connected layer and output to a classification function to calculate the scores of the text semantics in the speech text data and the image text data and their normalized score results to obtain classification result information.

In summary, the text classification method based on the multimodal knowledge graph provided by the present invention can be obtained by sequentially performing the above steps. By obtaining the real-time audio and video data, real-time and historical text data, data cleaning is performed on them, and voice and video are separated at the same time, effective voice data and video data are generated respectively, and they are all used as part of the data set of the pre-training and prediction model, thereby obtaining text semantic features more comprehensively. In addition, by constructing a multi-channel and multi-size deep convolutional network model that integrates the connection time series classification method and the attention mechanism, and a video image recognition method based on the time deep convolutional neural network model and the sentence level, more abundant high-level semantic feature information is mined and obtained. Finally, by constructing a context information mechanism and a multi-head attention mechanism that integrates voice text data, video text data, real-time text data and historical text data, the semantic representation ability is more fully utilized, the deficiency of a single feature in voice and video data is compensated, and the accuracy of text classification is improved. In addition, by obtaining the configuration data of the external storage for calculation, the calculation efficiency of the model is improved. The overall model structure has a good semantic representation ability of text data, and reflects the good complementarity and correlation characteristics from the semantic feature entropy, which improves the accuracy of text classification.

In step S8, it specifically includes:

The classification result information is converted into voice for output, and/or the classification result information is converted into voice and transmitted to a client terminal for output, and/or the classification result information is converted into text for output, and/or the classification result information is converted into text and transmitted to a client terminal for output, and/or the classification result information is converted into an image for output, and/or the classification result information is converted into an image and transmitted to a client terminal for output.

As described in step S8, in this real-time mode, after the classification result information is obtained through the above steps, it can be converted into voice, and the result information can be broadcasted through the built-in sound playback device of the smart refrigerator, or the result information can be converted into text and directly displayed through the display device configured by the smart refrigerator, or the result information can be converted into an image and directly displayed on the large screen of the smart refrigerator. In addition, the result information can also be transmitted to the client terminal for output by voice communication. Here, the client terminal is an electronic device with information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcasting, or transmitting the classification result information in the form of text or image through SMS, email and other methods to client terminals such as mobile phones, tablet computers or application software installed on client terminals for users to view. Thereby, a multi-channel and multi-category classification result information output method is realized, and users are not limited to obtaining relevant information only near the smart refrigerator. With the multi-channel and multi-category real-time voice acquisition method provided by the present invention, users can directly interact with the smart refrigerator remotely, which is extremely convenient and greatly improves the user experience. In other implementations of the present invention, only one or several of the above-mentioned classification result information output methods may be used, or the classification result information may be output through other channels based on the existing technology, and the present invention does not impose any specific limitation on this.

In summary, the present invention provides a text classification method based on a multimodal knowledge graph, which obtains real-time audio and video data, real-time and historical text data through multiple channels, extracts corresponding features and fuses them, and makes full use of the neural network model of deep convolution and recurrent fusion multi-head attention mechanism to realize semantic text feature extraction, obtain the generated text classification result, and output the text classification result through multiple channels. The method not only significantly improves the accuracy of generated text classification, but also makes the interaction between users and smart refrigerators more convenient and diversified, greatly improving the user experience.

Based on the same inventive concept, the present invention also provides an electrical device, which includes:

A memory for storing executable instructions;

The processor is used to implement the above-mentioned text classification method based on the multimodal knowledge graph when running the executable instructions stored in the memory.

Based on the same inventive concept, the present invention also provides a refrigerator, comprising:

A memory for storing executable instructions;

Based on the same inventive concept, the present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on a multimodal knowledge graph.

It should be understood that although this specification is described according to implementation modes, not every implementation mode contains only one independent technical solution. This description of the specification is only for the sake of clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each implementation mode may also be appropriately combined to form other implementation modes that can be understood by those skilled in the art.

The series of detailed descriptions listed above are only specific descriptions of feasible implementation methods of the present invention and are not intended to limit the scope of protection of the present invention. All equivalent implementation methods or changes that do not deviate from the technical spirit of the present invention should be included in the scope of protection of the present invention.

Claims

A text classification method based on a multimodal knowledge graph, characterized by comprising the steps of:

Obtain real-time audio and video data, and obtain real-time and historical text data;

Preprocessing the real-time audio and video data to obtain real-time voice data and real-time video data;

Transcribing the real-time voice data into voice text data, and extracting text features of the voice text data;

Transcribing the real-time video data into image text data, and extracting text features of the image text data;

Extracting entity features of the real-time and historical text data;

According to the real-time voice data text features, real-time video data text features and entity features, obtaining context information of the text data and weight information of text semantic features;

The context information and weight information are combined through a fully connected layer and then output to a classifier to calculate a score to obtain classification result information;

Output the classification result information.
The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "preprocessing the real-time audio and video data to obtain real-time voice data and video data" specifically includes:

Performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data to obtain valid audio and video data;

Using a script or a third-party tool to separate the effective audio and video data into voice and video to obtain the real-time voice data and real-time video data;

Preprocessing the real-time voice data and video data, including: performing frame division and windowing processing on the real-time voice data, and cropping and frame division processing on the real-time video data;

The real-time and historical text data are preprocessed, including: word segmentation, removal of stop words, and removal of duplicate words.
The text classification method based on a multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the real-time voice data into voice text data" specifically includes:

Extracting the real-time voice data features to obtain voice features;

Inputting the speech feature into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe to obtain first speech text data;

Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data;

Based on the attention mechanism, obtaining key features of the second voice text data or weight information of the key features;

The second speech text data and its key features or weight information of the key features are combined through a fully connected layer and then scored by a classification function to obtain the speech text data.
The text classification method based on multimodal knowledge graph according to claim 3 is characterized in that the “extracting the real-time speech data features” specifically includes:

The real-time speech data features are extracted to obtain the Mel-frequency cepstral coefficient features thereof.
The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the real-time video data into image text data" specifically includes:

Input the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features;

Input the image features into a multi-channel multi-size temporal convolutional network for transcription to obtain first image text data;

Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data;

The second image text data is combined through a fully connected layer and then scored by a classification function to obtain the image text data.
The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the “extracting entity features of the real-time and historical text data” specifically includes:

Using an entity linking method to perform entity extraction on the text data to obtain a plurality of food entities;

Query the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation;

The entity vector representation is input into the multi-head attention mechanism for calculation to obtain the entity feature vector.
The text classification method based on multimodal knowledge graph according to claim 6 is characterized in that the step of “querying the food knowledge graph based on each food entity to obtain the corresponding entity vector representation” specifically includes:

Converting the entity into a corresponding entity vector representation in the form of an entity triple;

A distributed vector representation method of a neural network is used to implement the entity vector representation.
The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "obtaining the context information of the text data and the weight information of the text semantic features according to the real-time voice data text features, the real-time video data text features and the entity features" specifically includes:

Converting the real-time speech text features and the real-time video text features into speech text word vectors and image text word vectors;

The speech text word vector, image text word vector and entity features are input into a bidirectional long short-term memory network model to obtain a context feature vector containing the speech text features, image text features and real-time and historical text feature information.
The text classification method based on multimodal knowledge graph according to claim 8 is characterized in that the method further comprises:

Based on the attention mechanism, the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data are distinguished to obtain the weight information of the text semantic features.
The text classification method based on a multimodal knowledge graph according to claim 9 is characterized in that the “based on the attention mechanism, distinguishing the words and phrases’ own weight information and/or associated weight information in the text features of the speech text data, image text data, and real-time and historical text data” specifically includes:

Inputting the speech text context feature vector, the image text context feature vector, and the real-time and historical text entity feature vectors into a multi-head attention mechanism respectively;

Obtaining a self-weighted text attention feature vector including self-weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features;

Obtain an associated weight text attention feature vector containing the associated weight information of the speech text semantic features, the image text semantic features, and the real-time and historical text semantic features.
The text classification method based on a multimodal knowledge graph according to claim 10 is characterized in that the step of “combining the context information and the weight information through a fully connected layer and outputting the result information to a classifier to calculate a score and obtain classification result information” specifically includes:

The context feature vector and the weighted text attention feature vector are combined through a fully connected layer and output to a classification function to calculate the text semantic scores of the speech text data, image text data, and real-time and historical text data and their normalized score results to obtain the classification result information of the text.
The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the voice data into voice text data and extracting text features of the voice text data" further includes:

The configuration data stored in the external cache is obtained, and the multi-channel and multi-size deep convolutional neural network model is calculated based on the configuration data to perform text transcription and extract text features.
An electrical device, characterized in that it comprises:

A memory for storing executable instructions;

A processor, configured to implement the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
A refrigerator, characterized by comprising:

A memory for storing executable instructions;

A processor, configured to implement the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
A computer-readable storage medium storing executable instructions, characterized in that when the executable instructions are executed by a processor, the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 is implemented.