[go: up one dir, main page]

WO2024140434A1 - Text classification method based on multi-modal knowledge graph, and device and storage medium - Google Patents

Text classification method based on multi-modal knowledge graph, and device and storage medium Download PDF

Info

Publication number
WO2024140434A1
WO2024140434A1 PCT/CN2023/140835 CN2023140835W WO2024140434A1 WO 2024140434 A1 WO2024140434 A1 WO 2024140434A1 CN 2023140835 W CN2023140835 W CN 2023140835W WO 2024140434 A1 WO2024140434 A1 WO 2024140434A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
data
real
features
time
Prior art date
Application number
PCT/CN2023/140835
Other languages
French (fr)
Chinese (zh)
Inventor
曾谁飞
孔令磊
张景瑞
李敏
刘卫强
Original Assignee
青岛海尔电冰箱有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 青岛海尔电冰箱有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔电冰箱有限公司
Publication of WO2024140434A1 publication Critical patent/WO2024140434A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • FIG2 is a schematic diagram of the steps of a text classification method based on a multimodal knowledge graph in one embodiment of the present invention.
  • FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and real-time and historical text data in one embodiment of the present invention.
  • FIG. 4 is a schematic diagram of the preprocessing steps for the real-time audio and video data and the real-time and historical text data in one embodiment of the present invention.
  • FIG. 6 is a schematic diagram of the steps of transcribing the real-time video data into image text data in one embodiment of the present invention.
  • S2 Preprocess the real-time audio and video data to obtain real-time voice data and real-time video data.
  • the method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input.
  • a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model.
  • the smart refrigerator Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.
  • S11 Acquire the real-time audio and video data collected by the collection device, and/or acquire the real-time audio and video data transmitted from the client terminal.
  • S12 Acquire the real-time text data collected by the collection device, and/or acquire the real-time text data transmitted from the client terminal.
  • the user's facial image is captured by a video camera built into the intelligent refrigerator, and the lip area feature image is extracted from the facial image to identify the text content corresponding to the image, such as identifying the image text data of "What vegetables are in the refrigerator today".
  • the real-time text data described here refers to text data collected by a text collection device
  • the historical text data refers to the real-time text data of the user in the previous use process.
  • it can also include historical text data input by the user.
  • the real-time and historical text data include the user's preferences, likes and information about ingredients that the user is interested in, as well as some comment data posted by the user, such as "I used to like Kung Pao Chicken", which covers the information that the user's preference for ingredients is already related to the current real-time text data.
  • the acquisition of real-time text data and historical text data can be used as part of the data set of pre-training and prediction models, which can effectively supplement the single voice representation of real-time audio and video data and enrich semantic features.
  • the user's real-time audio and video can be collected by an audio and video collection device such as a camera or a camera set in the smart refrigerator.
  • an audio and video collection device such as a camera or a camera set in the smart refrigerator.
  • the user can directly make a voice to the smart refrigerator.
  • the user's real-time audio and video data transmitted can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol.
  • the client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, a smart camera, a smart watch, an APP or a smart electronic device such as Bluetooth.
  • the user directly makes a voice to the client terminal or directly uses the camera built into the refrigerator to shoot.
  • the client terminal After the client terminal collects audio and video, it is transmitted to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. Thereby realizing a multi-channel real-time audio and video acquisition method, it is not limited to having to make a voice to the smart refrigerator.
  • the real-time text data is directly collected through a text collection device or a text input device of the client terminal.
  • one or any multiple of the above-mentioned real-time audio and video data or real-time text data acquisition methods can also be used, or the real-time audio and video data and real-time text data can also be obtained through other channels based on the prior art, and the present invention does not make specific restrictions on this.
  • the historical text data stored in the internal memory of the smart refrigerator can be read.
  • the historical text data stored in the external storage device configured by the smart refrigerator can also be read.
  • the external storage device is a mobile storage device such as a U disk, SD card, etc.
  • the storage space of the smart refrigerator can be further expanded.
  • the historical text data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained.
  • Implementing multi-channel historical text data acquisition channels can greatly increase the amount of historical text data, thereby improving the accuracy of subsequent voice recognition and video image recognition.
  • one or any multiple of the above-mentioned historical text data acquisition methods may also be used, or the historical text data may also be obtained through other channels based on the prior art, and the present invention does not impose specific restrictions on this.
  • the smart refrigerator is provided with an external cache, and at least part of the historical text data is stored in the external cache. As the usage time increases, the historical text data increases. By storing part of the data in the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the historical text data stored in the external cache can be directly read, which can improve the efficiency of the algorithm.
  • a Redis component is used as the external cache.
  • the Redis component is a widely used distributed cache system with a key/value storage structure, which can be used as a database, a cache and a message queue agent.
  • other external caches such as Memcached may also be used, and the present invention does not specifically limit this.
  • step S11 to step S13 real-time audio and video data, real-time and historical text data can be flexibly obtained through multiple channels, which improves the user experience while ensuring the data volume and effectively improving the algorithm efficiency.
  • step S2 it specifically includes the steps of:
  • S22 Separate the effective audio and video data into voice and video to obtain real-time voice data and video data.
  • S23 pre-processing the real-time voice data and video data, including: performing frame division and windowing processing on the real-time voice data, and cropping and frame division processing on the real-time video data.
  • S24 Preprocessing the real-time and historical comment text data, including: word segmentation, removal of stop words, and removal of duplicate words.
  • step S21 data cleaning of the real-time audio and video data specifically includes:
  • a certain number of real-time audio and video data sets are obtained. For example, they can be imported into the data cleaning model in the form of files for processing.
  • data that does not meet the file import format is parsed and converted, and then irrelevant data and duplicate data in the data set are deleted, and abnormal values and missing value data are processed.
  • Information irrelevant to classification is preliminarily screened out, and the audio and video data is cleaned.
  • the cleaned data is output and saved in a specified format, thereby obtaining valid audio and video data.
  • step S22 a script or a third-party audio and video separation tool is used to separate the effective audio and video data into voice and video, thereby obtaining real-time voice data and real-time video data.
  • Python language can be used to write an audio and video separation script, or a third-party audio and video separation tool can be used to separate the input audio and video data to achieve voice and video separation and obtain classified real-time voice and video data.
  • step S23 the classified speech is segmented according to the specified time period or sampling number, and the framing process of the speech is completed to obtain the speech signal data, and then through the effect of the window function, the speech signal originally containing noise presents the characteristics of signal enhancement and signal periodicity, and the windowing process is completed, which is convenient for the subsequent better extraction of the characteristic parameters of the speech.
  • step S23 also includes cutting the effective video data to generate multiple frames of pictures.
  • the video data can be first loaded and the video information can be read by writing a script, and then the video can be decoded according to the video information to determine how many pictures the video shows per second, so as to obtain single-frame image information, and the single-frame image information includes the width and height of each frame of the picture, and finally the video is saved as multiple pictures. Therefore, after the processing of step S23, effective real-time voice data and image data can be obtained.
  • other video framing methods such as third-party video cropping tools can also be used, and the present invention does not specifically limit this.
  • step S24 the collected real-time and historical text data are preprocessed, such as deleting irrelevant data, duplicate data, and processing abnormal values and missing values, and preliminarily screening information irrelevant to classification. Then, the real-time text data and historical text data are labeled with category labels based on rule-based statistical methods, and the text data is segmented based on string matching, understanding, statistics, and rules. After that, stop words and duplicate words are removed so that the text data meets the input requirements of the neural network model.
  • step S3 it specifically includes:
  • S32 Input the speech features into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe and obtain first speech text data.
  • S33 Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data.
  • S35 The second speech text data and its key features or weight information of the key features are combined through a fully connected layer, and then the scores are calculated through a classification function to obtain the speech text data.
  • step S31 extracting the effective voice data features specifically includes:
  • MFCC Mel-scale Frequency Cepstral Coefficients
  • characteristic parameters such as perceptual linear prediction characteristics (PLP) or linear prediction coefficient characteristics (LPC) of the speech data may be obtained through different algorithm steps to replace the MFCC features.
  • PLP perceptual linear prediction characteristics
  • LPC linear prediction coefficient characteristics
  • step S32 the text content of the valid voice data is transcribed through a network model in the automatic speech recognition technology to obtain the first voice text data.
  • the task of speech-to-text conversion is achieved by constructing a multi-channel multi-size deep convolutional neural network model.
  • the deep network model is composed of a multi-layer deep convolutional network model.
  • the deep convolutional neural network model is generally composed of several convolutional layers plus several fully connected layers, and various nonlinear operations and pooling operations are included in the middle. It is mainly used to process grid structured data, so the model can use filters to filter out the contours between adjacent pixels.
  • the model first proposes speech feature values, and then calculates the feature values instead of calculating the original speech data values.
  • the deep convolutional neural network model has the advantages of small computational complexity and easy to characterize local features, and shared weights and pooling layers can give the model better invariance in the time domain or frequency domain.
  • the deeper nonlinear structure can also give the model a powerful representation ability.
  • multi-channel and multi-size can extract speech features from different perspectives, obtain more speech feature information, and have better speech recognition accuracy.
  • the multi-channel multi-size deep convolutional neural network used is composed of a 3*3 convolutional layer, 32 channels and a layer of maximum pooling.
  • step S33 the alignment relationship between the input speech feature sequence and the output speech text feature sequence is obtained by using the connectionist temporal classification (CTC) method.
  • CTC connectionist temporal classification
  • a time series classification method is adopted. This method is generally used after using a convolutional network model. It is a completely end-to-end acoustic model training that does not require pre-alignment of the data. It only requires an input sequence and an output sequence for training. It does not require alignment and one-to-one labeling of the data, and can directly output the probability of sequence prediction. Based on this predicted probability, we can obtain the most likely text output result to obtain the second voice text data.
  • the attention mechanism can guide the deep convolutional neural network to focus on more critical feature information and suppress other non-critical feature information. Therefore, by introducing the attention mechanism, the local key features or weight information of the second speech text data can be obtained, thereby further reducing the irregular error alignment of the sequence during model training.
  • step S35 according to the second speech text data and its key features or the weight information of the key features, the second speech text data is given its own weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information of the text semantic features of the speech text data, so as to enhance the importance of different parts of the text semantic feature information, and finally, through a classification function, such as a Softmax function, the score is calculated to obtain the speech text data.
  • a classification function such as a Softmax function
  • step S4 it specifically includes:
  • S41 Input the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features.
  • S42 Input the image features into a multi-channel multi-size temporal convolutional network for transcription to obtain first image text data.
  • S43 Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data.
  • step S41 and step S42 considering that the sentences recognized by the image text are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the facial area, and crop and segment the video image of the facial area to obtain multiple continuous facial picture frames.
  • the multiple continuous facial picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension.
  • the 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.
  • step S6 it specifically includes:
  • step S7 it specifically includes:

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Disclosed in the present invention is a text classification method based on a multi-modal knowledge graph. The method comprises: acquiring and extracting text features of text data; according to the text features, extracting context information of the text data and weight information of text semantic features; and combining the context information and the weight information by means of a fully connected layer, and then outputting same to a classifier for calculating a score to obtain classification result information, and outputting the classification result information. The method improves the accuracy and the generalization capability of text classification, thereby improving the user experience effect.

Description

基于多模态知识图谱的文本分类方法、设备及存储介质Text classification method, device and storage medium based on multimodal knowledge graph 技术领域Technical Field
本发明涉及计算机技术领域,具体地涉及一种基于多模态知识图谱的文本分类方法、设备及存储介质。The present invention relates to the field of computer technology, and in particular to a text classification method, device and storage medium based on a multimodal knowledge graph.
背景技术Background technique
目前,文本分类算法没有充分利用语音、视频和用户对食材的偏好、喜爱和评论数据等多模态数据的语义信息表示能力,导致文本分类效果不佳。而且,这些文本数据都是基于传统机器学习方法或机器学习与神经网络浅层特征信息相结合方法,这些方法容易出现泛化、数据理解能力不足、构建模型的鲁棒性较弱,进而影响文本分类能力不足。At present, text classification algorithms do not fully utilize the semantic information representation capabilities of multimodal data such as voice, video, and user preferences, favorites, and comments on food, resulting in poor text classification results. Moreover, these text data are based on traditional machine learning methods or methods that combine machine learning with shallow feature information of neural networks. These methods are prone to generalization, insufficient data understanding capabilities, and weak robustness of model building, which in turn affects the insufficient text classification capabilities.
因此,如何借助知识图谱构建多模态的文本分类方法成为文本分类准确率提高的关键技术。而智能冰箱交互离不开实时语音、视频和实时文本以及历史文本等多源异构数据,故针对所述多源异构数据如何基于多模态或跨模态数据实现最优的特征信息提取和文本分类,从而优化智能冰箱文本分类准确率,进而提升冰箱使用的体验效果。Therefore, how to build a multimodal text classification method with the help of knowledge graphs has become a key technology to improve the accuracy of text classification. Smart refrigerator interaction is inseparable from multi-source heterogeneous data such as real-time voice, video, real-time text, and historical text. Therefore, how to achieve the best feature information extraction and text classification based on multi-modal or cross-modal data for the multi-source heterogeneous data, so as to optimize the accuracy of smart refrigerator text classification and improve the experience of using the refrigerator.
发明内容Summary of the invention
本发明的目的在于提供一种基于多模态知识图谱的文本分类方法、设备及存储介质。The purpose of the present invention is to provide a text classification method, device and storage medium based on a multimodal knowledge graph.
本发明提供种基于多模态知识图谱的生成文本分类方法,包括步骤:The present invention provides a method for generating text classification based on a multimodal knowledge graph, comprising the steps of:
获取实时音视频数据,获取实时和历史文本数据;对所述实时音视频数据进行预处理,获取实时语音数据和实时视频数据;转写所述实时语音数据为语音文本数据,提取所述语音文本数据的文本特征;转写所述实时视频数据为图像文本数据,提取所述图像文本数据的文本特征;提取所述实时和历史文本数据的实体特征;根据所述实时语音数据文本特征、实时视频数据文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息;将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息;输出所述分类结果信息。Acquire real-time audio and video data, and acquire real-time and historical text data; pre-process the real-time audio and video data to acquire real-time voice data and real-time video data; transcribe the real-time voice data into voice text data, and extract text features of the voice text data; transcribe the real-time video data into image text data, and extract text features of the image text data; extract entity features of the real-time and historical text data; acquire context information of the text data and weight information of text semantic features according to the real-time voice data text features, real-time video data text features and entity features; combine the context information and weight information through a fully connected layer, and output them to a classifier to calculate a score to obtain classification result information; and output the classification result information.
作为本发明的进一步改进,所述“对所述实时音视频数据进行预处理,获取实时语音数据和视频数据”,具体包括:对所述实时音视频数据进行数据清洗、格式解析、格式转换和数据存储,获得有效的音视频数据;采用脚本或第三方工具将所述有效音视频数据进行语音和视频分离,以获得所述实时语音数据和实时视频数据;对所述实时语音数据和视频数据进行预处理,包括:对所述实时语音数据进行分帧和加窗处理,对所述实时视频数据进行裁剪、分帧处理;对所述实时和历史文本数据进行预处理,包括:分词、去除停用词、去重复词。As a further improvement of the present invention, the "preprocessing the real-time audio and video data to obtain real-time voice data and video data" specifically includes: performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data to obtain valid audio and video data; using a script or a third-party tool to separate the voice and video of the valid audio and video data to obtain the real-time voice data and real-time video data; preprocessing the real-time voice data and video data, including: framing and windowing the real-time voice data, and cropping and framing the real-time video data; preprocessing the real-time and historical text data, including: word segmentation, removal of stop words, and de-duplicate words.
作为本发明的进一步改进,所述“转写所述实时语音数据为语音文本数据”,具体包括:提取所述实时语音数据特征,得到语音特征;将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据;基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据;基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息;将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。As a further improvement of the present invention, the "transcription of the real-time voice data into voice text data" specifically includes: extracting features of the real-time voice data to obtain voice features; inputting the voice features into a voice recognition multi-channel and multi-size deep convolutional neural network model to transcribe the first voice text data; outputting the alignment relationship between the voice features and the first voice text data based on a connection time series classification method to obtain second voice text data; based on an attention mechanism, obtaining the key features of the second voice text data or the weight information of the key features; combining the second voice text data and its key features or the weight information of the key features through a fully connected layer, and then calculating the score through a classification function to obtain the voice text data.
作为本发明的进一步改进,所述“提取所述有效语音数据特征”,具体包括:提取所述有效语音数据特征,获取其梅尔频率倒谱系数特征。As a further improvement of the present invention, the "extracting the effective voice data features" specifically includes: extracting the effective voice data features and obtaining its Mel-frequency cepstral coefficient features.
作为本发明的进一步改进,所述“转写所述实时视频数据为图像文本数据”,具体包括:将所述实时视频数据输入3D深度卷积神经网络计算,得到图像特征;将所述图像特征输入多通道多尺寸时间卷积网络转写,获得第一图像文本数据;基于连接时序分类方法输出所述图像特征和所述第一图像文本数据的对齐关系,以得到第二图像文本数据;将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。As a further improvement of the present invention, the "transcription of the real-time video data into image text data" specifically includes: inputting the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features; inputting the image features into a multi-channel multi-size time convolutional network for transcription to obtain first image text data; outputting the alignment relationship between the image features and the first image text data based on a connection temporal classification method to obtain second image text data; combining the second image text data through a fully connected layer, and then calculating the score through a classification function to obtain the image text data.
作为本发明的进一步改进,所述“提取所述实时和历史文本数据的实体特征”,具体包括:采用实体链接方法对所述文本数据进行实体抽取,以得到多个食材实体;基于每个食材实体查询食材知识图谱,获得对应的实体向量表示;将所述实体向量表示输入多头注意力机制计算,得到实体特征向量。As a further improvement of the present invention, the "extracting entity features of the real-time and historical text data" specifically includes: using an entity linking method to perform entity extraction on the text data to obtain multiple food entities; querying the food knowledge graph based on each food entity to obtain a corresponding entity vector representation; inputting the entity vector representation into a multi-head attention mechanism for calculation to obtain an entity feature vector.
作为本发明的进一步改进,所述“基于每个食材实体查询食材知识图谱,获得对应的实体向量表示”,具体包括:采用实体三元组形式将所述实体转换为对应的实体向量表示;采用神经网络的分布式向量表示方法来实现所述实体向量表示。As a further improvement of the present invention, the "querying the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation" specifically includes: converting the entity into a corresponding entity vector representation in the form of an entity triple; and using a distributed vector representation method of a neural network to realize the entity vector representation.
作为本发明的进一步改进,所述“根据所述实时语音数据文本特征、实时视频数据文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息”,具体包括:将所述实时语音文本特征和实时视频文本特征转换为语音文本词向量和图像文本词向量;将所述语音文本词向量、图像文本词向量和实体特征输入双向长短记忆网络模型,获取包含所述语音文本特征、图像文本特征和实时以及历史文本特征信息的上下文特征向量。As a further improvement of the present invention, the "obtaining the context information of the text data and the weight information of the text semantic features according to the real-time voice data text features, real-time video data text features and entity features" specifically includes: converting the real-time voice text features and real-time video text features into voice text word vectors and image text word vectors; inputting the voice text word vectors, image text word vectors and entity features into a bidirectional long short-term memory network model to obtain a context feature vector containing the voice text features, image text features and real-time and historical text feature information.
作为本发明的进一步改进,基于注意力机制,区分所述语音文本数据、图像文本数据和实时以及历史文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。As a further improvement of the present invention, based on the attention mechanism, the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data are distinguished to obtain the weight information of the text semantic features.
作为本发明的进一步改进,所述“基于注意力机制,区分所述语音文本数据、图像文本数据和实时以及历史文本数据的文本特征中的词、词语的自身权重信息和或关联权重信息”,具体包括:分别将所述语音文本上下文特征向量、图像文本上下文特征向量和实时以及历史文本实体特征向量输入多头注意力机制;获取包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义特征自身权重信息的自身权重文本注意力特征向量;获取包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义特征关联权重信息的关联权重文本注意力特征向量。As a further improvement of the present invention, the "based on the attention mechanism, distinguishing the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data" specifically includes: respectively inputting the speech text context feature vector, image text context feature vector, and real-time and historical text entity feature vector into the multi-head attention mechanism; obtaining the self-weight text attention feature vector containing the self-weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features; obtaining the associated weight text attention feature vector containing the associated weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features.
作为本发明的进一步改进,所述“将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息”,具体包括:将所述上下文特征向量和权重文本注意力特征向量经全连接层组合后,输出至分类函数,计算所述语音文本数据、图像文本数据和实时以及历史文本数据文本语义的得分及其归一化得分结果,得到文本的分类结果信息。As a further improvement of the present invention, the "combining the context information and weight information through a fully connected layer, and outputting the result to a classifier to calculate a score to obtain classification result information" specifically includes: combining the context feature vector and the weight text attention feature vector through a fully connected layer, and outputting the result to a classification function, calculating the scores of the text semantics of the speech text data, image text data, and real-time and historical text data and their normalized score results, and obtaining the classification result information of the text.
作为本发明的进一步改进,所述“转写所述语音数据为语音文本数据,提取所述语音文本数据的文本特征”,还包括:获取存储于外部缓存的配置数据,将所述语音数据基于所述配置数据执行所述多通道多尺寸深度卷积神经网络模型计算,进行文本转写和提取文本特征。As a further improvement of the present invention, the method of "transcribe the voice data into voice text data and extract text features of the voice text data" also includes: obtaining configuration data stored in an external cache, executing the multi-channel and multi-scale deep convolutional neural network model calculation on the voice data based on the configuration data, performing text transcription and extracting text features.
本发明还提供一种电器设备,包括:存储器,用于存储可执行指令;处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态知识图谱的生成文本分类方法。The present invention also provides an electrical device, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned method for generating text classification based on a multimodal knowledge graph when running the executable instructions stored in the memory.
本发明还提供一种冰箱,包括:存储器,用于存储可执行指令;处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态知识图谱的生成文本分类方法。The present invention also provides a refrigerator, comprising: a memory for storing executable instructions; and a processor for implementing the above-mentioned method for generating text classification based on a multimodal knowledge graph when running the executable instructions stored in the memory.
本发明还提供一种计算机可读存储介质,其存储有可执行指令,所述可执行指令被处理器执行时实现上述的基于多模态知识图谱的生成文本分类方法。The present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned method for generating text classification based on a multimodal knowledge graph.
本发明的有益效果是:本发明所提供的方法完成了对所获取的文本数据进行识别与分类任务。首先通过引入实时语音、实时视频、实时文本、实时和历史用户对食材偏好、兴趣和历史评论数据等多模态数据,解决了单一模态数据的文本语义信息单一、数据理解不足等问题;其次,引入深度卷积神经网络模型弥补了传统机器学习方法的特征表征能力不足的现象,能更深层次的获得语义特征信息的关联性和互补性,加强语义特征,有效提高了文本分类准确度;最后,增加对多模态知识图谱的实体链接表示,提高文本语义特征信息的泛化能力,提升用户的体验效果。The beneficial effect of the present invention is that the method provided by the present invention completes the task of identifying and classifying the acquired text data. First, by introducing multimodal data such as real-time voice, real-time video, real-time text, real-time and historical user preferences, interests and historical comment data on food, the problems of single-modal data's single text semantic information and insufficient data understanding are solved; secondly, the introduction of a deep convolutional neural network model makes up for the insufficient feature representation ability of traditional machine learning methods, and can obtain the relevance and complementarity of semantic feature information at a deeper level, strengthen semantic features, and effectively improve text classification accuracy; finally, the entity link representation of the multimodal knowledge graph is increased to improve the generalization ability of text semantic feature information and enhance the user experience.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1是本发明一实施方式中的基于多模态知识图谱的文本分类方法所涉及模型的结构框图。FIG1 is a structural block diagram of a model involved in a text classification method based on a multimodal knowledge graph in one embodiment of the present invention.
图2是本发明一实施方式中的基于多模态知识图谱的文本分类方法步骤示意图。FIG2 is a schematic diagram of the steps of a text classification method based on a multimodal knowledge graph in one embodiment of the present invention.
图3是本发明一实施方式中获取实时音视频数据以及实时和历史文本数据步骤示意图。FIG3 is a schematic diagram of the steps of acquiring real-time audio and video data and real-time and historical text data in one embodiment of the present invention.
图4是本发明一实施方式中对所述实时音视频数据和实时以及历史文本数据进行预处理步骤示意图。FIG. 4 is a schematic diagram of the preprocessing steps for the real-time audio and video data and the real-time and historical text data in one embodiment of the present invention.
图5是本发明一实施方式中转写所述实时语音数据为语音文本数据步骤示意图。FIG. 5 is a schematic diagram of the steps of transcribing the real-time voice data into voice text data in one embodiment of the present invention.
图6是本发明一实施方式中转写所述实时视频数据为图像文本数据步骤示意图。FIG. 6 is a schematic diagram of the steps of transcribing the real-time video data into image text data in one embodiment of the present invention.
图7是本发明一实施方式中根据所述实时语音文本特征、实时视频文本特征和实体特征,获取该文本数据的上下文信息和权重信息步骤示意图。7 is a schematic diagram of the steps of obtaining the context information and weight information of the text data according to the real-time speech text features, real-time video text features and entity features in one embodiment of the present invention.
图8是本发明一实施方式中根据所述实时语音文本特征、实时视频文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息步骤示意图。8 is a schematic diagram of the steps of obtaining the context information of the text data and the weight information of the text semantic features according to the real-time speech text features, real-time video text features and entity features in one embodiment of the present invention.
具体实施方式Detailed ways
以下将结合附图所示的具体实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。The present invention will be described in detail below in conjunction with the specific embodiments shown in the accompanying drawings. However, these embodiments do not limit the present invention, and any structural, methodological, or functional changes made by a person skilled in the art based on these embodiments are all within the scope of protection of the present invention.
需要说明的是,术语“包括”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。此外,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。It should be noted that the term "comprises" or any other variation thereof is intended to cover non-exclusive inclusion, so that a process, method, article or device that includes a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In addition, the terms "first", "second", etc. are used for descriptive purposes only and cannot be understood as indicating or implying relative importance.
本发明的实施例是一种基于多模态知识图谱的文本分类方法。虽然本申请提供了如下述实施方式或流程图1所述的方法操作步骤,但是基于常规或者无需创造性的劳动,所述方法在逻辑性上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本申请实施方式中所提供的执行顺序。The embodiment of the present invention is a text classification method based on a multimodal knowledge graph. Although the present application provides the method operation steps as described in the following implementation or flowchart 1, based on routine or no creative labor, the execution order of the steps in the method that do not have a necessary causal relationship in logic is not limited to the execution order provided in the implementation of the present application.
如图1所示,为本发明所提供的一种基于多模态知识图谱的文本分类方法所涉及模型的结构框图,如图2所示,为基于多模态知识图谱的文本分类方法步骤示意图,其包括:As shown in FIG1 , a structural block diagram of a model involved in a text classification method based on a multimodal knowledge graph provided by the present invention is shown in FIG2 , which is a schematic diagram of the steps of a text classification method based on a multimodal knowledge graph, including:
S1:获取实时音视频数据,获取实时和历史文本数据。S1: Obtain real-time audio and video data, and obtain real-time and historical text data.
S2:对所述实时音视频数据进行预处理,获取实时语音数据和实时视频数据。S2: Preprocess the real-time audio and video data to obtain real-time voice data and real-time video data.
S3:转写所述实时语音数据为语音文本数据,提取所述语音文本数据的文本特征。S3: Transcribing the real-time voice data into voice text data, and extracting text features of the voice text data.
S4:转写所述实时视频数据为图像文本数据,提取所述图像文本数据的文本特征。S4: Transcribing the real-time video data into image text data, and extracting text features of the image text data.
S5:提取所述实时和历史文本数据的实体特征。S5: Extract entity features of the real-time and historical text data.
S6:根据所述实时语音文本特征、实时视频文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息。S6: Acquire context information of the text data and weight information of text semantic features according to the real-time speech text features, real-time video text features and entity features.
S7:将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息。S7: After combining the context information and weight information through a fully connected layer, the combined information is output to a classifier to calculate a score and obtain classification result information.
S8:输出所述分类结果信息。S8: Output the classification result information.
本发明提供的方法可供智能电子设备基于用户的实时音视频数据输入,来实现与用户之间的实时交互或消息推送等功能。示例性的,在本实施方式中,以智能冰箱为例,并结合预先训练好的深度学习模型对本方法进行说明。基于用户的音视频输入,智能冰箱对用户音视频数据所生成的对应文本内容进行分类,并根据分类结果信息计算需要输出的文本内容分类结果信息。The method provided by the present invention can be used by smart electronic devices to implement real-time interaction or message push functions with users based on the user's real-time audio and video data input. Exemplarily, in this embodiment, a smart refrigerator is taken as an example, and the method is described in combination with a pre-trained deep learning model. Based on the user's audio and video input, the smart refrigerator classifies the corresponding text content generated by the user's audio and video data, and calculates the text content classification result information to be output based on the classification result information.
如图3所示,在步骤S1中,其具体包括:As shown in FIG3 , in step S1, it specifically includes:
S11:获取采集装置所采集的所述实时音视频数据,和/或获取自客户终端传输的所述实时音视频数据。S11: Acquire the real-time audio and video data collected by the collection device, and/or acquire the real-time audio and video data transmitted from the client terminal.
S12:获取采集装置所采集的所述实时文本数据,和/或获取自客户终端传输的所述实时文本数据。S12: Acquire the real-time text data collected by the collection device, and/or acquire the real-time text data transmitted from the client terminal.
S13:获取内部存储的历史文本数据,和/或获取外部存储的历史文本数据,和/或S13: Acquire historical text data stored internally, and/or acquire historical text data stored externally, and/or
获取客户终端传输的历史文本数据。Get the historical text data transmitted by the client terminal.
这里所述的实时音视频数据包括实时语音数据和实时视频数据,所述实时语音指的是用户当前对智能电子设备或对与智能电子设备通信连接的客户终端设备等说出的询问性或指令性语句,同样的,也可以是语音采集装置采集用户发出的语音信息。如在本实施方式中,用户可提出诸如“今天冰箱里有啥蔬菜”、“今天冰箱里牛肉食材有哪些”等问题,或用户可发出诸如“删除全部食材”等命令指令。所述实时视频数据是利用智能电子设备或智能电子设备通信连接的客户终端设备实时拍摄而获得的实时视频图像,如在本实时方式中,利用内置在智能冰箱内的视像头拍摄到用户的脸部图像,从脸部图像中提取嘴唇区域特征图像以识别该图像对应的文本内容,比如识别出“今天冰箱里有啥蔬菜”的图像文本数据。The real-time audio and video data described here include real-time voice data and real-time video data. The real-time voice refers to the interrogative or directive statements currently spoken by the user to the intelligent electronic device or the client terminal device connected to the intelligent electronic device. Similarly, it can also be the voice information sent by the user collected by the voice collection device. For example, in this embodiment, the user can ask questions such as "What vegetables are in the refrigerator today?", "What beef ingredients are in the refrigerator today?", or the user can issue commands such as "Delete all ingredients". The real-time video data is a real-time video image obtained by real-time shooting using an intelligent electronic device or a client terminal device connected to the intelligent electronic device. For example, in this real-time mode, the user's facial image is captured by a video camera built into the intelligent refrigerator, and the lip area feature image is extracted from the facial image to identify the text content corresponding to the image, such as identifying the image text data of "What vegetables are in the refrigerator today".
这里所述的实时文本数据是通过文本采集装置采集到的文本数据,而所述历史文本数据是指以往使用过程中用户的实时文本数据,进一步的,其还可以包括用户自行输入的历史文本数据等。具体的,在本实施方式中,所述实时和历史文本数据包括用户对食材的偏好、喜爱和用户感兴趣的食材信息以及用户发表的一些评论数据,比如“我以前喜欢宫保鸡丁”,涵盖了用户对食材的喜欢已经与当前实时文本数据有关联的信息。实时文本数据和历史文本数据的获取可以作为预训练和预测模型的数据集的一部分,能够有效的补充实时音视频数据的单一语音表征,丰富语义特征。The real-time text data described here refers to text data collected by a text collection device, and the historical text data refers to the real-time text data of the user in the previous use process. Furthermore, it can also include historical text data input by the user. Specifically, in this embodiment, the real-time and historical text data include the user's preferences, likes and information about ingredients that the user is interested in, as well as some comment data posted by the user, such as "I used to like Kung Pao Chicken", which covers the information that the user's preference for ingredients is already related to the current real-time text data. The acquisition of real-time text data and historical text data can be used as part of the data set of pre-training and prediction models, which can effectively supplement the single voice representation of real-time audio and video data and enrich semantic features.
如步骤S11和S12所述,在本实施方式中,可通过设置于智能冰箱内的照相机、摄像头等音视频采集装置采集用户实时音视频,在使用过程中,当用户需要与智能冰箱进行交互时,直接对智能冰箱发出语音即可。并且,也可通过与智能冰箱基于无线通信协议连接的客户终端获取传输而来的用户实时音视频数据,客户终端为具有信息发送功能的电子设备,如手机、平板电脑、智能摄像机、智能手表、APP或蓝牙等智能电子设备,在使用过程中,用户直接对客户终端发出语音或直接使用冰箱内置的摄像头进行拍摄即可,客户终端采集音视频后通过wifi或蓝牙等无线通信方式传输至智能冰箱。从而实现多渠道的实时音视频获取方式,并不局限于必须面向智能冰箱发出语音。当用户有交互需求时,通过文本采集装置或者是客户终端文本输入设备直接采集实时文本数据。在本发明的其他实施方式中,也可采用上述实时音视频数据或实时文本数据获取方法中一种或任意多种,或者也可基于现有技术通过其他渠道获取所述实时音视频数据和实时文本数据,本发明对此不作具体限制。As described in steps S11 and S12, in this embodiment, the user's real-time audio and video can be collected by an audio and video collection device such as a camera or a camera set in the smart refrigerator. During use, when the user needs to interact with the smart refrigerator, the user can directly make a voice to the smart refrigerator. In addition, the user's real-time audio and video data transmitted can also be obtained through a client terminal connected to the smart refrigerator based on a wireless communication protocol. The client terminal is an electronic device with an information sending function, such as a mobile phone, a tablet computer, a smart camera, a smart watch, an APP or a smart electronic device such as Bluetooth. During use, the user directly makes a voice to the client terminal or directly uses the camera built into the refrigerator to shoot. After the client terminal collects audio and video, it is transmitted to the smart refrigerator through wireless communication methods such as wifi or Bluetooth. Thereby realizing a multi-channel real-time audio and video acquisition method, it is not limited to having to make a voice to the smart refrigerator. When the user has an interactive demand, the real-time text data is directly collected through a text collection device or a text input device of the client terminal. In other embodiments of the present invention, one or any multiple of the above-mentioned real-time audio and video data or real-time text data acquisition methods can also be used, or the real-time audio and video data and real-time text data can also be obtained through other channels based on the prior art, and the present invention does not make specific restrictions on this.
如步骤S13所述,在本实施方式中,可读取智能冰箱的内部存储器所存储的历史文本数据。并且,也可通过读取智能冰箱配置的外部存储装置所存储的历史文本数据,外部存储装置为诸如U盘、SD卡等移动存储设备,通过设置外部存储装置可进一步拓展智能冰箱的存储空间。并且,也可通过获取存储在诸如手机、平板电脑等客户终端或应用软件服务器端等处的所述历史文本数据。实现多渠道的历史文本数据获取渠道,能够大幅提高历史文本的数据量,从而提高后续语音识别和视频图像识别的准确度。在本发明的其他实施方式中,也可采用上述历史文本数据获取方法中的一种或任意多种,或者也可基于现有技术通过其他渠道获取所述历史文本数据,本发明对此不作具体限制。As described in step S13, in this embodiment, the historical text data stored in the internal memory of the smart refrigerator can be read. In addition, the historical text data stored in the external storage device configured by the smart refrigerator can also be read. The external storage device is a mobile storage device such as a U disk, SD card, etc. By setting an external storage device, the storage space of the smart refrigerator can be further expanded. In addition, the historical text data stored in a client terminal such as a mobile phone, a tablet computer, or an application software server can be obtained. Implementing multi-channel historical text data acquisition channels can greatly increase the amount of historical text data, thereby improving the accuracy of subsequent voice recognition and video image recognition. In other embodiments of the present invention, one or any multiple of the above-mentioned historical text data acquisition methods may also be used, or the historical text data may also be obtained through other channels based on the prior art, and the present invention does not impose specific restrictions on this.
进一步的,在本实施方式中,智能冰箱配置有外部缓存,至少有部分所述历史文本数据被储存在所述外部缓存中,随着使用时间增加,历史文本数据增多,通过将部分数据存储在外部缓存中,能够节省智能冰箱内部存储空间,并且在进行神经网络计算时,直接读取存储于外部缓存中的所述历史文本数据,能够提高算法效率。Furthermore, in the present embodiment, the smart refrigerator is provided with an external cache, and at least part of the historical text data is stored in the external cache. As the usage time increases, the historical text data increases. By storing part of the data in the external cache, the internal storage space of the smart refrigerator can be saved, and when performing neural network calculations, the historical text data stored in the external cache can be directly read, which can improve the efficiency of the algorithm.
具体的,在本实施方式中,采用Redis组件作为所述外部缓存,Redis组件为当前一种使用较为广泛的key/value存储结构的分布式缓存系统,其可用作数据库,高速缓存和消息队列代理。在本发明的其他实施方式中也可采用诸如Memcached等其他外部缓存,本发明对此不作具体限制。Specifically, in this embodiment, a Redis component is used as the external cache. The Redis component is a widely used distributed cache system with a key/value storage structure, which can be used as a database, a cache and a message queue agent. In other embodiments of the present invention, other external caches such as Memcached may also be used, and the present invention does not specifically limit this.
综上所述,在步骤S11到步骤S13中,能够通过多渠道灵活获取实时音视频数据、实时和历史文本数据,在提升了用户体验的同时,保证了数据量,并有效提升了算法效率。In summary, in step S11 to step S13, real-time audio and video data, real-time and historical text data can be flexibly obtained through multiple channels, which improves the user experience while ensuring the data volume and effectively improving the algorithm efficiency.
如图4所示,在步骤S2中,其具体包括步骤:As shown in FIG4 , in step S2, it specifically includes the steps of:
S21:对所述实时音视频数据进行数据清洗,获得有效的音视频数据。S21: Cleaning the real-time audio and video data to obtain valid audio and video data.
S22:将所述有效音视频数据进行语音和视频分离,以获得实时语音数据和视频数据。S22: Separate the effective audio and video data into voice and video to obtain real-time voice data and video data.
S23:对所述实时语音数据和视频数据进行预处理,包括:对所述实时语音数据进行分帧和加窗处理,对所述实时视频数据进行裁剪、分帧处理。S23: pre-processing the real-time voice data and video data, including: performing frame division and windowing processing on the real-time voice data, and cropping and frame division processing on the real-time video data.
S24:对所述实时和历史评论文本数据进行预处理,包括:分词、去除停用词、去重复词。S24: Preprocessing the real-time and historical comment text data, including: word segmentation, removal of stop words, and removal of duplicate words.
在步骤S21中,对所述实时音视频数据进行数据清洗具体包括:In step S21, data cleaning of the real-time audio and video data specifically includes:
获取一定数量的实时音视频数据集,示例性的,可以以文件的形式导入数据清洗模型进行处理,为了防止数据导入失败,对不满足文件导入格式的数据进行数据格式解析和数据格式转换,然后再删除数据集中的无关数据、重复数据以及处理异常值和缺失值数据等,初步筛选掉与分类无关的信息,对所述音视频数据进行清洗处理,同时将清洗后的数据以指定格式输出并保存起来,从而获得有效的音视频数据。A certain number of real-time audio and video data sets are obtained. For example, they can be imported into the data cleaning model in the form of files for processing. In order to prevent data import failure, data that does not meet the file import format is parsed and converted, and then irrelevant data and duplicate data in the data set are deleted, and abnormal values and missing value data are processed. Information irrelevant to classification is preliminarily screened out, and the audio and video data is cleaned. At the same time, the cleaned data is output and saved in a specified format, thereby obtaining valid audio and video data.
在步骤S22中,采用脚本或者第三方音视频分离工具对所述有效的音视频数据进行语音和视频分离,从而获得了实时语音数据和实时视频数据。In step S22, a script or a third-party audio and video separation tool is used to separate the effective audio and video data into voice and video, thereby obtaining real-time voice data and real-time video data.
在本发明实施例中,可以采用python语言进行音视频分离脚本的编写,或者是第三方的音视频分离工具,将输入的音视频数据进行分离操作,实现语音、视频的分离,得到分类后的实时语音和视频数据。In an embodiment of the present invention, Python language can be used to write an audio and video separation script, or a third-party audio and video separation tool can be used to separate the input audio and video data to achieve voice and video separation and obtain classified real-time voice and video data.
在步骤S23中,对分类后的语音根据指定的时间段或采样数进行分段,完成对语音的分帧处理以得到语音信号数据,再通过窗函数的作用,使得原本含有噪声的语音信号呈现出信号加强和信号周期性的特征,完成加窗处理,便于后续更好的提取语音的特征参数。示例性的,步骤S23还包括对有效的视频数据进行裁剪,产生多帧图片,具体的,可以采用编写脚本的方式首先加载视频数据并读取视频信息,然后根据视频信息对视频进行解码,确定视频每秒钟展示多少张图片,从而获取单帧图像信息,所述单帧图像信息包括每帧图片的宽度和高度,最后将视频保存成多张图片。所以,经过步骤S23的处理,可以得到有效的实时语音数据和图像数据。在本发明的其他实施方式中也可采用诸如第三方视频裁剪工具等其他视频分帧方法,本发明对此不作具体限制。In step S23, the classified speech is segmented according to the specified time period or sampling number, and the framing process of the speech is completed to obtain the speech signal data, and then through the effect of the window function, the speech signal originally containing noise presents the characteristics of signal enhancement and signal periodicity, and the windowing process is completed, which is convenient for the subsequent better extraction of the characteristic parameters of the speech. Exemplarily, step S23 also includes cutting the effective video data to generate multiple frames of pictures. Specifically, the video data can be first loaded and the video information can be read by writing a script, and then the video can be decoded according to the video information to determine how many pictures the video shows per second, so as to obtain single-frame image information, and the single-frame image information includes the width and height of each frame of the picture, and finally the video is saved as multiple pictures. Therefore, after the processing of step S23, effective real-time voice data and image data can be obtained. In other embodiments of the present invention, other video framing methods such as third-party video cropping tools can also be used, and the present invention does not specifically limit this.
在步骤S24中,对采集到的实时和历史文本数据进行文本预处理,比如删除无关的数据、重复数据以及处理异常值和缺失值等,初步筛选与分类无关的信息。接着,基于规则统计方法对所述实时文本数据和历史文本数据进行类别标签标注,以及基于字符串匹配的分词方法、基于理解的分词方法、基于统计的分词方法和基于规则的分词方法对所述文本数据进行分词处理。之后,去停用词和去重复词,使得所述文本数据符合神经网络模型的输入要求。In step S24, the collected real-time and historical text data are preprocessed, such as deleting irrelevant data, duplicate data, and processing abnormal values and missing values, and preliminarily screening information irrelevant to classification. Then, the real-time text data and historical text data are labeled with category labels based on rule-based statistical methods, and the text data is segmented based on string matching, understanding, statistics, and rules. After that, stop words and duplicate words are removed so that the text data meets the input requirements of the neural network model.
如图5所示,在步骤S3中,其具体包括:As shown in FIG5 , in step S3, it specifically includes:
S31:提取所述有效语音数据特征,得到语音特征。S31: Extract the effective voice data features to obtain voice features.
S32:将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据。S32: Input the speech features into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe and obtain first speech text data.
S33:基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据。S33: Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data.
S34: 基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息。S34: Based on the attention mechanism, obtain the key features of the second speech text data or the weight information of the key features.
S35:将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。S35: The second speech text data and its key features or weight information of the key features are combined through a fully connected layer, and then the scores are calculated through a classification function to obtain the speech text data.
在步骤S31钟,提取所述有效语音数据特征具体包括:In step S31, extracting the effective voice data features specifically includes:
提取所述语音数据特征,获取其梅尔频率倒谱系数特征(Mel-scale Frequency Cepstral Coefficients,简称MFCC)。MFCC是一种语音信号中具有辨识性的成分,是在Mel标度频率域提取出来的倒谱参数,其中,Mel标度描述了人耳频率的非线性特性,MFCC的参数考虑到了人耳对不同频率的感受程度,特别适用于语音辨别和语者辨识。Extract the features of the speech data and obtain its Mel-scale Frequency Cepstral Coefficients (MFCC). MFCC is a recognizable component in a speech signal and is a cepstral parameter extracted in the Mel scale frequency domain. The Mel scale describes the nonlinear characteristics of the human ear frequency. The MFCC parameters take into account the sensitivity of the human ear to different frequencies and are particularly suitable for speech recognition and speaker identification.
在本发明实施例中,也可以通过不同算法步骤获取所述语音数据的感知线性预测特征(Perceptual Linear Predictive,简称PLP)或线性预测系数特征(Linear Predictive Coding,简称LPC)等特征参数来取代MFCC特征,具体可根据实际应用场景和采用的模型参数进行具体的调整,本发明对此不做具体限制。In an embodiment of the present invention, characteristic parameters such as perceptual linear prediction characteristics (PLP) or linear prediction coefficient characteristics (LPC) of the speech data may be obtained through different algorithm steps to replace the MFCC features. The specific adjustments may be made according to the actual application scenario and the model parameters used, and the present invention does not impose any specific restrictions on this.
上述步骤中所涉及的具体的算法步骤可参考当前本领域的现有技术,具体的内容在此不做具体描述。The specific algorithm steps involved in the above steps can refer to the current existing technology in this field, and the specific content will not be described in detail here.
在步骤S32中,通过自动语音识别技术中的网络模型对所述有效语音数据实现文本内容转写,得到所述的第一语音文本数据。In step S32, the text content of the valid voice data is transcribed through a network model in the automatic speech recognition technology to obtain the first voice text data.
在本实施方式中,通过构建多通道多尺寸深度卷积神经网络模型实现语音转文本的任务,该深度网络模型是由多层深度卷积网络模型构成,深度卷积神经网络模型一般是由若干卷积层加若干全连接层组成,中间包含各种的非线性操作、池化操作,主要用于处理网格结构的数据,因此该模型可以利用滤波器将相邻像素之间的轮廓过滤出来。另外,该模型它是先提出语音特征值,然后再对特征值进行计算而不是对原始语音数据值进行计算。因此,相比于传统的循环神经网络来说,深度卷积神经网络模型具有计算量小、容易刻画局部特征的优势,而且共享权重以及池化层可以赋予该模型更好的时域或频域的不变性,另外更深层的非线性结构也可以让该模型具备强大的表征能力。另外,多通道多尺寸可以从不同的视角去提取语音特征,获取更多的语音特征信息,具有更好的语音识别精度。In this embodiment, the task of speech-to-text conversion is achieved by constructing a multi-channel multi-size deep convolutional neural network model. The deep network model is composed of a multi-layer deep convolutional network model. The deep convolutional neural network model is generally composed of several convolutional layers plus several fully connected layers, and various nonlinear operations and pooling operations are included in the middle. It is mainly used to process grid structured data, so the model can use filters to filter out the contours between adjacent pixels. In addition, the model first proposes speech feature values, and then calculates the feature values instead of calculating the original speech data values. Therefore, compared with the traditional recurrent neural network, the deep convolutional neural network model has the advantages of small computational complexity and easy to characterize local features, and shared weights and pooling layers can give the model better invariance in the time domain or frequency domain. In addition, the deeper nonlinear structure can also give the model a powerful representation ability. In addition, multi-channel and multi-size can extract speech features from different perspectives, obtain more speech feature information, and have better speech recognition accuracy.
具体的,在本实施方式中,在步骤S32中,所采用的多通道多尺寸深度卷积神经网络由3*3卷积层、32通道数和一层最大池化构成。Specifically, in this embodiment, in step S32, the multi-channel multi-size deep convolutional neural network used is composed of a 3*3 convolutional layer, 32 channels and a layer of maximum pooling.
在步骤S33中,利用连接时序分类方法(Connectionist temporal classification,CTC)得到输入语音特征序列和输出的语音文本特征序列的对齐关系。In step S33, the alignment relationship between the input speech feature sequence and the output speech text feature sequence is obtained by using the connectionist temporal classification (CTC) method.
在本实施方式中,所述有效语音数据和所述第一语音文本数据的文字很难构建精准的映射关系,从而增加了后续语音识别的难度。为了解决这个问题,采用了时序分类方法,该方法一般是在使用卷积网络模型之后使用的,是一种完全端到端的声学模型训练,不需要预先对数据做对齐处理,只需要一个输入序列和一个输出序列即可训练,不需要对数据做对齐和一一标注处理,同时可以直接输出序列预测的概率。根据这个预测概率,我们可以获得最有可能的文本输出结果,以得到第二语音文本数据。In this embodiment, it is difficult to construct an accurate mapping relationship between the effective voice data and the text of the first voice text data, which increases the difficulty of subsequent voice recognition. In order to solve this problem, a time series classification method is adopted. This method is generally used after using a convolutional network model. It is a completely end-to-end acoustic model training that does not require pre-alignment of the data. It only requires an input sequence and an output sequence for training. It does not require alignment and one-to-one labeling of the data, and can directly output the probability of sequence prediction. Based on this predicted probability, we can obtain the most likely text output result to obtain the second voice text data.
进一步的,在步骤S34中,所述注意力机制可以引导深度卷神经网络去关注更为关键的特征信息而抑制其他非关键的特征信息,因此,通过引入注意力机制,能够得到所述第二语音文本数据的局部关键特征或权重信息,从而进一步减少模型训练时出现序列的不规则误差对齐现象。Furthermore, in step S34, the attention mechanism can guide the deep convolutional neural network to focus on more critical feature information and suppress other non-critical feature information. Therefore, by introducing the attention mechanism, the local key features or weight information of the second speech text data can be obtained, thereby further reducing the irregular error alignment of the sequence during model training.
这里,在步骤S35中,根据所述第二语音文本数据以及其关键特征或关键特征的权重信息,通过自注意力机制和全连接层相融合的模型将所述第二语音文本数据赋予其自身权重信息,从而更好的获得所述语音文本数据文本语义特征的内部权重信息,以增强文本语义特征信息不同部分的重要性,最后再经过分类函数,比如Softmax函数,计算得分得到所述语音文本数据。Here, in step S35, according to the second speech text data and its key features or the weight information of the key features, the second speech text data is given its own weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information of the text semantic features of the speech text data, so as to enhance the importance of different parts of the text semantic feature information, and finally, through a classification function, such as a Softmax function, the score is calculated to obtain the speech text data.
如图6所示,在步骤S4中,其具体包括:As shown in FIG6 , in step S4, it specifically includes:
S41:将所述实时视频数据输入3D深度卷积神经网络计算,得到图像特征。S41: Input the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features.
S42:将所述图像特征输入多通道多尺寸时间卷积网络转写,获得第一图像文本数据。S42: Input the image features into a multi-channel multi-size temporal convolutional network for transcription to obtain first image text data.
S43:基于连接时序分类方法输出所述图像特征和所述第一图像文本数据的对齐关系,以得到第二图像文本数据。S43: Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data.
S44:将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。S44: After combining the second image text data through the fully connected layer, the scores are calculated through the classification function to obtain the image text data.
在步骤S41和步骤S42中,考虑到图像文本识别到的句子比较复杂,比如句子长度不一、句子停顿位置或单词构成不一样以及其图像特征存在关联性等多种情况,所以我们可以根据所述有效的视频数据,对其进行裁剪分帧等视频处理操作,获取面部区域的视频图像,并对面部区域的视频图像进行裁剪、分割,以得到多张连续的面部图片帧。在本实施例中,将所述多张连续的面部图片帧输入到3D卷积神经网络模型中,通过增加时间维度的信息,能够提取到更具表达性的特征,所述3D卷积神经网络模型可以解决多张图片之间的关联信息,是以连续的多帧图像作为输入,通过增加了一个新的维度信息,捕捉到输入帧中的运动信息,从而更好的获得其图像特征。In step S41 and step S42, considering that the sentences recognized by the image text are relatively complex, such as different sentence lengths, different pause positions or word compositions, and correlations between image features, we can perform video processing operations such as cropping and framing according to the effective video data, obtain the video image of the facial area, and crop and segment the video image of the facial area to obtain multiple continuous facial picture frames. In this embodiment, the multiple continuous facial picture frames are input into the 3D convolutional neural network model, and more expressive features can be extracted by adding information of the time dimension. The 3D convolutional neural network model can solve the correlation information between multiple pictures, and takes multiple continuous frames of images as input, and captures the motion information in the input frames by adding a new dimension of information, so as to better obtain its image features.
在步骤S43和S44中,同样也是和上述语音数据处理的方法一样,也采用了连续时序分类方法,实现了所述有效视频数据和所述第一图像文本数据的文字之间的映射关系,以得到第二图像文本数据。再通过自注意力机制和全连接层相融合的模型将所述第二图像文本数据赋予其自身权重信息和/或关联权重信息,从而更好的获得所述图像文本数据文本语义特征的内部权重信息和/或关联权重信息,以增强文本语义特征信息不同部分的重要性,最后再经过分类函数计算得分得到所述图像文本数据。具体的处理过程同上述语音数据处理步骤,在此不做赘述。In steps S43 and S44, similar to the above-mentioned voice data processing method, a continuous time series classification method is also used to realize the mapping relationship between the effective video data and the text of the first image text data, so as to obtain the second image text data. Then, the second image text data is given its own weight information and/or associated weight information through a model that integrates the self-attention mechanism and the fully connected layer, so as to better obtain the internal weight information and/or associated weight information of the text semantic features of the image text data, so as to enhance the importance of different parts of the text semantic feature information, and finally the image text data is obtained by calculating the score through the classification function. The specific processing process is the same as the above-mentioned voice data processing steps, which will not be repeated here.
如图7所示,在步骤S5中,其具体包括:As shown in FIG. 7 , in step S5, it specifically includes:
S51:采用实体链接方法对所述文本数据进行实体抽取,以得到多个食材实体。S51: Using an entity linking method to perform entity extraction on the text data to obtain a plurality of food entities.
S52:基于每个食材实体查询食材知识图谱,获得对应的实体向量表示。S52: Query the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation.
S53:将所述实体向量表示输入多头注意力机制计算,得到所述实体特征向量。S53: Input the entity vector representation into the multi-head attention mechanism for calculation to obtain the entity feature vector.
在步骤S51和S52中,实体链接是将文本中已识别的实体对象(比如人名、地名等)无歧义的正确指向知识库中目标实体的过程。也就是说,查找知识库,找到最符合所述实体对象的目标项,所以实体链接是为文本中提及到的实体分配唯一标识,一般是作为实体抽取识别的后置任务。在本实施方式中,先从实时和历史文本中提取到有关食材的所有实体,对应到候选实体项中,并从给定的知识图谱中找到每个实体提及可能对应的候选实体集合,过滤掉知识图谱中不相关的实体以生成候选实体;接着,提取到的实体消除歧义和实体对齐处理,对每个实体对应的候选实体集合中多个候选实体打分和排序,并输出得分最高的候选实体作为实体链接结果。In steps S51 and S52, entity linking is the process of correctly and unambiguously pointing the recognized entity objects (such as names of people, places, etc.) in the text to the target entities in the knowledge base. In other words, the knowledge base is searched to find the target item that best matches the entity object, so entity linking is to assign a unique identifier to the entity mentioned in the text, which is generally a post-task of entity extraction and recognition. In this embodiment, all entities related to the ingredients are first extracted from the real-time and historical texts, corresponding to the candidate entity items, and the candidate entity set that may correspond to each entity mention is found from the given knowledge graph, and irrelevant entities in the knowledge graph are filtered out to generate candidate entities; then, the extracted entities are disambiguated and aligned, and multiple candidate entities in the candidate entity set corresponding to each entity are scored and sorted, and the candidate entity with the highest score is output as the entity linking result.
所述实体信息包含了用户对食材的偏好、喜爱、用户感兴趣的话题和有关食材的评论数据等语义特征信息,这些实时和历史的文本数据丰富了文本语义内容。另外,通过三元组形式将基于知识图谱查找到的实体转化为实体向量表示,所述实体向量表示是考虑知识图谱中的实体关系结构信息和实体描述信息,分别得到对应的向量表示。具体的,通常采用神经网络的分布式向量表示方法,该方法将词转化成一种分布式表示,即将词表示为一个定长的连续的向量,该种方法可以体现不同分词对结果的贡献程度。The entity information contains semantic feature information such as user preferences, favorites, topics of interest to users, and comment data on related ingredients. These real-time and historical text data enrich the text semantic content. In addition, the entities found based on the knowledge graph are converted into entity vector representations in the form of triples. The entity vector representation considers the entity relationship structure information and entity description information in the knowledge graph to obtain corresponding vector representations. Specifically, a distributed vector representation method of a neural network is usually used. This method converts words into a distributed representation, that is, represents words as a fixed-length continuous vector. This method can reflect the contribution of different word segmentations to the results.
在步骤S53中,注意力机制是通过运算来直接计算得到文本数据在编码过程中每个位置上的注意力权重,然后再以权重和的形式来计算得到整个文本的隐含向量表示。通常,我们希望注意力机制模型可以基于相同的注意力机制学习到不同的行为,将后将不同的行为作为知识组合起来,为此,我们可以独立学习多个不同种类的文本数据,经过注意力池化操作后拼接在一起,以产生最终的特征向量。在本实时方式中,将实体向量表示输入多头注意力机制,得到带有用户对食材偏好、兴趣和有关食材评论数据相关性的实体特征向量,这在一定程度上扩展了不同类型数据间的语义特征的互补性和多方位多角度的语义关联性。In step S53, the attention mechanism directly calculates the attention weight of each position of the text data in the encoding process through operation, and then calculates the implicit vector representation of the entire text in the form of weight sum. Generally, we hope that the attention mechanism model can learn different behaviors based on the same attention mechanism, and then combine different behaviors as knowledge. For this reason, we can independently learn multiple different types of text data, and splice them together after the attention pooling operation to generate the final feature vector. In this real-time method, the entity vector representation is input into the multi-head attention mechanism to obtain an entity feature vector with the user's preference, interest and relevance of the food review data, which to a certain extent expands the complementarity of the semantic features between different types of data and the multi-faceted and multi-angle semantic relevance.
如图8所示,在步骤S6中,其具体包括:As shown in FIG8 , in step S6, it specifically includes:
S61:将所述语音文本数据和图像文本数据转换为语音文本词向量和图像文本词向量。S61: Convert the speech text data and image text data into speech text word vectors and image text word vectors.
S62:将所述语音文本词向量、图像文本词向量和实体特征向量输入双向长短记忆网络模型,获取包含所述语音文本特征、图像文本特征和实时以及历史文本特征信息的上下文特征向量。S62: Input the speech text word vector, image text word vector and entity feature vector into a bidirectional long short-term memory network model to obtain a context feature vector containing the speech text features, image text features and real-time and historical text feature information.
S63:基于注意力机制,区分所述语音文本数据、图像文本数据和实时以及历史文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。S63: Based on the attention mechanism, distinguish the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data, and obtain the weight information of the text semantic features.
在步骤S61中,为了将文本数据转化为计算机能够识别和处理的向量化形式,可通过Word2Vec算法,将所述语音文本数据和图像文本数据转化为所述语音文本词向量和图像文本词向量,或者也可通过其他诸如Glove算法等本领域现有算法转化得到所述词向量,本发明对此不做具体限制。In step S61, in order to convert the text data into a vectorized form that can be recognized and processed by a computer, the speech text data and the image text data can be converted into the speech text word vector and the image text word vector through the Word2Vec algorithm, or the word vector can be obtained through other existing algorithms in the field such as the Glove algorithm, and the present invention does not impose any specific restrictions on this.
在步骤S62中,双向长短记忆网络(Bi-directional Long Short-Term Memory,简写BiLSTM)由前向长短记忆网络(Long Short-Term Memory,简写LSTM)和后向长短记忆网络组合而成,LSTM模型能够更好地获取文本语义长距离的依赖关系,而在其基础上,BiLSTM模型能更好地获取文本双向语义。将所述语音文本词向量、图像文本词向量和实体特征向量输入BiLSTM模型中,经过前向LSTM和后向LSTM处理后,其中前向LSTM和后向LSTM都是等到所有时间步都计算完成后,才能产生两个结果向量,再将这两个结果向量拼接起来,输出带有语境上下文信息的所述上下文特征向量。In step S62, the bidirectional long short-term memory network (Bi-directional Long Short-Term Memory, abbreviated as BiLSTM) is composed of a forward long short-term memory network (Long Short-Term Memory, abbreviated as LSTM) and a backward long short-term memory network. The LSTM model can better obtain the long-distance dependency of text semantics, and on this basis, the BiLSTM model can better obtain the bidirectional semantics of text. The speech text word vector, image text word vector and entity feature vector are input into the BiLSTM model, and after being processed by the forward LSTM and the backward LSTM, the forward LSTM and the backward LSTM both wait until all time steps are calculated to generate two result vectors, and then the two result vectors are spliced together to output the context feature vector with contextual context information.
在本发明实施方式中,也可以通过构建其他结构的神经网络模型来实现语音数据和视频数据转写为所述的语音文本数据和视频文本数据,具体的方法不做限制。In the implementation manner of the present invention, the voice data and video data can also be transcribed into the voice text data and video text data by constructing a neural network model of other structures, and the specific method is not limited.
在步骤S63中,为了区分所述语音文本数据、图像文本数据和实时以及历史文本数据中不同词或词语的自身的权重信息或不同文本数据之间的关联权重信息,分别将所述语音文本上下文特征向量、图像文本上下文特征向量和实时以及历史文本上下文实体特征向量输入多头注意力机制中,获取包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义特征自身权重信息的自身权重特征向量以及包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义关联权重信息的关联权重特征向量,充分利用了上述文本的上下文信息,补充了语音和视频数据中单一特征的不足,丰富了文本数据中的语义表征能力,优化了后续的文本分类能力。In step S63, in order to distinguish the weight information of different words or phrases in the speech text data, image text data, and real-time and historical text data, or the associated weight information between different text data, the speech text context feature vector, image text context feature vector, and real-time and historical text context entity feature vector are respectively input into the multi-head attention mechanism to obtain the own weight feature vector containing the own weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features, and the associated weight feature vector containing the semantic association weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantics, thereby making full use of the context information of the above texts, supplementing the deficiency of single features in speech and video data, enriching the semantic representation capability in text data, and optimizing the subsequent text classification capability.
在步骤S7中,其具体包括:In step S7, it specifically includes:
将所述语音的上下文特征向量和权重文本注意力特征向量(包括自身权重文本注意力特征向量和关联权重文本注意力特征向量)经全连接层组合后,输出至分类函数,计算所述语音文本数据和所述图像文本数据中文本语义的得分及其归一化得分结果,得到分类结果信息。The context feature vector and the weighted text attention feature vector (including the own weighted text attention feature vector and the associated weighted text attention feature vector) of the speech are combined through a fully connected layer and output to a classification function to calculate the scores of the text semantics in the speech text data and the image text data and their normalized score results to obtain classification result information.
综上所述,依次通过上述步骤可以得到本发明所提供的基于多模态知识图谱的文本分类方法。通过获取所述实时的音视频数据、实时和历史文本数据,对其进行数据清洗,同时对其进行语音和视频的分离,分别产生有效的语音数据和视频数据,并将其都作为预训练和预测模型的数据集的一部分,从而更全面的获取了文本语义特征。另外,通过构建融合了连接时序分类方法和注意力机制的多通道多尺寸的深度卷积网络模型以及基于时间深度卷积神经网络模型与句子层面的视频图像识别方法,从而挖掘并获得了更加丰富的高层语义特征信息。最后,通过构建融合了语音文本数据、视频文本数据和实时文本数据以及历史文本数据的上下文信息机制、多头注意力机制,更加充分的利用了语义表征能力,弥补了语音和视频数据中单一特征的不足,提高了文本分类的准确性。另外,通过获取外部存储的配置数据进行计算,提高了模型的计算效率。整体模型结构具有很好的文本数据语义表征能力,从语义特征熵体现了良好的互补性和关联性特点,提高了对文本分类的准确率。In summary, the text classification method based on the multimodal knowledge graph provided by the present invention can be obtained by sequentially performing the above steps. By obtaining the real-time audio and video data, real-time and historical text data, data cleaning is performed on them, and voice and video are separated at the same time, effective voice data and video data are generated respectively, and they are all used as part of the data set of the pre-training and prediction model, thereby obtaining text semantic features more comprehensively. In addition, by constructing a multi-channel and multi-size deep convolutional network model that integrates the connection time series classification method and the attention mechanism, and a video image recognition method based on the time deep convolutional neural network model and the sentence level, more abundant high-level semantic feature information is mined and obtained. Finally, by constructing a context information mechanism and a multi-head attention mechanism that integrates voice text data, video text data, real-time text data and historical text data, the semantic representation ability is more fully utilized, the deficiency of a single feature in voice and video data is compensated, and the accuracy of text classification is improved. In addition, by obtaining the configuration data of the external storage for calculation, the calculation efficiency of the model is improved. The overall model structure has a good semantic representation ability of text data, and reflects the good complementarity and correlation characteristics from the semantic feature entropy, which improves the accuracy of text classification.
在步骤S8中,其具体包括:In step S8, it specifically includes:
将所述分类结果信息转换为语音进行输出,和/或将所述分类结果信息转换为语音传输至客户终端输出,和/或将所述分类结果信息转换为文本进行输出,和/或将所述分类结果信息转换为文本传输至客户终端输出,和/或将所述分类结果信息转换为图像进行输出,和/或将所述分类结果信息转换为图像传输至客户终端输出。The classification result information is converted into voice for output, and/or the classification result information is converted into voice and transmitted to a client terminal for output, and/or the classification result information is converted into text for output, and/or the classification result information is converted into text and transmitted to a client terminal for output, and/or the classification result information is converted into an image for output, and/or the classification result information is converted into an image and transmitted to a client terminal for output.
如步骤S8所述,在本实时方式中,在通过上述所述步骤获得分类结果信息后,可将其转换成语音,通过智能冰箱内置的声音播放设备播报所述结果信息,或者也可以将所述结果信息转换为文本,直接通过智能冰箱配置的显示设备显示,或者也可以将所述结果信息转换图像,直接通过智能冰箱的大屏显示。并且,也可将结果信息语音通信传输至客户终端输出,这里,客户终端为具有信息接收功能的电子设备,如将语音传输至手机、智能音响、蓝牙耳机等设备进行播报,或将分类结果信息以文本或图像形式通过短信、邮件等方式通讯传输至诸如手机、平板电脑等客户终端或客户终端安装的应用软件,供用户查阅。从而实现多渠道多种类的分类结果信息输出方式,用户并不局限于只能在智能冰箱附近处获得相关信息,配合本发明所提供的多渠道多种类实时语音获取方式,使得用户能够直接在远程与智能冰箱进行交互,具有极高的便捷性,大幅提高了用户使用体验。在本发明的其他实施方式中,也可仅采用上述分类结果信息输出方式中的一种或几种,或者也可基于现有技术通过其他渠道输出分类结果信息,本发明对此不作具体限制。As described in step S8, in this real-time mode, after the classification result information is obtained through the above steps, it can be converted into voice, and the result information can be broadcasted through the built-in sound playback device of the smart refrigerator, or the result information can be converted into text and directly displayed through the display device configured by the smart refrigerator, or the result information can be converted into an image and directly displayed on the large screen of the smart refrigerator. In addition, the result information can also be transmitted to the client terminal for output by voice communication. Here, the client terminal is an electronic device with information receiving function, such as transmitting voice to mobile phones, smart speakers, Bluetooth headsets and other devices for broadcasting, or transmitting the classification result information in the form of text or image through SMS, email and other methods to client terminals such as mobile phones, tablet computers or application software installed on client terminals for users to view. Thereby, a multi-channel and multi-category classification result information output method is realized, and users are not limited to obtaining relevant information only near the smart refrigerator. With the multi-channel and multi-category real-time voice acquisition method provided by the present invention, users can directly interact with the smart refrigerator remotely, which is extremely convenient and greatly improves the user experience. In other implementations of the present invention, only one or several of the above-mentioned classification result information output methods may be used, or the classification result information may be output through other channels based on the existing technology, and the present invention does not impose any specific limitation on this.
综上所述,本发明提供的一种基于多模态知识图谱的文本分类方法,其通过多渠道获取实时音视频数据、实时和历史文本数据,提取对应的特征并进行融合,充分利用了深度卷积、循环融合多头注意力机制的神经网络模型来实现语义文本特征提取,获得生成文本分类结果,并将所述文本分类结果通过多渠道进行输出,所述方法不仅显著提高了生成文本分类的准确率,而且使得用户和智能冰箱的交互方式更加便捷、多元化,大大提高了用户的体验。In summary, the present invention provides a text classification method based on a multimodal knowledge graph, which obtains real-time audio and video data, real-time and historical text data through multiple channels, extracts corresponding features and fuses them, and makes full use of the neural network model of deep convolution and recurrent fusion multi-head attention mechanism to realize semantic text feature extraction, obtain the generated text classification result, and output the text classification result through multiple channels. The method not only significantly improves the accuracy of generated text classification, but also makes the interaction between users and smart refrigerators more convenient and diversified, greatly improving the user experience.
基于同一发明构思,本发明还提供一种电器设备,其包括:Based on the same inventive concept, the present invention also provides an electrical device, which includes:
存储器,用于存储可执行指令;A memory for storing executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态知识图谱的文本分类方法。The processor is used to implement the above-mentioned text classification method based on the multimodal knowledge graph when running the executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种冰箱,其包括:Based on the same inventive concept, the present invention also provides a refrigerator, comprising:
存储器,用于存储可执行指令;A memory for storing executable instructions;
处理器,用于运行所述存储器存储的可执行指令时,实现上述的基于多模态知识图谱的文本分类方法。The processor is used to implement the above-mentioned text classification method based on the multimodal knowledge graph when running the executable instructions stored in the memory.
基于同一发明构思,本发明还提供一种计算机可读存储介质,其存储有可执行指令,所述可执行指令被处理器执行时实现上述的基于多模态知识图谱的文本分类方法。Based on the same inventive concept, the present invention also provides a computer-readable storage medium storing executable instructions, which, when executed by a processor, implement the above-mentioned text classification method based on a multimodal knowledge graph.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described according to implementation modes, not every implementation mode contains only one independent technical solution. This description of the specification is only for the sake of clarity. Those skilled in the art should regard the specification as a whole. The technical solutions in each implementation mode may also be appropriately combined to form other implementation modes that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。The series of detailed descriptions listed above are only specific descriptions of feasible implementation methods of the present invention and are not intended to limit the scope of protection of the present invention. All equivalent implementation methods or changes that do not deviate from the technical spirit of the present invention should be included in the scope of protection of the present invention.

Claims (15)

  1. 一种基于多模态知识图谱的文本分类方法,其特征在于,包括步骤:A text classification method based on a multimodal knowledge graph, characterized by comprising the steps of:
    获取实时音视频数据,获取实时和历史文本数据;Obtain real-time audio and video data, and obtain real-time and historical text data;
    对所述实时音视频数据进行预处理,获取实时语音数据和实时视频数据;Preprocessing the real-time audio and video data to obtain real-time voice data and real-time video data;
    转写所述实时语音数据为语音文本数据,提取所述语音文本数据的文本特征;Transcribing the real-time voice data into voice text data, and extracting text features of the voice text data;
    转写所述实时视频数据为图像文本数据,提取所述图像文本数据的文本特征;Transcribing the real-time video data into image text data, and extracting text features of the image text data;
    提取所述实时和历史文本数据的实体特征;Extracting entity features of the real-time and historical text data;
    根据所述实时语音数据文本特征、实时视频数据文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息;According to the real-time voice data text features, real-time video data text features and entity features, obtaining context information of the text data and weight information of text semantic features;
    将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息;The context information and weight information are combined through a fully connected layer and then output to a classifier to calculate a score to obtain classification result information;
    输出所述分类结果信息。Output the classification result information.
  2. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“对所述实时音视频数据进行预处理,获取实时语音数据和视频数据”,具体包括:The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "preprocessing the real-time audio and video data to obtain real-time voice data and video data" specifically includes:
    对所述实时音视频数据进行数据清洗、格式解析、格式转换和数据存储,获得有效的音视频数据;Performing data cleaning, format analysis, format conversion and data storage on the real-time audio and video data to obtain valid audio and video data;
    采用脚本或第三方工具将所述有效音视频数据进行语音和视频分离,以获得所述实时语音数据和实时视频数据;Using a script or a third-party tool to separate the effective audio and video data into voice and video to obtain the real-time voice data and real-time video data;
    对所述实时语音数据和视频数据进行预处理,包括:对所述实时语音数据进行分帧和加窗处理,对所述实时视频数据进行裁剪、分帧处理;Preprocessing the real-time voice data and video data, including: performing frame division and windowing processing on the real-time voice data, and cropping and frame division processing on the real-time video data;
    对所述实时和历史文本数据进行预处理,包括:分词、去除停用词、去重复词。The real-time and historical text data are preprocessed, including: word segmentation, removal of stop words, and removal of duplicate words.
  3. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“转写所述实时语音数据为语音文本数据”,具体包括:The text classification method based on a multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the real-time voice data into voice text data" specifically includes:
    提取所述实时语音数据特征,得到语音特征;Extracting the real-time voice data features to obtain voice features;
    将所述语音特征输入语音识别多通道多尺寸深度卷积神经网络模型转写得到第一语音文本数据;Inputting the speech feature into a speech recognition multi-channel multi-size deep convolutional neural network model to transcribe to obtain first speech text data;
    基于连接时序分类方法输出所述语音特征和所述第一语音文本数据的对齐关系,以得到第二语音文本数据;Outputting the alignment relationship between the speech feature and the first speech text data based on the connection time series classification method to obtain second speech text data;
    基于注意力机制,获取所述第二语音文本数据的关键特征或所述关键特征的权重信息;Based on the attention mechanism, obtaining key features of the second voice text data or weight information of the key features;
    将所述第二语音文本数据以及其关键特征或关键特征的权重信息经全连接层组合后,再经过分类函数计算得分得到所述语音文本数据。The second speech text data and its key features or weight information of the key features are combined through a fully connected layer and then scored by a classification function to obtain the speech text data.
  4. 根据权利要求3所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“提取所述实时语音数据特征”,具体包括:The text classification method based on multimodal knowledge graph according to claim 3 is characterized in that the “extracting the real-time speech data features” specifically includes:
    提取所述实时语音数据特征,获取其梅尔频率倒谱系数特征。The real-time speech data features are extracted to obtain the Mel-frequency cepstral coefficient features thereof.
  5. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“转写所述实时视频数据为图像文本数据”,具体包括:The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the real-time video data into image text data" specifically includes:
    将所述实时视频数据输入3D深度卷积神经网络计算,得到图像特征;Input the real-time video data into a 3D deep convolutional neural network for calculation to obtain image features;
    将所述图像特征输入多通道多尺寸时间卷积网络转写,获得第一图像文本数据;Input the image features into a multi-channel multi-size temporal convolutional network for transcription to obtain first image text data;
    基于连接时序分类方法输出所述图像特征和所述第一图像文本数据的对齐关系,以得到第二图像文本数据;Outputting the alignment relationship between the image feature and the first image text data based on the connection temporal classification method to obtain second image text data;
    将所述第二图像文本数据经全连接层组合后,再经过分类函数计算得分得到所述图像文本数据。The second image text data is combined through a fully connected layer and then scored by a classification function to obtain the image text data.
  6. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“提取所述实时和历史文本数据的实体特征”,具体包括:The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the “extracting entity features of the real-time and historical text data” specifically includes:
    采用实体链接方法对所述文本数据进行实体抽取,以得到多个食材实体;Using an entity linking method to perform entity extraction on the text data to obtain a plurality of food entities;
    基于每个食材实体查询食材知识图谱,获得对应的实体向量表示;Query the ingredient knowledge graph based on each ingredient entity to obtain the corresponding entity vector representation;
    将所述实体向量表示输入多头注意力机制计算,得到实体特征向量。The entity vector representation is input into the multi-head attention mechanism for calculation to obtain the entity feature vector.
  7. 根据权利要求6所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“基于每个食材实体查询食材知识图谱,获得对应的实体向量表示”,具体包括:The text classification method based on multimodal knowledge graph according to claim 6 is characterized in that the step of “querying the food knowledge graph based on each food entity to obtain the corresponding entity vector representation” specifically includes:
    采用实体三元组形式将所述实体转换为对应的实体向量表示;Converting the entity into a corresponding entity vector representation in the form of an entity triple;
    采用神经网络的分布式向量表示方法来实现所述实体向量表示。A distributed vector representation method of a neural network is used to implement the entity vector representation.
  8. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“根据所述实时语音数据文本特征、实时视频数据文本特征和实体特征,获取该文本数据的上下文信息和文本语义特征的权重信息”,具体包括:The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "obtaining the context information of the text data and the weight information of the text semantic features according to the real-time voice data text features, the real-time video data text features and the entity features" specifically includes:
    将所述实时语音文本特征和实时视频文本特征转换为语音文本词向量和图像文本词向量;Converting the real-time speech text features and the real-time video text features into speech text word vectors and image text word vectors;
    将所述语音文本词向量、图像文本词向量和实体特征输入双向长短记忆网络模型,获取包含所述语音文本特征、图像文本特征和实时以及历史文本特征信息的上下文特征向量。The speech text word vector, image text word vector and entity features are input into a bidirectional long short-term memory network model to obtain a context feature vector containing the speech text features, image text features and real-time and historical text feature information.
  9. 根据权利要求8所述的基于多模态知识图谱的文本分类方法,其特征在于,所述方法还包括:The text classification method based on multimodal knowledge graph according to claim 8 is characterized in that the method further comprises:
    基于注意力机制,区分所述语音文本数据、图像文本数据和实时以及历史文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息,获得所述文本语义特征的权重信息。Based on the attention mechanism, the self-weight information and/or associated weight information of words and phrases in the text features of the speech text data, image text data, and real-time and historical text data are distinguished to obtain the weight information of the text semantic features.
  10. 根据权利要求9所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“基于注意力机制,区分所述语音文本数据、图像文本数据和实时以及历史文本数据的文本特征中的词、词语的自身权重信息和/或关联权重信息”,具体包括:The text classification method based on a multimodal knowledge graph according to claim 9 is characterized in that the “based on the attention mechanism, distinguishing the words and phrases’ own weight information and/or associated weight information in the text features of the speech text data, image text data, and real-time and historical text data” specifically includes:
    分别将所述语音文本上下文特征向量、图像文本上下文特征向量和实时以及历史文本实体特征向量输入多头注意力机制;Inputting the speech text context feature vector, the image text context feature vector, and the real-time and historical text entity feature vectors into a multi-head attention mechanism respectively;
    获取包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义特征自身权重信息的自身权重文本注意力特征向量;Obtaining a self-weighted text attention feature vector including self-weight information of the speech text semantic features, image text semantic features, and real-time and historical text semantic features;
    获取包含所述语音文本语义特征、图像文本语义特征和实时以及历史文本语义特征关联权重信息的关联权重文本注意力特征向量。Obtain an associated weight text attention feature vector containing the associated weight information of the speech text semantic features, the image text semantic features, and the real-time and historical text semantic features.
  11. 根据权利要求10所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“将所述上下文信息和权重信息经全连接层组合后,输出至分类器计算得分得到分类结果信息”,具体包括:The text classification method based on a multimodal knowledge graph according to claim 10 is characterized in that the step of “combining the context information and the weight information through a fully connected layer and outputting the result information to a classifier to calculate a score and obtain classification result information” specifically includes:
    将所述上下文特征向量和权重文本注意力特征向量经全连接层组合后,输出至分类函数,计算所述语音文本数据、图像文本数据和实时以及历史文本数据文本语义的得分及其归一化得分结果,得到文本的分类结果信息。The context feature vector and the weighted text attention feature vector are combined through a fully connected layer and output to a classification function to calculate the text semantic scores of the speech text data, image text data, and real-time and historical text data and their normalized score results to obtain the classification result information of the text.
  12. 根据权利要求1所述的基于多模态知识图谱的文本分类方法,其特征在于,所述“转写所述语音数据为语音文本数据,提取所述语音文本数据的文本特征”,还包括:The text classification method based on multimodal knowledge graph according to claim 1 is characterized in that the "transcribing the voice data into voice text data and extracting text features of the voice text data" further includes:
    获取存储于外部缓存的配置数据,将所述语音数据基于所述配置数据执行所述多通道多尺寸深度卷积神经网络模型计算,进行文本转写和提取文本特征。The configuration data stored in the external cache is obtained, and the multi-channel and multi-size deep convolutional neural network model is calculated based on the configuration data to perform text transcription and extract text features.
  13. 一种电器设备,其特征在于,包括:An electrical device, characterized in that it comprises:
    存储器,用于存储可执行指令;A memory for storing executable instructions;
    处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至12任一项所述的基于多模态知识图谱的文本分类方法。A processor, configured to implement the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
  14. 一种冰箱,其特征在于,包括:A refrigerator, characterized by comprising:
    存储器,用于存储可执行指令;A memory for storing executable instructions;
    处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至12任一项所述的基于多模态知识图谱的文本分类方法。A processor, configured to implement the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 when running the executable instructions stored in the memory.
  15. 一种计算机可读存储介质,其存储有可执行指令,其特征在于,所述可执行指令被处理器执行时实现权利要求1至12任一项所述的基于多模态知识图谱的文本分类方法。A computer-readable storage medium storing executable instructions, characterized in that when the executable instructions are executed by a processor, the text classification method based on a multimodal knowledge graph as described in any one of claims 1 to 12 is implemented.
PCT/CN2023/140835 2022-12-31 2023-12-22 Text classification method based on multi-modal knowledge graph, and device and storage medium WO2024140434A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211736562.3 2022-12-31
CN202211736562.3A CN116186258A (en) 2022-12-31 2022-12-31 Text classification method, device and storage medium based on multimodal knowledge graph

Publications (1)

Publication Number Publication Date
WO2024140434A1 true WO2024140434A1 (en) 2024-07-04

Family

ID=86441578

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/140835 WO2024140434A1 (en) 2022-12-31 2023-12-22 Text classification method based on multi-modal knowledge graph, and device and storage medium

Country Status (2)

Country Link
CN (1) CN116186258A (en)
WO (1) WO2024140434A1 (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118710993A (en) * 2024-08-27 2024-09-27 南京航空航天大学 A tree species classification method and system based on Sentinel-2 data and CKAN-LSTM-MHA network for spatiotemporal and spectral integration
CN118734947A (en) * 2024-09-04 2024-10-01 湖北大学 Knowledge graph completion method and device based on attention penalty and noise sampling
CN118865977A (en) * 2024-09-26 2024-10-29 深圳市江智工业技术有限公司 Multimodal human-computer collaborative interaction system and method
CN118916499A (en) * 2024-10-09 2024-11-08 新瑞数城技术有限公司 Query method integrating AI large model and knowledge graph
CN118966226A (en) * 2024-10-17 2024-11-15 时代新媒体出版社有限责任公司 A method and system for extracting entity relations from ancient book texts based on deep learning
CN119167937A (en) * 2024-10-08 2024-12-20 广西警察学院 A multimodal entity recognition method based on large language model
CN119170182A (en) * 2024-11-19 2024-12-20 吉林大学第一医院 Full-course follow-up management system and method for obstetrics and gynecology patients
CN119227794A (en) * 2024-09-29 2024-12-31 广西警察学院 Multimodal document structured processing and knowledge extraction method based on large language model
CN119918016A (en) * 2025-04-03 2025-05-02 广东科学技术职业学院 Method, device, terminal equipment and storage medium for supervising student learning status based on artificial intelligence
CN120050004A (en) * 2025-04-25 2025-05-27 深圳大学 A semantic communication retransmission method, system, terminal and storage medium based on long short-term memory network
CN120144810A (en) * 2025-05-15 2025-06-13 北京邮电大学 Dish analysis method and device based on multi-mode knowledge graph, equipment and medium
CN120145605A (en) * 2025-05-15 2025-06-13 建设综合勘察研究设计院有限公司 Intelligent construction method, system and device for underground pipe network dynamic information model

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116186258A (en) * 2022-12-31 2023-05-30 青岛海尔电冰箱有限公司 Text classification method, device and storage medium based on multimodal knowledge graph
CN116910244A (en) * 2023-06-12 2023-10-20 青岛海尔电冰箱有限公司 Text classification method and device for multi-mode data, refrigeration equipment and medium
CN117725995B (en) * 2024-02-18 2024-05-24 青岛海尔科技有限公司 A method, device and medium for constructing knowledge graph based on large model
CN118940761B (en) * 2024-07-23 2025-01-28 上海烜翊科技有限公司 A Model-Based Document Generation Method
CN119622039A (en) * 2024-11-27 2025-03-14 中国数字文化集团有限公司 A deployment method and application terminal based on big data model in digital culture field

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN113094509A (en) * 2021-06-08 2021-07-09 明品云(北京)数据科技有限公司 Text information extraction method, system, device and medium
CN114944156A (en) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 Item classification method, device, equipment and storage medium based on deep learning
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Speech recognition and classification method, device, equipment, refrigerator and storage medium
CN115098765A (en) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 Information pushing method, device and equipment based on deep learning and storage medium
CN116186258A (en) * 2022-12-31 2023-05-30 青岛海尔电冰箱有限公司 Text classification method, device and storage medium based on multimodal knowledge graph

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287389A (en) * 2019-05-31 2019-09-27 南京理工大学 A Multimodal Sentiment Classification Method Based on Fusion of Text, Speech and Video
CN112200317B (en) * 2020-09-28 2024-05-07 西南电子技术研究所(中国电子科技集团公司第十研究所) Multi-mode knowledge graph construction method
CN113936637B (en) * 2021-10-18 2025-04-18 上海交通大学 Speech adaptive completion system based on multimodal knowledge graph

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259215A (en) * 2020-02-14 2020-06-09 北京百度网讯科技有限公司 Multi-modal-based topic classification method, device, equipment and storage medium
CN113094509A (en) * 2021-06-08 2021-07-09 明品云(北京)数据科技有限公司 Text information extraction method, system, device and medium
CN114944156A (en) * 2022-05-20 2022-08-26 青岛海尔电冰箱有限公司 Item classification method, device, equipment and storage medium based on deep learning
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Speech recognition and classification method, device, equipment, refrigerator and storage medium
CN115098765A (en) * 2022-05-20 2022-09-23 青岛海尔电冰箱有限公司 Information pushing method, device and equipment based on deep learning and storage medium
CN116186258A (en) * 2022-12-31 2023-05-30 青岛海尔电冰箱有限公司 Text classification method, device and storage medium based on multimodal knowledge graph

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118710993A (en) * 2024-08-27 2024-09-27 南京航空航天大学 A tree species classification method and system based on Sentinel-2 data and CKAN-LSTM-MHA network for spatiotemporal and spectral integration
CN118734947A (en) * 2024-09-04 2024-10-01 湖北大学 Knowledge graph completion method and device based on attention penalty and noise sampling
CN118865977A (en) * 2024-09-26 2024-10-29 深圳市江智工业技术有限公司 Multimodal human-computer collaborative interaction system and method
CN119227794A (en) * 2024-09-29 2024-12-31 广西警察学院 Multimodal document structured processing and knowledge extraction method based on large language model
CN119167937A (en) * 2024-10-08 2024-12-20 广西警察学院 A multimodal entity recognition method based on large language model
CN118916499A (en) * 2024-10-09 2024-11-08 新瑞数城技术有限公司 Query method integrating AI large model and knowledge graph
CN118966226A (en) * 2024-10-17 2024-11-15 时代新媒体出版社有限责任公司 A method and system for extracting entity relations from ancient book texts based on deep learning
CN119170182A (en) * 2024-11-19 2024-12-20 吉林大学第一医院 Full-course follow-up management system and method for obstetrics and gynecology patients
CN119918016A (en) * 2025-04-03 2025-05-02 广东科学技术职业学院 Method, device, terminal equipment and storage medium for supervising student learning status based on artificial intelligence
CN120050004A (en) * 2025-04-25 2025-05-27 深圳大学 A semantic communication retransmission method, system, terminal and storage medium based on long short-term memory network
CN120144810A (en) * 2025-05-15 2025-06-13 北京邮电大学 Dish analysis method and device based on multi-mode knowledge graph, equipment and medium
CN120145605A (en) * 2025-05-15 2025-06-13 建设综合勘察研究设计院有限公司 Intelligent construction method, system and device for underground pipe network dynamic information model

Also Published As

Publication number Publication date
CN116186258A (en) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2024140434A1 (en) Text classification method based on multi-modal knowledge graph, and device and storage medium
WO2024140432A1 (en) Ingredient recommendation method based on knowledge graph, and device and storage medium
CN110909613B (en) Video character recognition method and device, storage medium and electronic equipment
CN112052333B (en) Text classification method and device, storage medium and electronic equipment
CN113408385A (en) Audio and video multi-mode emotion classification method and system
WO2024140430A9 (en) Text classification method based on multimodal deep learning, device, and storage medium
WO2023222088A1 (en) Voice recognition and classification method and apparatus
CN105512348A (en) Method and device for processing videos and related audios and retrieving method and device
WO2023222089A1 (en) Item classification method and apparatus based on deep learning
CN116955699A (en) A video cross-modal search model training method, search method and device
WO2023222090A1 (en) Information pushing method and apparatus based on deep learning
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
CN118155623B (en) Speech recognition method based on artificial intelligence
CN117077787A (en) Text generation method and device, refrigerator and storage medium
WO2025001000A1 (en) Cognitive test method, cognitive test apparatus, electronic device and storage medium
CN118916443A (en) Information retrieval method and device and electronic equipment
CN119004381A (en) Multi-mode large model synchronous training and semantic association construction system and training method thereof
CN118520091A (en) Multi-mode intelligent question-answering robot and construction method thereof
CN118802398A (en) Meeting minutes generation method, device, storage medium and electronic device
US20240244290A1 (en) Video processing method and apparatus, device and storage medium
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
WO2024093578A1 (en) Voice recognition method and apparatus, and electronic device, storage medium and computer program product
CN116719931A (en) Text classification methods and refrigerators
CN114155454B (en) Video processing method, device and storage medium
CN118608012B (en) A quantitative method for online user interaction experience quality integrating large models and knowledge graphs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23910360

Country of ref document: EP

Kind code of ref document: A1