CN115604475A

CN115604475A - Multi-mode information source joint coding method

Info

Publication number: CN115604475A
Application number: CN202210969884.6A
Authority: CN
Inventors: 宋晓丹; 李甫; 高大化; 谢雪梅; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2023-01-13
Anticipated expiration: 2042-08-12
Also published as: WO2024032119A1; CN115604475B

Abstract

A multi-modal information source joint encoding method. Firstly, multiple modal information sources are extracted through the corresponding first encoder to remove the internal redundancy of each modal signal, and the corresponding feature map is obtained; then multiple groups of feature maps are connected. Get up and input the second encoder, decoupled into a common feature map and a personality feature map; the common feature map represents the common part between different modal information sources, and the individual characteristic map represents the unique characteristics of each modal information source; finally The individual characteristic map and common feature map of multiple modal information sources are decoded by the corresponding decoder and the corresponding modal information sources are reconstructed, that is, they are converted into binary code streams for storage or transmission through entropy encoding respectively; the binary code at the decoding end After the stream is entropy decoded, the corresponding modal information source is restored by the corresponding decoder; the present invention utilizes the correlation between different information sources to reduce the repeated transmission of related information, reduce the transmission bandwidth, and reduce the storage space; the decoding end restores the Different modality sources, with modality scalability.

Description

A Joint Coding Method for Multimodal Information Sources

技术领域technical field

本发明涉及信源编码技术领域，具体涉及一种多模态信源联合编码方法。The invention relates to the technical field of information source coding, in particular to a multi-mode information source joint coding method.

背景技术Background technique

信源编码作为一种基础技术，被广泛应用于各个领域。信源编码是信息时代多媒体技术和互联网技术结合的产物，旨在允许一定失真或者不允许失真前提下，用最少的比特表示信源。高效率的信源编码技术可以在有限带宽下大大提升解码后的信源质量，降低存储空间。例如，根据输入的不同目前有文本压缩、图像压缩(如PNG，BMP，JPEG，BPG，WEBP等压缩标准)、视频压缩(如H.264/AVC， H.265/HEVC，H.266/VVC，VP9，AV1，AVS1，AVS2，AVS3等)、音频编码(如AAC等)等等，这些标准有一个共同的特点，只针对单一种类的输入，例如文本压缩只针对文本输入，图像压缩只针对图像，视频压缩针对图像或者视频，音频编码只针对音频输入，无法对其他形式的进行处理，即便处理也需要经过前处理，而且效率低下。例如，视频压缩编码标准无法直观对文本进行压缩，尽管可以通过预处理将文本组织成视频形式，但是其内容与正常的视频差异大，无实际物理意义，视频编解码标准中的技术并不针对这种非正常信号设计，因此即便强制编码也会效率低下。As a basic technology, source coding is widely used in various fields. Source coding is the product of the combination of multimedia technology and Internet technology in the information age. It aims to express the source with the fewest bits under the premise of allowing certain distortion or not allowing distortion. High-efficiency source coding technology can greatly improve the quality of decoded source and reduce storage space under limited bandwidth. For example, depending on the input, there are currently text compression, image compression (such as PNG, BMP, JPEG, BPG, WEBP and other compression standards), video compression (such as H.264/AVC, H.265/HEVC, H.266/VVC , VP9, AV1, AVS1, AVS2, AVS3, etc.), audio coding (such as AAC, etc.), etc., these standards have a common feature, only for a single type of input, for example, text compression is only for text input, image compression is only for Image and video compression is for images or videos, and audio encoding is only for audio input, and cannot process other forms, even if the processing requires pre-processing, and the efficiency is low. For example, video compression coding standards cannot intuitively compress text. Although text can be organized into video form through preprocessing, its content is very different from normal video and has no actual physical meaning. The technology in video codec standards does not target This is an unusual signal design, so even mandatory encoding would be inefficient.

实际中，经常综合几种模态的数据一起进行某种表达，例如，电视剧电影等最常见的模态包含视频、音频和字幕三种模态，按照以上标准，目前的方案几乎都是对三种模态分别编码，但是实际上该三种模态信号之间存在关联，即存在一定程度的冗余，而现有的独立编码方法无法消除此类冗余，因此是对带宽或者存储空间的一种浪费。因此需要一种能够对多种模态的信号进行联合编码的方法，以去除不同模态信号之间的相关性，降低冗余，从而达到减少带宽和节省存储空间的目的。In practice, data from several modalities are often combined for a certain expression. For example, the most common modalities such as TV dramas and movies include three modalities: video, audio, and subtitles. According to the above standards, the current solutions are almost all three The three modes are encoded separately, but in fact there is a correlation between the three mode signals, that is, there is a certain degree of redundancy, and the existing independent encoding method cannot eliminate such redundancy, so it is a limitation on bandwidth or storage space. a waste. Therefore, a method capable of jointly encoding signals of multiple modalities is needed to remove correlations between signals of different modalities and reduce redundancy, thereby achieving the purpose of reducing bandwidth and saving storage space.

发明内容Contents of the invention

为了克服上述现有技术的缺点，本发明的目的在于提供了一种多模态信源联合编码方法，通过在编码压缩过程中利用不同信源之间的相关性，减少相关信息的重复传输从而降低传输带宽，降低存储空间；解码端根据需要恢复出不同模态信源，即具有模态可伸缩性。In order to overcome the shortcomings of the above-mentioned prior art, the purpose of the present invention is to provide a multi-modal information source joint coding method, which reduces the repeated transmission of related information by utilizing the correlation between different information sources in the encoding and compression process. Reduce the transmission bandwidth and storage space; the decoding end recovers different modal information sources according to the needs, that is, it has modal scalability.

为了达到上述目的，本发明采取的技术方案为：In order to achieve the above object, the technical scheme that the present invention takes is:

一种多模态信源联合编码方法，包括以下步骤：A multimodal information source joint coding method, comprising the following steps:

1)将多个模态信源经过相应第一编码器以提取特征去除每个模态信号内部冗余，得到对应的特征图；1) passing multiple modal information sources through corresponding first encoders to extract features and remove internal redundancy of each modal signal to obtain corresponding feature maps;

2)为了去除不同模态信号之间的相关性，将多组特征图连接起来输入第二编码器，解耦为共性特征图和个性特征图；共性特征图表示不同模态信源之间的共同部分，个性特征图表示每个模态信源所独有的特征；2) In order to remove the correlation between different modal signals, multiple sets of feature maps are connected and input to the second encoder, and decoupled into common feature maps and individual feature maps; the common feature maps represent the differences between different modal sources. The common part, the personality map represents the characteristics unique to each modality source;

3)将多个模态信源的个性特征图和共性特征图经过相应解码器解码并重建相应的模态信源，即分别经过熵编码，转换为二进制码流进行存储或者传输；在解码端二进制码流进行熵解码后分别经过相应的解码器恢复得到相应的模态信源。3) Decode the individual feature maps and common feature maps of multiple modal information sources through the corresponding decoder and reconstruct the corresponding modal information sources, that is, respectively undergo entropy encoding and convert them into binary code streams for storage or transmission; at the decoding end After the binary code stream is entropy decoded, the corresponding modal information sources are recovered through corresponding decoders respectively.

引入了知识库，对多模态信源进行联合编码；知识库是多模态或单模态，多模态知识库是指知识库中存储包含多种不同形式来自不同模态信源的信息；单个或多个模态信源经过“模态解析”获得检索知识库的索引，“模态解析”为了获取知识库节点实体以进行查询和推理。The knowledge base is introduced to jointly encode multi-modal information sources; the knowledge base is multi-modal or single-modal, and the multi-modal knowledge base refers to the storage of information in the knowledge base that contains a variety of different forms from different modal sources. ; Single or multiple modal information sources obtain the index of the retrieval knowledge base through "modal analysis", and "modal analysis" is used to obtain the node entity of the knowledge base for query and reasoning.

所述的多模态知识库一种表现形式中有文本和图像，以节点和边表示，每个节点表示一个实体或者表示文本或者表示图像，每条边表示不同节点之间的关系。One form of representation of the multimodal knowledge base includes text and images, represented by nodes and edges, each node represents an entity or text or image, and each edge represents the relationship between different nodes.

本发明的有益效果为：本发明提出了一种多模态信源联合编码方法，将每个模态信源表征为共性特征和个性特征，不同模态信源之间的共性特征相同，进而实现多个模态信源的联合编码。相比于多个模态信源独立编码，本发明通过在编码压缩过程中利用不同信源之间的相关性，减少相关信息的重复传输从而降低传输带宽和存储空间。同时，解码端可以根据需要恢复出不同模态信源，即具有模态可伸缩性的优点。The beneficial effects of the present invention are: the present invention proposes a multi-modal information source joint coding method, which characterizes each modal information source as common features and individual features, and the common features between different modal information sources are the same, and then Realize the joint coding of multiple modal sources. Compared with independent coding of multiple modal information sources, the present invention utilizes the correlation between different information sources in the process of encoding and compression to reduce repeated transmission of related information, thereby reducing transmission bandwidth and storage space. At the same time, the decoding end can recover information sources of different modalities according to needs, that is, it has the advantage of modal scalability.

本发明在以上多模态联合编码方法基础上，引入了知识库(其中存在与待编码信源的强相关已知信息)，增加了先验知识，显式关联了不同模态的信源，在编码过程利用知识库中的先验知识对多模态编码过程进行指导。因而，相比没有知识库的多模态联合编码，能够进一步节省存储空间，降低带宽。On the basis of the above multimodal joint coding method, the present invention introduces a knowledge base (where there is known information strongly related to the information source to be encoded), increases prior knowledge, and explicitly associates information sources of different modes, In the encoding process, the prior knowledge in the knowledge base is used to guide the multimodal encoding process. Therefore, compared with multi-modal joint coding without knowledge base, it can further save storage space and reduce bandwidth.

附图说明Description of drawings

图1为本发明实施例1一种多模态信源联合编码方法的流程图。FIG. 1 is a flow chart of a joint encoding method for multi-modal information sources according to Embodiment 1 of the present invention.

图2为本发明实施例2一种知识库辅助的多模态信源联合编码方法的流程图。FIG. 2 is a flow chart of a knowledge base-assisted joint encoding method for multi-modal information sources according to Embodiment 2 of the present invention.

图3为本发明实施例2中图像和文本多模态知识库。Fig. 3 is an image and text multimodal knowledge base in Embodiment 2 of the present invention.

图4为本发明实施例3一种知识库辅助的多模态信源联合编码方法的流程图。FIG. 4 is a flowchart of a knowledge base-assisted joint encoding method for multi-modal information sources according to Embodiment 3 of the present invention.

具体实施方式detailed description

下面结合附图和实施例对本发明做详细描述。The present invention will be described in detail below in conjunction with the accompanying drawings and embodiments.

实施例1，实施例1给出了给出了两种信源作为输入的例子，一种多模态信源联合编码方法，包括以下步骤：Embodiment 1, embodiment 1 provides the example that two kinds of information sources are given as input, a kind of multimodal information source joint coding method, comprises the following steps:

1)给定两个模态信源“模态1”和“模态2”，分别记为src₁和 src₂，两个模态信号分别经过第一编码器A和第一编码器B以提取特征去除每个模态信号内部冗余，得到特征图feat₁和特征图feat₂，其中第一编码器A和第二编码器B无特殊限制可以是神经网络中的卷积神经网络CNN，也可以是时序循环神经网络RNN；特征图feat₁和特征图feat₂可以是一维向量，也可以是二维矩阵甚至更高维度的张量；1) Given two modal signal sources "modal 1" and "modal 2", denoted as src ₁ and src ₂ respectively, the two modal signals respectively pass through the first encoder A and the first encoder B to Extract features to remove the internal redundancy of each modal signal, and obtain the feature map feat ₁ and feature map feat ₂ , wherein the first encoder A and the second encoder B can be the convolutional neural network CNN in the neural network without special restrictions, It can also be a time series recurrent neural network RNN; feature map feat ₁ and feature map feat ₂ can be one-dimensional vectors, two-dimensional matrices or even higher-dimensional tensors;

2)为了去除不同模态信号之间的相关性，将两组特征图连接起来输入第二编码器C，解耦为共性特征图和个性特征图；共性特征图表示不同模态信源之间的共同部分，通常为语义层面；个性特征图表示每个模态信源所独有的特征；以视频和音频两个模态信源为例，共性特征可能是视频中人物所说的话语，音频中通常也包含该信息；视频的个性特征可以是视频中人物的外表或者人物以外其他如花草等背景信息，音频的个性特征可能包含其他非相关音频，也可以是视频通常难以表达的语气等；2) In order to remove the correlation between different modal signals, the two sets of feature maps are connected and input to the second encoder C, decoupled into common feature maps and individual feature maps; the common feature maps represent the differences between different modal sources The common part of , usually at the semantic level; the personality feature map represents the unique features of each modal source; taking video and audio as examples, the common feature may be the words spoken by the characters in the video, The audio usually also contains this information; the personality characteristics of the video can be the appearance of the characters in the video or other background information other than the characters, such as flowers and plants, and the personality characteristics of the audio may include other irrelevant audio, or it can be the tone that is usually difficult to express in the video, etc. ;

本实施例进行共性和个性特征解耦，输出模态1的个性特征 feati₁，两种模态的共性特征featc和模态2的个性特征feati₂，第二编码器C中可能包含量化过程以实现有损编码，其结构无特殊要求，既可以是CNN，RNN还可以包含hyper prior模型；另外，需要说明的是，feati₁，featc和feati₂三类特征内部特性不一定相同，如feati₁内部可能包含边信息featis₁以及特征featii₁，其中的边信息featis₁用以辅助featii₁生成，featc和feati₂同理；In this embodiment, the common and individual features are decoupled, and the individual features feati ₁ of modality 1, the common features featc of the two modalities and the individual features feati ₂ of modality 2 are output, and the second encoder C may include a quantization process to Realize lossy coding, its structure has no special requirements, it can be CNN, RNN can also include hyper prior model; In addition, it should be noted that the internal characteristics of feati ₁ , featc and feati ₂ are not necessarily the same, such as feati ₁ The interior may contain side information featis ₁ and feature featii _1. The side information featis ₁ is used to assist the generation of featii ₁ , and featc and feati ₂ are the same;

3)featc，feati₁和feati₂三类特征分别经过熵编码，转换为二进制码流进行存储或者传输；在解码端二进制码流进行熵解码后恢复得到feati₁，featc和feati₂；之后feati₁和featc共同输入解码器A，以恢复模态1，标记为

featci₁和featc共同输入解码器B，以恢复模态2，记为

3) featc, feati ₁ and feati ₂ three types of features are respectively entropy coded and converted into binary code streams for storage or transmission; the binary code streams at the decoding end are restored to obtain feati ₁ , featc and feati ₂ after entropy decoding; after that feati ₁ and featc are jointly input to decoder A to recover modality 1, labeled as

Featci ₁ and featc are jointly input to decoder B to recover modality 2, denoted as

以上为测试时的流程，在训练过程中只需要有成对的多模态数据就可以进行训练，训练过程多个模态的编码器和解码器一起进行端到端训练，损失函数

设计为以下形式：The above is the test process. In the training process, only paired multi-modal data can be used for training. During the training process, the encoder and decoder of multiple modalities perform end-to-end training together. The loss function

designed to be of the form:

其中的quality₁(·,·)和quality₂(·,·)分别用于衡量编码造成的模态1和模态2的质量损失。例如对于对于视频或者图像可以用PSNR (峰值信噪比)，MS-SSIM(多尺度-结构相似性)或者感知损失等进行衡量；

和

用于衡量

和

转换为二进制码流所消耗的比特数量，通常可以通过估计获得。例如上述描述中，可以假设featc，feati₁和feati₂三类特征服从高斯分布，用featis₁中的部分特征表示高斯分布的均值，另外部分特征表示方差，即编码器采用变分自编码器VAE结构，则码率

和

可以用香农熵估计得到；公式中的λ₁，λ，λ₃属于超参数，λ₁控制模态1 和模态2重建质量之间的折中，即当更希望模态1的信源失真更小时λ₁可以设置比较小，反之亦然；λ₃在模态1和模态2之间进行码率分配，即两种模态总的带宽或者存储空间要求一定，λ₃较大时倾向于模态1码率更大和模态2码率更小，反之亦然；λ用于控制质量和码率之间的折中，通常质量越高所消耗的码率越大，质量越低所消耗的码率越小，即λ用于选取最终的码率点，λ越大则选取的码率点越低，适用于带宽越低的场景，相应的重建质量会越低，反之亦然。Among them, quality ₁ ( , ) and quality ₂ ( , ) are used to measure the quality loss of mode 1 and mode 2 caused by encoding, respectively. For example, for videos or images, PSNR (peak signal-to-noise ratio), MS-SSIM (multi-scale-structural similarity) or perceptual loss can be used to measure;

and

used to measure

and

The number of bits consumed by converting to a binary code stream can usually be estimated. For example, in the above description, it can be assumed that the featc, feati ₁ and feati ₂ types of features obey the Gaussian distribution, and some features in featis ₁ represent the mean value of the Gaussian distribution, and some other features represent the variance, that is, the encoder uses the variational autoencoder VAE structure, the code rate

and

It can be estimated by Shannon entropy; λ ₁ , λ, λ ₃ in the formula belong to hyperparameters, and λ ₁ controls the compromise between the reconstruction quality of mode 1 and mode 2, that is, when the source distortion of mode 1 is more expected When it is smaller, λ ₁ can be set relatively small, and vice versa; λ ₃ performs code rate allocation between mode 1 and mode 2, that is, the total bandwidth or storage space requirements of the two modes are certain, and when λ ₃ is larger, it tends to Because the code rate of mode 1 is larger and the code rate of mode 2 is smaller, and vice versa; λ is used to control the compromise between quality and code rate. Usually, the higher the quality, the greater the code rate consumed, and the lower the quality. The smaller the code rate consumed, that is, λ is used to select the final code rate point, the larger the λ, the lower the selected code rate point, which is suitable for scenarios with lower bandwidth, and the corresponding reconstruction quality will be lower, and vice versa.

实施例2，参照图2，实施例2在实施例1的基础上引入了知识库，可以更高效的对多模态信源进行联合编码。Embodiment 2, referring to FIG. 2 , embodiment 2 introduces a knowledge base on the basis of embodiment 1, which can perform joint coding on multi-modal information sources more efficiently.

图2中的知识库既可以是多模态也可以是单模态，多模态知识库是指知识库中存储包含不同形式的信息(通常来自不同模态信源)；图3以文本和图像为例给出了一种多模态知识库的例子，其中的多模态知识库有文本和图像，以节点和边表示，每个节点表示一个实体或者表示文本或者表示图像，每条边表示不同节点之间的关系，例如 Claude Shannon是World Computer Chess Championship的嘉宾，其中“Claude Shannon”和“WorldComputer Chess Championship”是节点，边“guestOf”表示两者关系。图3右下角给出了Claude Shannon的图像，“Claude Shannon”和其图像使用带方向的边“imageOf”相连；“Deep Thought”参加“World Computer Chess Championship”比赛，通过两个节点分别表示“Deep Thought”和“World Computer Chess Championship”，通过“attend”表示两者之间的关系。The knowledge base in Figure 2 can be either multi-modal or unimodal, and the multi-modal knowledge base means that the knowledge base stores different forms of information (usually from different modal sources); Figure 3 uses text and Image as an example gives an example of a multi-modal knowledge base. The multi-modal knowledge base has text and images, represented by nodes and edges. Each node represents an entity or represents text or represents an image. Each edge Represents the relationship between different nodes. For example, Claude Shannon is a guest of the World Computer Chess Championship, where "Claude Shannon" and "WorldComputer Chess Championship" are nodes, and the edge "guestOf" represents the relationship between the two. The image of Claude Shannon is shown in the lower right corner of Figure 3. "Claude Shannon" and its image are connected by the edge "imageOf" with direction; "Deep Thought" participates in the "World Computer Chess Championship" competition, and the two nodes represent "Deep Thought" and "World Computer Chess Championship", the relationship between the two is represented by "attend".

实施例2在实施例1的基础上引入了知识库，在实施例1的基础上，模态1信源经过“模态1解析”可以获得检索知识库的索引，模态2信源经过“模态2解析”也可以获得检索知识库的索引，两者只有其一也可以，有两种解析可以从知识库中检索到更多的相关信息或者增强鲁棒性，对多模态信源的编码效率提升作用更大。其中“模态 1解析”和“模态2解析”主要为了获取知识库节点实体以进行查询和推理。经过知识库的推理和查询后，相关信息可以经过第三编码器 D进行嵌入编码得到知识库特征，与信源特征通过第二编码器C进行联合编码，去除信源编码与知识库的冗余，从而提升编码效率。相应的在解码过程解码器A和解码器B也需要输入知识库特征对模态1 和模态2信源进行解码。Embodiment 2 introduces the knowledge base on the basis of Embodiment 1. On the basis of Embodiment 1, the information source of mode 1 can obtain the index of the retrieval knowledge base through "analysis of mode 1", and the information source of mode 2 can be obtained through "analysis of mode 1" Mode 2 analysis" can also obtain the index of the retrieval knowledge base, and only one of the two can be used. There are two types of analysis that can retrieve more relevant information from the knowledge base or enhance robustness. For multi-modal information sources The coding efficiency improvement effect is greater. Among them, "modal 1 parsing" and "modal 2 parsing" are mainly for obtaining knowledge base node entities for query and reasoning. After the reasoning and query of the knowledge base, the relevant information can be embedded and encoded by the third encoder D to obtain the features of the knowledge base, and jointly encoded with the source features by the second encoder C to remove the redundancy between the source code and the knowledge base , so as to improve the coding efficiency. Correspondingly, in the decoding process, decoder A and decoder B also need to input knowledge base features to decode the mode 1 and mode 2 sources.

实施例2所引入的知识库的目的在于增加了先验知识；显式关联了不同模态的信源。The purpose of the knowledge base introduced in Embodiment 2 is to increase prior knowledge; to explicitly associate information sources of different modalities.

实施例2的具体流程为：一种多模态信源联合编码方法，包括以下步骤：The specific process of embodiment 2 is: a method for joint encoding of multi-modal information sources, including the following steps:

1)给定两个模态信源“模态1”和“模态2”，分别记为src₁和 src₂，两个模态信号分别经过第一编码器A和第一编码器B以提取特征去除每个模态信号内部冗余，得到特征图feat₁和特征图feat₂；1) Given two modal signal sources "modal 1" and "modal 2", denoted as src ₁ and src ₂ respectively, the two modal signals respectively pass through the first encoder A and the first encoder B to Extract features to remove the internal redundancy of each modal signal, and obtain feature maps feat ₁ and feature maps feat ₂ ;

模态1信源经过“模态1解析”获得检索知识库的索引，模态2 信源经过“模态2解析”获得检索知识库的索引，其中“模态1解析”和“模态2解析”主要为了获取知识库节点实体以进行查询和推理；经过知识库的推理和查询后，相关信息经过编码器D进行嵌入编码得到知识库特征；The information source of mode 1 obtains the index of the retrieval knowledge base through "analysis of mode 1", and the information source of mode 2 obtains the index of the retrieval knowledge base through "analysis of mode 2", where "analysis of mode 1" and "analysis of mode 2 "Analysis" is mainly to obtain knowledge base node entities for query and reasoning; after knowledge base reasoning and query, relevant information is embedded and coded by encoder D to obtain knowledge base features;

2)为了去除不同模态信号之间的相关性，将两组特征图连接起来输入第二编码器C，解耦为共性特征图和个性特征图；共性特征图表示不同模态信源之间的共同部分，通常为语义层面；个性特征图表示每个模态信源所独有的特征；以视频和音频两个模态信源为例，共性特征可能是视频中人物所说的话语，音频中通常也包含该信息；视频的个性特征可以是视频中人物的外表或者人物以外其他如花草等北京信息，音频的个性特征可能包含其他非相关音频，也可以是视频通常难以表达的语气等；2) In order to remove the correlation between different modal signals, the two sets of feature maps are connected and input to the second encoder C, decoupled into common feature maps and individual feature maps; the common feature maps represent the differences between different modal sources The common part of , usually at the semantic level; the personality feature map represents the unique features of each modal source; taking video and audio as examples, the common feature may be the words spoken by the characters in the video, The audio usually also contains this information; the personality characteristics of the video can be the appearance of the characters in the video or other Beijing information other than the characters, such as flowers and plants, and the personality characteristics of the audio may include other irrelevant audio, or the tone that is usually difficult to express in the video ;

本实施例进行共性和个性特征解耦，输出模态1的个性特征 feati₁，两种模态的共性特征featc和模态2的个性特征feati₂，第二编码器C中可能包含量化过程以实现有损编码；In this embodiment, the common and individual features are decoupled, and the individual features feati ₁ of modality 1, the common features featc of the two modalities and the individual features feati ₂ of modality 2 are output, and the second encoder C may include a quantization process to Implement lossy encoding;

知识库特征与信源特征通过第二编码器C进行联合编码，去除信源编码与知识库的冗余，从而提升编码效率；The knowledge base feature and the information source feature are jointly encoded by the second encoder C to remove the redundancy of the information source encoding and the knowledge base, thereby improving the encoding efficiency;

feati₁和featc共同输入解码器B，以恢复模态2，记为

Feati ₁ and featc are jointly input to decoder B to recover modality 2, denoted as

在解码过程解码器A和解码器B也需要输入知识库特征对模态1 和模态2信源进行解码。In the decoding process, Decoder A and Decoder B also need to input knowledge base features to decode Modal 1 and Modal 2 sources.

实施例3，参照图4，实施例3给出了引入知识库的一种实施例，知识库所起的作用在于根据“文本”信源中的“Claude Shannon”关键字查询知识库中可以获得其本人的图像，从而无需编码“图像”信源中的Claude Shannon所对应的图像部分，因而可以更高效的对图像进行图像和文本进行编码。Embodiment 3, with reference to Fig. 4, embodiment 3 provides a kind of embodiment that introduces knowledge base, and the effect that knowledge base plays is that can obtain in the knowledge base according to " Claude Shannon " keyword query in " text " information source There is no need to encode the image part corresponding to Claude Shannon in the "image" source, so that the image and text can be encoded more efficiently.

参照图4，本实施例的输入为“文本”和“图像”两种模态信源，分别对应实施例2的图2中的“模态1”和“模态2”，文本信源中的“命名实体识别：BERT”对应“模态1解析”，即可以借鉴自然语言处理领域中的BERT技术对文本中的命名实体解析得到实体名称，如“ClaudeShannon”和“Deep Thought”，输入到知识库中进行查询与推理，经过编码之后生成知识库特征该特征通常是嵌入后的特征向量；图4中未对模态2进行解析，即没有利用图2中的“模态 2解析”。对于主分支，“文本”模态经过文本编码器，如GRU可以编码为文本特征，“图像”模态经过场景图生成技术检测图像中的目标并建立目标之间的关系，该场景图经过卷积网络生成图像特征图，标记为图像特征。之后，文本特征、图像特征连接后，与知识库特征共同作为输入送入第二编码器C进行编码，生成文本个性特征、图像个性特征以及文本和图像的共性特征。图4中未展示将特征进行无损编码生成二进制码流的过程，以及对二进制码流进行解码生成对应特征的部分。除此之外，解析到的“实体名称”也需要编码传输到解码端。With reference to Fig. 4, the input of this embodiment is two kinds of modal information sources of "text" and "image", respectively corresponding to "modal 1" and "modal 2" in Fig. 2 of embodiment 2, in the text information source The "Named Entity Recognition: BERT" corresponds to "Modality 1 Analysis", that is, the BERT technology in the field of natural language processing can be used to analyze the named entities in the text to obtain entity names, such as "Claude Shannon" and "Deep Thought", input to Query and reasoning are performed in the knowledge base, and the knowledge base features are generated after encoding. The features are usually embedded feature vectors; in Figure 4, mode 2 is not analyzed, that is, the "mode 2 analysis" in Figure 2 is not used. For the main branch, the "text" modality passes through a text encoder, such as GRU, which can be encoded into text features, and the "image" modality detects objects in the image and establishes the relationship between them through scene graph generation technology, which is passed through volume The product network generates image feature maps, which are labeled as image features. Afterwards, after the text features and image features are connected, they are sent together with the knowledge base features as input to the second encoder C for encoding to generate text individual features, image individual features, and common features of text and images. Figure 4 does not show the process of losslessly encoding features to generate binary code streams, and decoding the binary code streams to generate corresponding features. In addition, the parsed "entity name" also needs to be encoded and transmitted to the decoding end.

在解码端，文本个性特征、共性特征和知识库特征共同作为输入，经过文本解码器输出文本；图像个性特征、共性特征和知识库特征共同作为输入经过图像解码器输出图像。从图4可以看到，通过引入知识库，编码端无需传输图像中Claude Shannon对应的部分，只需传输解析到的Claude Shannon实体，解码端通过知识库可以获得知识库中的ClaudeShannon对应的图像；此外，也不需要传输编码“Edmonton”和“1989”，可以通过知识库传输推理得到。编码端图像的个性特征主要包含“Feng-hsiung Hsu”的衣着，姿态和位置特征。文本中的个性特征主要包含“Feng-hsiung Hsu”和“first prize”；共性特征包含“ClaudeShannon”和“Deep Thought”等信息。因此，增加知识库后使得编码更加高效。本实施例的训练过程和实施例1类似，损失函数的设计也类似。At the decoding end, the text individual features, common features and knowledge base features are used as input together, and the text is output through the text decoder; the image individual features, common features and knowledge base features are jointly used as input and the image output image is output through the image decoder. It can be seen from Figure 4 that by introducing the knowledge base, the encoding end does not need to transmit the part corresponding to Claude Shannon in the image, but only needs to transmit the parsed Claude Shannon entity, and the decoding end can obtain the image corresponding to Claude Shannon in the knowledge base through the knowledge base; In addition, transfer codes "Edmonton" and "1989" are also not required, which can be inferred through knowledge base transfer. The personality characteristics of the image on the encoding side mainly include the clothing, posture and location characteristics of "Feng-hsiung Hsu". Personality features in the text mainly include "Feng-hsiung Hsu" and "first prize"; common features include information such as "Claude Shannon" and "Deep Thought". Therefore, increasing the knowledge base makes coding more efficient. The training process of this embodiment is similar to that of Embodiment 1, and the design of the loss function is also similar.

Claims

1. A method for multi-modal joint source coding, comprising the steps of:

1) A plurality of modal signal sources pass through corresponding first encoders to extract features to remove internal redundancy of each modal signal, and a corresponding feature map is obtained;

2) In order to remove the correlation among different modal signals, connecting a plurality of groups of feature maps, inputting the feature maps into a second encoder, and decoupling the feature maps into a common feature map and an individual feature map; the common characteristic diagram represents a common part among different modality information sources, and the individual characteristic diagram represents a characteristic unique to each modality information source;

3) The individual characteristic graphs and the common characteristic graphs of the plurality of modal information sources are decoded by corresponding decoders and corresponding modal information sources are rebuilt, namely, the individual characteristic graphs and the common characteristic graphs are respectively subjected to entropy coding and converted into binary code streams to be stored or transmitted; and after entropy decoding is carried out on the binary code stream at the decoding end, recovering the binary code stream through corresponding decoders respectively to obtain corresponding modal information sources.

2. The method of claim 1, wherein: introducing a knowledge base, and carrying out joint coding on the multi-mode information source; the knowledge base is multi-mode or single-mode, and the multi-mode knowledge base stores information containing a plurality of different forms from different mode information sources; the single or multiple modal sources obtain indexes for searching the knowledge base through modal analysis, and the modal analysis is used for obtaining knowledge base node entities for query and reasoning.

3. The method of claim 2, wherein: the multi-modal knowledge base has a representation form of text and images which are represented by nodes and edges, wherein each node represents an entity or represents the text or represents the image, and each edge represents the relationship between different nodes.