CN116582686A

CN116582686A - Video coding method, device, equipment and storage medium based on AI

Info

Publication number: CN116582686A
Application number: CN202310396495.3A
Authority: CN
Inventors: 高虹; 袁子逸
Original assignee: Bigo Technology Pte Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-08-11
Also published as: WO2024212822A1

Abstract

The embodiment of the application discloses an AI-based video coding method, an AI-based video coding device, AI-based video coding equipment and a storage medium, wherein the AI-based video coding method comprises the following steps: acquiring a video to be encoded; outputting key point information by adopting a key point detection network for the source reference frame and the driving frame of the video to be encoded; determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule; and generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission. The scheme realizes the effect of ultralow code rate video coding, realizes the coding process of data by compressing the depth of video data and combining with an AI model, can obtain better video compression effect, and simultaneously, the scheme also improves the application range of the AI video coding model and reduces the difficulty and cost of model development training.

Description

Video coding method, device, equipment and storage medium based on AI

Technical Field

The embodiment of the application relates to the technical field of video processing, in particular to an AI-based video coding method, an AI-based video coding device, AI-based video coding equipment and an AI-based video storage medium.

Background

With the continuous development of the internet industry, the use of video is becoming wider, for example: intelligent security, autopilot, smart city, industrial internet, etc. More and more fields begin to take videos as auxiliary means for processing work, and more people gradually take video sharing, watching and the like as one of entertainment projects. Since the data volume of the video itself is very large, in order for the video data to be successfully transmitted and presented at the device terminal, the video needs to be encoded. Therefore, research into video coding techniques, particularly AI-based video coding techniques, has become important.

The existing AI video coding scheme is mainly depth video coding (Deep Video Coding, DVC) and machine vision coding. The depth video coding DVC is an end-to-end video coding model, the whole video compression frame is completed by a neural network and can be trained uniformly, and the purposes of video coding and image frame reconstruction are achieved by replacing each module in the traditional coding frame by the depth neural network. The machine vision coding is a video coding technology aiming at intelligent application, combines video coding and decoding with machine vision analysis, and realizes the coding and decoding tasks of the video by utilizing an end-to-end network system so as to enable a machine to complete the vision tasks.

Because the depth video coding DVC framework is completed by a neural network, and the adopted deep learning method is mainly offline optimized, the problems of poor self-adaptation capacity of the model, complex deployment and implementation of the model and higher transmission code rate exist; meanwhile, the machine vision coding model not only carries out video coding and decoding, but also completes the machine vision task by utilizing the decoded video, and has the problems of strong model pertinence and high model training difficulty. Also, existing AI video coding techniques typically require a larger floating point data representation, which does not represent a significant advantage over P-frame compression in high efficiency video coding. Therefore, when the existing AI video coding technology is used for coding the video, the problems of poor adaptability of a coding model, high model training difficulty and high transmission code rate exist.

Disclosure of Invention

The embodiment of the application provides an AI-based video coding method, an AI-based video coding device, AI-based video coding equipment and an AI-based video storage medium, which solve the problems of poor adaptability of a coding model, high model training difficulty and high transmission code rate when the video is coded by utilizing the existing AI video coding technology.

In a first aspect, an embodiment of the present application provides an AI-based video encoding method, including:

acquiring a video to be encoded;

outputting key point information by adopting a key point detection network for the source reference frame and the driving frame of the video to be encoded;

determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule;

and generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission.

In a second aspect, an embodiment of the present application further provides an AI-based video encoding apparatus, including:

the video acquisition module is used for acquiring a video to be encoded;

the key point information output module is used for outputting key point information by adopting a key point detection network to the source reference frame and the driving frame of the video to be coded;

the key point information compression module is used for determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule;

and the code stream data transmission module is used for generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission.

In a third aspect, an embodiment of the present application further provides an AI-based video encoding apparatus, including:

one or more processors;

storage means for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the AI-based video encoding methods described by embodiments of the application.

In a fourth aspect, embodiments of the present application also provide a storage medium storing computer-executable instructions that, when executed by a computer processor, are configured to perform the AI-based video encoding method of an embodiment of the present application.

In a fifth aspect, the embodiment of the present application further provides a computer program product, where the computer program product includes a computer program, where the computer program is stored in a computer readable storage medium, and where at least one processor of the device reads and executes the computer program from the computer readable storage medium, so that the device performs the AI-based video encoding method according to the embodiment of the present application.

In the embodiment of the application, a video to be encoded is obtained; outputting key point information by adopting a key point detection network for the source reference frame and the driving frame of the video to be encoded; determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule; and generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission. According to the video coding method based on the AI, the problems that when the video is coded by utilizing the existing AI video coding technology, the adaptability of a coding model is poor, the model training difficulty is high and the transmission code rate is high are solved, key point detection networks are adopted for source reference frames and driving frames of the video to be coded to output key point information, key point information compression results of all driving frames are determined according to the key point information of the source reference frames and preset compression rules, code stream data are generated based on the key point information of the source reference frames, the key point information of the source reference frames and the key point information compression results of all driving frames to be transmitted, the effect of ultralow code rate video coding can be achieved, the better video compression effect can be obtained by combining the depth compression of the video data and the coding process of the data with an AI model, and meanwhile, the scheme also improves the application range of the AI video coding model and reduces the difficulty and cost of model development training.

Drawings

Fig. 1 is a flowchart of an AI-based video encoding method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a video encoding system based on AI according to an embodiment of the application;

FIG. 3 is a flowchart of an AI-based video encoding method according to an embodiment of the present application;

FIG. 4 is a flowchart of an AI-based video encoding method according to an embodiment of the present application;

FIG. 5 is a flow chart of key point coordinate compression according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for evaluating reconstructed images according to an embodiment of the present application;

fig. 7 is a block diagram of a video encoding device based on AI according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an AI-based video encoding apparatus according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in further detail below with reference to the drawings and examples. It should be understood that the particular embodiments described herein are illustrative only and are not limiting of embodiments of the application. It should be further noted that, for convenience of description, only some, but not all of the structures related to the embodiments of the present application are shown in the drawings.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

For some compression schemes for generating face videos, based on a first-order motion model (First Order Motion Model, FOM) framework, a first frame (reference frame) is encoded and transmitted by a traditional encoder, then key points of subsequent frames are extracted at an encoding end and transmitted to a decoding end, and the decoding end twists and background fills the face of the reference frame according to the relative displacement of the key points of the current frame and the reference frame to generate the current frame.

However, the method is only suitable for the character head motion video with simple video content and small motion quantity, and the quality of the current frame generated by the situations of large head angle offset, face shielding, increased background objects and the like in the video is greatly reduced. The face of the person is deformed, dithered and the background object is absent, so that the look and feel are very affected. At this time, the reference frame needs to be updated frequently to solve the problem, but this clearly increases the code rate burden.

In order to explore the compression limit and the application scene of the generated compression scheme, the invention provides an ultra-low code rate generation coding method which can compress original key point data and multiplex the key point data, thereby further reducing the key point information to be transmitted by a coding end. The code rate target which can be reduced by at least 1/10 compared with HEVC is realized, and the subjective quality is ensured not to be obviously reduced.

The existing FOM-based generation compression scheme is still in an exploration stage, and a large improvement space is still reserved on code rate compression and driving strategies. First, the key point information typically requires 240bytes floating point data representation, which is not dominant compared to P-frame compression in HEVC. Therefore, in the face of the severe requirements of the actual application scene on the transmission code rate and subjective quality, the compression limit which can be achieved by the compression scheme and the effective application scene still need to be further explored.

Aiming at the problems of the FOM generation compression scheme, the application constructs an end-to-end generation coding system frame in combination with the mobile live broadcast service requirement, and provides new ideas of frame loss regeneration and frame generation for a transcoding engine side and a spectator side decoding side to replace the traditional coding frame respectively through coding preprocessing and FOM generation compression links.

Fig. 1 is a flowchart of an AI-based video encoding method according to an embodiment of the present application, which may be used in a video image transmission scenario, particularly in a video transmission scenario by video encoding compression, where the method may be performed by a server, an intelligent terminal, or the like, which has encoding and computing capabilities. The method specifically comprises the following steps:

s101, acquiring a video to be encoded;

in an embodiment, the video to be encoded may be video data to be transmitted, which may be composed of multiple frames of the same or different images. By encoding the video, the purpose of compressing the video can be achieved, so that the video can be successfully transmitted to the user terminal equipment.

In this scheme, the video to be encoded may be obtained by a video recording device, such as a camera, at a video recording end. After the video to be encoded is obtained, data acquisition, image preprocessing and the like can be performed on images in the video to be encoded, for example, classification, judgment, enhancement and the like are performed on the video to be encoded, so that the robustness of training a video encoding model and the efficiency of obtaining video information can be improved.

S102, outputting key point information by adopting a key point detection network for a source reference frame and a driving frame of the video to be encoded;

the source reference frame may be a first frame image of the video to be encoded. The drive frame may be each frame image located after the source reference frame in the video to be encoded. The image content and image data, etc. of each frame image in the drive frame may be the same or different, and the image content and image data, etc. of the drive frame and the source reference frame may be the same or different.

In an embodiment, the key points may be points capable of representing the image characteristics of each frame in the video to be encoded, including points representing the overall morphology or content distribution of the source reference frame and the driving frame image, and points that change in the driving frame image of the next frame or frames, for example: in the video of giving lessons mainly including the teacher's outgoing mirror, the five officials of the teacher's head and face can be used as the key points of the video to be encoded. The key point information may be information for video encoding the video to be encoded, including content information, location information, data information, and the like of the key point, for example: pixel data of the key point, position coordinate data, data information within a certain range around the coordinates, and the like.

The keypoint detection network may be a network for acquiring said keypoint information, comprising a keypoint detector Kp-detector or the like. The sparse motion field network in the key point detection network acts on the source reference frame and the driving frame in the video to be encoded, so that key point information of the source reference frame and the driving frame can be acquired.

S103, determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule;

the preset compression rule may be a rule for compressing each driving frame in the video to be encoded according to video compression requirements to reduce the volume of the video to be encoded, and the preset compression rule includes an intra-frame compression rule, an inter-frame compression rule, and the like. The key point information compression result of each driving frame may be a result of compressing key point data of each driving frame into encoded data, including intra-frame compression results of different key point information data in the same driving frame and inter-frame compression results of the same key point data in different driving frames.

In an embodiment, an encoding system model may be generated based on an end-to-end FOM (First Order Motion Model, first-order motion model), a method of key point data compression is adopted, key point information of each driving frame is subjected to intra-frame compression according to a preset compression rule, and key point information of each driving frame is subjected to inter-frame compression according to a relation between a source reference frame and the key point information of each driving frame and the preset compression rule. The key point data compression may include compression of pixel data, compression of position coordinate data, conversion of data types, and the like, which are not limited herein. The key point information compression result of each driving frame can be determined by compressing the driving frames in frames and between frames.

According to the scheme, the key point information is subjected to further compression treatment by adopting the preset compression rule, so that the great compression efficiency is improved on the basis of obtaining the key point information by original compression, the purpose of carrying out deep compression on data is achieved, and the coding effect of an ultralow code rate can be achieved.

S104, generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission.

The code stream data may be data obtained by encoding and compressing a source reference frame and each driving frame in the video to be encoded. The video to be encoded can be transmitted to user terminal equipment in the form of code stream data. And generating code stream data based on the image coding result of the source reference frame, the key point information coding result of the source reference frame and the key point information compression result of each driving frame.

In an embodiment, taking a video with a face as a main example, when the user terminal device receives the code stream, for the video with no face shielding, small head movement and simple background, since the content information, the position information, the data information and the like of the key points of the source reference frame and each driving frame in the video have small changes, that is, the source reference frame has high correlation with the key point information of each driving frame, each driving frame can be directly generated by unidirectional reference of the source reference frame so as to restore the video to be encoded. For example: and calculating and generating each driving frame image adjacent to or not adjacent to the source reference frame through a dense moving field according to the key point information of the source reference frame and a preset compression rule.

In another embodiment, when the user terminal device receives the code stream, for the video with occasional face occlusion, moderate motion and simple background, as the content information, the position information, the data information and other changes of the key points of the source reference frame and each driving frame in the video are larger, that is, the correlation between the source reference frame and the key point information of each driving frame is lower, after the video is compressed, the driving frame image cannot be accurately restored according to the unidirectional reference of the source reference frame, therefore, the backward reference can be added on the basis of the FOM, and the bidirectional reference driving strategy is adopted to generate the final image of each driving frame. For example: obtaining key point information of a current driving frame d and driving frames at n frame positions before and after the current driving frame d respectively through sparse motion fields, and taking the driving frames at n frame positions before and after the current driving frame as a forward reference frame d _-n Backward reference frame d _+n Forward reference frame d _-n Backward reference frame d _+n Generating reference images in respective reference directions after the action of the dense moving field, and finally respectively fusing the reference images of the forward reference frame and the backward reference frame to obtain a bidirectional fused reference image M, wherein the bidirectional fused reference image M is obtained by adopting the following formula:

output＝d _-n ×M+d _+n ×(1-M)；

And guiding the final generation of the current driving frame image. The bi-directional reference is significantly improved in subjective quality and reduced in facial jitter compared to the use of the uni-directional reference alone. The updating frequency of the reference source is adjusted to achieve better subjective quality, and a preprocessing module is added on the basis of the subjective quality and used for detecting scene changes. The scene change detection method comprises the following steps: the scene detection network and the calculation of the mutual information quantity of adjacent frames. And respectively judging whether the update probability of the current frame estimation is larger than the update probability average value of the coded frame prediction before the current frame by detecting scene change, and judging whether the mutual information quantity between the current frame and the adjacent frame is larger than the mutual information quantity average value calculated by the coded frame so as to indicate more accurate source reference update positioning, and finally obtaining subjective quality improvement to a certain extent and improving a worse generation result. For the video without face shielding, small head movement and simple background, the bidirectional reference driving strategy can be adopted according to the requirement of a user on the video restoration quality, and is not limited excessively.

In an embodiment, optionally, after generating the code stream data for transmission based on the source reference frame, the key point information of the source reference frame, and the key point information compression result of each driving frame, the method further includes:

Identifying whether a terminal incapable of identifying the code stream data exists or not through a transcoding engine;

if so, converting the code stream data based on target terminal parameters, and transmitting the converted code stream data to the target terminal;

and if the code stream data does not exist, directly transmitting the code stream data to a corresponding terminal.

The transcoding engine can be used for receiving the code stream data generated by the scheme and performing transcoding processing capacity, resolution requirements and the like when needed. The transcoding engine may be provided with a plurality of output ports for forwarding the code stream data to different user terminal devices. By means of the transcoding engine, it is possible to identify whether there is a user terminal device that is inconsistent with the code stream data transmission conditions, i.e. a user terminal device that cannot identify or transmit the code stream data.

In an embodiment, the target terminal may be a user terminal device that cannot transmit the code stream. The target terminal parameter may be a code stream transmission parameter of the target terminal. If the transcoding engine recognizes that the target terminal exists, decoding the code stream data, recoding the code stream data according to target terminal parameters, and forwarding the converted code stream data to the corresponding target terminal. Since the transcoding engine can be connected to a plurality of different user terminal devices at the same time, the encoding mode, the encoding standard, etc. of the recoding of the transcoding engine can be the same or different for different target terminals. If the transcoding engine does not recognize that the target terminal exists, the code stream data are directly transmitted to the corresponding user terminal equipment so as to be used for video regeneration of the user terminal equipment.

In another embodiment, the transcoding engine and the user terminal device may decode the source reference frame code stream to regenerate the source reference frame image, and perform image regeneration on each driving frame image according to the regenerated source reference frame and code stream data and in the driving frame sequence, so as to perform video restoration. Meanwhile, the transcoding engine and the user terminal equipment can judge whether a frame loss phenomenon exists according to each driving frame sequence number in the code stream, and if so, frame loss regeneration is carried out through encoding according to adjacent driving frames.

According to the scheme, whether the terminal incapable of recognizing the code stream data exists or not is recognized by the transcoding engine, the code stream data is converted based on the target terminal parameters, and then the converted code stream data is transmitted to the target terminal, so that the problem that the code stream data cannot be received due to different configurations of user terminal equipment can be avoided, and the accuracy and the efficiency of receiving and restoring the transmission video by the user terminal equipment are improved.

The technical scheme provided by the embodiment of the application obtains the video to be encoded; outputting key point information by adopting a key point detection network for the source reference frame and the driving frame of the video to be encoded; determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule; and generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission. According to the video coding method based on the AI, the problems that when the video is coded by utilizing the existing AI video coding technology, the adaptability of a coding model is poor, the model training difficulty is high and the transmission code rate is high are solved, key point detection networks are adopted for source reference frames and driving frames of the video to be coded to output key point information, key point information compression results of all driving frames are determined according to the key point information of the source reference frames and preset compression rules, code stream data are generated based on the key point information of the source reference frames, the key point information of the source reference frames and the key point information compression results of all driving frames to be transmitted, the effect of ultralow code rate video coding can be achieved, the better video compression effect can be obtained by combining the depth compression of the video data and the coding process of the data with an AI model, and meanwhile, the scheme also improves the application range of the AI video coding model and reduces the difficulty and cost of model development training.

Fig. 2 is a schematic diagram of a video coding system framework based on AI according to an embodiment of the present application, where, as shown in fig. 2, the video coding system specifically includes: acquisition/pre-processing module, AI Enc module, transcoding engine module and AI Dec module.

The acquisition/preprocessing module is used for carrying out preprocessing operations such as classification, enhancement, highlighting and the like on the video to be encoded, and transmitting the processed video to be encoded to the AI Enc module.

The AI Enc module is used for carrying out encoding compression operation on the video to be encoded, and comprises a traditional encoding link and two AI encoding links. Wherein the legacy encoding link is directed by the Codec Engine module to the mixed stream module. The coding Engine module is used for coding the source reference frame image in the video to be coded, so that the source reference frame image can enter the mixed stream module and be transmitted to the user side. The mixed stream module is used for mixing the image coding result of the source reference frame with the image coding result of each driving frame and generating a code stream for transmitting the coded video.

One of the AI code links is connected with a Codec Engine module for performing code compression processing on the source reference frame image, and the AI code link comprises a DPB (Decoded Picture Buffer ) module, a key point detection Net and a generated information compression module. The DPB module is used for carrying out decoding operation and buffering operation on the source reference frame image coded by the Codec Engine module, and transmitting the decoded source reference frame image to the key point detection Net so as to carry out key point information detection on the source reference frame image by the key point detection Net. It will be appreciated that the source reference frame image can also be directly input to the keypoint detection Net here. The generated information compression module is used for carrying out coding compression processing on the key point information detected by the key point detection Net, and transmitting the key point information after compression processing to the mixed flow module in the traditional coding link.

Another AI-encoded link includes keypoint detection Net and AI Dec, where AI Dec includes sparse motion field Net and dense motion field Net. The key point detection Net is used for carrying out key point information detection operation on each driving frame image and transmitting the detected key point information to the generated information compression module. The AI Dec is used for regenerating each frame image according to the detected key point information, transmitting the generated frame image to the identification Net, and forming a generation countermeasure network with the identification Net so as to identify the reliability and accuracy of the generated frame. The sparse motion field Net in the AI Dec is used for acquiring key point specific data and the like in key point information of each frame image, and the dense motion field Net is used for regenerating each frame image according to the data.

The transcoding engine module is used for transcoding the code stream data and regenerating the frame loss data, and comprises a Codec Dec and an AI Dec. The Codec Dec is configured to generate a source reference frame image from the code stream data and the source reference frame image encoded data. The AI Dec is used for performing frame loss regeneration according to the code stream data and the generated adjacent reference frame or driving frame image.

The AI Dec module is configured to receive the code stream data directly transmitted by the AI Enc module or receive the code stream data forwarded by the transcoding engine module, including the Codec Dec and the AI Dec. The Codec Dec is configured to generate a source reference frame image from the code stream data and the source reference frame image encoded data. The AI Dec is used for performing frame loss regeneration according to the code stream data and the generated adjacent reference frame or driving frame image.

Fig. 3 is a flowchart of an AI-based video encoding method according to an embodiment of the present application, as shown in fig. 3, specifically including the following steps:

s201, obtaining a video to be encoded;

s202, outputting key point coordinates and jacobian matrixes corresponding to the coordinates by adopting a key point detection network for a source reference frame and a driving frame of the video to be encoded;

the coordinates of the key points can be coordinates of key points in a source reference frame and each driving frame under the same coordinate system, the positions of the key points are represented, and the change of each frame of image can be determined through the change of the coordinates of the key points.

The jacobian matrix can be a jacobian matrix corresponding to the key point coordinates, and is a matrix formed by arranging first partial derivatives of the key point coordinates in a certain mode, wherein a determinant is called jacobian, and represents an optimal linear approximation of a micro equation and a given point. And acquiring and outputting key point coordinates by adopting a sparse motion field in a key point detection network for the source reference frame and the driving frame of the video to be encoded so as to calculate a jacobian matrix corresponding to the key point coordinates.

S203, reading a source jacobian matrix corresponding to each source key point coordinate of the source reference frame, and reading a driving jacobian matrix corresponding to each driving key point coordinate of each driving frame;

The source jacobian matrix may be a jacobian matrix corresponding to each source key point coordinate of the source reference frame, and the driving jacobian matrix may be a jacobian matrix corresponding to each driving key point coordinate of each driving frame, where each driving key point coordinate may be the same or different, and thus the driving jacobian matrices may also be the same or different accordingly. And outputting key point coordinates and a jacobian matrix corresponding to the coordinates through the key point detection network, and reading a source jacobian matrix corresponding to each source key point coordinate of the source reference frame and a driving jacobian matrix corresponding to each driving key point coordinate of each driving frame.

S204, compressing the driving jacobian matrix of each driving frame according to the source jacobian matrix and a first preset compression rule to obtain a driving jacobian matrix compression result;

the first preset compression rule may be a rule for compressing the driving jacobian matrix of each driving frame, may be that residual calculation is performed on the driving jacobian matrix of the adjacent driving frame, may also be that residual calculation is performed on each driving jacobian matrix and the source jacobian matrix respectively, and the like, and is not limited excessively here. And according to the source jacobian matrix and a first preset compression rule, compressing the driving jacobian matrix of each driving frame to obtain a driving jacobian matrix compression result.

In one embodiment, optionally, before compressing the jacobian driven by each driving frame according to the source jacobian and the first preset compression rule to obtain a compression result of the jacobian driven by each driving frame, the method further includes:

and performing format conversion on the source jacobian matrix and the driving jacobian matrix, and converting the data of the type of Float32 into the data of the type of Float 16.

In one embodiment, in order to enable the driving jacobian matrix to compress to the maximum extent, before the driving jacobian matrix of each driving frame is compressed according to the source jacobian matrix and a first preset compression rule, the precision of the source jacobian matrix and the driving jacobian matrix may be reduced before a driving jacobian matrix compression result is obtained. Specifically, the matrix accuracy may be reduced by performing format conversion on the source jacobian matrix and the driving jacobian matrix, and converting the data of the type of Float32 into the data of the type of Float 16. At this time, the data amount required for calculating each driving frame is 100bytes, and the compression comparison source video is about 1/5.09.

In an embodiment, the accuracy of the jacobian matrix can be reduced by performing format conversion on the source jacobian matrix and the driving jacobian matrix, so that the purpose of preliminary compression of video is achieved.

S205, generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission.

According to the technical scheme provided by the embodiment of the application, the source reference frame and the driving frame of the video to be encoded are subjected to key point detection network output key point coordinates and the jacobian matrix corresponding to the coordinates, the source jacobian matrix corresponding to each source key point coordinate of the source reference frame is read, and the driving jacobian matrix corresponding to each driving key point coordinate of each driving frame is read, so that the driving jacobian matrix of each driving frame is subjected to compression processing according to the source jacobian matrix and a first preset compression rule, a driving jacobian matrix compression result is obtained, the simplicity and efficiency of compressing the jacobian matrix of the driving frame can be improved, and the difficulty and cost of developing and training of a video coding model are reduced.

Fig. 4 is a flowchart of an AI-based video encoding method according to an embodiment of the present application, as shown in fig. 4, specifically including the following steps:

s301, obtaining a video to be encoded;

s302, outputting key point coordinates and source jacobian matrixes corresponding to the coordinates by adopting a key point detection network for a source reference frame and a driving frame of the video to be encoded;

S303, reading a source jacobian matrix corresponding to each source key point coordinate of the source reference frame, and reading a driving jacobian matrix corresponding to each driving key point coordinate of each driving frame;

s304, based on frame sequences of source reference frames and driving frames, differentiating the driving jacobian matrix of each driving frame with the jacobian matrix of the source reference frames frame by frame according to the frame sequences to obtain a jacobian residual matrix of each driving frame, and taking the jacobian residual matrix as a driving jacobian compression result;

in an embodiment, the frame sequence may be an arrangement sequence number of the source reference frame and each driving frame determined according to a playing order of the video image frames. The frame sequences may be stored in the form of numbers, symbols, strings, etc. in each frame code. Since the overall fluctuation range of the jacobian matrix corresponding to the same key point position of the source reference frame and each driving frame in the video does not exceed [ -1,1], in order to further compress the jacobian matrix of each driving frame, each driving jacobian matrix of each driving frame can be differentiated from the jacobian matrix of the source reference frame by frame according to the frame sequence, so as to obtain a jacobian residual matrix of each driving frame, and the jacobian residual matrix is used as a compression result of the driving jacobian matrix, the jacobian matrix of the type Float16 can be compressed into the jacobian residual matrix of the type Uint8, and the size of each driving frame is 60bytes at the moment, and the compression comparison source video is about 1/8.28.

S305, generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission.

According to the technical scheme provided by the embodiment of the application, the jacobian matrix of each driving frame is subjected to difference with the jacobian matrix of the source reference frame by frame according to the frame sequence based on the source reference frame and the frame sequence of each driving frame, so that the jacobian residual matrix of each driving frame is obtained, and as a compression result of the jacobian matrix, the effect of further compressing the jacobian matrix of each driving frame can be achieved, and the video compression degree and the video coding effect are improved.

Fig. 5 is a flowchart of key point coordinate compression provided in an embodiment of the present application, as shown in fig. 5, specifically including the following steps:

s401, reading the coordinates of each source key point of a source reference frame and the coordinates of each driving key point of each driving frame;

in one embodiment, the source keypoint coordinates may be coordinates of each keypoint in the source reference frame; the driving key point coordinates may be coordinates of each key point in each driving frame. The coordinates of each keypoint can be obtained by detecting sparse motion fields in the network.

S402, performing format conversion on the source key point coordinates and the driving key point coordinates by adopting a second preset compression rule, converting the data of the type of Float32 into the data of the type of Uint8, obtaining a compression result of the source key point coordinates, and obtaining a compression result of each driving key point coordinate.

In an embodiment, the second preset compression rule may be a rule for compressing the source keypoint coordinates and the driving keypoint coordinates, and may be redundancy of encoding the keypoint coordinates in frames or between frames, which is not limited herein. Since the key point coordinates have the operation of normalizing to [ -1,1], the key point coordinates can be mapped from [ -1,1] to [0,255], so as to achieve the purpose of compressing the key point coordinates. Specifically, a second preset compression rule may be adopted to perform format conversion on the source key point coordinate and the driving key point coordinate, so that the data type of the key point coordinate is converted from the Float32 type to the Uint8 type, and then a compression result of the source key point coordinate and a compression result of the driving key point coordinate are obtained. Furthermore, the key point data can be compressed to the greatest extent in a key point data multiplexing mode, specifically, the key points can be sampled at intervals of 4 fixed frames, and the compressed video is compared with the source video to about 1/27.67 at the moment, so that the compression of a greater degree is achieved, and the video coding effect is improved.

According to the technical scheme provided by the embodiment of the application, the source key point coordinates of the source reference frame are read, and the driving key point coordinates of the driving frames are read, the format conversion is carried out on the source key point coordinates and the driving key point coordinates by adopting a second preset compression rule, the data of the Float32 type are converted into the data of the Uint8 type, the compression result of the source key point coordinates is obtained, and the compression result of the driving key point coordinates is obtained, so that the video compression degree can be increased, the compression code rate is reduced, the video coding effect is improved, and meanwhile, the difficulty of developing and training a video coding model is reduced.

Fig. 6 is a flowchart of a method for evaluating a reconstructed image according to an embodiment of the present application, as shown in fig. 6, specifically including the following steps:

s501, inputting the source reference frame and each driving frame into a generator to generate key point information of the source reference frame and key point information of the driving frame through a sparse motion field network of the generator, outputting an intermediate product through a dense motion field network of the generator, and generating a reconstructed image of the current frame through an image generating unit;

in an embodiment, the generator may be a model for mapping the noise signal (typically random numbers) to a sample similar to the real data, in particular, a model of the output picture, for example: a fully connected neural network, a deconvolution network, etc. The generator is for generating a reconstructed image of a current frame, comprising a sparse motion field network and a dense motion field network. The sparse motion field network acts on the source reference frame image and each driving frame image to obtain key point information of each frame; the dense motion field acts on the keypoint information received by the generator and the source reference frame to obtain an intermediate product. The intermediate product may be a change content or data of the key point information of the source reference frame and the corresponding key point information of the driving frame, and the like, and is used for guiding the generator to generate a reconstructed image of the current frame. The image generating unit may be a unit for generating a reconstructed image of the current frame, and has functions of data reception, data processing, image processing, and the like. The reconstructed image of the current frame may be an image generated by the image generating unit from the source reference frame image or the driving frame image that has been generated from the previous frame and the key point information of the current frame. Inputting the source reference frame and each driving frame into a generator to generate key point information of the source reference frame and key point information of the driving frame through a sparse motion field network of the generator, outputting an intermediate product through a dense motion field network of the generator, and generating a reconstructed image of the current frame through an image generating unit.

In one embodiment, optionally, the intermediate product comprises: a moving optical flow field and a shielding diagram;

correspondingly, generating, by the image generating unit, a reconstructed image of the current frame, including:

and inputting the moving optical flow field, the shielding image and the source image into an image generating unit to output a reconstructed image of the current frame.

In one embodiment, the moving optical field may be the movement of an object between successive sequential frames, caused by the relative movement between the object and the camera. The motion optical flow field comprises key point information, key point motion trend and the like of an observed object in the video and is used for calculating and obtaining pixel motion information of each key point between adjacent frames according to the change of pixels in a source reference frame and each driving frame image sequence on a time domain and the correlation between the adjacent frames. The occlusion map is used to indicate which parts are available from the source image pixel displacement and which parts need to be padded by context when the image generation unit generates the reconstructed image of the current frame. And inputting the moving optical flow field, the shielding image and the source reference frame image into an image generating unit of a generator to output a reconstructed image of the current frame.

In an embodiment, the efficiency and accuracy of the image regeneration of the driving frame can be improved by inputting the moving optical flow field, the occlusion map and the source image to the image generation unit to output the reconstructed image of the current frame.

S502, inputting the reconstructed image to a discriminator to obtain an evaluation result of the reconstructed image; wherein the generator and the arbiter constitute an antagonistic neural network.

In an embodiment, the discriminator may be configured to evaluate the quality of the current frame reconstructed image generated by the generator, and may discriminate by similarity between the current frame reconstructed image and the actual current frame image. The discriminator inputs a reconstructed image of the current frame and outputs the reconstructed image as an authenticity label of the image, namely, a similarity, so that the reconstructed image is input to the discriminator, and an evaluation result of the discriminator on the reconstructed image can be obtained. An antagonistic neural network is a generation model, which consists of a generator and a discriminator, and the generator and the discriminator play roles with each other and are mutually opposite, and high-quality data is generated through the antagonism. In the process of training the antagonistic neural network, the continuous training and updating of the discriminator and the generator can enable the discriminator network to become more accurate gradually, and meanwhile, the data finally generated by the generator is closer to real data.

According to the technical scheme provided by the embodiment of the application, the key point information of the source reference frame and the key point information of the driving frame are generated through the sparse motion field network of the generator, the intermediate product is output through the dense motion field network of the generator, the reconstructed image of the current frame is generated through the image generating unit, and the reconstructed image is input into the discriminator to obtain the evaluation result of the reconstructed image, so that the accuracy of the reconstruction of the current frame image can be improved, and the reliability of the reduction of the coded video is further improved.

Fig. 7 is a block diagram of a video encoding device based on AI according to an embodiment of the present application, where the device is configured to execute the video encoding method based on AI according to the foregoing embodiment, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 7, the apparatus specifically includes: a video acquisition module 601, a key point information output module 602, a key point information compression module 603, a code stream data transmission module 604, wherein,

a video acquisition module 601, configured to acquire a video to be encoded;

the key point information output module 602 is configured to output key point information by using a key point detection network for the source reference frame and the driving frame of the video to be encoded;

The key point information compression module 603 is configured to determine a key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule;

and the code stream data transmission module 604 is configured to generate code stream data for transmission based on the source reference frame, the key point information of the source reference frame, and the key point information compression result of each driving frame.

In one possible embodiment, the keypoint information comprises: the key point coordinates and the source jacobian matrix corresponding to the coordinates;

correspondingly, the keypoint information compression module 603 includes:

the jacobian matrix reading unit is used for reading a source jacobian matrix corresponding to each source key point coordinate of the source reference frame and reading a driving jacobian matrix corresponding to each driving key point coordinate of each driving frame;

and the jacobian matrix compression unit is used for compressing the driving jacobian matrix of each driving frame according to the source jacobian matrix and a first preset compression rule to obtain a driving jacobian matrix compression result.

In a possible embodiment, the jacobian matrix compression unit is specifically configured to:

based on the frame sequence of the source reference frame and each driving frame, each driving jacobian matrix of each driving frame is subjected to frame-by-frame difference with the jacobian matrix of the source reference frame according to the frame sequence, so that a jacobian residual matrix of each driving frame is obtained and is used as a driving jacobian matrix compression result.

In a possible embodiment, the jacobian matrix compression unit is further configured to:

In one possible embodiment, the keypoint information compression module 603 further includes:

the key point coordinate reading unit is used for reading the key point coordinates of each source of the source reference frame and reading the driving key point coordinates of each driving frame;

and the key point coordinate compression unit is used for carrying out format conversion on the source key point coordinates and the driving key point coordinates by adopting a second preset compression rule, converting the data of the type of Float32 into the data of the type of Uint8, obtaining a compression result of the source key point coordinates and obtaining a compression result of each driving key point coordinate.

In one possible embodiment, the video acquisition module 601 includes:

a reconstructed image generating unit, configured to input the source reference frame and each driving frame into a generator, so as to generate key point information of the source reference frame and key point information of the driving frame through a sparse motion field network of the generator, output an intermediate product through a dense motion field network of the generator, and generate a reconstructed image of the current frame through the image generating unit;

The reconstructed image evaluation unit is used for inputting the reconstructed image to the discriminator so as to obtain an evaluation result of the reconstructed image; wherein the generator and the arbiter constitute an antagonistic neural network.

In one possible embodiment, the intermediate product comprises: a moving optical flow field and a shielding diagram;

correspondingly, the reconstructed image generating unit is specifically configured to:

In a possible embodiment, the code stream data transmission module 604 is further configured to:

The technical scheme provided by the embodiment of the application is that the video acquisition module is used for acquiring the video to be encoded; the key point information output module is used for outputting key point information by adopting a key point detection network to the source reference frame and the driving frame of the video to be coded; the key point information compression module is used for determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule; and the code stream data transmission module is used for generating code stream data based on the source reference frame, the key point information of the source reference frame and the key point information compression result of each driving frame for transmission. According to the AI-based video coding device, the problems that when the video is coded by using the existing AI video coding technology, the adaptability of a coding model is poor, the model training difficulty is high and the transmission code rate is high are solved, key point detection networks are adopted for source reference frames and driving frames of the video to be coded to output key point information, key point information compression results of all driving frames are determined according to the key point information of the source reference frames and preset compression rules, code stream data are generated based on the key point information of the source reference frames, the key point information of the source reference frames and the key point information compression results of all driving frames to be transmitted, the effect of ultralow code rate video coding is achieved, the better video compression effect can be obtained by combining the depth compression of the video data and the coding process of the data by combining with an AI model, meanwhile, the scheme also improves the application range of the AI video coding model, and reduces the difficulty and cost of model development training.

Fig. 8 is a schematic structural diagram of an AI-based video encoding apparatus according to an embodiment of the present application, and as shown in fig. 8, the apparatus includes a processor 701, a memory 702, an input device 703 and an output device 704; the number of processors 701 in the device may be one or more, one processor 701 being taken as an example in fig. 7; the processor 701, the memory 702, the input means 703 and the output means 704 in the device may be connected by a bus or in other ways, in fig. 7 by way of example. The memory 702 is used as a computer readable storage medium for storing a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the AI-based video encoding method in an embodiment of the application. The processor 701 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 702, i.e., implements the AI-based video encoding method described above. The input device 703 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the apparatus. The output device 704 may include a display device such as a display screen.

The present application also provides a storage medium containing computer executable instructions, which when executed by a computer processor, are configured to perform an AI-based video encoding method as described in the above embodiments, including:

acquiring a video to be encoded;

It should be noted that, in the above embodiment of the AI-based video encoding apparatus, each unit and module included is only divided according to the functional logic, but is not limited to the above division, as long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the embodiments of the present application.

In some possible embodiments, the aspects of the method provided by the present application may also be implemented as a program product, which includes a program code for causing a computer device to perform the steps of the method according to the various exemplary embodiments of the present application described in the present specification, when the program product is run on the computer device, for example, the computer device may perform the AI-based video encoding method described in the examples of the present application. The program product may be implemented using any combination of one or more readable media.

Claims

1. An AI-based video encoding method, comprising:

acquiring a video to be encoded;

2. The AI-based video encoding method of claim 1, wherein the keypoint information comprises: the key point coordinates and the jacobian matrix corresponding to the coordinates;

correspondingly, determining the key point information compression result of each driving frame according to the key point information of the source reference frame and a preset compression rule, wherein the key point information compression result comprises the following steps:

reading a source jacobian matrix corresponding to each source key point coordinate of a source reference frame, and reading a driving jacobian matrix corresponding to each driving key point coordinate of each driving frame;

and compressing the driving jacobian matrix of each driving frame according to the source jacobian matrix and a first preset compression rule to obtain a driving jacobian matrix compression result.

3. The AI-based video encoding method of claim 2, wherein compressing the driving jacobian of each driving frame according to the source jacobian and a first preset compression rule to obtain a driving jacobian compression result includes:

4. The AI-based video encoding method of claim 2, wherein prior to compressing the drive jacobian for each drive frame according to the source jacobian and a first preset compression rule to obtain a drive jacobian compression result, the method further comprises:

5. The AI-based video encoding method of claim 2, wherein determining the key point information compression result for each drive frame based on the key point information of the source reference frame and a preset compression rule, further comprises:

Reading the coordinates of each source key point of the source reference frame and the coordinates of each driving key point of each driving frame;

and carrying out format conversion on the source key point coordinates and the driving key point coordinates by adopting a second preset compression rule, converting the data of the type of flow 32 into the data of the type of Uint8, obtaining a compression result of the source key point coordinates, and obtaining a compression result of each driving key point coordinate.

6. The AI-based video encoding method of claim 1, wherein after acquiring the video to be encoded, the method further comprises:

inputting the source reference frame and each driving frame into a generator to generate key point information of the source reference frame and key point information of the driving frame through a sparse motion field network of the generator, outputting an intermediate product through a dense motion field network of the generator, and generating a reconstructed image of the current frame through an image generating unit;

inputting the reconstructed image to a discriminator to obtain an evaluation result of the reconstructed image; wherein the generator and the arbiter constitute an antagonistic neural network.

7. The AI-based video encoding method of claim 6, wherein the intermediate product comprises: a moving optical flow field and a shielding diagram;

8. The AI-based video encoding method of claim 1, wherein after generating code stream data for transmission based on the source reference frame, the key point information of the source reference frame, and the key point information compression result of each drive frame, the method further comprises:

9. An AI-based video encoding apparatus, comprising:

the video acquisition module is used for acquiring a video to be encoded;

10. An AI-based video encoding apparatus, the apparatus comprising: one or more processors; a storage means for storing one or more programs that when executed by the one or more processors cause the one or more processors to implement the AI-based video encoding method of any of claims 1-8.

11. A storage medium storing computer executable instructions for performing the AI-based video encoding method of any of claims 1-8 when executed by a computer processor.

12. A computer program product comprising a computer program which, when executed by a processor, implements the AI-based video encoding method of any of claims 1-8.