CN114095733A

CN114095733A - Metadata processing method in video transcoding, video transcoding equipment and electronic equipment

Info

Publication number: CN114095733A
Application number: CN202110968111.1A
Authority: CN
Inventors: 尼尔·古恩; 王丛中
Original assignee: Rongming Microelectronics Jinan Co ltd
Current assignee: Rongming Microelectronics Jinan Co ltd
Priority date: 2021-08-23
Filing date: 2021-08-23
Publication date: 2022-02-25
Anticipated expiration: 2041-08-23
Also published as: CN114095733B

Abstract

The present invention provides a method for processing metadata in video transcoding, video transcoding equipment and electronic equipment, and the processing method includes: S110, decoding a source video stream to obtain the decoded video stream and corresponding metadata; S120, Based on the processing operation on the decoded video stream, adjust the corresponding metadata; S130, encode the decoded video stream, and embed the metadata into the encoded video stream. The invention can automatically realize the transcoding of the HDR video, and retain the metadata information that the HDR media must contain, and at the same time, the metadata is automatically adjusted to adapt to the modification of the image during the transcoding process. Simplifies the video transcoding process and increases efficiency by allowing streaming services and other media users to transcode HDR media in an easy way without requiring multiple accesses to the main file.

Description

Metadata processing method in video transcoding, video transcoding equipment and electronic equipment

Technical Field

The invention relates to the technical field of video transcoding, in particular to a metadata processing method in video transcoding, video transcoding equipment and electronic equipment.

Background

HDR (high dynamic range) video uses metadata in the encoding process to more accurately define the information needed for display, using standards including HDR10, HLG, HDR10+, and Dolby Vision (DoVi). The information contained in these metadata may be static information such as the color system used by the entire video, maximum and maximum average light levels, master display functions, etc., or dynamic information that may optimize the video display frame by frame.

When transcoding video using commonly used transcoding tools (e.g., FFmpeg) (e.g., h.265 encoding using X265), HDR metadata is typically lost during the transcoding process and the user must manually add the metadata back into the transcoded video, otherwise the resulting transcoded bitstream cannot be played back correctly. Currently, h.265 is the most common HDR codec, while X265 is the most popular h.265 encoder so far. Other codecs, such as VP9, AV1, and h.264, may also be used for HDR encoding, and the same problems may exist.

Disclosure of Invention

The invention provides a method for processing metadata in video transcoding, video transcoding equipment and electronic equipment, and aims to solve the technical problem of how to automatically process metadata in video transcoding.

The method for processing the metadata in the video transcoding, provided by the embodiment of the invention, comprises the following steps:

decoding the source video stream to obtain a decoded video stream and corresponding metadata;

adjusting the corresponding metadata based on processing operations on the decoded video stream;

encoding the decoded video stream, and embedding the metadata into the encoded video stream.

According to some embodiments of the invention, the processing operation on the decoded video stream comprises: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.

In some embodiments of the present invention, the metadata is directly embedded in the encoded video stream if no processing operation is performed on the decoded video stream.

According to some embodiments of the invention, the metadata comprises static metadata and dynamic metadata.

In some embodiments of the present invention, said embedding said metadata in the encoded video stream comprises:

for Dolby Vision, setting and modifying a Dolby Vision descriptor according to the picture size and the bit rate of an encoded video stream, writing the encoded video stream into a container, inserting a Dolby configuration record into the container, and selecting a corresponding Dolby codec type for the Dolby configuration in the container;

for static metadata, firstly caching all static metadata in an encoder, and inserting the static metadata into a video frame of an encoded video stream, wherein the video frame is encoded into an I frame;

for dynamic metadata, the dynamic metadata is added to the corresponding encoded frame.

According to some embodiments of the invention, the static metadata comprises: color description metadata, CLL metadata, MDCV metadata, and Dolby Video configuration;

the dynamic metadata includes: HDR10+ dynamic metadata and dolby visual dynamic metadata.

The video transcoding device according to the embodiment of the invention comprises:

the decoding module is used for decoding the source video stream to obtain a decoded video stream and corresponding metadata;

a metadata processing module for adjusting the corresponding metadata based on processing operations on the decoded video stream;

and the encoding module is used for encoding the decoded video stream and embedding the metadata into the encoded video stream.

According to some embodiments of the invention, the processing operation on the decoded video stream comprises: and the metadata processing module correspondingly adjusts corresponding metadata based on the operation of the video stream.

According to an embodiment of the present invention, an electronic apparatus includes: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of:

receiving a decoded video stream and corresponding metadata acquired by a decoder decoding a source video stream;

sending the metadata to an encoder to encode the decoded video stream, the metadata being embedded in the encoded video stream.

The metadata processing method, the video transcoding equipment and the electronic equipment in video transcoding provided by the invention have the following beneficial effects that:

the present invention automatically transfers any HDR related metadata from decoding input to transcoding of encoded output. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image. Therefore, transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. The method allows streaming services and other media users to transcode HDR media in a simple mode without accessing a main file for multiple times, simplifies the process and improves the efficiency.

Drawings

Fig. 1 is a flowchart of a method for processing metadata in video transcoding according to an embodiment of the present invention;

fig. 2 is a diagram of an example of decoding HDR10 color information in h.265vui according to an embodiment of the present invention;

fig. 3 is a diagram of an example of decoding of CLL and MDCV SEI of h.265 according to an embodiment of the invention;

FIG. 4 is an exemplary diagram of an mp4 file containing Dolby profile 5, according to an embodiment of the invention;

FIG. 5 is an exemplary diagram of HDR10+ dynamic metadata according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating ST2094-10 process window parameters, in accordance with an embodiment of the present invention;

FIG. 7 is a schematic illustration of the active area described in the ST2094-10 standard, in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of an example FFmpeg command in the prior art;

fig. 9 is a schematic diagram of metadata processing when scaling is required in an HDR transcoding process according to an embodiment of the present invention;

FIG. 10 is a diagram illustrating scaling, mirroring, rotating and cropping of a read image according to an embodiment of the invention.

Detailed Description

To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.

The description of the method flow in the present specification and the steps of the flow chart in the drawings of the present specification are not necessarily strictly performed by the step numbers, and the execution order of the method steps may be changed. Moreover, certain steps may be omitted, multiple steps may be combined into one step execution, and/or a step may be broken down into multiple step executions.

The invention discloses a method for reserving metadata in an HDR video transcoding process. The core problem it solves is to automate the transfer of HDR metadata from an input bitstream to a transcoded bitstream during transcoding and to automatically modify the metadata as required by any video transformation (e.g. scaling, cropping, mirroring or rotation) in the process.

For example, HDR10+ dynamic metadata specified in the SMPTE ST2094-40 standard and Dolby Vision dynamic metadata specified in the SMPTE ST2094-10 standard may specify a processing window or active region. If these metadata properties are used in the input bitstream, these HDR10+ process windows or Dolby Vision active regions also need to be transformed similarly when transcoding in order to map to the correct region in the output window.

HDR metadata contains tone mapping information for mapping a set of colors in a higher dynamic range to a lower dynamic range and ensuring that their visual perception is similar. This tone mapping may be static, meaning that the mapping is optimized for the brightest scene for the entire video. The mapping may also be dynamic and may be updated on a frame-by-frame basis. Both mappings may thus provide better visual effects for certain special scenes (e.g. snow, low light).

For the convenience of understanding the present invention, the HDR metadata related to the present invention is explained as follows:

the HDR metadata includes: static metadata and dynamic metadata.

Wherein the static metadata is typically kept constant for the entire video data stream. In some special cases, it may also be updated when a scene changes. To ensure that random access to the video file is possible, or to allow the terminal to dynamically join the video stream being played, static metadata is typically sent on each I-frame, regardless of whether the metadata has changed.

The static metadata contains the following contents: HDR color description metadata, luma level information (CLL) of the content, and a Master Display Color Volume (MDCV). Dolby Vision has a complete set of static metadata defined by itself, called dvcC or dvvC, i.e., Dolby Vision configuration records.

Several types of static metadata are explained below:

1. color description metadata.

For h.264 and h.265, which comply with the ITU codec standard, HDR color metadata is stored in the VUI. For AV1 and VP9, its HDR color metadata is stored in color _ config of the sequence header obu (metadata open bitstream unit) in accordance with the specifications of these codecs. Therefore, the VUI and color _ config carry the same information regardless of the codec. As shown in fig. 2, is an example of decoding HDR10 color information in h.265vui. It is worth noting that different HDR standards will have different color settings.

2. CLL and MDCV metadata.

CLL (brightness level information of contents) contains information of maximum and maximum average brightness; the MDCV (master display color volume) contains information on the calibration of the master display. By interpreting both information, the best viewing experience can be reproduced as much as possible upon playback. See CTA 861.G for CLL definition; the definition of MDCV refers to SMPTE 2086.

For h.264 and h.265, CLL and MDCV metadata are stored in SEI (supplemental enhancement information). For AV1, they are stored in metadata _ obu. While VP9 does not support storing such metadata in the bitstream, it relies on an appropriate container format, such as MKV or WebM. As shown in fig. 3, an example of decoding of CLL and MDCV SEI for h.265. The formats of the other codecs are similar.

3. Dolby Vision configuration records.

Dolby defines a configuration record, named dvcC or dvvC, in which a number of information is stored, including Dolby Vision configuration and level, dynamic metadata (Dolby refers to RPU: reference processing unit), and up to 2 coding layers: the presence of a base layer (bl) and an enhancement layer (el). The dvcC is stored in a container, typically in an ISO-based format, such as mp4 or QuickTime or transport stream. These records must be passed from the input container to the output container and updated according to the encoder settings to the Dolby Level (Dolby Level), while all other settings should remain unchanged.

The duty of the dolby level is similar to the h.265 or h.264 level and increases as the picture size and bit rate increase. Depending on the configuration, dolby also uses its own codec identifier, e.g. dvh1 or dvhe is used in configuration 5. Configuration 8 contains many backward compatible modes, such as HDR10, HLG, and SDR, so the hev1 or hcv1 names of the standard ISO format h.265 codec are used.

Fig. 4 shows an mp4 file containing Dolby profile 5, dvcC shows configuration 5(profile 5), level 9, the bitstream contains dynamic metadata (rpu), and contains only the base layer. Some dolby visual configurations, such as profile 4 and profile 7, support enhancement coding, may further increase dynamic range.

Complete information about Dolby Vision configuration and level, information recorded using Dolby Vision based on ISO format and Dolby configuration, and related transport streams may be referred to prior art documents and will not be described herein again.

Dynamic HDR metadata is typically updated on a frame-by-frame basis. As previously described, the 2 HDR standards using dynamic metadata are HDR10+ and Dolby Vision, but SMPTE standardizes 4 schemes:

SMPTE ST 2094-10–Dolby Vision；

SMPTE ST 2094-20–Philips SL-HDR1；

SMPTE ST 2094-30–Technicolor SL-HDR1；

SMPTE ST-2094-40–HDR10+；

two types of standard dynamic metadata are introduced below:

1. HDR10+ dynamic metadata.

FIG. 5 is an example of HDR10+ dynamic metadata. For h.264 and h.265, they are stored in t.35sei. AV1 stores it in a T.35meta _ obu, and VP9 also relies on the container to store this metadata.

ST2094-10 contains process window parameters specifying pixel coordinates that need to be updated if any image transformation that changes pixel position is performed (e.g., scaling, rotation, mirroring and cropping), as shown in fig. 6, the current version of the standard limits the value of num windows to 1, and therefore these parameters have not been enabled for HDR10 +.

2. Dolby Vision dynamic metadata (RPU).

Dolby Vision dynamic metadata is defined in SMPTE 2094-10. For h.265, this metadata is stored in reservation type 62NAL units (network abstraction layer). Currently Dolby only supports h.265.

The ST2094-10 level 5(level 5) metadata specifies the pixel coordinates of the picture-active area. When the active area needs any picture transformation (e.g. scaling, cropping, rotating, mirroring pictures, etc.), the corresponding metadata needs to be modified and corrected as well. Since the definition of the active area is limited to a rectangle, only transformations that preserve the rectangular active area can be supported. For example, only 90 degrees and multiples thereof are supported. Fig. 7 is an active area described in the standard.

The prior art typically uses FFmpeg or similar tools for transcoding. Taking X265 as an example of an encoder, X265 requires all HDR metadata as a parameter input to the FFmpeg command line. The user must extract this information using his own software or common tools (e.g., media info, ffprobe, bitstream analyzers, etc.).

Fig. 8 shows an example FFmpeg command that transcodes an HDR10 bitstream (HDR inbit. mp4) to HDR-output. mp4 using an X265 encoder. HDR metadata is highlighted in light gray. The first 3 parameters are HDR color information. The fourth and fifth parameters are CLL and MDCV.

The above example only specifies static HDR metadata. X265 also has two other parameters for dynamic HDR metadata, as shown below. The first specifies the filename containing HDR10+ dynamic metadata, and the second specifies the filename containing Dolby Vision dynamic metadata. Each file must contain an entry with metadata for each frame in the input bitstream.

HDR10+ dynamic metadata: dhdr10-info ═ filename >;

dolby Vision dynamic metadata: dolby-vision-rpu ═ filename >.

In the prior art, the HDR transcoding is implemented by using FFmpeg + X265. This implementation requires the user to manually extract the HDR metadata and specify it on the command line at transcoding time. This requires each user to develop their own methods to extract HDR metadata, and if the video needs to be scaled/rotated during transcoding, the metadata needs to be manually transformed as necessary. In the case of Dolby Vision, it is also cumbersome and inconvenient to manually adjust the Dolby Level (Dolby Level) based on the new picture size and bit rate.

As shown in fig. 1, a method for processing metadata in video transcoding according to an embodiment of the present invention includes:

s110, decoding the source video stream to obtain a decoded video stream and corresponding metadata;

it should be noted that the metadata includes static metadata and dynamic metadata. Wherein the static metadata includes: color description metadata, CLL metadata, MDCV metadata, and Dolby Video configuration; the dynamic metadata includes: HDR10+ dynamic metadata and dolby visual dynamic metadata.

S120, adjusting corresponding metadata based on the processing operation of the decoded video stream;

for example, the processing operations on the decoded video stream may include: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream. It will be appreciated that the metadata may be embedded directly into the encoded video stream if no processing operations are performed on the decoded video stream.

S130, the decoded video stream is encoded, and the metadata is embedded in the encoded video stream.

In some embodiments of the present invention, embedding metadata in an encoded video stream includes:

for static metadata, firstly caching all static metadata in an encoder, and inserting the static metadata into a video frame of an encoded video stream, which is encoded into an I frame;

According to the method for processing the metadata in the video transcoding, any metadata related to HDR is automatically transmitted from decoding input to encoding output transcoding. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image. Therefore, transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. The method allows streaming services and other media users to transcode HDR media in a simple mode without accessing a main file for multiple times, simplifies the process and improves the efficiency.

The video transcoding device according to the embodiment of the invention comprises: the device comprises a decoding module, a metadata processing module and an encoding module.

the metadata processing module is used for adjusting corresponding metadata based on the processing operation of the decoded video stream;

for example, processing operations on a decoded video stream include: the method comprises the steps of changing the size or the position of a pixel in a picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata by a metadata processing module based on the operation of a video stream.

The encoding module is used for encoding the decoded video stream and embedding the metadata into the encoded video stream.

According to the metadata processing device in video transcoding, the functions of decoding, metadata processing and encoding are integrated, the transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. Thus, the stream service and other media users are allowed to transcode HDR media in a simple mode without accessing the main file for multiple times, the flow is simplified, and the efficiency is improved.

According to an embodiment of the present invention, an electronic apparatus includes: the electronic device may be a computer, for example, and the above method steps are implemented by connecting the computer to a net transcoder.

The computer program realizes the following method steps when executed by a processor:

s110, receiving a decoded video stream and corresponding metadata acquired by a decoder decoding a source video stream;

for example, processing operations on a decoded video stream include: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.

S130, the metadata is sent to the encoder to encode the decoded video stream, and the metadata is embedded in the encoded video stream.

The following describes a metadata processing method in video transcoding according to the present invention with reference to the accompanying drawings. It is to be understood that the following description is only exemplary in nature and should not be taken as a specific limitation on the invention.

When the present invention is used with a net transcoder, there is no need to manually extract/modify the HDR metadata. All HDR parameters are processed automatically, only the encoding parameters need to be specified. The same command applies to all HDR formats. For example, if the input is a profile 5Dolby Vision, the output is a profile 5Dolby Vision. The same is true of HDR10, HDR10+, HLG, etc.

As shown in fig. 9, to implement an example of scaling required in HDR transcoding, the input bitstream is input from the left and decoded. The decoded image is output, then scaled, and finally normally transmitted to a video encoder. HDR metadata is removed from the decoded bitstream and modified following operations consistent with image scaling, and the final modified metadata is input to a video encoder where it is inserted into the newly encoded bitstream.

The process is not limited to picture scaling, but may be used for other operations, including changing the size or position of pixels in a picture, scaling, cropping, rotating (90, 180, 270), mirroring, and the like. The metadata can be embedded directly in the new bitstream without any modification if no operations are performed to change the image.

FIG. 10 shows how active regions in Dolby Vision HDR metadata can be modified to support scaling, mirroring, n by 90 degree rotation, or picture cropping. Any picture operation that changes the X, Y coordinates of the active area requires modification of the active area in the metadata. Note that since the active region in Dolby Vision must be a rectangular frame with sides parallel to the image, only picture operations on a rectangular frame with sides parallel to the image are supported, and 45 degree rotation is not supported without loss (in practical applications, there are also few such operations, such as 45 degree rotation).

The HDR transcoding method comprises the following steps:

a100, decode the image and find all static (color description, CLL and MDCV, Dolby Video configuration, etc.) and dynamic (HDR10+ and Dolby vision) HDR metadata that can be appended to the decoded image. (a corresponding I-frame search may be required).

A200, performing an equivalent transformation on the dynamic metadata while performing any supported transformation on the entire image or selected region. If the dynamic metadata contains the selected area, the dynamic metadata is deleted when the unsupported image conversion needs to be executed, so that the distortion of the image caused by applying the dynamic metadata to the wrong area is reduced, and meanwhile, the warning information is output. And the user knows the change in the image display quality in the HDR transcoding process according to the warning information.

And A300, encoding.

A310, for Dolby Vision, the Dolby Vision descriptor is modified according to the encoder picture size and bit rate settings. If the encoded bitstream is written to a container, a Dolby configuration record is inserted in the container and the appropriate Dolby codec type is used for the Dolby configuration in the container.

For static metadata, all static metadata is first buffered in the encoder and inserted into any video frame encoded as an I-frame a 320. If the static metadata is found to have an update on any incoming frame, the updated metadata will be used on the next I-frame.

A330, for dynamic metadata, adding any dynamic metadata to the encoded frame carrying it. It is noted that the order of the frames may be scrambled during the encoding process, but the dynamic metadata must be consistent with the frame to which it belongs.

In summary, the present invention automatically transfers any HDR related metadata from decoding input to transcoding of encoded output. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image.

In the prior art, the original video file is accessed and metadata is manually extracted and modified using multiple tools, and a new video stream is inserted. This method requires access rights, is very complex and cannot be applied on a large scale. Moreover, in many cases, the transcoding process has no access rights to the original video file (only the video stream).

The invention can automatically realize the transcoding of the HDR video, reserves the metadata information which must be contained in the HDR media, and automatically adjusts the metadata to adapt to the modification of the image in the transcoding process. This allows streaming services and other media users to transcode HDR media in a simple way without having to access the main file multiple times, simplifying the process and increasing efficiency.

While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.

Claims

1. A method for processing metadata in video transcoding is characterized by comprising the following steps:

2. The method for processing metadata in video transcoding of claim 1, wherein the processing operation on the decoded video stream comprises: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.

3. The method of claim 1, wherein the metadata is directly embedded in the encoded video stream if the decoded video stream is not processed.

4. The method of claim 1, wherein the metadata comprises static metadata and dynamic metadata.

5. The method for processing metadata in video transcoding of claim 4, wherein the embedding the metadata into the encoded video stream comprises:

6. The method for processing metadata in video transcoding of claim 4, wherein the static metadata comprises: color description metadata, CLL metadata, MDCV metadata, and Dolby Video configuration;

7. A video transcoding device, comprising:

8. The video transcoding device of claim 7, wherein the processing operation on the decoded video stream comprises: and the metadata processing module correspondingly adjusts corresponding metadata based on the operation of the video stream.

9. An electronic device, characterized in that the electronic device comprises: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of:

10. The electronic device of claim 9, wherein the processing operation on the decoded video stream comprises: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.