Disclosure of Invention
The invention provides a method for processing metadata in video transcoding, video transcoding equipment and electronic equipment, and aims to solve the technical problem of how to automatically process metadata in video transcoding.
The method for processing the metadata in the video transcoding, provided by the embodiment of the invention, comprises the following steps:
decoding the source video stream to obtain a decoded video stream and corresponding metadata;
adjusting the corresponding metadata based on processing operations on the decoded video stream;
encoding the decoded video stream, and embedding the metadata into the encoded video stream.
According to some embodiments of the invention, the processing operation on the decoded video stream comprises: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.
In some embodiments of the present invention, the metadata is directly embedded in the encoded video stream if no processing operation is performed on the decoded video stream.
According to some embodiments of the invention, the metadata comprises static metadata and dynamic metadata.
In some embodiments of the present invention, said embedding said metadata in the encoded video stream comprises:
for Dolby Vision, setting and modifying a Dolby Vision descriptor according to the picture size and the bit rate of an encoded video stream, writing the encoded video stream into a container, inserting a Dolby configuration record into the container, and selecting a corresponding Dolby codec type for the Dolby configuration in the container;
for static metadata, firstly caching all static metadata in an encoder, and inserting the static metadata into a video frame of an encoded video stream, wherein the video frame is encoded into an I frame;
for dynamic metadata, the dynamic metadata is added to the corresponding encoded frame.
According to some embodiments of the invention, the static metadata comprises: color description metadata, CLL metadata, MDCV metadata, and Dolby Video configuration;
the dynamic metadata includes: HDR10+ dynamic metadata and dolby visual dynamic metadata.
The video transcoding device according to the embodiment of the invention comprises:
the decoding module is used for decoding the source video stream to obtain a decoded video stream and corresponding metadata;
a metadata processing module for adjusting the corresponding metadata based on processing operations on the decoded video stream;
and the encoding module is used for encoding the decoded video stream and embedding the metadata into the encoded video stream.
According to some embodiments of the invention, the processing operation on the decoded video stream comprises: and the metadata processing module correspondingly adjusts corresponding metadata based on the operation of the video stream.
According to an embodiment of the present invention, an electronic apparatus includes: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the steps of the method of:
receiving a decoded video stream and corresponding metadata acquired by a decoder decoding a source video stream;
adjusting the corresponding metadata based on processing operations on the decoded video stream;
sending the metadata to an encoder to encode the decoded video stream, the metadata being embedded in the encoded video stream.
According to some embodiments of the invention, the processing operation on the decoded video stream comprises: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.
The metadata processing method, the video transcoding equipment and the electronic equipment in video transcoding provided by the invention have the following beneficial effects that:
the present invention automatically transfers any HDR related metadata from decoding input to transcoding of encoded output. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image. Therefore, transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. The method allows streaming services and other media users to transcode HDR media in a simple mode without accessing a main file for multiple times, simplifies the process and improves the efficiency.
Detailed Description
To further explain the technical means and effects of the present invention adopted to achieve the intended purpose, the present invention will be described in detail with reference to the accompanying drawings and preferred embodiments.
The description of the method flow in the present specification and the steps of the flow chart in the drawings of the present specification are not necessarily strictly performed by the step numbers, and the execution order of the method steps may be changed. Moreover, certain steps may be omitted, multiple steps may be combined into one step execution, and/or a step may be broken down into multiple step executions.
The invention discloses a method for reserving metadata in an HDR video transcoding process. The core problem it solves is to automate the transfer of HDR metadata from an input bitstream to a transcoded bitstream during transcoding and to automatically modify the metadata as required by any video transformation (e.g. scaling, cropping, mirroring or rotation) in the process.
For example, HDR10+ dynamic metadata specified in the SMPTE ST2094-40 standard and Dolby Vision dynamic metadata specified in the SMPTE ST2094-10 standard may specify a processing window or active region. If these metadata properties are used in the input bitstream, these HDR10+ process windows or Dolby Vision active regions also need to be transformed similarly when transcoding in order to map to the correct region in the output window.
HDR metadata contains tone mapping information for mapping a set of colors in a higher dynamic range to a lower dynamic range and ensuring that their visual perception is similar. This tone mapping may be static, meaning that the mapping is optimized for the brightest scene for the entire video. The mapping may also be dynamic and may be updated on a frame-by-frame basis. Both mappings may thus provide better visual effects for certain special scenes (e.g. snow, low light).
For the convenience of understanding the present invention, the HDR metadata related to the present invention is explained as follows:
the HDR metadata includes: static metadata and dynamic metadata.
Wherein the static metadata is typically kept constant for the entire video data stream. In some special cases, it may also be updated when a scene changes. To ensure that random access to the video file is possible, or to allow the terminal to dynamically join the video stream being played, static metadata is typically sent on each I-frame, regardless of whether the metadata has changed.
The static metadata contains the following contents: HDR color description metadata, luma level information (CLL) of the content, and a Master Display Color Volume (MDCV). Dolby Vision has a complete set of static metadata defined by itself, called dvcC or dvvC, i.e., Dolby Vision configuration records.
Several types of static metadata are explained below:
1. color description metadata.
For h.264 and h.265, which comply with the ITU codec standard, HDR color metadata is stored in the VUI. For AV1 and VP9, its HDR color metadata is stored in color _ config of the sequence header obu (metadata open bitstream unit) in accordance with the specifications of these codecs. Therefore, the VUI and color _ config carry the same information regardless of the codec. As shown in fig. 2, is an example of decoding HDR10 color information in h.265vui. It is worth noting that different HDR standards will have different color settings.
2. CLL and MDCV metadata.
CLL (brightness level information of contents) contains information of maximum and maximum average brightness; the MDCV (master display color volume) contains information on the calibration of the master display. By interpreting both information, the best viewing experience can be reproduced as much as possible upon playback. See CTA 861.G for CLL definition; the definition of MDCV refers to SMPTE 2086.
For h.264 and h.265, CLL and MDCV metadata are stored in SEI (supplemental enhancement information). For AV1, they are stored in metadata _ obu. While VP9 does not support storing such metadata in the bitstream, it relies on an appropriate container format, such as MKV or WebM. As shown in fig. 3, an example of decoding of CLL and MDCV SEI for h.265. The formats of the other codecs are similar.
3. Dolby Vision configuration records.
Dolby defines a configuration record, named dvcC or dvvC, in which a number of information is stored, including Dolby Vision configuration and level, dynamic metadata (Dolby refers to RPU: reference processing unit), and up to 2 coding layers: the presence of a base layer (bl) and an enhancement layer (el). The dvcC is stored in a container, typically in an ISO-based format, such as mp4 or QuickTime or transport stream. These records must be passed from the input container to the output container and updated according to the encoder settings to the Dolby Level (Dolby Level), while all other settings should remain unchanged.
The duty of the dolby level is similar to the h.265 or h.264 level and increases as the picture size and bit rate increase. Depending on the configuration, dolby also uses its own codec identifier, e.g. dvh1 or dvhe is used in configuration 5. Configuration 8 contains many backward compatible modes, such as HDR10, HLG, and SDR, so the hev1 or hcv1 names of the standard ISO format h.265 codec are used.
Fig. 4 shows an mp4 file containing Dolby profile 5, dvcC shows configuration 5(profile 5), level 9, the bitstream contains dynamic metadata (rpu), and contains only the base layer. Some dolby visual configurations, such as profile 4 and profile 7, support enhancement coding, may further increase dynamic range.
Complete information about Dolby Vision configuration and level, information recorded using Dolby Vision based on ISO format and Dolby configuration, and related transport streams may be referred to prior art documents and will not be described herein again.
Dynamic HDR metadata is typically updated on a frame-by-frame basis. As previously described, the 2 HDR standards using dynamic metadata are HDR10+ and Dolby Vision, but SMPTE standardizes 4 schemes:
SMPTE ST 2094-10–Dolby Vision;
SMPTE ST 2094-20–Philips SL-HDR1;
SMPTE ST 2094-30–Technicolor SL-HDR1;
SMPTE ST-2094-40–HDR10+;
two types of standard dynamic metadata are introduced below:
1. HDR10+ dynamic metadata.
FIG. 5 is an example of HDR10+ dynamic metadata. For h.264 and h.265, they are stored in t.35sei. AV1 stores it in a T.35meta _ obu, and VP9 also relies on the container to store this metadata.
ST2094-10 contains process window parameters specifying pixel coordinates that need to be updated if any image transformation that changes pixel position is performed (e.g., scaling, rotation, mirroring and cropping), as shown in fig. 6, the current version of the standard limits the value of num windows to 1, and therefore these parameters have not been enabled for HDR10 +.
2. Dolby Vision dynamic metadata (RPU).
Dolby Vision dynamic metadata is defined in SMPTE 2094-10. For h.265, this metadata is stored in reservation type 62NAL units (network abstraction layer). Currently Dolby only supports h.265.
The ST2094-10 level 5(level 5) metadata specifies the pixel coordinates of the picture-active area. When the active area needs any picture transformation (e.g. scaling, cropping, rotating, mirroring pictures, etc.), the corresponding metadata needs to be modified and corrected as well. Since the definition of the active area is limited to a rectangle, only transformations that preserve the rectangular active area can be supported. For example, only 90 degrees and multiples thereof are supported. Fig. 7 is an active area described in the standard.
The prior art typically uses FFmpeg or similar tools for transcoding. Taking X265 as an example of an encoder, X265 requires all HDR metadata as a parameter input to the FFmpeg command line. The user must extract this information using his own software or common tools (e.g., media info, ffprobe, bitstream analyzers, etc.).
Fig. 8 shows an example FFmpeg command that transcodes an HDR10 bitstream (HDR inbit. mp4) to HDR-output. mp4 using an X265 encoder. HDR metadata is highlighted in light gray. The first 3 parameters are HDR color information. The fourth and fifth parameters are CLL and MDCV.
The above example only specifies static HDR metadata. X265 also has two other parameters for dynamic HDR metadata, as shown below. The first specifies the filename containing HDR10+ dynamic metadata, and the second specifies the filename containing Dolby Vision dynamic metadata. Each file must contain an entry with metadata for each frame in the input bitstream.
HDR10+ dynamic metadata: dhdr10-info ═ filename >;
dolby Vision dynamic metadata: dolby-vision-rpu ═ filename >.
In the prior art, the HDR transcoding is implemented by using FFmpeg + X265. This implementation requires the user to manually extract the HDR metadata and specify it on the command line at transcoding time. This requires each user to develop their own methods to extract HDR metadata, and if the video needs to be scaled/rotated during transcoding, the metadata needs to be manually transformed as necessary. In the case of Dolby Vision, it is also cumbersome and inconvenient to manually adjust the Dolby Level (Dolby Level) based on the new picture size and bit rate.
As shown in fig. 1, a method for processing metadata in video transcoding according to an embodiment of the present invention includes:
s110, decoding the source video stream to obtain a decoded video stream and corresponding metadata;
it should be noted that the metadata includes static metadata and dynamic metadata. Wherein the static metadata includes: color description metadata, CLL metadata, MDCV metadata, and Dolby Video configuration; the dynamic metadata includes: HDR10+ dynamic metadata and dolby visual dynamic metadata.
S120, adjusting corresponding metadata based on the processing operation of the decoded video stream;
for example, the processing operations on the decoded video stream may include: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream. It will be appreciated that the metadata may be embedded directly into the encoded video stream if no processing operations are performed on the decoded video stream.
S130, the decoded video stream is encoded, and the metadata is embedded in the encoded video stream.
In some embodiments of the present invention, embedding metadata in an encoded video stream includes:
for Dolby Vision, setting and modifying a Dolby Vision descriptor according to the picture size and the bit rate of an encoded video stream, writing the encoded video stream into a container, inserting a Dolby configuration record into the container, and selecting a corresponding Dolby codec type for the Dolby configuration in the container;
for static metadata, firstly caching all static metadata in an encoder, and inserting the static metadata into a video frame of an encoded video stream, which is encoded into an I frame;
for dynamic metadata, the dynamic metadata is added to the corresponding encoded frame.
According to the method for processing the metadata in the video transcoding, any metadata related to HDR is automatically transmitted from decoding input to encoding output transcoding. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image. Therefore, transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. The method allows streaming services and other media users to transcode HDR media in a simple mode without accessing a main file for multiple times, simplifies the process and improves the efficiency.
The video transcoding device according to the embodiment of the invention comprises: the device comprises a decoding module, a metadata processing module and an encoding module.
The decoding module is used for decoding the source video stream to obtain a decoded video stream and corresponding metadata;
the metadata processing module is used for adjusting corresponding metadata based on the processing operation of the decoded video stream;
for example, processing operations on a decoded video stream include: the method comprises the steps of changing the size or the position of a pixel in a picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata by a metadata processing module based on the operation of a video stream.
The encoding module is used for encoding the decoded video stream and embedding the metadata into the encoded video stream.
According to the metadata processing device in video transcoding, the functions of decoding, metadata processing and encoding are integrated, the transcoding of HDR video is automatically realized, metadata information which must be contained in HDR media is reserved, and meanwhile, metadata is automatically adjusted to adapt to modification of images in the transcoding process. Thus, the stream service and other media users are allowed to transcode HDR media in a simple mode without accessing the main file for multiple times, the flow is simplified, and the efficiency is improved.
According to an embodiment of the present invention, an electronic apparatus includes: the electronic device may be a computer, for example, and the above method steps are implemented by connecting the computer to a net transcoder.
The computer program realizes the following method steps when executed by a processor:
s110, receiving a decoded video stream and corresponding metadata acquired by a decoder decoding a source video stream;
s120, adjusting corresponding metadata based on the processing operation of the decoded video stream;
for example, processing operations on a decoded video stream include: changing the size or position of pixels in the picture, carrying out zooming, cutting, rotating and mirroring operations on the image, and carrying out corresponding adjustment on corresponding metadata based on the operation of the video stream.
S130, the metadata is sent to the encoder to encode the decoded video stream, and the metadata is embedded in the encoded video stream.
The following describes a metadata processing method in video transcoding according to the present invention with reference to the accompanying drawings. It is to be understood that the following description is only exemplary in nature and should not be taken as a specific limitation on the invention.
When the present invention is used with a net transcoder, there is no need to manually extract/modify the HDR metadata. All HDR parameters are processed automatically, only the encoding parameters need to be specified. The same command applies to all HDR formats. For example, if the input is a profile 5Dolby Vision, the output is a profile 5Dolby Vision. The same is true of HDR10, HDR10+, HLG, etc.
As shown in fig. 9, to implement an example of scaling required in HDR transcoding, the input bitstream is input from the left and decoded. The decoded image is output, then scaled, and finally normally transmitted to a video encoder. HDR metadata is removed from the decoded bitstream and modified following operations consistent with image scaling, and the final modified metadata is input to a video encoder where it is inserted into the newly encoded bitstream.
The process is not limited to picture scaling, but may be used for other operations, including changing the size or position of pixels in a picture, scaling, cropping, rotating (90, 180, 270), mirroring, and the like. The metadata can be embedded directly in the new bitstream without any modification if no operations are performed to change the image.
FIG. 10 shows how active regions in Dolby Vision HDR metadata can be modified to support scaling, mirroring, n by 90 degree rotation, or picture cropping. Any picture operation that changes the X, Y coordinates of the active area requires modification of the active area in the metadata. Note that since the active region in Dolby Vision must be a rectangular frame with sides parallel to the image, only picture operations on a rectangular frame with sides parallel to the image are supported, and 45 degree rotation is not supported without loss (in practical applications, there are also few such operations, such as 45 degree rotation).
The HDR transcoding method comprises the following steps:
a100, decode the image and find all static (color description, CLL and MDCV, Dolby Video configuration, etc.) and dynamic (HDR10+ and Dolby vision) HDR metadata that can be appended to the decoded image. (a corresponding I-frame search may be required).
A200, performing an equivalent transformation on the dynamic metadata while performing any supported transformation on the entire image or selected region. If the dynamic metadata contains the selected area, the dynamic metadata is deleted when the unsupported image conversion needs to be executed, so that the distortion of the image caused by applying the dynamic metadata to the wrong area is reduced, and meanwhile, the warning information is output. And the user knows the change in the image display quality in the HDR transcoding process according to the warning information.
And A300, encoding.
A310, for Dolby Vision, the Dolby Vision descriptor is modified according to the encoder picture size and bit rate settings. If the encoded bitstream is written to a container, a Dolby configuration record is inserted in the container and the appropriate Dolby codec type is used for the Dolby configuration in the container.
For static metadata, all static metadata is first buffered in the encoder and inserted into any video frame encoded as an I-frame a 320. If the static metadata is found to have an update on any incoming frame, the updated metadata will be used on the next I-frame.
A330, for dynamic metadata, adding any dynamic metadata to the encoded frame carrying it. It is noted that the order of the frames may be scrambled during the encoding process, but the dynamic metadata must be consistent with the frame to which it belongs.
In summary, the present invention automatically transfers any HDR related metadata from decoding input to transcoding of encoded output. If the video stream is converted during transcoding, the metadata is identically converted so that any regions acted upon by the metadata remain the same in the transcoded image.
In the prior art, the original video file is accessed and metadata is manually extracted and modified using multiple tools, and a new video stream is inserted. This method requires access rights, is very complex and cannot be applied on a large scale. Moreover, in many cases, the transcoding process has no access rights to the original video file (only the video stream).
The invention can automatically realize the transcoding of the HDR video, reserves the metadata information which must be contained in the HDR media, and automatically adjusts the metadata to adapt to the modification of the image in the transcoding process. This allows streaming services and other media users to transcode HDR media in a simple way without having to access the main file multiple times, simplifying the process and increasing efficiency.
While the invention has been described in connection with specific embodiments thereof, it is to be understood that it is intended by the appended drawings and description that the invention may be embodied in other specific forms without departing from the spirit or scope of the invention.