CN118338036A

CN118338036A - Scene description file generation method and device

Info

Publication number: CN118338036A
Application number: CN202311054420.3A
Authority: CN
Inventors: 杨付正; 张伟; 杨鹤杰; 李斌; 王之奎; 邢芳
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2023-01-10
Filing date: 2023-08-21
Publication date: 2024-07-12
Also published as: WO2024149120A1; CN118334206A; CN118334205A; CN118334204A; CN118334208A; CN118333839A; CN118334207A; CN118334141A; CN118338035A; CN118334203A

Abstract

Some embodiments of the application provide a method and a device for generating a scene description file, and relate to the technical field of video processing. Comprising the following steps: under the condition that a target digital person is included in a three-dimensional scene to be rendered, obtaining a representation scheme of each digital person component of the target digital person; generating a digital person node array according to the representation scheme of each digital person component of the target digital person; adding the digital human node array into a digital human node description module in a scene description file of the three-dimensional scene to be rendered; the digital person node description module is a node description module corresponding to a node representing the target digital person in a node column of the scene description file. Some embodiments of the application provide for enabling flexible combinations of digital personal technologies to be supported in a scene description box.

Description

Scene description file generation method and device

The present application claims priority from the chinese patent office filed on 10/01/2023, application number 2023100367908, and priority from the chinese patent office filed on 07/2023/06, application number 2023106723978, which are incorporated herein by reference in their entirety.

Technical Field

Some embodiments of the application relate to the field of video processing technology. And more particularly, to a method and apparatus for generating a scene description file.

Background

With the rapid development of the related technology of the immersion type media, the working mode of the office on the superimposed line is accepted by more and more people, and the related technology of the immersion type media is further popularized in the work and life of people. Compared with the improvement of immersion by means of a large-sized display screen, virtual Reality (VR)/augmented Reality (Augmented Reality, AR) devices have the design advantage of fitting the physiological structure of the human body, and can make people feel vivid and real scenes through binocular stereoscopic vision, so that a greater probability is in the future that VR/AR devices are used as the hardware basis of immersion media. A very important part of the content in VR/AR technology is the user representation (userrepresentation), or simply digital person (avatar). When the user wears the VR/AR device, the user hopes that his or her own real image or avatar can appear in the virtual scene presented by the head-mounted device, and can move in the virtual scene synchronously with the movement of the user in the real world, which is an important function that the digital person should possess, and is also an important link for improving the sense of immersion.

In the scene description frame of the immersive media, only the representation schemes of the components of the digital person in the three-dimensional scene are accurately acquired, so that the digital person in the three-dimensional scene can be correctly rendered. However, since only the whole representation scheme for describing the digital person in the scene description file is supported in the existing scene description framework, only the digital person with the same representation scheme is supported by all components in the existing scene description framework, and the digital person components with different representation schemes are not supported to be flexibly combined.

Disclosure of Invention

The exemplary embodiment of the application provides a method and a device for generating a scene description file, which are used for solving the problem that the existing scene description framework does not support flexible combination of digital human components with different representation schemes.

Some embodiments of the present application provide the following technical solutions:

in a first aspect, some embodiments of the present application provide a method for generating a scene description file, including:

under the condition that a target digital person is included in a three-dimensional scene to be rendered, obtaining a representation scheme of each digital person component of the target digital person;

generating a digital person node array according to the representation scheme of each digital person component of the target digital person;

Adding the digital human node array into a digital human node description module in a scene description file of the three-dimensional scene to be rendered;

the digital person node description module is a node description module corresponding to a node representing the target digital person in a node column of the scene description file.

In a second aspect, some embodiments of the present application provide a generating apparatus for a scene description file, including:

A memory configured to store a computer program;

And a processor configured to cause the scene description file generating device to implement the scene description file generating method according to the first aspect when the computer program is invoked.

In a third aspect, some embodiments of the present application provide a computer readable storage medium, where a computer program is stored, where the computer program, when executed by a computing device, causes the computing device to implement the method for generating a scene description file according to the first aspect.

In a fourth aspect, some embodiments of the present application provide a computer program product, which when run on a computer causes the computer to implement the method of generating a scene description file.

As can be seen from the above technical solutions, in the case that the three-dimensional scene to be rendered includes the target digital person, the method for generating the scene description file according to the embodiment of the present application firstly obtains the representation scheme of each digital person component of the target digital person, then generates a digital person node array according to the representation scheme of each digital person component of the target digital person, and adds the digital person node array to a digital person node description module corresponding to a node representing the target digital person in a scene list of the scene description file of the three-dimensional scene to be rendered. The method for generating the scene description file provided by the embodiment of the application can generate the digital person node array according to the representation schemes of the digital person components of the target digital person, and add the digital person node array into the scene description file of the three-dimensional scene to be rendered, so that the representation schemes of the digital person components of the target digital person in the three-dimensional scene to be rendered can be obtained by analyzing the scene description file of the three-dimensional scene to be rendered, and even if different representation schemes are adopted by different digital person components of the target digital person, the scene description file generation method provided by the embodiment of the application can generate the scene description file comprising the description information of the representation schemes of the digital person components, thereby realizing flexible combination of the digital person components of the scene description frame supporting different representation schemes.

Drawings

In order to more clearly illustrate some embodiments of the present application or implementations in the related art, a brief description will be given of the drawings required for the embodiments or the related art description below, and it is apparent that the drawings in the following description are some embodiments of the present application and that other drawings may be obtained according to the drawings for those skilled in the art.

FIG. 1 illustrates a schematic structural diagram of an immersive media description framework in some embodiments of the application;

FIG. 2 illustrates a schematic diagram of a scene description file in some embodiments of the application;

FIG. 3 is a schematic diagram of a scene description file in other embodiments of the application;

FIG. 4 is a schematic diagram of a scene description file in other embodiments of the application;

FIG. 5 is a schematic diagram of a scene description file in other embodiments of the application;

FIG. 6 is a schematic diagram of an input data description scheme of a media access function in some embodiments of the application;

FIG. 7 illustrates a flow chart of steps of a scene description file generation method in some embodiments of the application;

FIG. 8 is a flow chart illustrating steps of a scene description file parsing method in some embodiments of the application;

FIG. 9 illustrates a flow chart of steps of a method of rendering a three-dimensional scene in some embodiments of the application;

FIG. 10 is a flow chart illustrating steps of a method of processing scene data of a three-dimensional scene in some embodiments of the application;

FIG. 11 is a flow chart illustrating steps of a cache management method in some embodiments of the application.

Detailed Description

For the purposes of making the objects and embodiments of the present application more apparent, an exemplary embodiment of the present application will be described in detail below with reference to the accompanying drawings in which exemplary embodiments of the present application are illustrated, it being apparent that the exemplary embodiments described are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terminology in the present application is for the purpose of facilitating understanding of the embodiments described below only and is not intended to limit the embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

References in the specification to "some implementations", "some embodiments", etc., indicate that the implementations or embodiments described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same implementation. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to effect such feature, structure, or characteristic in connection with other implementations (whether or not explicitly described herein).

Some embodiments of the application relate to a scene description framework for immersive media. Referring to the scene description framework of immersive media shown in fig. 1, in order to enable the display engine 11 to concentrate on media rendering, the scene description framework of immersive media decouples access and processing of media files and rendering of media files, and a media access Function (MEDIA ACCESS Function, MAF) 12 is designed to be responsible for access and processing functions of media files. At the same time, a media access function application programming interface (Application Programming Interface, API) is designed, and instruction interaction is performed between the display engine 11 and the media access function 12 through the media access function API. The display engine 11 may issue instructions to the media access function 12 via a media access function API, and the media access function 12 may also request instructions from the display engine 11 via the media access function API.

The general workflow of the scene description framework of immersive media includes: 1) The display engine 11 obtains a scene description file provided by the immersive media service (Scene Description Documents). 2) The display engine 11 parses the scene description file, acquires parameters or information such as an access address of the media file, attribute information (media type, codec parameters, etc.) of the media file, format requirements of the processed media file, etc., and calls the media access function API to transfer all or part of the information obtained by parsing the scene description file to the media access function 12. 3) The media access function 12 requests a download of a specified media file from the media resource server or obtains the specified media file locally according to the information delivered by the display engine 11, and establishes a corresponding pipeline (pipeline) for the media file, and then processes the media file in the pipeline for decapsulation, decryption, decoding, post-processing, etc., to convert the media file from the encapsulated format to the format specified by the display engine 11. 4) The pipeline stores the output data obtained by completing all the processing in a designated cache. 5) Finally, the display engine 11 reads the data subjected to the complete processing in the specified buffer, and performs rendering of the media file based on the data read from the buffer.

The files and functional modules involved in the scene description framework of immersive media are further described below.

1. Scene description file

In the workflow of the scene description framework of the immersive media, the scene description file is used to describe the structure of a three-dimensional scene (the characteristics of which can be described by a three-dimensional mesh), textures (such as texture maps, etc.), animations (rotation, translation), camera viewpoint positions (rendering perspectives), etc.

In the related art, GL transmission format 2.0 (gltf2.0) has been determined as a candidate format for a scene description file, which can meet the requirements of moving picture experts group-immersive (MPEG-IMMERSIVE, MPEG-I) and six degrees of freedom (Degrees of Freedom,6 DoF) applications. Gltf2.0 is described, for example, in GL transmission format (glTF) version 2.0 of Khronos Group available from gitsub. Referring to fig. 2, fig. 2 is a schematic diagram of a scene description file in the gltf2.0 scene description standard (ISO/IEC 12113). As shown in fig. 2, scene description files in the gltf2.0 scene description standard include, but are not limited to: a scene description module (scene) 201, a node description module (node) 202, a mesh description module (mesh) 203, an accessor description module (accessor) 204, a cache slice description module (bufferView) 205, a buffer description module (buffer) 206, a camera description module (camera) 207, a light description module (light) 208, a material description module (material) 209, a texture description module (texture) 210, a sampler description module (sampler) 211, and a texture map description module (image) 212, an animation description module (animation) 213, and a skin description module (skin) 214.

A scene description module (scene) 201 in the scene description file shown in fig. 2 is used to describe a three-dimensional scene contained in the scene description file. Any number of three-dimensional scenes may be contained in a scene description file, each three-dimensional scene being represented using a scene description module 201. The scene description module 201 and the scene description module 201 are in parallel relation, namely, the three-dimensional scene and the three-dimensional scene are in parallel relation.

The node description module (node) 202 in the scene description file shown in fig. 2 is a next-level description module of the scene description module 201, and is used for describing objects included in the three-dimensional scene described by the scene description module 201. There may be many specific objects in each three-dimensional scene, such as a digital person (avatar), a near three-dimensional object, a far background picture, etc., which are described by the scene description file through the node description module 201. Each node description module 202 may represent an object or a set of objects, where the relationships between the node description modules 202 reflect the relationships between the components of the three-dimensional scene described by the scene description module 201, and one or more nodes may be included in the scene described by the scene description module 201. The nodes may be in a parallel relationship or a hierarchical relationship, that is, the node description modules 202 have a relationship between them, which enables a plurality of specific objects to be described together or separately. If one node is contained by another node, the contained node is referred to as a child node (child), for which the "child" is used instead of the "node". The nodes and the child nodes are flexibly used for combination, so that a hierarchical node structure can be formed, and rich scene contents are expressed.

The mesh description module (mesh) 203 in the scene description file shown in fig. 2 is a next-level description module of the node description module 202, and is used for describing characteristics of an object represented by the node description module 202. The mesh description module 203 is a set of one or more primitives (primitives), each of which may include an attribute (attributes) defining attributes that need to be used in rendering by the graphics processor (Graphics Processing Unit, GPU). The attributes may include: position (three-dimensional coordinates), normal (tangent vector), texcoord _n (texture coordinates), color_n (color: RGB or RGBA), points_n (attributes associated with skin description module 210), and weights _n (attributes associated with skin description module 210), and the like. Since the number of vertices included in the mesh description module 203 is very large, and each vertex further includes various attribute information, it is inconvenient to directly store a large amount of media data included in the media file in the mesh description module 203 of the scene description file, but rather, an access address (URI) of the media file is pointed out in the scene description file, and when the data in the media file needs to be taken, the data in the media file is downloaded, so that the separation of the scene description file and the media file is realized. The mesh description module 203 typically does not store media data, but rather stores an index value of the accessor description module (accessor) 204 corresponding to each attribute, and points to corresponding data in the buffer slice (bufferView) of the buffer through the accessor description module 204.

In some embodiments, the scene description file and the media file may be fused together to form a binary file, thereby reducing the variety and number of files.

In addition, there may be a syntax element of mode in the primitives of the mesh description module 203. The pattern syntax element is used to describe the topology of the graphics processor (Graphics Processing Unit, GPU) when rendering a three-dimensional mesh, such as mode=0 for the scatter, mode=1 for the line, mode=4 for the triangle, etc.

The value of "position" in the above-mentioned mesh description module 203 is 1, points to the accessor description module with index 1, and finally points to the vertex coordinate data stored in the buffer; the value of "color_0" is 2, pointing to the accessor description module with index 2, and finally to the color data stored in the buffer.

The definition of the syntax elements in the attributes (attributes) of the primitives of the mesh description module 203 is shown in table 1 below:

TABLE 1

Illustratively, the following is an example JSON of the mesh description module 203:

the above example is exemplified by the attribute (map. Priority. Attributes) of the primitive of the mesh description module 203 including the position syntax element (position) and the color syntax element (color_n) in table 1 above, but excluding the normal vector syntax element (noraml), the tangent vector syntax element (tangent), the texture coordinate syntax element (texcoord), the joint syntax element (joints_n), and the weight syntax element (weights _n).

The definition of the types of accessors indexed in the attributes (mesh. Primities) of the primitives of the mesh description module 203 is as follows in table 2:

TABLE 2

Accessor type	Number of component channels	Meaning of
			SCALAR	1	Scalar quantity
VEC2	2	Two-dimensional vector
			VEC3	3	Three-dimensional vector
VEC4	4	Four-dimensional vector
			MAT2	4	Two-dimensional matrix
MAT3	9	Three-dimensional matrix
			MAT4	16	Four-dimensional matrix

The definition of the data types in the attributes (attributes) of the primitives of the mesh description module 203 is shown in table 3 below:

TABLE 3 Table 3

Type code	Data type	With or without symbols	Number of bits
				5120	Signed byte	Signed with symbol	8
5121	Unsigned byte (unsigned byte)	Unsigned	8
				5122	Signed value (signed short)	Signed with symbol	16
5123	Unsigned numerical value (unsigned short)	Unsigned	16
				5125	Unsigned integer (unsigned int)	Unsigned	32
5126	Floating point number (float)	Signed with symbol	32

The accessor description module (accessor) 204, the buffer slice description module (bufferView) 205, and the buffer description module (buffer) 206 in the scene description file shown in fig. 2 are used to implement a layer-by-layer refinement index of the data of the media file by the grid description module 203. As described above, the mesh description module 203 stores not specific media data but the index value of the corresponding accessor description module 204, and accesses the specific media data through the accessor described by the accessor description module 204 indexed by the index value. The indexing process of the media data by the grid description module 203 includes: first, the index value declared by the syntax element in the grid description module 203 points to the corresponding accessor description module 204; the accessor description module 204 then points to the corresponding cache slice description module 205; finally, the cache slice description module 205 points to the corresponding cache description module 206. The buffer description module 206 in the scene description file shown in fig. 2 is mainly responsible for pointing to the corresponding media file, and includes the URI of the media file, the byte length of the media file, and other information, which is used for describing the buffer for buffering the media data of the media file. A buffer may be divided into one or more buffer slices, and the buffer slice description module 205 is mainly responsible for partially accessing the media data in the buffer, including a start byte offset of the access data, a byte length of the access data, and the like, through which the buffer slice description module 205 and the buffer description module 206 may implement partial access to the data of the media file. The accessor description module 204 is mainly responsible for adding additional information to the partial data delimited in the cache slice description module 205, such as data type, number of data of the type, numerical range of data of the type, etc. The three-layer structure can realize the function of taking partial data from one media file, is beneficial to accurate taking of the data and is also convenient to reduce the number of the media files.

A camera description module (camera) 207 in the scene description file shown in fig. 2 is a next-level description module of the node description module 202, and is used for describing information related to visual viewing, such as a viewpoint, a view angle, and the like when a user views an object described by the node description module 202. In order to enable the user to be in the three-dimensional scene and to view the three-dimensional scene, the node description module 202 may also point to the camera description module 207, and describe, through the camera description module 207, information related to visual viewing, such as a viewpoint, a viewing angle, and the like, when the user views the object described by the node description module 202.

An illumination description module (light) 208 in the scene description file shown in fig. 2 is a description module of the next level of the node description module 202, and is used for describing information related to illumination, such as illumination intensity, ambient light color, illumination direction, light source position, and the like, of the object described by the node description module 202.

A material description module (material) 209 in the scene description file shown in fig. 2 is a description module of the next level of the mesh description module 203, and is used for describing the material information of the three-dimensional object described by the mesh description module 203. In describing the three-dimensional object, only the geometric information of the three-dimensional object is described by the grid description module 203, or the colors and/or positions of the three-dimensional object are monotonously defined, so that the sense of realism of the three-dimensional object cannot be improved, and more information needs to be added on the surface of the three-dimensional object. For conventional three-dimensional model techniques, this process may also be simply referred to as texture mapping or adding texture, etc. The scene description file in the gltf2.0 scene description standard also takes this description module. The texture description module 209 defines textures using a set of generic parameters to describe texture information for geometric objects that appear in the three-dimensional scene. The texture description module 209 generally uses a metal-roughness model (PBR) to describe the texture of the virtual object, and the texture characteristic parameters based on the metal-roughness model are represented by widely used texture based on physical Rendering (PhysicallyBased-Rendering). Based on this, the texture description module 209 specifies the metal-roughness texture properties of the object, and the definition of the syntax elements in the texture description module 209 is shown in table 4:

TABLE 4 Table 4

In some embodiments, the definition of the syntax elements in the metal-roughness (material. PbrMetarialRoughness) of the texture description module 209 is shown in Table 5 below:

TABLE 5

The value of each attribute in the metal-roughness of the material description module 209 may be defined using factors and/or textures (e.g., baseColorTexture and baseColorFactor). If no texture is given, it must be assumed that the value of all corresponding texture elements in this texture model is 1.0. If both a factor and a texture are present, the factor value acts as a linear multiplier to the corresponding texture value. Texture bindings are defined by the index of texture objects and optionally the texture coordinate index.

Illustratively, the following is an example JSON of the texture description module 209:

Parsing the material description module 209, the current material may be named "gold" by the material name syntax element and its value ("name": gold "), the basic color of the current material may be determined by the color syntax element and its value (" basecolorFactor ": 1.000,0.766,0.336,1.0") under the pbrMetallicRoughness array [1.000,0.766,0.336,1.0], the metallic value of the current material may be determined by the metallic syntax element and its value ("metalnessFactor": 1.0) under the pbrMetallicRoughness array [1.0 ], the roughness value of the current material may be determined by the roughness syntax element and its value ("roughnessFactor": 0.0 ") under the pbrMetallicRoughness array, and the roughness value of the current material may be determined as" 0.0 ".

A texture description module (texture) 210 in the scene description file shown in fig. 2 is a next-level description module of the texture description module 209, and is used for describing the color of the three-dimensional object described by the texture description module 209 and other characteristics used in the texture definition. Texture is an important aspect of imparting a realistic appearance to an object. The main color of the object, as well as other characteristics used in the definition of materials, may be defined by textures in order to accurately describe the appearance of the rendered object. The material itself may define a plurality of texture objects that may be used as textures for virtual objects during rendering and may be used to encode different material properties. The texture description module 210 uses the sampler syntax element and the texture map syntax element index to reference a sampler description module (sampler) 211 and a texture map description module (image) 212. The texture map description module 212 contains a uniform resource identifier (Uniform Resource Identifier, URI) that links to the texture map or binary package that the texture description module 210 actually uses. The sampler description module 211 is used to describe the filtering and packing mode of the texture. The responsibilities and cooperative relationships of the material description module 209, the texture description module 210, the sampler description module 211, and the texture map description module 212 include: the texture description module 209, together with the texture description module 210, defines the color and physical information of the object surface. The sampler description module 211 defines how texture maps are attached to the object surface. The texture description module 210 designates a sampler description module 212 and a texture map description module 212, wherein the texture map description module 212 implements the addition of textures, and the texture map description module 212 uses URIs for identification and indexing and the accessor description module 204 for accessing data. The sampler description module 211 then implements specific adjustment and packaging of textures. The definition of the syntax elements in the texture description module 210 is shown in table 6 below:

TABLE 6

In some embodiments, the definition of syntax elements in sample (texture. Sample) of texture description module 210 is shown in table 7 below:

TABLE 7

Illustratively, the following is an example JSON of a texture description module 209, texture description module 210, sampler description module 211, and texture map description module 212:

An animation description module (animation) 213 in the scene description file shown in fig. 2 is a next-level description module of the node description module 202, and is used to describe animation information added to the object described by the node description module 202. In order that the object represented by the node description module 202 is not limited to a stationary state, animation may be added to the object described by the node description module 202, so that the description level of the animation description module 213 in the scene description file is specified by the node description module 202, that is, the animation description module 213 is the next-level description module of the node description module 202, and the animation description module 213 also has a corresponding relationship with the grid description module 203. The animation description module 213 may describe the animation by means of position movement, angle rotation, size scaling, etc., and may specify the start and end times of the animation and the implementation of the animation. For example, an animation is added to a grid description module 203 representing a three-dimensional object, so that the three-dimensional object represented by the grid description module 203 can complete a specified animation process through fusion of position movement, angle rotation and size scaling within a specified time window.

Skin description module (skin) 214 in the scene description file shown in fig. 2 is a next-level description module of node description module 202 for describing a kinematic collaboration relationship between the skeleton added to the nodes described by node description module 202 and the grid of surface information of the surface object. When the nodes described by the node description module 202 represent objects with large degrees of freedom of movement such as people, animals, machinery and the like, bones can be filled into the interiors of the objects to optimize the athletic performance effects of the objects, and the three-dimensional grid representing the surface information of the objects becomes a skin conceptually at this time. Skin description module 214 this description hierarchy is specified by node description module 202, i.e., skin description module 214 is the next-level description module of node description module 202, skin description module 214 has a correspondence with mesh description module 203. The skeleton is used for moving to drive the grids on the surface of the object to move, and then the simulated bionic design is combined, so that a more realistic movement effect can be realized, for example, when a person performs a fist making action by hands, the skin on the surface can be changed along with stretching, covering and the like of the internal skeleton, at the moment, the skeleton is pre-filled in the hand model, and the realistic simulation of the action can be realized by defining the cooperative relationship between the skeleton and the skin.

The scene description file description modules in the glTF2.0 scene description standard only have the most basic capability of describing three-dimensional objects, and have the problems of being incapable of supporting dynamic three-dimensional immersion media, not supporting audio files, not supporting scene updating and the like. glTF also states that there is an optional extension object property (extensions) under each object property, allowing extensions to be used in any part of it to achieve more sophisticated functionality. The system comprises a scene description module (scene), a node description module (node), a mesh description module (mesh), a accessor description module (accessor), a buffer description module (buffer), an animation description module (animation) and the like, and syntax elements defined in the scene description module (scene), the node description module (node), the mesh description module (mesh), the accessor description module (accessor), the buffer description module (animation) and the animation description module (animation) all have the attribute of an extension object, so that certain function extension is supported on the basis of glTF2.0.

The mainstream immersion media at present mainly includes Point Cloud (Point Cloud), three-dimensional grid (Mesh), 6DoF panoramic video, MIV, and the like. In three-dimensional scenes, multiple types of immersive media tend to exist simultaneously. This requires the rendering engine to support the codec of many different types of immersive media, resulting in different types of rendering engines depending on the type and number of codecs supported. In order to realize cross-platform description of three-dimensional scenes composed of different types of media, the dynamic image expert group (Moving Picture Experts Group, MPEG) initiates the establishment of MPEG scene description standards with the standard number of ISO/IEC 23090-14. The standard mainly solves the problem of cross-platform description of MPEG MEDIA (including MPEG formulated codec, MPEG file format, MPEG transport mechanism) in three-dimensional scenes.

MPEG #128 conference resolution the MPEG-I Scene Description standard was formulated based on glTF2.0 (ISO/IEC 12113). The first version of the MPEG scene description standard has been developed at present, in the FDIS voting stage. The MPEG scene description standard is based on the first edition standard, and corresponding expansion is added to solve the requirements which are not realized in the cross-platform description of the three-dimensional scene, including interactivity, AR anchoring, user representation (userrepresentation)/digital person (avatar), haptic support, support of an immersion type media codec by expansion and the like.

The first version of the MPEG scene description standard has been made mainly by:

1) The MPEG scene description standard defines a scene description file format for describing an immersive three-dimensional scene that incorporates and extends on the basis of the original gltf2.0 (ISO/IEC 12113) content.

2) The MPEG scene description defines a scene description framework and application programming interfaces (Application Program Interface, APIs) for cooperation among modules, realizes decoupling of the acquisition and processing processes of the immersion media and the media rendering process, and is beneficial to optimization in the aspects of adapting the immersion media to different network conditions, partially acquiring the immersion media files, accessing different levels of detail of the immersion media, adjusting content quality and the like. Decoupling of the acquisition and processing of the immersive media from the immersive media rendering process is key to achieving cross-platform description of three-dimensional scenes.

3) MPEG scene description proposes a series of extensions based on the international standardization organization basic media file Format (International Standardization Organization Base MEDIA FILE Format, ISOBMFF) (ISO/IEC 14496-12) for transmitting immersive media content.

Referring to fig. 3, on the basis of the scene description file in the gltf2.0 scene description standard shown in fig. 2, the extensions of the scene description file in the MPEG scene description standard can be divided into two groups:

The first set of extensions includes: MPEG media 301, MPEG time-varying accessor (MPEG accessor time) 302, and MPEG ring buffer (MPEG buffer circle) 303. Wherein the MPEG media 301 is an independent extension for referencing external media sources; MPEG time-varying accessor 302 is an extension of the accessor hierarchy for accessing time-varying media; the MPEG ring buffer is an extension of the buffer hierarchy to support circular buffers. The first set of extensions provides a basic description and format of media in a scene, meeting the basic requirements of describing time-varying immersive media in a scene description framework.

The second set of extensions includes: MPEG dynamic scene (mpeg_scene_dynamic) 304, MPEG texture (mpeg_texture_video) 305, MPEG audio space (mpeg_audio_spatial) 306, MPEG view angle recommendation (mpeg_viewport_reconstructed) 307, MPEG grid map (mpeg_mesh_ linking) 308, and MPEG animation time (mpeg_animation_time) 309. The MPEG_scene_dynamic 304 is a scene hierarchy extension, and is used for supporting dynamic scene update; MPEG_texture_video 305 is a texture level extension for supporting textures in video form; MPEG_Audio_spatial 306, which is a node level and camera level extension for supporting spatial 3D audio, MPEG_viewport_recommended 307, which is a scene level extension for supporting description of recommended views in two-dimensional display, and MPEG_mesh_ linking 308, which is a mesh level extension for supporting linking two meshes and providing mapping information; MPEG_animation_time 309 is a scene level extension for supporting control of the animation timeline.

The respective expansion portions described above will be described in detail below:

The MPEG media in the MPEG scene description file is used to describe the type of media file and to make necessary description of the MPEG type of media file in order to take these MPEG type of media files later. Wherein the definition of the syntax elements of the first level of the MPEG media is as follows in table 8:

TABLE 8

The definition of syntax elements in the media list (MPEG media) of MPEG media is shown in table 9 below:

TABLE 9

The definition of syntax elements in the options (MPEG media. Alternatives) in the media list of MPEG media is shown in table 10 below:

Table I0

The definition of syntax elements in the track array (mpeg_media. Alternatives. Tracks) in the options of the media list of the MPEG media is shown in table 11 below:

TABLE 11

For example, the MPEG media based on the scene descriptions of tables 8-11 above may be as follows:

Furthermore, based on ISOBMFF (ISO/IEC 14496-12), ISO/IEC 23090-14 also defines transport formats for delivery of scene description files and data delivery related to glTF 2.0.0 extensions. To facilitate delivery of scene description files to clients, ISO/IEC 23090-14 defines how glTF files and related data are packaged in ISOBMFF files as non-time-varying and time-varying data (e.g., as track samples). Mpeg_scale_dynamic, mpeg_mesh_linking, mpeg_animation_timing provide the display engine with a specific form of time-varying data, and the display engine 11 should perform a corresponding operation according to these varying information. The format of each extended time-varying data is also defined by ISO/IEC 23090-14, as well as the manner in which it is packaged in the ISOBMFF file. MPEF media (MPEG media) allows referencing external media streams delivered via RTP/SRTP, MPEG-DASH, etc. protocols. To allow addressing of media streams without knowledge of the actual protocol scheme, hostname or port values, a new URL scheme is defined by ISO/IEC 23090-14. This scheme requires that there be one flow identifier in the query part, but no specific type of identifier is specified, allowing the use of media flow identification schemes (MEDIA STREAM Identification scheme, RFC 5888), tagging schemes (RFC 4575) or index schemes based on 0.

2. Display engine

Referring to fig. 1, in the workflow of the scene description framework of the immersive media, the display engine 11 mainly functions to obtain a scene description file, parse the obtained scene description file to obtain a composition structure of a three-dimensional scene to be rendered and detailed information in the three-dimensional scene to be rendered, and render and display the three-dimensional scene to be rendered according to the information obtained by parsing the scene description file. The embodiment of the application does not limit the specific workflow and principle of the display engine 11, so that the display engine 11 can analyze the scene description file, issue an instruction to the media access function 12 through the media access function API, issue an instruction to the cache management module 13 through the cache API, and take the processed data from the cache to complete the rendering and display of the three-dimensional scene and the objects therein.

3. Media access function API and media access function

In the workflow of the scene description framework of the immersive media, the display engine 11 can obtain the method for rendering the three-dimensional scene by analyzing the scene description file, and the method for rendering the three-dimensional scene needs to be transferred to the media access function 12 or the method based on rendering the three-dimensional scene sends an instruction to the media access function 12, and the process of transferring the method for rendering the three-dimensional scene to the media access function 12 or the method based on rendering the three-dimensional scene sends the instruction to the media access function 12 is realized through the media access function API.

In some embodiments, display engine 11 may send media access instructions or media data processing instructions to media access function 12 through a media access function API. The instructions issued by the display engine 11 are derived from the parsing result of the scene description file, and may include the address of the media file, the attribute information of the media file (media type, codec used, etc.), the format requirements for the processed media data and other media data, etc.

In some embodiments, media access function 12 may also request media access instructions or media data processing instructions from display engine 11 through a media access function API.

In the workflow of the scene description framework of immersive media, the media access function 12 may receive instructions from the display engine 11 and perform access and processing functions of media files according to the instructions sent by the display engine 11. The method specifically comprises the following steps: after the media file is acquired, the media file is processed. The processing processes of different types of media files have great difference, so that the work efficiency of the media access function is also considered in order to realize wide media type support, and then various pipelines are designed in the media access function, and only pipelines matched with the media types are started in the processing process.

The input of the pipeline is the media files downloaded from the server or the media files read by the local storage control, and the media files often have a relatively complex structure and cannot be directly used by the display engine 11, so the main function of the pipeline is to process the data of the media files, so that the data of the media files meet the requirements of the display engine 11.

After the media access function completes the processing of the data of the media file, the processed data needs to be delivered to the display engine in a standard arrangement structure, which needs to correctly store the processed data in the buffer, and the buffer management module 13 does this, but the buffer management module needs to acquire the buffer management instruction from the media access function or the display engine through the buffer API.

In some embodiments, the media access function may send cache management instructions to the cache management module through a cache API. The cache management instruction is a cache management instruction sent by the display engine 11 to the media access function 12 through the media access function API.

4. Cache API and cache management module

In the workflow of the scene description framework of the immersive media, the media data processed by the pipeline needs to be delivered to the display engine 11 for use in a standard arrangement structure, which requires participation of a buffer API and a buffer management module 13, and the buffer API and the buffer management realize that corresponding buffers are created according to the format of the processed media data and are responsible for subsequent management of the buffers, such as updating, releasing and the like. The cache management module 13 may communicate with the media access function 12 through a cache API, and may also communicate with the display engine 11 through the cache API, where the purpose of the communication with the display engine 11 and/or the media access function 12 is to implement cache management. When the cache management module 13 communicates with the media access function 12, the display engine 11 is required to send the relevant cache management instruction to the media access function 12 through the media access function API, and then the media access function 12 sends the relevant cache management instruction to the cache management module 13 through the cache API. When the cache management module 13 communicates with the display engine 11, the display engine 11 only needs to send the cache management description information parsed in the scene description file to the cache management module 13 directly through the cache API.

With the rapid development of the related technology of the immersion type media, the working mode of the office on the superimposed line is accepted by more and more people, and the related technology of the immersion type media is further popularized in the work and life of people. Compared with the improvement of immersion by means of a large-sized display screen, virtual Reality (VR)/augmented Reality (Augmented Reality, AR) devices have the design advantage of fitting the physiological structure of the human body, and can make people feel vivid and real scenes through binocular stereoscopic vision, so that a greater probability is in the future that VR/AR devices are used as the hardware basis of immersion media. A very important part of the content in VR/AR technology is the user representation (userrepresentation), or simply digital person (avatar). When the user wears the VR/AR device, the user hopes that his or her own real image or avatar can appear in the virtual scene presented by the head-mounted device, and can move in the virtual scene synchronously with the movement of the user in the real world, which is an important function that the digital person should possess, and is also an important link for improving the sense of immersion. The types of movements of a digital person following a real person can be divided into two types, one type is limb movements such as walking, turning, picking up, etc., and the other type is facial expression movements such as laughing, frowning, gazelle, etc. The limb movement can be realized by methods of motion capture, reverse reckoning and the like, and the requirement on the fineness is low. Facial expression motion can be achieved by means of computer vision, machine learning techniques and related algorithms, and the requirements on precision and complexity are far greater than limb motion, so digital human facial expression motion becomes a research hotspot related to immersive media.

Based on the above technical development trend and market demand, importing/creating, reconstructing/driving digital persons in a virtual three-dimensional scene is an important immersive media technology, but the digital person technology is still in a stage that each technical company adheres to its own technical route and is difficult to interwork, and also lacks international standards related to the digital person technology. In response to this problem, many standards development organizations and forums are conducting related work attempts, and the third working group (WG 03) of the Moving Picture Experts Group (MPEG) of the international organization for standardization (ISO/IEC) is formulating the scene description international standard (ISO/IEC 23090-14), how to support digital persons in a scene is one of the sub-directions of the standard. At present, the research direction of the digital person in the standard has determined some basic standard draft, but the support of the dynamic expression of the digital person face only selects a target technical route, and a method for integrating the target technical route into the scene description standard is lacking. This need is established by incorporating techniques for reconstructing digital human dynamic facial expressions into a scene description standard architecture. It is therefore interesting to study how the scene description criteria support dynamic facial expressions of digital persons.

Digital humans (avatars) are a realistic three-dimensional manikin, virtual humanoid figures made by computer graphics techniques, which do not have a real world body. Facial expression mapping of a digital person is a technique of matching the facial expression of a digital person with the facial expression of a human, which can make the digital person more realistic and vivid by recognizing and mapping the facial expression of a human to the face of the digital person. Facial expression mapping techniques can only produce static digital human facial expressions at an early stage, but as the technology advances, facial expression mapping techniques can gradually produce dynamic digital human facial expressions. A mainstream method is to design a static three-dimensional model of a digital human face in advance, then capture and recognize facial expressions of a real human in real time, and map motion information of the facial expressions to the static three-dimensional model of the digital human face to make it a dynamic three-dimensional model. The technology can be applied to the fields of games, movies, virtual reality and the like, and can improve the experience quality and the immersion of users.

At present, a method for reconstructing a dynamic facial expression of a digital person by using a facial semantic mark method is more prominent in the related technical field, and the general flow of the facial semantic mark method comprises the following steps (1) to (4):

(1) A neutral digital human face three-dimensional model is predefined and a set of facial semantic landmark points are predefined on the model.

The neutral digital human face three-dimensional model can be a static three-dimensional model, and the topological structure of the static three-dimensional model can be a three-dimensional grid, a three-dimensional point cloud and the like; facial semantic mark points generally cover spatial positions with abundant changes of eyes, eyebrows, mouth, nose, lower jaw and the like and larger influence on facial expression. Currently, the total number of facial semantic landmark points is not uniformly specified, but 68 points are commonly taken. Of course, 21, 29, 98, 106, 186 points may be taken. The larger the total point number of the facial semantic mark points is, the more accurate the facial semantic mark is, but the larger the corresponding data volume and the calculated volume are, so that the number of the facial semantic mark points can be set according to the precision requirement on the facial three-dimensional model and the performance of related equipment in actual use.

(2) Input data of a target moment is obtained, and a group of motion vectors of facial semantic mark points are obtained according to the input data.

The input data types and the input data processing modes are quite various; for example: when the input data is a depth video shot by an RGBD camera, the manner of obtaining the motion vector of the facial semantic landmark according to the input data may be: the facial semantic mark points are positioned by a computer vision and neural network method. For example: the facial semantic mark point extraction network is used for obtaining the space coordinates of the facial semantic mark points, and then the space coordinates of the facial semantic mark points at the previous moment are differentiated, so that the motion vectors of the facial semantic mark points at the target moment can be obtained. When the input data is text representing the emotion of a digital person, the manner of obtaining the motion vector of the facial semantic landmark from the input data may be: the method of using the facial expression hybrid model firstly utilizes the weighted superposition of basic expressions to obtain a vivid complex expression, and then carries out the positioning of facial semantic mark points and the calculation of motion vectors on the complex face.

(3) And obtaining the motion vectors of all vertexes in the digital human face three-dimensional model by using the motion vectors of the group of facial semantic mark points.

When the data quantity and the number of points are considered, the facial semantic mark points are equivalent to sparse data, all vertexes of the digital human face static three-dimensional model are equivalent to dense data, and the mapping of motion vectors from the facial semantic mark points to all vertexes of the face model is the mapping from the sparse data to the dense data. This step may use computer vision and neural network methods, such as using a motion diffusion network, to map sparse data motion vectors to dense data motion vectors.

(4) And superposing the motion vectors of all vertexes on a neutral digital human face three-dimensional model to obtain the data of the digital human dynamic facial expression at the target moment.

According to different topological structures of the digital human face static three-dimensional model, the process of superposing the motion vectors of all vertexes on the neutral digital human face three-dimensional model is different, when the topological structure is a three-dimensional point cloud, the motion vector of each point is directly summed with the space coordinates of the corresponding point in the three-dimensional point cloud, and the obtained new space coordinates are the superposition result; in addition, the attribute information attached to the three-dimensional point cloud needs to be mapped one by one at this time, that is, the spatial coordinates of the points in the point cloud change, and the attribute information is unchanged. When the topological structure is a three-dimensional grid, the motion vector of each point is directly summed with the space coordinates of the corresponding vertex in the three-dimensional grid, and the obtained new space coordinates are the superposition result; in addition, the attribute information attached to the three-dimensional grid needs to be mapped one by one, the attribute information is unchanged, and the connection relationship between the vertexes is unchanged, namely, the vertexes with the connection relationship still have the same connection relationship after coordinate conversion.

The data of the dynamic facial expression of the digital person at the target moment can be obtained through the steps (1) - (4), the process is continuously repeated, the dynamic facial expression data of the digital person at each moment can be obtained, and the dynamic neutral digital facial three-dimensional model driven by the data at different moments can be used for reconstructing and mapping the dynamic facial expression of the digital person.

In some embodiments, after step (4) above, the dynamic facial expression of the neutral digital person may also be mapped to a stylized digital person's face, making the visual look more realistic.

In the scene description framework of the embodiment of the application, a three-dimensional scene service side needs to provide a scene description file to a display engine, and the display engine analyzes the scene description file. The scene description file includes, but is not limited to: the method comprises the following steps of describing information of the whole three-dimensional scene, describing information of digital people in the three-dimensional scene, describing information of digital people component combinations used for supporting different representation schemes in the three-dimensional scene and the like.

Some embodiments of the present application further provide a method for supporting digital person component combinations of different representation schemes in a scene description framework, and the following describes a scene description file, a media access function API, a media access function, a cache API, and cache management of the digital person component combinations supporting the different representation schemes respectively.

1. Scene description file supporting digital person component combinations of different representation schemes

Referring to fig. 4, fig. 4 is a schematic structural diagram of a scene description file supporting a digital person component combination with different representation schemes according to an embodiment of the present application. As shown in fig. 4, the scene description file supporting the digital person component combination of different representation schemes includes, but is not limited to, the following modules: MPEG media 2701, scene description module (scene) 402, node description module (node) 403, mesh description module (mesh) 404, accessor description module (accessor) 405, cache slice description module (bufferView) 406, buffer description module (buffer) 407, camera description module (camera) 408, material description module (material) 409, texture description module (texture) 410, sampler description module (sampler) 411, texture map description module (image) 412, skin description module (skin) 413, and animation description module (animation) 414. Wherein the node description module (node) 403 includes: an array of digital human nodes (mpeg_node_avatar) 4031, the array of digital human nodes comprising: an active identification syntax element (isAvatar) 40311, a digital person type syntax element (type) 40312, and a mapping list (mappings) 40313, the mapping list (mappings) 40313 comprising: a part name syntax element (path) 403131 and a node index syntax element (node) 403132. The following describes each part in the scene description file shown in fig. 4.

MPEG media (MPEG media) 401 in a scene description file supporting a combination of digital person components of different presentation schemes is used to describe the types of media files and to make necessary description of the MPEG type media files for subsequent retrieval of those MPEG type media files.

A scene description module (scene) 402 in the scene description file supporting a combination of digital person components of different representation schemes is used to describe the three-dimensional scene contained in the scene description file. Any number of three-dimensional scenes may be contained in a scene description file, each three-dimensional scene being represented using a scene description module.

According to a second standard revision of the scene description standard, a digital person in a three-dimensional scene will be represented by a node, described by a corresponding node description module (node), and a plurality of hierarchical child nodes (child) are mounted under the node, and specific parts of the digital person's body are described layer by using the node description module corresponding to the child nodes, and the deepest unit of the upper body can be each finger knuckle by a hierarchical structure. For example: the hip is used as a node to represent a digital person (layer 1), and the node representing the digital person comprises three child nodes (layer 2) of a spine, a left thigh and a right thigh; selecting the subnodes representing the spine to extend deeply, wherein the subnodes representing the spine comprise subnodes representing the chest (layer 3); selecting a child node representing the chest to extend deeply, wherein the child node representing the chest comprises a child node (layer 4) representing the upper chest; selecting a child node representing the upper chest to extend deeply, wherein the child node representing the upper chest comprises three child nodes (layer 5) representing left shoulders, right shoulders and neck; the child node representing the neck is selected to continue to extend deep, which includes the child node representing the head (layer 6).

In some embodiments, the digital human body part hierarchy standard names may be as follows:

in a hierarchy of nodes, sub-nodes describing specific parts of the digital human body, one or more three-dimensional grids may be mounted under each node and sub-node as a three-dimensional model of the body parts. In addition, by means of the displacement, rotation, expansion and contraction parameters of the nodes and the child nodes, the three-dimensional models of the body parts can be truly spliced together to form a complete digital human three-dimensional model.

The digital person node array (mpeg_node_avatar) in the node description module (node) is used to indicate whether the node corresponding to the node description module (node) is used to represent a digital person.

In some embodiments, indicating whether the node corresponding to the node description module (node) is used to represent a digital person by the digital person node array (mpeg_node_avatar) includes: when a certain node description module of the scene description file includes the digital person node array (mpeg_node_avatar), it is determined that the node corresponding to the node description module represents a digital person.

It should be noted that, in the scene description file, a digital person uses a node to represent, multiple levels of subnodes can be mounted under the node, and specific parts of the body of the digital person are described layer by using node description modules corresponding to the subnodes.

In some embodiments, when a node represents a digital person, an array of digital person nodes (mpeg_node_avatar) may be added to an extended list (extensions) of the node description module to which the node corresponds.

An active identification syntax element (isAvatar) in a digital person node array (mpeg_node_avatar) of a node description module (node) is used to indicate whether a digital person represented by a node corresponding to the node description module (node) is active (whether the digital person needs to be rendered in a rendering process, and whether a display engine, a media access function, etc. in a scene description box need to process related data and information from the digital person).

In some embodiments, indicating, by the active identification syntax element (isAvatar), whether the digital person represented by the node corresponding to the node description module is active includes: when the value of the active identification syntax element (isAvatar) is a first value, determining that the digital person represented by the node corresponding to the node description module is active, and when the value of the active identification syntax element (isAvatar) is a second value, determining that the digital person represented by the node corresponding to the node description module is inactive.

In some embodiments, the first value and the second value may be true (wire) and false (false), respectively.

In other embodiments, the first and second values may be 1 and 0, respectively.

In some embodiments, the data type of the value of the active identification syntax element (isAvatar) may be boolean (boolean).

A digital person type syntax element (type) in a digital person node array (mpeg_node_avatar) of the node description module (node) is used to indicate a scheme for reconstructing and driving digital persons represented by the nodes corresponding to the node description module (node).

In some embodiments, the digital human type syntax element (type) describes a reconstruction and driving scheme of the digital person using a universal resource name (URN, uniform Resource Name), and the reconstruction and driving scheme of the digital person can be understood as a technical scheme/technical architecture to which the digital person belongs, and the reconstruction and driving scheme defines media formats/types of all body components of the digital person, and with the above information, the media access function can establish a corresponding pipeline for reconstruction/driving of corresponding body components of the digital person, and can also establish a corresponding pipeline for reconstruction/driving of the whole body of the digital person. For example: when reconstructing the digital artificial MPEG reference digital person format represented by the node corresponding to the driving node description module (node), the value of the digital human type syntax element (type) may be set to the uniform resource name (URN, uniform Resource Name) of the MPEG reference digital person. The MPEG reference number human URN is URN: MPEG: sd:2023: avatar, so when reconstructing and driving the MPEG reference number human format represented by the node corresponding to the node description module (node), the digital human type syntax element and its value are: "type" is "urn: mpeg: sd:2023: avatar".

In some embodiments, the data type of the value of the digital human type syntax element (type) may be a string (string).

The mapping list (mappings) in the digital person node array (mpeg_node_avatar) of the node description module (node) describes the mapping between digital person child nodes and digital person body part hierarchical standard names, the data type of the mapping list (mappings) is an array containing two syntax elements, namely a part name syntax element (path) and a node index syntax element (node). The part name syntax element (path) is used for describing standard names of the digital human body parts of the nodes/sub-nodes corresponding to the node description module, such as "/full_body/upper_body/head/face/eye_right"; a node index syntax element (node) is used to index nodes describing body parts identified by a part name syntax element (path) such that body part names correspond to the nodes.

In summary, the syntax elements in the digital human node array (mpeg_node_avatar) of the node description module (node) are shown in table 12 below:

Table 12

Syntax elements in the mapping list (mappings) in the digital person node array (mpeg_node_avatar) of the node description module (node) are shown in table 13 below:

TABLE 13

In a scenario that all components of a data person adopt the same representation scheme, the representation scheme of the digital person can provide more comprehensive description information, but the disadvantage caused by adopting one representation scheme of the digital person is that all components of the digital person need to follow the technical architecture of the representation scheme, when different components of the digital person adopt different technical types/media formats for implementation, the representation scheme of the digital person cannot be accurately described, so that the representation scheme of the digital person does not support flexible combination and compatibility of digital person technologies, and in view of the fact, the embodiment of the application further provides the following two technical schemes:

Scheme one,

The values of the digital person type syntax element (type) in the digital person node array (mpeg_node_avatar) of the node description module (node) are extended so that part of the components of the digital person may not follow the entire digital person representation scheme, but the representation scheme of these components of the digital person may be additionally defined. For example, the digital person type syntax element (type) and its value are: "type" ureg: MPEG: sd:2023: avatar: face-landMark ", the representation representing the face component of the digital person should employ a semantic notation method named" landMark ", and the representation of components of the digital person other than the face component is an MPEG reference digital person. That is, modifying the definition of the digital person type syntax element (type) in the digital person node array (mpeg_node_avatar) of the node description module (node) has supported flexible combinations and compatibility of digital person technologies.

Based on the above scheme one, syntax elements in the digital human node array (mpeg_node_avatar) of the node description module (node) are shown in the following table 14:

TABLE 14

Scheme II,

Referring to fig. 5, a schematic diagram of another scenario description file of a digital person component combination supporting different representation schemes according to an embodiment of the present application is provided. On the basis of the scene description file shown in fig. 4, the scene description file shown in fig. 5 further includes: component type syntax element 403133. The function of the other modules in the scene description file shown in fig. 5 is similar to that in fig. 4, and a newly extended component type syntax element 403133 in the mapping list (mappings) 40313 of the digital person node array (mpeg_node_avatar) 4031 of the node description module (node) 403 is explained below.

The syntax elements in the mapping list (mappings) of the digital person node array (mpeg_node_avatar) of the node description module (node) are extended, a new syntax element is extended in the mapping list (mappings), and when the extended syntax element is included in the mapping array corresponding to a certain component of the digital person in the mapping list (mappings), the representation scheme of the component should follow the value of the extended syntax element.

In some embodiments, the extended syntax element may be a component type syntax element. For example: the digital person type (type) syntax element takes the value "urn: mpeg: sd:2023: avatar", and the component type syntax element in the mapping array describing the digital person's facial components in the mapping list (mappings) takes the value "urn: landMark", the representation scheme of the digital person's facial components should follow the latter, i.e. the representation scheme using the semantic notation method named "landMark".

In some embodiments, the component type syntax element may be subType or localType or partType. The embodiment of the application is not limited to this, and the value of the component type syntax element can represent the representation scheme of the corresponding digital human component. The component type syntax element subType is described below as an example.

In some embodiments, the data type of the value of the component type syntax element (subType) is a string (string)

Based on scheme two above, the syntax elements in the mapping list (mappings) of the digital person node array (mpeg_node_avatar) of the node description module (node) are shown in table 15 below:

TABLE 15

A method of describing a mesh description module (mesh) 404 in a scene description file supporting a combination of digital human components of different representation schemes, comprising: for nodes representing three-dimensional objects, the scene description file will be described using the data format of a three-dimensional mesh, i.e., using mesh description module (mesh) 404. The mesh description module has various specific information, such as three-dimensional coordinates, colors, etc., which are all subordinate to syntax elements in attributes (attributes) in the primitives of the mesh description module 404. The most basic grid description module comprises three-dimensional coordinates of vertexes, connection relations among the vertexes, color information of the vertexes, three-dimensional coordinates of required texture values and the like. The three-dimensional coordinate set texture coordinates (Texcoord) of the required texture values need to be enabled at the grid description module level so that the subsequent haptic values obtained from the texture are attached to the corresponding three-dimensional coordinates of the three-dimensional object. The texture coordinates (Texcoord) and the color data are all subject to the attribute (attribute) in the primitive of the mesh description module.

In some embodiments, a description method of an accessor description module (accessor), a cache slice description module (bufferView) 406, and a cache description module (buffer) 407 in a scene description file supporting a digital person component combination of different representation schemes includes: some of the output data describing the media access function and some of the output data describing the media access function input data (e.g., media files) have an interleaving relationship with the content of the output data describing the media access function and the content of the input data indexing the media access function.

In some embodiments, a method of describing output data of a media access function by an accessor description module (accessor) 405, a cache slice description module (bufferView) 406, and a cache description module (buffer) 407 includes: a mesh description module (mesh) 404 representing a dynamic three-dimensional model of a digital human face in the three-dimensional mesh for description points to a specific accessor description module (accesso) 405; the accessor description module (accesso) 405 points back to the corresponding cache slice description module (bufferView) 406; the buffer slice description module (bufferView) 406 points to a corresponding buffer description module (buffer) 407, and the description method realizes the storage and indexing of the digital facial expression model data. Specifically, the buffer described by the buffer description module (buffer) 407 is to be directly read by the display engine, wherein the stored data is data that can be directly used for rendering. In the present application, the data to be stored in the buffer is a dynamic three-dimensional model of a digital human face. The cache slice described by the cache slice description module (bufferView) 406 is responsible for data slicing of the dynamic three-dimensional model of the digital human face in the cache, and such data slicing functions can be implemented by two parameters, namely the start byte offset (byteOffset) and the byte length (byteLength). The accessor described by the accessor description module (accessor) is responsible for adding additional information to the slicing of the dynamic three-dimensional model of the digital human face in the buffered slices, such as the data type, the amount of data of a certain type, the numerical range of data of a certain type, etc. A mesh description module (mesh) 404 for describing a three-dimensional mesh representing a digital human face points to the accessor description module (accessor) 405, taking a dynamic three-dimensional model of the digital human face to be rendered.

Since the dynamic three-dimensional model of the digital human face is a time-varying medium, the accessor for accessing the dynamic three-dimensional model of the digital human face needs to be an accessor capable of accessing the time-varying medium, and thus an MPEG ring buffer (mpeg_ accessor _time) extension should be included in the accessor description module (accessor) 405 for describing the accessor for accessing the dynamic three-dimensional model of the digital human face, and the described accessor is converted into a time-varying accessor by means of the MPEG ring buffer. In the case where the accessor description module (accessor) 405 includes an MPEG ring buffer (MPEG accessor time), two buffered slice index syntax elements (bufferView) are included in the accessor description module (accessor) 405, and the buffered slice index syntax element (bufferView) is outside the MPEG ring buffer and the other buffered slice index syntax element (bufferView) is inside the MPEG ring buffer. The buffered slice index syntax element (bufferView) outside the MPEG circular buffer will point to the dynamic three-dimensional model of the digital human face and the buffered slice index syntax element (bufferView) inside the MPEG circular buffer will point to the information header of the time-varying accessor. The information header of the time-varying accessor may also be referred to as a parameter of the time-varying accessor, describing an indexing method, a data type, a data amount, etc. of time-varying data stored in the time-varying accessor, these parameters may vary with time and are therefore to be attached as an information header in the time-varying accessor. The information header may include: the contents such as a buffered slice index syntax element (bufferView), a data type syntax element (componentType), and a accessor type syntax element (type) value at different times are buffered outside the MPEG ring buffer.

Since the dynamic three-dimensional model of the digital human face is a time-varying medium, a buffer for buffering the dynamic three-dimensional model of the digital human face needs to be a buffer capable of accessing the time-varying medium, and thus a buffer description module (buffer) 407 describing the buffer for buffering the dynamic three-dimensional model of the digital human face should contain an MPEG ring buffer (mpeg_buffer_circle) extension, and convert the buffer into a ring buffer by means of the MPEG ring buffer. The MPEG ring buffer (mpeg_buffer_cyclic) may include syntax elements such as a media index syntax element (media), a track index syntax element (tracks), a number of links syntax element (count), and the like. The media index syntax element (media) and its value will point to the media file described in MPEG media. A track index syntax element (tracks) and its value are used to describe track information of source data of data buffered in the buffer. A number of links syntax element (count) is used to describe the number of storage links of the ring buffer.

Through the two extensions, different links in the ring buffer store data of time-varying media at different moments, and different storage links are distinguished in a slicing manner through the cache slices, so that a time-varying accessor can access the data from different cache slices (storage links) at different moments.

In some embodiments, a method of describing input data of a media access function by an accessor description module (accessor) 405, a cache slice description module (bufferView) 406, and a cache description module (buffer) 407 includes: the input data of the media access function may be described in MPEG media (mpeg_media) as a media file, and this media file declared in MPEG media (mpeg_media) is indexed by an MPEG ring buffer (mpeg_buffer_circle) in a buffer description module (buffer) 407, in such a way that the input data of the media access function is delivered to the media access function.

In some embodiments, a method of describing input data of a media access function by an accessor description module (accessor) 405, a cache slice description module (bufferView) 406, and a cache description module (buffer) 407 includes: instead of using the media file to carry the input data of the media access function, the scene description file is used to carry the input data of the media access function, which is then delivered to the media access function by the display engine through the media access function API after the display engine completes parsing of the scene description file. One implementation of directly describing input data of a media access function using a scene description file includes: the input data of the media access function is described in a buffer description module (buffer) 407 of the scene description file.

In some embodiments, a method of describing input data of a media access function by an accessor description module (accessor) 405, a cache slice description module (bufferView) 406, and a cache description module (buffer) 407 includes: a part of the input data of the media access function is described in MPEG media (mpeg_media) as a media file, and the media file declared in MPEG media (mpeg_media) is indexed by an MPEG ring buffer (mpeg_buffer_circle) in a buffer description module (buffer) 407, and another part of the input data of the media access function is directly described using a scene description file.

In some embodiments, a method of describing a camera description module (camera) 408 in a scene description file supporting a combination of digital person components for different presentation schemes, includes: visual information related to viewing, such as a viewpoint, a view angle, and the like of the node description module (node) 403, is defined by the camera description module (camera) 408.

In some embodiments, a description method of a texture description module (texture) 410 and a material description module (material) 409 in a scene description file supporting obtaining haptic values of a haptic material property of a reference type, includes: additional information of the surface of the three-dimensional object is described by a texture description module (material) 409, a texture description module (texture) 410, a sampler description module (sampler) 411, and a texture map description module (image) 412. The collaboration relationship between a texture description module (material) 409, a texture description module (texture) 410, a sampler description module (sampler) 411, and a texture map description module (image) 412 includes: a texture description module (material) 409 and a texture description module (texture) 410 together define color and physical information of the object surface. A sampler description module (sampler) 411 and a texture map description module (image) 412 are specified under the texture description module (texture) 410, the sampler description module (sampler) 411 defining how texture maps to object surfaces, enabling specific adjustment and wrapping of textures. Texture map description module (image) 412 then uses the ULR to identify and index the texture map.

In some embodiments, a method of describing skin description module (skin) 2513 in a scene description file supporting a combination of digital human components of different representation schemes, comprising: the motion and deformation relationship between the three-dimensional mesh (mesh) mounted by the node description module (node) 413 and the corresponding bone is defined by the skin description module (skin) 413.

In some embodiments, a method of describing an animation description module (animation) 414 in a scene description file supporting a combination of digital human components of different representation schemes, includes: animation added to the node description module (node) 403 is defined by an animation description module (animation) 414.

In some embodiments animation descriptions module (animation) 414 may describe the animation added to node descriptions module (node) 403 by one or more of position movement, angle rotation, and size scaling.

In some embodiments, animation description module (animation) 414 may also indicate at least one of a start time, an end time, and an implementation of the animation added for node description module (node) 403.

That is, in a scene description file supporting a combination of digital human components for different representation schemes, animation may likewise be added to nodes representing objects in a three-dimensional object. Animation description module (animation) 414 describes the animation added to the node by three modes of position movement, angle rotation and size scaling, and can also prescribe the starting and ending time of the animation and the implementation mode of the animation.

The scenario description file provided by the above scheme one, which supports the digital person component combination of different representation schemes, is exemplarily described below in conjunction with a specific scenario description file.

The main content of the scene description file supporting the digital person component combination of different representation schemes provided by the scheme one is included between a pair of brackets of the 1 st row and the 108 th row in the above example, and the scene description file supporting the digital person component combination of different representation schemes includes: digital asset description module (asset), use extension list (extensionUsed), MPEG media (mpeg_media), scene declaration (scene), scene list (scenes), node list (nodes), grid list (meshes), accessor list (accessors), cache slice list (bufferViews), cache list (buffers). The contents of each section and the information included in each section at the analysis angle are described below.

1. Digital asset description module (asset): the digital asset description module is in lines 2-4. From the "version" of line 3, 2.0 "of the digital asset description module, it can be determined that the scene description file was written based on the glTF2.0 version, which is also a reference version of the scene description standard. From a parsing perspective, the display engine may determine which parser should be selected to parse the scene description file based on the digital asset description module.

2. Using the extended list (extensionUsed): the extended list is used for lines 6-11. Since the use of the extended list includes: four extensions of MPEG media (mpeg_media), MPEG ring buffer (mpeg_buffer_circle), MPEG time-varying accessor (mpeg_ accessor _time), MPEG digital person node (mpeg_node_avatar), thus determining that four MPEG extensions of MPEG media, MPEG ring buffer, MPEG time-varying accessor, and MPEG digital person node are used in the scene description file. From the parsing point of view, the display engine may learn in advance from the content of the usage expansion list that the expansion items involved in the subsequent parsing include: MPEG media, MPEG ring buffer, MPEG time-varying accessor, and MPEG digital human node.

3. Scene declaration (scene): the scene is declared as line 13. Since a scene description file can theoretically include a plurality of three-dimensional scenes, the scene description file indicates that the scene described by the scene description file is the first three-dimensional scene in the scene list, namely, the scene described by the scene description module enclosed by brackets on lines 16-20, through "scene" 0 on line 13.

4. Scene list (scenes): the scene list is 15 th to 21 st lines. The scene list only contains a bracket, which indicates that the scene description file contains a scene description module, and also indicates that the scene description file contains only a three-dimensional scene, and the bracket (scene description module) indicates that only a node with index of 0 is included in the three-dimensional scene by 'nodes' of lines 17-19. From an analytic perspective, the scene list defines that the whole scene description framework should select the first three-dimensional scene in the scene list for subsequent processing and rendering, defines the overall structure of the three-dimensional scene, and points to a more detailed next-layer node description module.

5. List of nodes (nodes): the node list is 23 th to 39 th rows. The node list only comprises a bracket, the node list only comprises a node description module (node), the three-dimensional scene only comprises one node, the node and the node with the index value of 0 contained in the node in the scene description module are the same node, and the node are associated in an index mode. In the bracket (node description module), the three-dimensional grid mounted on the node is the three-dimensional grid described by the first grid description module in the grid list, which corresponds to the grid description module of the next layer, by the "mesh" of the 25 th row, namely "0"; the "MPEG_node_avatar" on line 27 indicates that the node described by the node description module represents a digital person, "isAvatar" on line 28 indicates that the digital person represented by the node described by the node description module is active, "url: MPEG: sd:2023:avatar: face-landMark" on line 29 indicates that the representation of the digital person represented by the node described by the node description module is the MPEG reference digital person represented by url: MPEG: sd:2023:avatar, and the representation of the digital person's facial components is the corresponding facial semantic notation method of landMark. The digital human body part hierarchical standard name of the mapping relation is a face through the "path" of the 32 th row, namely the "full_body/upper_body/head/face", and the index value of the node corresponding node description module of the mapping relation is 0 through the "node" of the 33 th row, namely the node of the mapping relation. From an analytic perspective, the node list indicates that the content mounted on the unique node of the three-dimensional scene is a three-dimensional grid, and the three-dimensional grid is a three-dimensional grid described by a first grid description module in the grid list, the node represents a digital person, the representation of components of the digital person other than the face is an MPEG reference digital person, and the representation of the face component is a facial semantic mark method.

6. Grid list (meshes): the grid list is the 41 st to 52 th rows, the grid list only comprises a bracket, the scene description file or the grid list only comprises a grid description module, the three-dimensional scene only comprises a three-dimensional grid, and the three-dimensional grid with the index value of 0 in the node description module are the same grid. In the brackets (mesh description module) describing the three-dimensional mesh, the "primities" through line 43 indicates that the three-dimensional mesh has primitives (primities), the "attributes" through line 45 and the "mode" through line 48 respectively indicate that there are two types of information of attribute (attribute) and mode (mode) in the primitives, the "position" through line 46: 0 indicates that the three-dimensional grid has geometric coordinate data and that the accessor description module corresponding to the accessor accessing the geometric coordinates is the first accessor in the accessor list (accessors). In addition, the topology of the three-dimensional mesh can also be determined to be a triangular mesh by "mode" 4 of line 48. From an analytic perspective, the description module determines the actual data types that the three-dimensional grid has and the topology type of the three-dimensional grid.

7. Buffer list (buffers): the buffer list is 85 th to 95 th rows. The buffer list only includes a bracket, which illustrates that the scene description file only includes a buffer description module (buffer), and the three-dimensional scene is displayed by using only one buffer. In the bracket (buffer description module) representing the buffer, an extension of the MPEG ring buffer (mpeg_buffer_circle) is used, which illustrates that the buffer is a ring buffer obtained using the MPEG extension. The data source in the MPEG ring buffer (MPEG_buffer_circle) is the first media file declared in the MPEG media, according to "media:0" at line 91 in the MPEG ring buffer (MPEG_buffer_circle), and it can also be determined that the MPEG ring buffer has three storage links, according to "count" 3 at line 90 in the MPEG ring buffer (MPEG_buffer_circle). From an parsing point of view, the buffer list enables to correspond to the buffer the media declared in the MPEG media (MPEG media) or to reference the media file declared before but not used. It should be noted that, the media file referred to herein is an unprocessed encapsulated file, and the media file needs to be processed by a media access function to extract data, such as three-dimensional coordinates mentioned in a mesh description module (mesh), which can be directly used for rendering.

8. Cache slice list (bufferViews): the cache slice list is 68 th to 83 th lines. The buffer slice description module includes three parallel brackets, and only one buffer is determined by combining the buffer list, which illustrates that the data of the media file declared in the MPEG media is divided into three buffer slices in the buffer slice description module. In the first bracket (first cache slice description module), the cache description module with index 0 is pointed first, i.e. the only cache description module in the cache list, and then the corresponding cache slice is defined by a byte length (byteLength) to have a capacity of 82476 bytes. In the second bracket (second cache slice description module), the index 0 cache description module is also pointed to, and then the data slice range of the corresponding cache slice is 82476-164952 bytes defined by two parameters of byte length (82476) and byte offset (82476). In the third bracket (third cache slice description module), the index 0 cache description module is also pointed to, and then the data slice range of the corresponding cache slice is 164953 ～ 165052 bytes defined by two parameters of byte length (100) and byte offset (164952).

9. Accessor list (accessors): the accessor list is from 54 th to 66 th. The accessor list contains only one brace, which means that the scene description file only needs to include one accessor description module, and the display of the three-dimensional scene needs to be accessed by one accessor. In addition, the brackets (accessor description module) have an MPEG time-varying accessor (MPEG_ accessor _timed) therein, which states that the accessor is directed to MPEG-defined time-varying media. In the bracket (accessor description module), it is indicated by "bufferView" of line 56 that the accessor corresponding to the accessor description module should acquire data from the buffer corresponding to the first buffer description module in the buffer slice list (bufferViews), by "componentType" of line 57 that the data type of the accessor corresponding to the accessor description module is floating point type, by "type" of line 58 that "VEC3" indicates that the accessor type of the accessor corresponding to the accessor description module is a three-dimensional vector, by "count" of line 59 that 6873 indicates that the number of data that the accessor corresponding to the accessor description module should access is 6873, by "bufferView" of line 62 that the information header of the accessor corresponding to the accessor description module should be acquired from the buffer corresponding to the third buffer description module in the buffer slice list (bufferViews). From an analytic perspective, the accessor list perfects the complete definition of the data needed for rendering, for example, the data types lacking in the buffer description module and the buffer slice description module are defined in the corresponding accessor description module of the accessor list.

10. MPEG media (mpeg_media): MPEG media is lines 97-106. MPEG media implements declarations of media files contained in a three-dimensional scene and indicates the file type of the encapsulated file corresponding to the media file by "mimeType": "= SpecificValue" on line 102 and the access address of the media file by "uri": "https:// www.example.com/avatarface" on line 25. From an parsing perspective, the display engine may determine that a media file exists in the three-dimensional scene to be rendered by parsing the MPEG media, and learn the method of accessing and parsing the media file.

Exemplary, the scene description file supporting the digital person component combination of different representation schemes provided in the above scheme two is described below with reference to a specific scene description file.

The main content of the scene description file supporting the digital person component combination of different representation schemes provided by the scheme two is included between a pair of brackets of the line 1 and the line 109 in the above example, and the scene description file supporting the digital person component combination of different representation schemes includes: digital asset description module (asset), use extension list (extensionUsed), MPEG media (mpeg_media), scene declaration (scene), scene list (scenes), node list (nodes), grid list (meshes), accessor list (accessors), cache slice list (bufferViews), cache list (buffers). The scene description file of the digital person component combination supporting the different representation schemes provided by the second scheme is mainly different from the scene description file of the digital person component combination supporting the different representation schemes provided by the first scheme: the representation of the digital person is denoted MPEG reference digital person by urn: MPEG: sd:2023: avatar at line 29. Lines 30-36 are mapping lists (mappings), lines 31-35 in the mapping list (mappings) are a first group of mapping relations, wherein 'path' of line 32 shows that the digital human body part hierarchical standard name of the mapping relation is a face, 'node' of line 33 shows that an index value 0 of a node description module corresponding to a node of the mapping relation of the face is shown as 0, and 'subType' of line 34 shows that 'urn: landMark' shows that the expression scheme of a face component of the digital human is a face semantic mark method corresponding to urn: landMark.

In some embodiments, data (media access function input data) for reconstructing a dynamic three-dimensional model corresponding to a digital person component may be carried by a media file. That is, data for reconstructing a dynamic three-dimensional model corresponding to a digital person component is described in an MPEG media (MPEG_media) as an MPEG type media file, and the input data of a media access function is delivered to the media access function by indexing the media file declared in the MPEG media (MPEG_media) through an MPEG ring buffer (MPEG_buffer_circle) in a buffer description module (buffer).

In some embodiments, data for reconstructing a dynamic three-dimensional model corresponding to a digital human component may be carried by a scene description file. That is, the data for reconstructing the dynamic three-dimensional model corresponding to the digital person component is directly added into the scene description file, and the display engine analyzes the scene description file to obtain the data for reconstructing the dynamic three-dimensional model corresponding to the digital person component and then sends the data to the media access function through the media access function API.

In some embodiments, data for reconstructing a dynamic three-dimensional model corresponding to a digital human component may be carried in part using a scene description file and in part by a media file.

When the face component representation scheme of the digital person is a facial semantic mark method, in order to reconstruct the dynamic facial expression of the digital person, the input data generally comprises two types of data, namely a static three-dimensional model of the face of the digital person and related information (such as space coordinates and motion vectors of facial semantic mark points) of the facial semantic mark points of the digital person. The description method of the input data of several media access functions shown in fig. 6 can be derived by combining whether the facial components of the digital person adopt the facial semantic mark method and the bearing mode of the input data of the media access functions:

Method a-1: as shown in fig. 6 601, when the representation scheme of the digital person's facial component and the representation scheme of the digital person's other components are different, the input data of the media access function (i.e., the media file used to reconstruct the digital person's dynamic facial expression) may be described in MPEG media (mpeg_media) 401 as one MPEG type media file, and this media file declared in MPEG media (mpeg_media) 401 is indexed by an MPEG ring buffer (mpeg_buffer_circle) in buffer description module (buffer) 407, in such a way that the input data of the media access function is delivered to the media access function.

When the method a-1 is used to describe the input data of the media access function, the MPEG media (mpeg_media) 401, the buffer slice description module (bufferView) 406, and the buffer description module (buffer) 407 in the scene description file may be shown as the MPEG media (mpeg_media) 401, the buffer slice description module (bufferView) 406, and the buffer description module (buffer) 407 in the scene description file corresponding to the first or second scheme, which are not illustrated herein for avoiding redundancy.

Method a-2: as shown in fig. 6 602, when the representation scheme of the digital person's face component and the representation scheme of the digital person's other components are different, the static three-dimensional model of the digital person's face of the input data of the media access function is carried using a scene description file, and the related information of the digital person's face semantic mark point is carried using a media file and described in the MPEG media (mpeg_media) 401, and this media file declared in the MPEG media (mpeg_media) 401 is indexed by an MPEG ring buffer (mpeg_buffer_circle) in a buffer description module (buffer) 407.

When describing input data of a media access function by adopting the method a-2, the MPEG media (mpeg_media) 401, the buffer slice description module (bufferView) 406, and the buffer description module (buffer) 407 in the scene description file corresponding to the scheme one or the scheme two can be replaced by the following contents:

Wherein the n-01 to n+28 actions cache slice list (bufferViews). The cache slice description module includes six brackets in parallel, which indicate that the media data is divided into six cache slices. In combination with the related information in the accessor list (accessors) and the buffer list (buffers), the buffer slice corresponding to the first buffer slice description module is used for buffering the first 82476 bytes of the static three-dimensional model of the digital human face, the buffer slice corresponding to the second buffer slice description module is used for buffering the last 82476 bytes of the static three-dimensional model of the digital human face, the buffer slice corresponding to the third buffer slice description module is used for buffering the information head data of the accessor, the buffer slice corresponding to the fourth buffer slice description module is used for buffering the first 816 bytes of the related information of the digital human face semantic mark point, the buffer slice corresponding to the fifth buffer slice description module is used for buffering 817-1632 bytes of the related information of the digital human face semantic mark point, and the buffer slice corresponding to the sixth buffer slice description module is used for buffering metadata related to the digital human face semantic mark point.

The n+30 to n+44 row buffer list (buffers). The buffer description module includes two parallel brackets, which indicate that the media data is stored in two buffers. The static three-dimensional model of the digital human face is directly carried by using a scene description file and stored in a first buffer description module, the related information of the semantic mark point of the digital human face is carried by using a media file, and the media file is indexed by a second buffer description module.

The n+46 to n+55 th act as MPEG media (MPEG media). Media files carrying information about digital human facial semantic markers are declared in MPEG media.

Method a-3: as shown in fig. 6 603, when the representation scheme of the digital person's face component and the representation scheme of the digital person's whole are different, the related information of the digital person's face semantic mark point of the input data is directly carried using a scene description file, while the static three-dimensional model of the digital person's face is carried using a media file and described in an MPEG media (mpeg_media) 401, and this media file declared in the MPEG media (mpeg_media) 401 is indexed by an MPEG ring buffer (mpeg_buffer_circle) in a buffer description module (buffer) 407.

When describing input data of a media access function by adopting the method a-3, the MPEG media (mpeg_media) 401, the buffer slice description module (bufferView) 406, and the buffer description module (buffer) 407 in the scene description file corresponding to the scheme one or the scheme two can be replaced by the following contents:

Wherein the n+00 through n+29 actions cache slice list (bufferViews). The cache slice description module includes six brackets in parallel, which indicate that the media data is divided into six cache slices. In combination with the related information in the accessor list (accessors) and the buffer list (buffers), the buffer slice corresponding to the first buffer slice description module is used for buffering the first 82476 bytes of the static three-dimensional model of the digital human face, the buffer slice corresponding to the second buffer slice description module is used for buffering the last 82476 bytes of the static three-dimensional model of the digital human face, the buffer slice corresponding to the third buffer slice description module is used for buffering the information head data of the accessor, the buffer slice corresponding to the fourth buffer slice description module is used for buffering the related information of the digital human face semantic mark point at the time t, the buffer slice corresponding to the fifth buffer slice description module is used for buffering the related information of the digital human face semantic mark point at the time t+1, and the buffer slice corresponding to the sixth buffer slice description module is used for buffering the metadata related to the digital human face semantic mark point.

The n+31 to n+45 row buffers list (buffers). The buffer description module includes two parallel brackets, which indicate that the media data is stored in two buffers. The related information of the digital human face semantic mark points is directly carried by using a scene description file and is stored in a first buffer description module, a static three-dimensional model of the digital human face is carried by using a media file, and the media file is indexed by a second buffer description module.

The n+46 to n+55 th act as MPEG media (MPEG media). Media files bearing a static three-dimensional model of a digital human face are declared in MPEG media.

Method a-4: as shown in fig. 6 605, when the representation of the digital person's face component and the representation of the digital person as a whole are different, the input data of the media access function (including the static three-dimensional model of the digital person's face and the related information of the digital person's face semantic landmark) are directly carried using the scene description file itself.

The n+31 to n+40 row buffer list (buffers). The buffer description module includes two parallel brackets, which indicate that the media data is stored in two buffers. The static three-dimensional model of the digital human face is directly carried by using a scene description file and is stored in a first buffer description module, and the related information of the semantic mark point of the digital human face is directly carried by using the scene description file and is stored in a second buffer description module.

Method b1: as shown in fig. 6 605, when the representation of the digital person's facial components is the same as the representation of the digital person, the input data of the media access function may be in other file formats that do not involve the facial semantic markers method and use other description methods.

2. Display engine supporting digital person component combinations of different representation schemes

In the workflow of the scene description framework of immersive media, the main functions of the display engine include: a method for analyzing a scene description file to obtain a rendered three-dimensional scene; transmitting a media access instruction or a media data processing instruction to a media access function through a media access function API; sending a cache management instruction to a cache management module through a cache API; the processed data is taken from the cache, and rendering and displaying of the three-dimensional scene and objects in the three-dimensional scene are completed according to the read data, so that the functions of the display engine supporting the digital person component combination of different representation schemes comprise: 1. a scene description file containing description information for reconstructing digital human dynamic facial expressions can be parsed; 2. the media access function API can be used for transmitting media access instructions or media data processing instructions with the media access function; wherein the media access instructions or media data processing instructions are derived from parsing results of a scene description file containing description information for reconstructing dynamic facial expressions of the digital person; 3. a cache management instruction can be sent to a cache management module through a cache API; 4. the processed data (including static three-dimensional model of digital human face and related information of digital human face semantic mark points) used for reconstructing digital human dynamic facial expression can be taken from the cache, and the rendering and displaying of the three-dimensional scene and the digital human in the three-dimensional scene can be completed according to the read data.

3. Media access function API supporting digital person component combinations of different representation schemes

In the workflow of the scene description framework of the immersive media, the display engine can acquire a method for rendering the three-dimensional scene by analyzing the scene description file, the method for rendering the three-dimensional scene is required to be transmitted to the media access function or an instruction is required to be sent to the media access function based on the method for rendering the three-dimensional scene, and the process of transmitting the method for rendering the three-dimensional scene to the media access function or the process of sending the instruction to the media access function based on the method for rendering the three-dimensional scene is realized through a media access function API.

In some embodiments, the display engine may send media access instructions or media data processing instructions to the media access function through the media access function API. Wherein the media access instructions or media data processing instructions sent by the display engine to the media access function through the media access function API are derived from parsing results of a scene description file containing description information to reconstruct dynamic facial expressions of the digital person, the media access instructions or media data processing instructions may include: index of media files, URLs of media files, types of media files, codecs used by media files, format requirements for processed data to reconstruct digital human dynamic facial expressions and other media data, and the like.

In some embodiments, the media access function may also request media access instructions or media data processing instructions from the display engine through the media access function API.

4. Media access function supporting digital personal component combinations of different representation schemes

In the workflow of the scene description framework of the immersive media, after the media access function receives the media access instruction or the media data processing instruction issued by the display engine through the media access function API, the media access instruction or the media data processing instruction issued by the display engine through the media access function API is executed. For example: acquiring a data media file carrying a dynamic facial expression of a digital person, establishing a proper pipeline for the reconstructed dynamic three-dimensional model media file of the digital person face, writing a dynamic three-dimensional model of the digital person face processed into a scene description file in a prescribed format into a proper cache and the like.

The media access function may generate a dynamic three-dimensional model of the digital human face by different methods, as follows:

When describing input data of a media access function by adopting the method a-1, an implementation manner of the media access function to obtain a dynamic three-dimensional model of a digital human face comprises the following steps: firstly, a media access function acquires a media file carrying a dynamic three-dimensional model for reconstructing a digital human face, then the media access function establishes a corresponding pipeline, the dynamic three-dimensional model of the digital human face is reconstructed in the pipeline by utilizing the media file, and output data of the pipeline is the dynamic three-dimensional model of the digital human face.

In some embodiments, the implementation of the media access function to obtain a media file carrying a dynamic three-dimensional model for reconstructing a digital human face comprises: media files bearing a dynamic three-dimensional model for reconstructing a digital human face are downloaded from a server using a network transport service.

In some embodiments, the implementation of the media access function to obtain a media file carrying a dynamic three-dimensional model for reconstructing a digital human face comprises: media files bearing a dynamic three-dimensional model for reconstructing the digital human face are read from the local storage space.

When describing input data of a media access function using method a-2 or a-3, an implementation of the media access function to obtain a dynamic three-dimensional model of a digital human face includes: on one hand, the media access function obtains a media file carrying a dynamic three-dimensional model for reconstructing the digital human face as part of input data, on the other hand, the media access function receives an analysis result of the scene description file, obtains data carrying the dynamic three-dimensional model for reconstructing the digital human face in the scene description file as another part of input data according to the analysis result of the scene description file, then establishes a corresponding pipeline in the media access function, reconstructs the dynamic three-dimensional model of the digital human face in the pipeline by utilizing the input data, and the output data of the pipeline is the dynamic three-dimensional model of the digital human face.

When describing input data of a media access function by adopting the method a-4, an implementation manner of the media access function to obtain a dynamic three-dimensional model of a digital human face comprises the following steps: the media access function receives the analysis result of the scene description file, acquires data required by reconstructing the dynamic three-dimensional model of the digital human face according to the analysis result of the scene description file, reconstructs the dynamic three-dimensional model of the digital human face in the pipeline by utilizing the input data, and the output data of the pipeline is the dynamic three-dimensional model of the digital human face. In addition, in the scheme, the media access function does not need to acquire a server or a locally stored media file.

5. Caching API for digital person component combinations supporting different representation schemes

After the media access function generates the dynamic three-dimensional model of the digital human face through the pipeline, the media access function also needs to deliver the dynamic three-dimensional model of the digital human face to the display engine in a standard arrangement structure, which needs to correctly store the dynamic three-dimensional model of the digital human face in the cache, and the work is completed by the cache management module, but the cache management module needs to acquire the cache management instruction from the media access function or the display engine through the cache API.

In some embodiments, the media access function may send cache management instructions to the cache management module through a cache API. The cache management instruction is sent to the media access function by the display engine through the media access function API.

In some embodiments, the display engine may send cache management instructions to the cache management module through a cache API.

That is, the cache management module may communicate with the media access function through the cache API, and may also communicate with the display engine through the cache API, where the purpose of communicating with the media access function or the display engine is to implement cache management. When the cache management module communicates with the media access function through the cache API, the display engine is required to send the cache management instruction to the media access function through the media access function API, and then the media access function sends the cache management instruction to the cache management module through the cache API; when the cache management module communicates with the display engine through the cache API, the display engine only needs to generate a cache management instruction according to the cache management information analyzed in the scene description file, and the cache management instruction is sent to the cache management module through the cache API.

In some embodiments, the cache management instructions may include: one or more of creating a cached instruction, updating a cached instruction, releasing a cached instruction.

6. Cache management module supporting digital human component combination of different representation schemes

In the workflow of the scene description framework of the immersive media, after the reconstruction of the dynamic three-dimensional model of the digital human face is completed by the media access function through the pipeline, the dynamic three-dimensional model of the digital human face needs to be delivered to the display engine in a standard arrangement structure, which needs to correctly store the dynamic three-dimensional model of the digital human face in the buffer, and the work is responsible for the buffer management module.

The cache management module realizes management operations such as creation, updating, release and the like of the cache, and an instruction of the operation is received through the cache API. The rules of cache management are recorded in the scene description file, analyzed by the display engine and finally delivered to the cache management module by the display engine or the media access function. When the media file is processed by the media access function, the media file needs to be stored in a proper buffer memory and then taken by the display engine, and the buffer memory management has the effect of well managing the buffer memories to be matched with the format of the processed media data without disturbing the processed media data. The specific design method of the media management module should be referred to the design of the display engine and the media access function.

The embodiment of the application provides a method for generating a scene description file, which is shown by referring to fig. 7, and comprises the following steps:

And S71, under the condition that the target digital person is included in the three-dimensional scene to be rendered, acquiring a representation scheme of each digital person component of the target digital person.

In the embodiment of the application, the representation schemes of the digital person components of the target digital person may be the same (the representation scheme of the digital person components of the target digital person includes only one representation scheme), and the representation schemes of the digital person components of the target digital person may be different (the representation scheme of the digital person components of the target digital person may also include multiple representation schemes).

S72, generating a digital person node array (MPEG_node_avatar': { }) according to the representation scheme of each digital person component of the target digital person.

In some embodiments, step S72 (generating a digital person node array according to the representation of each digital person component of the target digital person) includes:

the value of a digital person type syntax element (type) in the digital person node array is set according to the representation scheme of each digital person component of the target digital person.

In some embodiments, setting the values of the digital human-type syntax elements in the digital human node array according to the representation scheme of the individual digital human components of the target digital human comprises the following steps ① to ③:

And ①, determining the representation scheme of the target digital person according to the representation scheme of each digital person component of the target digital person.

In some embodiments, determining the representation of the target digital person from the representations of individual digital person components of the target digital person comprises: and determining the representation scheme of the digital person component corresponding to the node representing the target digital person as the representation scheme of the target digital person.

For example: and if the representation scheme of the digital person component corresponding to the node representing the target digital person is the MPEG reference number person, determining that the representation scheme of the target digital person is the MPEG reference number.

In some embodiments, determining the representation of the target digital person from the representations of individual digital person components of the target digital person comprises:

and determining the representation scheme with the most use times in the representation schemes of the digital person components of the target digital person as the representation scheme of the target digital person.

For example: the target digital person comprises five digital person components, and the representation schemes of the five digital person components are respectively as follows: representation scheme 1, representation scheme 2, representation scheme 3, representation scheme 2, the representation scheme 2 being determined as the representation scheme of the target digital person since the representation scheme 2 is the most used one of the representation schemes of the individual digital person components of the target digital person.

The representation scheme with the largest use times is determined as the representation scheme of the target digital person, so that the representation scheme of each digital person component of the target digital person can be represented more clearly, the number of the representation schemes of the digital person components required to be declared in the scene description file can be reduced, and the data volume of the scene description file can be further reduced.

And ②, determining an abnormal scheme component according to the representation scheme of the target digital person.

The different scheme component is a digital person component with a different representation scheme from the representation scheme of the target digital person in the digital person component of the target digital person.

For example: the target digital person includes: the digital person component A, the digital person component B and the digital person component C respectively have the following representation schemes: representation scheme 1, representation scheme 2 being determined for the representation scheme of the target digital person, it is possible to determine that the digital person component a is a different scheme component from the representation scheme of the target digital person, since the representation scheme of the digital person component a (representation scheme 1) is different from the representation scheme of the target digital person (representation scheme 2).

Step ③, setting the value of the digital person type syntax element according to the URN of the representation scheme of the target digital person, the names of the digital person body parts represented by the different scheme components and the URN of the representation scheme of the different scheme components.

In some embodiments, the step iii (setting the value of the digital person type syntax element according to the URN of the representation scheme of the target digital person, the name of the digital person body part represented by each variant component, and the URN of the representation scheme of each variant component) includes: the names of the digital human body parts represented by the different scheme components and the URNs of the representation schemes are added at the tail of the URNs of the representation schemes of the target digital human, and the names of the digital human body parts represented by the different scheme components and the URNs of the representation schemes are separated by using the 'separation'.

For example: in the case where the representation of the target digital person is an MPEG reference digital person and the representation of the facial components of the target digital person is a facial semantic mark method, the digital human type syntax element (type) and its value may be set to "type": "urn: MPEG: sd:2023: avatar: face-landMark". The face semantic mark method of which the representation scheme of the face component of the target digital person is named 'landMark' can be acquired through a digital human type syntax element (type) and a value thereof, and the representation scheme of the digital person components other than the face component of the target digital person is an MPEG reference digital person.

In some embodiments, the step S72 (generating the digital person node array according to the representation of each digital person component of the target digital person) includes:

the values of the digital person type syntax element (type) and the values of the component type syntax element (subType) in the digital person node array are set according to the representation scheme of the individual digital person components of the target digital person.

In some embodiments, setting the value of the digital person type syntax element (type) and the value of the component type syntax element (subType) in the digital person node array according to the representation scheme of the individual digital person components of the target digital person includes steps 1 to 4 as follows:

And step 1, determining the representation scheme of the target digital person according to the representation scheme of each digital person component of the target digital person.

The implementation manner of determining the representation scheme of the target digital person according to the representation scheme of each digital person component of the target digital person may refer to the implementation manner of step ①, which is not described in detail herein for avoiding redundancy.

And 2, setting the value of the digital person type grammar element (type) as the URN of the representation scheme of the target digital person.

For example: determining that the representation scheme of the target digital person is an MPEG reference digital person, a digital human type syntax element (type) and its value are set to "type": "urn: MPEG: sd:2023: avatar".

And 3, determining an abnormal scheme component according to the representation scheme of the target digital person.

The implementation manner of determining the different scheme component according to the target digital person representation scheme may refer to the implementation manner of step ii, and will not be described in detail here for avoiding redundancy.

And 4, adding a component type grammar element (subType) into a mapping array corresponding to each different-scheme component, and setting the value of the component type grammar element as the URN of the representation scheme of the corresponding different-scheme component.

For example: in the case where the representation scheme of the target digital person is the MPEG reference digital person and the representation scheme of the facial component of the digital person is the facial semantic mark method, the digital person type (type) syntax element and its value may be set to "type": "urn: MPEG: sd:2023: avatar", and the component type syntax element (subType) may be added to the mapping array corresponding to the facial component in the mapping list (mappings), and the component type syntax element (subType) and its value may be set to "subType": "urn: landMark".

S73, adding the digital human node array into a digital human node description module in the scene description file of the three-dimensional scene to be rendered.

According to a second standard revision of the scene description standard, a digital person in a three-dimensional scene is represented by a node and is described by a corresponding node description module (node), so that the three-dimensional scene to be rendered comprises a node representing the target digital person, and the scene description file of the three-dimensional scene to be rendered comprises a node description module corresponding to the node representing the target digital person.

In some embodiments, the method for generating a scene description file provided by the embodiment of the present application further includes:

And adding a part name grammar element (path) to the mapping array corresponding to each digital person component of the target digital person, and setting the value of the part name grammar element as a hierarchical standard name of the digital person body part represented by the corresponding digital person component.

For example: the hierarchical standard name of a certain component of the target digital person is "full_body/upper_body/threshold", a syntax element part name syntax element (path) is added in a mapping array corresponding to the component, and the part name syntax element and the value thereof added in the mapping array corresponding to the component are set as "path": "full_body/upper_body/threshold".

and adding a node index syntax element (node) in a mapping array corresponding to each digital person component of the target digital person, and setting the value of the node index syntax element as the index value of a corresponding node description module.

For example: and if the index value of the node description module corresponding to the node representing a certain digital person component is 1, setting the node index syntax element (node) added in the mapping array corresponding to the digital person component and the value thereof as 'node' 1.

For example, the representation scheme of the target digital person is an MPEG reference digital person, the representation scheme of the face component of the target digital person is a facial semantic mark method, the index value of the node description module corresponding to the digital person component is 2, and then the mapping array corresponding to the digital person component can be as follows:

an active identification syntax element (isAvatar) is added to the digital person node array ("MPEG_node_avatar": { }), and the value of the active identification syntax element is set according to whether the target digital person is an active digital person.

In the embodiment of the application, whether the digital person is an active digital person refers to whether the digital person needs to be rendered in the three-dimensional scene rendering process, and whether a display engine, a media access function and the like in a scene description framework need to process related data and information of the digital person.

In some embodiments, setting the value of the active identification syntax element (isAvatar) according to whether the target digital person is an active digital person comprises: if the target digital person is an active digital person, the active identification grammar element and the value thereof are set to be 'isAvatar' to be wire or 'isAvatar' to be 1, and if the target digital person is an inactive digital person, the active identification grammar element and the value thereof are set to be 'isAvatar' to be false or 'isAvatar' to be 0.

Illustratively, a target number in a three-dimensional scene to be rendered is an artificially active number person, the target number comprising: the digital person component A, the digital person component B and the digital person component C respectively have the following representation schemes: MPEG reference person, facial semantic notation, generating a digital person node array according to the representation scheme of each digital person component of the target digital person may be as follows:

In the foregoing embodiment, the generation of the digital person node array according to the representation scheme of each digital person component of the target digital person may also be as follows:

in some embodiments, the method for generating a scene description file according to the embodiments of the present application further includes the following steps a and b:

and a step a of generating a target media description module corresponding to the target media file according to the description information of the target media file.

The target media file is any media file in the three-dimensional scene to be rendered.

Step b, adding the target media description module into a media list ("media") of a moving picture expert group (MPEG_media ") media (" MPEG_media ": { }) of the scene description file.

In some embodiments, the step a (generating the target media description module corresponding to the target media file according to the description information of the target media file) includes at least one of the following steps a1 to a 5:

Step a1, adding a media name syntax element (name) in the target media description module, and setting a value of the media name syntax element according to the name of the target media file.

For example: the name of the target media file is "avatarThorax", a media name syntax element (name) is added in the target media description module, and the media name syntax element and its value are set to "name": "avatarThorax".

Step a2, adding an automatic play syntax element (autoplay) in the target media description module, and setting a value of the automatic play syntax element according to whether the target media file needs automatic play.

For example: and if the target media file needs to be automatically played, adding a grammar element 'automatic play' into the target media description module, and setting the automatic play grammar element and the value thereof as 'automatic play' or 'automatic play' 1.

For another example: if the target media file does not need to be automatically played, adding a grammar element 'auto play' into the target media description module, and setting the automatic play grammar element and the value thereof as 'auto play' or 'auto play' of 0.

And a3, circularly playing a grammar element (loop) in the target media description module, and setting the value of the circularly playing grammar element according to whether the target media file needs circularly playing or not.

For example: and if the target media file needs to be circularly played, adding a grammar element 'loop' into the target media description module, and setting the circularly played grammar element and the value thereof as 'loop' position or 'autoplay' 1.

For another example: if the target media file does not need to be played circularly, adding a syntax element of loop into the target media description module, and setting the loop playing syntax element and the value thereof as loop: false or "autoplay":0.

Step a4, adding a selectable item list ("ALTERNATIVES": [ ]) in the target media description module.

And a5, generating an optional description module corresponding to each optional version of the target media file according to the description information of each optional version of the target media file, and adding the optional description module corresponding to each optional version of the target media file into the optional list ('ALTERNATIVES': [ ]).

In some embodiments, the generating the selectable item description module corresponding to each selectable version of the target media file according to the description information of each selectable version of the target media file in the step a5 includes at least one of the following steps a51 to a 55:

step a51, adding a media type syntax element in a first alternative description module corresponding to a first alternative version (mimeType), and setting a value of the media type syntax element according to a packaging format of the first alternative version.

Wherein the first alternative version is any alternative version of the target media file. That is, the generation of the selectable description module may be performed by the embodiment of the present application for each selectable version of the target media file.

Step a52, adding a uniform resource identifier syntax element (URI) in the first alternative description module, and setting a value of the uniform resource identifier syntax element according to a Uniform Resource Identifier (URI) of the first alternative version.

For example: the uniform resource identifier of the first alternative version of the target media file is "www.example.com/avatarface/index", a uniform resource identifier syntax element (uri) is added to the first alternative description module, and the uniform resource identifier syntax element and its value are set to "uri": "www.example.com/avatarface/index".

Step a53, adding a track array (tracks [ ]) in the first alternative description module.

Step a54, adding a track index syntax element (track) in the track array (tracks [ ]), and setting a value of the track index syntax element according to the first alternate version of track information.

Step a55, adding a codec parameter syntax element (codes) to the track array (tracks [ ]), and setting a value of the codec parameter syntax element according to a codec type of the first alternative version of the bitstream.

For example, when the name of the target media file is "avatarThorax", and the target media needs to be played circularly and automatically, including two alternative versions, one of which has a uniform resource identifier of "www.example.com/avatarface/index=0" and the other of which has a URI of "www.example.com/avatarface/index=1", the target media description module corresponding to the target media generated according to the above embodiment may be as follows:

It should be noted that "= TypeValue", "trackIndex =1", "CodecsValue", "= TypeValue2", "trackIndex =2", "CodecsValue2" in the above examples are representative rather than specific values, and the values of "mimeType", "track", "codes" need to be set according to the alternative versions of the corresponding media files and related standards, and specific values of "mimeType", "track", "codes" are not limited in the embodiments of the present application because the types of the respective alternative versions of the target media files are not limited in the embodiments of the present application.

In some embodiments, the method for generating a scene description file according to the embodiment of the present application further includes the following steps c and d:

And c, generating a target scene description module corresponding to the three-dimensional scene to be rendered according to the description information of the three-dimensional scene to be rendered.

And d, adding the target scene description module into a scene list ("scenes": [ ]) of the scene description file.

The generating a target scene description module corresponding to the three-dimensional scene to be rendered according to the description information of the three-dimensional scene to be rendered includes: and adding a node index list ('nodes': [ ]) in the target scene description module, and adding index values of the node description modules corresponding to each top-level node in the three-dimensional scene to be rendered in the node index list.

It should be noted that, in the embodiment of the present application, the index value of the node description module corresponding to the root node in the three-dimensional scene to be rendered is merely added to the node index list, instead of the index values of the node description modules corresponding to all the nodes (excluding the index values of the node description modules corresponding to the child nodes).

For example: the three-dimensional scene to be rendered includes two nodes, wherein one node is a child node of the other node, and an index value of a node description module corresponding to a father node in the two nodes is 0, and a target scene description module corresponding to the three-dimensional scene to be rendered added in the scene description file may be as follows:

In the above example, the three-dimensional scene to be rendered includes two nodes, and the index value of the node description module corresponding to the top node in the two nodes is 0, so that the index value 0 is added in the node list (nodes) of the target scene description module corresponding to the three-dimensional scene to be rendered.

In some embodiments, the method for generating a scene description file according to the embodiment of the present application further includes the following steps e and f:

and e, generating a target node description module corresponding to the target node according to the description information of the target node.

The target node is any node in the three-dimensional scene to be rendered.

And f, adding the target node description module into a node list ("nodes": [ ]) of the scene description file.

In some embodiments, the step c (generating the target node description module corresponding to the target node according to the description information of the target node) includes at least one of the following steps c1 to c 4:

And e1, adding a node name grammar element (name) in the target node description module, and setting the value of the node name grammar element according to the name of the target node.

For example: and if the name of the target node is ' avatarThorax ', adding a node name grammar element (name) into the target node description module, and setting the node name grammar element and the value thereof in the target node description module as ' name ' avatarThorax '.

And e2, adding a child node index list ('child' in [ ]) in the target node description module, and adding index values of the node description modules corresponding to the child nodes mounted by the target node in the child node index list.

For example: the target node is loaded with two sub-nodes, the index value of the node description module corresponding to one sub-node is 1, and the index value of the node description module corresponding to the other sub-node is 2, and the sub-node index list part of the target node description module can be as follows:

And e3, adding a grid index grammar element (mesh) into the target node description module, and setting the value of the grid index grammar element (mesh) as the index value of the grid description module corresponding to the three-dimensional grid mounted by the target node.

For example: and the target node is loaded with a three-dimensional grid, and the index value of a grid description module corresponding to the three-dimensional grid is 0, a grid index grammar element (mesh) is added in the target media description module, and the grid index grammar element and the value thereof in the target node description module are set as 'mesh': 0.

And e4, adding a position offset syntax element (translation) in the target node description module, and setting the value of the position offset syntax element according to the spatial position offset of the target node relative to the parent node.

For example, a node in the three-dimensional scene to be rendered is named as "avatarThorax", the node includes two child nodes, the index values of the node description modules corresponding to the two child nodes are 1 and 2, the index value of the grid description module corresponding to the three-dimensional grid mounted on the node is 0, and the spatial position offset of the node relative to the parent node is 0, the node description module generated according to the description information of the node may be as follows:

For example, a node in the three-dimensional scene to be rendered is named as "AVATARNECK", the node includes a child node, the index value of the node description module corresponding to the child node is 1, the index value of the grid description module corresponding to the three-dimensional grid mounted by the child node is 1, the offset of the child node relative to the parent node is (0.0,10.0,20.0), and the node description module generated according to the description information of the child node may be as follows:

in some embodiments, the method for generating a scene description file according to the embodiment of the present application further includes the following steps g and h:

And g, generating a target grid description module corresponding to the target three-dimensional grid according to the description information of the target three-dimensional grid.

The target three-dimensional grid is any three-dimensional grid in the three-dimensional scene to be rendered.

Because the three-dimensional grid in the scene description file is the next level of the node, the three-dimensional grid can be mounted under the node, and therefore the target three-dimensional grid can also be described as the three-dimensional grid mounted by any node in the three-dimensional scene to be rendered.

Step h adds the target mesh description module to the mesh list ("meshes": [ ]) of the scene description file.

In some embodiments, the step g (generating a target mesh description module corresponding to the target three-dimensional mesh according to the description information of the target three-dimensional mesh) includes the following steps g1 to g3:

and g1, adding a grid name grammar element (name) into the target grid description module, and setting the value of the grid name grammar element according to the name of the target three-dimensional grid.

For example: the name of the target three-dimensional grid is ' AVATARFACE ', a grid name grammar element (name) is added in the target grid description module, and the grid name grammar element and the value thereof in the target grid description module are set as ' name ' AVATARFACE '.

And g2, adding a position syntax element (position) in an attribute (attributes) of a primitive (primities) of the target three-dimensional grid description module, and setting the value of the position syntax element according to an index value of an accessor description module corresponding to an accessor for accessing a dynamic three-dimensional model of a digital person component corresponding to the target three-dimensional grid.

And g3, adding a mode syntax element (mode) into a primitive (primities) of the target grid description module, and setting the value of the mode syntax element according to the topology type of the target three-dimensional grid.

For example, a name of a certain three-dimensional grid is "AVATARFACE", an index value of an accessor description module corresponding to an accessor for accessing a dynamic three-dimensional model of a digital person component corresponding to the three-dimensional grid is 0, and a topology type of the three-dimensional grid is a triangle patch, then the grid description module corresponding to the three-dimensional grid may be as follows:

In some embodiments, the method for generating a scene description file according to the embodiment of the present application further includes the following step i and step j:

and step i, generating a target accessor description module corresponding to the target accessor according to the description information of the target accessor.

The target accessor is any accessor for realizing the rendering of the three-dimensional scene to be rendered.

Step j, adding the target accessor description module to an accessor list ("accessors": [ ]) of the scene description file.

In some embodiments, the step i (generating the target accessor description module corresponding to the target accessor according to the description information of the target accessor) includes at least one of the following steps i1 to i 8:

And step i1, adding a data type syntax element (componentType) in the target accessor description module, and setting the value of the data type syntax element according to the data type of the data accessed by the target accessor.

For example: if the data type of the data accessed by a certain accessor is 5126, the data type syntax element and the value thereof in the accessor description module corresponding to the accessor are set as follows: "componentType":5126.

And step i2, adding an accessor type syntax element (type) in the target accessor description module, and setting the value of the accessor type syntax element according to the type of the target accessor.

For example: the accessor type accessed by a accessor is a two-dimensional vector, and an accessor type syntax element (type) and a value thereof in an accessor description module corresponding to the accessor are set as follows: "type" is "VEC2".

And step i3, adding a data quantity syntax element (count) in the target accessor description module, and setting the value of the data quantity syntax element according to the quantity of data accessed by the target accessor.

For example: the number of the accessed data accessed by a certain accessor is 1000, and a data number syntax element (count) and a value thereof in an accessor description module corresponding to the accessor are set as follows: "count" is 1000.

And i4, adding a target cache slice index syntax element (bufferView) into the target accessor description module, and setting the value of the target cache slice index syntax element according to the index value of the cache slice description module corresponding to the cache slice for caching the data accessed by the target accessor.

For example: a buffer slice description module corresponding to a buffer slice for buffering data accessed by a certain accessor is a fourth buffer slice description module (index value is 3) in a buffer slice list (bufferViews) of a scene description file, and then a target buffer slice index syntax element and a value thereof in the accessor description module corresponding to the accessor are set as follows: "bufferView" means 3.

And step i5, adding an MPEG time-varying accessor (MPEG_ accessor _time) ({ }) into an extended list (extensions) (: { }) of the target accessor description module.

And step i6, adding a second cache slice index syntax element (bufferView) into the MPEG time-varying accessor ('MPEG_ accessor _timed': { }), and setting the value of the second cache slice index syntax element according to the index value of a cache slice description module corresponding to the cache slice used for caching the time-varying parameter of the target accessor.

For example: a buffer slice description module corresponding to a buffer slice for buffering a time-varying parameter of a certain accessor is a second buffer slice description module in a buffer slice list (bufferViews) of a scene description file, and a second buffer slice index syntax element and a value thereof in the accessor description module corresponding to the accessor are set as follows: "bufferView":1.

Step i7, adding a time-varying syntax element (immutable) in the MPEG time-varying accessor ("MPEG_ accessor _timed": { }), and setting the value of the time-varying syntax element according to whether the value of the syntax element in the target accessor changes with time.

In some embodiments, setting the value of the time-varying syntax element according to whether the value of the syntax element within the target accessor varies with time comprises: when the value of the syntax element in the target accessor does not change with time, setting the time-varying syntax element and the value thereof in the MPEG time-varying accessor of the target accessor description module as follows: "immutable": wire or "immutable": 1. When the value of the syntax element within the target accessor will change over time, the time-varying syntax element and its value in the MPEG time-varying accessor of the target accessor description module is set to: "immutable": false or "immutable":0.

And step i8, adding an accessor name syntax element (name) in the target accessor description module, and setting the value of the accessor name syntax element according to the name of the target accessor.

For example, the type of data accessed by a accessor is 5126, the name of the accessor is "avatarThorax", the type of the accessor is VEC3, the number of data accessed by the accessor is 1828, the index value of the cache slice description module corresponding to the cache slice for caching the data accessed by the accessor is 0, the value of the syntax element in the accessor may change with time, the index value of the cache slice description module corresponding to the cache slice for caching the time-varying parameter of the accessor is 2, and then the accessor description module corresponding to the accessor may be as follows:

in some embodiments, the method for generating a scene description file according to the embodiment of the present application further includes the following steps k and l:

And step k, generating a target buffer description module corresponding to the target buffer according to the description information of the target buffer.

The target buffer is any buffer used for realizing the rendering of the three-dimensional scene to be rendered.

And step l, adding the target buffer description module into a buffer list ("buffers": [ ]) of the scene description file.

In some embodiments, step k (generating a target buffer description module corresponding to the target buffer according to the description information of the target buffer) includes at least one of the following steps k1 to k 7:

And step k1, adding a first byte length syntax element (byteLength) in the target buffer description module, and setting the value of the first byte length syntax element according to the capacity of the target buffer.

For example, when the capacity of a certain buffer is 15000 bytes, the first byte length syntax element in the buffer description module corresponding to the buffer is set as follows: "byteLenth" 15000.

And step k2, adding an MPEG ring buffer (MPEG_buffer_circulation) in the target buffer description module.

And step k3, adding a link number grammar element (count) into the MPEG ring buffer (MPEG_buffer_circulation), and setting the value of the link number grammar element according to the storage link number of the target buffer.

For example: setting the number of links in the MPEG ring buffer of the buffer description module corresponding to a certain buffer as 5, and setting the number of links syntax elements and the values thereof as: "count" 5.

And k4, adding a media index syntax element (media) into the MPEG annular buffer, and setting the value of the media index syntax element according to the index value of a media description module corresponding to the media file to which the source data of the data buffered by the target buffer belongs.

For example: the index value of the media description module corresponding to the media file of the source data of the data buffered in a certain buffer is 0, the media index syntax element and the value in the MPEG annular buffer of the buffer description module corresponding to the buffer are set as 'media': 0

And k5, adding a track index syntax element (tracks) into the MPEG annular buffer, and setting the value of the track index syntax element according to the track index value of the source data of the data buffered by the target buffer.

And step k6, adding a buffer name grammar element (name) into a buffer description module corresponding to the target buffer, and setting the buffer name grammar element according to the name of the target buffer.

And k7, adding a uniform resource identifier syntax element (uri) into the target buffer description module, and setting the value of the uniform resource identifier syntax element according to at least one part of data used for reconstructing the dynamic three-dimensional model of the corresponding digital human component.

That is, data for reconstructing a dynamic three-dimensional model of a corresponding digital person component is directly added to the scene description file.

For example, if a certain buffer named as "AVATARFACE" has a capacity of 108940 bytes and the number of storage links of the buffer is 3, and the index value of the media description module corresponding to the media file to which the source data of the data buffered in the buffer is 1, the buffer description module corresponding to the buffer added in the buffer list of the scene description file may be as follows:

in some embodiments, the method for generating a scene description file further includes the following steps m and n:

and m, generating a target cache slice description module corresponding to the target cache slice according to the description information of the target cache slice. .

The target cache slice is any cache slice of a cache for realizing the rendering of the three-dimensional scene to be rendered.

And step n, adding the target cache slice description module into a cache slice list ("bufferViews": [ ]) of the scene description file.

In some embodiments, the step m (generating the target cache slice description module corresponding to the target cache slice according to the description information of the target cache slice) includes at least one of the following steps m1 to m 4:

And m1, adding a buffer index syntax element (buffer) into the target buffer slice description module, and setting the value of the buffer index syntax element according to the index value of the buffer description module corresponding to the buffer to which the target buffer slice belongs.

For example: and if the index value of the buffer description module corresponding to a certain buffer is 2, setting the buffer index syntax element and the value thereof in the buffer slice description module corresponding to the buffer slice of the buffer as: "buffer" 2.

And m2, adding a second byte length syntax element (byteLength) into a cache slice description module corresponding to the target cache slice, and setting the value of the second byte length syntax element (byteLength) according to the capacity of the target cache slice.

And m3, adding an offset syntax element (byteOffset) into the target cache slice description module, and setting the value of the offset syntax element according to the offset of the data cached by the target cache slice.

And m4, adding a cache slice name syntax element (name) in the target cache slice description module, and setting the value of the cache slice name syntax element according to the name of the target cache slice.

For example, the index value of the buffer description module corresponding to a certain buffer is 0, the capacity of the buffer is 43972, and the buffer includes three buffer slices, where the name of the first buffer slice is "avatarThorax1", the capacity is 21936, the offset is 0, the name of the second buffer slice is "avatarThorax2", the capacity is 21936, the offset is 21936, the name of the third buffer slice is "avatarThoraxHead", the capacity is 100, and the offset is 43872, and the buffer slice description module portion corresponding to the buffer slice of the buffer in the scene description file may be as follows:

In some embodiments, the method for generating a scene description file further includes:

A digital asset description module (asset) is added in the scene description file, a version syntax element (version) is added in the digital asset description module, and a value of the version syntax element is set according to version information of the scene description file.

For example: when the scene description file is written based on glTF2.0 version, the value of the version syntax element is set to 2.0.

By way of example, the digital asset description module added to the scene description file may be as follows:

An extended usage list (extensionsUsed) is added to the scene description file, and a top-level extension of the scene description file of the version glTF2.0 of MPEG used by the scene description file is added to the extended usage list.

Illustratively, the MPEG extension used in the scene description file includes: MPEG media (mpeg_media), MPEG ring buffer (mpeg_buffer_circle), MPEG time-varying accessor (mpeg_ accessor _time), MPEG digital person component (mpeg_node_avatar), the extended usage list added in the scene description file may be as follows:

Adding a scene statement (scene) into the scene description file, and setting the value of the scene statement as an index value of a scene description module corresponding to the scene to be rendered.

For example, if the index value of the scene description module corresponding to the scene to be rendered is 0, adding the scene declaration to the scene description file may be as follows:

Some embodiments of the present application further provide a method for parsing a scene description file, as shown in fig. 8, where the method for parsing a scene description file includes steps S81 to S83 as follows:

S81, acquiring a digital person node description module corresponding to a node representing a target digital person in the three-dimensional scene to be rendered from a scene description file of the three-dimensional scene to be rendered.

For example, the digital human node description module obtained from the scene description file of the three-dimensional scene to be rendered may be as follows:

Illustratively, the obtaining the target node description module from the scene description file may be as follows:

the obtaining the target node description module from the scene description file may also be as follows:

S82, acquiring a digital person node array ("MPEG_node_avatar") in the digital person node description module.

As described in the above examples, the digital personal node array obtained from the digital personal node description module may be as follows:

the digital human node array obtained from the digital human node description module may also be as follows:

S83, obtaining the representation scheme of each digital person component of the target digital person according to the digital person node array ("MPEG_node_avatar").

In some embodiments, the step S83 (obtaining the representation of each digital person component of the target digital person according to the digital person node array) includes: and obtaining the representation scheme of each digital person component of the target digital person according to the value of the digital person type grammar element in the digital person node array.

In some embodiments, the obtaining the representation of each digital person component of the target digital person from the value of the digital person type syntax element (type) in the digital person node array includes steps ⑴ and ⑵ as follows:

Step ⑴, determining a representation scheme of the target digital person and a representation scheme of at least one digital person component according to a value of a digital person type syntax element (type) in the digital person node array.

In the above embodiment, the digital person type syntax element and its value are "type": "urn: MPEG: sd:2023: avatar: face-landMark", and the representation scheme of the target digital person may be determined to be an MPEG reference digital person according to the value of the digital person type syntax element (type) in the digital person node array, and the representation scheme of the face component of the target digital person is a facial semantic mark method.

Step ⑵, determining the representation scheme of the other digital person components except the at least one digital person component in the digital person components of the target digital person as the representation scheme of the target digital person.

As described in the above example, the representation scheme of the target digital person is an MPEG reference digital person, and thus the representation scheme of the other digital person components (chest components of standardized hierarchy name full_body/upper_body/thorax) among the digital person components of the target digital person except for the face component is determined as an MPEG reference digital person, and thus it is determined that the MPEG reference digital person includes two digital person components of the face component and the chest component, the representation scheme of the face component is a facial semantic mark method, and the representation scheme of the chest component is an MPEG reference digital person.

In some embodiments, the step S83 (obtaining the representation of each digital person component of the target digital person according to the digital person node array) includes: and acquiring the representation scheme of each digital person component of the target digital person according to the value of the digital person type grammar element (type) and the value of the component type grammar element (subType) in the digital person node array.

In some embodiments, the representation scheme of each digital person component of the target digital person is obtained according to the value of the digital person type syntax element and the value of the component type syntax element in the digital person node array, and the representation scheme comprises the following steps 1) to 5):

Step 1), determining the representation scheme of the target digital person according to the values of the digital human type grammar elements in the digital person node array.

In the foregoing embodiment, the digital person type syntax element and the value thereof are "type": "urn: MPEG: sd:2023: avatar", and the representation scheme of the target digital person may be determined to be an MPEG reference digital person according to the value of the digital person type syntax element (type) in the digital person node array.

Step 2), obtaining the mapping array corresponding to each digital person component of the target digital person from the mapping list ("mappings": [ ]) of the digital person node array.

In the above embodiment, the mapping arrays corresponding to the individual digital person components of the target digital person obtained from the mapping list ("mappings": [ ]) of the digital person node array are as follows:

Step 3), determining whether the mapping array corresponding to each digital person component of the target digital person contains the component type syntax element (subType).

As described in the above example, the target digital person includes two digital person components, the mapping array corresponding to the first digital person component does not include a component type syntax element, and the mapping array corresponding to the second digital person component includes a component type syntax element.

And 4) if the mapping array corresponding to the first digital person component of the target digital person contains the component type grammar element, acquiring a representation scheme of the first digital person component according to the value of the component type grammar element.

As described in the previous example, the mapping array corresponding to the second digital person component includes a component type syntax element, and the component type syntax element and the value thereof are "subType": "landMark", so that the representation scheme of the face component of the target digital person is determined to be a face semantic annotation method according to the value of the component type syntax element.

And 5) if the mapping array corresponding to the second digital person component of the target digital person does not contain the component type grammar element, determining the representation scheme of the second digital person component as the representation scheme of the target digital person.

In the above example, the mapping array corresponding to the first digital person component does not include a component type syntax element, so that the determination of the chest component of the target digital person is referred to as an MPEG reference digital person, and thus the MPEG reference digital person is determined to include two digital person components, namely, a face component and a chest component, where the representation scheme of the face component is a facial semantic mark method, and the representation scheme of the chest component is referred to as an MPEG reference digital person.

According to the method for analyzing the scene description file, firstly, a digital person node description module corresponding to a node representing a target digital person in a three-dimensional scene to be rendered is obtained from the scene description file of the three-dimensional scene to be rendered, then, a digital person node array in the digital person node description module is obtained, and a representation scheme of each digital person component of the target digital person is obtained according to the digital person node array. The method for analyzing the scene description file provided by the embodiment of the application can acquire the representation schemes of all the digital person components of the target digital person from the digital person node array of the scene description and file, so that even if different representation schemes are adopted by different digital person components of the target digital person, the method for analyzing the scene description file provided by the embodiment of the application can accurately acquire the representation schemes of all the components of the target digital person through the scene description file, thereby realizing flexible combination of the digital person components of different representation schemes supported by the scene description framework.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application further includes:

And acquiring hierarchical standard names of the digital person body parts represented by the digital person components of the target digital person according to the values of the part name grammar elements (path) in the mapping array corresponding to the digital person components of the target digital person.

For example: the syntax element of the part name in the mapping array corresponding to a certain digital person component of the target digital person is "path", and the value of the syntax element is "full_body/upper_body/threshold", so that the hierarchical standard name of the digital person body part represented by the chest component of the target digital person can be determined to be "full_body/upper_body/threshold".

For example: the syntax element of the part name in the mapping array corresponding to a certain digital person component of the target digital person is "path", and the value of the syntax element is "full_body/upper_body/head/face", so that the hierarchical standard name of the digital person body part represented by the face component of the target digital person can be determined to be "full_body/upper_body/head/face".

and acquiring the index value of the node description module corresponding to each digital person component of the target digital person according to the value of the node index grammar element (node) in the mapping array corresponding to each digital person component of the target digital person.

For example: and if the node index syntax element and the value of the node index syntax element in the mapping array corresponding to a certain digital person component of the target digital person are 'node' 2, the index value of the node description module corresponding to the digital person component can be 2 (the third node description module in the node list of the scene description file).

And determining whether the target digital person is an active digital person according to the value of the active identification grammar element (isAvatar) in the digital person node array.

In some embodiments, determining whether the target digital person is an active digital person according to a value of an active identification syntax element in a digital person node array of the target node description module includes: if the value of the active identification grammar element in the digital person node array of the target node description module is 1 or wire, determining the digital person of which the target digital person is active; and if the value of the active identification grammar element in the digital person node array of the target node description module is 0 or false, determining that the target digital person is an inactive digital person.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application further includes the following steps 1 and 2:

Step 1, a target media description module is obtained from a media list ("media") of MPEG media ("MPEG_media": { }) of the scene description file.

The target media description module is any media description module in a media list of MPEG media of the scene description file.

And step 2, acquiring description information of a target media file corresponding to the target media description module according to the target media description module.

In some embodiments, the obtaining, according to the target media description module, the description information of the target media file may include at least one of the following steps 21 to 24:

step 21, obtaining the name of the target media file according to the value of the media name grammar element (name) in the target media description module.

For example: the media name syntax element in the target media description module and its value are: "name" avatarThorax ", the name of the target media file may be determined as: avatarThorax.

Step 22, determining whether the target media file needs to be automatically played according to the value of an automatic play syntax element (autoplay) in the target media description module.

In some embodiments, determining whether the target media file requires automatic playback according to a value of an automatic playback syntax element (autoplay) in the target media description module includes: if the automatic play syntax element (autoplay) in the target media description module and its value are: "autoplay" true or "autoplay" 1, then determining that the target media file needs to be automatically played; if the automatic play syntax element (autoplay) in the target media description module is: "autoplay" or "autoplay" of 0, it is determined that the target media file does not need to be automatically played.

Step 23, determining whether the target media file needs to be played circularly according to the value of the loop play syntax element (loop) in the target media description module.

In some embodiments, determining whether the target media file requires loop play based on a value of a loop play syntax element (loop) in the target media description module comprises: if the loop play syntax element (loop) in the target media description module is: "Loop" true or "Loop" 1, determine the target media file needs to be played circularly; and if the loop play syntax element (loop) in the target media description module is: "Loop" to false or "Loop" to 0, then it is determined that the target media file does not need to be played in a loop.

Step 24, obtaining the description information of each alternative version of the target media file according to each alternative description module in the alternative list ("ALTERNATIVES": [ ]) of the target media description module.

In some embodiments, the step 24 (obtaining the description information of each alternative version of the target media file according to each alternative description module in the alternative list of the target media description module) may include at least one of the following steps 241 to 244:

Step 241, obtaining a package format of a first alternative version corresponding to the first alternative description module according to the value of the media type syntax element (mimeType) in the first alternative description module.

Wherein the first selectable item description module is any selectable item description module in the selectable item list. Because the first selectable item description module can be any selectable item description module in the selectable item list, the description information of each selectable version of the target media file can be obtained through the embodiment of the application.

For example: the media type syntax element (mimeType) in the alternative description module corresponding to a certain alternative version of the target media file has a value of "mimeType": application/MP4", and the encapsulation format of that alternative version of the target media file may be determined to be MP4.

Step 242, obtaining a Uniform Resource Identifier (URI) of the first alternative version according to a value of a uniform resource identifier syntax element (URI) in the first alternative description module.

For example: the uniform resource identifier syntax element (uri) and its value in a certain selectable item description module in the selectable item list ("ALTERNATIVES": [ ]) of the target media description module are: "uri": "https:// www.example.com/avatarthorax", the uniform resource identifier (access address) of the selectable version of the target media file corresponding to the selectable description module may be determined as:

"https://www.example.com/avatarthorax"。

Step 243, obtaining the first alternate version of track information from a value of a first track index syntax element (track) in a track array ("tracks": [ ]) of the first alternate description module.

Step 244, obtaining a decoder type of the first alternative version of the code stream based on values of codec parameter syntax elements (codes) in a track array ("tracks": [ ]) of the first alternative description module.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application includes the following steps 3 and 4:

And step 3, acquiring a target scene description module corresponding to the three-dimensional scene to be rendered from a scene list ('scenes': [ ]) of the scene description file.

And step 4, acquiring the description information of the three-dimensional scene to be rendered according to the target scene description module.

In some embodiments, the step 4 (obtaining the description information of the three-dimensional scene to be rendered according to the target scene description module) includes: and determining the index value of the node description module corresponding to each top-level node in the three-dimensional scene to be rendered according to the index value of the node index list (nodes) statement of the target scene description module.

For example: the node index list of the target scene description module and the index value stated by the node index list are "nodes": 0], it can be determined that the three-dimensional scene to be rendered only comprises a top-level node, and the node description module corresponding to the top-level node is the first node description module in the node list of the scene description file.

For another example: the node index list of the target scene description module and the index value stated by the node index list are "nodes [0,2], it can be determined that the three-dimensional scene to be rendered comprises two top-level nodes, and the node description modules corresponding to the two top-level nodes are respectively a first node description module and a third node description module in the node list of the scene description file.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application includes the following steps 5 and 6:

And step 5, acquiring a target node description module from a node list ("nodes": [ ]) of the scene description file.

The target node description module is any node description module in the node list.

And step 6, acquiring description information of the target node corresponding to the target node description module according to the target node description module.

In some embodiments, the step 4 (obtaining the target node description module from the node list of the scene description file) may include at least one of the following steps 61 to 64:

Step 61, obtaining the name of the target node according to the value of the node name syntax element (name) in the target node description module.

For example: the node name syntax element in the target node description module is ' name ' avatarThorax ', and the name of the target node can be obtained according to the value of the node name syntax element in the target node description module, wherein the name is as follows: avatarThorax.

And step 62, obtaining index values of the node description modules corresponding to all the child nodes mounted on the target node according to the index values stated in the child node index list ("child" of the target node description module).

For example: the index list of the child nodes in the target node description module and the index value stated in the index list of the child nodes are "child" 1,2 ", two child nodes can be determined to be mounted on the target node according to the index value stated in the index list of the child nodes in the target node description module, and the two child nodes mounted on the target node are nodes corresponding to the second node description module and the third node description module in the node list of the scene description file respectively.

And 63, acquiring index values of the grid description modules corresponding to the three-dimensional grids mounted by the target node according to the index values stated in the grid index syntax element (mesh) of the target node description module.

For example: and if the grid index syntax element and the value thereof in the target node description module are ' mesh ' to be 0, acquiring the three-dimensional grid mounted by the target node as the three-dimensional grid corresponding to the first grid description module in the grid list (' meshes ': to be ') of the scene description file according to the index value stated in the grid index syntax element of the target node description module.

Step 64, according to the value of the position offset syntax element (translation) in the target node description module, the spatial position offset of the target node relative to the parent node is obtained.

For example, if the position offset syntax element in a node description module is "transmission" [0.0,0.0,20.0] and the value thereof is "transmission", the spatial position offset of the node relative to the parent node thereof can be obtained according to the value of the position offset syntax element in the node description module [0.0,0.0,20.0].

It should be noted that, directly obtained from the value of the position offset syntax element in the node description module is the spatial position offset of the node with respect to its parent node, which is not necessarily the spatial position offset of the node with respect to the node representing the target digital person, which is the superposition of the spatial offset amounts of the nodes on the path from the node to the node representing the target digital person.

For example: the node A is a child node of the node representing the target digital person, the node B is a child node of the node A, the position offset syntax element in the node description module corresponding to the node A is a transition thereof [0.0,10.0,25.0], the position offset syntax element in the node description module corresponding to the node B is a transition thereof [0.0,10.0,20.0], the offset of the node A relative to the node representing the target digital person is [0.0,10.0,25.0], the offset of the node B relative to the node A is [0.0,00.0,20.0], and the position offset is superimposed to obtain the offset of the node B relative to the node representing the target digital person [0.0,10.0,45.0].

And 7, acquiring a target grid description module from a grid list ("meshes": [ ]) of the scene description file.

Wherein the target grid description module is any one of the grid description modules in the grid list ("meshes": [ ]).

And step 8, acquiring description information of the target three-dimensional grid corresponding to the target grid description module according to the target grid description module.

In some embodiments, step 8 (obtaining, according to the target mesh description module, description information of the target three-dimensional mesh corresponding to the target mesh description module) includes at least one of the following steps 81 to 83:

and 81, acquiring the name of the target three-dimensional grid according to the value of the grid name grammar element (name) in the target grid description module.

For example: the name of the three-dimensional grid corresponding to the grid description module can be obtained according to the value of the grid name syntax element in the grid description module, wherein the grid name syntax element in a certain grid description module in the scene description file is 'name': 'AVATARNECK': AVATARNECK.

Step 82, obtaining an index value of an accessor description module corresponding to an accessor of the dynamic three-dimensional model of the digital human component corresponding to the target three-dimensional grid according to a value of a position syntax element (position) in an attribute (attributes) of a primitive (primities) of the target grid description module.

For example: and determining that the accessor for accessing the dynamic three-dimensional model of the digital human component corresponding to the three-dimensional grid corresponding to the grid description module is the accessor corresponding to the third accessor description module in the scene description file according to the value of the position syntax element in the attribute of the primitive of the grid description module.

Step 83, obtaining the type of the topology structure of the target three-dimensional grid according to the value of the mode syntax element (mode) in the primitive (primities) of the target grid description module.

For example: and if the mode syntax element in the primitive of a certain grid description module in the scene description file and the value thereof are 'mode' and '0', acquiring the type of the topological structure of the three-dimensional grid corresponding to the grid description module as a scattered point according to the value of the mode syntax element in the primitive of the grid description module.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application further includes the following steps 9 and 10:

And 9, acquiring a target accessor description module from an accessor list ("accessors": [ ]) of the scene description file.

Wherein the target accessor description module is any accessor description module in an accessor list ("accessors": [ ]) of the scene description file.

And step 10, acquiring description information of the target accessor corresponding to the target accessor description module according to the target accessor description module.

In some embodiments, step 10 (obtaining, according to the target accessor description module, description information of the target accessor corresponding to the target accessor description module) includes at least one of the following steps 101 to 108:

And step 101, acquiring the data type of the data accessed by the target accessor according to the value of the data type syntax element (componentType) in the target accessor description module.

For example: the data type syntax element in a certain accessor description module and its value are: "componentType":5126, it may be determined that the data type of the data accessed by the accessor corresponding to the accessor description module is a 32-bit floating point number (float).

Step 102, determining the type of the target accessor according to the value of an accessor type syntax element (type) in the target accessor description module.

For example: the accessor type syntax element in a accessor description module has the value: "type" is "VEC2", then the type of accessor corresponding to the accessor description module can be determined to be a two-dimensional vector.

Step 103, determining the amount of data accessed by the target accessor according to the value of the data amount syntax element (count) in the target accessor description module.

For example: the data quantity syntax element in a certain accessor description module and its value are: "count":1828, it may be determined that the number of data accessed by the accessor corresponding to the accessor description module is 1828.

And 104, determining an index value of a cache slice description module corresponding to a cache slice for caching the data accessed by the target accessor according to the value of a target cache slice index syntax element (bufferView) in the target accessor description module.

For example: the target cache slice index syntax element of a certain accessor description module has a value of "bufferView":1, it may be determined that the data accessed by the accessor corresponding to the accessor description module is cached in the cache slice corresponding to the second cache slice description module in the cache slice list ("bufferViews": [ ]).

Step 105, determining whether the target accessor is a time-varying accessor based on the MPEG extension according to whether the target accessor description module contains an MPEG time-varying accessor ("MPEG_ accessor _timed": { }).

In some embodiments, determining whether the target accessor is an MPEG extension-based time-varying accessor based on whether an MPEG time-varying accessor ("mpeg_ accessor _timed": { }) is included in the target accessor description module comprises: and if the target accessor description module contains the MPEG time-varying accessor, determining that the target accessor is the time-varying accessor based on the MPEG extension, and if the target accessor description module does not contain the MPEG time-varying accessor, determining that the target accessor is not the time-varying accessor based on the MPEG extension.

Step 106, determining a buffer slice description module corresponding to a buffer slice for buffering the time-varying parameter of the target accessor according to the value of a second buffer slice index syntax element (bufferView) in an MPEG time-varying accessor ("MPEG_ accessor _timed": { }) of the target accessor description module.

For example: a second buffered slice index syntax element in an MPEG time-varying accessor of a accessor description module is: "bufferView": and 3, determining that the time-varying parameters of the accessor corresponding to the accessor description module are cached in the cache slice corresponding to the fourth cache slice description module in the cache slice list ('bufferViews': [ ]).

Step 107, determining whether the value of the syntax element in the target accessor changes with time according to the value of the time-varying syntax element (immutable) in the MPEG time-varying accessor of the target accessor description module (MPEG_ accessor _timed).

In some embodiments, determining whether the value of the syntax element within the target accessor varies over time based on the value of the time-varying syntax element (immutable) in the MPEG time-varying accessor of the target accessor description module comprises: if the time-varying syntax element in the MPEG time-varying accessor of the target accessor description module and the value thereof are: "immutable": true or "immutable":1, determining that the value of the syntax element within the target accessor does not change over time, and if the target accessor describes the time-varying syntax element in the MPEG time-varying accessor of the module and its value is: "immutable": false or "immutable":0, then it is determined that the value of the syntax element within the target accessor will change over time.

Step 108, determining the name of the target accessor according to the value of the accessor name syntax element (name) in the target accessor description module.

In some embodiments, the method for parsing a scene description file provided by the embodiment of the present application further includes the following steps 11 and 12:

and step 11, acquiring a target buffer description module from a buffer list ("buffers": [ ]) of the scene description file.

Wherein the target buffer description module is any buffer description module in the buffer list ("buffers": [ ]).

And step 12, acquiring description information of a target buffer corresponding to the target buffer description module according to the target buffer description module.

In some embodiments, the step 12 (obtaining, according to the destination-buffer description module, the description information of the destination buffer corresponding to the destination-buffer description module) includes at least one of the following steps 121 to 127:

step 121, obtaining the name of the target buffer according to the value of the buffer name syntax element (name) in the target buffer description module.

Step 122, obtaining the capacity of the target buffer corresponding to the target buffer description module according to the value of the first byte length syntax element (byteLength) in the target buffer description module.

The target buffer description module is any buffer description module obtained from a buffer list of the scene description file.

For example: the first byte length syntax element and the value thereof in a buffer description module are "byteLength" to 43972, it may be determined that the capacity of the buffer corresponding to the buffer description module is 43972 bytes.

Step 123, determining whether the target buffer is a ring buffer based on MPEG extension according to whether the target buffer description module includes an MPEG ring buffer ("mpeg_buffer_cyclic").

In some embodiments, determining whether the destination buffer is an MPEG extension-based ring buffer based on whether an MPEG ring buffer is included in the destination buffer description module comprises: and if the target buffer description module contains the MPEG annular buffer, determining that the target buffer is the annular buffer based on the MPEG extension, and if the target buffer description module does not contain the MPEG annular buffer, determining that the target buffer is not the annular buffer based on the MPEG extension.

Step 124, obtaining the number of storage links of the target buffer according to the value of the number of links syntax element (count) in the MPEG ring buffer ("mpeg_buffer_loop": { }) of the target buffer description module.

For example: the number of links syntax element in the MPEG ring buffer of a certain buffer description module and its value are: "count" 3, it may be determined that the buffer corresponding to the buffer description module includes 3 storage links.

Step 125, obtaining an index value of the media description module corresponding to the media file of the source data of the data cached in the target buffer according to the value of the media index syntax element (media) in the MPEG ring buffer of the target buffer description module.

Step 126, obtaining the track index value of the source data of the data buffered in the target buffer according to the value of the second track index syntax element (tracks) in the MPEG ring buffer ("mpeg_buffer_cyclic").

And step 127, acquiring data for reconstructing a dynamic three-dimensional model of the corresponding digital human component according to the value of the uniform resource identifier grammar element (uri) in the target buffer description module.

In some embodiments, the method for parsing a scene description file according to the foregoing embodiments further includes the following steps 13 and 14:

And step 13, acquiring a target cache slice description module from a cache slice list ("bufferViews": [ ]) of the scene description file.

The target cache slice description module is any cache slice description module in the cache slice list.

And step 14, acquiring description information of the target cache slice corresponding to the target cache slice description module according to the target cache slice description module.

In some embodiments, the step 14 (obtaining, according to the target cache slice description module, the description information of the target cache slice corresponding to the target cache slice description module) includes at least one of the following steps 141 to 143:

Step 141, obtaining the capacity of the target cache slice corresponding to the target cache slice description module according to the value of the second byte length syntax element (byteLength) in the target cache slice description module.

For example: the second byte length syntax element in a certain cache slice description module has the value: "byteLength" 21936, the capacity of the corresponding cache slice of the cache slice description module may be determined to be 21936 bytes.

And 142, acquiring the offset of the data cached by the target cache slice according to the value of the offset syntax element (byteOffset) in the target cache slice description module.

For example: the offset syntax element in a certain cache slice description module and its value are: "byteOffset" is 0, the offset of the data buffered by the buffer slice corresponding to the buffer description module may be determined to be 0.

Step 143, obtaining the name of the target cache slice according to the value of the cache slice name syntax element (name) in the target cache slice description module.

In some embodiments, the method for parsing a scene description file further includes: and determining the version number of the scene description file according to a version syntax element (version) and a value thereof in a digital asset description module (asset) in the scene description file.

For example: the digital assets of the scene description file are as follows:

then, it may be determined that the scene description file is written based on gltf2.0 version, the version of the scene description file being a reference version of the scene description standard.

In some embodiments, the method for parsing a scene description file further includes: and acquiring the extension items used by the scene description file according to an extension use list (extensionUsed) in the scene description file.

For example: the extended usage list (extensionUsed) of the scene description file is as follows:

Then, the extension items that can be used to obtain the scene description file include: MPEG media (mpeg_media), MPEG ring buffer (mpeg_buffer_cyclic), MPEG time-varying accessor (mpeg_ accessor _time), digital human node array (mpeg_node_avatar).

Some embodiments of the present application further provide a three-dimensional scene rendering method, where an execution body of the three-dimensional scene rendering method is a display engine in an immersive media description frame, and referring to fig. 9, the three-dimensional scene rendering method includes steps S91 to S94 as follows:

s91, acquiring a scene description file of the three-dimensional scene to be rendered.

In some embodiments, an implementation of obtaining a scene description file of a three-dimensional scene to be rendered includes: and sending request information for requesting the scene description file of the three-dimensional scene to be rendered to a media resource server, and receiving a request response carrying the scene description file of the three-dimensional scene to be rendered, which is sent by the media resource server.

In other embodiments, the implementation manner of obtaining the scene description file of the three-dimensional scene to be rendered includes: and reading the scene description file of the three-dimensional scene to be rendered from the appointed storage space.

S92, obtaining the representation scheme of each digital person component of the target digital person in the three-dimensional scene to be rendered according to the scene description file.

In some embodiments, the step S92 (obtaining, according to the scene description file, a representation scheme of each digital person component of the target digital person in the three-dimensional scene to be rendered) includes the following steps 921 to 923:

and step 921, acquiring a digital person node description module corresponding to the node representing the target digital person from the scene description file.

Step 922, obtain the digital personal node array ("mpeg_node_avatar" { }) in the digital personal node description module.

Step 923, obtaining a representation scheme of each digital person component of the target digital person according to the digital person node array ("MPEG_node_avatar" { }).

The implementation manner of the step 923 (the obtaining, according to the scene description file, the representation party of each digital person component of the target digital person in the three-dimensional scene to be rendered) may refer to the implementation manner of the step S83 in the embodiment shown in fig. 8, and will not be described in detail here for avoiding redundancy.

S93, sending the representation scheme of each digital person component of the target digital person to a media access function.

After the display engine sends the representation scheme of each digital person component of the target digital person to the media access function, the media access function can reconstruct a dynamic three-dimensional model corresponding to each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person, and write the dynamic three-dimensional model corresponding to each digital person component of the target digital person into a buffer corresponding to each digital person component of the target digital person.

In some embodiments, sending the representation of the individual digital person components of the target digital person to a media access function includes: and sending the representation scheme of each digital person component of the target digital person to the media access function through the media access function API.

S94, reading dynamic three-dimensional models corresponding to all the digital person components of the target digital person from the buffer corresponding to all the digital person components of the target digital person, and rendering the three-dimensional scene to be rendered based on the dynamic three-dimensional models corresponding to all the digital person components of the target digital person.

In some embodiments, the step S94 (reading the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person) includes the following steps 941 and 942:

And 941, acquiring description information of accessors for accessing the dynamic three-dimensional model corresponding to each digital person component of the target digital person according to the scene description file.

In some embodiments, step 941 (obtaining, from the scene description file, description information of an accessor for accessing the dynamic three-dimensional model corresponding to each digital person component of the target digital person) includes:

step 9411, obtaining the grid description module corresponding to each digital person component of the target digital person from the grid list ("meshes": [ ]) of the scene description file.

Step 9412, obtaining an index value of a position coordinate syntax element (position) statement in a grid description module corresponding to each digital person component of the target digital person;

Step 9413, according to the index value stated by the position coordinate grammar element in the grid description module corresponding to each digital person component of the target digital person, obtaining the accessor description module corresponding to each digital person component of the target digital person from the accessor list ("accessors": [ ]) of the scene description file.

Step 9414, according to the accessor description module corresponding to each digital person component of the target digital person, the description information of the accessor for accessing the dynamic three-dimensional model corresponding to each digital person component of the target digital person is obtained.

In some embodiments, the description information of the accessor may include at least one of:

the name of the accessor, the data type of the data accessed by the accessor, the type of the accessor, the number of the data accessed by the accessor, whether the accessor is a time-varying accessor based on MPEG extension, the index value of a cache slice description module corresponding to a cache slice storing the data accessed by the accessor, the index value of a cache slice description module corresponding to a cache slice storing a time-varying parameter of the accessor, and whether the accessor parameter changes with time.

Step 942, reading the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person according to the description information of the accessor for accessing the dynamic three-dimensional model corresponding to each digital person component of the target digital person.

According to the three-dimensional scene rendering method provided by the embodiment of the application, firstly, a scene description file of a three-dimensional scene to be rendered is obtained, then, according to the scene description file, the representation scheme of each digital person component of a target digital person in the three-dimensional scene to be rendered is obtained, the representation scheme of each digital person component of the target digital person is sent to a media access function, finally, a dynamic three-dimensional model corresponding to each digital person component of the target digital person is read from a buffer corresponding to each digital person component of the target digital person, and the rendering of the three-dimensional scene to be rendered is carried out based on the dynamic three-dimensional model corresponding to each digital person component of the target digital person. The display engine can acquire the representation scheme of each digital person component of the target digital person in the three-dimensional scene to be rendered according to the scene description file, and send the representation scheme of each digital person component of the target digital person to the media access function, and the media access function can reconstruct the dynamic three-dimensional model corresponding to each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person, write the dynamic three-dimensional model corresponding to each digital person component of the target digital person into the buffer corresponding to each digital person component of the target digital person, and the display engine can also read the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, and render the three-dimensional scene to be rendered based on the dynamic three-dimensional model corresponding to each digital person component of the target digital person.

Because the display engine needs to read the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, before reading the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, the display engine further obtains the description information of the accessor corresponding to each digital person component of the target digital person, the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice corresponding to each digital person component of the target digital person according to the scene description file, and sends the description information of the accessor corresponding to each digital person component of the target digital person, the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer corresponding to each digital person component of the target digital person to the media access function to the three-dimensional model.

In some embodiments, the description information of the buffer includes at least one of:

the method comprises the steps of selecting a buffer name, a buffer capacity, whether the buffer is a ring buffer based on MPEG expansion, the number of storage links of the buffer, an index value of a media description module corresponding to a media file to which source data of data buffered by the buffer belongs, and a track index value of the source data of the data buffered by the buffer.

In some embodiments, the description information of the buffered slice may include at least one of:

The method comprises the steps of selecting a name of a cache slice, an index value of a cache description module corresponding to a cache to which the cache slice belongs, the capacity of the cache slice of the cache and the offset of data cached by the cache slice of the cache.

In some embodiments, sending, to the media access function, the description information of the accessor corresponding to each digital person component of the target digital person, the description information of the buffer corresponding to each digital person component of the target digital person, and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person includes:

And sending the description information of the accessors corresponding to the digital person components of the target digital person, the description information of the buffers corresponding to the digital person components of the target digital person and the description information of the buffer slices of the buffers corresponding to the digital person components of the target digital person to the media access function through a media access function API.

Because the display engine needs to read the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, in some embodiments, before reading the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, the display engine further obtains description information of the buffer corresponding to each digital person component of the target digital person and description information of a buffer slice of the buffer corresponding to each digital person component of the target digital person according to the scene description file, and sends the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person to the buffer management module, so that the buffer management module creates the buffer corresponding to each digital person component of the target digital person according to the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person.

In some embodiments, the sending, to the cache management module, the description information of the cache corresponding to each digital person component of the target digital person and the description information of the cache slice of the cache corresponding to each digital person component of the target digital person includes: and sending the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person to a buffer management module through a buffer API.

Some embodiments of the present application further provide a method for processing scene data of a three-dimensional scene, where an execution body of the method for processing scene data of a three-dimensional scene is a media access function in an immersive media description framework, and referring to fig. 10, the method for processing scene data of a three-dimensional scene includes the following steps:

s101, receiving a representation scheme of each digital person component of the target digital person sent by the display engine.

Wherein the target digital person is any digital person in the three-dimensional scene to be rendered.

In some embodiments, receiving a representation of individual digital person components of a target digital person sent by a display engine includes: the representation of individual digital person components of the target digital person sent by the display engine is received through the media access function API.

S102, constructing a dynamic three-dimensional model of each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person and the reconstruction data of each digital person component of the target digital person.

The reconstruction data of any digital human component is data for reconstructing a dynamic three-dimensional model corresponding to the digital human component.

In some embodiments, the step S102 (constructing a dynamic three-dimensional model of each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person and the reconstruction data of each digital person component of the target digital person) includes creating a pipeline corresponding to a third digital person component of the target digital person according to the representation scheme of the third digital person component, inputting the reconstruction data of the third digital person component into the pipeline corresponding to the third digital person component, and acquiring the dynamic three-dimensional model of the third digital person component output by the pipeline corresponding to the third digital person component.

In some embodiments, before the step S102 (the dynamic three-dimensional model of each digital person component of the target digital person is constructed according to the representation scheme of each digital person component of the target digital person and the reconstruction data of each digital person component of the target digital person), the method for processing scene data of a three-dimensional scene provided by the embodiment of the present application further includes: and obtaining reconstruction data of each digital person component of the target digital person. The implementation of obtaining the reconstructed data of each digital person component of the target digital person may include the following:

One implementation way,

The implementation mode for obtaining the reconstruction data of each digital person component of the target digital person comprises the following steps: receiving description information of a first media file sent by the display engine; and acquiring reconstruction data of the first digital personal component according to the description information of the first media file. The first media file is a media file corresponding to a first digital person component of the target digital person.

That is, data for reconstructing a dynamic three-dimensional model corresponding to a first digital person component of the target digital person may be described in an MPEG media (mpeg_media) as an MPEG type media file, and the input data of the media access function may be delivered to the media access function by indexing the media file declared in the MPEG media (mpeg_media) through an MPEG ring buffer (mpeg_buffer_circle) in a buffer description module (buffer).

In some embodiments, the description information of the first media file includes: a URI of the first media file. The method for obtaining the reconstruction data of the first digital personal component according to the description information of the first media file comprises the following steps:

and a step a of acquiring the first media file according to the URI of the first media file.

And b, decapsulating the first media file to obtain the code stream of each track of the first media file.

And c, decoding the code stream of each track of the first media file to obtain the reconstruction data of the first digital human component.

In some embodiments, step a (obtaining the first media file from the URI of the first media file) comprises: sending a media resource request to a media resource server according to the URI of the first media file; and receiving a media resource response carrying the first media file sent by the media server.

In some embodiments, step a (obtaining the first media file from the URI of the first media file) comprises: and reading the first media file from a preset storage space according to the URI of the first media file.

The second implementation mode,

The implementation mode for obtaining the reconstruction data of each digital person component of the target digital person comprises the following steps: and receiving reconstruction data of a second digital person component of the target digital person, which is sent by the display engine.

That is, the data for reconstructing the dynamic three-dimensional model corresponding to the digital person component may be carried by the scene description file, and the display engine analyzes the scene description file to obtain the data for reconstructing the dynamic three-dimensional model corresponding to the digital person component, and then sends the data to the media access function through the media access function API.

The third implementation mode,

The implementation mode for obtaining the reconstruction data of each digital person component of the target digital person comprises the following steps: a portion of the data for reconstructing the dynamic three-dimensional model corresponding to the first digital person component is carried using the scene description file and a portion of the data for reconstructing the dynamic three-dimensional model corresponding to the first digital person component is carried using the media file. For example: and carrying a static three-dimensional model for reconstructing the dynamic three-dimensional model corresponding to the first digital human component by using a scene description file, and carrying driving data for reconstructing the dynamic three-dimensional model corresponding to the first digital human component by using a media file.

S103, writing the dynamic three-dimensional model of each digital person component of the target digital person into a buffer corresponding to each digital person component of the target digital person.

After the media access function writes the dynamic three-dimensional model of each digital person component of the target digital person into the buffer corresponding to each digital person component of the target digital person, the display engine can read the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, and render the three-dimensional scene to be rendered according to the dynamic three-dimensional model corresponding to each digital person component of the target digital person.

In some embodiments, the step S103 (writing the dynamic three-dimensional model of each digital person component of the target digital person into the buffer corresponding to each digital person component of the target digital person) includes: receiving the description information of the accessor corresponding to each digital person component of the target digital person, the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by the display engine; writing the dynamic three-dimensional model of each digital person component of the target digital person into the corresponding buffer memory of each digital person component of the target digital person according to the description information of the accessor corresponding to each digital person component of the target digital person, the description information of the buffer memory corresponding to each digital person component of the target digital person and the description information of the buffer memory slice of the buffer memory corresponding to each digital person component of the target digital person.

The accessor corresponding to any digital person component is an accessor for accessing the dynamic three-dimensional model of the digital person component, and the buffer corresponding to any digital person component is a buffer for buffering the dynamic three-dimensional model of the digital person component.

In some embodiments, receiving, by a display engine, description information of an accessor corresponding to each digital person component of the target digital person, description information of a buffer corresponding to each digital person component of the target digital person, and description information of a buffer slice of the buffer corresponding to each digital person component of the target digital person, where the description information includes:

and receiving the description information of the accessors corresponding to the digital person components of the target digital person, the description information of the buffers corresponding to the digital person components of the target digital person and the description information of the buffer slices of the buffers corresponding to the digital person components of the target digital person, which are sent by the display engine, through a media access function API.

In some embodiments, the method for processing scene data of a three-dimensional scene provided by the embodiment of the present application further includes: and sending the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person to a buffer management module, so that the buffer management module creates the buffer corresponding to each digital person component of the target digital person according to the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person.

In some embodiments, the sending, to the cache management module, the description information of the cache corresponding to each digital person component of the target digital person and the description information of the cache slice of the cache corresponding to each digital person component of the target digital person includes: and sending the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person to the buffer management module through a buffer API.

After receiving the representation scheme of each digital person component of the target digital person sent by the display engine, the processing method of the scene data of the three-dimensional scene constructs a dynamic three-dimensional model of each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person and the reconstruction data of each digital person component of the target digital person; and writing the dynamic three-dimensional model of each digital person component of the target digital person into a buffer corresponding to each digital person component of the target digital person. The processing method of the scene data of the three-dimensional scene provided by the embodiment of the application can construct a dynamic three-dimensional model of each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person and the reconstruction data of each digital person component of the target digital person; and writing the dynamic three-dimensional model of each digital person component of the target digital person into a buffer corresponding to each digital person component of the target digital person, so that a display engine can read the dynamic three-dimensional model corresponding to each digital person component of the target digital person from the buffer corresponding to each digital person component of the target digital person, and render the three-dimensional scene to be rendered according to the dynamic three-dimensional model corresponding to each digital person component of the target digital person, thereby realizing rendering of digital persons comprising digital person components with different representation schemes, and solving the problem that the existing scene description framework does not support flexible combination of digital person components with different representation schemes.

Some embodiments of the present application further provide a cache management method, where an execution body of the cache management method is a cache management module in an immersive media description framework, and referring to fig. 11, the cache management method includes the following steps S111 and S112:

s111, receiving description information of a buffer corresponding to each digital person component of the target digital person and description information of a buffer slice of the buffer corresponding to each digital person component of the target digital person.

In some embodiments, the description information of the buffer may include at least one of:

the method comprises the steps of selecting a cache slice name, an index value of a cache description module corresponding to a cache to which the cache slice belongs, the capacity of the cache slice and the offset of data cached by the cache slice of the target cache.

In some embodiments, the step S111 (receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person) includes: and receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by a display engine.

In some embodiments, receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by the display engine, includes: and receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by a display engine, through a buffer management API.

In some embodiments, the step S111 (receiving the description information of the target buffer and the description information of the buffered slice of the target buffer) includes: and receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by the media access function.

In some embodiments, receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person sent by the media access function includes: and receiving the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, which are sent by the media access function, through a buffer management API.

S112, creating a buffer corresponding to each digital person component of the target digital person according to the description information of the buffer corresponding to each digital person component of the target digital person and the description information of the buffer slice of the buffer corresponding to each digital person component of the target digital person, so that a media access function obtains a representation scheme of each digital person component of the target digital person from a display engine, rebuilds a dynamic three-dimensional model of each digital person component of the target digital person according to the representation scheme of each digital person component of the target digital person, and writes the dynamic three-dimensional model of each digital person component of the target digital person into the buffer corresponding to each digital person component of the target digital person.

In some embodiments, the cache management method provided by the embodiment of the present application further includes: receiving the description information of accessors corresponding to all digital person components of the target digital person, which is sent by the display engine and/or the media access function; and carrying out buffer management on the buffers corresponding to the digital person components of the target digital person according to the description information of the accessors corresponding to the digital person components of the target digital person. The description information of the accessor corresponding to any digital person component of the target digital person is the description information of the accessor for accessing the dynamic three-dimensional model corresponding to the digital person component.

In some embodiments, the description information of the target accessor in the accessors corresponding to the respective digital person components of the target digital person includes at least one of the following:

Storing an index value of a cache slice description module corresponding to a cache slice of data accessed by the target accessor, a data type of the data accessed by the target accessor, a type of the target accessor, a quantity of the data accessed by the target accessor, whether the target accessor is a time-varying accessor based on MPEG expansion, an index value of a cache slice description module corresponding to a cache slice storing time-varying parameters of the target accessor, and whether the parameters of the target accessor change with time.

In some embodiments, the cache management method provided by the embodiment of the present application further includes: receiving a cache management instruction, wherein the cache management instruction is used for indicating to release a designated cache or update data cached in the designated cache; and responding to the cache management instruction to release the designated cache or update the data cached in the designated cache.

After the description information of the target buffer and the description information of the buffer slice of the target buffer are received, the target buffer can be created according to the description information of the target buffer and the description information of the buffer slice of the target buffer, so that the media access function obtains the representation schemes of all digital person components of the target digital person from the display engine, reconstructs the dynamic three-dimensional model of all digital person components of the target digital person according to the representation schemes of all digital person components of the target digital person, and writes the dynamic three-dimensional model of all digital person components of the target digital person into the buffer corresponding to all digital person components of the target digital person.

In some embodiments, some embodiments of the present application provide a generating apparatus of a scene description file, where the generating apparatus of the scene description file includes:

A memory configured to store a computer program;

and a processor configured to cause the scene description file generating device to implement the scene description file generating method according to any one of the above embodiments when the computer program is invoked.

In some embodiments, some embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, which when executed by a computing device, causes the computing device to implement a method for generating a scene description file according to any of the above embodiments.

In some embodiments, some embodiments of the present application provide a computer program product, which when run on a computer causes the computer to implement the method for generating a scene description file according to any of the above embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for generating a scene description file, comprising:

2. The method of claim 1, wherein the generating a digital person node array from the representation of the individual digital person components of the target digital person comprises:

The values of the digital human-type syntax elements in the digital human node array are set according to the representation scheme of the individual digital human components of the target digital human.

3. The method of claim 2, wherein the setting values of digital human-type syntax elements in the digital human node array according to the representation scheme of the respective digital human component of the target digital human comprises:

determining a representation of the target digital person according to the representation of each digital person component of the target digital person;

Determining an abnormal scheme component according to the representation scheme of the target digital person; the different scheme component is a digital person component with a different representation scheme from the representation scheme of the target digital person in the digital person component of the target digital person;

And setting the value of the digital person type grammar element according to the uniform resource name URN of the representation scheme of the target digital person, the names of the digital person body parts represented by the different scheme components and the URNs of the representation scheme of the different scheme components.

4. The method of claim 1, wherein the generating a digital person node array from the representation of the individual digital person components of the target digital person comprises:

and setting values of the digital person type syntax elements and values of the component type syntax elements in the digital person node array according to the representation scheme of each digital person component of the target digital person.

5. The method of claim 4, wherein setting values of digital person type syntax elements and values of component type syntax elements in the digital person node array according to the representation scheme of the respective digital person component of the target digital person comprises:

Setting a value of the digital human-type syntax element to a URN of the representation scheme of the target digital person;

and adding a component type syntax element in a mapping array corresponding to each different scheme component, and setting the value of the component type syntax element as the URN of the representation scheme of the corresponding different scheme component.

6. The method according to claim 1, wherein the method further comprises:

And adding a part name grammar element to a mapping array corresponding to each digital person component of the target digital person, and setting the value of the part name grammar element as a hierarchical standard name of the digital person body part represented by the corresponding digital person component.

7. The method according to claim 1, wherein the method further comprises:

And adding a node index grammar element to the mapping array corresponding to each digital person component of the target digital person, and setting the value of the node index grammar element as the index value of the corresponding node description module.

8. The method according to claim 1, wherein the method further comprises:

and adding an active identification grammar element in the digital person node array, and setting the value of the active identification grammar element according to whether the target digital person is an active digital person.

9. The method according to any one of claims 1-8, further comprising:

generating a target media description module corresponding to a target media file according to description information of the target media file, and adding the target media description module into a media list of a Moving Picture Expert Group (MPEG) media of the scene description file;

the target media file is any media file in the three-dimensional scene to be rendered, and the target media description module corresponding to the target media file is generated according to description information of the target media file, and the target media description module comprises at least one of the following:

Adding a media name syntax element in the target media description module, and setting a value of the media name syntax element according to the name of the target media file;

Adding an automatic playing grammar element in the target media description module, and setting the value of the automatic playing grammar element according to whether the target media file needs automatic playing or not;

adding a cyclic play syntax element in the target media description module, and setting a value of the cyclic play syntax element according to whether the target media file needs cyclic play or not;

Adding a selectable item list in the target media description module;

generating an optional description module corresponding to each optional version of the target media file according to the description information of each optional version of the target media file, and adding the optional description module corresponding to each optional version of the target media file into the optional list.

10. The method according to any one of claims 1-8, further comprising:

Generating a target scene description module corresponding to the three-dimensional scene to be rendered according to the description information of the three-dimensional scene to be rendered, and adding the target scene description module into a scene list of the scene description file;

The generating a target scene description module corresponding to the three-dimensional scene to be rendered according to the description information of the three-dimensional scene to be rendered includes: and adding a node index list into the target scene description module, and adding index values of the node description modules corresponding to each top-level node in the three-dimensional scene to be rendered into the node index list.

11. The method according to any one of claims 1-8, further comprising:

Generating a target node description module corresponding to the target node according to the description information of the target node, and adding the target node description module into a node list of the scene description file;

the target node is any node in the three-dimensional scene to be rendered, and the target node description module corresponding to the target node is generated according to the description information of the target node, and comprises at least one of the following:

Adding a node name syntax element in the target node description module, and setting a value of the node name syntax element according to the name of the target node;

adding a child node index list into the target node description module, and adding index values of node description modules corresponding to all child nodes mounted on the target node into the child node index list;

Adding a grid index syntax element into the target node description module, and setting the value of the grid index syntax element as the index value of the grid description module corresponding to the three-dimensional grid mounted by the target node;

And adding a position offset syntax element in the target node description module, and setting the value of the position offset syntax element according to the spatial position offset of the target node relative to the parent node.

12. The method according to any one of claims 1-8, further comprising:

Generating a target grid description module corresponding to the target three-dimensional grid according to the description information of the target three-dimensional grid, and adding the target grid description module into a grid list of the scene description file;

the target three-dimensional grid is any three-dimensional grid in the three-dimensional scene to be rendered, and the target grid description module corresponding to the target three-dimensional grid is generated according to description information of the target three-dimensional grid, and comprises at least one of the following:

Adding a grid name grammar element into the target grid description module, and setting the value of the grid name grammar element according to the name of the target three-dimensional grid;

Adding a position grammar element in the attribute of the primitive of the target grid description module, and setting the value of the position grammar element according to the index value of an accessor description module corresponding to an accessor of the dynamic three-dimensional model of the digital human component corresponding to the target three-dimensional grid;

And adding a mode syntax element in the primitive of the target grid description module, and setting the value of the mode syntax element according to the topology type of the target three-dimensional grid.

13. The method according to any one of claims 1-8, further comprising:

generating a target accessor description module corresponding to the target accessor according to the description information of the target accessor, and adding the target accessor description module into an accessor list of the scene description file;

The target accessor is any accessor for realizing the rendering of the three-dimensional scene to be rendered, and the target accessor description module corresponding to the target accessor is generated according to the description information of the target accessor, and comprises at least one of the following:

adding a data type syntax element in the target accessor description module, and setting a value of the data type syntax element according to the data type of the data accessed by the target accessor;

Adding an accessor type syntax element in the target accessor description module, and setting a value of the accessor type syntax element according to the type of the target accessor;

Adding a data quantity syntax element in the target accessor description module, and setting a value of the data quantity syntax element according to the quantity of data accessed by the target accessor;

Adding the target cache slice index syntax element in the target accessor description module, and setting the value of the target cache slice index syntax element according to the index value of a cache slice description module corresponding to a cache slice for caching the data accessed by the target accessor;

adding an MPEG time-varying accessor in the extended list of the target accessor description module;

Adding a second cache slice index syntax element in the MPEG time-varying accessor, and setting the value of the second cache slice index syntax element according to the index value of a cache slice description module corresponding to the cache slice for caching the time-varying parameter of the target accessor;

Adding a time-varying syntax element in the MPEG time-varying accessor, and setting the value of the time-varying syntax element according to whether the value of the syntax element in the target accessor changes with time;

and adding an accessor name syntax element in the target accessor description module, and setting the value of the accessor name syntax element according to the name of the target accessor.

14. The method according to any one of claims 1-8, further comprising:

Generating a target buffer description module corresponding to the target buffer according to the description information of the target buffer, and adding the target buffer description module into a buffer list of the scene description file;

The target buffer is any buffer used for realizing the rendering of the three-dimensional scene to be rendered, and the target buffer description module corresponding to the target buffer is generated according to the description information of the target buffer and comprises at least one of the following:

Adding a first byte length syntax element in the target buffer description module, and setting a value of the first byte length syntax element according to the capacity of the target buffer;

Adding an MPEG annular buffer in a buffer description module corresponding to the target buffer;

Adding a link number grammar element into the MPEG ring buffer, and setting the value of the link number grammar element according to the storage link number of the target buffer;

Adding a media index syntax element into the MPEG annular buffer, and setting the value of the media index syntax element according to the index value of a media description module corresponding to a media file to which the source data of the data buffered by the target buffer belongs;

Adding a track index syntax element in the MPEG ring buffer, and setting a value of the track index syntax element according to a track index value of source data of data buffered by the target buffer;

Adding a buffer name grammar element in a buffer description module corresponding to the target buffer, and setting the buffer name grammar element according to the name of the target buffer;

and adding a uniform resource identifier syntax element in the target buffer description module, and setting the value of the uniform resource identifier syntax element according to at least one part of data used for reconstructing a dynamic three-dimensional model of the corresponding digital person component.

15. The method according to any one of claims 1-8, further comprising:

Generating a target cache slice description module corresponding to the target cache slice according to the description information of the target cache slice, and adding the target cache slice description module into a cache slice list of the scene description file;

The target cache slice is any cache slice of a buffer for realizing the rendering of the three-dimensional scene to be rendered, and the target cache slice description module corresponding to the target cache slice is generated according to the description information of the target cache slice, and comprises at least one of the following:

Adding a buffer index syntax element into the target buffer slice description module, and setting the value of the buffer index syntax element according to the index value of the buffer description module corresponding to the buffer to which the target buffer slice belongs;

Adding a second byte length syntax element in the target cache slice description module, and setting a value of the second byte length syntax element according to the capacity of the target cache slice;

adding an offset syntax element in the target cache slice description module, and setting a value of the offset syntax element according to the offset of the data cached by the target cache slice;

and adding a cache slice name syntax element in the target cache slice description module, and setting the value of the cache slice name syntax element according to the name of the target cache slice.

16. A scene description file generating apparatus, comprising:

A memory configured to store a computer program;

A processor configured to cause the generating means of the scene description file to implement the generating method of the scene description file according to any of claims 1-15 when the computer program is invoked.