WO2026002363A1 - Methods and apparatus for supporting generation of a virtual 3d object - Google Patents
Methods and apparatus for supporting generation of a virtual 3d objectInfo
- Publication number
- WO2026002363A1 WO2026002363A1 PCT/EP2024/067654 EP2024067654W WO2026002363A1 WO 2026002363 A1 WO2026002363 A1 WO 2026002363A1 EP 2024067654 W EP2024067654 W EP 2024067654W WO 2026002363 A1 WO2026002363 A1 WO 2026002363A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- virtual
- contextual
- prompt
- node
- processing node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Methods and apparatus for supporting generation of a virtual 3D object Methods (100, 300) are disclosed for supporting generation of a virtual 3D object. The methods include, at a preprocessing node, obtaining heterogeneous input data for the virtual 3D object (110), aggregating the heterogeneous input data (120), and generating a contextual prompt from the aggregated heterogeneous data (130), the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. On detection of a trigger event, the preprocessing node provides the contextual prompt to a processing node (140). The methods further include, at a processing node, inputting the contextual prompt to an ML model (220) operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The methods further include, at the processing node, using the generated prompt to generate a representation of the virtual 3D object (230), and providing the representation and the contextual metadata for the virtual 3D object to an XR client (240).
Description
Methods and apparatus for supporting generation of a virtual 3D object
Technical Field
The present disclosure relates to methods for supporting generation of a virtual 3D object, and for generating a virtual 3D object. The methods may be performed by a preprocessing node, a processing node, and an Extended Reality (XR) client, respectively. The present disclosure also relates to a preprocessing node, a processing node, and an XR client, and to a computer program product configured, when run on a computer, to carry out methods for supporting generation of a virtual 3D object, and for generating a virtual 3D object.
Background
3D content creation is a significant part of the technological landscape, spanning industries from gaming to manufacturing to medicine. It is used to make products in these fields more effective, efficient, or engaging. 3D renders can be generated in several way including textured meshes, neural radiance fields or latent diffusion models.
Textured meshes are a popular method of representing 3D objects digitally. A 3D mesh is a collection of vertices, edges, and faces that define the shape of a 3D object. Meshes can range from simple shapes, such as cubes and spheres, to complex structures, such as characters or buildings. Textured meshes add detailed visual data to these structures by mapping 2D images onto the surface of the vertices. In effect, textures allow a simple 3D model to look more complex than it is. These models are represented as data structures that store the vertices, edges, and faces, along with the associated texture coordinates. Several textual mesh formats are widely used, depending on user preference and context. For example, a different format may be preferred for game development as opposed to 3D rendering. Among the most used formats are Wavefront OBJ files, the PLY Polygon File Format, and the GL Transmission Format (gITF).
Neural Radiance Fields (NeRFs) are a recently introduced method for generating 3D content. These are fully connected neural networks used for learning a continuous volumetric scene function from a sparse set of 2D images. A NeRF takes 5D inputs (spatial location (3D) and viewing direction (2D)) and outputs the volume density and emitted radiance at the input spatial location. NeRFs generate realistic 3D scenes and
provide an effective way to reconstruct complex scenes from only a handful of 2D images.
Latent Diffusion Models (LDMs) are generative models that create samples by simulating a random Markov chain that mixes to the data distribution by converting a simple noise distribution to a complex data distribution in a controllable manner. By controlling the “temperature” variable of the stochastic Markov process (the volatility or noise level at each step), it is possible to draw samples with fine-grained control over variety versus fidelity. LDMs provide a way for transforming a simpler noise structure into something much more complicated, akin to generating a finished piece of music from a simple melody. A sample of the advanced state of the art can be shown in Jiaxiang Tang et al.: LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation, https://arxiv.orq/abs/2402.05054.
Any of the above Al-driven 3D rendering techniques may be used in Extended reality (XR), which is an umbrella term encapsulating augmented reality (AR), virtual reality (VR), and mixed reality (MR). XR technologies create immersive environments that can be manipulated and interacted with in ways that were not previously possible.
The existing technology for 3D content creation, though advanced, comes with some limitations given its relative novelty as a field. A principal limitation of current techniques is their capacity to adapt to and reflect an intent of a user. Regardless of the input method used, current models tend to simplify these inputs, losing valuable information in the process, and consequently widening the gap between the user’s intent and the final output. Additionally, an inability to account for real-time changes or variables can result in 3D images that lack relevancy or accuracy in certain contexts.
Summary
It is an aim of the present disclosure to provide methods, a preprocessing node, a processing node, an XR client, and a computer program product which at least partially address one or more of the challenges mentioned above. It is a further aim of the present disclosure to provide methods, a preprocessing node, a processing node, an XR client, and a computer program product which cooperate to enhance the quality of 3D content creation outputs, for example by using more detailed and varied input data, and
extracting a greater quantity of relevant information from that input data to assist content creation.
According to a first aspect of the present disclosure, there is provided a computer implemented method for supporting generation of a virtual 3D object. The method, performed by a preprocessing node, comprises obtaining heterogeneous input data for the virtual 3D object, aggregating the heterogeneous input data, and generating a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. The method further comprises, on detection of a trigger event, providing the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object.
According to another aspect of the present disclosure, there is provided a computer implemented method for supporting generation of a virtual 3D object. The method, performed by a processing node, comprises receiving a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object. The method further comprises inputting the contextual prompt to an ML model operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The method further comprises using the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object, and providing the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to an XR client operable to generate the virtual 3D object.
According to another aspect of the present disclosure, there is provided a computer implemented method for generating a virtual 3D object. The method, performed by an XR client, comprises receiving from a processing node a representation of the virtual 3D object and contextual metadata for the virtual 3D object, and rendering the virtual 3D object for display to a user of the XR client, in accordance with the received representation. The method further comprises mapping application specific functions within the XR client to the contextual metadata, and executing the mapped application specific functions according to the contextual metadata.
According to another aspect of the present disclosure, there is provided a computer implemented method for supporting generation of a virtual 3D object by an XR client. The method, performed by a system comprising a preprocessing node and a processing node, comprises, at the preprocessing node, obtaining heterogeneous input data for the virtual 3D object, aggregating the heterogeneous input data, and generating a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. The method further comprises, at the preprocessing node, on detection of a trigger event, providing the contextual prompt to the processing node. The method further comprises, at the processing node, receiving the contextual prompt from the preprocessing node, and inputting the contextual prompt to an ML model operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The method further comprises, at the processing node, using the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object, and providing the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer readable non-transitory medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method according to any one of the aspects or examples of the present disclosure.
According to another aspect of the present disclosure, there is provided a preprocessing node for supporting generation of a virtual 3D object. The preprocessing node comprises processing circuitry configured to cause the preprocessing node to obtain heterogeneous input data for the virtual 3D object, aggregate the heterogeneous input data, and generate a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. The processing circuitry is further configured to cause the preprocessing node to, on detection of a trigger event, provide the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object.
According to another aspect of the present disclosure, there is provided a processing node for supporting generation of a virtual 3D object. The processing node comprises processing circuitry configured to cause the preprocessing node to receive a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object, and input the contextual prompt to an ML model operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The processing circuitry is further configured to cause the processing node to use the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object and provide the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
According to another aspect of the present disclosure, there is provided an XR client for generating a virtual 3D object. The XR client comprises processing circuitry configured to cause the XR client to receive from a processing node a representation of the virtual 3D object and contextual metadata for the virtual 3D object, and render the virtual 3D object for display to a user of the XR client, in accordance with the received representation. The processing circuitry is further configured to cause the XR client to map application specific functions within the XR client to the contextual metadata, and execute the mapped application specific functions according to the contextual metadata.
According to another aspect of the present disclosure, there is provided a system for supporting generation of a virtual 3D object by an XR client. The system comprises a preprocessing node and a processing node. The preprocessing node comprises processing circuitry configured to cause the preprocessing node to obtain heterogeneous input data for the virtual 3D object, aggregate the heterogeneous input data, and generate a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. The processing circuitry is further configured to cause the preprocessing node to, on detection of a trigger event, provide the contextual prompt to the processing node. The processing node comprises processing circuitry configured to cause the processing node to receive the contextual prompt from the preprocessing node, input the contextual prompt to an ML model operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata
for the virtual 3D object. The processing circuitry is further configured to cause the processing node to use the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object, and provide the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
Aspects of the present disclosure thus provide methods and nodes that enhance the quality of 3D content creation outputs by using more detailed and varied input information. Heterogeneous input data are passed through a preprocessing node which aggregates them and prepares a contextual prompt. The contextual prompt contains semantic information about the virtual 3D object, derived from the aggregated heterogeneous input data. The contextual prompt is then used by a processing node to generate a prompt for the T ext-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The contextual metadata can then be used by an XR client to render a 3D object that is more consistent with the provided input data. Examples of the methods and nodes presented herein allow for a more consistent and detailed generation of 3D objects contextual to the immersive XR environment. In some examples aspects of the present disclosure may facilitate tailored object attributes including colour or form changes based, for example, on temperature or spatial positioning.
Brief Description of the Drawings
For a better understanding of the present disclosure, and to show more clearly how it may be carried into effect, reference will now be made, by way of example, to the following drawings in which:
Figure 1 is a flow chart illustrating process steps in a computer implemented method for supporting generation of one or more virtual 3D objects;
Figures 2a and 2b show flow charts illustrating another example of a computer implemented method for supporting generation of one or more virtual 3D objects;
Figure 3 is a flow chart illustrating process steps in a computer implemented method for supporting generation of a virtual 3D object;
Figures 4a and 4b show flow charts illustrating process steps in another example of a computer implemented method for supporting generation of a virtual 3D object;
Figure 5 is a flow chart illustrating process steps in a computer implemented method for generating a virtual 3D object;
Figure 6 is a block diagram illustrating functional modules in an example preprocessing node;
Figure 7 is a block diagram illustrating functional modules in an example processing node;
Figure 8 is a block diagram illustrating functional modules in an example XR client;
Figure 9 is a block diagram illustrating functional components in a system for supporting generation of a virtual 3D object by an XR client;
Figure 10 illustrates interaction between components according to examples of the present disclosure; and
Figure 11 is a sequence diagram illustrating an example implementation of the methods of the present disclosure.
Detailed Description
Examples of the present disclosure propose methods and nodes for supporting generation of a virtual 3D object, and for generating a virtual 3D object. Examples of the proposed methods and nodes address the above discussed problems of using available input information to its full capacity, and drawing appropriate context and insight from possible inputs. In this manner, examples of the present disclosure facilitate generation of one or more 3D objects that are detailed and consistent with the relevant inputs, and with any user intent reflected in them. Some methods allow for adaptation to real-time sensor data or other variables, ensuring accuracy and relevancy of the generated 3D object (or objects). The output of the methods disclosed herein is high resolution and contextually relevant 3D images.
To provide additional context for the present disclosure, there now follows a brief discussion of technologies which may be relevant to the methods and nodes presented herein.
Natural language processing (NLP) allows machines to understand and interpret human language in its written or spoken form. NLP uses two primary techniques for understanding language: syntactic analysis, which looks at the arrangement of words, and semantic analysis, which interprets the meaning of words and sentences.
Computer Vision is a multidisciplinary field that allows computers to gain a high-level understanding of digital images or videos. This field aims to automate tasks that humans can do with ease, such as identifying objects or understanding visual scenes. There are different types of learning algorithms involved in computer vision, including Convolutional Neural Networks (CNN), used for object detection and image classification.
Sensor technologies include a wide array of tools, such as thermometers for temperature detection or accelerometers for detecting physical motion. These input devices convert various types of real-world data into digital signals that computers can understand.
Large Language Models (LLMs) including GPT-4 Llama 3, Phi 3, Mistral, Gemma, etc., are machine learning models trained on large amounts of text data. They aim to predict the next word in a sequence, given the previous words. This training allows them to learn grammar, facts about the world, and some degree of reasoning abilities. LLMs can generate coherent and contextually relevant text based on given prompts.
Figure 1 is a flow chart illustrating process steps in a computer implemented method 100 for supporting generation of one or more virtual 3D objects. In some examples, the one or more 3D objects may be rendered by an Extended Reality (XR) client. The method 100 is described below with reference to generation of a single virtual 3D object. However, it will be appreciated that the method 100 may be used to generate a plurality of virtual 3D objects, following the steps outlined below. The method 100 is performed by a preprocessing node, which may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment
operable to implement a computer program, a virtualised function, or any other logical entity. The preprocessing node may encompass multiple logical entities, as discussed in greater detail below.
Referring to Figure 1 , the method 100 comprises, in a first step 110, obtaining heterogeneous input data for the virtual 3D object to be generated. The method 100 then comprises aggregating the heterogeneous input data in step 120, and generating a contextual prompt from the aggregated heterogeneous data in step 130. As illustrated at step 130a, the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object. The semantic information conveyed by the text string comprises information that has meaning with respect to the virtual 3D object. The semantic information is derived from the aggregated heterogeneous input data. Consequently, the meaning conveyed by the text string with respect to the virtual 3D object was present in the input data, and derived from that data during generation of the contextual prompt. In some examples of the present disclosure, the semantic information may describe any one or more of characteristics of the object, behavior of the object, for example in response to external stimuli, the passage of time, etc., a position of the object within an environment in which it will be generated, etc. According to examples of the present disclosure, the contextual prompt places meaning present in the heterogeneous input data with respect to the virtual 3D object into a format that can be processed by a processing node. Finally, on detection of a trigger event in step 140, the method 100 comprises providing the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object.
Figures 2a and 2b show flow charts illustrating another example of a computer implemented method 200 for supporting generation of one or more virtual 3D objects, which, in some examples, may be rendered by an Extended Reality (XR) client. As for the method 100 above, the method 200 is described below with reference to generation of a single virtual 3D object, although it will be appreciated that the method 200 may be used to generate a plurality of virtual 3D objects, following the steps outlined below. The method 200 is performed by a preprocessing node, which may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised
function, or any other logical entity. The preprocessing node may encompass multiple logical entities, as discussed in greater detail below. The method 200 illustrates examples of how the steps of the method 100 may be implemented and supplemented to provide the above discussed and additional functionality.
Referring initially to Figure 2a, in a first step 210, the preprocessing node obtains heterogeneous input data for the virtual 3D object. As illustrated at steps 210a to 21 Od, the heterogeneous input data may take different forms. In some examples, the heterogeneous input data comprises at least two of the different types of data set out at steps 210a to 21 Od.
In a first example 210a, the heterogeneous input data may comprise sensor data comprising measurements obtained in a physical environment within which the virtual 3D object will be generated. It will be appreciated that the 3D object may be rendered in a fully virtual environment, or a hybrid environment, for example being caused to appear on a table or other object that is physically present in a room. Regardless of whether the object is rendered in a fully virtual or hybrid environment, a user of the XR client rendering the virtual 3D object will be present within a physical environment, and so the virtual 3D object is generated within this physical environment. Sensor data obtained from this physical environment may consequently form part of the heterogeneous input data for the invention. Such sensor data may include temperature data, pressure data, humidity data, light levels, motion, etc. In some examples, the measurements obtained in the physical environment may comprise measurements representing a state of at least one of the physical environment itself and/or one or more elements present in the physical environment. Such elements may include a user of an XR client, and/or a user device such as a Head Mounted Display (HMD) that interacts with the XR client to present the rendered virtual 3D object to the user.
In some examples, the sensor data can consequently include motion and/or positional data relating to the movement, position, orientation etc. of the user or any component part of the user (head, limb, eye gaze, etc. of a human user). In this manner, properties and/or behavior of the virtual 3D object can be adapted to respond to user movement including user attention being directed to the object (identified via eye gaze), and/or user movement such as placing a hand or handheld tool in the vicinity of the object. For example, the virtual 3D object may be caused to deform or to move in response to “contact” with the user or with another object.
In another example 210b, the heterogeneous input data may comprise object data comprising objects detected in the physical environment discussed above by a computer vision process such as object recognition. Such objects may include furniture, equipment, other users, etc. In another example 210c, the heterogeneous input data may comprise data observed or measured in the physical environment during a historical time period. Such data may include past environmental data like historical weather conditions or previously observed user interactions within an environment. The historical observed data may be obtained from an external data source, for example via a web Application Programming Interface (API).
In another example, 21 Od, the heterogeneous input data may include user intent data comprising a representation of an intent of a user of the XR client with respect to the virtual 3D object. User intent data may include for example spoken instructions from a user, and/or instructions conveyed in any other manner, including gestures, interaction with a user device such as an HMD etc. Speech recognition processing may be used to capture spoke instructions from a user. It will be appreciated that while in some examples the user intent data may relate explicitly to the virtual 3D object, in other examples, the user intent data may also or alternatively be relevant to the virtual 3D object while not being explicitly about the virtual 3D object. For example, the user intent data may be implicit based on the scene and/or context, and may in some examples relate to a physical environment within which the virtual 3D object will be generated.
Referring still to Figure 2a, in step 220, the preprocessing node aggregates the heterogeneous input data. This may comprise concatenating the data, and/or organizing the data into a single format, such as a JSON format. The preprocessing node may extract semantic information from the data itself, for example via mapping to semantic categories through use of thresholds or other mapping tools. In one example, if a temperature sensor returns 45 degrees Celsius, the preprocessing node may aggregate this with other heterogeneous input data for use in generating the contextual prompts as “hot temperature”. Similarly, a 98% of relative humidity may mapped in step 220 to “high humidity”, and concatenated or otherwise combined with other sensor readings, and inputs from other sources. In another example, the aggregation step 220 may comprise the preprocessing node converting measurement units to align with a user context. For example, angles may be converted to radiant, degrees Celsius to degrees Fahrenheit for temperature, etc. Considering sensor data relating to a user of the XR HMD, the
preprocessing node may, in step 220, map a combination of sensor measurements to extract meaning. This meaning can be assembled with the remaining heterogeneous input data for generation of the contextual prompt. In one example, sensor data relating to user eye gaze (camera) and motion (accelerometer) may be used to detect user attention and gait, and eye gaze patterns and gait may be used to infer a user state, for example via mapping, Machine Learning processes, or other conversion or heuristic based methods. In the present example, user eye gaze and gait may lead to an inference that “the user is tired”, and this inference may be assembled with other input data for use in generating the contextual prompt.
Considering other types of input data, computer vision module information may be translated to map an object identification into a text description such as “a car”, a tree”, etc. In addition, relative position of an identified object, may be generated, such as “the car is between two trees”. Other examples of assembling and mapping or conversion of individual input data items can be envisaged with respect to the different types of input data discussed above. The aggregation step 220 may thus comprise a wide range of different possibilities, according to which raw input data obtained from a range of different sources may be processed via mapping, conversion, interpretation, or simple assembly, to form an aggregated input data set. Following the different processing options, the data may be concatenated or otherwise assembled for further processing, for example as a JSON object as discussed above.
In step 230, the preprocessing node then generates a contextual prompt from the aggregated heterogeneous data. As illustrated at 230a, the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data. In some examples, the contextual prompt may express the information contained in the aggregated heterogeneous input data, simply reorganized into a text string expressing the information as prose. In some examples, the generation of the contextual prompt may be carried out by inputting the aggregated heterogeneous input data to a Machine Learning (ML) model such as a Large Language Model (LLM). Considering the example of a LLM, this may comprise inputting to the LLM an LLM prompt comprising (i) the aggregated input data from step 220 (for example in the form of a JSON object), and (ii) a request that the LLM generate a contextual prompt reflecting the aggregated input data. In some examples, the input to the LLM may additionally specify the format for the contextual prompt, for example “generate a contextual prompt for a virtual 3D object
based on the following input [aggregated input data], the contextual prompt should be expressed in prose and include instruction relating to all elements of the input”. The exact wording for the input to the LLM in addition to the aggregated input data may be refined over a training period in order to arrive at a contextual prompt that achieves require performance in later method steps. The LLM may already be trained, and may in some examples undergo additional fine-tuning using examples to achieve a contextual prompt that more closely matches the specific context of the user. Use of an LLM to generate the contextual prompt offers a degree of flexibility and adaptability in terms of the input structure of the aggregated input data, and the output structure for the contextual prompt.
In further examples, the generation of the contextual prompt may be performed using other software components or ML models other than an LLM. In some examples, a process of software component may be used to convert the aggregated input data directly to a contextual prompt in a specific structure or format for subsequent processing. Use of a dedicated process for generation of the contextual prompt offers specificity for subsequent processing components.
Referring now to Figure 2b, the preprocessing node performs subsequent method steps on detection of a trigger event at step 240. As illustrated at 240a, the trigger event may comprise any one or more of an action of a user of an XR client and/or XR user device such as an HMD, a voice command of a user of an XR client and/or XR user device, and/or an event detected in a physical environment within which the virtual 3D object will be generated. The detected event could for example be a threshold value of a measured parameter being exceeded, or any other kind of event able to be detected in the environment.
On detection of the trigger event at step 240, the preprocessing node then, at step 250, provides the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object. As illustrated at 250a, the processing node may be operable to facilitate generation of the virtual 3D object by using an ML model to generate a representation of the virtual 3D object, and contextual metadata for the virtual 3D object.
In some examples the preprocessing node may, at step 260, receive a request from the processing node to provide updated input data, and, at step 270, provide updated input data to the processing node. In some examples, the preprocessing node may for
example expose one or more functions relating to the heterogeneous input data to the processing node, such that the processing node may obtain, via the preprocessing node, updated values for input data provided by the exposed functions. This may be particularly appropriate for sensor data, which may be refreshed to ensure the processing node has values that are current for generation of the virtual 3D object. Such updated data may be provided directly to the processing node without aggregation and inclusion in a generated contextual prompt. For example, a contextual prompt may specify behaviour of the virtual 3D object under different temperature conditions. The preprocessing node may in such an example expose suitable temperature sensors to the processing node, enabling the processing node to request and receive up to date temperature sensor readings via a suitable function call. It will be appreciated that a wide range of input data sources may be exposed in this manner, including for example motion sensor readings to obtain up to date positional information on the user or other objects in a physical environment, light readings for the environment, object detection information, etc. Any one of more of the types of input data discussed above may be refreshed in this manner.
It will be appreciated that, as discussed above with particular reference to the generation of the contextual prompt at step 230, at least some of the steps of the method 200 may be performed by an ML model, and may in some examples be performed by a Large Language Model (LLM). For the purposes of the present specification, an LLM is considered to comprise a transformer based Neural Network that is operable to learn context and meaning by tracking relationships in sequential data, and may comprise a minimum of one billion parameters.
The methods 100 and 200 may be complemented by a method 300 performed by a processing node, which receives the contextual prompt from the preprocessing node, and inputs the contextual prompt to an ML model to generate a representation of the virtual 3D object, and contextual metadata for the virtual 3D object.
Figure 3 is a flow chart illustrating process steps in a computer implemented method 300 for supporting generation of a virtual 3D object, which, in some examples, may be rendered by an XR client. The method 300 is described below with reference to generation of a single virtual 3D object. However, it will be appreciated that the method 300 may be used to generate a plurality of virtual 3D objects, following the steps outlined below. The method 300 is performed by a processing node, which may comprise a
physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. The processing node may encompass multiple logical entities, as discussed in greater detail below.
Referring to Figure 3, in a first step 310, the method 300 comprises receiving a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object. The text string may in some examples comprise structured text such as a JSON object. The method then comprises, in step 320, inputting the contextual prompt to an ML model operable to generate a prompt for a Text-to-XR function within the processing node, and contextual metadata for the virtual 3D object. The ML model may for example comprise an LLM, which may be requested to generate the Text-to-XR prompt and contextual metadata as discussed in further detail below. In some examples, the Text-to-XR function prompt, which is to be input to the Text-to-XR function, may formally comprise a string, as in UTF8 or ASCII characters. In other examples, the Text-to-XR function prompt may comprise structured text such as a JSON object. In step 330, the method comprises using the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object. Finally in step 340, the method comprises providing the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client. It will be appreciated that the representation of the virtual 3D object is generated based on the contextual prompt, via the dedicated Text-to-XR prompt generated in step 320. The representation is therefore contextually relevant, i.e. reflects the heterogeneous input data provided to the method 100 or 200, which may include user intent, and/or other data relating to the object or the physical environment in which it will be generated. The contextual metadata for the virtual 3D object comprises information derived from the contextual prompt, and which complements information contained in the representation of the virtual 3D object. For example, the information conveyed by the contextual metadata may be information that is not contained in the generated representation of the virtual 3D object. In some examples, the contextual metadata may comprise information concerning one or more of behavioral characteristics of the virtual 3D object, response of the virtual 3D object to external stimuli, characteristics of the virtual 3D object that cannot be conveyed via the
generated representation, evolution of the virtual 3D object in time, movement of the virtual 3D object in space, etc. In some examples, the contextual metadata may be provided in a format that can be parsed or interpreted by a standard software library. Such a format will allow for mapping of the contextual metadata to specific application functions within the XR client, as discussed in greater detail below. A JSON object is an example of a format that can be interpreted in this manner.
The method 300 uses ML model, and the methods 100 and 200 may use ML models. For the purposes of the present disclosure, the term “ML model” encompasses within its scope the following concepts: machine Learning algorithms, comprising processes or instructions through which data may be used in a training process to generate a model artefact for performing a given task, or for representing a real-world process or system; and the model artefact that is created by such a training process, and which comprises the computational architecture that performs the task.
Generally, an ML model, or a representation of an ML model, can be transmitted or transferred between nodes using any existing model format such as Open Neural Network Exchange, ONNX (https://onnx.ai), or formats used in commonly used toolboxes such as Keras or PyTorch.
Figures 4a and 4b show a flow chart illustrating process steps in a computer implemented method 400 for supporting generation of a virtual 3D object e.g., by an XR client. The method 400 is described below with reference to generation of a single virtual 3D object. However, it will be appreciated that the method 400 may be used to generate a plurality of virtual 3D objects, following the steps outlined below. The method 400 is performed by a processing node, which may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. The processing node may encompass multiple logical entities, as discussed in greater detail below. The method 400 illustrates examples of how the steps of the method 300 may be implemented and supplemented to provide the above discussed and additional functionality.
Referring initially to Figure 4a, in a first step 410, the processing node receives a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object. The contextual prompt may be substantially as described above with respect to the methods 100 and 200.
In step 412, the processing node may send a request to the preprocessing node to provide updated input data. The sending of such a request may be dependent upon the contextual prompt, and whether the contextual prompt references any information relevant to the virtual 3D object that is liable to change over time. Such information might include environmental data concerning a state of the physical environment within which the virtual 3D object will be generated, and/or information regarding the state and/or position of one or more objects within the physical environment. The sending of such a request may also or alternatively be dependent upon the preprocessing node exposing functionality and/or resources of one or more sources of input data to the processing node. These sources of input data might include sensors, computer processes such as object recognition, computer vision, natural language processing, etc. Following the sending of the request in step 412, the processing node may receive some or all of the requested updated input data from the preprocessing node at step 414.
Referring now to Figure 4b, in step 420 the processing node inputs the contextual prompt, together in some examples with some or all of the received updated input data, to an ML model operable to generate (i) a prompt for a Text-to-XR function within the processing node, and (ii) contextual metadata for the virtual 3D object. The ML model may for example comprise an LLM, as discussed above. In such examples, the contextual prompt may be input to the LLM together with a request to provide the prompt for Text-to-XR function and the contextual metadata. As discussed above with reference to generation of the contextual prompt by the processing node, the output form the LLM may be fine-tuned through adjusting of the input to the LLM, and or additional training to ensure the outputs from the LLM are consistent with required levels of performance.
In step 430, the processing node uses the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object. In some examples, step 430 may also use the updated input data together with the generated prompt and the Text-to-XR function within the processing node to generate the
representation of the virtual 3D object. In some examples, the Text-to-XR function may comprise a neural network. In some examples, the representation of the 3D object may comprise at least one of a 3D mesh model, a Neural Radiance Field (NeRF), and/or a Latent Diffusion Model (LDM).
In step 440, the processing node then provides the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
The methods 100, 200, 300 and/or 400 may be complemented by a method 500 performed by an XR client, in which the representation of the virtual 3D object and the contextual metadata are used to generate the virtual 3D object.
Figure 5 is a flow chart illustrating process steps in a method 500 for generating a virtual 3D object. The method 500 is described below with reference to generation of a single virtual 3D object. However, it will be appreciated that the method 500 may be used to generate a plurality of virtual 3D objects, following the steps outlined below. The method 300 is performed by an XR client, which may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity. The XR client may encompass multiple logical entities, as discussed in greater detail below.
Referring to Figure 5, in a first step 510, the XR client receives from a processing node a representation of the virtual 3D object and contextual metadata for the virtual 3D object. In step 520, the XR client renders the virtual 3D object for display to a user of the XR client, in accordance with the received representation. In step 530, the XR client maps application specific functions within the XR client to the contextual metadata. Finally, in step 540, the XR client executes the mapped application specific functions according to the contextual metadata. This execution may take place over an extended period of time during which the virtual 3D object is presented to a user, for example causing the virtual 3D object to respond to external stimuli or otherwise behave according to a contextual prompt derived from heterogeneous input data relating to the virtual 3D object, as discussed above with reference to methods 100 to 400.
As discussed above, the methods 100 and 200 are performed by a preprocessing node, and the present disclosure provides a preprocessing node that is adapted to perform any or all of the steps of the above discussed methods. The preprocessing node may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity.
Figure 6 is a block diagram illustrating an example preprocessing node 600 which may implement the method 100 and/or 200, as illustrated in Figures 1 and 2a and 2b, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 650. Referring to Figure 6 the preprocessing node 600 comprises a processor or processing circuitry 602, and may comprise a memory 604 and interfaces 606. The processing circuitry 602 is operable to perform some or all of the steps of the method 100 and/or 200 as discussed above with reference to Figures 1 and 2a and 2b. The memory 604 may contain instructions executable by the processing circuitry 602 such that the preprocessing node 600 is operable to perform some or all of the steps of the method 100 and/or 200, as illustrated in Figures 1 and 2a and 2b. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 650. In some examples, the processor or processing circuitry 602 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 602 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 604 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive, etc.
As discussed above, the methods 300 and 400 are performed by a processing node, and the present disclosure provides a processing node that is adapted to perform any or all of the steps of the above discussed methods. The processing node may comprise a
physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity.
Figure 7 is a block diagram illustrating an example processing node 700 which may implement the method 300 and/or 400, as illustrated in Figures 3 and 4a and 4b, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 750. Referring to Figure 7, the processing node 700 comprises a processor or processing circuitry 702, and may comprise a memory 704 and interfaces 706. The processing circuitry 702 is operable to perform some or all of the steps of the method 300 and/or 400 as discussed above with reference to Figures 3 and 4 and 4b. The memory 704 may contain instructions executable by the processing circuitry 702 such that the processing node 700 is operable to perform some or all of the steps of the method 300 and/or 400, as illustrated in Figures 3 and 4a and 4b. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 750. In some examples, the processor or processing circuitry 702 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 702 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 704 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), randomaccess memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive, etc.
As discussed above, the method 500 is performed by an XR client, and the present disclosure provides an XR client that is adapted to perform any or all of the steps of the above discussed methods. The XR client may comprise a physical or virtual node, and may be implemented in a computer system, computing device or server apparatus, a user device, such as a wired or wireless device, an XR headset, etc., and/or in a virtualized environment, for example in a cloud, edge cloud, or fog deployment. Examples of a virtual node may include a piece of software or computer
program, a code fragment operable to implement a computer program, a virtualised function, or any other logical entity.
Figure 8 is a block diagram illustrating an example XR client 800 which may implement the method 500, as illustrated in Figure 5, according to examples of the present disclosure, for example on receipt of suitable instructions from a computer program 850. Referring to Figure 8, the XR client 800 comprises a processor or processing circuitry 802, and may comprise a memory 804 and interfaces 806. The processing circuitry 802 is operable to perform some or all of the steps of the method 500 as discussed above with reference to Figure 5. The memory 804 may contain instructions executable by the processing circuitry 802 such that the XR client 800 is operable to perform some or all of the steps of the method 500, as illustrated in Figure 5. The instructions may also include instructions for executing one or more telecommunications and/or data communications protocols. The instructions may be stored in the form of the computer program 850. In some examples, the processor or processing circuitry 802 may include one or more microprocessors or microcontrollers, as well as other digital hardware, which may include digital signal processors (DSPs), special-purpose digital logic, etc. The processor or processing circuitry 802 may be implemented by any type of integrated circuit, such as an Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) etc. The memory 804 may include one or several types of memory suitable for the processor, such as read-only memory (ROM), random-access memory, cache memory, flash memory devices, optical storage devices, solid state disk, hard disk drive, etc.
In some examples of the present disclosure, the processing node and preprocessing node may be considered as a system, and examples of the present disclosure provide a system for supporting generation of a virtual 3D object by an XR client. An example of such a system is illustrated in Figure 9. Referring to Figure 9, the system 900 comprises a preprocessing node 910, which may comprise a preprocessing node 600 as described above, and a processing node 920, which may comprise a processing node 700 as described above. In some examples the system may further comprise an XR client 930, which may comprise an XR client 800 as described above. As summarised in Figure 9, the system may obtain heterogeneous input data which is passed to the preprocessing node 910. The preprocessing node 910 uses the heterogeneous input data to generate a contextual prompt, which is provided to the processing node 920. The processing node 920 then uses the contextual prompt to generate a representation of the virtual 3D object
and contextual metadata for the 3D object. The representation and contextual metadata are then provided to an XR client, which may or may not form part of the system, and which then provides the rendered virtual 3D object and causes it to adopt behaviour in accordance with the representation and contextual metadata.
The above functionality may be achieved by causing the preprocessing node 910 to perform examples of the method 100 and/or 200, causing the processing node to perform examples of the method 300 and/or 400, and causing the XR client to perform examples of the method 500.
In some examples, the system 900 may be implemented on a user device which may be the same user device on which the XR client is running, i.e. all three of the preprocessing node 910, processing node 920 and XR client 930 may be running on an HMD. In other examples, one or more of preprocessing node 910 or processing node 920 could be running in a virtualised environment such as cloud, edge cloud, etc.
Figures 1 to 5 discussed above provide an overview of methods which may be performed according to different examples of the present disclosure. These methods may be performed by a preprocessing node, processing node and XR client respectively, as illustrated in Figures 6 to 8. The methods enable the generation of a virtual 3D object which is more accurate and representative of the context in which it is generated. There now follows a detailed discussion of how different process steps illustrated in Figures 1 to 5 and discussed above may be implemented. The functionality and implementation detail described below is discussed with reference to the modules of Figures 6 to 8, and the system of Figure 9, performing examples of the methods 100, 200, 300, 400, and/or 500, substantially as described above.
In some examples, the methods disclosed herein may be implemented, executed, and facilitated via a range of software components, as set out below. These components form an implementation framework for the methods, and include input data components, operable to provide the heterogeneous input data on which generation of the virtual 3D object is based, processing components operable to generate the contextual prompt, representation and contextual metadata, and an XR component, operable to render the virtual 3D object and to interpret and execute the contextual metadata.
Input components:
Computer Vision Component (CVC). The CVC is responsible for identifying objects in the surrounding environment of the user, and consequently may provide the object data input 210b. The input of the CVC may come directly from the XR HMD of the user, or from external sources such as nearby video cameras. The CVC may run algorithmic computer vision processes, or use a specialized neural network, such as semantic segmentation, to recognize objects. The output of the CVC process may be a list of identified objects.
Sensor Data Component (SDC). The SDC is responsible for gathering sensor data from one or more data sources, and consequently may provide the sensor data input 210a. For example, the SDC may provide sensor data from the XR HMD of the user, such as accelerometer data used to estimate orientation and pose. Alternatively or in addition, the SDC may discover nearby sensors and fetch data, such as temperature or light sensor readings. The output of the SDC may be, for example, a list of discovered sensors, their values, and a timestamp.
User Interaction Component (UIC). The UIC is responsible for interaction with the user and is used to grasp the user’s intent. Consequently the UIC may provide the user intent data input 21 Od. For example, the UIC may use speech recognition to capture the goal of the user. Alternatively or in addition, an algorithm may determine what the user is trying to achieve using data from the HMD. The user intent may also provide information about contextual metadata to be generated, expressing for example desired characteristics, behavior, response to external stimuli etc. of the object. The output of the UIC may be a string that describes the detected intent of the user.
Large Language Model Preprocessor (LLMPP). The LLMPP is an implementation of the preprocessing node 600, and is responsible for aggregating the data outputted by the CVC, SDC, and UIC, into a single and coherent data structure, and for generating the contextual prompt (via methods 100, 200). The LLMPP may apply some data processing methods, such as reducing verbosity of the data, aggregation such as averaging, or even augment the data, e.g., adding information about the range of the values in a timeseries. The output of the LLMPP is a contextual prompt that forms a well- structured dataset which is tailored for easy consumption by the LLM. The LLMPP may itself in some examples comprise an LLM.
Text-to-3D model, including a Large Language Model (LLM). The Text-to-3D model is an implementation of the processing node 700, and is responsible for the generation of a representation of the virtual 3D object, for example a 3D mesh model such as a gITF file, and the contextual metadata which is consumable by the XR application, e.g., links to external data sources, simple actions to take if a collision with the user is detected, or if other conditions such as changes in the environment are fulfilled.
Extended Reality Environment Component (XREC). The XREC is an implementation of the XR client 800, and is the interface between the LLM and the XR application. Its main responsibilities are the rendering of the representation, which may be a 3D mesh such as a gITF file, and the parsing of the contextual metadata. The XREC may be implemented directly into the XR application, or as a library for the game engine, such as Unity, used to produce the XR application.
Figure 10 illustrates the interaction between the above discussed components according to examples of the present disclosure. Referring to Figure 10, the CVC, SDC, and UIC provide heterogeneous input data to the LLMPP. The LLMPP aggregates the received input data, and generates a contextual prompt for the Text-to-3D model, denoted LLM in Fig. 10 but comprising both LLM and Text-to-XR functionality. The Text-to-3D model, on receipt of the contextual prompt, may make function calls to the LLMPP to obtain updated values of certain input data, and generates a representation of the virtual 3D object for rendering, as well as contextual metadata for the virtual 3D object. This representation and contextual metadata allow for the rendering of a detailed and contextually relevant 3D object on the XR environment.
Figure 11 is a sequence diagram illustrating an example implementation of the methods disclosed herein. Figure 11 illustrates message exchange and functionality between a CVC, SDC and UIC as discussed above, an LLMPP, and LLM (representing the Text-to- 3D model and so comprising both LLM and Text-to-XR functionality), and an XREC.
Referring to Figure 11 , and to the methods 100 to 500 discussed above, an example implementation of the methods disclosed herein may be executed as follows:
Data Collection (steps 110, 210, 210a-210d of methods 100 and 200, steps 1, 2, 3 of Figure 11): The implementation framework collects data from various sources. The CVC detects objects in the user's environment and provides this information to
the LLMPP. For instance, the CVC may deliver such information encoded in JSON, but also in other formats, as showed below:
{"objects_detected": ["table", "chair", "person"]}
The UIC may also provide intent data to the LLMPP, such as via speech recognition, for example:
{"intent": "Show me a 3D model of the solar system."}
Substantially simultaneously, the SDC collects real-time measurements from local sensors and actuators or via other APIs, feeding this data into the LLMPP. An example of delivered sensor data is provided below: {"payload": [
{"n": "temperature", "u": "Cel", "v": 20},
{"n": "lightjevel", "vs": "dim"},
{"n": "user_heart_rate", "u": "bpm", "v": 70}]
}
The above example is expressed in JSON/SenML as set out in RFC 8428 (https://datatracker.ietf.orq/doc/html/rfc8428), although other formats may be used.
The SDC may additionally interact with other APIs to obtain for example historical environmental data for the physical environment concerned. In other examples, the LLMPP may interact with such APIs to obtain the historical data.
Data Preprocessing (steps 120, 130, 130a, 220, 230, 230a of methods 100 and 200, step 4 of Figure 11): The LLMPP processes all the incoming data, including object detection information from CVC, sensor data from SDC, environmental data, and user intent data from UIC. The preprocessing stage helps to create a comprehensive understanding of the environment that the user with the XR HMD is experiencing. The aggregate information collected by the above three components may be delivered, in the JSON format, as shown below: {"context": { "room_temperature": 20, "lightjevel": "dim", "objects_detected": ["table", "chair", "person"],
"userjntent": "Show me a 3D model of the solar system where the sun becomes red when I look at it; it is yellow otherwise."}}
Trigger Event and delivery of contextual prompt (steps 140, 150, 240, 240a, 250, 250a of methods 100, 200, steps 5 and 7 of Figure 11): At a certain point, a trigger event occurs. This could be a specific user action, a voice command, or another type of event that initiates the generation of an XR object. When such trigger event happens, the LLMPP delivers to the LLM a contextual prompt, derived from the aggregated input information. An example of such a prompt is shown below:
{"prompt": "Generate a 3D model of the solar system that fits on the table in a dimly lit room. When the user looks at the sun, its color transitions from yellow to red."}
The example above provides information regarding the virtual 3D object to be generated and the contextual metadata that will be generated to enable the object to have characteristics and behavior that reflect the intent of the user and the context in which it is rendered.
Contextual Information Processing, model and metadata generation (steps 310, 320, 330, 410, 412, 414, 420, 430, 260, 270 of method 300, 400, and 200, steps 6 and 8 of Figure 11): Upon the trigger event, the contextual prompt is provided and input to the LLM. The LLM generates a contextually relevant prompt for the Text-to-XR function of the Text-to-3D model. At this point the implementation of the LLM can also call one or more functions exposed at the LLMPP (e.g., get_room_temperature) when it seeks to refresh certain input data. The text-to-xr function takes the prompt and generates the requested 3D model. This model is contextually relevant, meaning it is based on the user's intent and the environmental context, as represented by the heterogeneous input data on which the contextual prompt was based. Simultaneously, the LLM can also output metadata that enhances the 3D model. An example of 3D model could be a gITF document, while contextual metadata may be some generic JSON object.
Taking the example contextual prompt above, the sentence “When the user looks at the sun, its color transitions from yellow to red”, may be used by the LLM to produce platform agnostic well-formatted instructions for a client application, e.g., XREC. The generated contextual metadata may resemble: {"metadata ": {"events": [{ gazejn . {
“element”: “sun”,
“color”: “red”
}
“gaze_out”: {
“element”: “sun”,
“color”: “yellow”
}
}]
}
}
The above snippet provides generic instructions in a well-formatted JSON object, and can be interpreted by any software library on the XR application. Such library may map application specific functions to the LLM-generated metadata. This solution adds a layer of indirection that relieves the application developer from writing hard-coded code for handling dynamically generated information.
3D Model Rendering (steps 340, 440, 510 to 540 of methods 300, 400, 500, step 9 of Figure 11): The Text-to-XR function sends the generated 3D model and contextual metadata to the XR Environment, which renders the model for the user to see, and executed application specific functions in accordance with the contextual metadata. For example, the XR client may parse the contextual metadata, and then map the contextual metadata to individual functionality exposed by the XR runtime. After the mapping, the instructions may be placed into an execution queue and the XR application may execute them according to an execution policy. An execution policy may be periodic execution, or as fast as possible in a loop fashion, etc. In the case of the above example contextual metadata, the “gaze in/out” provided in the metadata maps to a collision detector that checks whether the eyes enter the collider associated with the 3D model of the sun. When this happens, the XR application fires the queued instructions, e.g., the sun changes color.
User Experience: The user, wearing the XR HMD, sees the rendered 3D model in their environment, providing an immersive and contextually relevant XR experience.
The above example implementation may make use of commercially available LLMs including, but not limited to GPT4, Llama 3, Phi 3, Mistral, Gemma, etc. Additional
training of such commercially available LLMs is not required for implementation of examples of the methods proposed herein.
Examples of the methods presented herein offer improved performance in the rendering of virtual 3D objects, enhancing the quality of 3D content creation outputs by using more detailed and varied input information. The existing technology for 3D content creation, though advanced, comes with some limitations given its recent development. In the context of text-to-3D image conversion, existing solutions are limited to plain text commands and lack the ability to incorporate additional context current to the given scene. Such inputs fail to account for real-time changes or variables and can result in 3D images that lack relevancy or accuracy in certain contexts. In addition, inputs given by the user are often underutilized, widening the gap between the user’s intent and the final output. Whether the input is a simple command, an object identified by an Al system, or sensor data, current models tend to simplify these inputs, losing valuable information in the process. Examples of the present disclosure provide a solution that can smartly interpret and incorporate these inputs to produce high resolution and contextually relevant 3D images.
According to examples of the present disclosure, all available inputs, including text or voice commands from a user, data from computer vision recognizing an object within images or video, historical environmental data, and/or sensor data indicating current sensor measurements, location or other real-time data are used to generate a contextual prompt which is passed through an ML model such as an LLM to refine prompts for 3D rendering. Function calling may be used to ensure homogenized and up to date input data, and the LLM returns a structured output. In this manner, example methods proposed herein allow for a more consistent and detailed generation of 3D objects contextual to the immersive XR environment, offering tailored attributes like color changes based, for example, on temperature or spatial positioning. The generation of contextual metadata for parsing by the XR client enables XR applications to perform instructions without additional specific programming by the application developer.
Advantages of the methods proposed herein include the following:
Improved Accuracy: The introduction of distinct sources of input, including command specifications and sensor data, is operable to improve significantly the accuracy of the
generated 3D content. This is particularly beneficial in applications requiring precision, such as medical imaging, engineering design, and archaeological reconstruction.
Enhanced Detailing: The use of an ML model such as an LLM to generate detailed contextual prompts may lead to the creation of more in-depth and contextually relevant images. This may significantly improve the visual experience in fields such as animation and gaming, in which nuanced detailing enhances the overall immersive experience.
Customization: The ability to generate 3D objects based on user-specific commands and real-time sensor data can cater to the needs of individual users. This could be of great use in custom manufacturing, prototyping, and personalized product design.
Real-time Adaptability: Example methods proposed herein can use real-time sensor data to adapt the generated 3D images. For instance, in an XR environment, the appearance and position of 3D objects could change based on the current temperature or spatial location or orientation of the user’s HMD. This adaptability could significantly improve the realism, immersiveness, and interactive capabilities of virtual and augmented reality systems.
The methods of the present disclosure may be implemented in hardware, or as software modules running on one or more processors. The methods may also be carried out according to the instructions of a computer program, and the present disclosure also provides a computer readable medium having stored thereon a program for carrying out any of the methods described herein. A computer program embodying the disclosure may be stored on a computer readable medium, or it could, for example, be in the form of a signal such as a downloadable data signal provided from an Internet website, or it could be in any other form.
It should be noted that the above-mentioned examples illustrate rather than limit the disclosure, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims or numbered embodiments. The word “comprising” does not exclude the presence of elements or steps other than those listed in a claim or embodiment, “a” or “an” does not exclude a plurality, and a single processor or other unit may fulfil the functions of several units recited in the claims or numbered embodiments. Any reference signs in the claims or numbered embodiments shall not be construed so as to limit their scope.
Claims
1 . A computer implemented method (100) for supporting generation of a virtual 3D object, the method, performed by a preprocessing node, comprising: obtaining heterogeneous input data for the virtual 3D object (110); aggregating the heterogeneous input data (120); generating a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object (130), which semantic information is derived from the aggregated heterogeneous input data (130a); and on detection of a trigger event, providing the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object (140).
2. A method as claimed in claim 1 , wherein the processing node is operable to facilitate generation of the virtual 3D object by using a Machine Learning, ML, model to generate (320, 330): a representation of the virtual 3D object, and contextual metadata for the virtual 3D object.
3. A method as claimed in claim 1 or 2, wherein the heterogeneous input data comprises at least two of: sensor data comprising measurements obtained in a physical environment within which the virtual 3D object will be generated (210a); object data comprising objects detected in the physical environment by a computer vision process (210b); historical data comprising data observed or measured in the physical environment during a historical time period (210c); user intent data comprising a representation of an intent of a user of the XR client with respect to the virtual 3D object (21 Od).
4. A method as claimed in claim 3, wherein measurements obtained in a physical environment within which the virtual 3D object will be generated comprise measurements representing a state of at least one of (210a): the physical environment; one or more elements present in the physical environment.
5. A method as claimed in claim 4, wherein the one or more elements present in the physical environment comprise a user of an XR client rendering the virtual 3D object.
6. A method as claimed in any one of the preceding claims, wherein the trigger event comprises at least one of (240a): an action of a user of an XR client; a voice command of a user of an XR client; an event detected in a physical environment within which the virtual 3D object will be generated.
7. A method as claimed in any one of the preceding claims, further comprising: receiving a request from the processing node to provide updated input data (260); and providing updated input data to the processing node (270).
8. A method as claimed in claim 7, wherein the updated input data comprises any one or more of the input data listed in claim 3.
9. A method as claimed in any one of the preceding claims wherein at least some of the steps of the method are performed by an ML model.
10. A method as claimed in any one of the preceding claims wherein at least some of the steps of the method are performed by a Large Language Model.
11. A computer implemented method (300) for supporting generation of a virtual 3D object, the method, performed by a processing node, comprising: receiving a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object (310); inputting the contextual prompt to an ML model operable to generate (320): a prompt for a Text-to-XR function within the processing node; and contextual metadata for the virtual 3D object; using the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object (330); and providing the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to an XR client operable to generate the virtual 3D object (340).
12. A method as claimed in any one of the preceding claims, wherein the representation of the 3D object comprises at least one of a 3D mesh model, a Neural Radiance Field, NeRF, and/or a Latent Diffusion Model, LDM.
13. A method as claimed in claim 11 or 12, further comprising: sending a request to the preprocessing node to provide updated input data; receiving updated input data to the processing node; and using the updated input data to generate at least one of the prompt for the Text- to-XR function within the processing node and the contextual metadata for the virtual 3D object.
14. A method as claimed in any one of claims 11 to 13, wherein the ML model comprises a Large Language Model, LLM.
15. A computer implemented method (500) for generating a virtual 3D object, the method, performed by an Extended Reality, XR, client, comprising: receiving from a processing node a representation of the virtual 3D object and contextual metadata for the virtual 3D object (510); rendering the virtual 3D object for display to a user of the XR client, in accordance with the received representation (520); mapping application specific functions within the XR client to the contextual metadata (530); and executing the mapped application specific functions according to the contextual metadata (540).
16. A computer implemented method for supporting generation of a virtual 3D object by an Extended Reality, XR, client, the method, performed by a system comprising a preprocessing node and a processing node, comprising: at the preprocessing node: obtaining heterogeneous input data for the virtual 3D object; aggregating the heterogeneous input data; generating a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data; and
on detection of a trigger event, providing the contextual prompt to the processing node; the method further comprising: at the processing node: receiving the contextual prompt from the preprocessing node; inputting the contextual prompt to an ML model operable to generate: a prompt for a Text-to-XR function within the processing node; and contextual metadata for the virtual 3D object; using the generated prompt and the Text-to-XR function within the processing node to generate a representation of the virtual 3D object; and providing the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
17. A computer program product comprising a computer readable medium, the computer readable medium having computer readable code embodied therein, the computer readable code being configured such that, on execution by a suitable computer or processor, the computer or processor is caused to perform a method as claimed in any one of the preceding claims.
18. A preprocessing node (600) for supporting generation of a virtual 3D object, the preprocessing node (600) comprising processing circuitry (602) configured to cause the preprocessing node (600) to: obtain heterogeneous input data for the virtual 3D object; aggregate the heterogeneous input data; generate a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data; and on detection of a trigger event, provide the contextual prompt to a processing node operable to facilitate generation of the virtual 3D object.
19. A preprocessing node (600) as claimed in claim 18, wherein the processing circuitry (602) is further configured to cause the preprocessing node (600) to carry out a method according to any one of claims 2 to 10.
20. A processing node (700) for supporting generation of a virtual 3D object, the processing node (700) comprising processing circuitry (702) configured to cause the processing node (700) to: receive a contextual prompt from a preprocessing node, the contextual prompt comprising a text string conveying semantic information with respect to the virtual 3D object; input the contextual prompt to an ML model operable to generate: a prompt for a Text-to-XR function within the processing node (700); and contextual metadata for the virtual 3D object; use the generated prompt and the Text-to-XR function within the processing node (700) to generate a representation of the virtual 3D object; and provide the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to an XR client operable to generate the virtual 3D object.
21 . A processing node (700) as claimed in claim 20, wherein the processing circuitry (702) is further configured to cause the processing node (700) to carry out a method according to any one of claims 12 to 14.
22. An Extended Reality, XR, client (800) for generating a virtual 3D object, the XR client (800) comprising processing circuitry (802) configured to cause the XR client (800) to: receive from a processing node a representation of the virtual 3D object and contextual metadata for the virtual 3D object; render the virtual 3D object for display to a user of the XR client, in accordance with the received representation; map application specific functions within the XR client to the contextual metadata; and execute the mapped application specific functions according to the contextual metadata.
23. A system (900) for supporting generation of a virtual 3D object by an Extended Reality, XR, client, the system comprising a preprocessing node (910) and a processing node (920), wherein: the preprocessing node (910) comprises processing circuitry configured to cause the preprocessing node (910) to: obtain heterogeneous input data for the virtual 3D object;
aggregate the heterogeneous input data; generate a contextual prompt from the aggregated heterogeneous data, wherein the contextual prompt comprises a text string conveying semantic information with respect to the virtual 3D object, which semantic information is derived from the aggregated heterogeneous input data; and on detection of a trigger event, provide the contextual prompt to the processing node (920); and wherein the processing node (920) comprises processing circuitry configured to cause the processing node (920) to: receive the contextual prompt from the preprocessing node (910); input the contextual prompt to an ML model operable to generate: a prompt for a Text-to-XR function within the processing node (920); and contextual metadata for the virtual 3D object; use the generated prompt and the Text-to-XR function within the processing node (920) to generate a representation of the virtual 3D object; and provide the representation of the virtual 3D object and the contextual metadata for the virtual 3D object to the XR client.
24. A system as claimed in claim 23, wherein in the preprocessing node (910), the processing circuitry is further configured to cause the preprocessing node (910) to carry out a method according to any one of claims 2 to 10, and in the processing node (920), the processing circuitry is further configured to cause the processing node (920) to carry out a method according to any one of claims 12 or 13.
25. A system as claimed in claim 23 or 24, further comprising an XR client (930) for generating a virtual 3D object, the XR client (930) comprising processing circuitry configured to cause the XR client (930) to: receive from a processing node (920) a representation of the virtual 3D object and contextual metadata for the virtual 3D object; render the virtual 3D object for display to a user of the XR client (930), in accordance with the received representation; map application specific functions within the XR client (930) to the contextual metadata; and execute the mapped application specific functions according to the contextual metadata.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2024/067654 WO2026002363A1 (en) | 2024-06-24 | 2024-06-24 | Methods and apparatus for supporting generation of a virtual 3d object |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2024/067654 WO2026002363A1 (en) | 2024-06-24 | 2024-06-24 | Methods and apparatus for supporting generation of a virtual 3d object |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2026002363A1 true WO2026002363A1 (en) | 2026-01-02 |
Family
ID=91737627
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/EP2024/067654 Pending WO2026002363A1 (en) | 2024-06-24 | 2024-06-24 | Methods and apparatus for supporting generation of a virtual 3d object |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2026002363A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200349763A1 (en) * | 2019-05-03 | 2020-11-05 | Facebook Technologies, Llc | Semantic Fusion |
| US20200379787A1 (en) * | 2018-04-20 | 2020-12-03 | Facebook, Inc. | Assisting Users with Personalized and Contextual Communication Content |
| US20210073429A1 (en) * | 2019-09-10 | 2021-03-11 | Apple Inc. | Object Relationship Estimation From A 3D Semantic Mesh |
-
2024
- 2024-06-24 WO PCT/EP2024/067654 patent/WO2026002363A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200379787A1 (en) * | 2018-04-20 | 2020-12-03 | Facebook, Inc. | Assisting Users with Personalized and Contextual Communication Content |
| US20200349763A1 (en) * | 2019-05-03 | 2020-11-05 | Facebook Technologies, Llc | Semantic Fusion |
| US20210073429A1 (en) * | 2019-09-10 | 2021-03-11 | Apple Inc. | Object Relationship Estimation From A 3D Semantic Mesh |
Non-Patent Citations (2)
| Title |
|---|
| RUNZ MARTIN ET AL: "MaskFusion: Real-Time Recognition, Tracking and Reconstruction of Multiple Moving Objects", 2018 IEEE INTERNATIONAL SYMPOSIUM ON MIXED AND AUGMENTED REALITY (ISMAR), IEEE, 16 October 2018 (2018-10-16), pages 10 - 20, XP033502065, DOI: 10.1109/ISMAR.2018.00024 * |
| ZHENG CHUANXIA ET AL: "Multi-class indoor semantic segmentation with deep structured model", VISUAL COMPUTER, SPRINGER, BERLIN, DE, vol. 34, no. 5, 8 June 2017 (2017-06-08), pages 735 - 747, XP036476657, ISSN: 0178-2789, [retrieved on 20170608], DOI: 10.1007/S00371-017-1411-8 * |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111445561B (en) | Virtual object processing methods, devices, equipment and storage media | |
| CN116611496A (en) | Text-to-image generation model optimization method, device, equipment and storage medium | |
| US20240404225A1 (en) | Avatar generation from digital media content items | |
| US20250182366A1 (en) | Interactive bot animations for interactive systems and applications | |
| US20250061634A1 (en) | Audio-driven facial animation using machine learning | |
| US12437160B2 (en) | Image-to-text large language models (LLM) | |
| US20250184291A1 (en) | Interaction modeling language and categorization schema for interactive systems and applications | |
| US20250181847A1 (en) | Deployment of interactive systems and applications using language models | |
| US20240412440A1 (en) | Facial animation using emotions for conversational ai systems and applications | |
| US12511810B2 (en) | Backchanneling for interactive systems and applications | |
| US20250181138A1 (en) | Multimodal human-machine interactions for interactive systems and applications | |
| US20250181424A1 (en) | Event-driven architecture for interactive systems and applications | |
| US20250184292A1 (en) | Managing interaction flows for interactive systems and applications | |
| WO2025259492A1 (en) | Texture generation using prompts | |
| US20250181207A1 (en) | Interactive visual content for interactive systems and applications | |
| US20250184293A1 (en) | Sensory processing and action execution for interactive systems and applications | |
| CN120408125A (en) | AI agent digital human interaction system and method based on multimodal perception | |
| US20250005836A1 (en) | Dynamic real time avatar-based ai communication system | |
| WO2026002363A1 (en) | Methods and apparatus for supporting generation of a virtual 3d object | |
| US20250349040A1 (en) | Personalized image generation using combined image features | |
| CN119169161A (en) | 3D interactive digital human generation method, device and customer service project system | |
| WO2024066549A1 (en) | Data processing method and related device | |
| CN116958334A (en) | Method and device for generating expression data of virtual intelligent agent | |
| US20250378643A1 (en) | Mesh generation using prompts | |
| US20250373878A1 (en) | Real-time streaming and playback of synchronized audio and animation data |