[go: up one dir, main page]

CN120339558A - Electronic device, method and storage medium - Google Patents

Electronic device, method and storage medium

Info

Publication number
CN120339558A
CN120339558A CN202410076665.4A CN202410076665A CN120339558A CN 120339558 A CN120339558 A CN 120339558A CN 202410076665 A CN202410076665 A CN 202410076665A CN 120339558 A CN120339558 A CN 120339558A
Authority
CN
China
Prior art keywords
user
gaze
electronic device
gaze point
head rotation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410076665.4A
Other languages
Chinese (zh)
Inventor
沈凌浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony China Ltd
Sony Group Corp
Original Assignee
Sony China Ltd
Sony Group Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony China Ltd, Sony Group Corp filed Critical Sony China Ltd
Priority to CN202410076665.4A priority Critical patent/CN120339558A/en
Priority to PCT/CN2025/072190 priority patent/WO2025152914A1/en
Publication of CN120339558A publication Critical patent/CN120339558A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/20Editing of 3D images, e.g. changing shapes or colours, aligning objects or positioning parts
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • G06T15/205Image-based rendering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Computing Systems (AREA)
  • Architecture (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Processing Or Creating Images (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The present disclosure relates to electronic devices, methods, and storage media. An electronic device includes a processor and a memory storing computer program code that, when executed by the processor, causes the electronic device to perform operations including receiving gaze point location and head rotation information of a user, obtaining interactive content information, predicting a gaze range of the user using an Artificial Intelligence (AI) model based on the gaze point location and head rotation information of the user and the interactive content information, and rendering an interactive screen for display based on the predicted gaze range.

Description

Electronic device, method, and storage medium
Technical Field
The present disclosure relates generally to the field of image processing, and more particularly, to an electronic device, method, and storage medium for improving gaze point rendering (foveated rendering).
Background
In recent years, application services such as augmented reality (XR) are receiving increasing attention. XR is a generic term for all real-to-virtual human-machine interactions generated by computer technology and presentation devices such as wearable devices, including, for example, virtual Reality (VR), augmented Reality (AR), mixed Reality (MR), and so forth. Such new services also include, for example, meta-universe (METAVERSE), cloud game (Cloud game), and so forth. These services greatly enrich the entertainment experience of the user.
These new services have special service features and design requirements compared to traditional multimedia services. For example, VR services have very high requirements for image rendering, and advanced rendering techniques, high frame rates, high resolution, and low latency are required to improve user experience. This presents challenges for the presentation devices used, especially wearable devices with limited processing power and energy.
Disclosure of Invention
The present disclosure provides various aspects. By applying one or more aspects of the present disclosure, performance of image rendering of, for example, a wearable device may be improved.
The following presents a simplified summary of the disclosure in order to provide a basic understanding of some aspects of the disclosure. It should be understood that this summary is not an exhaustive overview of the disclosure. It is not intended to identify key or critical elements of the disclosure or to delineate the scope of the disclosure. Its purpose is to present some concepts related to the disclosure in a simplified form as a prelude to the more detailed description that is presented later.
According to one aspect of the disclosure, there is provided an electronic device comprising a processor and a memory storing computer program code, wherein the computer program code, when executed by the processor, causes the electronic device to perform operations comprising receiving gaze point location and head rotation information of a user, obtaining interactive content information, predicting a gaze range of the user using an Artificial Intelligence (AI) model based on the gaze point location and head rotation information of the user and the interactive content information, and rendering an interactive screen for display based on the predicted gaze range.
According to another aspect of the disclosure, there is provided an electronic device comprising a processor, and a memory storing computer program code, wherein the computer program code, when executed by the processor, causes the electronic device to perform operations comprising preparing a training set comprising input data and output data, wherein the input data comprises gaze point position and head rotation information of a user and interactive content information corresponding to a series of moments, the output data comprises gaze ranges of the user, and training an Artificial Intelligence (AI) model on the training set to determine parameters of the AI model.
According to another aspect of the present disclosure, there is provided a method including receiving gaze point position and head rotation information of a user, obtaining interactive content information, predicting a gaze range of the user using an Artificial Intelligence (AI) model based on the gaze point position and head rotation information of the user and the interactive content information, and rendering an interactive picture for display based on the predicted gaze range.
According to another aspect of the disclosure, a method is provided that includes preparing a training set including input data including gaze point location and head rotation information of a user and interactive content information corresponding to a series of moments, and output data including a gaze range of the user, and training an Artificial Intelligence (AI) model on the training set to determine parameters of the AI model.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing executable instructions which, when executed, implement any of the methods described above.
Drawings
The disclosure may be better understood by referring to the following detailed description in conjunction with the accompanying drawings in which the same or similar reference numerals are used throughout the several views to indicate the same or similar elements. All of the accompanying drawings, which are incorporated in and form a part of this specification, illustrate further embodiments of the present disclosure and, together with the detailed description, serve to explain the principles and advantages of the present disclosure. Wherein:
FIG. 1 is a schematic diagram of gaze point rendering, according to an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic diagram showing head rotation detected with a sensor;
FIG. 3 is a training flow diagram of a timing prediction model according to an example embodiment;
fig. 4 illustrates a modification of gaze point rendering according to an exemplary embodiment;
FIG. 5 illustrates an optimization process of personality traits in accordance with an example embodiment;
FIG. 6 is a block diagram of an apparatus according to an exemplary embodiment;
Fig. 7 illustrates an example block diagram of a computer that may be implemented as a user device or a control device in accordance with this disclosure.
Features and aspects of the present disclosure will be clearly understood from a reading of the following detailed description with reference to the accompanying drawings.
Detailed Description
Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. The following description of the exemplary embodiments is merely illustrative and is not intended to be in any way limiting of the present disclosure and its applications. In the interest of clarity and conciseness, not all features of an embodiment are described in this specification. It should be noted, however, that many implementation-specific arrangements may be made in implementing embodiments of the present disclosure according to specific needs.
Furthermore, it should be noted that, in order to avoid obscuring the present disclosure with unnecessary details, only the processing steps and/or apparatus structures closely related to at least the technical content according to the present disclosure are shown in some drawings, while in other drawings, existing processing steps and/or apparatus structures are additionally shown in order to facilitate better understanding of the present disclosure.
Hereinafter, it is possible to describe one or more aspects of the present disclosure by taking an application scenario of VR service as an example for convenience of explanation. It should be noted, however, that this is not a limitation on the scope of application of the present disclosure, and that one or more aspects of the present disclosure may also be applied to application scenarios such as AR, MR, metauniverse, etc. applied to image rendering.
As a representative example of emerging services, VR services wish to provide users with a very immersive audiovisual experience while also placing very high demands on image rendering, mainly in the following ways:
Frame rate to provide a smooth, steady virtual reality experience, VR services require image frame rates of at least 75 frames per second (fps), and preferably up to 90fps or more. Too low frame rate can cause dizziness and influence the experience of users;
Resolution VR services require high resolution images to provide better immersion. In general, the higher the resolution of an image, the better the display effect. However, the high resolution increases the difficulty and the calculation amount of image rendering, so that the trade-off needs to be carried out according to the actual situation;
Delay VR devices typically require real-time rendering of images and therefore have high demands on the rendering speed of the images. Too high a delay can cause a user to feel obvious picture blocking, which affects immersion and interaction experience;
Interactivity VR services need to provide rich interactive functions such as motion capture and real-time response of the user's head and hands, etc. These interactive functions require high-precision image rendering techniques to achieve, to ensure accuracy and real-time of the response.
To achieve high quality image rendering, VR services need to employ advanced rendering techniques such as ray tracing, global illumination, dynamic shading, and the like. These techniques may improve the realism and fidelity of the image, providing a more immersive experience for the user.
However, VR devices such as VR glasses are limited in hardware and software, and therefore efficient algorithms need to be used to reduce the resources required for rendering. Currently, techniques for gaze point rendering (Foveated Rendering) are proposed. Gaze point rendering is a novel graphic computing technology, and based on the physiological characteristic that human eyes gradually blur from the central to the peripheral visual perception, the computing complexity is greatly reduced by reducing the resolution of images around gaze points. When people see things, the whole visual field range is not as clear, but the center point is clear, and the more the edge is blurred. Thus, when rendering a display image on a VR device, it is not necessary that the entire picture have the same resolution, but that the center of the picture being gazed at has the highest resolution and the surrounding resolutions decrease.
Techniques for applying eye tracking to gaze point rendering, such as eye tracking gaze point rendering (ETFR), are also presented. ETFR techniques can track the user's current true gaze point with a sensor and use high resolution rendering only within a certain range around the gaze point, the other part reducing the rendering resolution.
However, current various gaze point rendering methods (including ETFR) only consider detecting a user's gaze point location and then use high resolution in a fixed area around the gaze point location. While this helps reduce more Graphics Processing Unit (GPU) consumption for applications, such fixed sizes, shapes are not optimal in all cases.
In view of this, the present disclosure provides an improved gaze point rendering method to further improve the performance of image rendering.
Exemplary embodiments according to the present disclosure are described below with reference to the accompanying drawings.
Fig. 1 is a schematic diagram of gaze point rendering, according to an exemplary embodiment of the present disclosure. As shown, the gaze point rendering according to the exemplary embodiment takes into consideration the head rotation information of the user and the interactive contents in addition to the gaze point position of the user, and comprehensively considers these factors using the time series prediction model, thereby predicting the gaze range of the user more accurately for image rendering.
The rendering method according to the exemplary embodiments is applicable to various user devices, such as VR glasses, VR helmets, and the like. However, the user device used in the present disclosure is not limited thereto, and may include any display device requiring high-performance image rendering, such as an AR device, a high-end television, a projector, a game display screen, and the like.
Various sensors may be provided in the user device or attached using an interface to capture various user status information, including gaze point location and head rotation information of the user shown in fig. 1.
In some embodiments, the sensor capturing gaze point position may be implemented as an eye tracker. Eye trackers can be used to quickly track a user's eye movements in two-or three-dimensional space to determine where the user looks in their field of view. In some examples, the eye tracker may include or be coupled to a display lens positioned with respect to each eye of the user to project a picture of the VR space into the user's eye. In other examples, the eye tracker may include or be coupled to a viewing lens positioned relative to each eye of the user so that the user can view the surrounding real world environment, similar to a pair of glasses. In some other examples, the eye tracker may include or be coupled to a specially configured lens positioned with respect to each eye of the user in order to project a simulated or composite view of the real world environment and the overlaying real world environment simultaneously, thereby providing the AR environment to the user.
The eye tracker according to an exemplary embodiment may employ various eye tracking techniques. One example is infrared-based eye video analysis (VOG), which basically aims a beam of light (infrared light) and a camera at the eyes of a user, captures real-time video of the eyes, and extracts the position and movement information of the eyeballs using image processing techniques. The method is simple to implement and low in cost, but requires high-resolution and high-frame-rate video acquisition equipment. Another example is pupil-cornea reflex method, i.e. the calculation of eye movements by measuring the glints of the pupil and cornea. When light irradiates the eye, an anti-light spot is formed between the pupil and the cornea, and the rotation angle of the eye can be calculated by monitoring the position change of the anti-light spot. Another example is the sclera-iris limbus method, which irradiates the eye with infrared light and calculates the eye's motion by monitoring the infrared light signals reflected from the sclera and iris edges. Alternatively, electrooculography, a non-invasive eye tracking technique, which calculates eye movements by measuring changes in the potential difference across the surface of the eye, may be employed. The method has higher precision, but requires special electrodes and signal processing circuits. As a result, the eye tracker may output the focal position of the user's gaze screen, i.e. the gaze point position as referred to in this disclosure. The gaze point position may be represented, for example, as two-dimensional spatial coordinates (x, y).
In addition, various techniques may be employed to capture the user's head rotation information. In one example, the sensor for capturing head rotation information may be implemented as an Inertial Measurement Unit (IMU). Typically, an IMU may include three single axis accelerometers that detect acceleration signals of an object on separate axes of a carrier coordinate system and three single axis gyroscopes that detect angular velocity signals of the carrier. By processing these signals, the pose of the object can be calculated. In another example, an infrared or visible light camera may be used to capture image data of head movements and based on such data, information about head rotational direction, speed, and angle, etc. is analyzed. Alternatively, machine vision and artificial intelligence techniques may be utilized to infer the head orientation of a person in three-dimensional space from a two-dimensional digital image by analyzing feature points, lines, shapes, etc. in the image, in conjunction with a depth learning algorithm.
Fig. 2 is a schematic diagram showing the rotation of the head detected by the sensor. For example, the IMU may output head pose (α, β, γ) characterized by roll angle, pitch angle, yaw angle as a measurement. In a scene such as AR or VR, as the user's head turns, the visual content of the virtual or real space that needs to be presented changes. In general, the head rotation angle is usually limited, and the range of high definition rendering should obviously not exceed this upper limit, otherwise causing discomfort to the user. In addition, when the head rotation angle of the user is different, the size and shape of the area where the human eye looks at the screen may also be different. The head rotation of the user is considered in the method, and the accuracy of the gaze point rendering is improved.
Gaze point rendering according to an exemplary embodiment is content aware. As shown in fig. 1, in addition to the head rotation information and gaze point position information captured by the sensor, the interactive content may also be input to a deep-learned feature extraction model. In other words, the interactive content may also influence rendering decisions. As used in this disclosure, "interactive content" refers to any form of content with which a user may interact. Unlike traditional media content, the presentation of interactive content may vary from user interaction, and thus the need for real-time rendering is higher. One typical example is a game, including but not limited to VR/AR games. The interactive content may include any content that affects the user's gaze including, for example, the state of the scene, the state of objects in the scene, the state of personas, and the like.
According to an example embodiment of the present disclosure, the feature extraction model may receive interactive content at a current time, such as a current state (e.g., location, attribute, type, etc.) of an object or character throughout a scene or scene. In one example, the interactive content may include a rendered scene graph. In addition, the feature extraction model may also receive subsequent interactive content, such as interactive content at a time next to the current time in the rendering sequence. The subsequent interactive content may be obtained, for example, from game configuration data, such as unrendered scene graphs or configuration scripts.
The feature extraction model is configured to extract feature data from the input gaze point location, head rotation information, interactive content. In one example, the feature extraction model may be a neural network including one or more feature layers, but its implementation is not limited thereto, and any other suitable model architecture may be employed. The feature extraction model has been trained. Although the feature extraction model and the time series prediction model are shown as two independent models in fig. 1, they are not limited thereto, but may be alternatively implemented as a part included in one neural network. Thus, the feature extraction model and the temporal prediction model may be separate, or may be trained or used together, so long as the feature extraction model is capable of extracting feature data (e.g., feature vectors) suitable for input to the temporal prediction model.
In one example, the feature extraction model may stitch the gaze point location, the head rotation information, and the interactive content information at a time and/or the interactive content information at a next time into a feature sequence corresponding to the time. In the case of using a scene graph as the interactive content, the feature extraction model may process the gaze point location and the scene graph in different ways, for example, stitching the coordinates of the gaze point location after the pixel values (e.g. RGB values) of the scene graph, encoding the coordinates of the gaze point location into feature vectors and stitching after the pixel values of the scene graph, or characterizing the gaze point location using a gaussian distribution thermodynamic diagram as an additional input channel.
The temporal prediction model is configured to predict a gaze range of the user based on the feature data extracted by the feature extraction model. Specifically, the time sequence prediction model may analyze the gaze point, the interactive content, and the head rotation represented by the feature data, construct a relationship between them, and predict the gaze range of the user based on such relationship. The gaze range may be defined by gaze point location, size, and shape. Optionally, the timing prediction model may also predict the head rotation and gaze range at the next moment, which may be beneficial in certain situations where pre-rendering of subsequent interactive pictures is required to reduce rendering latency.
According to an exemplary embodiment of the present disclosure, the timing prediction model may be a pre-trained Artificial Intelligence (AI) model. As examples of AI models that may be used as the timing prediction model, various deep-learning neural networks may be used, including, but not limited to, convolutional Neural Networks (CNN), converters (transducers), or Mamba models.
CNN is a widely used deep learning model, comprising convolution computation and depth structure, with very strong nonlinear fitting capability. By means of a non-linear fitting capability, such as a convolutional neural network, it is possible to mine the associations hidden between gaze point locations, head rotations, interactive content and gaze ranges. The key part of CNN includes multiple convolution layers in which parameters are convolved from a learned filter (i.e., convolution kernel) and data matrix to extract hidden features in the input data. CNNs may also include one or more of a batch normalization layer, an activation function, a pooling layer, a fully connected layer.
The converter is another deep learning model, mainly used for sequence-to-sequence conversion tasks. It consists of an input encoder and an output decoder, which are connected by several self-attention layers. The self-attention layer uses the attention mechanism to calculate the relationship between input and output, allowing the converter model to process the sequence in parallel. In addition, the converter may also include input/output embedding, position coding, residual connection, layer normalization, and the like.
The Mamba model is an innovative deep learning architecture designed specifically for processing sequence data. The method is based on a framework of a structured State Space Model (SSM) and combines a Flash Attention technology of hardware perception, thereby realizing high-efficiency performance in a sequence processing task. The Mamba model consists of an input encoder and an output decoder, which are interconnected with a gated MLP layer by a series of SSM modules. These modules effectively capture complex relationships between inputs and outputs by encoding the inputs into more compact state information. The Mamba model also introduces normalization and residual ties to enhance the expressive power and training stability of the model. The connection mode is favorable for the model to learn the long-term dependency relationship in the sequence better, and improves the convergence rate of the model.
In some embodiments, only feature data corresponding to the current moment may be input to the temporal prediction model, resulting in a gaze range for the current moment, e.g. gaze point position, size, shape of the user. In other embodiments, in addition to the feature data corresponding to the current moment, stored feature data for a previous period of time may be input together to a temporal prediction model, which may help to find a correlation in time of the user's gaze behavior. For example, the temporal prediction model may receive the feature vector at the current time t0 and the feature vector at one or more times prior to t0 and output the current gaze range of the user.
Based on the predicted gaze range, the rendering processor may render an interactive picture (e.g., a game picture) that is currently needed to be presented. In particular, the processor may render pictures within the gaze range at a higher resolution (full resolution, e.g., 4K or 8K), while rendering pictures at the periphery of the gaze range at a lower resolution (e.g., 1/2 resolution or 1/4 resolution). Alternatively, the gaze range predicted by the temporal prediction model may include a first region at the center and a second region at the periphery of the first region, and the rendering processor may render the first region of the gaze range at a highest first resolution (e.g., full resolution), render the second region of the gaze range at a slightly lower second resolution (e.g., 1/2 resolution), and render the region at the periphery of the gaze range at a lower third resolution (e.g., 1/4 resolution). Therefore, the load and time delay of the processor can be effectively reduced.
The training of the timing prediction model is described below. As shown in fig. 3, the model training process according to an exemplary embodiment may be summarized as a preparation step 101 and a training step 102.
In a preparation step 101, a dataset needs to be prepared on which the model is trained. Typically, neural networks are trained in a supervised fashion, so that the training dataset includes a set of input features and associated ground truth (ground truth) outputs. According to an exemplary embodiment, the training data set may include gaze point position and head rotation information of the user and interactive content information corresponding to a series of moments in time as input data, and an actual gaze range (size, shape) of the user with each moment in time and/or the next moment in time as output data. The preparing step 101 may further comprise pairing gaze point location, head rotation information, interactive content information with gaze ranges by time of day. In one example, the interactive content information includes a scene graph, and the gaze point location may be linked to the scene graph in a stitched or embedded manner. In addition, the preparation step 101 may further include data cleaning, normalization, standardization, and the like, as needed.
Training step 102 is a step of training a model on a prepared training dataset, sometimes also referred to as deep learning. In order to make the computational complexity of the operation practically viable, the training step is based on an iterative process, for example on a random gradient descent (SGD) algorithm. To this end, the weights of the neural network are initialized (e.g., randomly) at the beginning. Input data of the training dataset is input to the neural network to obtain a corresponding output, e.g. a predicted gaze range, and a value of the penalty function is calculated from the difference between the predicted gaze range and the actual gaze range. And updating the weight and the bias of the neural network according to the gradient information of the loss function. Specifically, the gradient of the loss function with respect to the weight is propagated to each layer of the model by a back propagation algorithm (e.g., gradient descent method), and its weights and biases are updated. And repeating the steps until the preset iteration times or the loss function value is lower than a preset threshold value. It should be noted that the above explanation of the training method is merely exemplary, and that the actual operation may vary or increase or decrease depending on the type of model employed.
Different users can react differently to the same interactive content, so that certain customization can be additionally performed according to the collected current user data, and a better effect is achieved. This customization may not require retraining the entire model, but rather fine tuning based on the model that has been trained. In order to not unduly affect the user experience, it is necessary to reduce the data to be collected, and therefore, a method such as prompt tuning (prompt tuning) is used.
Fig. 4 illustrates a modification of gaze point rendering according to an exemplary embodiment. Fig. 4 is substantially the same as fig. 1, except that the time series prediction model receives as input, in addition to the current feature data and/or the previous feature data, also the user's personality characteristics. Here, the personality characteristics of the user refer to characteristics related to personalization of the user and affecting rendering decisions. Depending on the optimization task, the personality characteristics used may vary. For example, in the case of a racing game, the game levels of different users are not the same, and thus may exhibit different degrees of gaze point-track variation correlation, and thus the gaze ranges of the high definition rendering may also be different, so that the time-series prediction model may be customized with personality characteristics representing the game levels to improve prediction accuracy for the users.
The optimization process of the personality trait is described below in connection with fig. 5. In step 201, for a particular user, the personality characteristics of the user may be initialized. Depending on the purpose of the optimization, the type of personality trait may be determined. Alternatively, an average vector of a plurality of users may be used as the initial personality characteristic vector at the time of initialization. This initialization step may occur, for example, when the user plays a game for the first time, or when the optimization function is enabled.
Then, in step 202, as shown in fig. 4, the initialized personality trait may be input to the temporal prediction model along with other trait data to predict a gaze range of the user. In step 203, the predicted gaze range may be compared with the actual gaze range to calculate a prediction accuracy.
In step 204, the personality characteristics of the user may be further adjusted. For example, the calculated prediction accuracy may be compared to a preset threshold, and if the calculated prediction accuracy is below the threshold, the personality traits may be adjusted, or otherwise not. Steps 202, 203 and 204 may be performed iteratively until the prediction accuracy exceeds a preset threshold.
Fig. 6 shows a device block diagram according to an exemplary embodiment. As shown in the figures, a user device for implementing gaze point rendering according to an exemplary embodiment may include a user state sensing module, a feature extraction module, a feature caching module, a timing prediction module, a gaze point rendering module. Optionally, the apparatus may further comprise a personality trait module.
The user state awareness module is configured to obtain state information of the user, such as gaze point location and head rotation information of the user. The user state awareness module may be implemented as a sensor that captures corresponding information.
The feature extraction module is configured to extract feature data based on the head rotation information, gaze point location, and interactive content (e.g., game content). As described above, the feature extraction module may be implemented as a neural network model. The feature data extracted by the feature extraction module may be stored in the feature cache module.
The timing prediction module is configured to predict a gaze range of the user based on the current feature data and/or previous feature data stored by the feature cache module. Optionally, the timing prediction module may also predict the head rotation and gaze range at the next moment.
The gaze point rendering module is configured to perform rendering based on the prediction result of the temporal prediction module. Optionally, the time sequence prediction module may also receive the user personality characteristics from the personality characteristics module to predict, so as to improve the prediction accuracy for the user.
It should be understood that the above-described modules are merely logic modules divided according to the specific functions they implement, and are not intended to limit the specific implementation. In actual implementation, the modules may be implemented as separate physical entities, or may be implemented by a single entity (e.g., a processor (CPU or DSP, etc.), an integrated circuit, etc.).
The following describes the application of the exemplary embodiments of the present disclosure using VR racing games as an example. It should be noted that this is merely exemplary and is not intended to limit the scope of application of the present disclosure. In VR racing games, attention is focused on the road ahead when the user is focusing on driving the vehicle. At this time, when the vehicle speed is low, the user's attention is dispersed relatively, the gaze point area is large, and when the vehicle speed is high, the gaze point area is small. Meanwhile, when the vehicle speed is low, the scene change is low, and when the vehicle speed is high, the scene change is high, and more rendering resources are needed. Thus, vehicle speed may be considered as affecting rendered game content. When the vehicle speed is high, the high-resolution rendering area of the gaze point can be reduced, and the rendering speed is improved. By collecting the relationship between the VR race driving speed and the gaze point size for multiple testers, the size of the area of the high-resolution rendering is adjusted according to the race speed based on the relationship.
Because the racing game needs to observe curves, road surfaces and the like in advance, the information can be obtained by analyzing the current track and the current course, and the head rotation and the gaze point position change of the user can be predicted based on the information. In particular, the user's head rotation direction has a correlation with the gaze point position change and the course change direction, and the gaze shape is an ellipse pointing in the course change direction, so that the area of high resolution rendering can be optimized according to this feature. When the vehicle moves faster, the angle of movement is larger, the user's head rotation angle will also be larger, and the gaze shape will be more elongated. Therefore, high-resolution rendering can be used for smaller long and narrow areas, resources required by rendering are reduced, and rendering speed is improved.
In addition, testers with different levels can show different degrees of gaze point-track change correlation, testers with higher levels can pay better attention to track change information, can reach faster speeds, have higher requirements on rendering resources, and can be predicted better. Accordingly, the major axis radius and area of the ellipse can be correspondingly adjusted according to the level of the racing car the user has previously exhibited.
Fig. 7 illustrates an example block diagram of a computer that may be implemented as a user device in accordance with an example embodiment.
In fig. 7, a Central Processing Unit (CPU) 1301 executes various processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as needed.
The CPU 1301, ROM 1302, and RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to the bus 1304.
Connected to the input/output interface 1305 are an input section 1306 including a keyboard, a mouse, and the like, an output section 1307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like, a storage section 1308 including a hard disk, and the like, and a communication section 1309 including a network interface card such as a LAN card, a modem, and the like. The communication section 1309 performs a communication process via a network such as the internet.
The drive 1310 is also connected to the input/output interface 1305 as needed. The removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1310, so that a computer program read out therefrom is installed into the storage section 1308 as needed.
In the case of implementing the above-described series of processes by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.
It will be appreciated by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 7, in which the program is stored, which is distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Or the storage medium may be a ROM 1302, a hard disk contained in the storage section 1308, or the like, in which a program is stored and distributed to users together with a device containing them.
Exemplary embodiments of the present disclosure provide a computer program configured to cause a computing device to perform the above-described method when the computer program is executed on the computing device.
Exemplary embodiments of the present disclosure provide a computer program product comprising one or more computer-readable storage media having program instructions collectively stored on the readable storage media that are loadable by a computing device to cause the computing device to perform the same method. However, the computer program may be implemented as a stand-alone module, as a plug-in for a pre-existing software program, or even directly in the latter. In any case, similar considerations apply if the computer program is structured in a different manner, or if additional modules or functions are provided, and likewise the memory structure may be of other types, or may be replaced by an equivalent entity (not necessarily including a physical storage medium). A computer program may take any form suitable for use by any computing device (see below) to configure the computing device to perform the desired operations, and in particular, may be in the form of external or resident software, firmware, or microcode (in object code or source code, for example, for compilation or interpretation). Furthermore, the computer program may be provided on any computer readable storage medium. A storage medium is any tangible medium (other than the transitory signal itself) that can hold and store instructions for use by a computing device. For example, the storage medium may be of an electrical, magnetic, optical, electromagnetic, infrared or semiconductor type, examples of such storage medium being a fixed disk (in which the program may be preloaded), a removable disk, a memory key (e.g., of a USB type), etc. The computer program may be downloaded to the computing device from a storage medium or via a network (e.g., the internet, a wide area network, and/or a local area network, including transmission cables, fiber optic, wireless connections, network devices), and one or more network adapters in the computing device receive the computer program from the network and forward it for storage in one or more storage devices of the computing device.
In any event, the technical solution according to the exemplary embodiments of the present disclosure is itself implemented even with a hardware structure (e.g., by electronic circuitry integrated in one or more chips of semiconductor material, such as a Field Programmable Gate Array (FPGA) or application specific integrated circuit), or with a combination of software and hardware suitably programmed or otherwise configured.
Exemplary embodiments of the present disclosure provide a computing device comprising components configured to perform the steps of the above-described methods. Exemplary embodiments of the present disclosure provide a computing device that includes circuitry (i.e., any hardware suitably configured by software, for example) for performing each step of the same method. However, the computing device may be of any type (e.g., a central unit of an imaging system, a separate computer, etc.).
Exemplary embodiments of the present disclosure are described above with reference to the drawings, but the present disclosure is of course not limited to the above examples. Various changes and modifications may be made by those skilled in the art within the scope of the appended claims, and it is understood that such changes and modifications will naturally fall within the technical scope of the present disclosure.
In this specification, the steps described in the flowcharts include not only processes performed in time series in the order described, but also processes performed in parallel or individually, not necessarily in time series. Further, even in the steps of time-series processing, needless to say, the order may be appropriately changed.
Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
[ Exemplary implementations of the present disclosure ]
Various implementations are conceivable to implement the concepts of the present disclosure, including but not limited to the following illustrative examples (EEs):
EE1, an electronic device comprising:
Processor, and
A memory storing computer program code, wherein the computer program code, when executed by the processor, causes the electronic device to perform operations comprising:
Receiving the gaze point position and head rotation information of a user;
Acquiring interactive content information;
Predicting a gaze range of a user using an Artificial Intelligence (AI) model based on gaze point location and head rotation information of the user and interactive content information, and
Based on the predicted gaze range, an interactive picture is rendered for display.
EE2, the electronic device of EE1, wherein the operations further comprise:
for a moment, extracting feature data corresponding to the moment from the gaze point position, the head rotation information and the interactive content information of the moment and the interactive content information of the subsequent moment;
inputting feature data corresponding to the current time and features corresponding to the time within the previous time period into an AI model to predict a user's gaze range at the current time, and
Based on the predicted gaze range, an interactive picture is rendered for display.
EE3, the electronic device of EE2, wherein the operations further comprise:
At least one of head rotation, gaze point position, gaze range of the user at the next moment is predicted for prerendering of the interactive picture at the next moment.
EE4, the electronic device according to EE2, wherein the memory further stores feature data corresponding to a plurality of moments.
EE5, an electronic device according to EE1, wherein the gaze range of the user is defined by the gaze point position, size and shape of the user's gaze screen.
EE6, the electronic device according to EE1, further comprising:
An eye tracker for acquiring a gaze point position of a user, and
And the sensor is used for acquiring the head rotation information of the user.
EE 7, an electronic device according to EE6, wherein the eye tracker uses infrared-based eye video analysis (VOG) technology.
EE8, the electronic equipment according to EE1, wherein the interactive content information comprises at least one of a state of a scene, a state of an object in the scene, a state of a character, a scene graph, and a configuration scenario.
EE9, the electronic device of EE1, wherein the AI model comprises at least one of a Convolutional Neural Network (CNN), a converter, or a Mamba model.
EE10, the electronic device of EE1, wherein the operations further comprise:
rendering the interactive picture within the gaze range at a first resolution and rendering the interactive picture surrounding the gaze range at a second resolution lower than the first resolution.
EE11, the electronic device of EE1, wherein the operations further comprise:
Initializing personality traits of the user, and
Iteratively performing:
Inputting the personality characteristics, the gaze point position and the head rotation information of the user and the interactive content information into the AI model together so as to predict the gaze range of the user;
determining a prediction accuracy by comparing the predicted gaze range with the actual gaze range, and
And adjusting the personality characteristics of the user until the prediction accuracy exceeds a preset threshold.
EE12, the electronic device of EE11, wherein the operations further comprise:
Determining personality characteristics of the user that cause the prediction accuracy to exceed a preset threshold, and
The determined personality characteristics of the user are input to the AI model along with gaze point location, head rotation information, and interactive content information to predict a gaze range of the user.
EE13, an electronic device comprising:
Processor, and
A memory storing computer program code, wherein the computer program code, when executed by the processor, causes the electronic device to perform operations comprising:
Preparing a training set comprising input data and output data, wherein the input data comprises gaze point position and head rotation information of a user and interactive content information corresponding to a series of moments, the output data comprises gaze ranges of the user, and
An Artificial Intelligence (AI) model is trained on the training set to determine parameters of the AI model.
EE14, an electronic device according to EE13, wherein the gaze range of the user is defined by the gaze point position, size and shape of the user's gaze screen.
EE15, the electronic equipment according to EE13, wherein the interactive content information comprises at least one of a state of a scene, a state of an object in the scene, a state of a persona, a scene graph, and a configuration scenario.
EE16, the electronic device of EE13, wherein the AI model comprises at least one of a Convolutional Neural Network (CNN), a converter, or a Mamba model.
EE17, the electronic device of EE13, wherein preparing the training set further comprises, for a moment in the series of moments:
Preparing input data including gaze point position at that time, head rotation information and interactive content information, and interactive content information at a subsequent time, corresponding output data including gaze range at that time, and
The input data and the output data corresponding to the time are paired.
EE18, the electronic device of EE13, wherein the interactive content information comprises a scene graph, and wherein preparing the training set further comprises at least one of:
splicing coordinates of the gaze point position after pixel values of the scene graph;
Encoding the coordinates of the gaze point position into feature vectors and stitching them after the pixel values of the scene graph, or
The gaze point location is characterized using a gaussian distribution thermodynamic diagram and serves as a different input channel than the scene graph.
EE19, a method comprising:
Receiving the gaze point position and head rotation information of a user;
Acquiring interactive content information;
Predicting a gaze range of a user using an Artificial Intelligence (AI) model based on gaze point location and head rotation information of the user and interactive content information, and
Based on the predicted gaze range, an interactive picture is rendered for display.
EE20, a method comprising:
Preparing a training set comprising input data and output data, wherein the input data comprises gaze point position and head rotation information of a user and interactive content information corresponding to a series of moments, the output data comprises gaze ranges of the user, and
An Artificial Intelligence (AI) model is trained on the training set to determine parameters of the AI model.
EE21, a computer readable storage medium containing executable instructions that when executed cause an electronic device to perform a method as described by EE19 or EE 20.

Claims (10)

1.一种电子设备,包括:1. An electronic device, comprising: 处理器;和Processor; and 存储器,存储计算机程序代码,其中所述计算机程序代码当被所述处理器执行时使得所述电子设备执行操作,所述操作包括:A memory storing computer program code, wherein the computer program code, when executed by the processor, causes the electronic device to perform operations, the operations comprising: 接收用户的注视点位置和头部转动信息;Receive user's gaze point position and head rotation information; 获取互动内容信息;Obtain interactive content information; 基于用户的注视点位置和头部转动信息以及互动内容信息,利用人工智能(AI)模型预测用户的注视范围;以及Using an artificial intelligence (AI) model to predict the user’s gaze range based on the user’s gaze point location and head rotation information as well as interactive content information; and 基于预测的注视范围,渲染互动画面以供显示。Based on the predicted gaze range, the interactive screen is rendered for display. 2.根据权利要求1所述的电子设备,其中,所述操作进一步包括:2. The electronic device according to claim 1, wherein the operation further comprises: 对于一时刻,从该时刻的注视点位置、头部转动信息和互动内容信息以及后续时刻的互动内容信息提取与该时刻对应的特征数据;For a moment, extract feature data corresponding to the moment from the gaze point position, head rotation information and interaction content information at the moment and the interaction content information at subsequent moments; 将与当前时刻对应的特征数据和与先前时间段内的时刻对应的特征输入到AI模型,以预测用户在当前时刻的注视范围;以及Inputting feature data corresponding to the current moment and features corresponding to moments in a previous time period into an AI model to predict the user's gaze range at the current moment; and 基于预测的注视范围,渲染互动画面以供显示。Based on the predicted gaze range, the interactive screen is rendered for display. 3.根据权利要求2所述的电子设备,其中,所述操作进一步包括:3. The electronic device according to claim 2, wherein the operation further comprises: 预测下一时刻的用户的头部转动、注视点位置、注视范围中的至少一个,以用于下一时刻的互动画面的预渲染。Predict at least one of the user's head rotation, gaze point position, and gaze range at the next moment, so as to pre-render the interactive screen at the next moment. 4.根据权利要求2所述的电子设备,其中,所述存储器还存储与多个时刻对应的特征数据。The electronic device according to claim 2 , wherein the memory further stores feature data corresponding to a plurality of time instants. 5.根据权利要求1所述的电子设备,其中,用户的注视范围由用户注视屏幕的注视点位置、大小和形状定义。5 . The electronic device according to claim 1 , wherein the gaze range of the user is defined by the position, size and shape of the gaze point where the user gazes at the screen. 6.根据权利要求1所述的电子设备,还包括:6. The electronic device according to claim 1, further comprising: 眼动追踪器,用于获取用户的注视点位置;以及An eye tracker to capture the user's gaze location; and 传感器,用于获取用户的头部转动信息。The sensor is used to obtain the user's head rotation information. 7.根据权利要求6所述的电子设备,其中,所述眼动追踪器使用基于红外线的眼睛视频分析(VOG)技术。7. The electronic device of claim 6, wherein the eye tracker uses infrared-based video eye analysis (VOG) technology. 8.根据权利要求1所述的电子设备,其中,所述互动内容信息包括以下中的至少一个:场景的状态,场景中的物体的状态,人物角色的状态,场景图,配置剧本。8. The electronic device according to claim 1, wherein the interactive content information includes at least one of the following: a state of a scene, a state of an object in a scene, a state of a character, a scene graph, and a configuration script. 9.根据权利要求1所述的电子设备,其中,所述AI模型包括以下中的至少一个:卷积神经网络(CNN),转换器,或Mamba模型。9. The electronic device of claim 1, wherein the AI model comprises at least one of: a convolutional neural network (CNN), a transformer, or a Mamba model. 10.根据权利要求1所述的电子设备,其中,所述操作进一步包括:10. The electronic device of claim 1, wherein the operations further comprise: 以第一分辨率渲染所述注视范围内的互动画面,以低于第一分辨率的第二分辨率渲染围绕所述注视范围的互动画面。The interactive screen within the gaze range is rendered at a first resolution, and the interactive screen surrounding the gaze range is rendered at a second resolution lower than the first resolution.
CN202410076665.4A 2024-01-18 2024-01-18 Electronic device, method and storage medium Pending CN120339558A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202410076665.4A CN120339558A (en) 2024-01-18 2024-01-18 Electronic device, method and storage medium
PCT/CN2025/072190 WO2025152914A1 (en) 2024-01-18 2025-01-14 Electronic device, method, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410076665.4A CN120339558A (en) 2024-01-18 2024-01-18 Electronic device, method and storage medium

Publications (1)

Publication Number Publication Date
CN120339558A true CN120339558A (en) 2025-07-18

Family

ID=96361359

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410076665.4A Pending CN120339558A (en) 2024-01-18 2024-01-18 Electronic device, method and storage medium

Country Status (2)

Country Link
CN (1) CN120339558A (en)
WO (1) WO2025152914A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10890968B2 (en) * 2018-05-07 2021-01-12 Apple Inc. Electronic device with foveated display and gaze prediction
US11886634B2 (en) * 2021-03-19 2024-01-30 Nvidia Corporation Personalized calibration functions for user gaze detection in autonomous driving applications
CN113419623A (en) * 2021-05-27 2021-09-21 中国人民解放军军事科学院国防科技创新研究院 Non-calibration eye movement interaction method and device
CN114170537B (en) * 2021-12-03 2025-05-06 浙江大学 A multimodal three-dimensional visual attention prediction method and its application
CN115061576B (en) * 2022-08-10 2023-04-07 北京微视威信息科技有限公司 Method for predicting fixation position of virtual reality scene and virtual reality equipment

Also Published As

Publication number Publication date
WO2025152914A1 (en) 2025-07-24

Similar Documents

Publication Publication Date Title
US11836289B2 (en) Use of eye tracking to adjust region-of-interest (ROI) for compressing images for transmission
US11217021B2 (en) Display system having sensors
US10739849B2 (en) Selective peripheral vision filtering in a foveated rendering system
US10775886B2 (en) Reducing rendering computation and power consumption by detecting saccades and blinks
US10720128B2 (en) Real-time user adaptive foveated rendering
CN108170279B (en) Eye movement and head movement interaction method of head display equipment
KR102281026B1 (en) Hologram anchoring and dynamic positioning
US9348141B2 (en) Low-latency fusing of virtual and real content
CN114647318A (en) Ways to track the location of a device
JP2019502223A (en) Collection, selection and combination of eye images
US20220319041A1 (en) Egocentric pose estimation from human vision span
CN114026603B (en) Rendering computer-generated real text
EP4315248A1 (en) Egocentric pose estimation from human vision span
CN107065164B (en) Image presentation method and device
CN115509345B (en) Virtual reality scene display processing method and virtual reality device
CN120339558A (en) Electronic device, method and storage medium
US12536737B1 (en) Dynamic frame selection for scene understanding
KR20250154931A (en) Head-mounted display device and its operating method
CN119893066A (en) Anti-dazzling wide dynamic technology application method based on virtual reality
CN118343924A (en) Virtual object motion processing method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication