CN112114659B

CN112114659B - Method and system for determining a fine gaze point of a user

Info

Publication number: CN112114659B
Application number: CN202010500200.9A
Authority: CN
Inventors: 杰弗里·库珀
Original assignee: Tobii AB
Current assignee: Tobii AB
Priority date: 2019-06-19
Filing date: 2020-06-04
Publication date: 2024-08-06
Anticipated expiration: 2040-06-04
Also published as: SE1950758A1; CN112114659A; SE543229C2

Abstract

An eye tracking system, a head mounted device, a computer program, a carrier and a method in an eye tracking system for determining a fine gaze point of a user are disclosed. In the method, a line of sight convergence distance of the user is determined. Further, a spatial representation of at least a portion of the user's field of view is obtained, and depth data of at least a portion of the spatial representation is obtained. Significance data of the spatial representation is determined based on the determined gaze convergence distance and the obtained depth data, and a fine gaze point of the user is determined based on the determined significance data.

Description

Method and system for determining a fine gaze point of a user

Technical Field

The present disclosure relates to the field of eye tracking. In particular, the present disclosure relates to a method and system for determining a fine gaze point of a user.

Background

Eye/gaze tracking functionality is introduced in an increasing number of applications, such as Virtual Reality (VR) applications and Augmented Reality (AR) applications. By introducing such an eye tracking function, an estimated gaze point of the user may be determined, which in turn may be used as input for other functions.

When determining an estimated gaze point of a user in an eye tracking system, a signal representing the estimated gaze point of the user may deviate, for example, due to measurement errors of the eye tracking system. Even if the user actually focuses the line of sight on the same point during a certain period of time, different gaze points of the user may be determined in different measurement cycles during the period of time. In US 2016/0291690 A1, saliency data of a user's field of view is used together with the gaze direction of the user's eyes in order to more reliably determine the point of interest at which the user is gazing. However, the saliency data that determines the field of view of the user needs to be processed, and even if the saliency data is used, the determined point of interest may be different from the actual point of interest.

It is desirable to provide an eye tracking technique that provides a more robust and accurate gaze point than known methods.

Disclosure of Invention

It is an object of the present disclosure to provide a method and system that seeks to mitigate, alleviate or eliminate one or more of the above-mentioned disadvantages of the prior art.

This object is achieved by a method, an eye tracking system, a head mounted device, a computer program and a carrier according to the appended claims.

According to one aspect, a method in an eye tracking system for determining a fine gaze point of a user is provided. In the method, a gaze convergence distance of the user is determined, a spatial representation of at least a portion of a field of view of the user is obtained, and depth data of at least a portion of the spatial representation is obtained. Significance data of the spatial representation is determined based on the determined gaze convergence distance and the obtained depth data, and then a fine gaze point of the user is determined based on the determined significance data.

The saliency data provides a measure of the attributes in the user's field of view and is represented in a spatial representation, indicating the likelihood that these attributes are brought to visual attention. Determining saliency data for a spatial representation means determining saliency data associated with at least a portion of the spatial representation.

The depth data of at least a portion of the spatial representation indicates a distance from the user's eyes to an object or feature in the user's field of view corresponding to the at least a portion of the spatial representation. These distances are real or virtual, depending on the application (e.g., AR or VR).

The gaze convergence distance indicates the distance from the user's eye at which the user's gaze point is focusing. Any method of determining the convergence distance may be used to determine the convergence distance, such as a method based on the gaze direction of the user's eye and the intersection between these directions, or a method based on the inter-pupillary distance.

Determining saliency data based on the determined line-of-sight convergence distance and depth data obtained for at least a portion of the spatial representation additionally enables faster and less processing to be required to determine the saliency data. It further enables the determination of a fine gaze point of the user, which is a more accurate estimate of the user's point of interest.

In some embodiments, determining the saliency data of the spatial representation includes identifying a first depth region in the spatial representation that corresponds to depth data obtained within a predetermined range including the determined gaze convergence distance. Saliency data for a first depth region of the spatial representation is then determined.

The identified first depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are within a predetermined range that includes the determined line-of-sight convergence distance. A user is typically more likely to be looking at one of the objects or features within a predetermined range than the objects or features corresponding to regions of the spatial representation having depth data outside the predetermined range. It is therefore beneficial to determine saliency data for the first depth region and to determine a fine gaze point based on the determined saliency data.

In some embodiments, determining the significance data for the spatial representation includes: a second depth region of the spatial representation corresponding to depth data obtained outside the predetermined range including the line-of-sight convergence distance is identified, and suppression is made for significance data determining the second depth region of the spatial representation.

The identified second depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are outside of a predetermined range that includes the determined line-of-sight convergence distance. It is generally less likely that a user is looking at one of the objects or features outside of the predetermined range than the objects or features corresponding to regions of the spatial representation having depth data within the predetermined range. It is therefore beneficial to suppress the saliency data determining the second depth region to avoid processing that may not be necessary or may even provide misleading results, as the user is less likely to look at objects and/or features corresponding to regions of the spatial representation having depth data outside a predetermined range. This will reduce the processing power used for determining the saliency data compared to a method that also determines the saliency data without employing the determined line-of-sight convergence distance of the user and depth data of at least a portion of the spatial representation.

In some embodiments, determining an improved gaze point includes determining a refined gaze point of the user as a point corresponding to a highest saliency, based on the determined saliency data. The determined fine gaze point will thus be the point that is in some way most likely to attract visual attention. Together with the saliency data using the identified first depth region, which corresponds to determining the spatially represented depth data that corresponds to obtained within a predetermined range including the determined line of sight convergence distance, the determined refined gaze point will thus be the point within the first depth region that is most likely to draw visual attention in some respect.

In some embodiments, determining significance data for the spatial representation includes: the method includes determining first saliency data for the spatial representation based on the visual saliency, determining second saliency data for the spatial representation based on the determined line-of-sight convergence distance and the obtained depth data, and determining the saliency data based on the first saliency data and the second saliency data. The first saliency data may be based on, for example, high contrast, vivid colors, sizes, movements, etc. After optional normalization and weighting, different types of saliency data are combined together.

In some embodiments, the method further comprises: the method includes determining a new line of sight convergence distance for the user, determining new saliency data for the spatial representation based on the new line of sight convergence distance, and determining a new fine gaze point for the user based on the new saliency data. Thus, a dynamic new fine gaze point may be determined based on the new gaze convergence distance determined over time. Several alternatives are envisaged, for example using only the currently determined new line of sight convergence distance or the average of the line of sight convergence points determined within a predetermined period of time.

In some embodiments, the method further comprises: a plurality of gaze points of the user is determined, and a cropped region of the spatial representation is identified based on the determined plurality of gaze points of the user. Preferably, determining the saliency data then comprises determining saliency data for the identified cropped regions of the spatial representation.

The user is generally more likely to be looking at the points corresponding to the cropped area than the points corresponding to the area outside the cropped area. It is therefore advantageous to determine saliency data of the cropped regions and to determine a fine gaze point based on the determined saliency data.

In some embodiments, the method further includes suppressing significance data that determines an area of the spatial representation that is outside of the identified cropped area of the spatial representation.

It is generally less likely that a user is looking at points corresponding to areas outside the cropped area than points corresponding to the cropped area. Thus, it is beneficial to suppress the saliency data determining the region outside the clipping region to avoid a process that may be unnecessary or may even provide misleading results, since the user is unlikely to look at the points corresponding to the region outside the clipping region. This will reduce the processing power used for determining the saliency data relative to a method in which the saliency data is determined without clipping based on the determined gaze point of the user.

In some embodiments, obtaining depth data includes obtaining depth data for the identified cropped region of the spatial representation. By obtaining depth data of the identified clipping region, and not necessarily obtaining depth data of regions outside the clipping region, saliency data within the clipping region may be determined based only on the obtained depth data of the identified clipping region. Thus, the amount of processing required to determine the saliency data can be further reduced.

In some embodiments, the method further comprises determining a respective gaze convergence distance for each of the plurality of determined gaze points of the user.

In some embodiments, the method further comprises determining a new gaze point of the user. In the event that the determined new gaze point is within the identified crop area, the new crop area is identified as being the same as the identified crop area. In an alternative, in case the determined new gaze point is outside the identified clipping region, a new clipping region is identified comprising the determined new gaze point and being different from the identified clipping region.

If it is determined that the new gaze point determined for the user is within the identified crop area, the user may look at points within the crop area. By maintaining the same clipping region in this case, any saliency data determined based on the identified clipping region may be reused. Thus, no further processing is required for determining saliency based on the identified cropped regions.

In some embodiments, successive gaze points of the user are determined in successive time intervals, respectively. Further, for each time interval, it is determined whether the user is gazing or panning. In the case where the user is gazing, a fine gaze point is determined. In the case that the user is panning, suppression is made for determining a fine gaze point. If the user is gazing, the user may be looking at a certain point at that point in time, and thus, may correctly determine a fine gaze point. On the other hand, if the user is panning, the user is unlikely to look at a certain point at that point in time, and thus, it is unlikely to correctly determine a fine gaze point. These embodiments will enable a reduction in processing while making such a determination if it is possible to correctly determine a fine gaze point.

In some embodiments, successive gaze points of the user are determined in successive time intervals, respectively. Further, for each time interval, it is determined whether the user is in smooth follow. In case the user is in a smooth follow, the successive crop areas comprising the successive gaze points are determined, respectively, such that the identified successive crop areas follow the smooth follow. If a smooth follow is determined, then in the case where it is determined that the clipping region follows the smooth follow, little additional processing is required to determine a continuous clipping region.

In some embodiments, the spatial representation is an image, such as a 2D image of the real world, a 3D image of the real world, a 2D image of the virtual environment, or a 3D image of the virtual environment. The data may come from a photo sensor, a virtual 3D scene, or from another type of image sensor or a spatial sensor.

According to a second aspect, an eye tracking system for determining a gaze point of a user is provided. The eye tracking system includes a processor and a memory containing instructions executable by the processor. The eye tracking system is operable to determine a gaze convergence distance of the user and obtain a spatial representation of at least a portion of the user's field of view. The eye tracking system is further operable to obtain depth data for at least a portion of the spatial representation, and determine saliency data for the spatial representation based on the determined gaze convergence distance and the obtained depth data. The eye tracking system is further operable to determine a fine gaze point of the user based on the determined saliency data.

Embodiments of the eye tracking system according to the second aspect may for example comprise features corresponding to features of any embodiment of the method according to the first aspect.

According to a third aspect, a head mounted device for determining a gaze point of a user is provided. The head mounted device includes a processor and a memory containing instructions executable by the processor. The head-mounted device is operable to determine a gaze convergence distance of the user and obtain a spatial representation of at least a portion of the user's field of view. The headset is further operable to obtain depth data for at least a portion of the spatial representation, and determine saliency data for the spatial representation based on the determined line-of-sight convergence distance and the obtained depth data. The head mounted device is further operable to determine a fine gaze point of the user based on the determined saliency data.

In some embodiments, the head mounted device further comprises one of a transparent display and a non-transparent display.

Embodiments of the head mounted device according to the third aspect may for example comprise features corresponding to features of any embodiment of the method according to the first aspect.

According to a fourth aspect, a computer program is provided. The computer program comprises instructions that, when executed by at least one processor, cause the at least one processor to determine a gaze convergence distance of the user and obtain a spatial representation of a field of view of the user. Further, the at least one processor is caused to obtain depth data for at least a portion of the spatial representation, and determine saliency data for the spatial representation based on the determined line-of-sight convergence distance and the obtained depth data. Further, the at least one processor is caused to determine a fine gaze point of the user based on the determined saliency data.

Embodiments of the computer program according to the fourth aspect may for example comprise features corresponding to features of any embodiment of the method according to the first aspect.

According to a fifth aspect, there is provided a carrier comprising a computer program according to the fourth aspect. The carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

Embodiments of the carrier according to the fifth aspect may for example comprise features corresponding to features of any embodiment of the method according to the first aspect.

Drawings

These and other aspects will now be described in the following illustrative and non-limiting detailed description with reference to the accompanying drawings.

Fig. 1 is a flow chart illustrating an embodiment of a method according to the present disclosure.

Fig. 2 includes an image showing the result of steps of an embodiment of a method according to the present disclosure.

Fig. 3 is a flow chart illustrating steps of a method according to the present disclosure.

Fig. 4 is a flow chart illustrating further steps of a method according to the present disclosure.

Fig. 5 is a flow chart illustrating yet further steps of a method according to the present disclosure.

Fig. 6 is a block diagram illustrating an embodiment of an eye tracking system according to the present disclosure.

All the figures are schematic and not necessarily to scale, and generally only show parts which are necessary in order to elucidate the respective examples, whereas other parts may be omitted or merely suggested.

Detailed Description

Aspects of the present disclosure will be described more fully hereinafter with reference to the accompanying drawings. However, the methods, eye tracking systems, head mounted devices, computer programs, and carriers disclosed herein may be embodied in many different forms and should not be construed as limited to the aspects set forth herein. Throughout the drawings, like reference numerals refer to like elements throughout.

The saliency data provides metrics for attributes in the user's field of view and is represented in a spatial representation, indicating the likelihood that these attributes are noticeable to human vision. For this reason, some of the properties most likely to draw attention to human vision are, for example, color, motion, orientation, and scale. Such saliency data may be determined using a saliency model. The saliency model generally predicts what will attract human visual attention. Many saliency models are based on models of a set of features that simulate the biological plausibility of early vision processes, with saliency data for an area being determined based on, for example, the degree to which the area is different from its surroundings.

In a spatial representation of a user's field of view, a saliency model may be used to identify different visual features that contribute to different degrees of attention selection of a stimulus, and generate saliency data indicative of saliency of different points in the spatial representation. A fine gaze point that is more likely to correspond to the point of interest at which the user is gazing may then be determined based on the determined saliency data.

When saliency data is determined in a saliency model on a spatial representation, for example in the form of a 2D image, the saliency of each pixel of the image may be analyzed according to some visual attribute and a saliency value assigned to each pixel for that attribute. Once the saliency is calculated for each pixel, the difference in saliency between pixels is known. Alternatively, salient pixels may then be grouped together into salient regions to simplify feature results.

In the case of using an image as input to a model, the prior art saliency model typically uses a bottom-up approach to compute saliency. The inventors have realized that additional top-down determined information about the user from the eye tracking system may be used to enable a more accurate estimation of the point of interest at which the user is gazing and/or to make the saliency model run faster. The top-down information provided by the eye tracker may be one or more determined gaze convergence distances of the user. Further top-down information provided by the eye tracker may be one or more determined gaze points of the user. Saliency data for the spatial representation is then determined based on the top-down information.

Fig. 1 is a flow chart illustrating an embodiment of a method 100 for determining a fine gaze point of a user in an eye tracking system. In the method, a line of sight convergence distance of the user is determined 110. The gaze convergence distance indicates the distance from the user's eyes where the user's gaze is focusing. Any method of determining the convergence distance may be used to determine the convergence distance, such as a method based on the gaze direction of the user's eye and the intersection between these directions, a method based on time-of-flight measurements, and a method based on inter-pupillary distance. The eye tracking system in which the method 100 is performed may be, for example, a head-mounted system such as Augmented Reality (AR) glasses or Virtual Reality (VR) glasses, but may also be a non-head-mounted, but rather remote eye tracking system from the user. Further, the method comprises the step of obtaining 120 a spatial representation of at least a portion of the field of view of the user. The spatial representation may be, for example, a digital image of at least a portion of a field of view of the user captured by one or more cameras in or remote from the eye tracking system. Further, depth data of at least a portion of the spatial representation is obtained 130. The depth data of the spatial representation of the user's field of view indicates the real or virtual distance from the user's eyes to a point or portion of an object or feature in the user's field of view. The depth data is associated with points or portions of the spatial representation corresponding to points or portions of objects or features of the user's field of view, respectively. Thus, a certain point or region in the spatial representation, which represents a point on a certain object or feature or a portion of the object or feature in the user's field of view, will have depth data indicating the distance from the user's eye to that point or portion on the object or feature. For example, the spatial representation may be two images (stereoscopic images) taken from two outward facing cameras in the head mounted device at a lateral distance. The distance from the user's eyes to a point or portion of an object or feature in the user's field of view can then be determined by analyzing the two images. The depth data thus determined may be linked to points or portions of the two images corresponding to points or portions of objects or features in the user's field of view, respectively. Other examples of spatial representations are also possible, such as 3D meshes based on time-of-flight measurements or instant localization and mapping (SLAM). Based on the determined line of sight convergence distance and the obtained depth data, saliency data of the spatial representation is determined 140. Finally, a fine gaze point of the user is determined 150 based on the determined saliency data.

Depending on the application, the depth data of the spatial representation of the user's field of view indicates the real or virtual distance from the user's eyes to a point or portion of an object or feature in the field of view. Where the spatial representation includes a representation of a real world object or feature of at least a portion of the user's field of view, the distances indicated by the depth data are typically real, i.e., the distances indicate the real distances from the user's eyes to the real world object or feature represented in the spatial representation. Where the spatial representation includes a representation of a virtual object or feature of at least a portion of the user's field of view, the distances indicated by the depth data are typically virtual when viewed by the user, i.e., the distances indicate virtual distances from the user's eyes to the virtual object or feature represented in the spatial representation.

The determined line of sight convergence distance and the obtained depth data may be used to improve the determination of the saliency data such that they provide determination fine information from which a fine gaze point may be determined. For example, one or more regions in the spatial representation may be identified that correspond to portions of objects or features in the field of view that are at a distance from the user's eyes that coincides with the determined line of sight convergence distance. The identified one or more regions may be used to refine the saliency data by adding information indicating which regions of the spatial representation are more likely to correspond to points of interest at which the user gazes. Furthermore, the identified one or more regions of the spatial representation may be used as some form of filter prior to determining the saliency data of the spatial representation. In this way, saliency data is determined only for those regions of the spatial representation that correspond to portions of objects or features in the field of view, which regions are at a distance from the eyes of the user that coincides with the determined line of sight convergence distance.

In particular, determining 140 the saliency data of the spatial representation may include identifying 142 a first depth region of the spatial representation, the first depth region corresponding to depth data obtained within a predetermined range including the determined line-of-sight convergence distance. The range may be set wider or narrower depending on, for example, the accuracy of the determined line-of-sight convergence distance, the accuracy of the obtained depth data, and other factors. Saliency data for a first depth region of the spatial representation is then determined 144.

The identified first depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are within a predetermined range that includes the determined line-of-sight convergence distance. A user is generally more likely to be looking at one of the objects or features within a predetermined range than an object or feature corresponding to an area of the spatial representation having depth data outside the predetermined range. Thus, the identification of the first depth region provides further information that may be used to identify the point of interest at which the user is looking.

In addition to determining the first depth region, determining the saliency data of the spatial representation preferably further comprises identifying second depth data of the spatial representation, the second depth data corresponding to depth data obtained outside a predetermined range including the line of sight convergence distance. The saliency data is not determined for the second depth region of the spatial representation, as opposed to the first depth region. In contrast, after identifying the second depth region, the method explicitly suppresses the determination of the saliency data for the second depth region.

The identified second depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are outside of a predetermined range that includes the determined line-of-sight convergence distance. It is generally less likely that a user is looking at one of the objects or features outside of the predetermined range than the objects or features corresponding to regions of the spatial representation having depth data within the predetermined range. Thus, it is beneficial to suppress the determination of the saliency data for the second depth region to avoid processing that may not be necessary or may even provide misleading results, as it is less likely that a user looks at objects and/or features corresponding to regions of the spatial representation having depth data outside a predetermined range.

In general, since the point of interest at which the user is looking will normally change over time, the method 100 is repeatedly performed to create new fine points of regard over time. Thus, the method 100 generally further comprises: the method includes determining a new line of sight convergence distance for the user, determining new saliency data for the spatial representation based on the new line of sight convergence distance, and determining a new fine gaze point for the user based on the new saliency data. Thus, a dynamic new fine gaze point is determined based on the new gaze convergence distance determined over time. Several alternatives are envisaged, such as for example using only the new line of sight convergence distance currently determined or the mean of the line of sight convergence points determined within a predetermined period of time. Furthermore, if the field of view of the user also changes over time, a new spatial representation is obtained and new depth data for at least a portion of the new spatial representation is obtained.

The additional top-down information provided by the eye tracker may be one or more determined gaze points of the user. The method 100 may further include determining 132 a plurality of gaze points of the user, and identifying 134 a cropped region of the spatial representation based on the determined plurality of gaze points of the user. Typically, the plurality of gaze points are determined over a period of time. In general, the determined individual gaze points of the determined plurality of gaze points may differ from each other. This may be because the user looks at different points during the time period, or because of errors in the determined respective gaze points, i.e. the user may actually look at the same point during the time period, but the determined respective gaze points still differ from each other. The clipping region preferably comprises all of the determined plurality of gaze points. The size of the clipping region may depend on, for example, the accuracy of the determined gaze point, such that a higher accuracy will result in a smaller clipping region.

The user is generally more likely to be looking at the points corresponding to the cropped area than the points corresponding to the area outside the cropped area. It is therefore advantageous to determine saliency data of the cropped regions and to determine a fine gaze point based on the determined saliency data. Further, since the user is more likely to be looking at the point corresponding to the clipping region than the point corresponding to the region outside the clipping region, determination of the saliency data of the region of the spatial representation outside the identified clipping region of the spatial representation can be suppressed. The amount of processing required will be reduced by determining saliency data for all regions of the spatial representation, and not for each region of the spatial representation outside the identified cropped region. In general, the cropped area may be made significantly smaller than the entire spatial representation when the probability that the user looks at a point within the cropped area is maintained at a high level. Therefore, suppressing determination of saliency data of a spatially represented region outside the clipping region can significantly reduce the throughput.

In addition to or instead of using the identified crop area in determining the saliency data, the crop area may be used in obtaining the depth data. For example, since the user is more likely to be looking at the point corresponding to the clipping region than the point corresponding to the region outside the clipping region, depth data can be obtained for the identified clipping region, and it is not necessary to obtain depth data for the region outside the clipping region. Then, saliency data within the crop region may be determined based only on the obtained depth data of the identified crop region. Thus, the amount of processing required for obtaining depth data and determining saliency data can be reduced.

The method 100 may further include determining at least a second line-of-sight convergence distance of the user. Then, a first depth region of the spatial representation is identified, the first depth region corresponding to depth data within a range determined based on the determined line-of-sight convergence distance and the determined at least second line-of-sight convergence distance. Saliency data for a first depth region of the spatial representation is then determined.

The identified first depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are within a range determined based on the determined line-of-sight convergence distance and the determined at least second line-of-sight convergence distance. A user is typically more likely to be looking at one of these objects or features within the aforementioned range than an object or feature corresponding to a region of the spatial representation having depth data outside the range. Thus, the identification of the first depth region provides further information that may be used to identify the point of interest at which the user is looking.

There are several alternatives for determining the range based on the determined line of sight convergence distance and the determined at least second line of sight convergence distance. In a first example, a maximum line of sight convergence distance and a minimum line of sight convergence distance of the determined line of sight convergence distance and the determined at least second line of sight convergence distance may be determined. The maximum line-of-sight convergence distance and the minimum line-of-sight convergence distance may then be used to identify a first depth region of the spatial representation corresponding to the obtained depth data within a range including the determined maximum line-of-sight convergence distance and minimum line-of-sight convergence distance. The range may be set wider or narrower depending on, for example, the accuracy of the determined line-of-sight convergence distance, the accuracy of the obtained depth data, and other factors. As an example, the range may be set from the determined minimum line-of-sight convergence distance to the maximum line-of-sight convergence distance. Saliency data for a first depth region of the spatial representation is then determined.

In a first example, the identified first depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are within a range that includes the determined maximum line-of-sight convergence distance and minimum line-of-sight convergence distance. A user is typically more likely to be looking at one of these objects or features within the aforementioned range than an object or feature corresponding to a region of the spatial representation having depth data outside the range. Thus, the identification of the first depth region according to the first example provides further information that may be used to identify the point of interest at which the user gazes.

In a second example, a mean gaze convergence distance of the determined gaze convergence distance of the user and the determined at least second gaze convergence distance may be determined. The mean gaze convergence distance may then be used to identify a first depth region of the spatial representation corresponding to the obtained depth data within a range including the determined mean gaze convergence distance. The range may be set wider or narrower depending on, for example, the accuracy of the determined line-of-sight convergence distance, the accuracy of the obtained depth data, and other factors. Saliency data for a first depth region of the spatial representation may then be determined.

In a second example, the identified first depth region of the spatial representation corresponds to objects or features in at least a portion of the user's field of view that are within a range that includes the determined mean line-of-sight convergence distance. A user is typically more likely to be looking at one of these objects or features within the aforementioned range than an object or feature corresponding to a region of the spatial representation having depth data outside the range. Thus, the identification of the first depth region according to the second example provides further information that may be used to identify the point of interest at which the user gazes.

From the determined saliency data, the user's fine gaze point may be determined 150 as the point corresponding to the highest saliency. The determined improved gaze point will thus be the point that is in some way most likely to attract visual attention. In addition to using the saliency data associated with determining 144 the identified first depth region of the spatial representation, which corresponds to the obtained depth data within a predetermined range including the determined line-of-sight convergence distance, the determined refined gaze point will thus be the point within the first depth region that is most likely to be visually noticeable in some respect. This may be further combined with identifying 132 a plurality of gaze points, identifying 134 a cropped area comprising the determined plurality of gaze points, and obtaining 130 depth data of the cropped area only. Further, only the identified cropped regions may be determined 146, and optionally only the saliency data of the identified depth regions may be determined, or may be combined with the saliency data of the identified depth regions such that only the saliency data is generated for depth regions within the cropped regions. The determined fine gaze point will thus be the point within the first depth region within the crop region that is in some way most likely to attract visual attention.

Determining saliency data for a spatial representation may include: the method includes determining first saliency data for the spatial representation based on the visual saliency, determining second saliency data for the spatial representation based on the determined line-of-sight convergence distance and the obtained depth data, and determining the saliency data based on the first saliency data and the second saliency data. Visual salience is the ability of an item, or an item in an image, to draw visual attention (bottom-up, i.e., the value is unknown, but can be inferred from the algorithm). In more detail, visual salience is a distinguishing subjective perceptual quality that stands out some items in the world from their surroundings and immediately draws our attention. Visual salience may be based on color, contrast, shape, orientation, motion, or any other perceived characteristic.

Once the saliency data for the different saliency features (such as visual saliency and depth saliency) have been calculated based on the determined line-of-sight convergence distance and the obtained depth data, they can be normalized and combined to form a primary saliency result. Depth saliency is related to the depth seen by the user (i.e., the value is known from top to bottom). Distances conforming to the determined convergence distance are considered to be more significant. When combining saliency features, each feature may be weighted equally or weighted differently depending on which features are estimated to have the greatest impact on visual attention and/or which features have the highest greatest saliency value compared to the average or expected value. The combination of salient features may be determined by a Winner-Take-All mechanism. Alternatively, the primary saliency results may be converted into a primary saliency map: a topographical representation of overall saliency. This is a useful step for a human observer, but is not a necessary step in case the significance result is used as input to a computer program. In the main saliency result, a single spatial location should be highlighted as the most salient.

Fig. 2 includes images showing the results from steps of an embodiment of a method according to the present disclosure. The spatial representation of at least a portion of the user's field of view in the form of image 210 is an input to a method for determining a fine gaze point. A plurality of gaze points are determined in image 210 and a crop area is identified, the crop area including the plurality of determined gaze points as illustrated by image 215. Further, a stereoscopic image 220 of at least a portion of the user's field of view is obtained, a cropped region as shown in image 225 is identified, and depth data of the obtained cropped region (as shown in image 230) of the stereoscopic image 220 is based. Then, a line-of-sight convergence distance of the user (which is 3.5m in this example) is received, and the first depth region is determined as a region corresponding to depth data in a range around the line-of-sight convergence distance in the clipping region. In this example, the range is 3m < x <4m, and the resulting first depth region is shown in image 235. The visual saliency of the cropped regions shown in 240 is determined to produce saliency data shown in the form of a saliency map 245 of the cropped regions. The saliency map 245 is combined with the first depth region shown in the image 235 into a saliency map 250 of the first depth region within the cropped region. The fine gaze point is the point: which is identified as the point with the highest significance in the first depth region within the cropped region. This point is shown as a black dot in image 255.

Fig. 3 is a flow chart illustrating steps of a method according to the present disclosure. In general, the flowchart illustrates steps related to identifying clipping regions over time based on a new determined gaze point (e.g., related to an embodiment of the method as illustrated in fig. 1). The identified crop area is a crop area that has been previously identified based on a plurality of previously determined gaze points. Then, a new gaze point is determined 310. In case the determined new gaze point is within the identified crop area 320, the identified crop area is not changed but the crop area is continued to be used and a new gaze point is determined 310. An alternative to looking at this is that the new clipping region is determined to be the same as the identified clipping region. In case the determined new gaze point is not within the identified clipping region (i.e. outside the identified clipping region) 320, a new clipping region is determined 330 comprising the determined new gaze point. In this case, the new crop area will be different from the identified crop area.

Fig. 4 is a flow chart illustrating further steps of a method according to the present disclosure. In general, the flowchart illustrates steps related to determining a fine gaze point over time based on a new determined gaze point (e.g., related to an embodiment of the method as illustrated in fig. 1). Successive gaze points of the user are determined 410 in successive time intervals, respectively. Further, for each time interval, it is determined 420 whether the user is gazing or panning (saccading). In case the user is gazing 420, a fine gaze point is determined 430. In case 420 the user is panning, a suppression is made for determining a fine gaze point. If the user is gazing, the user may be looking at a certain point at that point in time, and thus, may correctly determine a fine gaze point. On the other hand, if the user is panning, the user is unlikely to look at a certain point at that point in time, and thus, it is unlikely to correctly determine a fine gaze point. Referring to fig. 1, this may for example mean that the method 100 is only performed if it is determined that the user is gazing.

Fig. 5 is a flow chart illustrating yet further steps of a method according to the present disclosure. In general, the flowchart illustrates steps related to identifying clipping regions over time based on determined gaze points (e.g., related to an embodiment of the method as illustrated in fig. 1). The identified crop area is a crop area that has been previously identified based on a plurality of previously determined gaze points. Successive gaze points of the user are determined 510 in successive time intervals, respectively. Further, for each time interval, it is determined 520 whether the user is smoothly following (smooth pursuit). In case the user is following smoothly 520, a new clipping region is determined based on the smooth following 530. If a smooth follow is determined, then in the case where it is determined that the clipping region follows the smooth follow, little additional processing is required to determine a continuous clipping region. For example, consecutive cropped regions may have the same shape and may simply move relative to each other in the same direction and speed as the smooth follow of the user. In case the user is not following smoothly 520, a new clipping region is determined comprising a plurality of gaze points, including the determined new gaze point.

Fig. 1 includes some steps shown in a box with a solid border and some steps shown in a box with a dashed border. The steps included in the box with the solid border are the operations included in the broadest example embodiment. Steps included in blocks with dashed borders are further operations that may be included in, or may be part of, or may be taken in addition to, the operations of the border example embodiments. Not all steps need to be performed in sequence, and not all operations need to be performed. Furthermore, at least some of these steps may be performed in parallel.

The method for determining a fine gaze point of a user and the steps thereof as disclosed herein, e.g. with respect to fig. 1-5, may be implemented in an eye tracking system 600, e.g. in the head mounted device of fig. 6. The eye tracking system 600 comprises a processor 610 and a carrier 620 comprising computer executable instructions 630, for example in the form of a computer program, which instructions when executed by the processor 610 cause the eye tracking system 600 to perform the method. The carrier 620 may be, for example, an electronic signal, an optical signal, a radio signal, a transitory computer readable storage medium, and a non-transitory computer readable storage medium.

The person skilled in the art realizes that the present invention by no means is limited to the embodiments described above. On the contrary, many modifications and variations are possible within the scope of the appended claims.

Additionally, variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. The terminology used herein is for the purpose of describing particular aspects of the disclosure only and is not intended to be limiting of the invention. The division of tasks among functional units referred to in this disclosure does not necessarily correspond to division into a plurality of physical units; rather, one physical component may have multiple functions, and one task may be performed in a distributed fashion by several physical components in concert. A computer program may be stored/distributed on a suitable non-transitory medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the internet or other wired or wireless telecommunication systems. The mere fact that certain measures/features are recited in mutually different dependent claims does not indicate that a combination of these measures/features cannot be used to advantage. Method steps do not necessarily have to be performed in the order in which they appear in the claims or in the embodiments described herein, unless a certain order is explicitly described as being required. Any reference signs in the claims shall not be construed as limiting the scope.

Claims

1. A method in an eye tracking system for determining a fine gaze point of a user, the method comprising:

determining a sight convergence distance of the user;

obtaining a spatial representation of at least a portion of the user's field of view;

Obtaining depth data for at least a portion of the spatial representation;

determining first saliency data for the spatial representation based on visual saliency;

determining second saliency data for the spatial representation based on the determined line-of-sight convergence distance and the obtained depth data;

Determining saliency data based on the first saliency data and the second saliency data; and

Based on the determined saliency data, a fine gaze point of the user is determined as a point corresponding to the highest saliency in the spatial representation.

2. The method of claim 1, wherein determining the second saliency data of the spatial representation comprises:

Identifying a first depth region of the spatial representation, the first depth region corresponding to obtained depth data within a predetermined range including the determined line-of-sight convergence distance; and

Significance data for the first depth region of the spatial representation is determined.

3. The method of claim 1, wherein determining the second saliency data of the spatial representation comprises:

Identifying a second depth region of the spatial representation, the second depth region corresponding to obtained depth data outside a predetermined range including the line-of-sight convergence distance; and

Significance data for determining the second depth region of the spatial representation is suppressed.

4. The method of claim 1, further comprising:

Determining a new line-of-sight convergence distance for the user;

determining new saliency data for the spatial representation based on the new line-of-sight convergence distance; and

A new fine gaze point of the user is determined based on the new saliency data.

5. The method of claim 1, further comprising:

determining a plurality of gaze points of the user; and

A cropped region of the spatial representation is identified based on the determined plurality of gaze points of the user.

6. The method of claim 5, wherein determining saliency data comprises:

significance data of the identified cropped regions of the spatial representation is determined.

7. The method of claim 5, further comprising:

Suppression is performed on saliency data determining an area of the spatial representation that is outside of the identified cropped area of the spatial representation.

8. The method of claim 5, wherein obtaining depth data comprises:

Depth data of the identified cropped region of the spatial representation is obtained.

9. The method of claim 2, further comprising:

determining at least a second line of sight convergence distance of the user,

Wherein the first depth region of the spatial representation is identified, the first depth region corresponding to obtained depth data within a range based on the determined line-of-sight convergence distance and the determined at least second line-of-sight convergence distance of the user.

10. The method of claim 5, further comprising:

Determining a new gaze point of the user;

identifying a new crop area as being the same as the identified crop area if the determined new gaze point is within the identified crop area; or alternatively

In the case that the determined new gaze point is outside the identified clipping region, a new clipping region is identified that includes the determined new gaze point and is different from the identified clipping region.

11. The method of claim 5, wherein determining successive gaze points of the user in successive time intervals, respectively, further comprises, for each time interval:

determining whether the user is gazing or panning;

determining a fine gaze point in the event that the user is gazing; and

In case the user is panning, a fine gaze point is suppressed from being determined.

12. The method of claim 5, wherein determining successive gaze points of the user in successive time intervals, respectively, further comprises, for each time interval:

determining whether the user is smoothly following; and

In case the user is following smoothly, successive clipping regions respectively comprising the successive gaze points are identified such that the identified successive clipping regions follow the following smoothly.

13. The method of claim 1, wherein the spatial representation is an image.

14. An eye tracking system for determining a gaze point of a user, the eye tracking system comprising a processor and a memory containing instructions executable by the processor, the eye tracking system being operable by execution of the instructions to:

determining a sight convergence distance of the user;

Obtaining depth data for at least a portion of the spatial representation;

15. A head-mounted device for determining a gaze point of a user, the head-mounted device comprising a processor and a memory containing instructions executable by the processor, the head-mounted device being operable by execution of the instructions to:

determining a sight convergence distance of the user;

Obtaining depth data for at least a portion of the spatial representation;

16. The head mounted device of claim 15, further comprising one of a transparent display and a non-transparent display.

17. A computer-readable storage medium comprising instructions that, when executed by at least one processor, cause the at least one processor to:

Determining a sight convergence distance of a user;

obtaining a spatial representation of the user's field of view;

Obtaining depth data for at least a portion of the spatial representation;