CN119213464A

CN119213464A - Gesture detection method and system with hand prediction

Info

Publication number: CN119213464A
Application number: CN202280095936.XA
Authority: CN
Inventors: 周扬
Original assignee: Innopeak Technology Inc
Current assignee: Innopeak Technology Inc
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-12-27
Also published as: WO2022226432A1

Abstract

The invention relates to an augmented reality system and method. In an exemplary embodiment, a two-dimensional hand image is captured. Two-dimensional keypoints are identified using two-dimensional hand images. Mapping the two-dimensional keypoints to three-dimensional keypoints. Hand prediction is performed using three-dimensional keypoints. Still other embodiments exist.

Description

Gesture detection method and system with hand prediction

Background

The invention relates to an augmented reality system and method.

Over the past decade, extended Reality (XR) devices, including augmented Reality (Augmented Reality, AR) devices and Virtual Reality (VR) devices, have become increasingly popular. Important design considerations and challenges for XR devices include performance, cost, and power consumption. Due to various limitations, existing XR devices have been inadequate for reasons set forth below.

New and improved XR systems and methods therefor are desired.

Disclosure of Invention

A system of one or more computers may be configured to perform particular operations or actions by installing software, firmware, hardware, or a combination thereof on the system that, when executed, cause the system to perform the actions. One or more computer programs may be configured to perform particular operations or acts by including instructions that, when executed by data processing apparatus, cause the apparatus to perform the acts. One general aspect includes a hand prediction method that includes capturing a plurality of images including at least a first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image. The method also includes identifying a plurality of previous 2D keypoints using the previous image. The method also includes identifying a plurality of current 2D keypoints using the current image. The method also includes mapping a plurality of previous 2D keypoints to a plurality of previous three-dimensional (3D) keypoints. The method also includes mapping the plurality of current 2D keypoints to the plurality of current 3D keypoints. The method also includes generating a plurality of 3D predicted keypoints in the 3D space using the plurality of previous 3D keypoints and the plurality of current 3D keypoints. The method further includes mapping the plurality of 3D predicted keypoints to a plurality of predicted 2D keypoints. The method also includes identifying erroneous hand detection using the plurality of predicted 2D keypoints. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

Implementations may include one or more of the following features. The method may include projecting a plurality of previous 2D keypoints into a 3D space. The plurality of images also includes at least a second hand, and the plurality of 3D predicted keypoints are associated with the first hand and the second hand. The method may include defining a bounding box in the current image surrounding the first hand and tracking the bounding box using the plurality of predicted 2D keypoints. A bounding box is defined with an upper left corner position and a lower right corner position, the bounding box including at least ten percent of an edge area surrounding the first hand. The plurality of 3D predicted keypoints are assigned confidence values, and the method may include detecting a non-hand object using the confidence values. The method may include tracking the first hand using the plurality of predicted 2D keypoints. The method may include initiating a hand tracking process upon detection of a first hand. The method may include calculating a change in coordinates between a plurality of previous 3D keypoints and a plurality of current 3D keypoints, each 3D keypoint may include three coordinates. The method may include using a plurality of 3D vectors of a plurality of previous 3D keypoints and a plurality of current 3D keypoints. Implementations of the described technology may include hardware, methods or processes, or computer software on a computer-accessible medium.

One general aspect relates to an augmented reality device that includes a housing having a front face and a back face. The apparatus also includes a first camera configured at the front face, the first camera configured to capture a plurality of two-dimensional (2D) images at a predefined frame rate, the plurality of 2D images including a current image and a previous image. The device also includes a display disposed on the back of the housing. The apparatus also includes a memory coupled to the first camera and configured to store a plurality of 2D images. The apparatus also includes a processor coupled to the memory. In the apparatus, the processor is configured to identify a plurality of 2D keypoints associated with the hand using at least the current image and the previous image, map the plurality of 2D keypoints to a plurality of three-dimensional (3D) keypoints, and provide hand prediction using at least the plurality of 3D keypoints. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

Implementations may include one or more of the following features. The processor may include a neural processing unit configured to detect the hand using the first image captured by the first camera. The device may include a second camera, the first camera being located on the left side of the housing and the second camera being located on the right side of the housing. The processor is also configured to track the hand. The processor is further configured to generate a plurality of predicted 3D keypoints, map the plurality of predicted 3D keypoints to a plurality of predicted 2D keypoints. Implementations of the described technology may include hardware, methods or processes, or computer software on a computer-accessible medium.

One general aspect relates to a hand tracking method that includes capturing a first image. The method further includes detecting at least a first hand in the first image. The method also includes capturing a plurality of images including at least the first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image. The method also includes identifying a plurality of 2D keypoints using the plurality of images. The method also includes mapping the plurality of 2D keypoints to a plurality of three-dimensional (3D) keypoints. The method also includes generating a plurality of 3D predicted keypoints in the 3D space using the plurality of 3D keypoints. The method further includes mapping the plurality of 3D predicted keypoints to a plurality of predicted 2D keypoints. The method also includes tracking the first hand using the plurality of 3D predicted keypoints. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the methods.

Implementations may include one or more of the following features. The method may include calculating confidence values for a plurality of 3D predicted keypoints, and identifying erroneous hand detection using at least the confidence values. The method may include identifying a change between the first image and the second image. The method may include performing a deep learning process for hand detection using the first image. Implementations of the described technology may include hardware, methods or processes, or computer software on a computer-accessible medium.

It will be appreciated that embodiments of the present invention provide a number of advantages over conventional techniques. In particular, hand shape prediction techniques may enable more accurate, efficient hand tracking and bounding boxes. In addition, hand shape prediction techniques according to embodiments of the present invention may be performed in conjunction with gesture recognition techniques.

Embodiments of the present invention may be implemented in conjunction with existing systems and methods. For example, hand calibration techniques according to the present invention may be used with a variety of XR systems, including XR devices equipped with a ranging assembly. Furthermore, various techniques according to the present invention may be applied to existing XR systems through software or firmware updates. There are other benefits.

The present invention achieves these and other benefits in the context of known technology. A further understanding of the nature and advantages of the present invention may be realized by reference to the remaining portions of the specification and the attached drawings.

Drawings

Fig. 1A is a simplified diagram illustrating an augmented reality (XR) device 115n, according to an embodiment of the invention.

Fig. 1B is a simplified block diagram illustrating components of an augmented reality device 115n according to an embodiment of the invention.

Fig. 2 is a simplified diagram illustrating the field of view of a camera on an augmented reality device 210 according to an embodiment of the invention.

Fig. 3A is a simplified diagram illustrating keypoints defined on the right hand according to an embodiment of the invention.

FIG. 3B is a simplified diagram illustrating an exemplary gesture according to an embodiment of the present invention.

Fig. 4 is a simplified block diagram illustrating functional blocks of an augmented reality device according to an embodiment of the invention.

FIG. 5 is a simplified block diagram illustrating functional blocks in a gesture detection algorithm according to an embodiment of the present invention.

Fig. 6 is a simplified flowchart illustrating a procedure of a hand prediction method according to an embodiment of the present invention.

Fig. 7 is a simplified diagram illustrating hand prediction using 3D hand keypoints according to an embodiment of the invention.

Fig. 8 is a simplified diagram illustrating an exemplary state machine for hand tracking according to an embodiment of the present invention.

Detailed Description

The invention relates to an augmented reality system and method. In an exemplary embodiment, a two-dimensional hand image is captured. Two-dimensional keypoints are identified using two-dimensional hand images. Mapping the two-dimensional keypoints to three-dimensional keypoints. And carrying out hand prediction by using the three-dimensional key points. Still other embodiments exist.

With the advent of virtual reality and augmented reality applications, gesture-based control schemes are becoming more popular. In recent years, commercial depth camera based 3D hand tracking technology on AR glasses has been prevalent, which directly makes 3D measurements of the hand. Conventional research has generally focused on RGB camera-based hand tracking algorithms, and research efforts on actual hand tracking systems are very limited as compared to algorithms.

In particular, the ability to accurately and efficiently reconstruct human hand motion from images is expected to enable exciting new applications in immersive virtual and augmented reality, robotic control, and sign language recognition. Great progress has been made in recent years, particularly with the advent of deep learning techniques. However, it remains a challenging task due to unconstrained global and local pose changes, frequent occlusion, local self-similarity, and high degree of articulation (articulation, connection, joints). In various hand detection processes according to the present invention, the output includes a bounding box and a confidence value. During hand tracking, hand detection may be unstable, thereby degrading overall performance and accuracy. For example, existing hand detection methods typically employ convolutional network-based methods that use small models to interpret various hand data in the world, with the end result often being unsatisfactory. Note that a hand prediction method is used in addition to or instead of hand detection to reduce a missing (missing, not found) bounding box problem and a false positive (false positive) bounding box problem. For example, the "missing bounding box" problem refers to missing a detected true hand image, and the "false positive bounding box" problem refers to erroneously detecting a non-hand object as a hand. During hand tracking, the limiting threshold used to define the bounding box may lead to different results. Too many restrictions may result in missing bounding boxes, while too few restrictions may result in erroneous bounding boxes. To properly select the threshold, a machine learning algorithm may be configured into the system. After the method and the device are applied, the missing boundary box and the false alarm boundary box in the real-time 3D hand tracking result can be reduced to be close to zero.

According to embodiments, the hand prediction mechanism involves using 3D hand keypoints of the first two frames (e.g., images captured by a camera at the first two timestamps), and outputting a 2D bounding box surrounding the hand and corresponding confidence values. For example, hand predictions are used as input for bounding box tracking. In some implementations, a state machine is used to perform bounding box tracking and determine if hand prediction should be used.

The following description is presented to enable one of ordinary skill in the art to make and use the invention and is provided in the context of a particular application. Various modifications and various uses in different applications will be apparent to those skilled in the art, and the general principles defined herein may be applied to various embodiments. Thus, the present invention is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

In the following detailed description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention may be practiced without limitation to these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention.

The reader is noted that for all documents and materials filed concurrently with this specification and open to the public concurrently with this specification, all contents of such documents and materials are incorporated herein by reference. All the features disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

Furthermore, any element of the claims that does not explicitly state any means for performing a specified function or step for performing a specified function should not be construed as a 35U.S. C. clause 112, clause 6, specification means or step. In particular, "step" or "action" as used in the claims herein is not intended to invoke the provision of 35u.s.c.112 clause 6.

Note that left, right, front, back, top, bottom, forward, backward, clockwise, counterclockwise, etc. labels, if used, are for convenience only and do not represent any particular fixed orientation. Rather, they are used to reflect the relative position and/or orientation between portions of the object.

Fig. 1A is a simplified diagram (top view) showing an augmented reality device 115n according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. It is to be appreciated that the term "augmented reality" (ER) is broadly defined and includes Virtual Reality (VR), augmented Reality (AR), and/or other similar technologies. For example, the ER device 115 shown may be configured as a VR, AR, or others. According to particular embodiments, ER device 115 may include a small housing for AR applications or a relatively large housing for VR applications. Cameras 180A and 180B are disposed on the front of device 115. For example, cameras 180A, 180B are mounted on the left and right sides of ER device 115, respectively. In various applications, additional cameras may be configured below cameras 180A and 180B to provide additional fields of view and range estimation accuracy. For example, cameras 180A and 180B each include ultra-wide angle or fisheye lenses that provide a large field of view and they share a common field of view. Due to the arrangement of the two cameras, the parallax of the two cameras (a known factor) can be used to estimate the object distance. The display 185 is disposed on the back of the ER device 115. For example, in AR applications, the display 185 may be a translucent display that superimposes information on the optical lens. In VR implementations, the display 185 may include a non-transparent display.

Fig. 1B is a simplified block diagram illustrating components of an augmented reality device 115 according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. In some embodiments, an XR headset (e.g., AR headset 115n shown, etc.) may include, but is not limited to, at least one of a processor 150, a data storage 155, a speaker or earpiece 160, an eye-tracking sensor 165, a light source 170, an audio sensor or microphone 175, a front-facing or front-facing camera 180, a display 185, and/or a communication interface 190, and/or the like.

In some cases, the processor 150 may be communicatively coupled (e.g., via a bus, via wired connectors, or via electrical pathways (e.g., traces and/or pads, etc.) of a Printed Circuit Board (PCB) or Integrated Circuit (IC) and/or the like) to each of one or more of a data memory 155, a speaker or headset 160, an eye tracking sensor 165, a light source 170, an audio sensor or microphone 175, a front facing camera 180, a display 185, and/or a communication interface 190 and/or the like. In various embodiments, data memory 155 may include Dynamic Random Access Memory (DRAM) and/or nonvolatile memory. For example, images captured by camera 180 may be temporarily stored in DRAM for processing, and executable instructions (e.g., hand calibration and gesture recognition algorithms) may be stored in non-volatile memory. In various embodiments, the data memory 155 may be implemented as part of the processor 150 in a system-on-chip (SoC) arrangement.

An eye tracking sensor 165, which may include, but is not limited to, at least one of one or more cameras, one or more motion sensors, or one or more tracking sensors, etc., tracks the gaze point of the user's eye and is combined with the computing processing of the processor 150 for comparison with images or video captured in front of the ER device 115. The audio sensor 175 may include, but is not limited to, a microphone, a sound sensor, a noise sensor, etc., and may be used to receive or capture voice signals, sound signals, and/or noise signals, etc.

The front facing camera 180 includes respective lenses and sensors for capturing images or video of the area in front of the ER apparatus 115. For example, the front camera 180 includes cameras 180A and 180B respectively arranged on the left and right sides of the housing as shown in fig. 1B. In different implementations, the sensor of the front-facing camera may be a low resolution monochrome sensor that is not only energy efficient (without the need for color filters and their color processing), but is also relatively inexpensive, both in terms of device size and cost.

Fig. 2 is a simplified diagram illustrating the field of view of a camera on an augmented reality device 210 according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. The left camera 180A is mounted on the left side of the ER device housing 210, and the right camera 180B is mounted on the right side of the ER device housing 210. Each camera has an ultra-wide angle or fisheye lens capable of capturing a wide field of view. For example, camera 180A has a field of view on the left and at an angle θ _L, and camera 180B has a field of view on the right and at an angle θ _R. Any camera can detect hands or other objects.

Hand detection is a prerequisite for gesture recognition. When at least one camera in the XR device captures an image of the hand, the device detects the hand. For example, hand 221 can only be detected by an image captured by camera 180A, while hand 223 can only be detected by an image captured by camera 180B. When the hand 222 is in the region 220, it is within the common field of view of the two cameras 180A and 180B, the image from either camera can be used for hand detection, and depth calculations and other calculations can also be performed. In use, the hand may move into and out of the fields of view of the two cameras. Hand detection, hand tracking, and hand prediction processes may be performed. For example, when the hand 221 moves left and out of the FOV of the camera 180A, the hand prediction mechanism according to embodiments of the present invention may reduce "false positive" hand recognition. Furthermore, as the hand is tracked, the bounding box surrounding the hand (i.e., the area within the image captured by camera 180A or camera 180B) will be updated using, inter alia, hand prediction techniques.

Reference is now back made to fig. 1B. In embodiments, the processor 150 is configured to perform hand detection and hand prediction processes. In various embodiments, processor 150 includes a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Neural Processing Unit (NPU). For example, the hand detection process may be performed by the NPU, while the hand prediction may be performed by the CPU and/or NPU.

In AR applications, the field of view of each front facing camera 180 overlaps with the field of view of the eyes, captured images, or video of the user 120. The display screen and/or projector 185 may be used to display or project the generated image overlays (and/or to display a composite image or video incorporating the generated image overlays superimposed over the image or video of the actual area). The communication interface 190 provides wired or wireless communication with other devices and/or networks. For example, the communication interface 190 may be connected to a computer for tethering operations, wherein the computer provides the processing power required for graphics-intensive applications.

Fig. 3A is a simplified diagram illustrating keypoints defined on the right hand according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, keypoints 0-19 are assigned to different areas of the user's right hand. From the locations of these keypoints, gestures may be determined. For example, by identifying the relative positions of the keypoints 0-19, different gestures may be determined. In embodiments, the relative positions of the keypoints (e.g., measured in pixel distances) are calibrated during an initial hand shape calibration process, which makes the gesture recognition process more accurate.

FIG. 3B is a simplified diagram illustrating an exemplary gesture according to an embodiment of the present invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, hand images captured as two-dimensional (2D) images by a device camera may be mapped into 3D space for processing. In embodiments, the 2D image may be mapped to the 3D space using depth information and calibration parameters (e.g., hand shape). Since the hand moves in 3D space, the hand prediction process is performed using 3D vectors based on 3D coordinates. As shown in fig. 3B, 21 (i.e., 0 to 20) keypoints are obtained for the left hand gesture. For example, a left hand gesture is converted into 3D keypoints through which the augmented reality device may recognize the gesture as an "OK" gesture. Additional processes may also be performed.

Fig. 4 is a simplified block diagram illustrating functional blocks of an augmented reality device according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. The system pipeline of the augmented reality device 400 in fig. 4 may include functional components, which may correspond to the various portions of the device 115 shown in fig. 1B. At the front end, sensors, such as right side fisheye camera 401, left side fisheye camera 402, and Inertial Measurement Unit (IMU) 403 capture images and other information and send the captured data to sensor processor 411 (e.g., a lightweight embedded CPU, such as processor 150 in fig. 1B). Sensor processor 411 performs various simple image processing (e.g., denoising, exposure control, etc.) and then packages the processed data to XR server 421. For example, XR server 421 is implemented to act as a data consumer and transfer data to various algorithms, such as 3D hand tracking 431, 6dof 441, and others 451. The location of the 3D hand tracking algorithm 431 is shown configured after the XR server 421, followed by the APP module 432. In embodiments, the 3D hand tracking algorithm 431 utilizes hand detection and hand prediction techniques.

The unified APP 432 receives hand tracking results for different purposes, such as games, manipulation of virtual objects, and the like. Additional functions such as object compositor 433, system rendering 434, asynchronous Time Warping (ATW) 435, and display 436 may be configured as shown. Other functional blocks are also possible depending on the implementation.

FIG. 5 is a simplified block diagram illustrating functional blocks in a gesture detection and prediction algorithm according to an embodiment of the present invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. According to various embodiments, hand tracking system 500, which may be implemented with apparatus 150 shown in fig. 1B, uses a two-hand tracking process for left hand (l) and right hand (r). For example, system 500 provides real-time (i.e., 30 frames per second) hand tracking on edge devices and operates as a 3D hand tracking system. The stereoscopic fisheye camera is used to acquire left and right images with known parallax calibration. The system includes various sets of algorithms, including hand acquisition 501, hand detection 502, hand predictions 503r and 503l, bounding box tracking 504r and 504l, 2D hand keypoint detection 505r (506 r) and 505l (506 l), 3D hand keypoint detection 507r and 507l, gesture recognition 508r and 508l, and hand shape calibration 570. In embodiments, the hand prediction algorithm is part of the bounding box tracking process.

Fig. 6 is a simplified flowchart illustrating a procedure of a hand prediction method according to an embodiment of the present invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, one or more steps may be added, deleted, repeated, modified, replaced, overlapped, and rearranged without limiting the scope of the claims.

At step 610, a two-dimensional (2D) hand image is captured, the captured image including at least a previous image and a current image. According to embodiments, the image is captured by a 2D camera and the distance information may or may not be available. For example, the terms "previous image" and "current image" refer to two images captured at successive time intervals, where the current image is currently being processed and the previous image is the most recent. For example, the captured images are stored in time-series in memory and thus can be easily retrieved for processing.

At step 620, a previous 2D keypoint and a current 2D keypoint are identified using the previous image and the current image, respectively. In embodiments, the 2D keypoints are used first in the hand detection process, and the hand prediction process is performed only after one or more hands are detected. Various types of image recognition algorithms may be used, depending on the particular implementation. For example, a machine learning algorithm may be employed to perform the image recognition process. The bounding box surrounding the first hand in the current image is defined by an upper left-hand corner position and a lower right-hand corner position. According to embodiments, the bounding box includes at least ten percent of an edge area surrounding the first hand.

In step 630, the previous 2D keypoints are mapped to the previous 3D keypoints, and the current 2D keypoints are mapped to the current 3D keypoints. For example, the hand tracking process is initiated when a hand is detected, as the hand moves in 3D space-although only a 2D image of the hand is captured, hand tracking is performed in 3D space. In embodiments, 2D keypoints are projected into 3D space using information such as hand shape, hand distance, and hand size. For example, fig. 7 illustrates the use of 3D keypoints in hand prediction.

Fig. 7 is a simplified diagram illustrating hand prediction using 3D hand keypoints according to an embodiment of the invention. The diagram is merely an example, which should not unduly limit the scope of the claims. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. As shown, 3D hand points from the previous image P _t-1 and the current image P _t are obtained by converting 2D hand keypoints from the camera-captured images. As shown in fig. 7, the previous 3D hand keypoints are in frame P _t-1 and the current 3D keypoints are in frame P _t. Both the previous 3D hand keypoints and the current 3D keypoints are generated at step 630 from the 2D keypoints.

At step 640, a set of predicted 3D keypoints corresponding to predicted hand positions is generated using the previous 3D keypoints and the current 3D keypoints generated at step 630. In embodiments, a vector between two corresponding previous and current 3D hand keypoints is calculated and used to generate a predicted 3D keypoint. For example, each vector includes changes in coordinate values of the x-axis, y-axis, and z-axis. In a particular embodiment, assuming that the direction of hand motion is substantially linear and assuming that the velocity of hand motion is substantially constant, the predicted 3D keypoints can be easily inferred (e.g., applying the same difference in coordinates of each keypoint). For example, in the case where the previous 3D keypoint is (1, 1) and the current 3D keypoint is (4, 5, 6), then the predicted 3D keypoint will be (7, 9, 11). As will be appreciated, linear extrapolation is a relatively simple calculation that can be performed by various types of processors. In embodiments, other types of extrapolation mechanisms may be used, and predictions may be made using more than two sets of 3D keypoints. The computation of the 3D keypoint prediction may be performed in real time and meet predetermined performance requirements (e.g., 30 frames per second or faster). In embodiments, a convolutional neural network is used to calculate the confidence value, which may be performed by one or more NPUs.

As an example, fig. 7 shows 3D keypoint prediction. As shown, the inputs to the prediction process are the 3D hand keypoints of two timestamps, including the previous frame (t-1) and the current frame (t). For example, image P _t-1 is the image captured by the camera at the previous timestamp t-1, and image P _t is the image captured by the camera at the current timestamp t. Extrapolation is used to predict the hand 3D keypoint (e.g., P _t+1＝2*P_t-P_t-1) at the next timestamp t+1. For example, the 3D hand keypoints take the form of 21 keypoints in 3D straight line space (i.e., x, y, z), where each point (x, y, z) is the respective 3D position of the keypoint. As described above, when the extrapolation formula P _t+1＝2*P_t-1-P_t is used, it is assumed that the hand moves at a constant speed. More complex formulas may be used to apply acceleration and direction changes to the predictions.

In addition to generating predicted 3D keypoints, a confidence value for each predicted keypoint may also be calculated. For example, each predicted 3D keypoint is assigned a confidence value between 0 and 1. The assigned confidence values may be used in a variety of ways. For example, predicted keypoints with low confidence values may be discarded. Depending on the implementation, the confidence value of the predicted 3D keypoints may be calculated in a variety of ways. The total confidence value for the 21 keypoints is between 0 and 21, and if the total confidence value is below a predetermined threshold, the predicted frame may be discarded.

Turning now to fig. 6. At step 650, the predicted 3D keypoints are mapped to predicted 2D keypoints. In addition, confidence values are assigned to the predicted 3D keypoints. It is to be understood that 3D or 2D keypoints may be used, depending on the application and use. For example, for gesture recognition, 3D keypoints are used (see, e.g., fig. 5, blocks 507 and 508). Hand prediction may be used for bounding box tracking (see, e.g., fig. 5, blocks 503 and 504), with 2D keypoints being more useful for this application. For example, the predicted 3D keypoints are converted into 2D keypoints that can be used for bounding box tracking. For example, fig. 7 shows that the predicted 3D keypoints of frame P _t-1 are mapped to 2D keypoints. In embodiments, the predicted 3D keypoints are projected into 2D space. As shown in fig. 7, the predicted image P _t+1 (e.g., 21 keypoints in (x, y, z) space) is projected as 2D hand keypoints (21 x (u, v)) as output.

At step 660, the bounding box is tracked using the predicted 2D keypoints. In embodiments, the predicted total confidence value of the 2D or 3D keypoints may be used to identify erroneous hand detection. For example, low confidence values (e.g., less than 11 out of 21) may indicate that the predicted 3D keypoints (and corresponding 2D keypoints) may be incorrect (possibly caused by false hand detection), and they are not applicable to bounding box tracking and gesture detection applications.

Bounding box tracking may be facilitated in various ways using predicted 2D keypoints. As described above, hand detection may be unreliable for various reasons, and the predicted keypoints, and their confidence values, may be used to identify "false positive" hand detection. In various implementations, the hand prediction mechanism according to embodiments of the present invention may be both accurate and efficient. For example, a hand prediction method performed in 3D space (in addition to or instead of hand detection) may reduce the missing bounding box problem and the false positive bounding box problem. In various embodiments, the limiting threshold used to define the bounding box (e.g., using a confidence value) may result in different results, and may be calibrated according to the use case (e.g., black or bright environment). The threshold for false hand detection may use a machine learning algorithm. It will be appreciated that the use of a hand prediction process in combination with a bounding box may greatly improve performance, and that missing bounding boxes and false positive bounding boxes in real-time 3D hand tracking results may be reduced to near zero.

In various embodiments, the size and location of the bounding box may be defined and updated using the predicted 2D keypoints. For example, bounding boxes are defined around the predicted 2D keypoints with predetermined edges (e.g., 10% to 20% around the outermost keypoints). The predicted bounding box may also change in size and shape (e.g., when a fist becomes a palm, or vice versa).

By way of example, the steps shown in fig. 6 may be performed by XR device 115 shown in fig. 1B. The camera module 180 may be used to capture an image as described in step 610. The hand detection and hand prediction processes may be performed by the processor 150. In a specific embodiment, FIG. 8 illustrates an exemplary hand prediction process. Fig. 8 is a simplified diagram illustrating an exemplary state machine for hand tracking according to an embodiment of the present invention. Those of ordinary skill in the art will recognize many variations, alternatives, and modifications. For example, one or more steps shown in FIG. 8 may be added, deleted, repeated, modified, overlapped, rearranged, and replaced without limiting the scope of the claims.

At block 810, hand tracking is in a "dead" state, where hand detection and hand tracking are not performed. For example, the XR device may be in state 810 when idle (e.g., the image remains unchanged, has not moved, or has other types of changes). The XR device will remain in this "dead" state until activated (e.g., movement or change of image is detected in the left or right image).

At block 820, the XR device is activated and ready to perform various tasks, such as hand detection and hand tracking. As part of the initialization process at block 820, the camera is active and captures images, which are stored for processing.

At block 830, hand tracking is performed, which includes hand detection (block 831) and hand prediction (block 832). The hand detection 831 may be repeated until a hand is detected in one of the left and right images. As described above, the hand detection process 881 may erroneously identify a hand. Once a hand is detected, a hand prediction process 832 is performed using at least the previous and current frames. As part of the hand tracking process, the hand prediction 832 may be repeated until the hand is lost or not within the bounding box, where the hand may perform the hand prediction process. For example, if the hand being tracked is no longer located in the bounding box, a hand detection 831 may be performed to define a new bounding box, the hand detection 831 may also determine that the hand is no longer present, and proceed to block 840. In the present invention, the first two time stamped hand detection is mainly used for hand tracking, and then hand prediction is largely used.

At block 840, the XR device is in a "lost" state, in which no hand is detected any more. For example, the hand prediction process 832 may recognize "false positive" hand detection and determine that no hand is present. In the lost state, each XR component and process may still be active to detect hand motion, and if motion (e.g., the difference between two consecutive timestamps) is detected in the image, then a return may be made to block 830 to perform hand detection. For example, block 840 runs a loop (as shown) for a predetermined time before moving to the "dead" state in block 810. In some embodiments, blocks 810 and 840 may be implemented (or programmed) to the same state.

By way of example, pseudocode for a hand prediction processing mechanism in accordance with the present invention is provided below:

As an example, state machine 800 may be stored as instructions that are executed by a processor, which may include different computing cores (e.g., NPU and GPU). For example, the hand detection process 831 and the hand prediction process 832 may be performed by an NPU.

Returning now to fig. 5. The system 500 implements a set of outputs including 3D hand keypoints in blocks 507r and 507 l. For example, the hand keypoints are shown in fig. 3A and 3B. Note that although the captured image is two-dimensional, gesture detection is performed using 3D hand keypoints. For example, the results and/or intermediate calculations obtained in blocks 503l and 503r may be used in the gesture recognition process. For example, a 2D to 3D mapping may be performed between blocks 505l and 507l or obtained from blocks 5031 and 503 r. Calibration parameters may be used in the mapping process.

As shown, the system 500 includes five components, a main thread 501, a hand detection thread 502, a right hand thread 502r, a left hand thread 502l, and a hand calibration thread 570. These components interact with each other.

As an example, the main thread 501 is used to copy images captured by the right and left fisheye cameras 501r and 501l into the local memory of the system.

The hand detection thread 502 waits for a right fisheye image and a left fisheye image. Upon receiving the images, the hand detection thread 502 may use a hand detection convolutional network for the right and left fisheye images. For example, the hand detection thread 502 outputs confidence values and bounding boxes for the right and left hands.

The right hand thread 502r and the left hand thread 502l may be symmetrically implemented, which receive thread inputs from the right and left fisheye images, respectively. They also rely on the respective bounding box tracking (i.e., boxes 504r and 504 l). For example, confidence values and bounding box tracking may be used to generate 3D hand keypoints that allow gesture types to be identified.

The hand bounding box threads 504r and 504l provide tracking, and their inputs are bounding box size (and shape), confidence values, and bounding box prediction values from the hand prediction boxes 503r and 503 l. The hand bounding box threads 504r and 504l specifically output hand status (e.g., whether or not present), bounding box data, and the like.

As shown in fig. 5, if a hand is present (as determined in blocks 504r and 504 l), 2D hand keypoint detection (e.g., blocks 505r and/or 505 l) crop out the hand using bounding boxes from hand bounding box tracking of captured images. For example, the captured image is cropped and adjusted to a predetermined size (e.g., 96 pixels×96 pixels, which is a small size optimal for achieving efficient processing). 2D hand keypoint detection (e.g., blocks 505r and 505 l) uses a 2D keypoint detection convolutional network for the resized image and outputs 2D hand keypoints. As described above, if a 2D keypoint is present, it is mapped to a 3D keypoint for gesture detection.

While specific embodiments have been fully described above, various modifications, alternative constructions, and equivalents may be used. Accordingly, the foregoing description and description should not be deemed to limit the scope of the invention, which is defined by the appended claims.

Claims

1. A method of hand prediction, the method comprising:

Capturing a plurality of images including at least a first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image;

Identifying a plurality of previous 2D keypoints using previous images;

Identifying a plurality of current 2D keypoints using the current image;

mapping the plurality of previous 2D keypoints to a plurality of previous three-dimensional (3D) keypoints;

Mapping the plurality of current 2D keypoints to a plurality of current 3D keypoints;

generating a plurality of 3D predicted keypoints in 3D space using the plurality of previous 3D keypoints and the plurality of current 3D keypoints;

Mapping the plurality of 3D predicted keypoints to a plurality of predicted 2D keypoints, and

The plurality of predicted 2D keypoints is used to determine potential false hand detections.

2. The method of claim 1, further comprising projecting the plurality of previous 2D keypoints into 3D space.

3. The method of claim 2, wherein the plurality of images further includes at least a second hand, the plurality of 3D prediction keypoints being associated with the first and second hands.

4. The method of claim 1, further comprising:

Defining a bounding box surrounding the first hand in the current image;

The bounding box is tracked using the plurality of predicted 2D keypoints.

5. The method of claim 4, wherein the bounding box is defined by an upper left corner position and a lower right corner position, the bounding box comprising at least ten percent of an edge area surrounding the first hand.

6. The method of claim 1, wherein the plurality of 3D predicted keypoints are assigned confidence values, the method further comprising detecting a non-hand object using the confidence values.

7. The method of claim 1, further comprising tracking the first hand using the plurality of predicted 2D keypoints.

8. The method of claim 1, further comprising initiating a hand tracking process upon detection of the first hand.

9. The method of claim 1, further comprising calculating a change in coordinates between the plurality of previous 3D keypoints and the plurality of current 3D keypoints, each 3D keypoint comprising three coordinates.

10. The method of claim 1, further comprising using a plurality of 3D vectors for the plurality of previous 3D keypoints and the plurality of current 3D keypoints.

11. An augmented reality device, comprising:

a housing having a front face and a back face;

a first camera configured at the front face and configured to capture a plurality of two-dimensional (2D) images at a predefined frame rate, the plurality of 2D images including a current image and a previous image;

a display disposed on a back surface of the housing;

a memory coupled to the first camera and configured to store the plurality of 2D images, and

A processor coupled to the memory;

wherein the processor is configured to:

identifying a plurality of 2D keypoints associated with a hand using at least the current image and the previous image;

Mapping the plurality of 2D keypoints to a plurality of three-dimensional (3D) keypoints, and

At least the plurality of 3D keypoints are used to provide hand predictions.

12. The apparatus of claim 11, wherein the processor comprises a neural processing unit configured to detect a hand using a first image captured by the first camera.

13. The apparatus of claim 11, further comprising a second camera, the first camera being located on a left side of the housing and the second camera being located on a right side of the housing.

14. The apparatus of claim 11, wherein the processor is further configured to track the hand.

15. The apparatus of claim 11, wherein the processor is further configured to:

Generating a plurality of predicted 3D keypoints;

Mapping the plurality of predicted 3D keypoints into a plurality of predicted 2D keypoints.

16. A method of hand tracking, the method comprising:

Capturing a first image;

Detecting at least a first hand in the first image;

Capturing a plurality of images including at least the first hand, the plurality of images being in a two-dimensional (2D) space, the plurality of images including a current image and a previous image;

identifying a plurality of 2D keypoints using the plurality of images;

Mapping the plurality of 2D keypoints into a plurality of three-dimensional (3D) keypoints;

Generating a plurality of 3D predicted keypoints in 3D space using the plurality of 3D keypoints;

Tracking the first hand using the plurality of 3D predicted keypoints.

17. The method of claim 16, further comprising:

calculating confidence values of the plurality of 3D prediction key points;

An erroneous hand detection is identified using at least the confidence value.

18. The method of claim 16, further comprising identifying a change between the first image and the second image.

19. The method of claim 16, further comprising performing a deep learning process using the first image for hand detection.