Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Next, first, defects of the palm information detection method commonly used in the prior art are analyzed.
In the prior art, when the non-deep learning mode detects palm information, the mode of positioning the finger gap based on the contour curvature is easily affected by different posture changes of the palm (such as palm opening, closing, curling, rotating, tilting and the like), so that the user is greatly limited in use.
While schemes employing deep learning typically use two deep neural networks. The first neural network performs palm positioning in a similar manner to other deep learning object positioning models. And inputting a picture to be detected into the model, and outputting a positioning result by the model, wherein the positioning result comprises the detected object classification and positioning frame information. Then, the palm portion image is cut out using the classification and positioning frame information. And sending the image after clipping to a second neural network for more accurate palm key point positioning. And calculating ROI (Region of Interest) region coordinates according to the more accurate key point information, cutting the palm picture, and obtaining an ROI region image for subsequent feature extraction and identification. Palm information detection is performed using two deep neural networks, which occupy more memory space and require more run time. If multiple palms need to be positioned, the time required for positioning can be multiple times.
In the embodiment of the application, in order to improve the accuracy of palm key point detection and the robustness of palm gesture, and simultaneously shorten the positioning time of the palm key points and reduce the storage and operation resources consumed for positioning the palm key points, a scheme for completing palm positioning and key point detection by adopting a single depth network is provided.
The following describes specific embodiments of a palm information detection method disclosed in the present application by way of example with reference to the accompanying drawings.
The embodiment of the application discloses a palm information detection method, which is shown in fig. 1, and comprises steps 110 to 140.
Step 110, inputting a palm image into a pre-trained palm and key point detection model, detecting the palm in the palm image through the palm and key point detection model, and obtaining a candidate positioning result of the detected palm, wherein the candidate positioning result comprises a positioning frame confidence score, positioning frame position information and key point information;
Step 120, screening the candidate positioning results according to the positioning frame confidence score to obtain a first positioning result;
130, performing positioning frame superposition filtering on the first positioning result according to the positioning frame position information to obtain a second positioning result;
And 140, performing reliability screening on the key point information in the second positioning result, and determining a detection result of the key point in the palm image.
The palm and key point detection model adopted in the embodiment of the application can be constructed based on the positioning model of the lightweight backbone network. As shown in fig. 2. The palm and keypoint detection model includes a backbone network 210, a first branch network 220, a second branch network 230, and a third branch network 240.
Alternatively, the backbone network 210 may employ a deep neural network such as resnet or Mobi leNet. The backbone network 210 may include a number of convolution blocks and a connection layer, each of which may be composed of a single or multiple convolution layers, a normalization layer (e.g., BN layer), and an activation function. The backbone network 210 may also include a downsampling layer for downsampling hidden layer vectors output by one or more convolution blocks. The connection layer is used for connecting the hidden layer vector output by the convolution block and the downsampling layer to obtain a connection vector.
Optionally, the first branch network 220, the second branch network 230, and the third branch network 240 may include a full connection layer for performing classification mapping on the connection vectors to obtain a bounding box confidence score, bounding box position information, and keypoint information of a bounding box of the palm detected in the palm image, respectively.
The training method of the palm and key point detection model is described below.
In palm-positioning and keypoint detection applications, a single image acquired from an image acquisition device, or a frame of image extracted from a video stream, is first scaled to a specified size. For example, scaling to an image of height H and width W. And then, carrying out normalization processing on the zoomed image to obtain a palm image suitable for being used as an input of a palm and key point detection model. For example, based on the values m and v calculated in advance, use is made ofAnd carrying out normalization processing on the zoomed image, and normalizing the image pixel value Img to be in a distribution with a value of-1 to 1 or 0 to 1. The values of m and v can be obtained by counting training data of the palm and key point detection model.
And then, inputting the palm image obtained after normalization processing into a pre-trained palm and key point detection model, and carrying out feature extraction and mapping processing on the input palm image through the palm and key point detection model, wherein the palm and key point detection model respectively outputs the following positioning results of the detected palm, namely a positioning frame confidence score, positioning frame position information and key point information through the three branch networks. The positioning frames and the key points detected by the palm and key point detection model are shown in fig. 3. In the embodiment of the present application, the number of the keypoints output by the palm and the keypoint detection model may be configured to be 3 to 14.
In the embodiment of the application, when the palm image includes a plurality of palms, the palm and key point detection model may output a positioning result of each palm.
Optionally, the palm and keypoint detection model output further comprises a localization frame confidence score (hereinafter denoted with the symbol "T cls") as a vector of dimension NxC, localization frame position information (hereinafter denoted with the symbol "T box") as a vector of dimension NxD, and keypoint information (hereinafter denoted with the symbol "T kp") as a vector of dimension NxK x P. The method comprises the steps of detecting a positioning result of a palm and a key point detection model, wherein N represents the number of groups of the positioning result output by the palm and the key point detection model, and can be understood as that each group of the positioning result corresponds to one possible palm, C represents a positioning frame confidence score corresponding to each group of the positioning result, D represents the dimension number of positioning frame position information, the value of D is 4;K and represents the number of key points, the value range of K in the embodiment of the method is an integer in [3,14], and P represents the length of key point coordinate information, and the value of P is an integer multiple of 2. In the embodiment of the application, the key point coordinate information is an offset probability distribution of the key point relative to the designated center position, for example, the probability that the key point is offset by each preset offset relative to the designated center position can be obtained.
In some embodiments of the present application, when the positioning frame is rectangular, the value of D is 4, and the positioning frame position information T box represents the offsets of the four sides of the palm positioning frame from the designated center position. Taking the positioning schematic diagram shown in fig. 4 as an example, the designated center position is (xc, yc), the distance between the upper edge of the positioning frame and the designated center position is T, the distance between the left edge of the positioning frame and the designated center position is L, the distance between the right edge of the positioning frame and the designated center position is R, the distance between the lower edge of the positioning frame and the designated center position is B, and the positioning frame position information includes the offset of the distance T, L, B, R. Further, the position coordinates of the positioning frame can be calculated according to the coordinates of the specified center position and the offset T, L, B, R of the edge of the positioning frame from the specified center position.
In the embodiment of the application, the key point information is a probability distribution of the offset of each key point relative to the designated center position. As shown in fig. 5, the abscissa offset of the key point with respect to the specified center position may be denoted as dx, and the ordinate offset may be denoted as dy. Wherein the value of P is an integer multiple of 2. In the embodiment of the present application, two z-dimensional vectors may be used to represent the coordinate information of the key point, that is, the probability distribution of the offset amounts of the abscissa and the ordinate of the key point with respect to the specified center position is represented by the vector having the length z, where the length of p is 2×z (where z is an integer greater than 1). The corresponding offset value can be obtained by calculating the expectation of the probability distribution, so as to determine the coordinates of the key points. Compared with the prior method for directly predicting the coordinate values of the key points. The distance from the key point to the designated center position is calculated according to the offset probability distribution after the distance offset probability distribution is predicted, so that the predicted position of the key point is more accurate.
In the embodiment of the application, each set of positioning results corresponds to a designated center position (xc, yc), wherein the designated center position is predetermined according to a mapping relationship between a hidden layer feature corresponding to the positioning frame in the set of positioning results (i.e., a hidden layer feature corresponding to the set of positioning results) and an image area of the palm image.
Taking an example that the main network of the palm and key point detection model comprises two convolution block branches, the two convolution block branches are converged at a connecting layer, and the connecting layer splices hidden layer feature graphs with different scales output by each convolution block branch. The scale of the hidden layer feature map output by each convolution block branch corresponds to the downsampling scale of the palm image input to the palm and key point detection model by the corresponding convolution block branch pair. Taking the model structure shown in fig. 2 as an example, the convolution block ConvBlock branches, which are converged at the connection layer cat, and the two branches that are branched by the convolution block ConvBlock output tensors of different sizes, respectively. When the second branch network 230 performs feature mapping based on the hidden layer vectors converged at the cat to obtain the coordinates of the positioning frame, the position of the target object on the input tensor is related to the abscissa of the position of the target object on the image input to the model, which is determined by the features of the convolutional neural network. In the embodiment of the application, the appointed center position of each tensor output by the model, which is input to the palm image of the model, can be predetermined according to the coordinate association relation between the tensor output by the model and the input image.
The specific implementation manner of determining the designated center position corresponding to each tensor is the prior art, and is not repeated in the embodiment of the present application. Further, the specified center position corresponding to each tensor can be used as the specified center position corresponding to the positioning result obtained based on the tensor.
As described above, when a plurality of palms are included in the palm image, the palm and keypoint detection model may output a set of localization results for each palm. In the embodiment of the application, the ith set of positioning results can be expressed as resul T [ i ], and the ith set of positioning results resul T [ i ] comprise positioning frame confidence scores of palm positioning frames, such as T cls [ i ], positioning frame position information of the palm positioning frames, such as T box [ i ], and key point information, such as T kp [ i ].
In order to provide more accurate key point information for subsequent palm recognition, in the embodiment of the application, detection results output by the palm and key point detection models are required to be further filtered and screened, and only the detection results with high confidence and high quality are used as final detection results.
Firstly, the positioning result is screened according to the confidence level of the positioning frame.
Optionally, in the step 120, the step of screening the candidate positioning results according to the positioning frame confidence score to obtain a first positioning result includes using the candidate positioning result with the positioning frame confidence score greater than a preset score threshold as the first positioning result. The preset score threshold is determined according to test data. In the embodiment of the application, the confidence integrated score may be greater than a preset score threshold only when the IoU score of the candidate positioning result and the positioning frame confidence score are both relatively high.
In some embodiments of the present application, when the confidence integrated score of a group of candidate position results is greater than a preset score threshold, the confidence of the group of candidate position results can be considered to meet the requirement and be reserved, and when the confidence integrated score of a group of candidate position results is less than or equal to the preset score threshold, the confidence of the group of candidate position results can be considered to not meet the requirement and be abandoned. Thus, the reserved candidate position result is used as a first position result, and the filtering processing of the subsequent steps is continued.
Optionally, the positioning frame position information includes an offset of each edge of the positioning frame from a designated center position, and correspondingly, in the step 130, the positioning frame overlapping filtering is performed on the first positioning result according to the positioning frame position information to obtain a second positioning result, including calculating a position coordinate of the positioning frame in the first positioning result according to the designated center position and the offset, and performing overlapping filtering on the positioning frame according to the position coordinates of each positioning frame to obtain the second positioning result. The appointed center position is preset according to the mapping relation between the hidden layer features corresponding to the positioning frame and the image area of the palm image.
As described above, in the embodiment of the present application, each set of candidate positioning results has a fixed center information, that is, a designated center position, such as the position represented by coordinates (xc, yc) in fig. 4. In the embodiment of the application, the position information of the positioning frame output by the palm and key point detection model is not the position coordinates of the positioning frame, but the offset of each edge of the positioning frame from the designated center position. Therefore, before the registration filtering is performed on the positioning frames, first, the position coordinates of the positioning frames in each first positioning result are calculated according to the positioning frame position information T box and the specified center position (xc, yc), for example, denoted by (X 1,Y1,X2,Y2).
In some embodiments of the present application, the position coordinates of the positioning frame may be calculated from the positioning frame position information and the designated center position by the following formula.
Wherein T box [ i ] [1] represents the offset corresponding to the left upper corner and the left lower corner abscissa of the ith positioning frame, T box [ i ] [0] represents the offset corresponding to the left upper corner and the right upper corner ordinate of the ith positioning frame, T box [ i ] [3] represents the offset corresponding to the right upper corner and the right lower corner abscissa of the ith positioning frame, and T box [ i ] [2] represents the offset corresponding to the left lower corner and the right lower corner ordinate of the ith positioning frame. The offset of each side of the positioning frame to the designated center position is predicted, and the position coordinates of the positioning frame are calculated based on the predicted offset and the designated center position, so that the anchor is canceled, a plurality of super parameters related to the anchor size are not required to be introduced in the model training stage, and the model parameter adjustment difficulty in the training stage is reduced.
And then, according to the position coordinates of the positioning frames in the first positioning results, overlapping and filtering the positioning frames in the first positioning results.
Optionally, the positioning frames are subjected to superposition filtering according to the position coordinates of the positioning frames to obtain a second positioning result, wherein the second positioning result comprises the steps of calculating the superposition degree of the positioning frames according to the position coordinates of the positioning frames, determining the superposition positioning frames according to the superposition degree, and generating a second positioning result according to the first positioning result corresponding to one of the superposition positioning frames.
In some embodiments of the application, NMS (non maximum suppression, non-maximum suppression) may be used to filter out overlapping bounding boxes. NMS filtering is a conventional process that outputs a very large number of positioning results in a positioning detection model, many of which describe the same object at high probability. It is therefore necessary to preserve only one of the multiple registration results by calculating the degree of coincidence (IOU, INTERSECT ION OVER UNION) of the different positioning results.
For example, the NMS overlap threshold may be set to 0.3 or 0.5, i.e., a bounding box with an overlap of greater than 0.3 or 0.5, during model training and actual computation.
After the positioning frames with the overlapping ratio meeting the preset filtering conditions are filtered, the reserved positioning frames form a second positioning result. Thus, each positioning frame of the second positioning result is a non-overlapping positioning frame. For example, the highly reliable positioning results of the individual palms are preserved.
And filtering the steps to obtain positioning information of one or more independent palms with confidence coefficient higher than a preset threshold, namely one or more groups of second positioning results. And then, further evaluating the reliability of the key points of each group of second positioning results, and filtering the positioning results again based on the dimension of the key points.
In the step 140, the reliability screening is performed on the key point information in the second positioning results to determine a detection result of the key point in the palm image, which includes calculating, for each key point in each second positioning result, coordinates of the key point in the palm image according to a specified center position and the offset probability distribution information, performing quality evaluation on the coordinates of the key point in the palm image to obtain a key point information evaluation result, and determining a detection result of the key point in the palm image according to the key point information evaluation result.
As previously described, in embodiments of the present application, each set of positioning results has a fixed center information, i.e., a designated center position, such as the position represented by coordinates (xc, yc) in FIG. 5. In the embodiment of the application, the key point information output by the palm and key point detection model is not the position coordinates of the key points, and the key point information in the embodiment of the application comprises offset probability distribution information of the key points relative to the designated center position. The offset step size is preferably 0.5,1,2, and as described above, the key point coordinate information of each key point is represented by a vector with a length P, for example, the key point coordinate information of the key point may be represented by [ dy, dx ], where dy and dx are z-dimensional vectors, and dy and dx are spliced to obtain the key point coordinate information.
In the embodiment of the application, when the positioning result filtering is performed based on the key points, the coordinates of the key points in the palm image are calculated according to the designated center positions and the offset probability distribution information in the key point information. The specified center position is the specified center position corresponding to the group of second positioning results. The designated center position is obtained by pre-calculation, and the calculation method of the designated center position participates in the foregoing description, and is not repeated here.
In some embodiments of the application, calculating the coordinates of the key points in the palm image according to the specified center position and the offset probability distribution information comprises calculating expected values of offset probability distribution according to the offset probability distribution information, and calculating the coordinates of the key points in the palm image according to the specified center position and the expected values. For example, the position coordinates of each key point can be calculated from the key point information and the specified center position by the following formula.
When the offset step is 1 and is non-zero and symmetrical, the offset may have a value of Δx=1, 2,3, etc., and Δy=1, 2,3, etc. In other embodiments of the application, the step size may also be set to zero symmetry. For example, in the case of zero symmetry, the offset may take the value Δx=etc, -1, -0.5,0,0.5,1, etc when the step size is 0.5.
Wherein Deltax and Deltay respectively represent the offset of the abscissa and the ordinate relative to the designated center position, n represents the positioning result sequence number, k represents the key point sequence number, P represents the length of the key point coordinate information, j represents the sequence number of the offset probability distribution information, T kp [ n ] [ k ] [ j ] represents the probability value when the kth key point in the nth group of positioning results is offset by Deltay j and Deltax j (namely the jth offset) relative to the designated center position (xc, yc).
In the embodiment of the present application, the length of the offset probability distribution may be preferably set to 9,11,13,15,17,19 according to different scenarios. The probability distribution step size may be preferably set to 0.5,1,2, as the case may be. The offset may be zero symmetry and non-zero symmetry.
By calculating the expected distance value of the coordinate offset probability distribution to obtain the corresponding specified coordinate position, a more stable coordinate value can be obtained compared with the traditional method for directly predicting the distance value.
The method comprises the steps of predicting the offset probability distribution of the key points to the designated center position, calculating the position coordinates of the key points based on the predicted offset probability distribution and the designated center position, solving the problem that the deviation of the object detection prediction result of the anchor-free type is large, and meanwhile, canceling the anchor, so that a plurality of super parameters related to the anchor size are not required to be introduced in the model training stage, and the model parameter adjustment difficulty in the training stage is reduced.
According to the method, the positioning frame and the key point coordinates of each palm which are reserved through screening can be calculated respectively, and compared with the prior art that two neural networks are needed to be adopted to determine the positioning frame and the key point respectively, the palm detection efficiency is improved remarkably.
And then, carrying out quality evaluation on the coordinates of the key points in the palm image to obtain a key point information evaluation result. The method is used for filtering out the positioning result with poor quality according to the positioning point information and preventing the inaccurate positioning result from affecting the subsequent palm recognition application.
Optionally, the quality evaluation of the coordinates of the key points in the palm image to obtain a key point information evaluation result includes respectively performing quality evaluation on the key points in each second positioning result according to the position relationship of the key points in the palm image to obtain a key point information evaluation result, and/or respectively performing quality evaluation on the key points in each second positioning result according to the relative positions of the key points and the positioning frames corresponding to the key points to obtain a key point information evaluation result. The positioning frames corresponding to the key points are the positioning frames of the key points in a group of second positioning results.
For example, two adjacent keypoints may first be vectorized in a clockwise or counterclockwise order. Thereafter, the formula can be calculatedCalculating the vector included angle, wherein,AndRespectively represent vectors composed of two key points adjacent in turn. And then, further judging whether key points with serious errors exist in the key points in each group of second positioning results through the calculated included angles, if so, obtaining an evaluation result with low reliability of the key point information, and if not, obtaining an evaluation result of the reliability of the key point information.
In some embodiments of the present application, critical points with serious errors may also be filtered out, so as to obtain trusted critical point information.
For another example, the confidence score of the key point can be calculated by calculating the relative position of the key point and the positioning frame, and the confidence evaluation result of the key point can be obtained according to the comparison result of the confidence score and the preset score threshold.
Optionally, the quality evaluation is performed on the key points in each second positioning result according to the relative positions of the key points and the positioning frames corresponding to the key points, so as to obtain a key point information evaluation result, wherein the quality evaluation is performed on the key points in the second positioning result according to the cross-merging ratio of the areas of the first image area surrounded by the key points in the second positioning result and the second image area surrounded by the positioning frames corresponding to the key points, so as to obtain a key point information evaluation result.
In some embodiments of the present application, all the key points in the second positioning result may be sequentially connected to obtain a polygon, and the area surrounded by the polygon is used as the first image area.
In other embodiments of the present application, in order to reduce the computation load and the computation complexity, a portion of the key points in the second positioning result may be selected to be connected to obtain a polygon, and an area surrounded by the polygon is used as the first image area.
For example, at least four points, which are the most up, down, left, and right, among the key points in the second positioning result may be selected and sequentially connected to form a convex polygon such as the convex polygon in fig. 6, and the region surrounded by the convex polygon may be regarded as the first image region 610. In fig. 6, a rectangular region 620 represents a second image region surrounded by the positioning frame.
And then judging whether the edge of the convex polygon has an intersection point with the positioning frame, and if the intersection point exists, indicating that the polygon is partially overlapped with the rectangular frame. And calculating the intersection point coordinates and calculating the area of the overlapped part. For example, find the center point of the overlapping portion, connect each vertex with the center point to form a small triangle, calculate the area of the small triangle and sum to obtain the area S1 of the convex polygon of the overlapping portion, such as the diagonal filling area in fig. 6, and the area S2 of the large convex polygon (i.e. the first image area 610). Then, the area S3 of the second image area 620 surrounded by the positioning frame is calculated. Finally, according to the formula P-iou=s1/(s2° S3), the area intersection ratio P-IOU of the first image area surrounded by the key point and the second image area surrounded by the positioning frame corresponding to the key point may be calculated.
If the edge of the convex polygon has no intersection point with the positioning frame, further judging whether the convex polygon comprises the positioning frame or not, or if the situation that the positioning frame comprises the convex polygon exists, calculating the intersection ratio P-IOU of the area of the first image area surrounded by the key point and the area of the second image area surrounded by the positioning frame corresponding to the key point according to the formula P-IOU= (convex polygon area n positioning frame area)/(convex polygon area U positioning frame area).
If there is no intersection point between the edge of the convex polygon and the positioning frame, and there is no mutual inclusion, the area intersection ratio P-iou=0.
And then, taking the area intersection ratio P-IOU of the first image area surrounded by the key points and the second image area surrounded by the positioning frame corresponding to the key points as an evaluation score of the key points, and evaluating the quality of the key points. If the area intersection ratio P-IOU is zero, the quality of the key point is poor, and the area intersection ratio P-IOU is closer to the quality threshold value, the key point quality is better. The quality threshold value can be 1 or an area intersection ratio P-IOU value calculated according to the calibrated key points and the positioning frame.
And (3) through quality evaluation of the key points, reserving a second positioning result to which the reliable key point information belongs as a final detection result of the palm image.
In order to make the palm information detection method disclosed by the embodiment of the application clearer, a training method of the palm and key point detection model is exemplified below.
The palm and key point detection model is used for detecting the palm and key points of an input palm image, outputting a detected prediction value of a confidence coefficient score of a positioning frame, a prediction value of position information of the positioning frame and a prediction value of key point information of the palm, and the loss function of the palm and key point detection model is configured to calculate the loss value of the palm and key point detection model according to a prediction loss of the confidence coefficient of the positioning frame, a prediction loss of the position information of the positioning frame and a prediction loss of the key point position, wherein the prediction loss of the confidence coefficient of the positioning frame is obtained by calculating a product of the prediction value of the confidence coefficient score of the positioning frame and IoU score through cross entropy, the prediction loss of the position of the positioning frame is obtained by calculating a regression loss function of target detection, and the prediction loss of the position of the key point is obtained by weighting and summing the probability expected loss of the shift amount probability distribution loss of the key point information.
As previously described, the palm and keypoint detection model includes a backbone network 210, a first branch network 220, a second branch network 230, and a third branch network 240, wherein,
The backbone network is used for extracting features of the palm images input to the palm and key point detection model to obtain hidden layer vectors;
The first branch network is used for carrying out feature mapping on the hidden layer vector to obtain a positioning frame confidence score predicted value corresponding to the palm detected in the palm image;
The second branch network is used for carrying out feature mapping on the hidden layer vector to obtain a predicted value of the position information of the positioning frame of the palm detected in the palm image;
and the third branch network is used for carrying out feature mapping on the hidden layer vector to obtain the key point information predicted value of the palm detected in the palm image.
During the training process, the loss function of the palm and key point detection model can be expressed as:
Loss=αLosscls+βLossbox+δLosskeypoint;
Wherein, alpha, beta and delta are super parameters, the values of which can be set according to the needs, for example, the values of alpha, beta and delta can be 2, loss represents model Loss, loss cls represents positioning frame confidence prediction Loss, loss box represents positioning frame position prediction Loss, and Loss keypoint represents key point position prediction Loss.
Optionally, the positioning frame confidence prediction Loss cls may be calculated according to the following formula:
Losscls=-(1-y)log(1-Pscore)-ylogPscore;
Where P score=Tcls×IoU,Tcls represents a bounding box confidence prediction score, ioU represents a metric score for palm accuracy detected in a palm image, i.e., ioU score, y represents a classification label, i.e., whether the target is a corresponding classified object, e.g., y=1 represents that the target is a classified object, and y=0 represents that the target is not a classified object.
IoU (Intersection over Union) is a criterion for measuring the accuracy of detecting a corresponding object in a particular dataset. IoU is a simple measurement standard, and any task that derives a prediction horizon (prediction boxes) from the output can be measured using the IoU index. The calculation method of IoU scores of the positioning frames refers to the prior art, and is not repeated in the embodiment of the application.
In the palm and key point detection model training process, cross entropy Crossentropy is used as a loss function to calculate the confidence prediction loss of the positioning frame. The score of the palm sample in the image approaches 1 and the background approaches 0. The difference between the scheme and the traditional scheme is that when the confidence score of the positioning frame is finally calculated, a quality score of the positioning result is multiplied as the confidence score of the positioning frame in the training process. The quality score of the positioning result used herein employs IoU scores. I.e. the score P score fed into the cross entropy loss function during training = bounding box confidence score x IoU. As can be seen from the loss function of the confidence coefficient of the positioning frame, the result score P score is close to 1 only when the confidence coefficient score of the positioning frame and the score IoU are high, and the prediction loss of the confidence coefficient of the positioning frame is smaller, so that the detection accuracy of the palm and key point detection model obtained through training is improved. That is, for the positioning result of an object, P score is only close to 1 when both T cls and IoU are close to 1, thus constraining both confidence and positioning box results.
Alternatively, the bounding box position prediction Loss box may be calculated using GIoU Loss (Generalized Intersection over Union Loss, target detection random Loss function). For example, the bounding box position prediction Loss box is calculated using the following formula:
Lossbox=GIoU(box,b′ox);
wherein box represents the predicted value of the position of the positioning frame, b' ox represents the true value of the position of the marked positioning frame, and box= [ X1, Y1, X2, Y2].
Alternatively, the keypoint location prediction Loss keypoint may be calculated by the following formula:
Wherein, Representing the mean square error loss of the calculated key point information, kp representing the predicted value of the key point information, including the predicted value of the key point position,And Loss d represents the Loss of the probability distribution of the offset of the position of the key point relative to the designated central position, and tau is a weight value, and the value of tau is smaller than 1, for example tau=0.3.
Alternatively, the offset probability distribution Loss d for a keypoint location relative to a specified center location can be calculated by the following formula:
Lossd=-((Δyj+1-y)logSj+(y-Δyj)logSj+1);
Wherein, Y j+1 and y j represent target stable values, such as labeled probability distribution stable values, y represents a stable value output by the model, and S j and S j+1 represent the change of the probability of the shift distribution of the coordinates of the key points of the same shift. As can be seen from the calculation formula of the offset probability distribution Loss d, the stable value can be selected to be a value near the target stable value by the constraint of the offset probability distribution Loss d, so that the output offset probability distribution is more stable for the key point coordinates of the same offset, and the accuracy of key point detection is improved.
In the training process of the palm and key point detection model, the overall loss value of the optimization model is taken as a target, and model parameters are continuously optimized, so that the confidence prediction loss of the positioning frame, the position prediction loss of the positioning frame and the position prediction loss of the key point are reduced, and the training of the palm and key point detection model is completed. And then, the trunk network, the first branch network, the second branch network and the third branch network of the palm and key point detection model obtained through training are subjected to migration and deployment and are used for palm and palm key point detection.
According to the palm information detection method disclosed by the embodiment of the application, a palm image is input into a pre-trained palm and key point detection model, the palm in the palm image is detected through the palm and key point detection model, a candidate positioning result of the detected palm is obtained, the candidate positioning result comprises a positioning frame confidence score, positioning frame position information and key point information, the candidate positioning result is screened according to the positioning frame confidence score to obtain a first positioning result, positioning frame superposition filtering is carried out on the first positioning result according to the positioning frame position information to obtain a second positioning result, finally reliability screening is carried out on the key point information in the second positioning result, and the detection result of the key point in the palm image is determined. According to the method, the deep neural network is adopted, palm key points are detected based on learning of positioning frame information and key point information, and the method has stronger robustness to palm gestures. And only one neural network model needs to be trained and used, so that the occupied operation resources and storage resources are fewer, and the palm and key point detection efficiency is higher.
The embodiment of the application also discloses a palm information detection device, as shown in fig. 7, comprising:
The candidate positioning result obtaining module 710 is configured to input a palm image to a pre-trained palm and key point detection model, detect a palm in the palm image through the palm and key point detection model, and obtain a candidate positioning result of the detected palm, where the candidate positioning result includes a positioning frame confidence score, positioning frame position information, and key point information;
The first positioning result screening module 720 is configured to screen the candidate positioning results according to the positioning frame confidence score to obtain a first positioning result;
The second positioning result screening module 730 is configured to perform positioning frame overlapping filtering on the first positioning result according to the positioning frame position information, so as to obtain a second positioning result;
And a third positioning result screening module 740, configured to perform reliability screening on the key point information in the second positioning result, and determine a detection result of the key point in the palm image.
Optionally, the first positioning result screening module 720 is further configured to:
And taking the candidate positioning result with the positioning frame confidence score larger than a preset score threshold as a first positioning result.
Optionally, the positioning frame position information includes an offset of each edge of the positioning frame from a designated center position, where the designated center position is predetermined according to a mapping relationship between hidden layer features corresponding to the positioning frame and an image area of the palm image, and the performing positioning frame overlapping filtering on the first positioning result according to the positioning frame position information to obtain a second positioning result includes:
calculating the position coordinates of a positioning frame in the first positioning result according to the specified center position and the offset;
And carrying out superposition filtering on the positioning frames according to the position coordinates of the positioning frames to obtain a second positioning result.
Optionally, the keypoint information includes offset probability distribution information of the keypoint with respect to the designated center position, and the third positioning result filtering module 740 includes:
For each key point in each second positioning result, calculating coordinates of the key point in the palm image according to the designated center position and the offset probability distribution information;
Performing quality evaluation on coordinates of the key points in the palm image to obtain a key point information evaluation result;
And determining a detection result of the key points in the palm image according to the key point information evaluation result.
Optionally, the performing quality evaluation on coordinates of the key points in the palm image to obtain a key point information evaluation result includes:
According to the position relation of the key points in the palm image, respectively carrying out quality evaluation on the key points in each second positioning result to obtain a key point information evaluation result and/or,
And respectively carrying out quality evaluation on the key points in each second positioning result according to the relative positions of the key points and the positioning frames corresponding to the key points, and obtaining a key point information evaluation result.
Optionally, the quality evaluation is performed on the key points in each second positioning result according to the relative positions of the key points and the positioning frames corresponding to the key points, so as to obtain a key point information evaluation result, which includes:
and carrying out quality evaluation on the key points in the second positioning result according to the cross-merging ratio of the areas of the first image area surrounded by the key points in the second positioning result and the second image area surrounded by the positioning frame corresponding to the key points, so as to obtain a key point information evaluation result.
The palm and key point detection model is used for detecting the palm and key points of an input palm image, outputting a detected predicted value of a positioning frame confidence score, a predicted value of positioning frame position information and a predicted value of key point information of the palm, and the loss function of the palm and key point detection model is configured to calculate the loss value of the palm and key point detection model according to a predicted loss of the positioning frame confidence score, a predicted loss of the positioning frame position and a predicted loss of the key point position, wherein the predicted loss of the positioning frame confidence score is calculated according to a product of the predicted value of the positioning frame confidence score and IoU scores and then through cross entropy calculation, the predicted loss of the positioning frame position is calculated by a random loss function of target detection, and the predicted loss of the key point position is obtained by weighting and summing the probability expected loss of key point information offset and the probability distribution loss of offset.
The palm information detection device disclosed in the embodiment of the present application is used to implement the palm information detection method described in the embodiment of the present application, and specific implementation manners of each module of the device are not repeated, and reference may be made to specific implementation manners of corresponding steps in the method embodiment.
The palm information detection device disclosed by the embodiment of the application is used for detecting the palm in the palm image through inputting the palm image into a pre-trained palm and key point detection model, detecting the palm in the palm image through the palm and key point detection model, obtaining candidate positioning results of the detected palm, wherein the candidate positioning results comprise positioning frame confidence scores, positioning frame position information and key point information, screening the candidate positioning results according to the positioning frame confidence scores to obtain a first positioning result, carrying out positioning frame superposition filtering on the first positioning result according to the positioning frame position information to obtain a second positioning result, and finally carrying out reliability screening on the key point information in the second positioning result to determine the detection result of the key point in the palm image. According to the method, the deep neural network is adopted, palm key points are detected based on learning of positioning frame information and key point information, and the method has stronger robustness to palm gestures. And only one neural network model needs to be trained and used, so that the occupied operation resources and storage resources are fewer, and the palm and key point detection efficiency is higher.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
The foregoing describes the method and apparatus for palm information detection provided by the present application in detail, and specific examples are used herein to describe the principles and embodiments of the present application, and the description of the above examples is only for helping to understand the method and a core idea of the present application, and meanwhile, for those skilled in the art, according to the idea of the present application, there are changes in the specific embodiments and application ranges, so the disclosure should not be construed as limiting the present application.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
Various component embodiments of the application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that some or all of the functions of some or all of the components in an electronic device according to embodiments of the present application may be implemented in practice using a microprocessor or Digital Signal Processor (DSP). The present application can also be implemented as an apparatus or device program (e.g., a computer program and a computer program product) for performing a portion or all of the methods described herein. Such a program embodying the present application may be stored on a computer readable medium, or may have the form of one or more signals. Such signals may be downloaded from an internet website, provided on a carrier signal, or provided in any other form.
For example, fig. 8 shows an electronic device in which the method according to the application may be implemented. The electronic device may be a PC, a mobile terminal, a personal digital assistant, a tablet computer, etc. The electronic device conventionally comprises a processor 810 and a memory 820 and a program code 830 stored on said memory 820 and executable on the processor 810, said processor 810 implementing the method described in the above embodiments when said program code 830 is executed. The memory 820 may be a computer program product or a computer readable medium. The memory 820 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 820 has a storage space 8201 for program code 830 of a computer program for performing any of the method steps described above. For example, the memory space 8201 for the program code 830 may include individual computer programs that are each used to implement various steps in the above methods. The program code 830 is computer readable code. These computer programs may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. The computer program comprises computer readable code which, when run on an electronic device, causes the electronic device to perform a method according to the above-described embodiments.
The embodiment of the application also discloses a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, realizes the steps of the palm information detection method according to the embodiment of the application.
Such a computer program product may be a computer readable storage medium, which may have memory segments, memory spaces, etc. arranged similarly to the memory 820 in the electronic device shown in fig. 8. The program code may be stored in the computer readable storage medium, for example, in a suitable form. The computer readable storage medium is typically a portable or fixed storage unit as described with reference to fig. 9. In general, the memory unit comprises computer readable code 830', which computer readable code 830' is code that is read by a processor, which code, when executed by the processor, implements the steps of the method described above.
Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the application. Furthermore, it is noted that the word examples "in one embodiment" herein do not necessarily all refer to the same embodiment.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.