CN113892113A

CN113892113A - Human body posture estimation method and device

Info

Publication number: CN113892113A
Application number: CN201980096776.9A
Authority: CN
Inventors: 谭文伟
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2022-01-04
Also published as: WO2021007859A1

Abstract

A human body posture estimation method and a device thereof are provided, the method comprises the following steps: inputting an image to be processed into a neural network, and detecting K key points in the image to be processed in parallel to obtain a detection heat map and a label pool of each key point (S301); acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points (S302); acquiring a label value of a key point corresponding to each peak value in the detection heat map of each type of key point according to the label pool of each type of key point (S303); and clustering key points with similar label values as the same human body connection key point (S304). The human body posture estimation method and the human body posture estimation device can improve the efficiency and the accuracy of human body posture estimation of multiple persons.

Description

Human body posture estimation method and device

Technical Field

The embodiment of the application relates to the field of image processing, in particular to a human body posture estimation method and device.

Background

Human posture estimation is receiving more and more attention due to important application value and theoretical significance. Currently, the single-person body posture estimation research has reached higher precision, and the industry research direction focuses on the multi-person body posture estimation.

A multi-person human body posture estimation method is a top-down method, and mainly comprises the following steps: detecting a human body by using a human body detector, judging the position of the human body, and framing the human body; after the human body is determined, the key points of the single human body are independently predicted, and finally posture prediction is achieved. Under the conditions that a human body is shielded, the background is complex and easy to be confused, the effect of the top-down method detection is often sensitive. For complex poses, the top-down approach is also insufficient. When the number of people increases, the time cost also increases proportionally with the increase of the number of people.

The other multi-person human body posture estimation method is a bottom-up method and mainly comprises key point detection and clustering grouping. All key points of all categories in the picture are detected firstly, then the key points are clustered, different key points of different people are connected together, and different individuals are generated through clustering. The bottom-up method is less accurate than the top-down method, but has great superiority in time efficiency.

The existing numerous multi-person human body posture estimation methods have advantages and disadvantages, and how to consider efficiency and accuracy is an urgent problem to be solved in multi-person human body posture estimation.

Disclosure of Invention

The embodiment of the application provides a human body posture estimation method and device, which are used for improving the efficiency and accuracy of multi-person human body posture estimation.

In order to achieve the above purpose, the embodiment of the present application adopts the following technical solutions:

in a first aspect, a method for estimating a human body posture is provided, and the method may include: inputting an image to be processed into a neural network, and detecting K key points in the image to be processed in parallel to obtain a detection heat map and a label pool of each key point; the detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises a label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs; acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points; acquiring a label value of a key point corresponding to each peak value in the detection heat map of each type of key point according to the label pool of each type of key point; and clustering key points with similar label values to serve as the same human body connection key point.

By the human body posture estimation method, in the multi-person human body posture estimation, the calculated amount of the model is reduced through parallel detection, and the speed of the model is increased; the label value is allocated to each key point as the prior knowledge of grouping, so that the accuracy of clustering grouping is improved, the grouping process can be carried out in parallel, and the grouping efficiency is also improved; therefore, the efficiency and the accuracy of the multi-person human body posture estimation are improved.

With reference to the first aspect, in one possible implementation, the neural network configures a packet loss function for assigning the label values of the keypoints. The method and the device realize that the label value is allocated to each key point as the prior knowledge of grouping, and improve the accuracy of clustering grouping and the efficiency of grouping.

In combination with the first aspect or any one of the above possible implementations, in another possible implementation, a specific implementation of a packet loss function is provided

Wherein N is the number of human bodies in the image to be processed; w (y) represents the label value of the keypoint at the y coordinate;

the mean of the tag values representing all real keypoint locations of the nth person.

In combination with the first aspect or any one of the above possible implementations, in another possible implementation, the neural network configures a detection loss function, and the detection loss function is used to calculate a mean square error of the predicted detection heat map and the real keypoint heat map to output the detection heat map.

With reference to the first aspect or any one of the foregoing possible implementations, in another possible implementation, the label values of the key points are assigned according to a space constraint relationship, and the key points of the same human body in the space constraint relationship are assigned with similar label values.

With reference to the first aspect or any one of the foregoing possible implementations, in another possible implementation, the detection heatmap of a class of keypoints represents the probability of the keypoints of the class appearing at different positions in the image to be processed, including: the detection heat map of the key points represents the Gaussian distribution of the key points at different positions in the image to be processed; or the detection heat map of the key points in one category represents the probability of the key points in the different positions in the image to be processed. For example, the detection heatmap may be a confidence map.

In a second aspect, there is provided a body posture estimation device, which may include: the device comprises a detection unit, an acquisition unit and a clustering unit. The detection unit is used for inputting the image to be processed into the neural network, detecting K key points in the image to be processed in parallel and obtaining a detection heat map and a label pool of each key point; the detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises a label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs; the acquisition unit is used for acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points acquired by the detection unit; the acquisition unit is further used for acquiring a label value of the key point corresponding to each peak value according to the label pool of each type of key point acquired by the detection unit; and the clustering unit is used for clustering the key points with similar label values acquired by the acquisition unit as the same human body connection key point.

By the human body posture estimation device, in the multi-person human body posture estimation, the calculated amount of the model is reduced through parallel detection, and the speed of the model is increased; the label value is allocated to each key point as the prior knowledge of grouping, so that the accuracy of clustering grouping is improved, the grouping process can be carried out in parallel, and the grouping efficiency is also improved; therefore, the efficiency and the accuracy of the multi-person human body posture estimation are improved.

It should be noted that, for the human body posture estimation apparatus provided in the second aspect, a specific implementation of the above human body posture estimation method provided in the first aspect may refer to a specific implementation of the above first aspect.

In a third aspect, an embodiment of the present application provides a human body posture estimation apparatus, where the apparatus includes a processor, and is configured to implement the human body posture estimation method described in the first aspect. The apparatus may further include a memory coupled to the processor, and the processor may implement the human body posture estimation method described in the first aspect above when executing the instructions stored in the memory. The apparatus may also include a communication interface for the apparatus to communicate with other devices, which may be, for example, a transceiver, circuit, bus, module, or other type of communication interface.

In the present application, the instructions in the memory may be stored in advance, or may be downloaded from the internet and stored when the apparatus is used. The coupling in the embodiments of the present application is an indirect coupling or connection between devices, units or modules, which may be in an electrical, mechanical or other form, and is used for information interaction between the devices, units or modules.

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, which includes instructions, when executed on a computer, causing the computer to perform the human body posture estimation method described in any one of the above aspects or any one of the possible implementations.

In a fifth aspect, an embodiment of the present application further provides a computer program product, which when run on a computer, causes the computer to execute the human body posture estimation method according to any one of the above aspects or any one of the possible implementations.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor and may further include a memory, and is configured to implement the functions in the foregoing method. The chip system may be formed by a chip, and may also include a chip and other discrete devices.

The solutions provided in the third aspect to the sixth aspect are used for implementing the human body posture estimation method provided in the first aspect, and therefore the same beneficial effects as those of the first aspect can be achieved, and are not described herein again.

Drawings

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of a human body posture estimation device according to an embodiment of the present application;

fig. 3 is a schematic flow chart of a human body posture estimation method according to an embodiment of the present application;

fig. 4 is a schematic view of another application scenario provided in the embodiment of the present application;

FIG. 4a is a schematic diagram of a detection heatmap provided in an embodiment of the present application;

FIG. 5 is a comparison graph before and after image processing according to an embodiment of the present application;

fig. 6 is a schematic diagram of tag value clustering provided in an embodiment of the present application;

fig. 7 is a comparison diagram of human body posture estimation performed by the openposition algorithm provided in the embodiment of the present application and the algorithm of the present application;

fig. 8 is a comparison graph of human body posture estimation performed by the openposition algorithm provided in the embodiment of the present application and the algorithm of the present application;

fig. 9 is a comparison diagram of human body posture estimation performed by the openposition algorithm provided in the embodiment of the present application and the algorithm of the present application;

fig. 10 is a schematic structural diagram of another human body posture estimation device provided in the embodiment of the present application;

fig. 11 is a schematic structural diagram of another human body posture estimation device according to an embodiment of the present application.

Detailed Description

In the embodiments of the present application, for convenience of clearly describing the technical solutions of the embodiments of the present application, terms such as "first" and "second" are used to distinguish the same items or similar items with substantially the same functions and actions. Those skilled in the art will appreciate that the terms "first," "second," etc. do not denote any order or quantity, nor do the terms "first," "second," etc. denote any order or importance. The technical features described in the first and second descriptions have no sequence or magnitude order.

In the embodiments of the present application, words such as "exemplary" or "for example" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion for ease of understanding.

In the description of the present application, "/" indicates a relationship where the objects associated therewith are an "or", e.g., a/B may indicate a or B; in the present application, "and/or" is only an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. Also, in the description of the present application, "a plurality" means two or more than two unless otherwise specified. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

In the embodiments of the present application, at least one may also be described as one or more, and a plurality may be two, three, four or more, which is not limited in the present application.

In addition, the network architecture and the scenario described in the embodiment of the present application are for more clearly illustrating the technical solution of the embodiment of the present application, and do not constitute a limitation to the technical solution provided in the embodiment of the present application, and it is known by a person of ordinary skill in the art that the technical solution provided in the embodiment of the present application is also applicable to similar technical problems along with the evolution of the network architecture and the appearance of new service scenarios.

The method provided by the embodiment of the present application may be used in a neural network for estimating the posture of the human body, where the neural network may be a stacked hourglass network or other structures, and this is not particularly limited in the embodiment of the present application. Fig. 1 illustrates an application scenario diagram of the present application, and as shown in fig. 1, an input image is input to a neural network, and the neural network performs human body posture estimation to obtain an output image.

It should be noted that the neural network is illustrated as a cascade hourglass network in fig. 1, but is not particularly limited thereto.

At present, when the neural network carries out human body posture estimation, although the single human body posture estimation has already reached higher precision, the method for estimating the human body posture of multiple persons still has a place to be improved.

Based on this, the application provides a human body posture estimation method, and the basic principle is as follows: in the bottom-up multi-person human body posture estimation, key points are detected in parallel, and a detection heat map and a label pool of each type of key points are obtained while the key points are detected, so that the parallel detection reduces the calculated amount of the model and improves the speed of the model; the label value is allocated to each key point as the prior knowledge of grouping, so that the accuracy of clustering grouping is improved, the grouping process can be carried out in parallel, and the grouping efficiency is also improved; therefore, the efficiency and the accuracy of the multi-person human body posture estimation are improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Fig. 2 is a schematic composition diagram of a body posture estimation device 20 according to an embodiment of the present disclosure, and as shown in fig. 2, the body posture estimation device 20 may include at least one processor 21, a memory 22, a communication interface 23, and a communication bus 24. The following specifically describes each constituent component of the human body posture estimation device 20 with reference to fig. 2:

the processor 21 may be a single processor or may be a general term for a plurality of processing elements. For example, the processor 21 is a Central Processing Unit (CPU), and may be an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits configured to implement the embodiments of the present application, such as: one or more microprocessors (digital signal processors, DSPs), or one or more Field Programmable Gate Arrays (FPGAs).

The processor 21 may perform various functions of the function alias control server by running or executing software programs stored in the memory 22 and calling data stored in the memory 22. In particular implementations, processor 21 may include one or more CPUs such as CPU0 and CPU1 shown in fig. 2 as one example.

In a specific implementation, the human posture estimation device 20 may include a plurality of processors, such as the processor 21 and the processor 25 shown in fig. 2, as an example. Each of these processors may be a single-Core Processor (CPU) or a multi-Core Processor (CPU). A processor herein may refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

The memory 22 may be, but is not limited to, a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a Random Access Memory (RAM) or other type of dynamic storage device that can store information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disk read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 22 may be self-contained and coupled to the processor 21 via a communication bus 24. The memory 22 may also be integrated with the processor 21. The memory 22 is used for storing software programs for executing the scheme of the application, and is controlled by the processor 21 to execute.

The communication interface 23 is any device, such as a transceiver, for communicating with other devices or communication networks, such as ethernet, Radio Access Network (RAN), Wireless Local Area Networks (WLAN), etc. The communication interface 23 may include a receiving unit as well as a transmitting unit.

The communication bus 24 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 2, but it is not intended that there be only one bus or one type of bus.

It is noted that the components shown in fig. 2 do not constitute a limitation of the communication device, which may comprise more or less components than shown, or some components in combination, or a different arrangement of components than those shown in fig. 2.

Specifically, the processor 21 executes the following functions by running or executing software programs and/or modules stored in the memory 22 and calling data stored in the memory 22:

inputting an image to be processed into a neural network, and detecting K key points in the image to be processed in parallel to obtain a detection heat map and a label pool of each key point; the detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises a label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs; acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points; acquiring a label value of a key point corresponding to each peak value in the detection heat map of each type of key point according to the label pool of each type of key point; and clustering key points with similar label values to serve as the same human body connection key point.

In one aspect, an embodiment of the present application provides a method for estimating a human body posture, as shown in fig. 3, the method may include:

s301, inputting the image to be processed into a neural network, and detecting K key points in the image to be processed in parallel to obtain a detection heat map and a label pool of each key point.

Specifically, S301 may be performed on a pixel-by-pixel basis in the image to be processed, or S301 may be performed on a region-by-region basis by cutting the image to be processed into a plurality of regions, which is not specifically limited in the embodiment of the present application. Parallel, as used herein, refers to non-serial.

The key points may be joint points on the human body or others, and this is not particularly limited in the embodiments of the present application. The number K of the categories of the key points may be configured according to actual requirements, and the embodiment of the present application is not specifically limited.

For example, 17 key points are defined in the COCO dataset, which are: 0-nose, 1-left eye, 2-right eye, 3-left ear, 4-right ear, 5-left shoulder, 6-right shoulder, 7-left elbow, 8-right elbow, 9-left wrist, 10-right wrist, 11-left hip, 12-right hip, 13-left knee, 14-right knee, 15-left ankle, 16-right ankle.

The detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises the label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs.

In one possible implementation, the detection heatmap of a class of key points represents the probability that the class of key points appears at different positions in the image to be processed, which may be specifically implemented as: the detection heat map of a class of key points represents the Gaussian distribution of the class of key points appearing at different positions in the image to be processed. In multi-person body pose estimation, there are multiple peaks due to the presence of multiple bodies in one image. For example, there are multiple peaks of the left key in the detection heatmap of the key point, i.e., the left key.

In another possible implementation, the detection heatmap of a class of key points represents the possibility of the class of key points appearing at different positions in the image to be processed, which may be specifically implemented as follows: the detection heatmap of a class of keypoints represents the probability of the keypoints appearing at different positions in the image to be processed. For example, the detection heat map may be a two-dimensional confidence map with the probabilities of keypoints represented by shades of color.

In one particular implementation, a neural network may be configured with a detection loss function for computing a mean square error of a predicted detection heat map and a real keypoint heat map to output a detection heat map. The content of the loss detection function may be configured according to actual requirements, which is not specifically limited in the embodiment of the present application.

For example, the detection loss function takes the form of the paper "Stacked hourglass networks for human position estimation", which is not described herein.

Specifically, the label values of the key points may be assigned according to a space constraint relationship, and the key points of the same human body in the space constraint relationship are assigned with similar label values.

The spatial constraint relationship can be used to determine which regions in the image are the same human body and which regions are different human bodies. The specific content of the space constraint relation is not limited, and the space constraint relation can be configured according to actual requirements.

For example, in one possible implementation, the spatial constraint relationship may be: the image pixel positions of the same human body in adjacent areas in the image are closer, and different human bodies are far away in the image space, so that the individual areas in the image are determined through non-maximum inhibition in order to distribute the joint points to the belonged individuals, and further, which areas are the same human body and which areas are different human bodies in the image are obtained.

Specifically, different tag value intervals can be configured for different human bodies in advance, and tag values are obtained according to the human bodies determined according to the space constraint relation and the tag value intervals configured for the different human bodies.

The similar label value means that the absolute value of the difference value is smaller than a preset threshold, and a specific value of the preset threshold may be configured according to actual requirements, which is not specifically limited in the embodiment of the present application.

Specifically, similar label values may be randomly generated for different key points located in the same person in the spatial constraint relationship according to the spatial constraint relationship, or similar label values may be generated for different key points located in the same person in the spatial constraint relationship according to a preset algorithm, which is not specifically limited in this embodiment of the present application. It should be noted that the content of the preset algorithm can be configured according to actual requirements.

For example, in the estimation of human body postures of multiple persons, a real number is generated every time a key point is detected, and the real number is used as a tag value to represent a group to which the detected pixel point belongs. Of course, the type of the tag value may be other than a real number, and this is not particularly limited in the embodiment of the present application.

In one particular implementation, a neural network may be configured with a packet loss function that assigns a label value for a keypoint. The content of the packet loss function may be configured according to actual requirements, which is not specifically limited in the embodiment of the present application.

Illustratively, the embodiment of the present application provides a specific implementation of a packet loss function, which is as follows (1).

Wherein N is the number of human bodies in the image to be processed; w is formed by R^W×HRepresenting a label pool corresponding to the image to be processed; w (y) represents the label value of the keypoint at the y coordinate;

the mean of the tag values representing all real keypoint locations of the nth person. For a given graph, assume that there are NThe real position coordinates of the human body joint points of the N persons are T { (y)_nk)},n＝1,...,N,k＝1,...,K。

The design concept of the packet loss function is briefly described below.

The grouping of the key points is equivalent to a clustering problem, and in order to obtain a better key point grouping result, the generated prediction label pool needs to enable the joint points of the same person to be aggregated together as much as possible, and simultaneously enable the joint points of different persons to be separated as much as possible, so that the error of grouping the key points by using a clustering algorithm during testing is reduced. Respectively establishing loss, firstly applying a k-means clustering mode to the same person situation, namely introducing a reference value

The average value of all real joint point position label values of the nth person is represented by the following mathematical expression (2).

According to the definition of k-means clustering, a square loss function is established as the following formula (3).

For different people, the mean value of all real joint point position label values of the nth person is

Let the average of the label values of all real joint points of the nth person be

Is provided with

Indicating the error of the former and the latter. In order to measure the distance between the two, a square loss function is also introduced as shown in the following equation (4).

As described above, the smaller the formula (3), the larger the formula (4) must be. In order to unify (3) and (4) into one loss function, it is necessary to find the minimum value from equation (4), and therefore, by introducing a negative exponential function, rewrite equation (4) to obtain the following equation (5).

The loss functions obtained in (3) and (5) are combined and expressed by the following formula (6).

The total Loss is the weighted sum of the detection Loss and the packet Loss, Loss is mL_g+nL _dAnd m and n are hyper-parameters.

It should be noted that the design concept of the grouping loss function is only an example, and all the grouping loss functions related by using the concept belong to the grouping loss functions described in this application.

Fig. 4 is a scene schematic diagram of a human body posture estimation method described in the embodiment of the present application. As shown in fig. 4, the input image is input into a neural network (schematically illustrated as a cascaded hourglass network), and the process of S301 is executed to obtain the detection heat map and the tag pool shown in fig. 4. For example, the detection heat map obtained from the input image in fig. 4 is shown in fig. 4a, each of the small maps in fig. 4a is a detection heat map of a type of key points, and the highlight dots in each of the small maps indicate peak positions of the key points.

For example, the scheme may adopt a 4-level hourglass structure, the input size of the network is 256 × 256, and the output size is 128 × 128. If K human body key points need to be predicted, the number of output channels of the network is 2K, K channels are used for detection, and the other K channels are used for grouping. For the COCO dataset, since each graph has 17 key point labels, the output of the final network is 34 channels, wherein the first 17 channels are used for outputting the detection heat map of the joint point, and the last 17 channels are used for outputting the packet tag information of the joint point.

The following describes an implementation of the present application with reference to specific examples.

As shown in fig. 5, the original image of the left image, using the COCO data set and the model of the present application, will obtain the detection heatmap and label pool for each type of key point. The detection heat map may be a gaussian distribution of each type of keypoint. The tag values included in the 17 tag pools are as follows (arranged in the order of the 17 key points in the COCO dataset):

[ -1.8509495,1.0292919,4.163754],[ -1.8537576,1.0384212,4.1641073],[ -1.8466513,1.0294132,4.1671414],[ -1.8384541,1.0231285,0],[ -1.8542455,0,4.150399],[ -1.8703872,1.0921177,4.120116],[ -1.8843985,1.0381733,4.1086555],[ -1.8894546,1.1279857,4.1338553],[ -1.8111589,1.057544,4.1566596],[ -1.9049348,0,4.1225743],[ -1.8559527,1.0369155,4.124661],[ -1.9162827,1.0587022,4.111921],[ -1.9004283,1.1545397,4.1475873],[ -1.9555404,0,4.185656],[ -1.8526844,0,4.170965],[0,0,0],[0,0,0]. Where 0 indicates that the keypoint information is not available.

S302, acquiring peak values in the detection heat map of each type of key points according to the detection heat map of each type of key points.

Specifically, the most likely positions of the keypoints in the detection heatmap for each type of keypoint are peaks.

For example, based on the example in S301, in the detection heatmap of nose key points, 3 peaks may be acquired.

And S303, acquiring the label value of the key point corresponding to each peak value according to the label pool of each type of key point.

Specifically, in S303, the label value of the key point corresponding to each peak value is obtained according to the coordinate position in the image to be processed.

For example, based on the example in S301, the 3 peaks in the detection heatmap of the nose key points acquired in S302 each have respective coordinates according to which, among the tag values of the nose key points, the tag value of the corresponding position is acquired as [ -1.8509495,1.0292919,4.163754 ].

And S304, clustering key points with similar label values to serve as the same human body connection key point.

For example, in S304, a Lloyd method in the K-means algorithm may be used to match similar key points, and of course, other clustering methods may also be used, which is not specifically limited in this embodiment of the present application.

It should be noted that when the Lloyd method is used to match similar key points, the number of people in the picture corresponding to the k value in the algorithm may be determined by calculating the number of peak points on the detection heat map with the largest number of peak points in S302 in the present application. The distance in the Lloyd method can adopt euclidean distance, and in actual operation, other distance measurement methods can be selected according to actual conditions, so that the key point grouping with the best effect is obtained.

For example, based on the example in S301, clustering the label values in S304 using the Lloyd method can obtain the grouped label values as follows:

[-1.8509495，-1.8537576，-1.8466513，-1.8384541，-1.8542455，-1.8703872， -1.8843985，-1.8894546，-1.8111589，-1.9049348，-1.8559527，-1.9162827，-1.9004283，-1.9555404，-1.8526844，0，0]；

[1.0292919，1.0384212，1.0294132，1.0231285，0，1.0921177，1.0381733，1.1279857，1.057544，0，1.0369155，1.0587022，1.1545397，0，0，0，0]；

[4.163754，4.1641073，4.1671414，0，4.150399，4.120116，4.1086555，4.1338553，4.1566596，4.1225743，4.124661，4.111921，4.1475873，4.185656，4.170965，0，0]。

as shown in fig. 6, the clustered label values are illustrated, and it can be seen that the label values are well separated.

Connecting the key points corresponding to the label values divided into a group to obtain the output image of the right image in the figure 5, and well identifying the human body posture in the original image.

Compared with the existing top-down method, the method has the advantage that the running speed of the model is greatly improved under the condition that the accuracy rate is not much different. Serial key point detection is changed into parallel, the calculated amount of the model can be reduced, the complexity of the model is greatly reduced, and the processing speed of the model is improved.

In the scenario shown in fig. 7, the openposition of the left image may have missing detection of the key points and incorrect connection of the key points, for example, the right knee is detected as the left knee, and the left hip is connected with the right knee, but the method of the present application (the right image in fig. 7) also maintains relatively stable performance in this scenario.

Tests were performed on the MS COCO test-dev dataset, and compared with the openposition method in terms of accuracy, and the comparison results are shown in Table 1.

TABLE 1

	AP	AP ⁵⁰	AP ⁷⁵	AP ^M	AP ^L	AR	AR ⁵⁰	AR ⁷⁵	AR ^M	AR ^L
Openpose	0.611	0.844	0.667	0.558	0.684	0.665	0.872	0.718	0.602	0.749
This application	0.665	0.849	0.726	0.612	0.744	0.701	0.867	0.755	0.640	0.789

In addition, time efficiency comparison is carried out on the human body posture estimation algorithm in real time with openposition, the same group of pictures are used for testing openposition and the method, the average time consumption of 8.492 seconds is obtained when one picture is processed by openposition, only 3.409 seconds are needed for processing one picture by the method, and the time efficiency is improved to a certain extent. From the comparison of the accuracy and the time efficiency, the comprehensive performance of the method is greatly improved.

Fig. 7, 8 and 9 show the detection results of the method and opennase, wherein the left graph in the figure is the result of opennase, and the right graph is the result of the method.

In the embodiments provided in the present application, the method provided in the embodiments of the present application is introduced from the perspective of the working principle of the human body posture estimation device. In order to implement the functions in the method provided by the embodiment of the present application, the human body posture estimation device may include a hardware structure and/or a software module, and implement the functions in the form of a hardware structure, a software module, or a hardware structure and a software module. Whether any of the above-described functions is implemented as a hardware structure, a software module, or a hardware structure plus a software module depends upon the particular application and design constraints imposed on the technical solution.

The division of the modules in the embodiments of the present application is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present application may be integrated in one processor, may also exist alone physically, or may also be integrated in one module by two or more modules. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In the case of adopting each functional module divided corresponding to each function, as shown in fig. 10, the human body posture estimation device 100 provided in the embodiment of the present application is used for implementing the functions in the above method. As shown in fig. 10, the human body posture estimation apparatus 100 may include: detection section 1001, acquisition section 1002, and clustering section 1003. The detection unit 1001 is configured to perform S301 in fig. 3; the acquiring unit 1002 is configured to execute S302 and S303 in fig. 3; the clustering unit 1003 is configured to execute S304 in fig. 3. All relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

In the case of adopting the integrated division of the functional modules, as shown in fig. 11, a human body posture estimation device 110 provided in the embodiment of the present application is used for implementing the functions in the above method. The human body posture estimation device 110 comprises at least one processing module 1101 for implementing the functions in the method provided by the embodiment of the present application. Illustratively, the processing module 1101 may be used to perform the processes S301 to S304 in fig. 3. For details, reference is made to the detailed description in the method example, which is not repeated herein.

The human pose estimation apparatus 110 can also include at least one storage module 1102 for storing program instructions and/or data. The storage module 1102 and the processing module 1101 are coupled. The coupling in the embodiments of the present application is an indirect coupling or a communication connection between devices, units or modules, and may be an electrical, mechanical or other form for information interaction between the devices, units or modules. The processing module 1101 may cooperate with the storage module 1102. The processing module 1101 may execute program instructions stored in the storage module 1102. At least one of the at least one memory module may be included in the processing module.

The human body posture estimation apparatus 110 may further include a communication module 1103 for communicating with other devices through a transmission medium, so as to determine that the human body posture estimation apparatus 110 can communicate with other devices. The communication module 1103 is used for the apparatus to communicate with other devices.

When the processing module 1101 is a processor, the storage module 1102 is a memory, and the communication module 1103 is a communication interface, the human body posture estimation device 110 in fig. 11 according to the embodiment of the present application may be the human body posture estimation device 20 shown in fig. 2.

As described above, the human body posture estimation device 100 or the human body posture estimation device 110 provided in the embodiments of the present application can be used to implement the functions in the methods implemented in the embodiments of the present application, and for convenience of description, only the relevant portions of the embodiments of the present application are shown, and details of the specific technology are not disclosed, please refer to the embodiments of the present application.

As another form of the present embodiment, there is provided a computer-readable storage medium having stored thereon instructions that, when executed, perform the method of the above-described method embodiments.

As another form of the present embodiment, there is provided a computer program product containing instructions that, when executed, perform the method of the above-described method embodiments.

The embodiment of the present invention further provides a chip system, which includes a processor and is used for implementing the technical method of the embodiment of the present invention. In one possible design, the system-on-chip further includes a memory for storing program instructions and/or data necessary for a communication device of an embodiment of the present invention. In one possible design, the system-on-chip further includes a memory for the processor to call application code stored in the memory. The chip system may be composed of one or more chips, and may also include a chip and other discrete devices, which is not specifically limited in this embodiment of the present application.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM, flash memory, ROM, Erasable Programmable Read Only Memory (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), registers, a hard disk, a removable hard disk, a compact disc read only memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a core network interface device. Of course, the processor and the storage medium may reside as discrete components in a core network interface device. Alternatively, the memory may be coupled to the processor, for example, the memory may be separate and coupled to the processor via a bus. The memory may also be integral to the processor. The memory can be used for storing application program codes for executing the technical scheme provided by the embodiment of the application, and the processor is used for controlling the execution. The processor is used for executing the application program codes stored in the memory, so as to realize the technical scheme provided by the embodiment of the application.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical functional division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another device, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may be one physical unit or a plurality of physical units, that is, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially contributed to by the prior art, or all or part of the technical solutions may be embodied in the form of a software product, where the software product is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A human body posture estimation method is characterized by comprising the following steps:

inputting an image to be processed into a neural network, and detecting K key points in the image to be processed in parallel to obtain a detection heat map and a label pool of each key point; the detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises a label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs;

acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points;

obtaining a label value of a key point corresponding to each peak value according to the label pool of each type of key point;

and clustering key points with similar label values to serve as the same human body connection key point.
The method of claim 1, wherein the neural network configures a packet loss function that assigns a label value for a keypoint.
The method of claim 2,

said packet loss function
Wherein N is the number of human bodies in the image to be processed; w (y) represents the label value of the keypoint at the y coordinate;
the mean of the tag values representing all real keypoint locations of the nth person.
The method of any one of claims 1-3, wherein the neural network configures a detection loss function for computing a mean square error of the predicted detection heat map and the real keypoint heat map to output the detection heat map.
The method according to any one of claims 1 to 4, wherein the label values are assigned according to a spatial constraint relationship in which key points of the same human body are assigned similar label values.
The method of any one of claims 1 to 5, wherein the detection heat map of a class of keypoints represents the likelihood that the class of keypoints is present at different positions in the image to be processed, and comprises:

the detection heat map of the key points represents the Gaussian distribution of the key points at different positions in the image to be processed;

or,

the detection heatmap of a class of keypoints represents the probability of the keypoints appearing at different positions in the image to be processed.
A human body posture estimation device, characterized by comprising:

the detection unit is used for inputting the image to be processed into a neural network, detecting K key points in the image to be processed in parallel and obtaining a detection heat map and a label pool of each key point; the detection heat map of one type of key points represents the possibility that the key points appear at different positions in the image to be processed; the label pool of the key points comprises a label value of each key point in the image to be processed, and the label value of one key point is used for indicating the human body group to which the key point belongs;

the acquisition unit is used for acquiring a peak value in the detection heat map of each type of key points according to the detection heat map of each type of key points acquired by the detection unit;

the obtaining unit is further configured to obtain a label value of the key point corresponding to each peak value according to the label pool of each type of key point obtained by the detecting unit;

and the clustering unit is used for clustering the key points with similar label values acquired by the acquisition unit as the same human body connection key point.
The apparatus of claim 7, wherein the neural network configures a packet loss function that assigns a label value for a keypoint.
The apparatus of claim 8,

said packet loss function
Wherein N is the number of human bodies in the image to be processed; w (y) represents the label value of the keypoint at the y coordinate;
the mean of the tag values representing all real keypoint locations of the nth person.
The apparatus of any one of claims 7-9, wherein the neural network is configured with a detection loss function for computing a mean square error of the predicted detection heat map and the real keypoint heat map to output the detection heat map.
The apparatus according to any one of claims 7-10, wherein the label values are assigned according to a spatial constraint relationship, and key points of the same human body in the spatial constraint relationship are assigned with similar label values.
The apparatus according to any of claims 7-11, wherein the detection heat map of a class of keypoints represents the likelihood of the class of keypoints appearing at different positions in the image to be processed, and comprises:

the detection heat map of the key points represents the Gaussian distribution of the key points at different positions in the image to be processed;

or,

the detection heatmap of a class of keypoints represents the probability of the keypoints appearing at different positions in the image to be processed.
An apparatus for human pose estimation comprising a processor and a memory, the memory coupled with the processor, the processor for performing the human pose estimation method of any of claims 1 to 6.
A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the body pose estimation method of any of claims 1 to 6.
A computer program product which, when run on a computer, causes the computer to perform the human pose estimation method of any of claims 1 to 6.