CN110427806A

CN110427806A - Video frequency identifying method, device and computer readable storage medium

Info

Publication number: CN110427806A
Application number: CN201910538695.1A
Authority: CN
Inventors: 杨涛涛; 高新
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-11-08

Abstract

The present invention provides a kind of video frequency identifying method, device and computer readable storage mediums, belong to field of computer technology.This method can be according to personage's key point in video to be processed in start frame image, obtain personage's key point in video to be processed in multiple image, then, it can be based on personage's key point in frame image every in multiple image, obtain the motion information of personage's posture in video to be processed, finally, motion information based on personage's posture, identify whether video to be processed is the video comprising specified type behavior, in this way, by being identified with embodying personage's key point of personage in video, it can be interfered to avoid other pixel brings, and then improve video identification precision.

Description

Video identification method and device and computer readable storage medium

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a video identification method and device and a computer readable storage medium.

Background

With the continuous development of the internet, a user can simply and quickly upload videos to a video platform anytime and anywhere. Further, since the content of the video uploaded by the user may include a specific type of behavior that may cause adverse effects, for example, a high-risk behavior, the video platform side often needs to perform auditing and identification on the video uploaded by the user, and if an identification result indicates that the video does not include the specific type of behavior, an online right is set for the video uploaded by the user, so that the video uploaded by the user is displayed on the video platform.

In the prior art, identification is usually performed based on information of all pixel points in all images in a video to be processed, but since all pixel points contain more interference information, the problem that the information based on all the pixel points cannot be accurately identified is caused, and the identification effect is poor.

Disclosure of Invention

In view of this, the present invention provides a video identification method, apparatus and computer readable storage medium, which solve the problem of poor identification effect to a certain extent.

According to a first aspect of the present invention, there is provided a video recognition method, which may include:

acquiring figure key points in a plurality of frames of images in a video to be processed according to the figure key points in the initial frame of image in the video to be processed;

acquiring motion information of the human posture in the video to be processed based on the human key points in each frame of image in the multi-frame image;

and identifying whether the video to be processed is a video containing a specified type of behavior or not based on the motion information of the character gesture.

Optionally, the obtaining of the person key points in the multiple frame images in the video to be processed according to the key points in the start frame image in the video to be processed includes:

detecting character key points in the initial frame image in the video to be processed by using a preset key point detection algorithm;

and for a plurality of residual frame images with a time sequence later than that of the initial frame image in the video to be processed, sequentially determining the character key points in each residual frame image according to the time sequence of each residual frame image based on the character key points in the initial frame image to obtain the character key points in the multi-frame images in the video to be processed.

Optionally, the sequentially determining the person key points in each remaining frame image according to the time sequence of each remaining frame image based on the person key points in the start frame image includes:

for each residual frame image, determining corresponding pixel points of key points of the human body in the previous frame image of the residual frame image in the residual frame image by using a preset tracking algorithm;

taking the corresponding pixel points as character key points in the residual frame images;

the time sequence of the previous frame image of the residual frame images is earlier than that of the residual frame images, and the starting frame image is the previous frame image of the residual frame image with the earliest time sequence.

Optionally, before the corresponding pixel point is used as a person key point in the remaining frame image, the method further includes:

for each residual frame image, determining a first difference value between the number of the corresponding pixel points and the number of the human key points in the previous frame image of the residual frame image;

if the first difference is not smaller than a first preset difference threshold value, re-determining the character key points in the residual frame image by using the preset key point detection algorithm to obtain the corresponding pixel points; or,

and if the first difference value is smaller than the first preset difference value threshold value, determining missing character key points based on the relative position relation of the character key points in the previous frame of image, and adding the missing character key points to the corresponding pixel points.

Optionally, after the corresponding pixel point is taken as a person key point in the remaining frame image, the method further includes:

selecting a residual frame image from the residual frame images every n residual frame images to obtain a plurality of target residual frame images;

for each target residual frame image, re-determining the character key points in the target residual frame image by using the preset key point detection algorithm;

calculating a second difference value between the number of the re-determined character key points and the number of the corresponding pixel points;

and if the second difference is larger than a second preset difference threshold value, taking the re-determined person key points as the person key points of the target residual frame image.

Optionally, before the identifying whether the video to be processed is a video containing a specified type of behavior based on the motion information of the person gesture, the method further includes:

determining the number of people and the posture of the people corresponding to each frame of image based on the relative position relationship of the key points of the people in each frame of image in the multiple frames of images to obtain the posture information of the people corresponding to each frame of image;

correspondingly, the identifying whether the video to be processed is a video containing a specified type of behavior based on the motion information of the character gesture comprises the following steps:

the method comprises the steps that character posture information corresponding to each frame of image and motion information of character postures are used as input of a preset classification model, and the preset classification model is used for determining the category of a video to be processed;

and if the category to which the video to be processed belongs is matched with the specified type, determining that the video to be processed is a video containing the specified type behavior.

Optionally, the obtaining motion information of the person pose in the video to be processed based on the person key point in each frame of image in the multiple frames of images includes:

determining the motion trail of each type of human key points based on the positions of the human key points in each frame of image and the time sequence of each frame of image;

and determining the motion information of the character posture based on the motion trail of the key points of each type of characters and the time interval between each frame of image.

Optionally, the method further includes:

and if the video to be processed is the video containing the specified type of behaviors, determining that the video to be processed is the abnormal video.

According to a second aspect of the present invention, there is provided a video recognition apparatus, which may include:

the first acquisition module is used for acquiring character key points in a plurality of frames of images in a video to be processed according to the character key points in an initial frame of image in the video to be processed;

the second acquisition module is used for acquiring the motion information of the human posture in the video to be processed based on the human key points in each frame of image in the multi-frame image;

and the identification module is used for identifying whether the video to be processed is a video containing a specified type of behavior or not based on the motion information of the character gesture.

Optionally, the first obtaining module includes:

the detection submodule is used for detecting the character key points in the initial frame image in the video to be processed by utilizing a preset key point detection algorithm;

and the determining submodule is used for sequentially determining the character key points in each residual frame image according to the time sequence of each residual frame image based on the character key points in the initial frame image and the time sequence of each residual frame image for a plurality of residual frame images with the time sequence later than that of the initial frame image in the video to be processed so as to obtain the character key points in the plurality of frame images in the video to be processed.

Optionally, the determining sub-module is configured to:

Optionally, the determining sub-module is further configured to:

Optionally, the apparatus further comprises:

the first determining module is used for determining the number of the people and the postures of the people corresponding to each frame of image based on the relative position relation of the key points of the people in each frame of image in the multiple frames of images to obtain the posture information of the people corresponding to each frame of image;

accordingly, an identification module to:

taking the figure posture information corresponding to each frame of image and the motion information of the figure posture as the input of a preset classification model, and acquiring the category of the video to be processed by using the preset classification model;

Optionally, the second obtaining module is configured to:

And (4) optional. The device further comprises:

and the second determining module is used for determining that the video to be processed is an abnormal video if the video to be processed contains the specified type of behavior.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when executed by a processor, the computer program implements the steps of the video identification method according to the first aspect.

Aiming at the prior art, the invention has the following advantages:

the method comprises the steps of obtaining character key points in multi-frame images in a video to be processed according to character key points in initial frame images in the video to be processed, obtaining motion information of character gestures in the video to be processed based on the character key points in the multi-frame images, and finally identifying whether the video to be processed is a video containing behaviors of a specified type or not based on the motion information of the character gestures.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is a flowchart illustrating steps of a video recognition method according to an embodiment of the present invention;

FIG. 2-1 is a flow chart illustrating steps of another video recognition method according to an embodiment of the present invention;

FIG. 2-2 is a schematic diagram of a character pose provided by an embodiment of the invention;

fig. 3 is a block diagram of a video recognition apparatus according to an embodiment of the present invention;

fig. 4-1 is a block diagram of another video identification apparatus provided in an embodiment of the present invention;

fig. 4-2 is a block diagram of another video identification device according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 is a flowchart illustrating steps of a video identification method according to an embodiment of the present invention, where as shown in fig. 1, the method may include:

step 101, obtaining the key points of people in the multi-frame images in the video to be processed according to the key points of people in the initial frame images in the video to be processed.

In the embodiment of the invention, the video to be processed may be a video uploaded by a user and needing to be audited, generally, behavior actions of people may be recorded in the video content uploaded by the user, and accordingly, the content recorded by the frame image in the video to be processed often contains people, so in this step, key points of people in the initial frame image may be obtained first, specifically, key points of people in the initial frame image may be obtained based on a key point detection algorithm, and then key points of people in a plurality of frames of images contained in the video to be processed are obtained based on the key points of people in the initial frame image, and are used as a basis for identifying people behaviors in subsequent steps, so that compared with each frame of image contained in the video to be processed, the same operation of obtaining key points of people is executed to realize a mode of obtaining key points of people in each frame of image, in the embodiment of the invention, the method for acquiring the key points of the persons in the multi-frame images in the video to be processed is based on the key points of the persons in the initial frame images in the video to be processed, so that the number of times of executing acquisition operation can be reduced to a certain extent, and the operation cost is further reduced. Further, the key point of the person in the frame image may be a designated position of the person included in the frame image, and the designated position of the person may be a predetermined position capable of representing the posture characteristics of the person in the frame image, for example, the designated position may be a top of the head, a wrist, a shoulder, a knee, an ankle, and the like of the person.

And 102, acquiring motion information of the character posture in the video to be processed based on the character key points in each frame of image in the multi-frame image.

In the embodiment of the invention, because the key points of the person in each frame of image are the corresponding pixel points of the appointed position capable of showing the posture characteristics of the person in the image, therefore, the key points of the person in each frame of image can represent the posture of the person in the frame of image, furthermore, the posture of the person at a single time point is recorded in each frame of image in the video to be processed, and each frame of image in the whole video to be processed is continuous in video time sequence, therefore, in the step, the motion information of the character pose in the video to be processed can be obtained based on the character key points in each frame of image, the motion information of the character pose may be data capable of representing specific changes of the character pose in the video to be processed, and the motion information of the character pose may be a vector having a direction and a magnitude.

And 103, identifying whether the video to be processed is a video containing a specified type of behavior or not based on the motion information of the character gesture.

In the embodiment of the present invention, the specified type of behavior may be preset, for example, the specified type of behavior may be a behavior with a higher risk coefficient, and since a person makes a behavior, the person often changes the posture of the person, and the motion information of the posture of the person may represent a specific change of the posture of the person in the video to be processed, in this step, it may be identified whether the video to be processed includes the specified type of behavior based on the motion information of the posture of the person. Furthermore, in a mode of performing identification based on motion information of all pixel points, when a person occupies a small area, motion information of all pixel points often reflects motion changes of other pixel points, so that identification effect is poor.

In summary, the video identification method provided in the embodiments of the present invention can obtain the key points of the person in the multi-frame image of the video to be processed according to the key points of the person in the initial frame image of the video to be processed, then obtain the motion information of the gesture of the person in the video to be processed based on the key points of the person in each frame image of the multi-frame image, and finally identify whether the video to be processed is a video including the behavior of the specified type based on the motion information of the gesture of the person.

Fig. 2-1 is a flow chart illustrating steps of another video identification method according to an embodiment of the present invention, as shown in fig. 2-1, the method may include:

step 201, obtaining the key points of the people in the multi-frame images in the video to be processed according to the key points of the people in the initial frame images in the video to be processed.

Specifically, the step can be implemented through the following steps 2011 to 2012:

and 2011, detecting the key points of the person in the initial frame image in the video to be processed by using a preset key point detection algorithm.

In this step, the start frame image may be an nth frame image in a preset video to be processed, for example, N may be 1, so that it may be ensured that the person information in the video to be processed is not missed by starting the detection from the first frame image, and further the comprehensiveness of the information for identification is ensured to the maximum extent, and further the identification effect is improved.

Further, the preset key point detection algorithm may be a partial Affinity field (Part Affinity Fields) algorithm, specifically, the key point detection algorithm may generate a thermodynamic diagram of the person key points based on the image information of the start frame image, each pixel point in the thermodynamic diagram has a probability value, the probability value may represent a corresponding pixel point of the pixel point in the thermodynamic diagram in the start frame image and a probability of belonging to the person key point, correspondingly, the thermodynamic diagram may be divided into a plurality of regions according to a preset size, and the pixel point with the maximum probability value in the region is determined as the person key point.

Step 2012, determining the key points of the persons in each remaining frame image in sequence according to the time sequence of each remaining frame image based on the key points of the persons in the starting frame image for the remaining frame images with the time sequence later than that of the starting frame image in the video to be processed, so as to obtain the key points of the persons in the multi-frame image in the video to be processed.

In this step, the time sequence of the frame image indicates a corresponding time sequence of the frame image in the video to be processed, for example, the time sequence of the frame image corresponding to the 300 th millisecond in the video to be processed is earlier than the time sequence of the frame image corresponding to the 800 th millisecond in the video to be processed, and further, assuming that the starting frame image is the 2000 th millisecond corresponding frame image, the frame images after the 2000 th millisecond in the video to be processed are the remaining frame images. Further, the determination of the person key points in each remaining frame image in sequence according to the time sequence of each remaining frame image based on the person key points in the starting frame image may be realized by the following substeps (1) to (2):

substep (1): and for each residual frame image, determining corresponding pixel points of the key points of the human body in the previous frame image of the residual frame image in the residual frame image by using a preset tracking algorithm.

In this step, the time sequence of the frame preceding the residual frame image is earlier than the time sequence of the residual frame image, and accordingly, for the residual frame image with the earliest time sequence among the residual frame images, the frame preceding the residual frame image with the earlier time sequence than the residual frame image is the starting frame image. Therefore, for the remaining frame images, a preset tracking algorithm may be used to determine corresponding pixel points of the person key points in the initial frame image in the remaining frame images, and then, for the other remaining frame images, the determination may be performed sequentially according to the sequence from early to late in time sequence.

Specifically, the preset tracking algorithm may be an optical flow tracking algorithm, and the optical flow tracking algorithm may be based on a principle of constant brightness, that is, the brightness of the same point does not change with time, and a spatial consistency principle, that is, a pixel point adjacent to one pixel point is projected onto the next frame image and is also an adjacent point, and the speed is consistent, and based on the brightness characteristic of the character key point and the speed characteristic of the adjacent point, the corresponding pixel point of the character key point in the remaining frame image is predicted. Of course, in another alternative embodiment of the present invention, other algorithms may also be used to determine corresponding pixel points of the key points of the human body in the previous frame image of the remaining frame image in the remaining frame image, for example, a tracking algorithm based on the active contour, which is not limited in this embodiment of the present invention.

Substep (2): and taking the corresponding pixel points as character key points in the residual frame images.

Since the corresponding pixel points of the character key points in the previous frame image of the residual frame image are obtained by tracking the character key points detected in the previous frame image, the corresponding pixel points can be considered as the character key points in the residual frame image, and accordingly, the corresponding pixel points can be used as the character key points in the residual frame image. In the embodiment of the invention, the key points in the rest frame images are determined by tracking the key points of the people in the previous frame image, so that the key points of the people in all the frame images can be determined only by detecting the key points of the people in the initial frame image once, the operation process of detecting the key points of the people can be simplified, and the processing efficiency is improved.

Further, due to the influence of the precision of the tracking algorithm, in an actual application scene, the number of corresponding pixel points of the human key points in the previous frame image of the residual frame image determined based on the tracking algorithm in the residual frame image may be smaller than the number of the human key points in the previous frame image, that is, some human key points are omitted, so in the embodiment of the present invention, the following sub-steps (3) to (5) may be performed before the sub-step (2):

substep (3): and for each residual frame image, determining a first difference value between the number of the corresponding pixel points and the number of the human key points in the previous frame image of the residual frame image.

For example, assuming that the number of corresponding pixel points determined based on the tracking algorithm is X and the number of key points of the human body in the previous frame of the remaining frame of images is Y, the value of Y-X may be calculated to obtain the first difference.

Substep (4): and if the first difference is not smaller than a first preset difference threshold value, re-determining the character key points in the residual frame image by using the preset key point detection algorithm to obtain the corresponding pixel points.

In this step, the first preset difference threshold may be preset according to an actual situation, for example, the first preset difference threshold may be 50% of the number of preset designated positions of the person, for example, the number of the designated positions of the person is 10, then the first preset difference threshold may be 5, and further, if the first difference is not less than the first preset difference threshold, it may be considered that there are more missing person key points, therefore, the preset key point detection algorithm may be directly used to re-determine the person key points in the remaining frame image, specifically, the manner of determining the person key points based on the key point detection algorithm may refer to the above steps, which is not described herein again in this embodiment of the present invention.

Substep (5): and if the first difference value is smaller than the first preset difference value threshold value, determining missing character key points based on the relative position relation of the character key points in the previous frame of image, and adding the missing character key points to the corresponding pixel points.

In this step, if the first difference is smaller than the first preset difference threshold, it may be determined that there are fewer missing person key points, and at this time, the missing person key points may be determined based on the relative position relationship of the person key points in the previous frame of image, so as to save the processing cost. Specifically, because the time difference between two adjacent images is small, the pose change of the people in the two adjacent images is often small, and correspondingly, the degree of the relative position change of each key point of the people is also small, therefore, in this step, it can be determined whether the corresponding pixel points in the remaining frame image have pixel points corresponding to the person key points around each person key point in the previous frame image, if not, it may be determined that the human key point is missing, and further, the image processing apparatus may determine that the human key point is missing based on the relative positions and distances of the human key point corresponding to the missing human key point in the previous image, and the other human key points, and finally, adding the missing character key points into corresponding pixel points to ensure that the character key points in the residual frame image are more comprehensive and complete.

In the embodiment of the invention, the first difference is calculated, the key point detection algorithm is reused for detection when the first difference is not smaller than the first preset difference threshold, namely more person key points are lost, and completion is performed based on the relative position relation when the first difference is smaller than the first preset difference threshold, namely less person key points are lost, so that the integrity of the person key points in the residual frame image can be improved, and the cost for improving the integrity is saved as much as possible.

Further, in the shooting process of the video to be processed, persons may be added in the shooting process, and therefore, tracking is performed only based on a tracking algorithm, and newly added persons may be omitted, and therefore, in the embodiment of the present invention, after the sub-step (2), the following sub-steps (6) to (9) may be performed:

substep (6): and selecting a residual frame image from the residual frame images every n residual frame images to obtain a plurality of target residual frame images.

In this step, n is a preset value, for example, n may be 4, further, for example, n is 4, starting from a first frame image of the plurality of residual frame images, a first frame image of the plurality of residual frame images may be used as a target residual frame image, a 6 th frame image may be used as a target residual frame image, an 11 th frame image may be used as a target residual frame image, and so on, so as to obtain a plurality of target residual frame images. In the embodiment of the invention, the number of the target residual frame images can be flexibly controlled by selecting the target residual frame images from the residual frame images according to the fixed interval number, so that the realizability of the method is improved. For example, when the number of the remaining frame images is small, the number n may be small, and the number of the target remaining frame images may be increased, thereby improving the effect of complementing the key points of the person.

Substep (7): and for each target residual frame image, re-determining the character key points in the target residual frame image by using the preset key point detection algorithm.

Specifically, the method for determining the person key points based on the key point detection algorithm may refer to the above steps, and the embodiment of the present invention is not described herein again.

Substep (8): and calculating a second difference value between the number of the re-determined character key points and the number of the corresponding pixel points.

For example, assuming that the number of the newly determined human key points is P and the number of the corresponding pixel points is X, the value of P-X may be calculated to obtain a second difference value.

Substep (9): and if the second difference is larger than a second preset difference threshold value, taking the re-determined person key points as the person key points of the target residual frame image.

In this step, the second preset difference threshold may be set in advance according to actual conditions, for example, the second preset difference threshold may be 80% of the number of the designated positions of the preset person, for example, the number of the designated positions of the person is 10, then the second preset difference threshold may be 8, further, if the second difference is greater than the second preset difference threshold, it may be considered that a new person appears in the target remaining frame image, and therefore, the newly determined person key point may be used as the person key point of the target remaining frame image, so as to ensure that the person key points of the target remaining frame image can be more comprehensive. In the embodiment of the invention, part of the residual frame image is extracted as the target residual frame image, the second difference value is calculated to determine whether a new person appears in the target residual frame image, and when the new person appears, the person key points are set for the target residual frame image again, so that the comprehensiveness of the person key points in the target residual frame image can be improved.

Step 202, obtaining the motion information of the character posture in the video to be processed based on the character key points in each frame of image in the multi-frame image.

Specifically, this step can be realized by the following steps 2021 to 2022:

step 2021, determining the motion trajectory of the key points of each type of human being based on the positions of the key points of the human being in each frame of image and the time sequence of each frame of image.

In this step, each frame of image may include a plurality of person key points, where each person key point in each frame of image may correspond to a part of a body of a person, for example, the person key point may correspond to a left knee of the person, and the person key point may correspond to a right wrist of the person, where the person key points corresponding to the same part in each frame of image belong to the same type of person key point, for example, the person key point corresponding to the left knee in each frame of image belongs to a type of person key point.

Further, because each frame image in the video to be processed is generated sequentially along with time, the motion trail of the key point of each type of human can be determined sequentially according to the time sequence of each frame image. Specifically, the key points of the same type of human objects in each frame of image can be sequentially connected according to the sequence of the time sequence from early to late, so as to obtain the motion trajectory of the key points of each type of human objects.

Step 2022, determining the motion information of the character pose based on the motion tracks of the key points of each type of characters and the time interval between each frame of image.

In this step, each type of key points in each frame of image can constitute the pose of the person in the frame of image, therefore, the connection relationship of each type of character key points in each frame of image can be determined based on the confidence degree of the connection of each type of character key points in each frame of image, then, the points corresponding to the same frame of image in the motion trail of the key points of each type of human can be connected to construct the motion trail of the human posture, wherein, the motion trail of the character gesture can embody the motion direction of the character, furthermore, the change amplitude of the preset paragraph in the motion trail of the character gesture can be determined, and then determining the time length corresponding to the preset section based on the time interval between each frame of image and the number of the frame images corresponding to the preset section, and finally estimating the movement speed of the person based on the variation amplitude and the corresponding time length so as to obtain the movement information of the posture of the person.

And 203, determining the number of the people and the postures of the people corresponding to each frame of image based on the relative position relationship of the key points of the people in each frame of image in the plurality of frames of images, and obtaining the posture information of the people corresponding to each frame of image.

In this step, for each person key point in the frame image, the person key point may be connected to other person key points, then confidence levels of the connection between the person key point and the other person key points are measured along line segments of the connection between the person key point and the other person key points based on a direction of each person key point, where the direction of each person key point may be generated when the person key point is detected by using a key point detection algorithm, and finally, the person pose corresponding to the frame image is obtained based on the connection between the other person key points corresponding to the maximum confidence levels and the person key points, for example, fig. 2-2 is a schematic diagram of the person pose provided by the embodiment of the present invention, and as can be seen from fig. 2-2, the person pose is obtained by connecting 8 person key points. Accordingly, the number of persons in the frame image can be determined based on the number of independent person poses in the frame image, and then the person pose information can be obtained.

And 204, taking the figure posture information corresponding to each frame of image and the motion information of the figure posture as the input of a preset classification model, and acquiring the category of the video to be processed by using the preset classification model.

In this step, the preset classification model may be constructed based on a Convolutional Neural Network (CNN), specifically, the person posture information and the movement information of the person posture corresponding to each frame of image may be used as the input of the classification model, then the classification feature vectors are extracted based on the person posture information and the movement information of the person posture through the Convolutional layer of the classification model, then the full connection layer of the classification model is used to perform full connection processing on the classification feature vectors to obtain target vectors, then the sofimax layer of the classification model is used to determine the probability that the target vectors belong to each preset class, and finally, the class with the largest corresponding probability may be used as the class to which the video to be processed belongs. In this embodiment of the present invention, the previous layer of the sofimax layer may include neurons corresponding to a plurality of preset categories one to one, and accordingly, a component element included in a target vector output by the previous layer may be an element corresponding to each preset category, and for each element in the target vector, the sofimax layer may map a vector value of the element to (0, 1) by using a sofimax function, so as to obtain a probability value corresponding to the element, that is, a probability value corresponding to the preset category.

Step 205, if the category to which the video to be processed belongs is matched with the specified type, determining that the video to be processed is a video containing a behavior of the specified type.

Further, each preset category corresponds to a type, and if the type of the behavior corresponding to the category to which the video to be processed belongs is the same as the type of the specified type, the category to which the video to be processed belongs can be considered to be matched with the specified type. Further, compared with a mode of only recognizing through motion information, the embodiment of the invention can provide more abundant information capable of reflecting the posture of the person for recognition operation through a mode of acquiring the posture information of the person and recognizing through combining the posture information of the person and the motion information of the posture of the person, thereby improving the recognition effect.

In this step, if the two are matched, it may be considered that the content of the video to be processed includes the specified type of behavior, and therefore, it may be determined that the video to be processed is a video including the specified type of behavior, and further, it may be determined that the video to be processed is an abnormal video. It should be noted that, in the embodiment of the present invention, it may also be determined that the video to be processed is not a video containing a specified type of behavior under the condition of mismatch, and accordingly, an online authority may be set for the target video, so that the video may be displayed on the video platform. Further, after the video to be processed is determined to be the video containing the specified type of behavior, online failure reminding information can be sent to a terminal which uploads the video to be processed, so that a user can timely know an audit result and modify the content of the video to be processed.

In summary, the video identification method provided in the embodiments of the present invention can obtain key points of a person in each frame of image in a video to be processed according to key points of the person in an initial frame of image in the video to be processed, then obtain motion information of a person posture in the video to be processed based on the key points of the person in each frame of image, then determine the number of the persons and the postures of the persons corresponding to each frame of image based on a relative position relationship between the key points of the person in each frame of image in the plurality of frames of image, obtain posture information of the person corresponding to each frame of image, and finally perform identification by combining the posture information of the person and the motion information of the posture of the person.

Fig. 3 is a block diagram of a video recognition apparatus according to an embodiment of the present invention, and as shown in fig. 3, the apparatus 30 may include:

the first obtaining module 301 is configured to obtain a person key point in a multi-frame image in a video to be processed according to a person key point in an initial frame image in the video to be processed.

A second obtaining module 302, configured to obtain motion information of a person pose in the video to be processed based on a person key point in each frame of image in the multiple frames of images.

The identification module 303 is configured to identify whether the video to be processed is a video including a specified type of behavior based on the motion information of the person gesture.

In summary, in the video identification apparatus provided in the embodiment of the present invention, the first obtaining module may obtain the key points of the person in each frame of image in the video to be processed according to the key points of the person in the initial frame of image in the video to be processed, then the second obtaining module may obtain the motion information of the posture of the person in the video to be processed based on the key points of the person in each frame of image, and finally, the identifying module may identify whether the video to be processed is a video including the behavior of the specified type based on the motion information of the posture of the person, so that the key points of the person in the video can be used for identifying to avoid interference caused by other pixel points, thereby improving the video identification accuracy.

Fig. 4-1 is a block diagram of another video identification apparatus according to an embodiment of the present invention, and as shown in fig. 4-1, the apparatus 40 may include:

the first obtaining module 401 is configured to obtain key points of people in multiple frames of images in a video to be processed; the character key points are designated positions of characters contained in the frame images and corresponding pixel points in the frame images.

A second obtaining module 402, configured to obtain motion information of the person pose in the video to be processed based on the person key point in each frame of image in the multiple frames of images.

And the identifying module 403 is configured to identify whether the video to be processed is a video containing a specified type of behavior based on the motion information of the person gesture.

The first obtaining module 401 includes:

the detection submodule 4011 is configured to detect a person key point in an initial frame image in the video to be processed by using a preset key point detection algorithm;

the determining submodule 4012 is configured to, for a plurality of remaining frame images in the to-be-processed video, whose time sequences are later than the start frame image, sequentially determine, based on the person key points in the start frame image and according to the time sequence of each remaining frame image, the person key points in each remaining frame image, so as to obtain the person key points in the multi-frame images in the to-be-processed video.

Optionally, the determining sub-module 4012 is configured to:

Optionally, the determining sub-module 4012 is further configured to:

Optionally, fig. 4-2 is a block diagram of another video identification apparatus provided in an embodiment of the present invention, and as shown in fig. 4-2, the apparatus 40 further includes:

a first determining module 404, configured to determine, based on a relative position relationship between key points of people in each frame of image in the multiple frames of images, the number of people and the posture of people corresponding to each frame of image, so as to obtain posture information of people corresponding to each frame of image;

accordingly, the identifying module 403 is configured to:

Optionally, the second obtaining module 402 is configured to:

Optionally, as shown in fig. 4-2, the apparatus 40 further includes:

a second determining module 405, configured to determine that the video to be processed is an abnormal video if the video to be processed includes a behavior of a specified type.

For the above device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for the relevant points, refer to the partial description of the method embodiment.

Preferably, an embodiment of the present invention further provides a terminal, which includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and when the computer program is executed by the processor, the computer program implements each process of the video identification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the video identification method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here. The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.

The video recognition methods provided herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with the teachings herein. The structure required to construct a system incorporating aspects of the present invention will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of a video recognition method according to embodiments of the present invention. The present invention may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A method for video recognition, the method comprising:

2. The method of claim 1, wherein the obtaining key points of people in the multi-frame images of the video to be processed according to the key points in the initial frame image of the video to be processed comprises:

3. The method of claim 2, wherein the sequentially determining the key points of the person in each of the remaining frame images according to the time sequence of each of the remaining frame images based on the key points of the person in the starting frame image comprises:

4. The method of claim 3, wherein before the corresponding pixel points are used as key points of the human figure in the remaining frame images, the method further comprises:

5. The method of claim 3, wherein after the corresponding pixel points are used as key points of the human figure in the remaining frame images, the method further comprises:

6. The method of claim 1, wherein before identifying whether the video to be processed is a video containing a specified type of behavior based on the motion information of the character pose, the method further comprises:

7. The method of claim 1, wherein the obtaining motion information of the person pose in the video to be processed based on the key points of the person in each frame of the images of the plurality of frames comprises:

8. The method of claim 1, further comprising:

9. A video recognition apparatus, the apparatus comprising:

10. The apparatus of claim 9, wherein the first obtaining module comprises:

11. The apparatus of claim 10, wherein the determination submodule is configured to:

12. The apparatus of claim 11, wherein the determination submodule is further configured to:

13. The apparatus of claim 11, wherein the determination submodule is further configured to:

14. The apparatus of claim 9, further comprising:

accordingly, an identification module to:

15. The apparatus of claim 9, wherein the second obtaining module is configured to:

16. The apparatus of claim 9, further comprising:

17. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a video recognition method according to any one of claims 1 to 8.