CN109697434A

CN109697434A - A kind of Activity recognition method, apparatus and storage medium

Info

Publication number: CN109697434A
Application number: CN201910012006.3A
Authority: CN
Inventors: 王吉; 陈志博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2019-04-30
Anticipated expiration: 2039-01-07
Also published as: CN109697434B

Abstract

The invention discloses a kind of Activity recognition method, apparatus and storage mediums.The program obtains video to be detected, for the multiple candidate windows of video to be detected addition；Based on feature extraction network, three-dimensional feature figure of the video to be detected comprising multiple candidate windows on multiple time domain scales is generated；The determining matched time domain scale of video clip in candidate window, and three-dimensional feature figure corresponding with the time domain scale determined is obtained, the corresponding local feature figure of video clip is obtained according to the three-dimensional feature figure of acquisition；And then Activity recognition is carried out according to local feature figure and preset Activity recognition network, determine the corresponding behavior classification of behavioural characteristic in video clip.Feature extraction network can be used from obtaining its three-dimensional feature figure on multiple time domain scales in video to be detected in the program, so that the receptive field of classifier can adapt to the behavioural characteristic of different time length, the accuracy of the Activity recognition to a variety of time spans is improved.

Description

A kind of Activity recognition method, apparatus and storage medium

Technical field

The present invention relates to technical field of data processing, and in particular to a kind of Activity recognition method, apparatus and storage medium.

Background technique

With the continuous growth of Computerized intelligent demand and mode identification technology, image processing techniques and artificial intelligence The fast development of technology, carrying out analysis to video content using computer vision technique has huge actual demand, such as right Human behavior in video is detected.Mostly in the prior art is by the structure of neural network level, from training data Learn complicated and diversified feature mode, to efficiently extract the feature of input video, carries out the identification of specific behavior.

In practical applications, most of monitor video and network video are undivided long video, and long video In may include multiple behavior examples, and the duration of each behavior example may be different, still, existing Activity recognition scheme In, the video clip of specific frame number is generally required video compress or be extended for, is mentioned from video clip using neural network The feature of single time domain scale is taken to identify video behavior, such that the receptive field of the classifier of neural network can only Behavior with specific duration matches, and leads to behavior too long for duration or too short, identification accuracy is poor.

Summary of the invention

The embodiment of the present invention provides a kind of Activity recognition method, apparatus and storage medium, it is intended to improve to it is a variety of when span The accuracy of the Activity recognition of degree.

The embodiment of the present invention provides a kind of Activity recognition method, comprising:

Video to be detected is obtained, for the multiple candidate windows of video addition to be detected, wherein described in each candidate's window is corresponding One video clip of video to be detected；

Based on feature extraction network, the video to be detected comprising the multiple candidate window is generated in multiple time domain scales On three-dimensional feature figure；

The determining matched time domain scale of video clip in the candidate window, and the time domain scale for obtaining and determining The corresponding three-dimensional feature figure；

According to the three-dimensional feature figure of acquisition, the corresponding local feature figure of the video clip is obtained；

According to the local feature figure and preset Activity recognition network to the video clip in the multiple candidate window into Row Activity recognition determines the corresponding behavior classification of behavioural characteristic in the video clip.

The embodiment of the present invention also provides a kind of Activity recognition detection device, comprising:

Video acquisition unit, for obtaining video to be detected；

Video windowing unit, for adding multiple candidate windows for the video to be detected, wherein each candidate's window corresponds to institute State a video clip of video to be detected；

Feature acquiring unit is generated for being based on feature extraction network comprising the described to be detected of the multiple candidate window Three-dimensional feature figure of the video on multiple time domain scales；

Scale matching unit, for the determining matched time domain scale of video clip in the candidate window, and obtain with The corresponding three-dimensional feature figure of the determining time domain scale；

Feature selection unit obtains the corresponding part of the video clip for the three-dimensional feature figure according to acquisition Characteristic pattern；

Activity recognition unit is used for according to the local feature figure and preset Activity recognition network to the multiple candidate Video clip in window carries out Activity recognition, determines the corresponding behavior classification of behavioural characteristic in the video clip.

The embodiment of the present invention also provides a kind of storage medium, and the storage medium is stored with a plurality of instruction, and described instruction is suitable It is loaded in processor, to execute the step in any behavior recognition methods provided by the embodiment of the present invention.

The embodiment of the present invention obtains video to be detected, for the multiple candidate windows of video to be detected addition, each candidate window pair A video clip of video to be detected is answered, then, is based on feature extraction network, generates the view to be detected comprising multiple candidate windows Three-dimensional feature figure of the frequency on multiple time domain scales, wherein time domain scale is bigger, the corresponding duration of the feature in three-dimensional feature figure Also longer, determining three-dimensional corresponding with the time domain scale of the matched time domain scale of video clip and acquisition determination in candidate window Characteristic pattern obtains the corresponding local feature figure of video clip according to the three-dimensional feature figure of acquisition, if video clip in candidate window Length is smaller, then can choose the small three-dimensional feature figure of time domain scale and extract local feature figure, conversely, then selecting time domain scale big Three-dimensional feature figure extract local feature figure, after the local feature figure for extracting the video clip in each candidate window, according to office Portion's characteristic pattern and preset Activity recognition network carry out Activity recognition to the video clip in candidate window, determine row in video clip It is characterized corresponding behavior classification.The program can identify the behavior of a variety of durations in video to be detected, even if one section Include the different behavior of multiple durations in video, feature extraction network also can be used and obtain it from video to be detected more Three-dimensional feature figure on a time domain scale mentions so that the receptive field of classifier can adapt to the behavioural characteristic of different time length The accuracy of the high Activity recognition to a variety of time spans.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 a is the schematic diagram of a scenario of information interaction system provided in an embodiment of the present invention；

Fig. 1 b is the first pass schematic diagram of Activity recognition method provided in an embodiment of the present invention；

Fig. 1 c is feature extraction schematic network structure provided in an embodiment of the present invention；

Fig. 1 d is the convolution kernel schematic diagram of the feature extraction network of spatial domain provided in an embodiment of the present invention and time-domain seperation；

Fig. 1 e is the first schematic network structure of Activity recognition method provided in an embodiment of the present invention；

Fig. 1 f is second of schematic network structure of Activity recognition method provided in an embodiment of the present invention；

Fig. 1 g is interpolation operation schematic diagram provided in an embodiment of the present invention；

Fig. 1 h is the third schematic network structure of Activity recognition method provided in an embodiment of the present invention；

Fig. 1 i is the 4th kind of schematic network structure of Activity recognition method provided in an embodiment of the present invention；

Fig. 1 j is network training flow diagram provided in an embodiment of the present invention；

Fig. 1 k is another trained flow diagram of network provided in an embodiment of the present invention；

Fig. 2 a is Activity recognition application scenarios flow chart provided in an embodiment of the present invention；

Fig. 2 b is Activity recognition application scenarios schematic diagram provided in an embodiment of the present invention；

Fig. 3 a is the first structural schematic diagram of Activity recognition device provided in an embodiment of the present invention；

Fig. 3 b is second of structural schematic diagram of Activity recognition device provided in an embodiment of the present invention；

Fig. 3 c is the third structural schematic diagram of Activity recognition device provided in an embodiment of the present invention；

Fig. 3 d is the 4th kind of structural schematic diagram of Activity recognition device provided in an embodiment of the present invention；

Fig. 4 is the structural schematic diagram of server provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides a kind of Activity recognition method, apparatus and storage medium.

The embodiment of the present invention also provides a kind of information interaction system, which includes any row provided in an embodiment of the present invention For identification device, behavior identification device specifically be can integrate in the network device, such as in terminal or server equipment；This Outside, which can also include other equipment, for example, video capture device or terminal etc., terminal can be mobile phone, tablet computer Or personal computer etc., for uploading video to be detected to the network equipment.

With reference to Fig. 1 a, the embodiment of the present invention provides a kind of information interaction system, which includes video capture device and row For identification device；Activity recognition device is connected with video capture device by wireless network or cable network, Activity recognition dress It sets and receives the video to be detected that video capture device is sent, Activity recognition device obtains video to be detected, adds for video to be detected A candidate window is added, each candidate's window corresponds to a video clip of video to be detected；Based on feature extraction network, generation includes Three-dimensional feature figure of the video to be detected of multiple candidate's windows on multiple time domain scales；Then, it is determined that with the video clip The time domain scale matched, and three-dimensional feature figure corresponding with the time domain scale determined is obtained, it is intercepted from determining three-dimensional feature figure The corresponding local feature figure of video clip；Next, according to local feature figure and preset Activity recognition network in candidate window Video clip carry out Activity recognition, determine the corresponding behavior classification of behavioural characteristic in video clip.

The three-dimensional feature on multiple time domain scales in one section of long video can be extracted by feature extraction network as a result, Figure selects matched time domain scale and corresponding three-dimensional feature figure for the video clip in candidate window, and then in the three-dimensional feature Local feature figure is extracted in figure, Activity recognition is carried out according to local feature figure and Activity recognition network, determines row in video clip It is characterized corresponding behavior classification.This programme can be used feature extraction network and obtain it from video to be detected in multiple time domains Three-dimensional feature figure on scale, the video clip shorter for duration, the small local feature figure of available time domain scale, for The longer video clip of duration can choose the big local feature figure of time domain scale, so that classifier in Activity recognition network Receptive field can adapt to the behavioural characteristic of different time length, improve the accuracy of the Activity recognition to a variety of time spans.

The example of above-mentioned Fig. 1 a is a system architecture example for realizing the embodiment of the present invention, and the embodiment of the present invention is not It is limited to system structure shown in above-mentioned Fig. 1 a, is based on the system architecture, proposes each embodiment of the present invention.

In the present embodiment, the angle of subordinate act identification device is described, behavior identification device can specifically collect At in terminal device such as server, personal computer equipment.

As shown in Figure 1 b, the detailed process of behavior recognition methods can be such that

101, video to be detected is obtained, for the multiple candidate windows of video addition to be detected, wherein each candidate's window is corresponding One video clip of the video to be detected.

Video to be detected is made of a series of continuous video frame images, is used in the present embodiment and is wrapped in one section of video The frame number of the video frame contained measures the length of video to be detected and long video or video clip hereinafter.It is obtaining To after video to be detected, multi-scale sliding window mouth can be used to add candidate window on time dimension for video to be detected.Wherein, Candidate window be addition on the time dimension of video to be detected, include a frame or multi-frame video frame image selects frame, Each candidate's window corresponds to a video clip of video to be detected.For example, the length of candidate window can have following several sizes: 8, 16,32,64,128；Alternatively, 7,10,15,18,25 etc., wherein a variety of scales of candidate window, which can according to need, to be preset, Its unit is frame.For example, the length of candidate window is 8, then it is the video clip for including 8 frame video frame images in candidate's window. In order to improve the accuracy of Activity recognition, there can be overlapping between two neighboring candidate's window, degree of overlapping can be set to 25%- 75%, the unit of candidate window is also frame herein.

102, it is based on feature extraction network, generates the video to be detected comprising the multiple candidate window in multiple time domains Three-dimensional feature figure on scale.

Time domain scale in the embodiment of the present application refers to that a feature in three-dimensional feature figure corresponds in video to be detected Video frame images quantity, for measuring size of the three-dimensional feature figure on time dimension.Time domain scale is bigger, three-dimensional Size of the characteristic pattern on time dimension itself is smaller, and in video to be detected corresponding to a feature in three-dimensional feature figure Video frame images quantity it is more, in other words, time domain scale is bigger, to be checked corresponding in three-dimensional feature figure a feature The duration for surveying video is longer.

For example, the video to be detected of one section of 800*800*256, can therefrom extract size be respectively 195*195*128, The three-dimensional feature figure of three kinds of time domain scales of 195*195*64,195*195*32, above-mentioned size is in the form of long * wide * frame number It indicates.128,64,32 size of the three-dimensional feature figure of three kinds of different time domain scales on time dimension is indicated, having a size of 195* The time domain scale of the three-dimensional feature figure of 195*128 is minimum, and a feature in the three-dimensional feature figure corresponds to original to be detected 2 frame images in video, i.e. the time domain scale of the three-dimensional feature figure are 2.Three-dimensional feature figure having a size of 195*195*32 when Domain scale is maximum, and a feature in the three-dimensional feature figure corresponds to 8 frame images in original video to be detected, the i.e. three-dimensional The time domain scale of characteristic pattern is 8.

In feature extraction phases, candidate window is not used to be split video to be detected, the input of feature extraction network Data remain as complete long video, by the video input to be detected comprising multiple candidate windows to preparatory trained feature extraction In network, feature extraction is carried out.

It is Three dimensional convolution neural network (3D-Convolutional Neural Networks, 3D- that this feature, which extracts network, CNN), general Three dimensional convolution neural network mainly includes convolutional layer, pond layer and full articulamentum etc., the spy in the present embodiment It includes multiple convolutional layers that sign, which extracts network, and without full articulamentum, pond layer can be arranged after each convolutional layer, can also not Pond layer is set, wherein pond layer can compress the characteristic pattern of input, simplify the computation complexity of network, while to feature It is compressed, extracts main feature, for example, using maximum pond layer.If the resolution ratio of video to be detected is higher, but length compared with It is short, then pond layer can be set, video is compressed on Spatial Dimension, to reduce the weight parameter of network, reduce and calculate Complexity, without pond, the as far as possible more features in retention time dimension on time dimension.In the convolutional neural networks Such as convolution kernel size, step-length, zero padding quantity hyper parameter, parameters and the convolutional layer such as quantity of convolution kernel in convolutional layer Quantity etc. can rule of thumb be arranged.

Wherein, in order to realize to the feature mining on the time dimension between continuous video frame images, convolutional layer is used The three-dimensional feature figure that Three dimensional convolution checks this layer of input carries out feature extraction, i.e. size of the convolution kernel on time dimension is greater than Or it is equal to 2.One neuron of one convolutional layer is only connected with the partial nerve member in upper one layer, successively carries out convolution behaviour Make, exports three-dimensional characteristic pattern (feature map)；For example, the video of one section of 800*800*256 frame, by the convolution of 8*8*2 After core carries out convolution operation according to the step-length (being 2 in the step-length in three dimensions) of 2*2*2, the characteristic pattern of output is 397* The three-dimensional feature figure of 397*128, i.e., by the image construction of continuous 128 frame 397*397.It should be noted that due to this programme In be to video carry out Three dimensional convolution, therefore, convolutional layer output characteristic pattern be three-dimensional feature figure, i.e., by continuous multiple frames two dimension Characteristic pattern is formed by stacking.

Due to the three-dimensional feature figure in this programme multiple dimensioned embodiment in the time domain, that is, be embodied on time dimension, therefore, Next mainly situation of the three-dimensional feature figure of convolutional layer output on time dimension is described in detail, as video frame figure As the change in size situation after spatial domain convolution is no longer described in detail.Wherein, stock size can be used on Spatial Dimension Convolution kernel carry out operation, for example, convolution kernel is sized to 3*3 or 2*2 on Spatial Dimension, step-length 1, convolution kernel Quantity is rule of thumb arranged.Size of the convolution kernel on time dimension, can according to need the size of the local feature of observation into Row setting, can be by convolution operation on time dimension to be maintained on time dimension the size constancy of data volume after convolution Step-length be set as 1, and corresponding zero padding quantity is arranged according to size of the convolution kernel on time dimension.It is exported to reduce Size of the data volume on time dimension, can be used the step-length greater than 1, alternatively, step-length of the convolution kernel on time dimension is set It is set to 1, addition pond layer carries out down-sampled after convolutional layer, to reduce the size of output data body.

For example, referring to Fig. 1 c, it is from top to bottom followed successively by video, first layer convolutional layer and second layer convolutional layer to be detected；? In video to be detected, each lattice are a frame video frame images, and in first layer convolutional layer and second layer convolutional layer, each lattice are one A convolution kernel (i.e. a neuron), it should be noted that on the schematic diagram, show only the convolution behaviour on time dimension Make.The length of the video to be detected of input is 256 frames, and size of the convolution kernel of first convolutional layer on time dimension is 2, step A length of 2, in first layer convolutional layer, the video frame images of 2 frames, are seen in the corresponding video to be detected being originally inputted of a neuron What is observed is the feature between this 2 frame video frame images, then after convolution operation, the three-dimensional feature figure of output is in time dimension On length be 128.Size of the convolution kernel of second layer convolutional layer on time dimension is 2, step-length 2, then rolls up in the second layer In lamination, a neuron is connected with two neurons in first layer convolutional layer (assuming that not having between the two convolutional layers Pond layer is set), with the video frame images of 4 frames in the corresponding video to be detected being originally inputted, what is observed is this 4 frame video frame Feature between image, after convolution operation, length of the three-dimensional feature figure of output on time dimension is 64.With convolution The number of plies of layer is deepened, and size of the characteristic pattern of output on time dimension is gradually reduced, and a neuron connects in convolutional layer The frame number of original video (video i.e. to be detected) gradually increases, i.e., time dimension of the neuron in original video in convolutional layer On receptive field be gradually increased, then the time domain scale of three-dimensional feature figure of convolutional layer output is also gradually increased.

Specifically, in some embodiments, feature extraction network is the Three dimensional convolution neural network comprising multiple convolutional layers, " it is based on feature extraction network, generates three of the video to be detected comprising the multiple candidate window on multiple time domain scales The step of dimensional feature figure ", comprising:

By feature extraction network described in the video input to be detected comprising the multiple candidate window, successively described more A convolutional layer carries out convolution algorithm；

The three-dimensional feature figure for obtaining last continuous multiple convolutional layer outputs, as the video to be detected in multiple time domain rulers Three-dimensional feature figure on degree, wherein the number of plies of convolutional layer is deeper, and time domain scale is bigger.

For example, feature extraction network includes 7 convolutional layers, the output data conduct of last 5 convolutional layers can be extracted Three-dimensional feature figure of the video to be detected on five time domain scales.

Optionally, in some embodiments, the convolutional neural networks are using the Three dimensional convolution of spatial domain and time-domain seperation mind Through network.D referring to Fig.1 is the convolution kernel structural schematic diagram in feature extraction network provided in an embodiment of the present invention.In spatial domain Using the convolution kernel (assuming that the width of video frame images is equal with height) of 1*d*d size, the convolution of 1*1*t size is used in time domain Core, i.e. a convolutional layer include a two-dimensional space domain convolution kernel and an one-dimensional convolution core.It should be noted that one When carrying out convolution operation to input data in a convolutional layer, spatial domain convolution sum convolution operation carries out respectively, can be advanced The operation of row convolution, can also first carry out spatial domain convolution algorithm.For example, the convolutional layer in the feature extraction network carries out When convolution algorithm, to the three-dimensional feature figure of input successively use the two-dimensional space convolution kernel and the one-dimensional convolution core into Row convolution operation.Specifically, if first carrying out spatial domain convolution algorithm, convolutional layer to the continuous multiple frames video frame images of input, Two-dimensional space convolution operation is carried out to each frame video frame images using the two-dimensional space convolution kernel of 1*d*d respectively, is obtained continuous Multiple two dimensional character figures；Then, multiple continuous two dimensional character figures are checked using the one-dimensional convolution of 1*1*t, in time dimension That is, in the depth direction upper carry out convolution operation carries out convolution to the pixel data at same position on multiple continuous X-Y schemes Operation.

Since this programme is to carry out convolution to three-dimensional video data, and using undivided long video, cause The weight parameter that feature extraction network needs to learn is very more, or even there are redundancies, be easy to cause over-fitting.Therefore, using sky Between domain and time-domain seperation Three dimensional convolution neural network, can reduce parameter, promoted calculating speed while, reduce network Over-fitting degree improves the accuracy of behavioral value.

In some embodiments, the three-dimensional feature figure on multiple time domain scales can also be obtained using convolution expansive working. Specifically, feature extraction network is the Three dimensional convolution neural network comprising multiple expansion convolutional layers；" it is based on feature extraction network, Generate include it is the multiple candidate window the video to be detected in the three-dimensional feature figure on multiple time domain scales " the step of wrap It includes:

By feature extraction network described in the video input to be detected comprising the multiple candidate window, successively described more Convolution algorithm is carried out according to the corresponding coefficient of expansion in a expansion convolutional layer；

The three-dimensional feature figure for obtaining last continuous multiple expansion convolutional layer outputs, as the video to be detected when multiple Three-dimensional feature figure on the scale of domain.

Wherein, it is the Three dimensional convolution neural network for including multiple expansion convolutional layers that this feature, which extracts network,.It is rolled up in expansion Lamination carries out expansion convolution operation according to the coefficient of expansion that the convolutional layer is arranged, by way of expanding convolution, can not do In the case that pondization loses information, expand the receptive field of neuron, each convolution output is allowed to include large range of information.It can The coefficient of expansion of each expansion convolutional layer is set with time domain scale as needed, is needed so that the three-dimensional feature figure of output has Time domain scale.It should be noted that expansion convolution operation herein can be carried out only on time dimension, it can also be in space It is all carried out in dimension and time dimension.

103, the determining matched time domain scale of video clip in the candidate window, and the time domain for obtaining and determining The corresponding three-dimensional feature figure of scale.

Video to be detected comprising multiple candidate windows gets video to be detected in multiple time domain rulers after feature extraction Three-dimensional feature figure on degree.From the point of view of intuitive, the video clip in each candidate frame can correspond to the three of multiple time domain scales Dimensional feature figure.Next, according to the difference of the length of each candidate window, it is matched for the video clip selection in each candidate window Time domain scale and corresponding three-dimensional feature figure.Specifically, " the determining matched time domain ruler of video clip in the candidate window The step of degree, and the acquisition three-dimensional feature figure corresponding with the time domain scale determined ", comprises determining that in the candidate window The video clip video frame images that include quantity；According to the quantity, the determining and matched time domain ruler of the video clip Degree, and obtain three-dimensional feature figure corresponding with the time domain scale determined.

It wherein, is that video to be detected adds candidate window using multi-scale sliding window mouth due to the present embodiment, it waits Select window that there are different lengths.About the selection of the matched time domain scale of candidate window, it is mainly concerned with following parameter: the length of candidate window Degree, the step-length of length and convolution operation of the convolution kernel on time dimension.It, can be according to volume for a convolutional layer Product core is when a neuron is corresponding in the length F on time dimension and the step-length S and a upper convolutional layer of convolution operation The corresponding time domain scale of a neuron in the convolutional layer is calculated in domain scale.The length etc. of video clip in candidate window In the length of candidate window, the immediate time domain scale of length of selection and candidate's window, as with the piece of video in candidate's window The corresponding time domain scale of section, and then using the three-dimensional feature figure on the time domain scale as corresponding with the video clip in candidate's window Three-dimensional feature figure.If the scale of candidate window is too long, include in the three-dimensional feature figure of such as larger than maximum time domain scale is original Frame number in video then selects maximum time domain scale as time domain scale corresponding with the video clip in candidate's window.

104, according to the three-dimensional feature figure of acquisition, the corresponding local feature figure of the video clip is obtained.

After determining time domain scale, three-dimensional feature corresponding with the time domain scale determined is obtained from multiple three-dimensional feature figures Figure, and then according to the three-dimensional feature figure of acquisition, obtain local feature figure corresponding with video clip.For example, in a certain candidate's window Video clip be video to be detected in the 80th frame to the 102nd frame, then after determining its corresponding three-dimensional feature figure, from this three The corresponding partial 3-D characteristic pattern of the 80th frame to the 102nd frame is intercepted on dimensional feature figure, as part corresponding with the video clip Characteristic pattern.According to aforesaid operations, the corresponding local feature figure of video clip of each candidate window can be extracted.

105, according to the local feature figure and preset Activity recognition network to the piece of video in the multiple candidate window Duan Jinhang Activity recognition determines the corresponding behavior classification of behavioural characteristic in the video clip.

After getting the local feature figure of each video clip, according to local feature figure and Activity recognition network to video Behavioural characteristic in segment is identified, determines the corresponding behavior classification of behavioural characteristic in video clip.Specifically, " according to The local feature figure and preset Activity recognition network carry out Activity recognition to the video clip in the candidate window, determine institute State the corresponding behavior classification of behavioural characteristic in video clip " the step of include:

According to the local feature figure and preset time domain referral networks, from the video clip in the multiple candidate window Selection includes the video clip of behavioural characteristic, as nomination segment；

According to the local feature figure and the Activity recognition network of the nomination segment, determine that the nomination segment is corresponding Behavior classification.

The local feature figure for the video clip in candidate window that will acquire is input in preset time domain referral networks, into The preliminary behavioral value of row judges that the video clip in a candidate window is behavior segment or background segment, from all videos Filtered out in segment include specific behavior feature video clip, as nomination segment.Wherein, if not including in one section of video There is any behavioural characteristic, then the video clip is background segment.

After filtering out nomination segment, the local feature figure of segment will be nominated as input data, Behavior-based control identification net Network carries out Activity recognition, to determine the corresponding behavior classification of nomination segment.

Wherein, since local feature figure is operated by multiple convolution, behavior identification network can be not provided with At least one full articulamentum is arranged in convolutional layer.In practical applications, multiple full articulamentums can be set in the program, the last one Full articulamentum is classifier, includes M+1 node, and M is the positive integer greater than 1, shares M behavior classification and a background classes Not.The corresponding local feature figure of segment will be nominated to input in the network, export the confidence level of each node, the highest node of confidence level Corresponding classification is the corresponding behavior classification of the video clip.

E referring to Fig.1, the comprehensive following three network of this programme constitute a complete Activity recognition model: feature extraction net Network, time domain referral networks and Activity recognition network.Wherein, input of the output of a upper network as next network, finally Realize identified from one section of long video it includes multiple behaviors, determine the behavior classification of these behaviors.

F referring to Fig.1, in order to improve the recognition accuracy of Activity recognition network, behavior identification network is additionally provided with interpolation (interpolation) layer, for the local feature figure of each nomination segment to be adjusted to preset length on time dimension.Specifically Ground, the Activity recognition network include interpolated layer and full articulamentum；" according to the local feature figure and the row of the nomination segment To identify network, the corresponding behavior classification of the nomination segment is determined " the step of, comprising:

The local feature figure of the nomination segment is input to the Activity recognition network, in the interpolated layer by the office Portion's characteristic pattern is adjusted to preset length on time dimension；

The local feature figure that the interpolated layer exports is input to the full articulamentum and carries out Activity recognition, is mentioned described in determination The corresponding behavior classification of name segment.

Wherein, the effect of interpolated layer is that by all length equal resize of the nomination segment on time dimension be default length Degree so that the input data of full articulamentum is of the same size on time dimension, while retaining fine location information, mentions The accuracy of high full articulamentum classification.Specifically, for example, by the length of a video clip by 8 frame resize be 12 frames, then An interleave can be carried out at interval of two field pictures.Fig. 1 g is please referred to, the frame that dotted line indicates is increased picture frame after interpolation, In, interleave image can be calculated according to two field pictures adjacent thereto, for example, linear interpolation algorithm or double can be used Linear interpolation algorithm is adjusted the length of nomination segment.

H referring to Fig.1, in some embodiments, the time domain referral networks include the first full articulamentum and the second full connection Layer；It " according to the local feature figure and preset time domain referral networks, is selected from the video clip in the multiple candidate window Video clip comprising behavioural characteristic, as nomination segment " the step of include:

According to the local feature figure and the first full articulamentum, detect in the video clip in the multiple candidate window It whether include behavioural characteristic；

Using the video clip comprising behavioural characteristic as the nomination segment；

Selection includes the video clip of behavioural characteristic from the video clip in the multiple candidate window, as nomination piece After section, the method also includes:

Boundary recurrence is carried out to the nomination segment in the described second full articulamentum, when obtaining the first of the nomination segment Between boundary.

Wherein, time domain referral networks include at least one first full articulamentum and at least one second full articulamentum, wherein Last first full articulamentum is used for behavioral value, and nomination segment is selected from all videos segment, the last one second connects entirely It connects layer to return for boundary, primarily determines the first time boundary that behavior occurs in nomination segment.By the office of all nomination segments Portion's characteristic pattern is input in time domain referral networks, is carried out behavioral value by the first full articulamentum, is detected in the video clip Whether include behavioural characteristic, will include the video clip of behavioural characteristic as the nomination segment；In the second full articulamentum Boundary recurrence is carried out to nomination segment, determines the first time boundary of each nomination segment, that is, determines the behavior hair in nomination segment Raw start frame and end frame.Can not only identify as a result, include in one section of long video which classification behavior, additionally it is possible to Determine the period that behavior occurs.

I referring to Fig.1, in some embodiments, the full articulamentum of the Activity recognition network include the full articulamentum of third and 4th full articulamentum；" the local feature figure that the interpolated layer exports is input to the full articulamentum, to the nomination segment into Row Activity recognition determines the corresponding behavior classification of the nomination segment " the step of may include:

The local feature figure that the interpolated layer exports is input to the full articulamentum of the third, the nomination segment is carried out Activity recognition determines the corresponding behavior classification of the nomination segment；

After determining the corresponding behavior classification of the nomination segment, the method also includes:

The local feature figure of the nomination segment is input to the 4th full articulamentum, side is carried out to the nomination segment Boundary returns, and obtains the second time boundary of the nomination segment.

Wherein, the full articulamentum of Activity recognition network may include that the full articulamentum of third and the 4th full articulamentum, third are complete Articulamentum for determine nomination segment behavior classification, the 4th full articulamentum be used for the nomination segment after behavior classification again into Row bound, which returns, obtains the second time boundary.Nomination segment is determined more accurately on the basis of first time boundary to realize The period that middle behavior occurs, i.e. accurate positioning behavior start frame and behavior end frame.

Optionally, the embodiment of the present application further includes the training process of network.In this embodiment, one is constituted by three networks Therefore three networks are trained by complete Activity recognition model as a whole.This method further include:

Collecting sample video, for the multiple candidate windows of Sample video addition, wherein each candidate's window corresponds to described to be checked Survey a video clip of video；

Two tag along sorts and more tag along sorts are added for the video clip in each candidate window, wherein two contingency table Label include behavior label and background label；

According to the Sample video after two tag along sorts of addition and more tag along sorts and after initializing weight, feature is mentioned Network is taken, three-dimensional feature figure of the Sample video comprising multiple candidate windows on multiple time domain scales is generated；

The determining and matched time domain scale of the video clip, and obtain three-dimensional feature corresponding with the time domain scale determined Figure；

Using the local feature figure with behavior label as positive sample, using the local feature figure with background label as negative The positive sample and the negative sample are input in time domain referral networks and are trained by sample；

The local feature figure with more tag along sorts is input in Activity recognition network and is trained；

It repeats above-mentioned steps and is iterated training, until the time domain referral networks and the Activity recognition network Loss function is less than preset threshold, determines the feature extraction network, the time domain referral networks and the Activity recognition network Parameter.

J referring to Fig.1, using the Sample video added with two tag along sorts and more tag along sorts as training sample.Wherein, two Tag along sort is used for the training stage of time domain referral networks, and more tag along sorts are used for the training stage of Activity recognition network.Wherein, Pre-set feature extraction network structure and necessary hyper parameter, according to preset algorithm to network carry out weights initialisation, For example, carrying out weights initialisation using the initial method of Xavier, Gaussian Profile initial method etc..

In this embodiment, what time domain referral networks solved is two classification problems, that is, is judged in a video clip It whether include specific behavioural characteristic.Time domain referral networks include convolutional layer and at least one full articulamentum, and full articulamentum has two A node, each node are connect with the neuron of previous layer entirely.The quantity of convolutional layer can according to need setting, not make to have Body limits.The time domain referral networks are obtained by the local feature figure training of the Sample video with two tag along sorts.Behavior is known What other network solved is classification problem more than one.The network is trained by the local feature figure of the Sample video with more tag along sorts Convolutional neural networks obtain.

Acquisition largely includes the long video of one or more specific action as Sample video, according to step 101 institute The method shown is that Sample video adds candidate window, and adds two tag along sorts and more classification for the video clip in each candidate window Label, wherein two tag along sorts include behavior label and background label, with mark in the corresponding video clip of the candidate frame whether There is behavior.For example, if label is label=1, for behavior label, indicate include in the video clip in the candidate frame Behavioural characteristic, for background label, indicates not including behavior spy in the video clip in the candidate frame if label is label=0 Sign, the video clip are background segment.The sample label that the training of Activity recognition network uses is more tag along sorts, i.e. M+1 class Label has the specific behavior of corresponding classification to send out when label is label=1,2,3 ... M in the video clip in candidate window Raw, as label label=0, the video clip in candidate window is background segment.

By the feature extraction network after the Sample video input initialization weight with two tag along sorts and more tag along sorts, It extracts the three-dimensional feature figure on multiple time domain scales, and corresponding time domain scale is selected according to the length of candidate window, and then from complete It is intercepted on whole three-dimensional feature figure and the matched local feature figure of video clip in candidate window.Get the view in each candidate window After the local feature figure of frequency segment, since each candidate's window both corresponds to a behavior label or background label, then each The corresponding behavior label of local feature figure or background label；Using the local feature figure with behavior label as positive sample, Using the local feature figure with background label as negative sample, positive sample and negative sample are input to the convolution mind built in advance Through being trained in network.About Activity recognition network, the training method and the training method phase of time domain referral networks of the network Seemingly.Difference is the video clip in each candidate window using more tag along sorts.By the behavior piece with more tag along sorts The local feature figure of section is input to Activity recognition network and is trained.It is iterated training as procedure described above, until stating time domain The loss function of referral networks and the Activity recognition network is less than preset threshold.Due to convolutional neural networks training process just Be one in the process for constantly minimizing loss function, when the size of loss function reaches target, that is, be less than preset threshold, then it is complete At the training of network, at this point it is possible to determine the weight parameter of above three network.

It is understood that in other embodiments, one can also be constituted by feature extraction network and Activity recognition network A complete Activity recognition model, the two networks are trained as a whole.For Sample video, need to only add more Tag along sort.Specifically, training process includes:

Tag along sort is added for the video clip in each candidate window；

According to after addition tag along sort the Sample video and initialization weight after feature extraction network, generation includes Three-dimensional feature figure of the Sample video of candidate window on multiple time domain scales；

The local feature figure with tag along sort is input in Activity recognition network and is trained；

It repeats above-mentioned steps and is iterated training, until the loss function of the Activity recognition network is less than default threshold Value, determines the parameter of the feature extraction network and the Activity recognition network.

In some embodiments, in the training stage of neural network, the multitask training program cut using detection bonus point.Ginseng According to Fig. 1 k, a Video segmentation network is added after time domain referral networks, for being video frame images by video segment, And according to the corresponding label of video clip, label is added to divide obtained video frame images.Essence is introduced by the segmentation network The segmentation task of true frame can make model learning classify to the behavior of each frame, can nominate to feature extraction network and time domain The optimization of the weight parameter of network and Activity recognition network, and then the accuracy of behavioral value can be significantly improved.Also, it can To train the segmentation task of each classification by sigmoid function and intersection entropy loss, different classes of does not have competitive relation.

From the foregoing, it will be observed that the embodiment of the present invention adds multiple candidate windows by obtaining video to be detected, for video to be detected, often One candidate window corresponds to a video clip of video to be detected；Based on feature extraction network, generate comprising multiple candidate windows to Detect three-dimensional feature figure of the video on multiple time domain scales；Then, it is determined that with the matched time domain scale of the video clip, and Three-dimensional feature figure corresponding with the time domain scale determined is obtained, the corresponding office of video clip is obtained according to determining three-dimensional feature figure Portion's characteristic pattern can choose the small three-dimensional feature figure extraction office of time domain scale if the length of video clip is smaller in candidate window Portion's characteristic pattern extracts in each candidate window conversely, the three-dimensional feature figure for then selecting time domain scale big extracts local feature figure After the local feature figure of video clip, Activity recognition is carried out according to local feature figure and preset Activity recognition network, determines view The corresponding behavior classification of behavioural characteristic in frequency segment.The program can know the behavior of a variety of durations in video to be detected Not, even if in one section of video including the different behavior of multiple durations, feature extraction network also can be used from video to be detected Middle its three-dimensional feature figure on multiple time domain scales of acquisition, so that the receptive field of classifier can adapt to different time length Behavioural characteristic improves the accuracy of the Activity recognition to a variety of time spans.

Citing, is described in further detail by the method according to described in preceding embodiment below.

For example, referring to Fig. 2 a and Fig. 2 b, in the present embodiment, network device services are integrated in video behavior detection device It is illustrated in device.

(1) training convolutional neural networks.

The stage mainly includes the training of feature extraction network, time domain referral networks and Activity recognition network.In the program Above three network is trained as a whole, specific training method is referring to above-described embodiment, and details are not described herein.

(2) video to be detected is obtained.

Server receives the video to be detected that video capture device is sent.

(3) candidate window is added for video to be detected.

It is that video to be detected adds the candidate windows of multiple and different sizes based on multi-scale sliding window mouth, it can be between candidate window There is overlapping.Referring to Fig. 2 b, the first candidate window to the n-th candidate window, total n candidate window are added on video to be detected.

(4) it obtains and the matched local feature figure of video clip in each candidate window.

The video input to be detected that will acquire exports video to be detected more into preparatory trained feature extraction network Three-dimensional feature figure on a time domain scale extracts the first time domain scale, the second time domain ruler according to feature extraction network referring to Fig. 2 b Degree ... the three-dimensional feature figure of the i time domain scale such as i-th time domain scale, wherein the specific value of i can be as needed by user Setting, for example, i=3~7.Then, from the video clip pair extracted in three-dimensional feature figure in the first candidate window to the n-th candidate window The local feature figure answered.So far, the matched local feature figure of video clip in each candidate window is got.

Enumerate a specific feature extraction network herein to be illustrated.Assuming that in one embodiment, feature extraction network Including convolutional layer following six: convolutional layer 1, convolutional layer 2, convolutional layer 3, convolutional layer 4, convolutional layer 5 and convolutional layer 6, above-mentioned six The convolution kernel of convolutional layer the size F=2 on time dimension, step-length S=2.One convolutional layer output three-dimensional feature figure when Domain scale are as follows: the quantity of video frame images in the original video connected in a neuron of the convolutional layer, therefore, in convolutional layer 1, One neuron (i.e. convolution kernel) is connected with two frame video frame images in video to be detected, the three-dimensional feature figure of this layer output Time domain scale be 2；In convolutional layer 2, a neuron is connect with two neurons in convolutional layer 1, in video to be detected The video frame images of continuous four frame be connected, after convolution operation, the time domain scale of the three-dimensional feature figure of this layer output is 4；And so on, the time domain scale that can calculate the three-dimensional feature figure of the output of convolutional layer 3 is 8；The three-dimensional that convolutional layer 4 exports is special The time domain scale for levying figure is 16；The time domain scale for the three-dimensional feature figure that convolutional layer 5 exports is 32；The three-dimensional that convolutional layer 6 exports is special The time domain scale for levying figure is 64.Therefore the maximum time domain scale of three-dimensional feature figure that the network can export is 64.

In addition, extracting network for this feature, the corresponding time domain scale L of candidate frame can be calculated according to the following formula.

Wherein, k₀On the basis of be worth, be set as 6, the i.e. number of plies of the last one convolutional layer, 64 for this layer output three-dimensional features The time domain scale of figure, ω are the length of candidate window.The formula is suitable for the convolution operation that step-length is 2, wherein 64 can be according to most The specific time domain scale of the characteristic pattern of the output of the latter convolutional layer replaces with other numerical value.Assuming that candidate window length is 32, it can To be calculated, corresponding k=4, time domain scale L=32, it can select convolutional layer 4 export three-dimensional feature figure as with this The matched three-dimensional feature figure of video clip in candidate window can be extracted at this time and is somebody's turn to do from the 4th convolutional layer output data The corresponding local feature figure of video clip in candidate window.For example, if the length of video to be detected is 512 frames, it is one of to wait Select the place of window position be the 33rd frame to the 64th frame, length be 32 frames, then extracted from complete three-dimensional feature figure this 32 The corresponding local feature figure of frame video frame images.

It is understood that the scale of the sliding window in above-mentioned example is just matched with time domain scale, according to other Sliding window scale, for example, window size be 10,20,30,40 etc., then according to above-mentioned formula calculate k value there may be small Number, Yao Jinhang floor operation.

(5) the screening nomination segment from video to be detected, and determine the first time boundary that behavior occurs.

Being filtered out from above-mentioned candidate window by preparatory trained time domain referral networks may include specific behavior Segment primarily determines the period that behavior occurs in the segment as nomination segment, that is, determines first time boundary.It is specific real The mode of applying can refer to the description in above-mentioned Activity recognition embodiment of the method step 104, and details are not described herein.Referring to Fig. 2 b, to Nomination segment 1, nomination segment 2 ... nomination segment k, total k nomination segment are filtered out in detection video.

(6) it determines the behavior classification of nomination segment, and determines the second time boundary of behavior.

Behavior identification network includes classification sub-network and returns sub-network, the last one in sub-network of classifying connects entirely It connects layer and plays the role of classifier, can identify comprising the M+1 class behavior classification including background segment, to the nomination piece filtered out The classification of behavior is accurately identified in section, if detecting in nomination segment do not have the behavior comprising any classification, is determined The nomination segment is background segment.After determining the corresponding behavior classification of each nomination segment, by returning sub-network to determining row Boundary recurrence is carried out for the nomination segment (not including background segment) of classification, to determine the second time boundary of behavior, Jin Ergen The specific period that behavior occurs is determined according to the second time boundary.

Through the above scheme, server can carry out Activity recognition to the one section long video that video capture device is sent, really The behavior classification of one or more specific behavior occurred in the fixed long video and the time of origin section of each behavior.And Behavior classification and behavior time of origin section are sent to video capture device.

In order to implement above method, the embodiment of the present invention also provides a kind of Activity recognition device, behavior identification device tool Body can integrate in terminal device such as server or personal computer equipment.

For example, as shown in Figure 3a, behavior identification device may include video acquisition unit 301, video windowing unit 302, feature acquiring unit 303, scale matching unit 304, feature selection unit 305 and Activity recognition unit 306, as follows:

(1) video acquisition unit 301；

Video acquisition unit 301, for obtaining video to be detected.Wherein, video to be detected can be video capture device It acquires in real time, is also possible to what user was uploaded by terminal.

(2) video windowing unit 302；

Video windowing unit 302, for adding multiple candidate windows for the video to be detected, wherein each candidate's window pair Answer a video clip of the video to be detected.

Video to be detected is made of a series of continuous video frame images, in the present embodiment, using in one section of video The frame number for the video frame for including measures the length of long video or video clip.It is got in video acquisition unit 301 to be checked After surveying video, video windowing unit 302 uses multi-scale sliding window mouth to add candidate window on time dimension for video to be detected, For example, the window of following several sizes can be used: 8,16,32,64,128.It is adjacent in order to improve the accuracy of Activity recognition There can be overlapping between two candidate windows, degree of overlapping can be set to 25%-75%.

(3) feature acquiring unit 303；

Feature acquiring unit 303 generates the video to be detected comprising multiple candidate windows and exists for being based on feature extraction network Three-dimensional feature figure on multiple time domain scales.

It is Three dimensional convolution neural network that this feature, which extracts network, and the feature extraction network in the present embodiment includes multiple convolution Layer, without full articulamentum.Feature acquiring unit 303 carries out feature using the three-dimensional feature figure that Three dimensional convolution checks this layer of input and mentions It takes, i.e., size of the convolution kernel on time dimension is greater than or equal to 2.One neuron of one convolutional layer only with upper one layer in Partial nerve member be connected, successively carry out convolution operation, export three-dimensional characteristic pattern.Specific embodiment can refer to above-mentioned Description in Activity recognition embodiment of the method step 102, details are not described herein.

Referring to Fig. 3 b, in some embodiments, feature extraction network is the Three dimensional convolution nerve net comprising multiple convolutional layers Network, feature acquiring unit 303 may include that convolution algorithm subelement 3031 and feature obtain subelement 3032, in which:

Convolution algorithm subelement 3031, for will include that feature described in the video input to be detected of multiple candidate windows mentions Network is taken, successively carries out convolution algorithm in the multiple convolutional layer；

Feature obtains subelement 3032, for obtaining the three-dimensional feature figure of last continuous multiple convolutional layer outputs, as institute State three-dimensional feature figure of the video to be detected on multiple time domain scales, wherein the number of plies of convolutional layer is deeper, and time domain scale is bigger.

In some embodiments, which uses the Three dimensional convolution neural network of spatial domain and time-domain seperation. D referring to Fig.1 is the convolution kernel structural schematic diagram in feature extraction network provided in an embodiment of the present invention.1* is used in spatial domain The convolution kernel of d*d size uses the convolution kernel of 1*1*t size in time domain.That is a convolutional layer includes a two-dimensional space domain volume Product core and an one-dimensional convolution core, it should be noted that when carrying out convolution operation to input data in a convolutional layer, Spatial domain convolution sum convolution operation carries out respectively, can first carry out convolution operation, can also first carry out spatial domain volume Product operation.Convolution algorithm subelement is also used to when the convolutional layer of the feature extraction network carries out convolution algorithm, to input Three-dimensional feature figure successively carries out convolution operation using the two-dimensional space convolution kernel and the one-dimensional convolution core.Specifically, If first carrying out spatial domain convolution algorithm, convolutional layer uses the two dimension of 1*d*d to the continuous multiple frames video frame images of input respectively Spatial convoluted checks each frame video frame images and carries out two-dimensional space convolution operation, obtains multiple continuous two dimensional character figures；Then, Multiple continuous two dimensional character figures are checked using the one-dimensional convolution of 1*1*t, convolution operation are carried out on time dimension, i.e., in depth It spends on direction, convolution operation is carried out to the pixel data at same position on multiple continuous X-Y schemes.

Since this programme is to carry out convolution to three-dimensional video data, and using undivided long video, cause The weight parameter that feature extraction network needs to learn is very more, or even there are redundancies, be easy to cause over-fitting.Therefore, using sky Between domain and time-domain seperation Three dimensional convolution neural network, can reduce parameter, promoted calculating speed while, reduce network Over-fitting degree, so that detection effect is also compared common Three dimensional convolution and promoted.

In some embodiments, the three-dimensional feature figure on multiple time domain scales can also be obtained using convolution expansive working. Specifically, feature extraction network is the Three dimensional convolution neural network comprising multiple expansion convolutional layers；Feature acquiring unit 303 may be used also To be used for:

By feature extraction network described in the video input to be detected comprising multiple candidate windows, successively the multiple swollen Convolution algorithm is carried out according to the corresponding coefficient of expansion in swollen convolutional layer；

Wherein, it is the Three dimensional convolution neural network for including multiple expansion convolutional layers that this feature, which extracts network,.It is rolled up in expansion Lamination carries out expansion convolution operation according to the coefficient of expansion that the convolutional layer is arranged, by way of expanding convolution, can not do In the case that pondization loses information, expand the receptive field of neuron, each convolution output is allowed to include large range of information. Therefore the coefficient of expansion of each expansion convolutional layer is arranged in the time domain scale that can according to need, so that the three-dimensional feature figure tool of output Time domain scale in need.

(4) scale matching unit 304；

Scale matching unit 304 for the determining matched time domain scale of video clip in the candidate window, and obtains The three-dimensional feature figure corresponding with the determining time domain scale.

Long video is got after the three-dimensional feature figure on multiple time domain scales, intuitively from the point of view of, in each candidate frame Video clip can correspond to the three-dimensional feature figure of multiple time domain scales.Next, scale matching unit 304 is according to each candidate The difference of the length of window selects the three-dimensional feature figure of matched time domain scale for the video clip in each candidate window.

Referring to Fig. 3 c, in some embodiments, feature extraction network is the Three dimensional convolution nerve net comprising multiple convolutional layers Network, scale matching unit 304 may include that quantity determines that subelement 3041 and scale determine subelement 3042, in which:

Quantity determines subelement 3041, for determining the number of video frame images that the video clip in the candidate window includes Amount；

Scale determines subelement 3042, is used for according to the quantity, the determining and matched time domain scale of the video clip, And obtain three-dimensional feature figure corresponding with the time domain scale determined.

(5) feature selection unit 305；

Feature selection unit 305 obtains the corresponding office of the video clip for the three-dimensional feature figure according to acquisition Portion's characteristic pattern.

After determining time domain scale and corresponding three-dimensional feature figure, the three-dimensional of feature selection unit 305 from the determination is special Local feature figure corresponding with video clip is intercepted in sign.

Specific embodiment can refer to the description in above-mentioned Activity recognition embodiment of the method step 103 and step 104, This is repeated no more.

(6) Activity recognition unit 306；

Activity recognition unit 306 is used for according to the local feature figure and preset Activity recognition network in candidate window Video clip carry out Activity recognition, determine the corresponding behavior classification of behavioural characteristic in the video clip.

After getting the local feature figure of each video clip, according to local feature figure and Activity recognition network to video Behavioural characteristic in segment is identified, determines the corresponding behavior classification of behavioural characteristic in video clip.Referring to Fig. 3 d, one In a little embodiments, Activity recognition unit 306 includes segment screening subelement 3061 and Activity recognition subelement 3062, in which:

Segment screens subelement 3061, for according to the local feature figure and preset time domain referral networks, from described Selection includes the video clip of behavioural characteristic in video clip in multiple candidate's windows, as nomination segment；

Activity recognition subelement 3062, for the local feature figure and the Activity recognition net according to the nomination segment Network determines the corresponding behavior classification of the nomination segment.

The local feature figure that the video clip in the candidate window that subelement 3061 will acquire is screened by segment is input to In preset time domain referral networks, carry out preliminary behavioral value, that is, judge a candidate window video clip be behavior segment also Background segment, filtered out from all videos segment include specific behavior feature video clip, as nomination segment.Its In, if not including in one section of video has any behavioural characteristic, which is background segment.

Segment screening subelement 3061 filter out include behavioural characteristic nomination segment after, Activity recognition subelement 3062 will nominate the local feature figure of segment as input data, and Behavior-based control identifies that network carries out Activity recognition, is mentioned with determination The corresponding behavior classification of name segment.Wherein, it operates to obtain by multiple convolution due to local feature figure, the behavior identifies net Network can be not provided with convolutional layer, at least one full articulamentum is arranged.In practical applications, multiple full connections can be set in the program Layer, the last one full articulamentum are classifier, include M+1 node, and M is the positive integer greater than 1, share M behavior classification With a background classification.The corresponding local feature figure of segment will be nominated to input in the network, export the confidence level of each node, confidence Spending the corresponding classification of highest node is the corresponding behavior classification of the video clip.

From the foregoing, it will be observed that the embodiment of the present invention obtains video to be detected by video acquisition unit 301, and then, video adding window Unit 302 is the multiple candidate windows of video to be detected addition, and each candidate's window corresponds to a video clip of video to be detected；Feature Acquiring unit 303 is based on feature extraction network, generates the video to be detected comprising multiple candidate windows on multiple time domain scales Three-dimensional feature figure；Then, the determination of scale matching unit 304 and the matched time domain scale of the video clip, and obtain and determine The corresponding three-dimensional feature figure of time domain scale, feature selection unit 305 obtains video clip pair according to determining three-dimensional feature figure The local feature figure answered can choose the small three-dimensional feature figure of time domain scale if the length of video clip is smaller in candidate window Local feature figure is extracted, conversely, the three-dimensional feature figure for then selecting time domain scale big extracts local feature figure, extracts each candidate After the local feature figure of video clip in window, Activity recognition unit 306 is according to local feature figure and preset Activity recognition net Network carries out Activity recognition, determines the corresponding behavior classification of behavioural characteristic in video clip.The program can be in video to be detected The behavior of a variety of durations is identified, even if including the different behavior of multiple durations in one section of video, feature also can be used Network is extracted from its three-dimensional feature figure on multiple time domain scales is obtained in video to be detected, enables the receptive field of classifier The behavioural characteristic for enough adapting to different time length, improves the accuracy of the Activity recognition to a variety of time spans.

In some embodiments, the time domain referral networks include the first full articulamentum and the second full articulamentum；Segment sieve Subelement 3061 is selected to can be also used for:

Whether according to the local feature figure and the first full articulamentum, detecting in the video clip includes behavior Feature；

It will include the video clip of behavioural characteristic as the nomination segment；

Behavior identification device further includes the first recurrence unit, and first, which returns unit, is used in the described second full articulamentum pair The nomination segment carries out boundary recurrence, obtains the first time boundary of the nomination segment.

In some embodiments, the Activity recognition network includes interpolated layer and full articulamentum；Activity recognition subelement 3062 can be also used for:

The local feature figure that the interpolated layer exports is input to the full articulamentum, behavior is carried out to the nomination segment Identification, determines the corresponding behavior classification of the nomination segment.

In some embodiments, the full articulamentum of the Activity recognition network includes the full articulamentum of third and the 4th full connection Layer；Activity recognition subelement 3062 can be also used for: it is complete that the local feature figure that the interpolated layer exports is input to the third Articulamentum carries out Activity recognition, the local feature figure that the interpolated layer exports is input to the full articulamentum of the third, to described It nominates segment and carries out Activity recognition, determine the corresponding behavior classification of the nomination segment.Behavior identification device further includes second Unit is returned, the second recurrence unit is used to the local feature figure of the nomination segment being input to the 4th full articulamentum, right The nomination segment carries out boundary recurrence, obtains the second time boundary of the nomination segment.

In some embodiments, behavior identification device can also include network training unit, which can To be used for: collecting sample video, for the multiple candidate windows of Sample video addition, wherein each candidate's window corresponds to described to be checked Survey a video clip of video；Two tag along sorts and more tag along sorts are added for the video clip in each candidate window, wherein Two tag along sort includes behavior label and background label；According to the sample after two tag along sorts of addition and more tag along sorts Feature extraction network after this video and initialization weight generates the Sample video comprising multiple candidate windows in multiple time domain scales On three-dimensional feature figure；The determining and matched time domain scale of the video clip, and obtain corresponding with the time domain scale determined Three-dimensional feature figure；According to the three-dimensional feature figure of acquisition, the corresponding local feature figure of the video clip is obtained；There to be row For label local feature figure as positive sample, using the local feature figure with background label as negative sample, by the positive sample This and the negative sample are input in time domain referral networks and are trained；The local feature figure with more tag along sorts is defeated Enter into Activity recognition network and is trained；It repeats above-mentioned steps and is iterated training, until the time domain referral networks It is less than preset threshold with the loss function of the Activity recognition network, determines the feature extraction network, time domain nomination net The parameter of network and the Activity recognition network.Specific embodiment can refer to retouching in above-mentioned Activity recognition embodiment of the method It states, details are not described herein.

The embodiment of the present invention also provides a kind of server, as shown in figure 4, it illustrates take involved in the embodiment of the present invention The structural schematic diagram of business device, specifically:

The server may include one or processor 401, one or more meters of more than one processing core The components such as memory 402, power supply 403 and the input unit 404 of calculation machine readable storage medium storing program for executing.Those skilled in the art can manage It solves, server architecture shown in Fig. 4 does not constitute the restriction to server, may include than illustrating more or fewer portions Part perhaps combines certain components or different component layouts.

Wherein:

Processor 401 is the control centre of the server, utilizes each of various interfaces and the entire server of connection Part by running or execute the software program and/or module that are stored in memory 402, and calls and is stored in memory Data in 402, the various functions and processing data of execute server, to carry out integral monitoring to server.Optionally, locate Managing device 401 may include one or more processing cores；Preferably, processor 401 can integrate application processor and modulatedemodulate is mediated Manage device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is main Processing wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 401.

Memory 402 can be used for storing software program and module, and processor 401 is stored in memory 402 by operation Software program and module, thereby executing various function application and data processing.Memory 402 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as sound-playing function, image player function) etc.；Storage data area, which can be stored, uses created data according to server Deng.In addition, memory 402 may include high-speed random access memory, it can also include nonvolatile memory, for example, at least One disk memory, flush memory device or other volatile solid-state parts.Correspondingly, memory 402 can also include Memory Controller, to provide access of the processor 401 to memory 402.

Server further includes the power supply 403 powered to all parts, it is preferred that power supply 403 can pass through power management system It unites logically contiguous with processor 401, to realize the function such as management charging, electric discharge and power managed by power-supply management system Energy.Power supply 403 can also include one or more direct current or AC power source, recharging system, power failure monitor electricity The random components such as road, power adapter or inverter, power supply status indicator.

The server may also include input unit 404, which can be used for receiving the number or character letter of input Breath, and generation keyboard related with user setting and function control, mouse, operating stick, optics or trackball signal are defeated Enter.

Although being not shown, server can also be including display unit etc., and details are not described herein.Specifically in the present embodiment, Processor 401 in server can according to following instruction, by the process of one or more application program is corresponding can It executes file to be loaded into memory 402, and runs the application program being stored in memory 402 by processor 401, thus Realize various functions, as follows:

In some embodiments, the feature extraction network is the Three dimensional convolution neural network comprising multiple convolutional layers, place Reason device 401 runs the application program being stored in memory 402, can also implement function such as:

In some embodiments, the convolutional layer in feature extraction network includes two-dimensional space convolution kernel and one-dimensional convolution Core；Processor 401 runs the application program being stored in memory 402, can also implement function such as:

When the convolutional layer of the feature extraction network carries out convolution algorithm, institute is successively used to the three-dimensional feature figure of input It states two-dimensional space convolution kernel and the one-dimensional convolution core carries out convolution algorithm.

In some embodiments, the feature extraction network is the Three dimensional convolution nerve net comprising multiple expansion convolutional layers Network；Processor 401 runs the application program being stored in memory 402, can also implement function such as:

In some embodiments, processor 401 runs the application program being stored in memory 402, can also realize such as Lower function:

Determine the quantity for the video frame images that the video clip in the candidate window includes；

According to the quantity, the determining and matched time domain scale of the video clip, and obtain and determining time domain scale Corresponding three-dimensional feature figure.

In some embodiments, the time domain referral networks include the first full articulamentum and the second full articulamentum；Processor The application program that 401 operations are stored in memory 402, can also implement function such as:

And from the video clip in the multiple candidate window selection include the video clip of behavioural characteristic, make After nomination segment, boundary recurrence is carried out to the nomination segment in the described second full articulamentum, obtains the nomination segment First time boundary.

In some embodiments, the Activity recognition network includes interpolated layer and full articulamentum；The operation storage of processor 401 Application program in memory 402 can also implement function such as:

In some embodiments, the full articulamentum of the Activity recognition network includes the full articulamentum of third and the 4th full connection Layer；Processor 401 runs the application program being stored in memory 402, can also implement function such as:

And after determining the corresponding behavior classification of the nomination segment, by the local feature figure of the nomination segment It is input to the 4th full articulamentum, boundary recurrence is carried out to the nomination segment based on the first time boundary, obtains institute State the second time boundary of nomination segment.

The specific implementation of above each operation can be found in the embodiment of front, and details are not described herein.

It will appreciated by the skilled person that all or part of the steps in the various methods of above-described embodiment can be with It is completed by instructing, or relevant hardware is controlled by instruction to complete, which can store computer-readable deposits in one In storage media, and is loaded and executed by processor.

For this purpose, the embodiment of the present invention provides a kind of storage medium, wherein being stored with a plurality of instruction, which can be processed Device is loaded, to execute the step in any Activity recognition method provided by the embodiment of the present invention.For example, the instruction can To execute following steps:

The specific implementation operated above can be found in the embodiment of front, and details are not described herein.

Wherein, which may include: read-only memory (ROM, Read Only Memory), random access memory Body (RAM, Random Access Memory), disk or CD etc..

By the instruction stored in the storage medium, any behavior provided by the embodiment of the present invention can be executed and known Step in other method, it is thereby achieved that achieved by any Activity recognition method provided by the embodiment of the present invention Beneficial effect is detailed in the embodiment of front, and details are not described herein.It is provided for the embodiments of the invention a kind of Activity recognition above Method, apparatus and storage medium are described in detail, and specific case used herein is to the principle of the present invention and embodiment party Formula is expounded, and the above description of the embodiment is only used to help understand the method for the present invention and its core ideas；Meanwhile it is right In those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, To sum up, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of Activity recognition method characterized by comprising

Video to be detected is obtained, for the multiple candidate windows of video addition to be detected, wherein each candidate's window corresponds to described to be checked Survey a video clip of video；

Based on feature extraction network, the video to be detected comprising the multiple candidate window is generated on multiple time domain scales Three-dimensional feature figure；

The determining matched time domain scale of video clip in the candidate window, and obtain corresponding with the time domain scale determined The three-dimensional feature figure；

It is gone according to the local feature figure and preset Activity recognition network to the video clip in the multiple candidate window For identification, the corresponding behavior classification of behavioural characteristic in the video clip is determined.

2. Activity recognition method as described in claim 1, which is characterized in that the feature extraction network is to include multiple convolution The Three dimensional convolution neural network of layer is based on feature extraction network, generates the video to be detected comprising the multiple candidate window Three-dimensional feature figure on multiple time domain scales, comprising:

By feature extraction network described in the video input to be detected comprising the multiple candidate window, successively in the multiple volume Lamination carries out convolution algorithm；

The three-dimensional feature figure for obtaining last continuous multiple convolutional layer outputs, as the video to be detected on multiple time domain scales Three-dimensional feature figure, wherein the number of plies of convolutional layer is deeper, and time domain scale is bigger.

3. Activity recognition method as claimed in claim 2, which is characterized in that the convolutional layer in the feature extraction network includes Two-dimensional space convolution kernel and one-dimensional convolution core；The method also includes:

When the convolutional layer of the feature extraction network carries out convolution algorithm, described two are successively used to the three-dimensional feature figure of input Dimension space convolution kernel and the one-dimensional convolution core carry out convolution algorithm.

4. Activity recognition method as described in claim 1, which is characterized in that the feature extraction network is to include multiple expansions The Three dimensional convolution neural network of convolutional layer；Based on feature extraction network, generate comprising the described to be detected of the multiple candidate window Three-dimensional feature figure of the video on multiple time domain scales, comprising:

By feature extraction network described in the video input to be detected comprising the multiple candidate window, successively the multiple swollen Convolution algorithm is carried out according to the corresponding coefficient of expansion in swollen convolutional layer；

The three-dimensional feature figure for obtaining last continuous multiple expansion convolutional layer outputs, as the video to be detected in multiple time domain rulers Three-dimensional feature figure on degree.

5. Activity recognition method as described in claim 1, which is characterized in that the determining video clip in the candidate window The time domain scale matched, and obtain the three-dimensional feature figure corresponding with the time domain scale determined, comprising:

According to the quantity, the determining and matched time domain scale of the video clip, and obtain corresponding with the time domain scale determined Three-dimensional feature figure.

6. such as Activity recognition method described in any one of claim 1 to 5, which is characterized in that according to the local feature figure and Preset Activity recognition network carries out Activity recognition to the video clip in the candidate window, determines behavior in the video clip The corresponding behavior classification of feature, comprising:

According to the local feature figure and preset time domain referral networks, selected from the video clip in the multiple candidate window Video clip comprising behavioural characteristic, as nomination segment；

According to the local feature figure and the Activity recognition network of the nomination segment, the corresponding behavior of the nomination segment is determined Classification.

7. Activity recognition method as claimed in claim 6, which is characterized in that the time domain referral networks include the first full connection Layer and the second full articulamentum；According to the local feature figure and preset time domain referral networks, out of the multiple candidate window Selection includes the video clip of behavioural characteristic in video clip, as nomination segment, comprising:

According to the local feature figure and the first full articulamentum, detect in the video clip in the multiple candidate window whether Include behavioural characteristic；

Selection includes the video clip of behavioural characteristic from the video clip in the multiple candidate window, as nomination segment it Afterwards, the method also includes:

Boundary recurrence is carried out to the nomination segment in the described second full articulamentum, obtains the first time side of the nomination segment Boundary.

8. Activity recognition method as claimed in claim 7, which is characterized in that the Activity recognition network includes interpolated layer and complete Articulamentum；According to the local feature figure and the Activity recognition network of the nomination segment, determine that the nomination segment is corresponding Behavior classification, comprising:

The local feature figure of the nomination segment is input to the Activity recognition network, it is in the interpolated layer that the part is special Sign figure is adjusted to preset length on time dimension；

The local feature figure that the interpolated layer exports is input to the full articulamentum, behavior knowledge is carried out to the nomination segment Not, the corresponding behavior classification of the nomination segment is determined.

9. Activity recognition method as claimed in claim 8, which is characterized in that the full articulamentum of the Activity recognition network includes The full articulamentum of third and the 4th full articulamentum；The local feature figure that the interpolated layer exports is input to the full articulamentum, it is right The nomination segment carries out Activity recognition, determines the corresponding behavior classification of the nomination segment, comprising:

The local feature figure that the interpolated layer exports is input to the full articulamentum of the third, behavior is carried out to the nomination segment Identification, determines the corresponding behavior classification of the nomination segment；

The local feature figure of the nomination segment is input to the 4th full articulamentum, boundary is carried out to the nomination segment and is returned Return, obtains the second time boundary of the nomination segment.

10. a kind of Activity recognition device characterized by comprising

Video acquisition unit, for obtaining video to be detected；

Video windowing unit, for for the multiple candidate windows of the video to be detected addition, wherein each candidate's window it is corresponding it is described to Detect a video clip of video；

Feature acquiring unit generates the video to be detected comprising the multiple candidate window for being based on feature extraction network Three-dimensional feature figure on multiple time domain scales；

Scale matching unit for the determining matched time domain scale of video clip in the candidate window, and is obtained and is determined The corresponding three-dimensional feature figure of the time domain scale；

Feature selection unit obtains the corresponding local feature of the video clip for the three-dimensional feature figure according to acquisition Figure；

Activity recognition unit is used for according to the local feature figure and preset Activity recognition network in the multiple candidate window Video clip carry out Activity recognition, determine the corresponding behavior classification of behavioural characteristic in the video clip.

11. Activity recognition device as claimed in claim 10, which is characterized in that the feature extraction network is to include multiple volumes The Three dimensional convolution neural network of lamination, the feature acquiring unit include:

Convolution algorithm subelement, for that will include feature extraction net described in the video input to be detected of the multiple candidate window Network successively carries out convolution algorithm in the multiple convolutional layer；

Feature obtains subelement, for obtaining the three-dimensional feature figure of last continuous multiple convolutional layer outputs, as described to be detected Three-dimensional feature figure of the video on multiple time domain scales, wherein the number of plies of convolutional layer is deeper, and time domain scale is bigger.

12. Activity recognition device as claimed in claim 10, which is characterized in that the scale matching unit includes:

Quantity determines subelement, for determining the quantity of video frame images that the video clip in the candidate window includes；

Scale determines subelement, determining with the matched time domain scale of the video clip for according to the quantity, and obtain with The corresponding three-dimensional feature figure of determining time domain scale.

13. such as the described in any item Activity recognition devices of claim 10 to 12, which is characterized in that the Activity recognition unit packet It includes:

Segment screens subelement, is used for according to the local feature figure and preset time domain referral networks, from the multiple candidate Selection includes the video clip of behavioural characteristic in video clip in window, as nomination segment；

Activity recognition subelement determines institute for the local feature figure and the Activity recognition network according to the nomination segment State the corresponding behavior classification of nomination segment.

14. a kind of storage medium, which is characterized in that the storage medium is stored with a plurality of instruction, and described instruction is suitable for processor It is loaded, the step in 1 to 9 described in any item Activity recognition methods is required with perform claim.