CN106599789A

CN106599789A - Video class identification method and device, data processing device and electronic device

Info

Publication number: CN106599789A
Application number: CN201611030170.XA
Authority: CN
Inventors: 汤晓鸥; 王利民; 熊元骏; 王喆; 乔宇; 林达华
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-07-29
Filing date: 2016-11-15
Publication date: 2017-04-26
Anticipated expiration: 2036-11-15
Also published as: WO2018019126A1; CN106599789B

Abstract

An embodiment of the invention discloses a video class identification method and device, a data processing device and an electronic device. The method comprises the following steps: carrying out subsection on a video to obtain a plurality of segmented videos; carrying out sampling on each segmented video in the plurality of segmented videos to obtain an original image and an optical flow image of each segmented video; processing the original image of each segmented video by utilizing an airspace convolution neural network to obtain an airspace classification result of the video; processing the optical flow image of each segmented video by utilizing a time domain convolution neural network to obtain a time domain classification result of the video; and carrying out fusion processing on the airspace classification result and the time domain classification result to obtain a classification result of the video. The video class identification method and device can improve video class identification accuracy.

Description

The recognition methodss of video classification and device, data processing equipment and electronic equipment

Technical field

The invention belongs to technical field of computer vision, the more particularly to a kind of recognition methodss of video classification and device, number According to processing meanss and electronic equipment.

Background technology

Action recognition is a popular direction of computer vision research.Action recognition technology is mainly by by colour The video that sequence of pictures is constituted is processed, and identifies the action in video.The difficult point of action recognition technology is：It is how right The video content of dynamic change is processed, to overcome the change at distance, visual angle, the movement of camera, and change of scene etc. Correctly to identify the action in video.

At present, conventional action recognition technology mainly coordinates support vector machine etc. using the Feature Descriptor of hand-designed Grader carries out action recognition.Wherein, most representational method is, as feature, to match somebody with somebody using intensive track description of modified model Closing support vector machine classifier carries out action recognition.This kind of method cannot be in training certainly due to the Feature Descriptor of hand-designed It is dynamic to improve character representation, cannot usually obtain preferable recognition correct rate.

In recent years, developing rapidly with depth learning technology, the application particularly in computer vision field are based on The action recognition technology of deep learning has been increasingly becoming main flow.This kind of method based on deep learning is mainly using convolution god Jing networks are processed to video, so as to identify the action in video.

The content of the invention

The embodiment of the present invention provides a kind of video classification technology of identification scheme.

A kind of one side according to embodiments of the present invention, there is provided video classification recognition methodss, including：

Video is segmented, multiple segmenting videos are obtained；

Respectively to multiple segmenting videos in each segmenting video sample, obtain the original image and light of each segmenting video Stream picture；

Classified with the spatial domain for obtaining the video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result；And using each segmenting video of convolution Processing with Neural Network light stream picture obtaining the time domain of the video As a result；

The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification knot of the video is obtained Really.

It is in based on another embodiment of said method, described segmentation is carried out to video to include：

The video is averagely segmented, the multiple segmenting videos of length identical are obtained.

In based on another embodiment of said method, the original image for obtaining each segmenting video includes：

A two field picture is randomly selected from each segmenting video respectively, as the original image of each segmenting video.

In based on another embodiment of said method, the light stream picture for obtaining each segmenting video includes：

Continuous multiple image is randomly selected from each segmenting video respectively, the light stream picture of each segmenting video is obtained.

In based on another embodiment of said method, the smooth stream picture be based on 8 bitmaps, totally 256 it is discrete The gray level image of color range, the intermediate value of the gray level image is 128.

It is in based on another embodiment of said method, described to randomly select continuous multiframe respectively from each segmenting video Image, the light stream picture for obtaining each segmenting video include：

It is respectively directed to each segmenting video：Continuous N two field pictures are randomly selected from each segmenting video；Wherein, N be more than 1 integer；And

The every adjacent two field pictures being based respectively in the N two field pictures are calculated, and obtain N-1 group light stream pictures, institute Stating each group of light stream picture in N-1 group light stream pictures includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.

Based in another embodiment of said method, it is characterised in that the process of utilization spatial domain convolutional neural networks The original image of each segmenting video is included with the spatial domain classification results for obtaining the video：

It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, obtains each segmenting video Spatial domain preliminary classification result；

Integrated treatment is carried out using the spatial domain preliminary classification result of the plurality of segmenting video of spatial domain common recognition function pair, is obtained The spatial domain classification results of the video；

And/or

Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the time domain of the video As a result include：

It is utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtains each segmenting video Time domain preliminary classification result；

Integrated treatment is carried out using the time domain preliminary classification result of the plurality of segmenting video of time domain common recognition function pair, is obtained The time domain result of the video.

In based on another embodiment of said method, spatial domain common recognition function and/or time domain common recognition function bag Include：Average function, max function or cum rights average function.

In based on another embodiment of said method, also include：

It is chosen at classification accuracy rate highest average function, max function or cum rights average function on checking data set to make For spatial domain common recognition function；And/or

It is chosen at classification accuracy rate highest average function, max function or cum rights average function on checking data set to make For time domain common recognition function.

In based on another embodiment of said method, the spatial domain preliminary classification result and the time domain preliminary classification are tied Fruit is respectively the classification results vector that dimension is equal to class categories quantity；

The time domain result of the spatial domain classification results and the video of the video is respectively dimension and is equal to class categories The classification results vector of quantity；

The classification results of the video are the classification results vector that dimension is equal to class categories quantity.

In based on another embodiment of said method, the spatial domain classification results and the time domain result are carried out Fusion treatment includes：

The spatial domain classification results are multiplied by after weight coefficient set in advance respectively with the time domain result to be carried out Summation, obtains the classification results of the video.

In based on another embodiment of said method, between the spatial domain classification results and the time domain result Weight coefficient ratio is 1:1.5.

In based on another embodiment of said method, the smooth stream picture is specially primary light stream picture, the time domain Convolutional neural networks are specially the first convolution neutral net；

It is utilized respectively the first convolution neutral net to process the primary light stream picture of each segmenting video, obtains Obtain the first time domain preliminary classification result of each segmenting video；

Synthesis is carried out using the first time domain preliminary classification result of the plurality of segmenting video of the first time domain common recognition function pair Process, obtain the first time domain result of the video.

In based on another embodiment of said method, also include：

Obtain the deformation light stream picture after the primary light flow graph distortion of image；

It is utilized respectively the second convolution neutral net to process the deformation light stream picture of each segmenting video, obtains each Second time domain preliminary classification result of segmenting video；

Synthesis is carried out using the second time domain preliminary classification result of the plurality of segmenting video of the second time domain common recognition function pair Process, obtain the second time domain result of the video；

Fusion treatment is carried out to the spatial domain classification results and the time domain result includes：Spatial domain classification is tied Really, the first time domain result and the second time domain result carry out fusion treatment, obtain the classification of the video As a result.

Deformation light stream picture in based on another embodiment of said method, after the acquisition primary light flow graph distortion of image Including：

Respectively to calculating per adjacent two field pictures, the homography conversion square between obtaining per adjacent two field pictures Battle array；

Respectively according to the homography conversion matrix between every adjacent two field pictures in corresponding adjacent two field pictures Latter two field picture carries out affine transformation；

Respectively the previous frame image and the latter two field picture after affine transformation in every adjacent two field pictures is calculated, Obtain deformation light stream picture.

It is in based on another embodiment of said method, described to include to carrying out calculating per adjacent two field pictures：According to Robust features SURF feature point description is accelerated to carry out interframe Feature Points Matching.

In based on another embodiment of said method, to the spatial domain classification results, the first time domain result Fusion treatment is carried out with the second time domain result includes：

The spatial domain classification results, the first time domain result and the second time domain result are multiplied by respectively Sued for peace after weight coefficient set in advance, obtained the classification results of the video.

In based on another embodiment of said method, the spatial domain classification results and the first time domain result and Weight coefficient ratio between the second time domain result is 1:1:0.5.

Based in another embodiment of said method, the classification results of the video are that dimension is equal to class categories quantity Classification results vector；

Methods described also includes：

It is normalized using the classification results vector of video described in Softmax function pairs, obtains video and belong to each The class probability vector of classification.

In based on another embodiment of said method, also include：

Preset initial spatial domain convolutional neural networks and initial time domain convolutional neural networks；

Each video as sample is based respectively on, using stochastic gradient descent method to the initial spatial domain convolutional neural networks It is trained, obtains the spatial domain convolutional neural networks；And using stochastic gradient descent method to initial time domain convolution god Jing networks are trained, and obtain the convolution neutral net.

In based on another embodiment of said method, using stochastic gradient descent method to the initial spatial domain convolutional Neural Network is trained, and obtaining the spatial domain convolutional neural networks includes：

For a video as sample, start to perform the operation for being segmented video, it is described until obtaining The spatial domain classification results of video；

Deviation of the spatial domain classification results of the comparison video relative to the preset standard spatial domain classification results of the video Whether preset range is less than；

If being not less than preset range, the network parameter of the initial spatial domain convolutional neural networks is adjusted；To adjust Spatial domain convolutional neural networks after network parameter as initial spatial domain convolutional neural networks, for the next one regarding as sample Frequently, start to perform the operation for being segmented video；

If being less than preset range, using current initial spatial domain convolutional neural networks as the spatial domain convolutional neural networks.

In based on another embodiment of said method, using stochastic gradient descent method to the initial time domain convolutional Neural Network is trained, and obtaining the convolution neutral net includes：

For a video as sample, start to perform the operation for being segmented video, it is described until obtaining The time domain result of video；

Deviation of the time domain result of the comparison video relative to the preset standard time domain result of the video Whether preset range is less than；

If being not less than preset range, the network parameter of the initial time domain convolutional neural networks is adjusted；To adjust Convolution neutral net after network parameter as initial time domain convolutional neural networks, for the next one regarding as sample Frequently, start to perform the operation for being segmented video；

If being less than preset range, using current initial time domain convolutional neural networks as the convolution neutral net；

The initial time domain convolutional neural networks include the first initial time domain convolutional neural networks or the second initial time domain volume Product neutral net, the time domain result include the first time domain result or the second time domain result accordingly, described Convolution neutral net includes the first convolution neutral net and the second convolution neutral net accordingly.

In based on another embodiment of said method, also include：

It is normalized using the spatial domain classification results of video described in Softmax function pairs, obtains the video category In a spatial domain class probability vector of all categories；And entered using the time domain result of video described in Softmax function pairs Row normalized, obtains the video and belongs to a time domain probability vector of all categories.

A kind of other side according to embodiments of the present invention, there is provided video classification identifying device, including：

Segmenting unit, for being segmented to video, obtains multiple segmenting videos；

Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain each segmenting video Original image and light stream picture；

Spatial domain classification processing unit, for processing the original graph of each segmenting video to obtain using spatial domain convolutional neural networks The spatial domain classification results of the video；

Time domain processing unit, for being utilized respectively the light stream picture of each segmenting video of convolution Processing with Neural Network To obtain the time domain result of each segmenting video；

Integrated unit, for the spatial domain classification results and the time domain result are carried out with fusion treatment, obtains institute State the classification results of video.

Based in another embodiment of said apparatus, the segmenting unit, specifically for carrying out averagely to the video Segmentation, obtains the multiple segmenting videos of length identical.

In based on another embodiment of said apparatus, the sampling unit includes：

Image sampling module, for randomly selecting a two field picture from each segmenting video respectively, as each segmenting video Original image；

Light stream sampling module, for randomly selecting continuous multiple image respectively from each segmenting video, obtains each segmentation The light stream picture of video.

In based on another embodiment of said apparatus, the smooth stream picture be based on 8 bitmaps, totally 256 it is discrete The gray level image of color range, the intermediate value of the gray level image is 128.

In based on another embodiment of said apparatus, the light stream sampling module, specifically for：

It is respectively directed to each segmenting video：Continuous N two field pictures are randomly selected from each segmenting video；Wherein, N be more than 1 integer；And the every adjacent two field pictures being based respectively in the N two field pictures are calculated, N-1 group light flow graphs are obtained Picture, each group of light stream picture in the N-1 groups light stream picture include a frame lateral light stream picture and frame longitudinal direction light stream respectively Image.

In based on another embodiment of said apparatus, the spatial domain classification processing unit includes：

Spatial domain classification processing module, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, and obtains the spatial domain preliminary classification result of each segmenting video；With

First integrated treatment module, for the spatial domain preliminary classification of the plurality of segmenting video of function pair of being known together using spatial domain As a result integrated treatment is carried out, the spatial domain classification results of the video are obtained；

The time domain processing unit includes：

First time domain processing module, for being utilized respectively light flow graph of the convolution neutral net to each segmenting video As being processed, the time domain preliminary classification result of each segmenting video is obtained；With

Second integrated treatment module, for the time domain preliminary classification of the plurality of segmenting video of function pair of being known together using time domain As a result integrated treatment is carried out, the time domain result of the video is obtained.

In based on another embodiment of said apparatus, spatial domain common recognition function and/or time domain common recognition function bag Include：Average function, max function or cum rights average function.

In based on another embodiment of said apparatus, the spatial domain common recognition function is specially classifies on checking data set Accuracy highest average function, max function or cum rights average function；

The time domain common recognition function is specially classification accuracy rate highest average function, maximum letter on checking data set Number or cum rights average function.

In based on another embodiment of said apparatus, the spatial domain preliminary classification result and the time domain preliminary classification are tied Fruit is respectively the classification results vector that dimension is equal to class categories quantity；

Based in another embodiment of said apparatus, the integrated unit, specifically for by the spatial domain classification results Sued for peace after weight coefficient set in advance is multiplied by respectively with the time domain result, obtained the classification knot of the video Really.

In based on another embodiment of said apparatus, between the spatial domain classification results and the time domain result Weight coefficient ratio is 1:1.5.

In based on another embodiment of said apparatus, the smooth stream picture is specially primary light stream picture, the time domain Convolutional neural networks are specially the first convolution neutral net；

The first time domain processing module, specifically for being utilized respectively the first convolution neutral net to each segmentation The primary light stream picture of video is processed, and obtains the first time domain preliminary classification result of each segmenting video；

Second integrated treatment module, specifically for using the plurality of segmenting video of the first time domain common recognition function pair First time domain preliminary classification result carries out integrated treatment, obtains the first time domain result of the video.

In based on another embodiment of said apparatus, also include：

Light stream processing unit, for obtaining the deformation light stream picture after the primary light flow graph distortion of image；

The time domain processing unit also includes：

Second time domain processing module, for being utilized respectively change of the second convolution neutral net to each segmenting video Shape light stream picture is processed, and obtains the second time domain preliminary classification result of each segmenting video；

3rd integrated treatment module, for carrying out synthesis to the second time domain preliminary classification result of the plurality of segmenting video Process, obtain the second time domain result of the video；

The integrated unit, specifically for the spatial domain classification results, the first time domain result and described Two time domain results carry out fusion treatment, obtain the classification results of the video.

In based on another embodiment of said apparatus, the smooth stream processing unit, specifically for：

Respectively according to the homography conversion matrix between every adjacent two field pictures in corresponding adjacent two field pictures Latter two field picture carries out affine transformation；And

In based on another embodiment of said apparatus, the smooth stream processing unit is to counting per adjacent two field pictures During calculation, specifically for carrying out interframe Feature Points Matching according to acceleration robust features SURF feature point description.

Based in another embodiment of said apparatus, the integrated unit, specifically for by the spatial domain classification results, The first time domain result and the second time domain result are asked after being multiplied by weight coefficient set in advance respectively With the classification results of the acquisition video.

In based on another embodiment of said apparatus, the spatial domain classification results and the first time domain result and Weight coefficient ratio between the second time domain result is 1:1:0.5.

In based on another embodiment of said apparatus, also include：

First normalized unit, for being returned using the classification results vector of video described in Softmax function pairs One change is processed, and obtains the class probability vector that video belongs to of all categories.

In based on another embodiment of said apparatus, also include：

Network training unit, for storing default initial spatial domain convolutional neural networks and initial time domain convolutional neural networks； And each video as sample is based respectively on, the initial spatial domain convolutional neural networks are carried out using stochastic gradient descent method Training, obtains the spatial domain convolutional neural networks；And using stochastic gradient descent method to the initial time domain convolutional Neural net Network is trained, and obtains the convolution neutral net.

In based on another embodiment of said apparatus, the network training unit is using stochastic gradient descent method to described When initial spatial domain convolutional neural networks are trained, specifically for：

For a video as sample, the spatial domain classification knot of the video that the comparison spatial domain classification processing unit is obtained Whether fruit is identical with the preset standard spatial domain classification results of the video；

If differing, the network parameter of the initial spatial domain convolutional neural networks is adjusted；To adjust network parameter Spatial domain convolutional neural networks afterwards start as initial spatial domain convolutional neural networks, then for the next one as the video of sample Perform the spatial domain classification results of the video that the comparison spatial domain classification processing unit is obtained and the preset standard of the video Whether identical is operated spatial domain classification results；

If identical, using current initial spatial domain convolutional neural networks as the spatial domain convolutional neural networks.

In based on another embodiment of said apparatus, the network training unit is using stochastic gradient descent method to described When initial time domain convolutional neural networks are trained, specifically for：

The time domain knot of the video obtained for a video as sample, the comparison time domain processing unit Whether fruit is identical with the preset standard time domain result of the video；

If differing, the network parameter of the initial time domain convolutional neural networks is adjusted；To adjust network parameter Convolution neutral net afterwards starts as initial time domain convolutional neural networks, then for the next one as the video of sample Perform the time domain result of the video that the comparison time domain processing unit is obtained and the preset standard of the video Whether identical is operated time domain result；

If identical, using current initial time domain convolutional neural networks as the convolution neutral net；

In based on another embodiment of said apparatus, also include：

Second normalized unit, for being returned using the spatial domain classification results of video described in Softmax function pairs One change is processed, and obtains the spatial domain class probability vector that the video belongs to of all categories；And utilize Softmax function pairs institute The time domain result for stating video is normalized, obtain the video belong to a time domain probability of all categories to Amount.

A kind of another aspect according to embodiments of the present invention, there is provided data processing equipment, including：Any of the above-described embodiment Described video classification identifying device.

In based on another embodiment of above-mentioned data processing equipment, the data processing equipment includes advanced reduced instruction Collection machine ARM, central processing unit CPU or Graphics Processing Unit GPU.

In terms of another according to embodiments of the present invention, there is provided a kind of electronic equipment, be provided with any of the above-described embodiment Described data processing equipment.

In terms of another according to embodiments of the present invention, there is provided a kind of computer-readable storage medium, for storing computer The instruction that can read, the instruction include：Video is segmented, the instruction of multiple segmenting videos is obtained；Respectively to multiple points Each segmenting video in section video is sampled, and obtains the instruction of the original image and light stream picture of each segmenting video；Using sky Domain convolutional neural networks process the original image of each segmenting video to obtain the instruction of the spatial domain classification results of the video；And Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the finger of the time domain result of the video Order；The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification results of the video are obtained Instruction.

In terms of another according to embodiments of the present invention, there is provided a kind of computer equipment, including：

Memorizer, stores executable instruction；

One or more processors, complete of the invention any of the above-described reality to perform executable instruction with memory communication Apply the corresponding operation of video classification recognition methodss of example.

The recognition methodss of video classification and device, data processing equipment and electronics provided based on the above embodiment of the present invention are set It is standby, by being segmented to video, obtain multiple segmenting videos；And respectively to multiple segmenting videos in each segmenting video carry out Sampling, obtains the original image and light stream picture of each segmenting video；Spatial domain convolutional neural networks are recycled to process each segmenting video Original image obtaining the spatial domain classification results of video；And using the light of each segmenting video of convolution Processing with Neural Network Stream picture is obtaining the time domain result of video；Fusion treatment is carried out to spatial domain classification results and time domain result finally, Obtain the classification results of video.The embodiment of the present invention is adopted to each segmenting video respectively by video is divided into multiple segmenting videos Sample frame picture and interframe light stream, when being trained to convolutional neural networks, it is possible to achieve the modeling to long-time action so that When the network model that later use training is obtained is identified to visual classification, the knowledge of video classification is improve relative to prior art Other accuracy, improves video classification recognition effect, and calculation cost is less.

Description of the drawings

Constitute the Description of Drawings embodiments of the invention of a part for description, and together with description for explaining The principle of the present invention.

Referring to the drawings, according to detailed description below, the present invention can be more clearly understood from, wherein：

Fig. 1 is the flow chart of embodiment of the present invention video classification recognition methodss one embodiment.

Fig. 2 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.

Fig. 3 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.

Fig. 4 is the flow process of the one embodiment being trained to initial spatial domain convolutional neural networks in the embodiment of the present invention Figure.

Fig. 5 is the flow process of the one embodiment being trained to initial time domain convolutional neural networks in the embodiment of the present invention Figure.

Fig. 6 is the structural representation of embodiment of the present invention video classification identifying device one embodiment.

Fig. 7 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.

Fig. 8 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.

Fig. 9 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.

Figure 10 is the structural representation of embodiment of the present invention video classification identifying device further embodiment.

Figure 11 is the schematic diagram of one application example of video classification identifying device of the present invention.

Figure 12 is the structural representation of electronic equipment one embodiment of the present invention.

Specific embodiment

Describe the various exemplary embodiments of the present invention now with reference to accompanying drawing in detail.It should be noted that：Unless had in addition Body illustrates that the positioned opposite of the part for otherwise illustrating in these embodiments, numerical expression and numerical value do not limit the present invention's Scope.

Simultaneously, it should be appreciated that for the ease of description, the size of the various pieces shown in accompanying drawing is not according to reality Proportionate relationship draw.

It is illustrative below to the description only actually of at least one exemplary embodiment, never as to the present invention And its application or any restriction for using.

For known to person of ordinary skill in the relevant, technology, method and apparatus may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered a part for description.

It should be noted that：Similar label and letter represent similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined in individual accompanying drawing, then which need not be further discussed in subsequent accompanying drawing.

The embodiment of the present invention can apply to computer system/server, and which can be with numerous other universal or special calculating System environmentss or configuration are operated together.It is suitable to well-known computing system, the ring being used together with computer system/server The example of border and/or configuration is included but is not limited to：Personal computer system, server computer system, thin client, thick client Machine, hand-held or laptop devices, based on the system of microprocessor, Set Top Box, programmable consumer electronics, NetPC Network PC, Minicomputer Xi Tong ﹑ large computer systems and the distributed cloud computing technology environment including any of the above described system, etc..

Computer system/server can be in computer system executable instruction (the such as journey performed by computer system Sequence module) general linguistic context under describe.Generally, program module can include routine, program, target program, component, logic, number According to structure etc., they perform specific task or realize specific abstract data type.Computer system/server can be with Implement in distributed cloud computing environment, in distributed cloud computing environment, task is by by the long-range of communication network links What reason equipment was performed.In distributed cloud computing environment, program module may be located at the Local or Remote meter including storage device On calculation system storage medium.

In the action recognition technology based on deep learning, double-current method convolutional neural networks (Two-Stream Convolution Neural Network) it is a kind of representative network model.Double-current method convolutional neural networks are to make With two convolutional neural networks, i.e. spatial domain convolutional neural networks and convolution neutral net respectively to frame picture and interframe light stream It is modeled, and is merged by the classification results to two convolutional neural networks, identifies the action in video.

However, during implementing, inventor has found, although double-current method convolutional neural networks can be simultaneously to frame figure Piece and interframe light stream, i.e., be modeled to transitory motions information, but but lacks the modeling ability to long-time action, and this causes The accuracy of action recognition cannot be ensured.

Fig. 1 is the flow chart of embodiment of the present invention video classification recognition methodss one embodiment.As shown in figure 1, of the invention Embodiment video classification recognition methodss include：

102, video is segmented, multiple segmenting videos are obtained.

As a specific example, when being segmented to video, specifically video averagely can be segmented, be obtained length The multiple segmenting videos of identical.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specifically Number of fragments regarding actual effect determine.Alternatively, it is also possible to random segment being carried out to video or several sections of works being extracted from video For multiple segmenting videos.

In implementing, after receiving video, the length of video can be obtained, according to the length of video and set in advance Number of fragments determines the length of each section of video, the video for receiving is divided into the multiple segmentations of length identical accordingly and is regarded Frequently.

When being averagely segmented to video, the length of each segmenting video for obtaining is identical, is being based on long-time video to volume When the network model of product neutral net is trained, can be with the training process of simplified network model；Using the convolution for training When neutral net carries out the identification of video classification, due to close to the time needed for the identification of each segmenting video, video class can be improved The whole efficiency not recognized.

104, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And light stream picture.

Exemplarily, when obtaining the original image of each segmenting video, one can be randomly selected from each segmenting video respectively Two field picture, as the original image of each segmenting video.

Exemplarily, when obtaining the light stream picture of each segmenting video, can respectively from the company of randomly selecting in each segmenting video Continuous multiple image, obtains the light stream picture of each segmenting video.

In a specific example of various embodiments of the present invention, light stream picture can for example be based on 8 bitmaps, totally 256 The gray level image of individual discrete color range, the intermediate value of gray level image is 128.

As optical flow field is a vector field, when light stream picture is represented using gray level image, need with two width scalar fields Picture represents light stream picture, that is, correspond respectively to two width scalar field figures of the X-direction and Y-direction amplitude of light stream image coordinate axle Piece.

Specifically, continuous multiple image is randomly selected from each segmenting video respectively, obtain the light stream of each segmenting video Image, can be realized in the following way：It is respectively directed to each segmenting video：

Continuous N two field pictures are randomly selected from each segmenting video；Wherein, N is the integer more than 1；And

The every adjacent two field pictures being based respectively in N two field pictures are calculated, and obtain N-1 group light stream pictures, wherein N-1 Each group of light stream picture in group light stream picture includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.

For example, each segmenting video can be respectively directed to：Continuous 6 two field picture is randomly selected from each segmenting video；Point Every adjacent two field pictures in not based on 6 two field pictures are calculated, and obtain 5 groups of light stream gray level images, wherein 5 groups of light stream gray scales Each group of light stream gray level image in image includes the horizontal light stream gray level image of a frame and frame longitudinal direction light stream gray level image respectively, 10 frame light stream gray level images are obtained, this 10 frame light stream gray level image can be used as the image of 10 passages.

106, classified with the spatial domain for obtaining video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result；And using each segmenting video of convolution Processing with Neural Network light stream picture with obtain video time domain tie Really.

Wherein, the time domain result of the spatial domain classification results and video of video is respectively dimension and is equal to class categories quantity Classification results vector.For example, classification results include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classes Not, then spatial domain classification results and time domain result are respectively classification results vector of the dimension equal to 6.

108, spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.

Wherein, the classification results of video are the classification results vector that dimension is equal to class categories quantity.For example, classification results Including：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the classification results of video are that dimension is equal to 6 Classification results vector.

Used as a specific example, carrying out fusion treatment to spatial domain classification results and time domain result can be specifically： Sued for peace after spatial domain classification results are multiplied by weight coefficient set in advance respectively with time domain result, obtained dividing for video Class result.Wherein, weight coefficient is correct according to classification of the network model of convolutional neural networks on checking data set is corresponded to Rate determines that the high network model's weight of classification accuracy rate is higher, and checking data set is by marking with true classification, and to have neither part nor lot in The video of network training is constituted.Checking data set can be obtained by any possible mode, for example, by a search engine The video of search respective classes is obtained.

For example, in a concrete application, the weight coefficient ratio between spatial domain classification results and time domain result can Being 1:1.5.

Based on the video classification recognition methodss that the above embodiment of the present invention is provided, by being segmented to video, obtain many Individual segmenting video；And respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture；Classified with the spatial domain for obtaining video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result；And using each segmenting video of convolution Processing with Neural Network light stream picture with obtain video time domain tie Really；Finally spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.The present invention is implemented Example distinguishes sample frame picture and interframe light stream to each segmenting video by video is divided into multiple segmenting videos, to convolution god When Jing networks are trained, it is possible to achieve the modeling to long-time action so that the network model pair that later use training is obtained When visual classification is identified, the accuracy of video classification identification is improve relative to prior art, improve the knowledge of video classification Other effect, and calculation cost is less.

Fig. 2 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.As shown in Fig. 2 this Bright embodiment video classification recognition methodss include：

202, video is segmented, multiple segmenting videos are obtained.

As a specific example, when being segmented to video, specifically video averagely can be segmented, be obtained length The multiple segmenting videos of identical, to simplify the training process of the network model of convolutional neural networks, improve the identification of video classification Whole efficiency.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specific number of fragments Determine depending on actual effect.Alternatively, it is also possible to random segment is carried out to video or several sections is extracted as multiple segmentations from video Video.As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, video is divided into into 3 Segmenting video.

204, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And light stream picture.

For example, a two field picture can be randomly selected from each segmenting video respectively, as the original image of each segmenting video； Continuous multiple image can be randomly selected from each segmenting video respectively, obtain the light stream picture of each segmenting video.

As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, respectively 3 segmentations are regarded Frequency is sampled, and obtains a frame original image and interframe light stream picture of 3 segmenting videos.Wherein original image is RGB color Image, light stream picture are gray level image.

206, it is utilized respectively spatial domain convolutional neural networks and the original image of each segmenting video is processed, obtains each segmentation The spatial domain preliminary classification result of video；And be utilized respectively convolution neutral net the light stream picture of each segmenting video is carried out Process, obtain the time domain preliminary classification result of each segmenting video.

Wherein, spatial domain preliminary classification result and time domain preliminary classification result are respectively dimension and are equal to dividing for class categories quantity Class result vector.For example, classification results include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then Spatial domain preliminary classification result and time domain preliminary classification result are respectively classification results vector of the dimension equal to 6.

As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, it is utilized respectively spatial domain volume Product neutral net is processed to the original image of 3 segmenting videos, obtains 3 spatial domain preliminary classification knots of 3 segmenting videos Really；And be utilized respectively convolution neutral net the light stream picture of 3 segmenting videos is processed, obtain 3 segmentations and regard 3 time domain preliminary classification results of frequency.In implementing, spatial domain convolutional neural networks and/or convolution neutral net can be with The combination of convolutional layer, non-linear layer, pond layer etc. is first passed through, the character representation of image is obtained, then by linear classification layer, is obtained Belong to the preliminary classification result of the score of each classification, i.e. each segmenting video.For example, classification results include：Running, high jump, Footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain preliminary classification result and time domain of each segmenting video is preliminary Classification results are respectively 6 dimensional vectors of the classification score for belonging to this 6 classifications comprising video.

208, integrated treatment is carried out using the spatial domain preliminary classification result of the multiple segmenting videos of spatial domain common recognition function pair, obtain The spatial domain classification results of video；And using time domain know together the multiple segmenting videos of function pair time domain preliminary classification result carry out it is comprehensive Conjunction is processed, and obtains the time domain result of video.

Wherein, the time domain result of the spatial domain classification results and video of video is respectively dimension and is equal to class categories quantity Classification results vector.

In implementing, spatial domain common recognition function and/or time domain common recognition function include：Average function, max function or band Weight average function.Classification accuracy rate highest average function, max function or cum rights on checking data set is chosen at specially Average function is used as spatial domain common recognition function；Or it is chosen at classification accuracy rate highest average function, maximum on checking data set Value function or cum rights average function are used as time domain common recognition function.

Specifically, average function, specially averages as output same category of category score between different segmentations The category category score；Max function, specially same category of category score between different segmentations, is selected by function Maximum therein is taken as the category score of output；Cum rights average function, specially same category of class between different segmentations Other score takes the category score of the meansigma methodss as the category of output of cum rights, and wherein each classification uses same set of weights, and Obtain as network model's parameter optimization in training.

For example, in the Application Example shown in Figure 11, average function can be chosen as spatial domain common recognition function and time domain Common recognition function, chooses average function as spatial domain common recognition function and time domain common recognition function, calculates 3 points using spatial domain common recognition function Belong to the meansigma methodss of 3 scores of each classification in 3 spatial domain preliminary classification results of section video, obtain as the classification of the category Point, one group of category score to all categories is thus obtained, as the spatial domain classification results of video；Using time domain common recognition letter Belong to the meansigma methodss of 3 scores of each classification in the preliminary category result of 3 time domains of number 3 segmenting videos of calculating, as this The category score of classification, has thus obtained one group of category score to all categories, used as the time domain result of video.Example Such as, classification results include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain classification of video As a result 6 dimensional vectors of the category score for belonging to this 6 classifications comprising video are respectively with time domain result.

210, spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.

Wherein, the classification results of video are the classification results vector that dimension is equal to class categories quantity.

As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, video spatial domain is classified As a result 1 is multiplied by respectively with time domain result:Sued for peace after 1.5 weight coefficient, obtained the classification results of video.For example, Classification results include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the classification results of video be Belong to 6 dimensional vectors of the classification score of this 6 classifications comprising video.Wherein, the classification of highest scoring is the class belonging to video Not, the classification of highest scoring is high jump in this embodiment, then identify that the classification of video is high jump.

Based on the video classification recognition methodss that the above embodiment of the present invention is provided, by common recognition is used between each segmenting video Function, by the preliminary classification result of each segmenting video of Function Synthesis of knowing together, obtains the classification results of video, due to function of knowing together Not to each segmenting video using convolutional neural networks model limit, therefore can realize that multiple segmenting videos share network The parameter of model, makes the parameter of network model less, such that it is able to adopt the network model with less parameters to realize to any The identification of the classification of the video of length, in the training process, by the video segmentation to random length, and carries out segmented network Training, is exercised supervision study by classification results and the true tag of the whole video of comparison, it is possible to achieve the instruction of video-level entirely Practice supervision, do not limited by video length.

Fig. 3 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.As shown in figure 3, this Bright embodiment video classification recognition methodss include：

302, video is segmented, multiple segmenting videos are obtained.

304, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And primary light stream picture.

306, obtain the deformation light stream picture after primary light flow graph distortion of image.

In implementing, obtaining the deformation light stream picture after primary light flow graph distortion of image includes：Respectively to every adjacent two Two field picture is calculated, the homography conversion matrix between obtaining per adjacent two field pictures；Respectively according to per two adjacent frames Homography conversion matrix between image carries out affine transformation to the latter two field picture in corresponding adjacent two field pictures；It is right respectively Previous frame image and the latter two field picture after affine transformation in per adjacent two field pictures is calculated, and obtains deformation light flow graph Picture.

On previous frame image due to the characteristic point on the latter two field picture after above-mentioned affine transformation and as benchmark There is no homography conversion between corresponding characteristic point, therefore, by previous frame image and affine transformation after latter two field picture meter The deformation light stream picture for obtaining, can reduce camera movement to the identification of video classification as the input information of video classification identification The impact of effect.

Specifically, include to carrying out calculating per adjacent two field pictures：Retouch according to robust features SURF characteristic point is accelerated Stating son carries out interframe Feature Points Matching.

308, it is utilized respectively spatial domain convolutional neural networks and the original image of each segmenting video is processed, obtains each segmentation The spatial domain preliminary classification result of video；It is utilized respectively primary light stream picture of the first convolution neutral net to each segmenting video Processed, obtained the first time domain preliminary classification result of each segmenting video；And it is utilized respectively the second convolution nerve net Network is processed to the deformation light stream picture of each segmenting video, obtains the second time domain preliminary classification result of each segmenting video.

310, integrated treatment is carried out using the spatial domain preliminary classification result of the multiple segmenting videos of spatial domain common recognition function pair, obtain The spatial domain classification results of video；Entered using the first time domain preliminary classification result of the multiple segmenting videos of the first time domain common recognition function pair Row integrated treatment, obtains the first time domain result of video；And using the multiple segmenting videos of the second time domain common recognition function pair The second time domain preliminary classification result carry out integrated treatment, obtain the second time domain result of video.

312, fusion treatment is carried out to spatial domain classification results, the first time domain result and the second time domain result, is obtained Obtain the classification results of video.

As a specific example, spatial domain classification results, the first time domain result and the second time domain result are entered Row fusion treatment includes：Spatial domain classification results, the first time domain result and the second time domain result are multiplied by respectively in advance Sued for peace after the weight coefficient of setting, obtained the classification results of video.Wherein, weight coefficient is according to corresponding network model Classification accuracy rate on checking data set determines that the high network model of classification accuracy rate obtains higher weights.

For example, in a particular application, spatial domain classification results and the first time domain result and the second time domain result it Between weight coefficient ratio can be 1:1:0.5.

As now widely used double-current method convolutional neural networks represent light stream picture using movable information in short-term, carrying The movement of camera is not considered when taking light stream picture, this may result in dynamic in None- identified video when camera movement is larger Make, and affect recognition effect.

Based on the video classification recognition methodss that the above embodiment of the present invention is provided, except using frame picture and interframe light stream it Outward, also represented as additional movable information in short-term using the light stream of deformation, the input that video classification is recognized is expanded as three kinds Information, i.e. frame picture, interframe light stream and deformation light stream, eliminate the impact of camera movement, therefore can drop due to deforming light stream The impact of low phase machine mobile video classification recognition effect, in the training process, equally using three kinds of input informations, i.e. frame picture, Interframe light stream and deformation light stream, are trained to network model, can reduce impact of the camera movement to network model, so as to can So that video classification recognition system moves more robust to camera.

The video classification recognition methodss of the various embodiments described above of the present invention can be applicable to the training rank of convolutional neural networks model Section, also apply be applicable to the test phase of convolutional neural networks model and follow-up concrete application stage.

In another embodiment of video classification recognition methodss of the present invention, the video classification identification side of the various embodiments described above When method is applied to the test phase of convolutional neural networks model and follow-up concrete application stage, can be in operation 108,210 or 312 After obtaining the classification results of video, the classification results vector obtained using Softmax function pairs fusion treatment is normalized place Reason, obtains the class probability vector that video belongs to of all categories.

In another embodiment of video classification recognition methodss of the present invention, the video classification identification side of the various embodiments described above When method is applied to the training stage of convolutional neural networks model, following operation can also be included：

Each video as sample is based respectively on, using stochastic gradient descent method (SGD) to initial spatial domain convolutional Neural net Network is trained, and obtains the spatial domain convolutional neural networks in the various embodiments described above；And using stochastic gradient descent method to initial Convolution neutral net is trained, and obtains the convolution neutral net in the various embodiments described above.

Wherein, in advance to each video labeling standard spatial domain sorting result information as sample.

Stochastic gradient descent method is to update primary network model come iteration by each sample, using stochastic gradient descent method Initial spatial domain convolutional neural networks and initial time domain convolutional neural networks are trained, training speed is fast, improve network instruction Practice efficiency.

Fig. 4 is the flow process of the one embodiment being trained to initial spatial domain convolutional neural networks in the embodiment of the present invention Figure.As shown in figure 4, the embodiment includes：

402, for a video as sample, start to perform the operation of flow process shown in the various embodiments described above of the present invention, Spatial domain classification results until obtaining video.

For example, perform operation 102～106,202～208, or 302～310 in the operation related to spatial domain, acquisition video Spatial domain classification results.

404, the spatial domain classification results for comparing video relative to the deviation of the preset standard spatial domain classification results of the video are It is no less than preset range.

If being not less than preset range, operation 406 is performed.If being less than preset range, terminate to initial spatial domain convolutional Neural net The training flow process of network, using current initial spatial domain convolutional neural networks as final spatial domain convolutional neural networks, does not perform sheet The follow-up process of embodiment.

406, the network parameter of initial spatial domain convolutional neural networks is adjusted.

408, to adjust the spatial domain convolutional neural networks after network parameter as new initial spatial domain convolutional neural networks, pin To the next one as the video of sample, start to perform operation 402.

Fig. 5 is the flow process of the one embodiment being trained to initial time domain convolutional neural networks in the embodiment of the present invention Figure.As shown in figure 5, the embodiment includes：

502, for a video as sample, start to perform the operation for being segmented video, until obtaining video Time domain result.

For example, perform operation 102～106,202～208, or 302～310 in the operation related to time domain, acquisition video Time domain result.

504, compare the time domain result of video relative to the preset standard time domain result of video deviation whether Less than preset range.

If being not less than preset range, operation 506 is performed.If being not less than preset range, terminate to initial time domain convolutional Neural The training flow process of network, using current initial time domain convolutional neural networks as final convolution neutral net, does not perform The follow-up process of the present embodiment.

506, the network parameter of initial time domain convolutional neural networks is adjusted.

508, to adjust the convolution neutral net after network parameter as new initial time domain convolutional neural networks, pin To the next one as the video of sample, start to perform operation 502.

Specifically, in the embodiment shown in fig. 5, initial time domain convolutional neural networks specifically include the first initial time domain volume Product neutral net or the second initial time domain convolutional neural networks, time domain result include accordingly the first time domain result or Second time domain result, convolution neutral net include the first convolution neutral net and the second convolution accordingly Neutral net.I.e., it is possible to pass through embodiment illustrated in fig. 5 and realize respectively or while realize to the first initial time domain convolutional Neural net The training of network, the second initial time domain convolutional neural networks.

Further, by Fig. 4, embodiment illustrated in fig. 5 to initial spatial domain convolutional neural networks and initial time domain convolution god When Jing networks are trained, following operation can also be included：

It is normalized using the spatial domain classification results of Softmax function pair videos, obtains video and belong to of all categories A spatial domain class probability vector；And be normalized using the time domain result of Softmax function pair videos, Obtain video and belong to a time domain probability vector of all categories.

Correspondingly, shown in Fig. 4, Fig. 5 spatial domain classification results, time domain result, can be specifically not normalized point Class result or normalized class probability vector.

Fig. 6 is the structural representation of embodiment of the present invention video classification identifying device one embodiment.The embodiment is regarded Frequency classification identifying device can be used for the video classification recognition methodss for realizing the various embodiments described above of the present invention.As shown in fig. 6, of the invention Embodiment video classification identifying device includes：Segmenting unit, sampling unit, spatial domain classification processing unit, time domain process are single Unit and integrated unit.Wherein：

Segmenting unit, for being segmented to video, obtains multiple segmenting videos.

As a specific example, segmenting unit, specifically can be used for averagely being segmented video, obtain length identical Multiple segmenting videos.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specific point Segment number is determined regarding actual effect.Alternatively, it is also possible to random segment being carried out to video or several sections being extracted from video as many Individual segmenting video.

In implementing, after receiving video, the length of video is obtained, according to length and the segmentation set in advance of video Quantity determines the length of each section of video, and the video for receiving is divided into the multiple segmenting videos of length identical accordingly.

Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain each segmenting video Original image and light stream picture.

Exemplarily, sampling unit can specifically include：

Specifically, light stream sampling module, can be specifically for being respectively directed to each segmenting video：

Spatial domain classification processing unit, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, to obtain the spatial domain classification results of each segmenting video.

Wherein, the spatial domain classification results of video are the classification results vector that dimension is equal to class categories quantity.For example, classify As a result include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then spatial domain classification results are dimension etc. In 6 classification results vector.

Time domain processing unit, enters to the light stream picture of each segmenting video for being utilized respectively convolution neutral net Row is processed, to obtain the time domain result of each segmenting video.

Wherein, the time domain result of video is the classification results vector that dimension is equal to class categories quantity.For example, classify As a result include：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then time domain result is dimension etc. In 6 classification results vector.

Integrated unit, for spatial domain classification results and time domain result are carried out with fusion treatment, obtains the classification of video As a result.

Used as a specific example, carrying out fusion treatment to spatial domain classification results and time domain result includes：By spatial domain Classification results are sued for peace after being multiplied by weight coefficient set in advance respectively with time domain result, obtain the classification knot of video Really.Wherein, weight coefficient be according to corresponding network model checking data set on classification accuracy rate determine, classification accuracy rate High network model obtains higher weight.

For example, in a particular application, the weight coefficient ratio between spatial domain classification results and time domain result can be 1:1.5。

Based on the video classification identifying device that the above embodiment of the present invention is provided, by being segmented to video, obtain many Individual segmenting video；And respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture；The original graph of spatial domain convolutional neural networks and convolution neutral net to each segmenting video is utilized respectively again Picture and light stream picture are processed, to obtain the spatial domain classification results and time domain result of each segmenting video；Finally to spatial domain Classification results and time domain result carry out fusion treatment, obtain the classification results of video.The embodiment of the present invention is by by video It is divided into multiple segmenting videos, sample frame picture and interframe light stream is distinguished to each segmenting video, convolutional neural networks are being instructed When practicing, it is possible to achieve the modeling to long-time action so that the network model that later use training is obtained is carried out to visual classification During identification, the accuracy of video classification identification is improve relative to prior art, improve video classification recognition effect, and count Calculate cost less.

Fig. 7 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in fig. 7, Compared with the embodiment shown in Fig. 6, in the embodiment, spatial domain classification processing unit specifically includes spatial domain classification processing module and the One integrated treatment module.Wherein：

Spatial domain classification processing module, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, and obtains the spatial domain preliminary classification result of each segmenting video.

Wherein, preliminary classification result in spatial domain is the classification results vector that dimension is equal to class categories quantity.For example, classification knot Fruit includes：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain preliminary classification result of video It is that classification results of the dimension equal to 6 are vectorial.

First integrated treatment module, for the spatial domain preliminary classification result of the multiple segmenting videos of function pair of being known together using spatial domain Integrated treatment is carried out, the spatial domain classification results of video are obtained.

In implementing, spatial domain common recognition function includes：Average function, max function or cum rights average function.Spatial domain is altogether Know function and be specially classification accuracy rate highest average function, max function or cum rights average function on checking data set.

For example, in a particular application, average function can be chosen as spatial domain common recognition function, video is divided into into 3 segmentations Video, for the convolutional neural networks of spatial domain, which obtains 3 groups of category scores, and each classification has being total to from 3 segmenting videos 3 scores, are corresponded to 3 segmenting videos respectively, 3 scores of each classification are averaged as the category using average function Category score, thus obtained one group of category score to all categories.

Referring back to Fig. 7, in another embodiment, time domain processing unit is specifically included：First time domain processes mould Block and the second integrated treatment module.Wherein：

First time domain processing module, for being utilized respectively light flow graph of the convolution neutral net to each segmenting video As being processed, the time domain preliminary classification result of each segmenting video is obtained.

Wherein, time domain preliminary classification result is the classification results vector that dimension is equal to class categories quantity.For example, classification knot Fruit includes：Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the time domain preliminary classification result of video It is that classification results of the dimension equal to 6 are vectorial.

Second integrated treatment module, for the time domain preliminary classification result of the multiple segmenting videos of function pair of being known together using time domain Integrated treatment is carried out, the time domain result of video is obtained.

In implementing, time domain common recognition function includes：Average function, max function or cum rights average function.Time domain is altogether Know function and be specially classification accuracy rate highest average function, max function or cum rights average function on checking data set.

Based on the video classification identifying device that the above embodiment of the present invention is provided, by common recognition is used between each segmenting video Function, by the preliminary classification result of each segmenting video of Function Synthesis of knowing together, obtains the classification results of video, due to function of knowing together Not to each segmenting video using convolutional neural networks model limit, therefore can realize that multiple segmenting videos share network The parameter of model, makes the parameter of network model less, such that it is able to adopt the network model with less parameters to realize to any The identification of the classification of the video of length, in the training process, by the video segmentation to random length, and carries out segmented network Training, is exercised supervision study by classification results and the true tag of the whole video of comparison, it is possible to achieve the instruction of video-level entirely Practice supervision, do not limited by video length.

Fig. 8 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in figure 8, Compared with the embodiment shown in Fig. 6 and Fig. 7, in the embodiment of the present invention, light stream picture be primary light stream picture, convolution nerve Network is the first convolution neutral net, and the video classification identifying device of the embodiment also includes：

Light stream processing unit, for obtaining the deformation light stream picture after primary light flow graph distortion of image.

In implementing, light stream processing unit obtains every specifically for respectively to calculating per adjacent two field pictures Homography conversion matrix between adjacent two field pictures；Respectively according to the homography conversion square between every adjacent two field pictures Battle array carries out affine transformation to the latter two field picture in corresponding adjacent two field pictures；And respectively in every adjacent two field pictures Previous frame image and affine transformation after latter two field picture calculated, obtain deformation light stream picture.

Specifically, when light stream processing unit is to calculating per adjacent two field pictures, specifically for according to acceleration robust Property feature SURF feature point description carries out interframe Feature Points Matching.

The time domain processing unit of the embodiment includes：First time domain processing module, the second integrated treatment module, Second time domain processing module and the 3rd integrated treatment module.Wherein：

First time domain processing module, specifically for being utilized respectively the first convolution neutral net to each segmenting video Primary light stream picture processed, obtain the first time domain preliminary classification result of each segmenting video；

Second integrated treatment module, specifically for the first time domain of the multiple segmenting videos of function pair of being known together using the first time domain Preliminary classification result carries out integrated treatment, obtains the first time domain result of video.

Second time domain processing module, for being utilized respectively change of the second convolution neutral net to each segmenting video Shape light stream picture is processed, and obtains the second time domain preliminary classification result of each segmenting video.

3rd integrated treatment module, it is preliminary for the second time domain using the multiple segmenting videos of the second time domain common recognition function pair Classification results carry out integrated treatment, obtain the second time domain result of video.

Integrated unit, specifically for entering to spatial domain classification results, the first time domain result and the second time domain result Row fusion treatment, obtains the classification results of video.

As a specific example, integrated unit, specifically for by spatial domain classification results, the first time domain result and Two time domain results are sued for peace after being multiplied by weight coefficient set in advance respectively, obtain the classification results of video.Wherein, weigh Weight coefficient is that the classification accuracy rate according to corresponding network on checking data set determines that the high network model of classification accuracy rate obtains Obtain higher weights.

Based on the video classification identifying device that the above embodiment of the present invention is provided, except using frame picture and interframe light stream it Outward, also represented as additional movable information in short-term using the light stream of deformation, the input that video classification is recognized is expanded as three kinds Information, i.e. frame picture, interframe light stream and deformation light stream, eliminate the impact of camera movement, therefore can drop due to deforming light stream The impact of low phase machine mobile video classification recognition effect, in the training process, equally using three kinds of input informations, i.e. frame picture, Interframe light stream and deformation light stream, are trained to network model, can reduce impact of the camera movement to network model, so as to can So that video classification recognition system moves more robust to camera.

The video classification identifying device of the various embodiments described above of the present invention can be applicable to the training rank of convolutional neural networks model Section, also apply be applicable to the test phase of convolutional neural networks model and follow-up concrete application stage.

Fig. 9 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in figure 9, The video classification identifying device of the various embodiments described above is applied to the test phase and follow-up concrete application of convolutional neural networks model During the stage, video classification identifying device can also include：First normalized unit, for being melted using Softmax function pairs The classification results vector for closing process acquisition is normalized, and obtains the class probability vector that video belongs to of all categories.

Figure 10 is the structural representation of embodiment of the present invention video classification identifying device further embodiment.Above-mentioned each enforcement When the video classification identifying device of example is applied to the training stage of convolutional neural networks model, can also include：Network training list Unit, for storing default initial spatial domain convolutional neural networks and initial time domain convolutional neural networks；And it is based respectively on each conduct Initial spatial domain convolutional neural networks are trained by the video of sample using stochastic gradient descent method, obtain final spatial domain volume Product neutral net；And initial time domain convolutional neural networks are trained using stochastic gradient descent method, when obtaining final Domain convolutional neural networks.

In a specific example based on embodiment illustrated in fig. 10, network training unit is using stochastic gradient descent method to first When beginning spatial domain convolutional neural networks are trained, specifically for：

For a video as sample, compare the spatial domain classification results phase of the video that spatial domain classification processing unit is obtained For whether the deviation of the preset standard spatial domain classification results of video is less than preset range；

If being not less than preset range, the network parameter of initial spatial domain convolutional neural networks is adjusted；To adjust network Spatial domain convolutional neural networks after parameter are used as new initial spatial domain convolutional neural networks, then the regarding as sample for the next one Frequently, start to perform the preset standard spatial domain point for comparing the spatial domain classification results with video of the video that spatial domain classification processing unit is obtained Whether identical is operated class result；

If being less than preset range, using current initial spatial domain convolutional neural networks as final spatial domain convolutional Neural net Network.

Based on another specific example of embodiment illustrated in fig. 10, network training unit is using stochastic gradient descent method to first When beginning convolution neutral net is trained, specifically for：

For a video as sample, compare the time domain result phase of the video of time domain processing unit acquisition For whether the deviation of the preset standard time domain result of video is less than preset range；

If being not less than preset range, the network parameter of initial time domain convolutional neural networks is adjusted；To adjust network Convolution neutral net after parameter is used as new initial time domain convolutional neural networks, then the regarding as sample for the next one Frequently, start to perform the preset standard time domain point of the time domain result with video of the video for comparing the acquisition of time domain processing unit Whether identical is operated class result；

If being less than preset range, using current initial time domain convolutional neural networks as final convolution nerve net Network.

Wherein, above-mentioned initial time domain convolutional neural networks can be included at the beginning of the first initial time domain convolutional neural networks or second Beginning convolution neutral net, time domain result include the first time domain result or the second time domain result accordingly, Convolution neutral net includes the first convolution neutral net and the second convolution neutral net accordingly.

Further, referring back to Figure 10, for initial spatial domain convolutional neural networks and initial time domain convolutional neural networks When being trained, the video classification identifying device of the various embodiments described above can also include：Second normalized unit, for profit It is normalized with the spatial domain classification results of Softmax function pair videos, obtains video and belong to a spatial domain of all categories Class probability vector；And be normalized using the time domain result of Softmax function pair videos, obtain video category In a time domain probability vector of all categories.

As shown in figure 11, be video classification identifying device of the present invention a concrete application example, convolution therein Neutral net can be specifically the first convolution neutral net, it is also possible to while including the first convolution neutral net and Two convolution neutral nets.

In addition, the embodiment of the present invention additionally provides a kind of data processing equipment, the data processing equipment is included in the present invention State the video classification identifying device of any embodiment.

Based on the data processing equipment that the above embodiment of the present invention is provided, the thing video classification for being provided with above-described embodiment is known Other device, by video is divided into multiple segmenting videos, distinguishes sample frame picture and interframe light stream to each segmenting video, to volume When product neutral net is trained, it is possible to achieve the modeling to long-time action so that the network mould that later use training is obtained When type is identified to visual classification, the accuracy of video classification identification is improve relative to prior art, improve video class Other recognition effect, and calculation cost is less.

Specifically, the data processing equipment of the embodiment of the present invention can be arbitrarily with data processing function device, example Such as can be including but not limited to：Advanced reduced instruction set machine (ARM), CPU (CPU) or Graphics Processing Unit (GPU) etc..

In addition, the embodiment of the present invention additionally provides a kind of electronic equipment, can for example be mobile terminal, personal computer (PC), panel computer, server etc., the electronic equipment are provided with the data processing equipment of any of the above-described embodiment of the invention.

Based on the electronic equipment that the above embodiment of the present invention is provided, the data processing equipment of above-described embodiment is provided with, is led to Cross and video is divided into into multiple segmenting videos, sample frame picture and interframe light stream are distinguished to each segmenting video, to convolutional Neural net When network is trained, it is possible to achieve the modeling to long-time action so that the network model that later use training is obtained is to video When classification is identified, the accuracy of video classification identification is improve relative to prior art, improve video classification identification effect Really, and calculation cost is less.

Figure 12 is the structural representation of electronic equipment one embodiment of the present invention, as shown in figure 12, for realizing the present invention The electronic equipment of embodiment includes CPU (CPU), and which can be according to being stored in holding in read only memory (ROM) Row instruction or the executable instruction that is partially loaded in random access storage device (RAM) from storage and perform various appropriate dynamic Make and process.CPU can communicate with performing executable instruction with read only memory and/or random access storage device So as to complete the corresponding operation of video classification recognition methodss provided in an embodiment of the present invention, for example：Video is segmented, is obtained Multiple segmenting videos；Respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture；It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, it is each to obtain The spatial domain classification results of segmenting video；And be utilized respectively convolution neutral net the light stream picture of each segmenting video is carried out Process, to obtain the time domain result of each segmenting video；Fusion treatment is carried out to spatial domain classification results and time domain result, Obtain the classification results of video.

Additionally, in RAM, various programs and the data that can be also stored with needed for system operatio.CPU, ROM and RAM lead to Cross bus to be connected with each other.Input/output (I/O) interface is also connected to bus.

I/O interfaces are connected to lower component：Including the importation of keyboard, mouse etc.；Including such as cathode ray tube (CRT), the output par, c of liquid crystal display (LCD) etc. and speaker etc.；Storage part including hard disk etc.；And including all The such as communications portion of the NIC of LAN card, modem etc..Communications portion performs logical via the network of such as the Internet Letter process.Driver is also according to needing to be connected to I/O interfaces.Detachable media, such as disk, CD, magneto-optic disk, quasiconductor are deposited Reservoir etc., is installed on a drive as needed, and the computer program in order to read from it is mounted into as needed Storage part.

Especially, in accordance with an embodiment of the present disclosure, computer is may be implemented as above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program, and which includes being tangibly embodied in machine readable Computer program on medium, computer program include the program code for the method shown in execution flow chart, described program Code may include that correspondence performs the corresponding instruction of any one video classification methods step provided in an embodiment of the present invention, for example, right Video is segmented, and obtains the instruction of multiple segmenting videos；Respectively to multiple segmenting videos in each segmenting video sample, Obtain the instruction of the original image and light stream picture of each segmenting video；Spatial domain convolutional neural networks are utilized respectively to each segmenting video Original image processed, obtain the spatial domain preliminary classification object command of each segmenting video；And it is utilized respectively convolution Neutral net is processed to the light stream picture of each segmenting video, obtains the finger of the time domain preliminary classification result of each segmenting video Order；Integrated treatment is carried out to the spatial domain preliminary classification result of multiple segmenting videos, the instruction of the spatial domain classification results of video is obtained； And integrated treatment is carried out to the time domain preliminary classification result of multiple segmenting videos, obtain the finger of the time domain result of video Order；Spatial domain classification results and time domain result are carried out with fusion treatment, the instruction of the classification results of video is obtained.The computer Program can be downloaded and installed from network by communications portion, and/or mounted from detachable media.In the computer journey When sequence is performed by CPU (CPU), the above-mentioned functions limited in performing the method for the present invention.

The embodiment of the present invention additionally provides a kind of computer-readable storage medium, for storing the instruction of embodied on computer readable, institute Stating instruction includes：Video is segmented, the instruction of multiple segmenting videos is obtained；Respectively to multiple segmenting videos in each segmentation Video is sampled, and obtains the instruction of the original image and light stream picture of each segmenting video；It is utilized respectively spatial domain convolutional Neural net Network is processed to the original image of each segmenting video, obtains the instruction of the spatial domain preliminary classification result of each segmenting video；And It is utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtains at the beginning of the time domain of each segmenting video The instruction of step classification results；Integrated treatment is carried out to the spatial domain preliminary classification result of multiple segmenting videos, the spatial domain of video is obtained The instruction of classification results；And integrated treatment is carried out to the time domain preliminary classification result of multiple segmenting videos, obtain video when The instruction of domain classification results；Spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained Instruction.

In addition, the embodiment of the present invention additionally provides a kind of computer equipment, including：

Memorizer, stores executable instruction；

In this specification, each embodiment is described by the way of progressive, and what each embodiment was stressed is and which The difference of its embodiment, same or analogous part cross-reference between each embodiment.For system embodiment For, it is substantially corresponding with embodiment of the method due to which, so description is fairly simple, portion of the related part referring to embodiment of the method Defend oneself bright.

Methods and apparatus of the present invention, equipment may be achieved in many ways.For example, software, hardware, firmware can be passed through Or any combinations of software, hardware, firmware are realizing methods and apparatus of the present invention, equipment.The step of for methods described Said sequence merely to illustrate, be not limited to order described in detail above the step of the method for the present invention, unless with Alternate manner is illustrated.Additionally, in certain embodiments, also the present invention can be embodied as recording journey in the recording medium Sequence, these programs are included for realizing the machine readable instructions of the method according to the invention.Thus, the present invention also covers storage and uses In the recording medium of the program for performing the method according to the invention.

Description of the invention is given for the sake of example and description, and is not exhaustively or by the present invention It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Select and retouch It is, for the principle and practical application that more preferably illustrate the present invention, and one of ordinary skill in the art is managed to state embodiment The present invention is solved so as to design the various embodiments with various modifications for being suitable to special-purpose.

Claims

1. a kind of video classification recognition methodss, it is characterised in that include：

Video is segmented, multiple segmenting videos are obtained；

Respectively to multiple segmenting videos in each segmenting video sample, obtain the original image and light flow graph of each segmenting video Picture；

The original image of each segmenting video is processed to obtain the spatial domain classification results of the video using spatial domain convolutional neural networks； And using each segmenting video of convolution Processing with Neural Network light stream picture obtaining the time domain result of the video；

The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification results of the video are obtained.

2. method according to claim 1, it is characterised in that described segmentation is carried out to video to include：

3. method according to claim 1 and 2, it is characterised in that the original image of each segmenting video of the acquisition includes：

4. method according to claim 1 and 2, it is characterised in that the light stream picture of each segmenting video of the acquisition includes：

5. method according to claim 4, it is characterised in that the smooth stream picture be based on 8 bitmaps, totally 256 from The gray level image of scattered color range, the intermediate value of the gray level image is 128.

6. the method according to claim 4 or 5, it is characterised in that described respectively from the company of randomly selecting in each segmenting video Continuous multiple image, the light stream picture for obtaining each segmenting video include：

It is respectively directed to each segmenting video：Continuous N two field pictures are randomly selected from each segmenting video；Wherein, N is more than 1 Integer；And

The every adjacent two field pictures being based respectively in the N two field pictures are calculated, and obtain N-1 group light stream pictures, the N-1 Each group of light stream picture in group light stream picture includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.

7. the method according to claim 1 to 6 any one, it is characterised in that utilization spatial domain convolutional neural networks The original image for processing each segmenting video is included with the spatial domain classification results for obtaining the video：

It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, obtains the sky of each segmenting video Domain preliminary classification result；

Integrated treatment is carried out using the spatial domain preliminary classification result of the plurality of segmenting video of spatial domain common recognition function pair, obtains described The spatial domain classification results of video；

And/or

Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the time domain result of the video Including：

Be utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtain each segmenting video when Domain preliminary classification result；

Integrated treatment is carried out using the time domain preliminary classification result of the plurality of segmenting video of time domain common recognition function pair, obtains described The time domain result of video.

8. a kind of video classification identifying device, it is characterised in that include：

Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain the original of each segmenting video Beginning image and light stream picture；

Spatial domain classification processing unit, it is described to obtain for the original graph using each segmenting video of spatial domain convolutional neural networks process The spatial domain classification results of video；

Time domain processing unit, for being utilized respectively the light stream picture of each segmenting video of convolution Processing with Neural Network to obtain Obtain the time domain result of each segmenting video；

Integrated unit, for the spatial domain classification results and the time domain result are carried out with fusion treatment, obtain described in regard The classification results of frequency.

9. a kind of data processing equipment, it is characterised in that including the visual classification identifying device described in claim 8.

10. a kind of electronic equipment, it is characterised in that the data processing equipment being provided with described in claim 9.