[go: up one dir, main page]

CN106599789A - Video class identification method and device, data processing device and electronic device - Google Patents

Video class identification method and device, data processing device and electronic device Download PDF

Info

Publication number
CN106599789A
CN106599789A CN201611030170.XA CN201611030170A CN106599789A CN 106599789 A CN106599789 A CN 106599789A CN 201611030170 A CN201611030170 A CN 201611030170A CN 106599789 A CN106599789 A CN 106599789A
Authority
CN
China
Prior art keywords
video
segmenting
time domain
classification
spatial domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611030170.XA
Other languages
Chinese (zh)
Other versions
CN106599789B (en
Inventor
汤晓鸥
王利民
熊元骏
王喆
乔宇
林达华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sensetime Technology Development Co Ltd
Original Assignee
Beijing Sensetime Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sensetime Technology Development Co Ltd filed Critical Beijing Sensetime Technology Development Co Ltd
Publication of CN106599789A publication Critical patent/CN106599789A/en
Application granted granted Critical
Publication of CN106599789B publication Critical patent/CN106599789B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • G06V20/42Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items of sport video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

An embodiment of the invention discloses a video class identification method and device, a data processing device and an electronic device. The method comprises the following steps: carrying out subsection on a video to obtain a plurality of segmented videos; carrying out sampling on each segmented video in the plurality of segmented videos to obtain an original image and an optical flow image of each segmented video; processing the original image of each segmented video by utilizing an airspace convolution neural network to obtain an airspace classification result of the video; processing the optical flow image of each segmented video by utilizing a time domain convolution neural network to obtain a time domain classification result of the video; and carrying out fusion processing on the airspace classification result and the time domain classification result to obtain a classification result of the video. The video class identification method and device can improve video class identification accuracy.

Description

The recognition methodss of video classification and device, data processing equipment and electronic equipment
Technical field
The invention belongs to technical field of computer vision, the more particularly to a kind of recognition methodss of video classification and device, number According to processing meanss and electronic equipment.
Background technology
Action recognition is a popular direction of computer vision research.Action recognition technology is mainly by by colour The video that sequence of pictures is constituted is processed, and identifies the action in video.The difficult point of action recognition technology is:It is how right The video content of dynamic change is processed, to overcome the change at distance, visual angle, the movement of camera, and change of scene etc. Correctly to identify the action in video.
At present, conventional action recognition technology mainly coordinates support vector machine etc. using the Feature Descriptor of hand-designed Grader carries out action recognition.Wherein, most representational method is, as feature, to match somebody with somebody using intensive track description of modified model Closing support vector machine classifier carries out action recognition.This kind of method cannot be in training certainly due to the Feature Descriptor of hand-designed It is dynamic to improve character representation, cannot usually obtain preferable recognition correct rate.
In recent years, developing rapidly with depth learning technology, the application particularly in computer vision field are based on The action recognition technology of deep learning has been increasingly becoming main flow.This kind of method based on deep learning is mainly using convolution god Jing networks are processed to video, so as to identify the action in video.
The content of the invention
The embodiment of the present invention provides a kind of video classification technology of identification scheme.
A kind of one side according to embodiments of the present invention, there is provided video classification recognition methodss, including:
Video is segmented, multiple segmenting videos are obtained;
Respectively to multiple segmenting videos in each segmenting video sample, obtain the original image and light of each segmenting video Stream picture;
Classified with the spatial domain for obtaining the video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result;And using each segmenting video of convolution Processing with Neural Network light stream picture obtaining the time domain of the video As a result;
The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification knot of the video is obtained Really.
It is in based on another embodiment of said method, described segmentation is carried out to video to include:
The video is averagely segmented, the multiple segmenting videos of length identical are obtained.
In based on another embodiment of said method, the original image for obtaining each segmenting video includes:
A two field picture is randomly selected from each segmenting video respectively, as the original image of each segmenting video.
In based on another embodiment of said method, the light stream picture for obtaining each segmenting video includes:
Continuous multiple image is randomly selected from each segmenting video respectively, the light stream picture of each segmenting video is obtained.
In based on another embodiment of said method, the smooth stream picture be based on 8 bitmaps, totally 256 it is discrete The gray level image of color range, the intermediate value of the gray level image is 128.
It is in based on another embodiment of said method, described to randomly select continuous multiframe respectively from each segmenting video Image, the light stream picture for obtaining each segmenting video include:
It is respectively directed to each segmenting video:Continuous N two field pictures are randomly selected from each segmenting video;Wherein, N be more than 1 integer;And
The every adjacent two field pictures being based respectively in the N two field pictures are calculated, and obtain N-1 group light stream pictures, institute Stating each group of light stream picture in N-1 group light stream pictures includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.
Based in another embodiment of said method, it is characterised in that the process of utilization spatial domain convolutional neural networks The original image of each segmenting video is included with the spatial domain classification results for obtaining the video:
It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, obtains each segmenting video Spatial domain preliminary classification result;
Integrated treatment is carried out using the spatial domain preliminary classification result of the plurality of segmenting video of spatial domain common recognition function pair, is obtained The spatial domain classification results of the video;
And/or
Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the time domain of the video As a result include:
It is utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtains each segmenting video Time domain preliminary classification result;
Integrated treatment is carried out using the time domain preliminary classification result of the plurality of segmenting video of time domain common recognition function pair, is obtained The time domain result of the video.
In based on another embodiment of said method, spatial domain common recognition function and/or time domain common recognition function bag Include:Average function, max function or cum rights average function.
In based on another embodiment of said method, also include:
It is chosen at classification accuracy rate highest average function, max function or cum rights average function on checking data set to make For spatial domain common recognition function;And/or
It is chosen at classification accuracy rate highest average function, max function or cum rights average function on checking data set to make For time domain common recognition function.
In based on another embodiment of said method, the spatial domain preliminary classification result and the time domain preliminary classification are tied Fruit is respectively the classification results vector that dimension is equal to class categories quantity;
The time domain result of the spatial domain classification results and the video of the video is respectively dimension and is equal to class categories The classification results vector of quantity;
The classification results of the video are the classification results vector that dimension is equal to class categories quantity.
In based on another embodiment of said method, the spatial domain classification results and the time domain result are carried out Fusion treatment includes:
The spatial domain classification results are multiplied by after weight coefficient set in advance respectively with the time domain result to be carried out Summation, obtains the classification results of the video.
In based on another embodiment of said method, between the spatial domain classification results and the time domain result Weight coefficient ratio is 1:1.5.
In based on another embodiment of said method, the smooth stream picture is specially primary light stream picture, the time domain Convolutional neural networks are specially the first convolution neutral net;
It is utilized respectively the first convolution neutral net to process the primary light stream picture of each segmenting video, obtains Obtain the first time domain preliminary classification result of each segmenting video;
Synthesis is carried out using the first time domain preliminary classification result of the plurality of segmenting video of the first time domain common recognition function pair Process, obtain the first time domain result of the video.
In based on another embodiment of said method, also include:
Obtain the deformation light stream picture after the primary light flow graph distortion of image;
It is utilized respectively the second convolution neutral net to process the deformation light stream picture of each segmenting video, obtains each Second time domain preliminary classification result of segmenting video;
Synthesis is carried out using the second time domain preliminary classification result of the plurality of segmenting video of the second time domain common recognition function pair Process, obtain the second time domain result of the video;
Fusion treatment is carried out to the spatial domain classification results and the time domain result includes:Spatial domain classification is tied Really, the first time domain result and the second time domain result carry out fusion treatment, obtain the classification of the video As a result.
Deformation light stream picture in based on another embodiment of said method, after the acquisition primary light flow graph distortion of image Including:
Respectively to calculating per adjacent two field pictures, the homography conversion square between obtaining per adjacent two field pictures Battle array;
Respectively according to the homography conversion matrix between every adjacent two field pictures in corresponding adjacent two field pictures Latter two field picture carries out affine transformation;
Respectively the previous frame image and the latter two field picture after affine transformation in every adjacent two field pictures is calculated, Obtain deformation light stream picture.
It is in based on another embodiment of said method, described to include to carrying out calculating per adjacent two field pictures:According to Robust features SURF feature point description is accelerated to carry out interframe Feature Points Matching.
In based on another embodiment of said method, to the spatial domain classification results, the first time domain result Fusion treatment is carried out with the second time domain result includes:
The spatial domain classification results, the first time domain result and the second time domain result are multiplied by respectively Sued for peace after weight coefficient set in advance, obtained the classification results of the video.
In based on another embodiment of said method, the spatial domain classification results and the first time domain result and Weight coefficient ratio between the second time domain result is 1:1:0.5.
Based in another embodiment of said method, the classification results of the video are that dimension is equal to class categories quantity Classification results vector;
Methods described also includes:
It is normalized using the classification results vector of video described in Softmax function pairs, obtains video and belong to each The class probability vector of classification.
In based on another embodiment of said method, also include:
Preset initial spatial domain convolutional neural networks and initial time domain convolutional neural networks;
Each video as sample is based respectively on, using stochastic gradient descent method to the initial spatial domain convolutional neural networks It is trained, obtains the spatial domain convolutional neural networks;And using stochastic gradient descent method to initial time domain convolution god Jing networks are trained, and obtain the convolution neutral net.
In based on another embodiment of said method, using stochastic gradient descent method to the initial spatial domain convolutional Neural Network is trained, and obtaining the spatial domain convolutional neural networks includes:
For a video as sample, start to perform the operation for being segmented video, it is described until obtaining The spatial domain classification results of video;
Deviation of the spatial domain classification results of the comparison video relative to the preset standard spatial domain classification results of the video Whether preset range is less than;
If being not less than preset range, the network parameter of the initial spatial domain convolutional neural networks is adjusted;To adjust Spatial domain convolutional neural networks after network parameter as initial spatial domain convolutional neural networks, for the next one regarding as sample Frequently, start to perform the operation for being segmented video;
If being less than preset range, using current initial spatial domain convolutional neural networks as the spatial domain convolutional neural networks.
In based on another embodiment of said method, using stochastic gradient descent method to the initial time domain convolutional Neural Network is trained, and obtaining the convolution neutral net includes:
For a video as sample, start to perform the operation for being segmented video, it is described until obtaining The time domain result of video;
Deviation of the time domain result of the comparison video relative to the preset standard time domain result of the video Whether preset range is less than;
If being not less than preset range, the network parameter of the initial time domain convolutional neural networks is adjusted;To adjust Convolution neutral net after network parameter as initial time domain convolutional neural networks, for the next one regarding as sample Frequently, start to perform the operation for being segmented video;
If being less than preset range, using current initial time domain convolutional neural networks as the convolution neutral net;
The initial time domain convolutional neural networks include the first initial time domain convolutional neural networks or the second initial time domain volume Product neutral net, the time domain result include the first time domain result or the second time domain result accordingly, described Convolution neutral net includes the first convolution neutral net and the second convolution neutral net accordingly.
In based on another embodiment of said method, also include:
It is normalized using the spatial domain classification results of video described in Softmax function pairs, obtains the video category In a spatial domain class probability vector of all categories;And entered using the time domain result of video described in Softmax function pairs Row normalized, obtains the video and belongs to a time domain probability vector of all categories.
A kind of other side according to embodiments of the present invention, there is provided video classification identifying device, including:
Segmenting unit, for being segmented to video, obtains multiple segmenting videos;
Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain each segmenting video Original image and light stream picture;
Spatial domain classification processing unit, for processing the original graph of each segmenting video to obtain using spatial domain convolutional neural networks The spatial domain classification results of the video;
Time domain processing unit, for being utilized respectively the light stream picture of each segmenting video of convolution Processing with Neural Network To obtain the time domain result of each segmenting video;
Integrated unit, for the spatial domain classification results and the time domain result are carried out with fusion treatment, obtains institute State the classification results of video.
Based in another embodiment of said apparatus, the segmenting unit, specifically for carrying out averagely to the video Segmentation, obtains the multiple segmenting videos of length identical.
In based on another embodiment of said apparatus, the sampling unit includes:
Image sampling module, for randomly selecting a two field picture from each segmenting video respectively, as each segmenting video Original image;
Light stream sampling module, for randomly selecting continuous multiple image respectively from each segmenting video, obtains each segmentation The light stream picture of video.
In based on another embodiment of said apparatus, the smooth stream picture be based on 8 bitmaps, totally 256 it is discrete The gray level image of color range, the intermediate value of the gray level image is 128.
In based on another embodiment of said apparatus, the light stream sampling module, specifically for:
It is respectively directed to each segmenting video:Continuous N two field pictures are randomly selected from each segmenting video;Wherein, N be more than 1 integer;And the every adjacent two field pictures being based respectively in the N two field pictures are calculated, N-1 group light flow graphs are obtained Picture, each group of light stream picture in the N-1 groups light stream picture include a frame lateral light stream picture and frame longitudinal direction light stream respectively Image.
In based on another embodiment of said apparatus, the spatial domain classification processing unit includes:
Spatial domain classification processing module, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, and obtains the spatial domain preliminary classification result of each segmenting video;With
First integrated treatment module, for the spatial domain preliminary classification of the plurality of segmenting video of function pair of being known together using spatial domain As a result integrated treatment is carried out, the spatial domain classification results of the video are obtained;
The time domain processing unit includes:
First time domain processing module, for being utilized respectively light flow graph of the convolution neutral net to each segmenting video As being processed, the time domain preliminary classification result of each segmenting video is obtained;With
Second integrated treatment module, for the time domain preliminary classification of the plurality of segmenting video of function pair of being known together using time domain As a result integrated treatment is carried out, the time domain result of the video is obtained.
In based on another embodiment of said apparatus, spatial domain common recognition function and/or time domain common recognition function bag Include:Average function, max function or cum rights average function.
In based on another embodiment of said apparatus, the spatial domain common recognition function is specially classifies on checking data set Accuracy highest average function, max function or cum rights average function;
The time domain common recognition function is specially classification accuracy rate highest average function, maximum letter on checking data set Number or cum rights average function.
In based on another embodiment of said apparatus, the spatial domain preliminary classification result and the time domain preliminary classification are tied Fruit is respectively the classification results vector that dimension is equal to class categories quantity;
The time domain result of the spatial domain classification results and the video of the video is respectively dimension and is equal to class categories The classification results vector of quantity;
The classification results of the video are the classification results vector that dimension is equal to class categories quantity.
Based in another embodiment of said apparatus, the integrated unit, specifically for by the spatial domain classification results Sued for peace after weight coefficient set in advance is multiplied by respectively with the time domain result, obtained the classification knot of the video Really.
In based on another embodiment of said apparatus, between the spatial domain classification results and the time domain result Weight coefficient ratio is 1:1.5.
In based on another embodiment of said apparatus, the smooth stream picture is specially primary light stream picture, the time domain Convolutional neural networks are specially the first convolution neutral net;
The first time domain processing module, specifically for being utilized respectively the first convolution neutral net to each segmentation The primary light stream picture of video is processed, and obtains the first time domain preliminary classification result of each segmenting video;
Second integrated treatment module, specifically for using the plurality of segmenting video of the first time domain common recognition function pair First time domain preliminary classification result carries out integrated treatment, obtains the first time domain result of the video.
In based on another embodiment of said apparatus, also include:
Light stream processing unit, for obtaining the deformation light stream picture after the primary light flow graph distortion of image;
The time domain processing unit also includes:
Second time domain processing module, for being utilized respectively change of the second convolution neutral net to each segmenting video Shape light stream picture is processed, and obtains the second time domain preliminary classification result of each segmenting video;
3rd integrated treatment module, for carrying out synthesis to the second time domain preliminary classification result of the plurality of segmenting video Process, obtain the second time domain result of the video;
The integrated unit, specifically for the spatial domain classification results, the first time domain result and described Two time domain results carry out fusion treatment, obtain the classification results of the video.
In based on another embodiment of said apparatus, the smooth stream processing unit, specifically for:
Respectively to calculating per adjacent two field pictures, the homography conversion square between obtaining per adjacent two field pictures Battle array;
Respectively according to the homography conversion matrix between every adjacent two field pictures in corresponding adjacent two field pictures Latter two field picture carries out affine transformation;And
Respectively the previous frame image and the latter two field picture after affine transformation in every adjacent two field pictures is calculated, Obtain deformation light stream picture.
In based on another embodiment of said apparatus, the smooth stream processing unit is to counting per adjacent two field pictures During calculation, specifically for carrying out interframe Feature Points Matching according to acceleration robust features SURF feature point description.
Based in another embodiment of said apparatus, the integrated unit, specifically for by the spatial domain classification results, The first time domain result and the second time domain result are asked after being multiplied by weight coefficient set in advance respectively With the classification results of the acquisition video.
In based on another embodiment of said apparatus, the spatial domain classification results and the first time domain result and Weight coefficient ratio between the second time domain result is 1:1:0.5.
In based on another embodiment of said apparatus, also include:
First normalized unit, for being returned using the classification results vector of video described in Softmax function pairs One change is processed, and obtains the class probability vector that video belongs to of all categories.
In based on another embodiment of said apparatus, also include:
Network training unit, for storing default initial spatial domain convolutional neural networks and initial time domain convolutional neural networks; And each video as sample is based respectively on, the initial spatial domain convolutional neural networks are carried out using stochastic gradient descent method Training, obtains the spatial domain convolutional neural networks;And using stochastic gradient descent method to the initial time domain convolutional Neural net Network is trained, and obtains the convolution neutral net.
In based on another embodiment of said apparatus, the network training unit is using stochastic gradient descent method to described When initial spatial domain convolutional neural networks are trained, specifically for:
For a video as sample, the spatial domain classification knot of the video that the comparison spatial domain classification processing unit is obtained Whether fruit is identical with the preset standard spatial domain classification results of the video;
If differing, the network parameter of the initial spatial domain convolutional neural networks is adjusted;To adjust network parameter Spatial domain convolutional neural networks afterwards start as initial spatial domain convolutional neural networks, then for the next one as the video of sample Perform the spatial domain classification results of the video that the comparison spatial domain classification processing unit is obtained and the preset standard of the video Whether identical is operated spatial domain classification results;
If identical, using current initial spatial domain convolutional neural networks as the spatial domain convolutional neural networks.
In based on another embodiment of said apparatus, the network training unit is using stochastic gradient descent method to described When initial time domain convolutional neural networks are trained, specifically for:
The time domain knot of the video obtained for a video as sample, the comparison time domain processing unit Whether fruit is identical with the preset standard time domain result of the video;
If differing, the network parameter of the initial time domain convolutional neural networks is adjusted;To adjust network parameter Convolution neutral net afterwards starts as initial time domain convolutional neural networks, then for the next one as the video of sample Perform the time domain result of the video that the comparison time domain processing unit is obtained and the preset standard of the video Whether identical is operated time domain result;
If identical, using current initial time domain convolutional neural networks as the convolution neutral net;
The initial time domain convolutional neural networks include the first initial time domain convolutional neural networks or the second initial time domain volume Product neutral net, the time domain result include the first time domain result or the second time domain result accordingly, described Convolution neutral net includes the first convolution neutral net and the second convolution neutral net accordingly.
In based on another embodiment of said apparatus, also include:
Second normalized unit, for being returned using the spatial domain classification results of video described in Softmax function pairs One change is processed, and obtains the spatial domain class probability vector that the video belongs to of all categories;And utilize Softmax function pairs institute The time domain result for stating video is normalized, obtain the video belong to a time domain probability of all categories to Amount.
A kind of another aspect according to embodiments of the present invention, there is provided data processing equipment, including:Any of the above-described embodiment Described video classification identifying device.
In based on another embodiment of above-mentioned data processing equipment, the data processing equipment includes advanced reduced instruction Collection machine ARM, central processing unit CPU or Graphics Processing Unit GPU.
In terms of another according to embodiments of the present invention, there is provided a kind of electronic equipment, be provided with any of the above-described embodiment Described data processing equipment.
In terms of another according to embodiments of the present invention, there is provided a kind of computer-readable storage medium, for storing computer The instruction that can read, the instruction include:Video is segmented, the instruction of multiple segmenting videos is obtained;Respectively to multiple points Each segmenting video in section video is sampled, and obtains the instruction of the original image and light stream picture of each segmenting video;Using sky Domain convolutional neural networks process the original image of each segmenting video to obtain the instruction of the spatial domain classification results of the video;And Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the finger of the time domain result of the video Order;The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification results of the video are obtained Instruction.
In terms of another according to embodiments of the present invention, there is provided a kind of computer equipment, including:
Memorizer, stores executable instruction;
One or more processors, complete of the invention any of the above-described reality to perform executable instruction with memory communication Apply the corresponding operation of video classification recognition methodss of example.
The recognition methodss of video classification and device, data processing equipment and electronics provided based on the above embodiment of the present invention are set It is standby, by being segmented to video, obtain multiple segmenting videos;And respectively to multiple segmenting videos in each segmenting video carry out Sampling, obtains the original image and light stream picture of each segmenting video;Spatial domain convolutional neural networks are recycled to process each segmenting video Original image obtaining the spatial domain classification results of video;And using the light of each segmenting video of convolution Processing with Neural Network Stream picture is obtaining the time domain result of video;Fusion treatment is carried out to spatial domain classification results and time domain result finally, Obtain the classification results of video.The embodiment of the present invention is adopted to each segmenting video respectively by video is divided into multiple segmenting videos Sample frame picture and interframe light stream, when being trained to convolutional neural networks, it is possible to achieve the modeling to long-time action so that When the network model that later use training is obtained is identified to visual classification, the knowledge of video classification is improve relative to prior art Other accuracy, improves video classification recognition effect, and calculation cost is less.
Description of the drawings
Constitute the Description of Drawings embodiments of the invention of a part for description, and together with description for explaining The principle of the present invention.
Referring to the drawings, according to detailed description below, the present invention can be more clearly understood from, wherein:
Fig. 1 is the flow chart of embodiment of the present invention video classification recognition methodss one embodiment.
Fig. 2 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.
Fig. 3 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.
Fig. 4 is the flow process of the one embodiment being trained to initial spatial domain convolutional neural networks in the embodiment of the present invention Figure.
Fig. 5 is the flow process of the one embodiment being trained to initial time domain convolutional neural networks in the embodiment of the present invention Figure.
Fig. 6 is the structural representation of embodiment of the present invention video classification identifying device one embodiment.
Fig. 7 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.
Fig. 8 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.
Fig. 9 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.
Figure 10 is the structural representation of embodiment of the present invention video classification identifying device further embodiment.
Figure 11 is the schematic diagram of one application example of video classification identifying device of the present invention.
Figure 12 is the structural representation of electronic equipment one embodiment of the present invention.
Specific embodiment
Describe the various exemplary embodiments of the present invention now with reference to accompanying drawing in detail.It should be noted that:Unless had in addition Body illustrates that the positioned opposite of the part for otherwise illustrating in these embodiments, numerical expression and numerical value do not limit the present invention's Scope.
Simultaneously, it should be appreciated that for the ease of description, the size of the various pieces shown in accompanying drawing is not according to reality Proportionate relationship draw.
It is illustrative below to the description only actually of at least one exemplary embodiment, never as to the present invention And its application or any restriction for using.
For known to person of ordinary skill in the relevant, technology, method and apparatus may be not discussed in detail, but suitable In the case of, the technology, method and apparatus should be considered a part for description.
It should be noted that:Similar label and letter represent similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined in individual accompanying drawing, then which need not be further discussed in subsequent accompanying drawing.
The embodiment of the present invention can apply to computer system/server, and which can be with numerous other universal or special calculating System environmentss or configuration are operated together.It is suitable to well-known computing system, the ring being used together with computer system/server The example of border and/or configuration is included but is not limited to:Personal computer system, server computer system, thin client, thick client Machine, hand-held or laptop devices, based on the system of microprocessor, Set Top Box, programmable consumer electronics, NetPC Network PC, Minicomputer Xi Tong ﹑ large computer systems and the distributed cloud computing technology environment including any of the above described system, etc..
Computer system/server can be in computer system executable instruction (the such as journey performed by computer system Sequence module) general linguistic context under describe.Generally, program module can include routine, program, target program, component, logic, number According to structure etc., they perform specific task or realize specific abstract data type.Computer system/server can be with Implement in distributed cloud computing environment, in distributed cloud computing environment, task is by by the long-range of communication network links What reason equipment was performed.In distributed cloud computing environment, program module may be located at the Local or Remote meter including storage device On calculation system storage medium.
In the action recognition technology based on deep learning, double-current method convolutional neural networks (Two-Stream Convolution Neural Network) it is a kind of representative network model.Double-current method convolutional neural networks are to make With two convolutional neural networks, i.e. spatial domain convolutional neural networks and convolution neutral net respectively to frame picture and interframe light stream It is modeled, and is merged by the classification results to two convolutional neural networks, identifies the action in video.
However, during implementing, inventor has found, although double-current method convolutional neural networks can be simultaneously to frame figure Piece and interframe light stream, i.e., be modeled to transitory motions information, but but lacks the modeling ability to long-time action, and this causes The accuracy of action recognition cannot be ensured.
Fig. 1 is the flow chart of embodiment of the present invention video classification recognition methodss one embodiment.As shown in figure 1, of the invention Embodiment video classification recognition methodss include:
102, video is segmented, multiple segmenting videos are obtained.
As a specific example, when being segmented to video, specifically video averagely can be segmented, be obtained length The multiple segmenting videos of identical.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specifically Number of fragments regarding actual effect determine.Alternatively, it is also possible to random segment being carried out to video or several sections of works being extracted from video For multiple segmenting videos.
In implementing, after receiving video, the length of video can be obtained, according to the length of video and set in advance Number of fragments determines the length of each section of video, the video for receiving is divided into the multiple segmentations of length identical accordingly and is regarded Frequently.
When being averagely segmented to video, the length of each segmenting video for obtaining is identical, is being based on long-time video to volume When the network model of product neutral net is trained, can be with the training process of simplified network model;Using the convolution for training When neutral net carries out the identification of video classification, due to close to the time needed for the identification of each segmenting video, video class can be improved The whole efficiency not recognized.
104, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And light stream picture.
Exemplarily, when obtaining the original image of each segmenting video, one can be randomly selected from each segmenting video respectively Two field picture, as the original image of each segmenting video.
Exemplarily, when obtaining the light stream picture of each segmenting video, can respectively from the company of randomly selecting in each segmenting video Continuous multiple image, obtains the light stream picture of each segmenting video.
In a specific example of various embodiments of the present invention, light stream picture can for example be based on 8 bitmaps, totally 256 The gray level image of individual discrete color range, the intermediate value of gray level image is 128.
As optical flow field is a vector field, when light stream picture is represented using gray level image, need with two width scalar fields Picture represents light stream picture, that is, correspond respectively to two width scalar field figures of the X-direction and Y-direction amplitude of light stream image coordinate axle Piece.
Specifically, continuous multiple image is randomly selected from each segmenting video respectively, obtain the light stream of each segmenting video Image, can be realized in the following way:It is respectively directed to each segmenting video:
Continuous N two field pictures are randomly selected from each segmenting video;Wherein, N is the integer more than 1;And
The every adjacent two field pictures being based respectively in N two field pictures are calculated, and obtain N-1 group light stream pictures, wherein N-1 Each group of light stream picture in group light stream picture includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.
For example, each segmenting video can be respectively directed to:Continuous 6 two field picture is randomly selected from each segmenting video;Point Every adjacent two field pictures in not based on 6 two field pictures are calculated, and obtain 5 groups of light stream gray level images, wherein 5 groups of light stream gray scales Each group of light stream gray level image in image includes the horizontal light stream gray level image of a frame and frame longitudinal direction light stream gray level image respectively, 10 frame light stream gray level images are obtained, this 10 frame light stream gray level image can be used as the image of 10 passages.
106, classified with the spatial domain for obtaining video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result;And using each segmenting video of convolution Processing with Neural Network light stream picture with obtain video time domain tie Really.
Wherein, the time domain result of the spatial domain classification results and video of video is respectively dimension and is equal to class categories quantity Classification results vector.For example, classification results include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classes Not, then spatial domain classification results and time domain result are respectively classification results vector of the dimension equal to 6.
108, spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.
Wherein, the classification results of video are the classification results vector that dimension is equal to class categories quantity.For example, classification results Including:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the classification results of video are that dimension is equal to 6 Classification results vector.
Used as a specific example, carrying out fusion treatment to spatial domain classification results and time domain result can be specifically: Sued for peace after spatial domain classification results are multiplied by weight coefficient set in advance respectively with time domain result, obtained dividing for video Class result.Wherein, weight coefficient is correct according to classification of the network model of convolutional neural networks on checking data set is corresponded to Rate determines that the high network model's weight of classification accuracy rate is higher, and checking data set is by marking with true classification, and to have neither part nor lot in The video of network training is constituted.Checking data set can be obtained by any possible mode, for example, by a search engine The video of search respective classes is obtained.
For example, in a concrete application, the weight coefficient ratio between spatial domain classification results and time domain result can Being 1:1.5.
Based on the video classification recognition methodss that the above embodiment of the present invention is provided, by being segmented to video, obtain many Individual segmenting video;And respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture;Classified with the spatial domain for obtaining video using the original image of each segmenting video of spatial domain convolutional neural networks process As a result;And using each segmenting video of convolution Processing with Neural Network light stream picture with obtain video time domain tie Really;Finally spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.The present invention is implemented Example distinguishes sample frame picture and interframe light stream to each segmenting video by video is divided into multiple segmenting videos, to convolution god When Jing networks are trained, it is possible to achieve the modeling to long-time action so that the network model pair that later use training is obtained When visual classification is identified, the accuracy of video classification identification is improve relative to prior art, improve the knowledge of video classification Other effect, and calculation cost is less.
Fig. 2 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.As shown in Fig. 2 this Bright embodiment video classification recognition methodss include:
202, video is segmented, multiple segmenting videos are obtained.
As a specific example, when being segmented to video, specifically video averagely can be segmented, be obtained length The multiple segmenting videos of identical, to simplify the training process of the network model of convolutional neural networks, improve the identification of video classification Whole efficiency.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specific number of fragments Determine depending on actual effect.Alternatively, it is also possible to random segment is carried out to video or several sections is extracted as multiple segmentations from video Video.As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, video is divided into into 3 Segmenting video.
204, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And light stream picture.
For example, a two field picture can be randomly selected from each segmenting video respectively, as the original image of each segmenting video; Continuous multiple image can be randomly selected from each segmenting video respectively, obtain the light stream picture of each segmenting video.
As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, respectively 3 segmentations are regarded Frequency is sampled, and obtains a frame original image and interframe light stream picture of 3 segmenting videos.Wherein original image is RGB color Image, light stream picture are gray level image.
206, it is utilized respectively spatial domain convolutional neural networks and the original image of each segmenting video is processed, obtains each segmentation The spatial domain preliminary classification result of video;And be utilized respectively convolution neutral net the light stream picture of each segmenting video is carried out Process, obtain the time domain preliminary classification result of each segmenting video.
Wherein, spatial domain preliminary classification result and time domain preliminary classification result are respectively dimension and are equal to dividing for class categories quantity Class result vector.For example, classification results include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then Spatial domain preliminary classification result and time domain preliminary classification result are respectively classification results vector of the dimension equal to 6.
As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, it is utilized respectively spatial domain volume Product neutral net is processed to the original image of 3 segmenting videos, obtains 3 spatial domain preliminary classification knots of 3 segmenting videos Really;And be utilized respectively convolution neutral net the light stream picture of 3 segmenting videos is processed, obtain 3 segmentations and regard 3 time domain preliminary classification results of frequency.In implementing, spatial domain convolutional neural networks and/or convolution neutral net can be with The combination of convolutional layer, non-linear layer, pond layer etc. is first passed through, the character representation of image is obtained, then by linear classification layer, is obtained Belong to the preliminary classification result of the score of each classification, i.e. each segmenting video.For example, classification results include:Running, high jump, Footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain preliminary classification result and time domain of each segmenting video is preliminary Classification results are respectively 6 dimensional vectors of the classification score for belonging to this 6 classifications comprising video.
208, integrated treatment is carried out using the spatial domain preliminary classification result of the multiple segmenting videos of spatial domain common recognition function pair, obtain The spatial domain classification results of video;And using time domain know together the multiple segmenting videos of function pair time domain preliminary classification result carry out it is comprehensive Conjunction is processed, and obtains the time domain result of video.
Wherein, the time domain result of the spatial domain classification results and video of video is respectively dimension and is equal to class categories quantity Classification results vector.
In implementing, spatial domain common recognition function and/or time domain common recognition function include:Average function, max function or band Weight average function.Classification accuracy rate highest average function, max function or cum rights on checking data set is chosen at specially Average function is used as spatial domain common recognition function;Or it is chosen at classification accuracy rate highest average function, maximum on checking data set Value function or cum rights average function are used as time domain common recognition function.
Specifically, average function, specially averages as output same category of category score between different segmentations The category category score;Max function, specially same category of category score between different segmentations, is selected by function Maximum therein is taken as the category score of output;Cum rights average function, specially same category of class between different segmentations Other score takes the category score of the meansigma methodss as the category of output of cum rights, and wherein each classification uses same set of weights, and Obtain as network model's parameter optimization in training.
For example, in the Application Example shown in Figure 11, average function can be chosen as spatial domain common recognition function and time domain Common recognition function, chooses average function as spatial domain common recognition function and time domain common recognition function, calculates 3 points using spatial domain common recognition function Belong to the meansigma methodss of 3 scores of each classification in 3 spatial domain preliminary classification results of section video, obtain as the classification of the category Point, one group of category score to all categories is thus obtained, as the spatial domain classification results of video;Using time domain common recognition letter Belong to the meansigma methodss of 3 scores of each classification in the preliminary category result of 3 time domains of number 3 segmenting videos of calculating, as this The category score of classification, has thus obtained one group of category score to all categories, used as the time domain result of video.Example Such as, classification results include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain classification of video As a result 6 dimensional vectors of the category score for belonging to this 6 classifications comprising video are respectively with time domain result.
210, spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained.
Wherein, the classification results of video are the classification results vector that dimension is equal to class categories quantity.
As shown in figure 11, in an Application Example of video classification recognition methodss of the present invention, video spatial domain is classified As a result 1 is multiplied by respectively with time domain result:Sued for peace after 1.5 weight coefficient, obtained the classification results of video.For example, Classification results include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the classification results of video be Belong to 6 dimensional vectors of the classification score of this 6 classifications comprising video.Wherein, the classification of highest scoring is the class belonging to video Not, the classification of highest scoring is high jump in this embodiment, then identify that the classification of video is high jump.
Based on the video classification recognition methodss that the above embodiment of the present invention is provided, by common recognition is used between each segmenting video Function, by the preliminary classification result of each segmenting video of Function Synthesis of knowing together, obtains the classification results of video, due to function of knowing together Not to each segmenting video using convolutional neural networks model limit, therefore can realize that multiple segmenting videos share network The parameter of model, makes the parameter of network model less, such that it is able to adopt the network model with less parameters to realize to any The identification of the classification of the video of length, in the training process, by the video segmentation to random length, and carries out segmented network Training, is exercised supervision study by classification results and the true tag of the whole video of comparison, it is possible to achieve the instruction of video-level entirely Practice supervision, do not limited by video length.
Fig. 3 is the flow chart of another embodiment of embodiment of the present invention video classification recognition methodss.As shown in figure 3, this Bright embodiment video classification recognition methodss include:
302, video is segmented, multiple segmenting videos are obtained.
304, respectively to multiple segmenting videos in each segmenting video sample, obtain the original image of each segmenting video And primary light stream picture.
306, obtain the deformation light stream picture after primary light flow graph distortion of image.
In implementing, obtaining the deformation light stream picture after primary light flow graph distortion of image includes:Respectively to every adjacent two Two field picture is calculated, the homography conversion matrix between obtaining per adjacent two field pictures;Respectively according to per two adjacent frames Homography conversion matrix between image carries out affine transformation to the latter two field picture in corresponding adjacent two field pictures;It is right respectively Previous frame image and the latter two field picture after affine transformation in per adjacent two field pictures is calculated, and obtains deformation light flow graph Picture.
On previous frame image due to the characteristic point on the latter two field picture after above-mentioned affine transformation and as benchmark There is no homography conversion between corresponding characteristic point, therefore, by previous frame image and affine transformation after latter two field picture meter The deformation light stream picture for obtaining, can reduce camera movement to the identification of video classification as the input information of video classification identification The impact of effect.
Specifically, include to carrying out calculating per adjacent two field pictures:Retouch according to robust features SURF characteristic point is accelerated Stating son carries out interframe Feature Points Matching.
308, it is utilized respectively spatial domain convolutional neural networks and the original image of each segmenting video is processed, obtains each segmentation The spatial domain preliminary classification result of video;It is utilized respectively primary light stream picture of the first convolution neutral net to each segmenting video Processed, obtained the first time domain preliminary classification result of each segmenting video;And it is utilized respectively the second convolution nerve net Network is processed to the deformation light stream picture of each segmenting video, obtains the second time domain preliminary classification result of each segmenting video.
310, integrated treatment is carried out using the spatial domain preliminary classification result of the multiple segmenting videos of spatial domain common recognition function pair, obtain The spatial domain classification results of video;Entered using the first time domain preliminary classification result of the multiple segmenting videos of the first time domain common recognition function pair Row integrated treatment, obtains the first time domain result of video;And using the multiple segmenting videos of the second time domain common recognition function pair The second time domain preliminary classification result carry out integrated treatment, obtain the second time domain result of video.
312, fusion treatment is carried out to spatial domain classification results, the first time domain result and the second time domain result, is obtained Obtain the classification results of video.
As a specific example, spatial domain classification results, the first time domain result and the second time domain result are entered Row fusion treatment includes:Spatial domain classification results, the first time domain result and the second time domain result are multiplied by respectively in advance Sued for peace after the weight coefficient of setting, obtained the classification results of video.Wherein, weight coefficient is according to corresponding network model Classification accuracy rate on checking data set determines that the high network model of classification accuracy rate obtains higher weights.
For example, in a particular application, spatial domain classification results and the first time domain result and the second time domain result it Between weight coefficient ratio can be 1:1:0.5.
As now widely used double-current method convolutional neural networks represent light stream picture using movable information in short-term, carrying The movement of camera is not considered when taking light stream picture, this may result in dynamic in None- identified video when camera movement is larger Make, and affect recognition effect.
Based on the video classification recognition methodss that the above embodiment of the present invention is provided, except using frame picture and interframe light stream it Outward, also represented as additional movable information in short-term using the light stream of deformation, the input that video classification is recognized is expanded as three kinds Information, i.e. frame picture, interframe light stream and deformation light stream, eliminate the impact of camera movement, therefore can drop due to deforming light stream The impact of low phase machine mobile video classification recognition effect, in the training process, equally using three kinds of input informations, i.e. frame picture, Interframe light stream and deformation light stream, are trained to network model, can reduce impact of the camera movement to network model, so as to can So that video classification recognition system moves more robust to camera.
The video classification recognition methodss of the various embodiments described above of the present invention can be applicable to the training rank of convolutional neural networks model Section, also apply be applicable to the test phase of convolutional neural networks model and follow-up concrete application stage.
In another embodiment of video classification recognition methodss of the present invention, the video classification identification side of the various embodiments described above When method is applied to the test phase of convolutional neural networks model and follow-up concrete application stage, can be in operation 108,210 or 312 After obtaining the classification results of video, the classification results vector obtained using Softmax function pairs fusion treatment is normalized place Reason, obtains the class probability vector that video belongs to of all categories.
In another embodiment of video classification recognition methodss of the present invention, the video classification identification side of the various embodiments described above When method is applied to the training stage of convolutional neural networks model, following operation can also be included:
Preset initial spatial domain convolutional neural networks and initial time domain convolutional neural networks;
Each video as sample is based respectively on, using stochastic gradient descent method (SGD) to initial spatial domain convolutional Neural net Network is trained, and obtains the spatial domain convolutional neural networks in the various embodiments described above;And using stochastic gradient descent method to initial Convolution neutral net is trained, and obtains the convolution neutral net in the various embodiments described above.
Wherein, in advance to each video labeling standard spatial domain sorting result information as sample.
Stochastic gradient descent method is to update primary network model come iteration by each sample, using stochastic gradient descent method Initial spatial domain convolutional neural networks and initial time domain convolutional neural networks are trained, training speed is fast, improve network instruction Practice efficiency.
Fig. 4 is the flow process of the one embodiment being trained to initial spatial domain convolutional neural networks in the embodiment of the present invention Figure.As shown in figure 4, the embodiment includes:
402, for a video as sample, start to perform the operation of flow process shown in the various embodiments described above of the present invention, Spatial domain classification results until obtaining video.
For example, perform operation 102~106,202~208, or 302~310 in the operation related to spatial domain, acquisition video Spatial domain classification results.
404, the spatial domain classification results for comparing video relative to the deviation of the preset standard spatial domain classification results of the video are It is no less than preset range.
If being not less than preset range, operation 406 is performed.If being less than preset range, terminate to initial spatial domain convolutional Neural net The training flow process of network, using current initial spatial domain convolutional neural networks as final spatial domain convolutional neural networks, does not perform sheet The follow-up process of embodiment.
406, the network parameter of initial spatial domain convolutional neural networks is adjusted.
408, to adjust the spatial domain convolutional neural networks after network parameter as new initial spatial domain convolutional neural networks, pin To the next one as the video of sample, start to perform operation 402.
Fig. 5 is the flow process of the one embodiment being trained to initial time domain convolutional neural networks in the embodiment of the present invention Figure.As shown in figure 5, the embodiment includes:
502, for a video as sample, start to perform the operation for being segmented video, until obtaining video Time domain result.
For example, perform operation 102~106,202~208, or 302~310 in the operation related to time domain, acquisition video Time domain result.
504, compare the time domain result of video relative to the preset standard time domain result of video deviation whether Less than preset range.
If being not less than preset range, operation 506 is performed.If being not less than preset range, terminate to initial time domain convolutional Neural The training flow process of network, using current initial time domain convolutional neural networks as final convolution neutral net, does not perform The follow-up process of the present embodiment.
506, the network parameter of initial time domain convolutional neural networks is adjusted.
508, to adjust the convolution neutral net after network parameter as new initial time domain convolutional neural networks, pin To the next one as the video of sample, start to perform operation 502.
Specifically, in the embodiment shown in fig. 5, initial time domain convolutional neural networks specifically include the first initial time domain volume Product neutral net or the second initial time domain convolutional neural networks, time domain result include accordingly the first time domain result or Second time domain result, convolution neutral net include the first convolution neutral net and the second convolution accordingly Neutral net.I.e., it is possible to pass through embodiment illustrated in fig. 5 and realize respectively or while realize to the first initial time domain convolutional Neural net The training of network, the second initial time domain convolutional neural networks.
Further, by Fig. 4, embodiment illustrated in fig. 5 to initial spatial domain convolutional neural networks and initial time domain convolution god When Jing networks are trained, following operation can also be included:
It is normalized using the spatial domain classification results of Softmax function pair videos, obtains video and belong to of all categories A spatial domain class probability vector;And be normalized using the time domain result of Softmax function pair videos, Obtain video and belong to a time domain probability vector of all categories.
Correspondingly, shown in Fig. 4, Fig. 5 spatial domain classification results, time domain result, can be specifically not normalized point Class result or normalized class probability vector.
Fig. 6 is the structural representation of embodiment of the present invention video classification identifying device one embodiment.The embodiment is regarded Frequency classification identifying device can be used for the video classification recognition methodss for realizing the various embodiments described above of the present invention.As shown in fig. 6, of the invention Embodiment video classification identifying device includes:Segmenting unit, sampling unit, spatial domain classification processing unit, time domain process are single Unit and integrated unit.Wherein:
Segmenting unit, for being segmented to video, obtains multiple segmenting videos.
As a specific example, segmenting unit, specifically can be used for averagely being segmented video, obtain length identical Multiple segmenting videos.For example, video is divided into into 3 segmenting videos of length identical or 5 segmenting videos, specific point Segment number is determined regarding actual effect.Alternatively, it is also possible to random segment being carried out to video or several sections being extracted from video as many Individual segmenting video.
In implementing, after receiving video, the length of video is obtained, according to length and the segmentation set in advance of video Quantity determines the length of each section of video, and the video for receiving is divided into the multiple segmenting videos of length identical accordingly.
Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain each segmenting video Original image and light stream picture.
Exemplarily, sampling unit can specifically include:
Image sampling module, for randomly selecting a two field picture from each segmenting video respectively, as each segmenting video Original image;
Light stream sampling module, for randomly selecting continuous multiple image respectively from each segmenting video, obtains each segmentation The light stream picture of video.
In a specific example of various embodiments of the present invention, light stream picture can for example be based on 8 bitmaps, totally 256 The gray level image of individual discrete color range, the intermediate value of gray level image is 128.
Specifically, light stream sampling module, can be specifically for being respectively directed to each segmenting video:
Continuous N two field pictures are randomly selected from each segmenting video;Wherein, N is the integer more than 1;And
The every adjacent two field pictures being based respectively in N two field pictures are calculated, and obtain N-1 group light stream pictures, wherein N-1 Each group of light stream picture in group light stream picture includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.
For example, each segmenting video can be respectively directed to:Continuous 6 two field picture is randomly selected from each segmenting video;Point Every adjacent two field pictures in not based on 6 two field pictures are calculated, and obtain 5 groups of light stream gray level images, wherein 5 groups of light stream gray scales Each group of light stream gray level image in image includes the horizontal light stream gray level image of a frame and frame longitudinal direction light stream gray level image respectively, 10 frame light stream gray level images are obtained, this 10 frame light stream gray level image can be used as the image of 10 passages.
Spatial domain classification processing unit, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, to obtain the spatial domain classification results of each segmenting video.
Wherein, the spatial domain classification results of video are the classification results vector that dimension is equal to class categories quantity.For example, classify As a result include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then spatial domain classification results are dimension etc. In 6 classification results vector.
Time domain processing unit, enters to the light stream picture of each segmenting video for being utilized respectively convolution neutral net Row is processed, to obtain the time domain result of each segmenting video.
Wherein, the time domain result of video is the classification results vector that dimension is equal to class categories quantity.For example, classify As a result include:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then time domain result is dimension etc. In 6 classification results vector.
Integrated unit, for spatial domain classification results and time domain result are carried out with fusion treatment, obtains the classification of video As a result.
Wherein, the classification results of video are the classification results vector that dimension is equal to class categories quantity.For example, classification results Including:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the classification results of video are that dimension is equal to 6 Classification results vector.
Used as a specific example, carrying out fusion treatment to spatial domain classification results and time domain result includes:By spatial domain Classification results are sued for peace after being multiplied by weight coefficient set in advance respectively with time domain result, obtain the classification knot of video Really.Wherein, weight coefficient be according to corresponding network model checking data set on classification accuracy rate determine, classification accuracy rate High network model obtains higher weight.
For example, in a particular application, the weight coefficient ratio between spatial domain classification results and time domain result can be 1:1.5。
Based on the video classification identifying device that the above embodiment of the present invention is provided, by being segmented to video, obtain many Individual segmenting video;And respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture;The original graph of spatial domain convolutional neural networks and convolution neutral net to each segmenting video is utilized respectively again Picture and light stream picture are processed, to obtain the spatial domain classification results and time domain result of each segmenting video;Finally to spatial domain Classification results and time domain result carry out fusion treatment, obtain the classification results of video.The embodiment of the present invention is by by video It is divided into multiple segmenting videos, sample frame picture and interframe light stream is distinguished to each segmenting video, convolutional neural networks are being instructed When practicing, it is possible to achieve the modeling to long-time action so that the network model that later use training is obtained is carried out to visual classification During identification, the accuracy of video classification identification is improve relative to prior art, improve video classification recognition effect, and count Calculate cost less.
Fig. 7 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in fig. 7, Compared with the embodiment shown in Fig. 6, in the embodiment, spatial domain classification processing unit specifically includes spatial domain classification processing module and the One integrated treatment module.Wherein:
Spatial domain classification processing module, enters to the original image of each segmenting video for being utilized respectively spatial domain convolutional neural networks Row is processed, and obtains the spatial domain preliminary classification result of each segmenting video.
Wherein, preliminary classification result in spatial domain is the classification results vector that dimension is equal to class categories quantity.For example, classification knot Fruit includes:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the spatial domain preliminary classification result of video It is that classification results of the dimension equal to 6 are vectorial.
First integrated treatment module, for the spatial domain preliminary classification result of the multiple segmenting videos of function pair of being known together using spatial domain Integrated treatment is carried out, the spatial domain classification results of video are obtained.
In implementing, spatial domain common recognition function includes:Average function, max function or cum rights average function.Spatial domain is altogether Know function and be specially classification accuracy rate highest average function, max function or cum rights average function on checking data set.
Specifically, average function, specially averages as output same category of category score between different segmentations The category category score;Max function, specially same category of category score between different segmentations, is selected by function Maximum therein is taken as the category score of output;Cum rights average function, specially same category of class between different segmentations Other score takes the category score of the meansigma methodss as the category of output of cum rights, and wherein each classification uses same set of weights, and Obtain as network model's parameter optimization in training.
For example, in a particular application, average function can be chosen as spatial domain common recognition function, video is divided into into 3 segmentations Video, for the convolutional neural networks of spatial domain, which obtains 3 groups of category scores, and each classification has being total to from 3 segmenting videos 3 scores, are corresponded to 3 segmenting videos respectively, 3 scores of each classification are averaged as the category using average function Category score, thus obtained one group of category score to all categories.
Referring back to Fig. 7, in another embodiment, time domain processing unit is specifically included:First time domain processes mould Block and the second integrated treatment module.Wherein:
First time domain processing module, for being utilized respectively light flow graph of the convolution neutral net to each segmenting video As being processed, the time domain preliminary classification result of each segmenting video is obtained.
Wherein, time domain preliminary classification result is the classification results vector that dimension is equal to class categories quantity.For example, classification knot Fruit includes:Running, high jump, footrace, vault, long-jump and triple jump, totally 6 classifications, then the time domain preliminary classification result of video It is that classification results of the dimension equal to 6 are vectorial.
Second integrated treatment module, for the time domain preliminary classification result of the multiple segmenting videos of function pair of being known together using time domain Integrated treatment is carried out, the time domain result of video is obtained.
In implementing, time domain common recognition function includes:Average function, max function or cum rights average function.Time domain is altogether Know function and be specially classification accuracy rate highest average function, max function or cum rights average function on checking data set.
Based on the video classification identifying device that the above embodiment of the present invention is provided, by common recognition is used between each segmenting video Function, by the preliminary classification result of each segmenting video of Function Synthesis of knowing together, obtains the classification results of video, due to function of knowing together Not to each segmenting video using convolutional neural networks model limit, therefore can realize that multiple segmenting videos share network The parameter of model, makes the parameter of network model less, such that it is able to adopt the network model with less parameters to realize to any The identification of the classification of the video of length, in the training process, by the video segmentation to random length, and carries out segmented network Training, is exercised supervision study by classification results and the true tag of the whole video of comparison, it is possible to achieve the instruction of video-level entirely Practice supervision, do not limited by video length.
Fig. 8 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in figure 8, Compared with the embodiment shown in Fig. 6 and Fig. 7, in the embodiment of the present invention, light stream picture be primary light stream picture, convolution nerve Network is the first convolution neutral net, and the video classification identifying device of the embodiment also includes:
Light stream processing unit, for obtaining the deformation light stream picture after primary light flow graph distortion of image.
In implementing, light stream processing unit obtains every specifically for respectively to calculating per adjacent two field pictures Homography conversion matrix between adjacent two field pictures;Respectively according to the homography conversion square between every adjacent two field pictures Battle array carries out affine transformation to the latter two field picture in corresponding adjacent two field pictures;And respectively in every adjacent two field pictures Previous frame image and affine transformation after latter two field picture calculated, obtain deformation light stream picture.
Specifically, when light stream processing unit is to calculating per adjacent two field pictures, specifically for according to acceleration robust Property feature SURF feature point description carries out interframe Feature Points Matching.
The time domain processing unit of the embodiment includes:First time domain processing module, the second integrated treatment module, Second time domain processing module and the 3rd integrated treatment module.Wherein:
First time domain processing module, specifically for being utilized respectively the first convolution neutral net to each segmenting video Primary light stream picture processed, obtain the first time domain preliminary classification result of each segmenting video;
Second integrated treatment module, specifically for the first time domain of the multiple segmenting videos of function pair of being known together using the first time domain Preliminary classification result carries out integrated treatment, obtains the first time domain result of video.
Second time domain processing module, for being utilized respectively change of the second convolution neutral net to each segmenting video Shape light stream picture is processed, and obtains the second time domain preliminary classification result of each segmenting video.
3rd integrated treatment module, it is preliminary for the second time domain using the multiple segmenting videos of the second time domain common recognition function pair Classification results carry out integrated treatment, obtain the second time domain result of video.
Integrated unit, specifically for entering to spatial domain classification results, the first time domain result and the second time domain result Row fusion treatment, obtains the classification results of video.
As a specific example, integrated unit, specifically for by spatial domain classification results, the first time domain result and Two time domain results are sued for peace after being multiplied by weight coefficient set in advance respectively, obtain the classification results of video.Wherein, weigh Weight coefficient is that the classification accuracy rate according to corresponding network on checking data set determines that the high network model of classification accuracy rate obtains Obtain higher weights.
For example, in a particular application, spatial domain classification results and the first time domain result and the second time domain result it Between weight coefficient ratio can be 1:1:0.5.
Based on the video classification identifying device that the above embodiment of the present invention is provided, except using frame picture and interframe light stream it Outward, also represented as additional movable information in short-term using the light stream of deformation, the input that video classification is recognized is expanded as three kinds Information, i.e. frame picture, interframe light stream and deformation light stream, eliminate the impact of camera movement, therefore can drop due to deforming light stream The impact of low phase machine mobile video classification recognition effect, in the training process, equally using three kinds of input informations, i.e. frame picture, Interframe light stream and deformation light stream, are trained to network model, can reduce impact of the camera movement to network model, so as to can So that video classification recognition system moves more robust to camera.
The video classification identifying device of the various embodiments described above of the present invention can be applicable to the training rank of convolutional neural networks model Section, also apply be applicable to the test phase of convolutional neural networks model and follow-up concrete application stage.
Fig. 9 is the structural representation of another embodiment of embodiment of the present invention video classification identifying device.As shown in figure 9, The video classification identifying device of the various embodiments described above is applied to the test phase and follow-up concrete application of convolutional neural networks model During the stage, video classification identifying device can also include:First normalized unit, for being melted using Softmax function pairs The classification results vector for closing process acquisition is normalized, and obtains the class probability vector that video belongs to of all categories.
Figure 10 is the structural representation of embodiment of the present invention video classification identifying device further embodiment.Above-mentioned each enforcement When the video classification identifying device of example is applied to the training stage of convolutional neural networks model, can also include:Network training list Unit, for storing default initial spatial domain convolutional neural networks and initial time domain convolutional neural networks;And it is based respectively on each conduct Initial spatial domain convolutional neural networks are trained by the video of sample using stochastic gradient descent method, obtain final spatial domain volume Product neutral net;And initial time domain convolutional neural networks are trained using stochastic gradient descent method, when obtaining final Domain convolutional neural networks.
In a specific example based on embodiment illustrated in fig. 10, network training unit is using stochastic gradient descent method to first When beginning spatial domain convolutional neural networks are trained, specifically for:
For a video as sample, compare the spatial domain classification results phase of the video that spatial domain classification processing unit is obtained For whether the deviation of the preset standard spatial domain classification results of video is less than preset range;
If being not less than preset range, the network parameter of initial spatial domain convolutional neural networks is adjusted;To adjust network Spatial domain convolutional neural networks after parameter are used as new initial spatial domain convolutional neural networks, then the regarding as sample for the next one Frequently, start to perform the preset standard spatial domain point for comparing the spatial domain classification results with video of the video that spatial domain classification processing unit is obtained Whether identical is operated class result;
If being less than preset range, using current initial spatial domain convolutional neural networks as final spatial domain convolutional Neural net Network.
Based on another specific example of embodiment illustrated in fig. 10, network training unit is using stochastic gradient descent method to first When beginning convolution neutral net is trained, specifically for:
For a video as sample, compare the time domain result phase of the video of time domain processing unit acquisition For whether the deviation of the preset standard time domain result of video is less than preset range;
If being not less than preset range, the network parameter of initial time domain convolutional neural networks is adjusted;To adjust network Convolution neutral net after parameter is used as new initial time domain convolutional neural networks, then the regarding as sample for the next one Frequently, start to perform the preset standard time domain point of the time domain result with video of the video for comparing the acquisition of time domain processing unit Whether identical is operated class result;
If being less than preset range, using current initial time domain convolutional neural networks as final convolution nerve net Network.
Wherein, above-mentioned initial time domain convolutional neural networks can be included at the beginning of the first initial time domain convolutional neural networks or second Beginning convolution neutral net, time domain result include the first time domain result or the second time domain result accordingly, Convolution neutral net includes the first convolution neutral net and the second convolution neutral net accordingly.
Further, referring back to Figure 10, for initial spatial domain convolutional neural networks and initial time domain convolutional neural networks When being trained, the video classification identifying device of the various embodiments described above can also include:Second normalized unit, for profit It is normalized with the spatial domain classification results of Softmax function pair videos, obtains video and belong to a spatial domain of all categories Class probability vector;And be normalized using the time domain result of Softmax function pair videos, obtain video category In a time domain probability vector of all categories.
As shown in figure 11, be video classification identifying device of the present invention a concrete application example, convolution therein Neutral net can be specifically the first convolution neutral net, it is also possible to while including the first convolution neutral net and Two convolution neutral nets.
In addition, the embodiment of the present invention additionally provides a kind of data processing equipment, the data processing equipment is included in the present invention State the video classification identifying device of any embodiment.
Based on the data processing equipment that the above embodiment of the present invention is provided, the thing video classification for being provided with above-described embodiment is known Other device, by video is divided into multiple segmenting videos, distinguishes sample frame picture and interframe light stream to each segmenting video, to volume When product neutral net is trained, it is possible to achieve the modeling to long-time action so that the network mould that later use training is obtained When type is identified to visual classification, the accuracy of video classification identification is improve relative to prior art, improve video class Other recognition effect, and calculation cost is less.
Specifically, the data processing equipment of the embodiment of the present invention can be arbitrarily with data processing function device, example Such as can be including but not limited to:Advanced reduced instruction set machine (ARM), CPU (CPU) or Graphics Processing Unit (GPU) etc..
In addition, the embodiment of the present invention additionally provides a kind of electronic equipment, can for example be mobile terminal, personal computer (PC), panel computer, server etc., the electronic equipment are provided with the data processing equipment of any of the above-described embodiment of the invention.
Based on the electronic equipment that the above embodiment of the present invention is provided, the data processing equipment of above-described embodiment is provided with, is led to Cross and video is divided into into multiple segmenting videos, sample frame picture and interframe light stream are distinguished to each segmenting video, to convolutional Neural net When network is trained, it is possible to achieve the modeling to long-time action so that the network model that later use training is obtained is to video When classification is identified, the accuracy of video classification identification is improve relative to prior art, improve video classification identification effect Really, and calculation cost is less.
Figure 12 is the structural representation of electronic equipment one embodiment of the present invention, as shown in figure 12, for realizing the present invention The electronic equipment of embodiment includes CPU (CPU), and which can be according to being stored in holding in read only memory (ROM) Row instruction or the executable instruction that is partially loaded in random access storage device (RAM) from storage and perform various appropriate dynamic Make and process.CPU can communicate with performing executable instruction with read only memory and/or random access storage device So as to complete the corresponding operation of video classification recognition methodss provided in an embodiment of the present invention, for example:Video is segmented, is obtained Multiple segmenting videos;Respectively to multiple segmenting videos in each segmenting video sample, obtain the original graph of each segmenting video Picture and light stream picture;It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, it is each to obtain The spatial domain classification results of segmenting video;And be utilized respectively convolution neutral net the light stream picture of each segmenting video is carried out Process, to obtain the time domain result of each segmenting video;Fusion treatment is carried out to spatial domain classification results and time domain result, Obtain the classification results of video.
Additionally, in RAM, various programs and the data that can be also stored with needed for system operatio.CPU, ROM and RAM lead to Cross bus to be connected with each other.Input/output (I/O) interface is also connected to bus.
I/O interfaces are connected to lower component:Including the importation of keyboard, mouse etc.;Including such as cathode ray tube (CRT), the output par, c of liquid crystal display (LCD) etc. and speaker etc.;Storage part including hard disk etc.;And including all The such as communications portion of the NIC of LAN card, modem etc..Communications portion performs logical via the network of such as the Internet Letter process.Driver is also according to needing to be connected to I/O interfaces.Detachable media, such as disk, CD, magneto-optic disk, quasiconductor are deposited Reservoir etc., is installed on a drive as needed, and the computer program in order to read from it is mounted into as needed Storage part.
Especially, in accordance with an embodiment of the present disclosure, computer is may be implemented as above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program, and which includes being tangibly embodied in machine readable Computer program on medium, computer program include the program code for the method shown in execution flow chart, described program Code may include that correspondence performs the corresponding instruction of any one video classification methods step provided in an embodiment of the present invention, for example, right Video is segmented, and obtains the instruction of multiple segmenting videos;Respectively to multiple segmenting videos in each segmenting video sample, Obtain the instruction of the original image and light stream picture of each segmenting video;Spatial domain convolutional neural networks are utilized respectively to each segmenting video Original image processed, obtain the spatial domain preliminary classification object command of each segmenting video;And it is utilized respectively convolution Neutral net is processed to the light stream picture of each segmenting video, obtains the finger of the time domain preliminary classification result of each segmenting video Order;Integrated treatment is carried out to the spatial domain preliminary classification result of multiple segmenting videos, the instruction of the spatial domain classification results of video is obtained; And integrated treatment is carried out to the time domain preliminary classification result of multiple segmenting videos, obtain the finger of the time domain result of video Order;Spatial domain classification results and time domain result are carried out with fusion treatment, the instruction of the classification results of video is obtained.The computer Program can be downloaded and installed from network by communications portion, and/or mounted from detachable media.In the computer journey When sequence is performed by CPU (CPU), the above-mentioned functions limited in performing the method for the present invention.
The embodiment of the present invention additionally provides a kind of computer-readable storage medium, for storing the instruction of embodied on computer readable, institute Stating instruction includes:Video is segmented, the instruction of multiple segmenting videos is obtained;Respectively to multiple segmenting videos in each segmentation Video is sampled, and obtains the instruction of the original image and light stream picture of each segmenting video;It is utilized respectively spatial domain convolutional Neural net Network is processed to the original image of each segmenting video, obtains the instruction of the spatial domain preliminary classification result of each segmenting video;And It is utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtains at the beginning of the time domain of each segmenting video The instruction of step classification results;Integrated treatment is carried out to the spatial domain preliminary classification result of multiple segmenting videos, the spatial domain of video is obtained The instruction of classification results;And integrated treatment is carried out to the time domain preliminary classification result of multiple segmenting videos, obtain video when The instruction of domain classification results;Spatial domain classification results and time domain result are carried out with fusion treatment, the classification results of video are obtained Instruction.
In addition, the embodiment of the present invention additionally provides a kind of computer equipment, including:
Memorizer, stores executable instruction;
One or more processors, complete of the invention any of the above-described reality to perform executable instruction with memory communication Apply the corresponding operation of video classification recognition methodss of example.
In this specification, each embodiment is described by the way of progressive, and what each embodiment was stressed is and which The difference of its embodiment, same or analogous part cross-reference between each embodiment.For system embodiment For, it is substantially corresponding with embodiment of the method due to which, so description is fairly simple, portion of the related part referring to embodiment of the method Defend oneself bright.
Methods and apparatus of the present invention, equipment may be achieved in many ways.For example, software, hardware, firmware can be passed through Or any combinations of software, hardware, firmware are realizing methods and apparatus of the present invention, equipment.The step of for methods described Said sequence merely to illustrate, be not limited to order described in detail above the step of the method for the present invention, unless with Alternate manner is illustrated.Additionally, in certain embodiments, also the present invention can be embodied as recording journey in the recording medium Sequence, these programs are included for realizing the machine readable instructions of the method according to the invention.Thus, the present invention also covers storage and uses In the recording medium of the program for performing the method according to the invention.
Description of the invention is given for the sake of example and description, and is not exhaustively or by the present invention It is limited to disclosed form.Many modifications and variations are obvious for the ordinary skill in the art.Select and retouch It is, for the principle and practical application that more preferably illustrate the present invention, and one of ordinary skill in the art is managed to state embodiment The present invention is solved so as to design the various embodiments with various modifications for being suitable to special-purpose.

Claims (10)

1. a kind of video classification recognition methodss, it is characterised in that include:
Video is segmented, multiple segmenting videos are obtained;
Respectively to multiple segmenting videos in each segmenting video sample, obtain the original image and light flow graph of each segmenting video Picture;
The original image of each segmenting video is processed to obtain the spatial domain classification results of the video using spatial domain convolutional neural networks; And using each segmenting video of convolution Processing with Neural Network light stream picture obtaining the time domain result of the video;
The spatial domain classification results and the time domain result are carried out with fusion treatment, the classification results of the video are obtained.
2. method according to claim 1, it is characterised in that described segmentation is carried out to video to include:
The video is averagely segmented, the multiple segmenting videos of length identical are obtained.
3. method according to claim 1 and 2, it is characterised in that the original image of each segmenting video of the acquisition includes:
A two field picture is randomly selected from each segmenting video respectively, as the original image of each segmenting video.
4. method according to claim 1 and 2, it is characterised in that the light stream picture of each segmenting video of the acquisition includes:
Continuous multiple image is randomly selected from each segmenting video respectively, the light stream picture of each segmenting video is obtained.
5. method according to claim 4, it is characterised in that the smooth stream picture be based on 8 bitmaps, totally 256 from The gray level image of scattered color range, the intermediate value of the gray level image is 128.
6. the method according to claim 4 or 5, it is characterised in that described respectively from the company of randomly selecting in each segmenting video Continuous multiple image, the light stream picture for obtaining each segmenting video include:
It is respectively directed to each segmenting video:Continuous N two field pictures are randomly selected from each segmenting video;Wherein, N is more than 1 Integer;And
The every adjacent two field pictures being based respectively in the N two field pictures are calculated, and obtain N-1 group light stream pictures, the N-1 Each group of light stream picture in group light stream picture includes a frame lateral light stream picture and frame longitudinal direction light stream picture respectively.
7. the method according to claim 1 to 6 any one, it is characterised in that utilization spatial domain convolutional neural networks The original image for processing each segmenting video is included with the spatial domain classification results for obtaining the video:
It is utilized respectively spatial domain convolutional neural networks to process the original image of each segmenting video, obtains the sky of each segmenting video Domain preliminary classification result;
Integrated treatment is carried out using the spatial domain preliminary classification result of the plurality of segmenting video of spatial domain common recognition function pair, obtains described The spatial domain classification results of video;
And/or
Using the light stream picture of each segmenting video of convolution Processing with Neural Network obtaining the time domain result of the video Including:
Be utilized respectively convolution neutral net to process the light stream picture of each segmenting video, obtain each segmenting video when Domain preliminary classification result;
Integrated treatment is carried out using the time domain preliminary classification result of the plurality of segmenting video of time domain common recognition function pair, obtains described The time domain result of video.
8. a kind of video classification identifying device, it is characterised in that include:
Segmenting unit, for being segmented to video, obtains multiple segmenting videos;
Sampling unit, for respectively to multiple segmenting videos in each segmenting video sample, obtain the original of each segmenting video Beginning image and light stream picture;
Spatial domain classification processing unit, it is described to obtain for the original graph using each segmenting video of spatial domain convolutional neural networks process The spatial domain classification results of video;
Time domain processing unit, for being utilized respectively the light stream picture of each segmenting video of convolution Processing with Neural Network to obtain Obtain the time domain result of each segmenting video;
Integrated unit, for the spatial domain classification results and the time domain result are carried out with fusion treatment, obtain described in regard The classification results of frequency.
9. a kind of data processing equipment, it is characterised in that including the visual classification identifying device described in claim 8.
10. a kind of electronic equipment, it is characterised in that the data processing equipment being provided with described in claim 9.
CN201611030170.XA 2016-07-29 2016-11-15 The recognition methods of video classification and device, data processing equipment and electronic equipment Active CN106599789B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2016106196541 2016-07-29
CN201610619654 2016-07-29

Publications (2)

Publication Number Publication Date
CN106599789A true CN106599789A (en) 2017-04-26
CN106599789B CN106599789B (en) 2019-10-11

Family

ID=58592577

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611030170.XA Active CN106599789B (en) 2016-07-29 2016-11-15 The recognition methods of video classification and device, data processing equipment and electronic equipment

Country Status (2)

Country Link
CN (1) CN106599789B (en)
WO (1) WO2018019126A1 (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330362A (en) * 2017-05-25 2017-11-07 北京大学 A kind of video classification methods based on space-time notice
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
WO2018019126A1 (en) * 2016-07-29 2018-02-01 北京市商汤科技开发有限公司 Video category identification method and device, data processing device and electronic apparatus
CN107943849A (en) * 2017-11-03 2018-04-20 小草数语(北京)科技有限公司 The search method and device of video file
CN108010538A (en) * 2017-12-22 2018-05-08 北京奇虎科技有限公司 Audio data processing method and device, computing device
CN108171222A (en) * 2018-02-11 2018-06-15 清华大学 A kind of real-time video sorting technique and device based on multithread neural network
CN108229290A (en) * 2017-07-26 2018-06-29 北京市商汤科技开发有限公司 Video object dividing method and device, electronic equipment, storage medium and program
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108764084A (en) * 2018-05-17 2018-11-06 西安电子科技大学 Video classification methods based on spatial domain sorter network and the time domain network integration
CN109271840A (en) * 2018-07-25 2019-01-25 西安电子科技大学 Video gesture classification method
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video Action Recognition and Localization Algorithm Based on Cascaded Neural Network
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN109657546A (en) * 2018-11-12 2019-04-19 平安科技(深圳)有限公司 Video behavior recognition methods neural network based and terminal device
CN109726765A (en) * 2019-01-02 2019-05-07 京东方科技集团股份有限公司 A kind of sample extraction method and device of visual classification problem
CN109740670A (en) * 2019-01-02 2019-05-10 京东方科技集团股份有限公司 The method and device of visual classification
CN109886165A (en) * 2019-01-23 2019-06-14 中国科学院重庆绿色智能技术研究院 An Action Video Extraction and Classification Method Based on Moving Object Detection
CN110020639A (en) * 2019-04-18 2019-07-16 北京奇艺世纪科技有限公司 Video feature extraction method and relevant device
CN110062248A (en) * 2019-04-30 2019-07-26 广州酷狗计算机科技有限公司 Recommend the method and apparatus of direct broadcasting room
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN110598504A (en) * 2018-06-12 2019-12-20 北京市商汤科技开发有限公司 Image recognition method and device, electronic equipment and storage medium
CN110602527A (en) * 2019-09-12 2019-12-20 北京小米移动软件有限公司 Video processing method, device and storage medium
CN111125405A (en) * 2019-12-19 2020-05-08 国网冀北电力有限公司信息通信分公司 Power monitoring image abnormality detection method and device, electronic device and storage medium
WO2020108023A1 (en) * 2018-11-28 2020-06-04 北京达佳互联信息技术有限公司 Video motion classification method, apparatus, computer device, and storage medium
WO2020155713A1 (en) * 2019-01-29 2020-08-06 北京市商汤科技开发有限公司 Image processing method and device, and network training method and device
CN111820947A (en) * 2019-04-19 2020-10-27 无锡祥生医疗科技股份有限公司 Ultrasonic heart reflux automatic capturing method and system and ultrasonic imaging equipment
CN111860353A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Video behavior prediction method, device and medium based on dual-stream neural network
CN111898458A (en) * 2020-07-07 2020-11-06 中国传媒大学 Violent Video Recognition Method Based on Bimodal Task Learning Based on Attention Mechanism
CN112288345A (en) * 2019-07-25 2021-01-29 顺丰科技有限公司 Method and device for detecting loading and unloading port state, server and storage medium
CN113139467A (en) * 2021-04-23 2021-07-20 西安交通大学 Hierarchical structure-based fine-grained video action identification method
US11113536B2 (en) 2019-03-15 2021-09-07 Boe Technology Group Co., Ltd. Video identification method, video identification device, and storage medium
CN113395537A (en) * 2021-06-16 2021-09-14 北京百度网讯科技有限公司 Method and device for recommending live broadcast room
CN113870040A (en) * 2021-09-07 2021-12-31 天津大学 Double-flow graph convolution network microblog topic detection method fusing different propagation modes
CN114987551A (en) * 2022-06-27 2022-09-02 吉林大学 Lane departure early warning method based on double-current convolutional neural network
CN116645917A (en) * 2023-06-09 2023-08-25 浙江技加智能科技有限公司 LED display brightness adjustment system and method thereof

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109120932B (en) * 2018-07-12 2021-10-26 东华大学 Video significance prediction method of HEVC compressed domain double SVM model
US11200424B2 (en) * 2018-10-12 2021-12-14 Adobe Inc. Space-time memory network for locating target object in video content
CN111753574A (en) * 2019-03-26 2020-10-09 顺丰科技有限公司 Throw area positioning method, device, equipment and storage medium
CN112307821A (en) * 2019-07-29 2021-02-02 顺丰科技有限公司 Video stream processing method, device, equipment and storage medium
US11138441B2 (en) * 2019-12-06 2021-10-05 Baidu Usa Llc Video action segmentation by mixed temporal domain adaption
CN111027482B (en) * 2019-12-10 2023-04-14 浩云科技股份有限公司 Behavior analysis method and device based on motion vector segmentation analysis
CN111104553B (en) * 2020-01-07 2023-12-12 中国科学院自动化研究所 Efficient motor complementary neural network system
CN111783713B (en) * 2020-07-09 2022-12-02 中国科学院自动化研究所 Weakly-supervised time-series behavior location method and device based on relational prototype network
CN111951276B (en) * 2020-07-28 2025-03-28 上海联影智能医疗科技有限公司 Image segmentation method, device, computer equipment and storage medium
CN113395542B (en) * 2020-10-26 2022-11-08 腾讯科技(深圳)有限公司 Video generation method and device based on artificial intelligence, computer equipment and medium
CN114756115A (en) * 2020-12-28 2022-07-15 阿里巴巴集团控股有限公司 Interactive control method, device and device
CN112580589A (en) * 2020-12-28 2021-03-30 国网上海市电力公司 Behavior identification method, medium and equipment considering unbalanced data based on double-flow method
CN112731359B (en) * 2020-12-31 2024-04-09 无锡祥生医疗科技股份有限公司 Method and device for determining speed of ultrasonic probe and storage medium
CN113128354B (en) * 2021-03-26 2022-07-19 中山大学中山眼科中心 A kind of hand washing quality detection method and device
CN112926549B (en) * 2021-04-15 2022-06-24 华中科技大学 Gait recognition method and system based on time domain-space domain feature joint enhancement
CN114373194B (en) * 2022-01-14 2024-11-12 南京邮电大学 Human action recognition method based on keyframe and attention mechanism
CN114861530A (en) * 2022-04-21 2022-08-05 同济大学 A kind of ENSO intelligent prediction method, device, equipment and storage medium
CN115830698A (en) * 2022-04-28 2023-03-21 西安理工大学 Target detection and positioning method based on depth optical flow and YOLOv3 space-time fusion
CN118214922B (en) * 2024-05-17 2024-08-30 环球数科集团有限公司 System for capturing video spatial and temporal features using CNNs filters

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129691A (en) * 2011-03-22 2011-07-20 北京航空航天大学 Video object tracking cutting method using Snake profile model
CN102289795A (en) * 2011-07-29 2011-12-21 上海交通大学 Method for enhancing video in spatio-temporal mode based on fusion idea
US20130071041A1 (en) * 2011-09-16 2013-03-21 Hailin Jin High-Quality Denoising of an Image Sequence
CN104217214A (en) * 2014-08-21 2014-12-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 Configurable convolutional neural network based red green blue-distance (RGB-D) figure behavior identification method
CN105550699A (en) * 2015-12-08 2016-05-04 北京工业大学 CNN-based video identification and classification method through time-space significant information fusion

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8345984B2 (en) * 2010-01-28 2013-01-01 Nec Laboratories America, Inc. 3D convolutional neural networks for automatic human action recognition
CN103218831B (en) * 2013-04-21 2015-11-18 北京航空航天大学 A kind of video frequency motion target classifying identification method based on profile constraint
CN104966104B (en) * 2015-06-30 2018-05-11 山东管理学院 A kind of video classification methods based on Three dimensional convolution neutral net
CN105740773B (en) * 2016-01-25 2019-02-01 重庆理工大学 Activity recognition method based on deep learning and multi-scale information
CN106599789B (en) * 2016-07-29 2019-10-11 北京市商汤科技开发有限公司 The recognition methods of video classification and device, data processing equipment and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102129691A (en) * 2011-03-22 2011-07-20 北京航空航天大学 Video object tracking cutting method using Snake profile model
CN102289795A (en) * 2011-07-29 2011-12-21 上海交通大学 Method for enhancing video in spatio-temporal mode based on fusion idea
US20130071041A1 (en) * 2011-09-16 2013-03-21 Hailin Jin High-Quality Denoising of an Image Sequence
CN104217214A (en) * 2014-08-21 2014-12-17 广东顺德中山大学卡内基梅隆大学国际联合研究院 Configurable convolutional neural network based red green blue-distance (RGB-D) figure behavior identification method
CN105550699A (en) * 2015-12-08 2016-05-04 北京工业大学 CNN-based video identification and classification method through time-space significant information fusion

Cited By (56)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018019126A1 (en) * 2016-07-29 2018-02-01 北京市商汤科技开发有限公司 Video category identification method and device, data processing device and electronic apparatus
CN107330362A (en) * 2017-05-25 2017-11-07 北京大学 A kind of video classification methods based on space-time notice
CN107330362B (en) * 2017-05-25 2020-10-09 北京大学 Video classification method based on space-time attention
CN107463949A (en) * 2017-07-14 2017-12-12 北京协同创新研究院 A kind of processing method and processing device of video actions classification
CN107463949B (en) * 2017-07-14 2020-02-21 北京协同创新研究院 A processing method and device for video action classification
CN108229290A (en) * 2017-07-26 2018-06-29 北京市商汤科技开发有限公司 Video object dividing method and device, electronic equipment, storage medium and program
US11222211B2 (en) 2017-07-26 2022-01-11 Beijing Sensetime Technology Development Co., Ltd Method and apparatus for segmenting video object, electronic device, and storage medium
CN107943849A (en) * 2017-11-03 2018-04-20 小草数语(北京)科技有限公司 The search method and device of video file
CN107943849B (en) * 2017-11-03 2020-05-08 绿湾网络科技有限公司 Video file retrieval method and device
CN108010538B (en) * 2017-12-22 2021-08-24 北京奇虎科技有限公司 Audio data processing method and device, and computing device
CN108010538A (en) * 2017-12-22 2018-05-08 北京奇虎科技有限公司 Audio data processing method and device, computing device
CN108230413A (en) * 2018-01-23 2018-06-29 北京市商汤科技开发有限公司 Image Description Methods and device, electronic equipment, computer storage media, program
CN108230413B (en) * 2018-01-23 2021-07-06 北京市商汤科技开发有限公司 Image description method and device, electronic equipment and computer storage medium
CN108171222B (en) * 2018-02-11 2020-08-25 清华大学 A real-time video classification method and device based on multi-stream neural network
CN108171222A (en) * 2018-02-11 2018-06-15 清华大学 A kind of real-time video sorting technique and device based on multithread neural network
CN110321761B (en) * 2018-03-29 2022-02-11 中国科学院深圳先进技术研究院 Behavior recognition method, terminal device and computer-readable storage medium
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN108764084A (en) * 2018-05-17 2018-11-06 西安电子科技大学 Video classification methods based on spatial domain sorter network and the time domain network integration
CN108764084B (en) * 2018-05-17 2021-07-27 西安电子科技大学 Video classification method based on fusion of spatial classification network and temporal classification network
CN110598504A (en) * 2018-06-12 2019-12-20 北京市商汤科技开发有限公司 Image recognition method and device, electronic equipment and storage medium
CN110598504B (en) * 2018-06-12 2023-07-21 北京市商汤科技开发有限公司 Image recognition method and device, electronic equipment and storage medium
CN109271840A (en) * 2018-07-25 2019-01-25 西安电子科技大学 Video gesture classification method
CN109325430B (en) * 2018-09-11 2021-08-20 苏州飞搜科技有限公司 Real-time behavior identification method and system
CN109325430A (en) * 2018-09-11 2019-02-12 北京飞搜科技有限公司 Real-time Activity recognition method and system
CN109325435A (en) * 2018-09-15 2019-02-12 天津大学 Video Action Recognition and Localization Algorithm Based on Cascaded Neural Network
CN109325435B (en) * 2018-09-15 2022-04-19 天津大学 Video action recognition and localization method based on cascaded neural network
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN109657546A (en) * 2018-11-12 2019-04-19 平安科技(深圳)有限公司 Video behavior recognition methods neural network based and terminal device
WO2020108023A1 (en) * 2018-11-28 2020-06-04 北京达佳互联信息技术有限公司 Video motion classification method, apparatus, computer device, and storage medium
CN109726765A (en) * 2019-01-02 2019-05-07 京东方科技集团股份有限公司 A kind of sample extraction method and device of visual classification problem
CN109740670A (en) * 2019-01-02 2019-05-10 京东方科技集团股份有限公司 The method and device of visual classification
US11055535B2 (en) 2019-01-02 2021-07-06 Boe Technology Group Co., Ltd. Method and device for video classification
US11210522B2 (en) 2019-01-02 2021-12-28 Boe Technology Group Co., Ltd. Sample extraction method and device targeting video classification problem
CN109886165A (en) * 2019-01-23 2019-06-14 中国科学院重庆绿色智能技术研究院 An Action Video Extraction and Classification Method Based on Moving Object Detection
WO2020155713A1 (en) * 2019-01-29 2020-08-06 北京市商汤科技开发有限公司 Image processing method and device, and network training method and device
US11113536B2 (en) 2019-03-15 2021-09-07 Boe Technology Group Co., Ltd. Video identification method, video identification device, and storage medium
CN110020639B (en) * 2019-04-18 2021-07-23 北京奇艺世纪科技有限公司 Video feature extraction method and related equipment
CN110020639A (en) * 2019-04-18 2019-07-16 北京奇艺世纪科技有限公司 Video feature extraction method and relevant device
CN111820947A (en) * 2019-04-19 2020-10-27 无锡祥生医疗科技股份有限公司 Ultrasonic heart reflux automatic capturing method and system and ultrasonic imaging equipment
CN111820947B (en) * 2019-04-19 2023-08-29 无锡祥生医疗科技股份有限公司 Ultrasonic heart reflux automatic capturing method and system and ultrasonic imaging equipment
CN110062248B (en) * 2019-04-30 2021-09-28 广州酷狗计算机科技有限公司 Method and device for recommending live broadcast room
CN110062248A (en) * 2019-04-30 2019-07-26 广州酷狗计算机科技有限公司 Recommend the method and apparatus of direct broadcasting room
CN112288345A (en) * 2019-07-25 2021-01-29 顺丰科技有限公司 Method and device for detecting loading and unloading port state, server and storage medium
CN110602527A (en) * 2019-09-12 2019-12-20 北京小米移动软件有限公司 Video processing method, device and storage medium
US11288514B2 (en) 2019-09-12 2022-03-29 Beijing Xiaomi Mobile Software Co., Ltd. Video processing method and device, and storage medium
CN111125405A (en) * 2019-12-19 2020-05-08 国网冀北电力有限公司信息通信分公司 Power monitoring image abnormality detection method and device, electronic device and storage medium
CN111898458A (en) * 2020-07-07 2020-11-06 中国传媒大学 Violent Video Recognition Method Based on Bimodal Task Learning Based on Attention Mechanism
CN111898458B (en) * 2020-07-07 2024-07-12 中国传媒大学 Violent video identification method for bimodal task learning based on attention mechanism
CN111860353A (en) * 2020-07-23 2020-10-30 北京以萨技术股份有限公司 Video behavior prediction method, device and medium based on dual-stream neural network
CN113139467A (en) * 2021-04-23 2021-07-20 西安交通大学 Hierarchical structure-based fine-grained video action identification method
CN113395537A (en) * 2021-06-16 2021-09-14 北京百度网讯科技有限公司 Method and device for recommending live broadcast room
CN113395537B (en) * 2021-06-16 2023-05-16 北京百度网讯科技有限公司 Method and device for recommending live broadcasting room
CN113870040B (en) * 2021-09-07 2024-05-21 天津大学 Double-flow chart convolution network microblog topic detection method integrating different propagation modes
CN113870040A (en) * 2021-09-07 2021-12-31 天津大学 Double-flow graph convolution network microblog topic detection method fusing different propagation modes
CN114987551A (en) * 2022-06-27 2022-09-02 吉林大学 Lane departure early warning method based on double-current convolutional neural network
CN116645917A (en) * 2023-06-09 2023-08-25 浙江技加智能科技有限公司 LED display brightness adjustment system and method thereof

Also Published As

Publication number Publication date
WO2018019126A1 (en) 2018-02-01
CN106599789B (en) 2019-10-11

Similar Documents

Publication Publication Date Title
CN106599789A (en) Video class identification method and device, data processing device and electronic device
Baldassarre et al. Deep koalarization: Image colorization using cnns and inception-resnet-v2
US10929649B2 (en) Multi-pose face feature point detection method based on cascade regression
CN111445488B (en) A Weakly Supervised Learning Approach to Automatically Identify and Segment Salt Bodies
CN105701508B (en) Global local optimum model and conspicuousness detection algorithm based on multistage convolutional neural networks
CN106548192B (en) Image processing method, device and electronic equipment neural network based
CN103984959B (en) A kind of image classification method based on data and task-driven
CN108108751B (en) Scene recognition method based on convolution multi-feature and deep random forest
CN109858466A (en) A kind of face critical point detection method and device based on convolutional neural networks
CN112101344B (en) Video text tracking method and device
CN108681695A (en) Video actions recognition methods and device, electronic equipment and storage medium
CN106504233A (en) Image electric power widget recognition methodss and system are patrolled and examined based on the unmanned plane of Faster R CNN
CN111126115B (en) Violent sorting behavior identification method and device
CN109657612B (en) Quality sorting system based on facial image features and application method thereof
CN107683469A (en) A kind of product classification method and device based on deep learning
WO2022152009A1 (en) Target detection method and apparatus, and device and storage medium
CN111368660A (en) A single-stage semi-supervised image human object detection method
CN104866868A (en) Metal coin identification method based on deep neural network and apparatus thereof
CN109918971A (en) Number detection method and device in monitor video
CN110543848B (en) Driver action recognition method and device based on three-dimensional convolutional neural network
CN109472193A (en) Method for detecting human face and device
CN112418032A (en) Human behavior recognition method and device, electronic equipment and storage medium
CN112364791B (en) Pedestrian re-identification method and system based on generation of confrontation network
CN112613579A (en) Model training method and evaluation method for human face or human head image quality and selection method for high-quality image
CN108961358A (en) A kind of method, apparatus and electronic equipment obtaining samples pictures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant