CN112906649B - Video segmentation method, device, computer device and medium - Google Patents
Video segmentation method, device, computer device and medium Download PDFInfo
- Publication number
- CN112906649B CN112906649B CN202110314575.0A CN202110314575A CN112906649B CN 112906649 B CN112906649 B CN 112906649B CN 202110314575 A CN202110314575 A CN 202110314575A CN 112906649 B CN112906649 B CN 112906649B
- Authority
- CN
- China
- Prior art keywords
- behavior
- video
- category
- feature vector
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 32
- 239000013598 vector Substances 0.000 claims abstract description 94
- 239000012634 fragment Substances 0.000 claims abstract description 19
- 230000006399 behavior Effects 0.000 claims description 178
- 238000001514 detection method Methods 0.000 claims description 26
- 238000013527 convolutional neural network Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 12
- 238000012549 training Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 10
- 238000013135 deep learning Methods 0.000 description 4
- 238000010606 normalization Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 125000000205 L-threonino group Chemical group [H]OC(=O)[C@@]([H])(N([H])[*])[C@](C([H])([H])[H])([H])O[H] 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004438 eyesight Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000001364 causal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000004456 color vision Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000004304 visual acuity Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a video segmentation method, video segmentation equipment, computer equipment and a medium. The method comprises the following steps: dividing the video into segments based on correlation coefficients between adjacent video frames in the video; for the video frames in the fragments, identifying the scenes of the video frames to obtain scene feature vectors; for the video frames in the fragments, identifying local behavior characteristics of the video frames to obtain local behavior characteristic vectors; identifying a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector; determining the behavior category of the segment based on the behavior category and the confidence level of the video frame of the segment; and merging the segments with the same adjacent behavior categories to obtain the segmentation result of the video. The method can fuse the two-way models at the same time, comprehensively utilizes two dimensions of the scene and the local behavior, and extracts the whole behavior information, thereby rapidly segmenting the video.
Description
Technical Field
The present application relates to the field of image automation processing, and in particular, to a video segmentation method, apparatus, computer apparatus, and medium.
Background
The rapid development of video compression algorithms and applications brings about massive video data. The video contains rich information, however, because the video data is huge, the abstract concept is not directly represented like characters, and therefore the extraction and the structuring of the video information are relatively complex. At present, the extraction method of video information mainly comprises the steps of firstly dividing video, and then labeling each divided segment classification, which is one thought of video information extraction and structuring. The video is segmented based on traditional computer vision, and image features are generally required to be designed manually, so that the designed features cannot be flexibly adapted to the changes of various scenes. Most of the currently available video segmentation is to detect the change of two adjacent frames through various traditional computer vision transformations only according to the color information of each frame, so as to determine video segmentation points, and then to continuously utilize a clustering algorithm in machine learning to aggregate the segmented adjacent video segments, wherein the similar categories are classified into one category. However, these methods described above can only accomplish rough and shallow segmentation and cannot recognize the semantics of each segment in the video.
Disclosure of Invention
The present application aims to overcome or at least partially solve or alleviate the above-mentioned problems.
According to an aspect of the present application, there is provided a video segmentation method including:
A segment segmentation step: dividing the video into segments based on correlation coefficients between adjacent video frames in the video;
scene recognition: for the video frames in the fragments, identifying the scenes of the video frames to obtain scene feature vectors;
A local behavior feature recognition step: for the video frames in the fragments, identifying local behavior characteristics of the video frames to obtain local behavior characteristic vectors;
judging the behavior category of the video frame: identifying a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;
Fragment behavior category determining step: determining the behavior category of the segment based on the behavior category and the confidence level of the video frame of the segment;
fragment merging: and merging the adjacent fragments with the same behavior category to obtain the segmentation result of the video.
The method can fuse the two-way models at the same time, comprehensively utilizes two dimensions of the scene and the local behavior, and extracts the whole behavior information, thereby rapidly segmenting the video.
Optionally, the segment segmentation step includes:
a histogram calculation step: calculating YCbCr histograms for each video frame of the video;
And a correlation coefficient calculating step: calculating the correlation coefficient between the YCbCr histogram of the video frame and the YCbCr histogram of the previous video frame;
threshold comparison step: and when the correlation coefficient is smaller than a preset first threshold value, taking the video frame as a starting frame of a new segment.
Optionally, the scene recognition step includes:
resolution conversion step: converting RGB channels of the video frame into fixed-size resolutions respectively; and
Generating a scene feature vector: inputting the video frame subjected to resolution conversion into a first network model to obtain a scene feature vector of the video frame, wherein the first network model is as follows: the VGG16 network model of the last full connectivity layer and the Softmax classifier is removed.
Optionally, the local behavior feature identifying step includes:
A shortest side length fixing step: converting RGB channels of the video frame into resolutions with fixed shortest lengths respectively; and
A local behavior feature vector generation step: inputting a video frame with a fixed shortest length into a first network model, inputting an output result of the first network model into a region-based convolutional neural network (FASTERRCNN) model, calculating an optimal detection category result by using the output result of the region-based convolutional neural network, and obtaining a local behavior feature vector by the optimal detection category result through a region-of-interest pooling layer.
Optionally, the step of determining the behavior category of the video frame includes:
and a video frame feature vector merging step: combining the scene feature vector and the local behavior feature vector into a video frame feature vector; and
Behavior category and confidence calculation step: and inputting the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence coefficient corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers with a Softmax classifier.
Optionally, the step of determining the segment behavior category includes: and taking the behavior category as the behavior category of the segment when the ratio of the same video frame number of the behavior category to the total video frame number of the segment is larger than a preset second threshold value.
According to another aspect of the present application, there is also provided a video segmentation apparatus including:
A segment segmentation module configured to segment a video into segments based on correlation coefficients between adjacent video frames in the video;
the scene recognition module is configured to recognize the scene of the video frame in the segment to obtain a scene feature vector;
The local behavior feature identification module is configured to identify local behavior features of the video frames in the fragments to obtain local behavior feature vectors;
a video frame behavior category determination module configured to identify a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;
A segment behavior category determination module configured to determine a behavior category of the segment based on a behavior category and a confidence level of a video frame of the segment; and
And the segment merging module is configured to merge the segments with the same adjacent behavior categories to obtain the segmentation result of the video.
The device can fuse the two-way models at the same time, comprehensively utilizes two dimensions of a scene and local behaviors, and extracts the whole behavior information, thereby quickly segmenting videos.
According to another aspect of the present application there is also provided a computer device comprising a memory, a processor and a computer program stored in said memory and executable by said processor, wherein said processor implements the method as described above when executing said computer program.
According to another aspect of the application there is also provided a computer readable storage medium, preferably a non-volatile readable storage medium, having stored therein a computer program which when executed by a processor implements a method as described above.
According to another aspect of the application there is also provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method as described above.
The above, as well as additional objectives, advantages, and features of the present application will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present application when read in conjunction with the accompanying drawings.
Drawings
Some specific embodiments of the application will be described in detail hereinafter by way of example and not by way of limitation with reference to the accompanying drawings. The same reference numbers will be used throughout the drawings to refer to the same or like parts or portions. It will be appreciated by those skilled in the art that the drawings are not necessarily drawn to scale. In the accompanying drawings:
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a video segmentation method in accordance with the present application;
FIG. 2 is a schematic block diagram of a behavior prediction network of the present application;
FIG. 3 is a schematic block diagram of a behavior prediction network for training the present application;
FIG. 4 is a schematic block diagram of one embodiment of a video segmentation device in accordance with the present application;
FIG. 5 is a block diagram of one embodiment of a computing device of the present application;
FIG. 6 is a block diagram of one embodiment of a computer-readable storage medium of the present application.
Detailed Description
The above, as well as additional objectives, advantages, and features of the present application will become apparent to those skilled in the art from the following detailed description of a specific embodiment of the present application when read in conjunction with the accompanying drawings.
An embodiment of the present application provides a video segmentation method, and fig. 1 is a schematic flow chart of one example of a video segmentation method according to the present application. The method may include:
S100, segment segmentation: dividing the video into segments based on correlation coefficients between adjacent video frames in the video;
S200, scene recognition: for the video frames in the fragments, identifying the scenes of the video frames to obtain scene feature vectors;
S300, identifying local behavior characteristics: for the video frames in the fragments, identifying local behavior characteristics of the video frames to obtain local behavior characteristic vectors;
s400, judging the behavior category of the video frame: identifying a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;
s500, determining a fragment behavior category: determining the behavior category of the segment based on the behavior category and the confidence level of the video frame of the segment;
s600 fragment merging step: and merging the adjacent fragments with the same behavior category to obtain the segmentation result of the video.
The method provided by the application can be used for fusing the two-way models at the same time, comprehensively utilizing two dimensions of the scene and the local behavior, and extracting the whole behavior information, thereby rapidly segmenting the video. The application utilizes a deep learning technique to segment the video from the dimension of the behavior class of the person. On one hand, the deep learning technology can be used for extracting more abstract general features, and on the other hand, dynamic information and causal events in the video are mainly defined by human behaviors, so that the video is segmented according to the human behavior types, which is most reasonable.
Optionally, the step of S100 segment segmentation may include:
s101, calculating a histogram: calculating YCbCr histograms for each video frame of the video;
s102, calculating a correlation coefficient: calculating the correlation coefficient between the YCbCr histogram of the video frame and the YCbCr histogram of the previous video frame; and
S103, threshold comparison: and when the correlation coefficient is smaller than a preset first threshold value, taking the video frame as a starting frame of a new segment.
The color space may include: RGB, CMY (three primary colors), HSV (Hue, saturation, brightness), HIS (Hue, saturation, intensity), YCbCr. Where Y of YCbCr refers to a luminance component, cb refers to a blue chrominance component, and Cr refers to a red chrominance component. Taking YCbCr as an example, in an alternative embodiment, the video is segmented:
And carrying out normalization processing on the YCbCr data of the frame based on the YCbCr color space, and constructing a normalized YCbCr histogram, wherein the horizontal axis of the histogram represents the normalized series, and the vertical axis represents the number of pixel points corresponding to the series. During normalization processing, Y, cb, and Cr may be optionally divided into 16 parts, 9 parts, and 9 parts, that is, a 16-9-9 mode, where the number of normalized stages is 16+9+9=34. The reason why the number of stages is determined and the normalization processing is performed is that the normalization processing, that is, the quantization processing, at unequal intervals is performed according to different ranges of colors and subjective color perception, in consideration of the visual resolving power of humans and the processing speed of computers.
The correlation coefficient d (H fi,Hfj) between the frame and the frame preceding the frame is calculated using the following formula:
Wherein l represents the normalized series, bin 1 represents the normalized total series, and the number of pixels corresponding to the first stage of the frame and the previous frame of the frame are H fi (l) and H fj (l), respectively; and/> The number of pixels of the frame and the frame immediately preceding the frame are averaged, respectively. Bin 1 is the number of bins (boxes) in the histogram, and in the YCbCr histogram, the normalized total number of steps is represented. For each pixel, the Y channel value is divided by 16, and the Cb channel and the Cr channel are divided by 9. At this time, bin 1 takes the value 16+9+9=34. Preferably, bin 1 takes 34. Compared with color difference information, human eyes are more sensitive to the brightness information, so that the brightness information and the color difference information can be better processed by adopting a YCbCr color space model.
And comparing the first similarity with a first threshold, and taking the frame as the start frame of the new segment (clip) if the first similarity is smaller than the first threshold and indicates that the frame is likely to be the start frame of the new segment (clip). The first threshold may be determined experimentally and in practice. Optionally, the first threshold is taken to be 0.85.
For each video clip (i) roughly cut in step S103, where i represents a sequence number of each video, one frame of image is intercepted per second, sent to the behavior prediction network, the network outputs an identifier (id) of the behavior, represented by clip (i) _frame (j) _id, and outputs a corresponding respective confidence clip (i) _frame (j) _confidence. Behavior prediction networks are networks dedicated to behavior prediction, each behavior being in one-to-one correspondence with an id. The behavior prediction network may include a first network model, a second network model, and a third network model. The flow of the single frame image to finally obtain the behavior category through the behavior prediction network is described below.
Optionally, the step of identifying S200 a scene may include:
S201 resolution conversion step: converting RGB channels of the video frame into fixed-size resolutions respectively; and
S202, generating scene feature vectors: inputting the video frame subjected to resolution conversion into a first network model to obtain a scene feature vector of the video frame, wherein the first network model is as follows: the VGG16 network model of the last full connectivity layer and the Softmax classifier is removed.
Fig. 2 is a schematic block diagram of a behavior prediction network of the present application. The image RGB channels are each converted to a fixed-size resolution, for example, to a 224x224 resolution, and the converted video frames are input into a first network model, also referred to as a scene recognition sub-network. The first network model is a modified VGG16 network for pre-defined, several scene-trained scene identifications, which modified VGG16 network eliminates the last full connectivity layer and Softmax classifier. The output of the scene recognition sub-network is a vector of 1x1x25088 dimension, which is denoted as a scene feature vector plane_feature_vector.
It should be noted that, the visual geometry group (Visual Geometry Group, VGG) is an organization of engineering science of oxford university, and a model established by deep learning on the expression database is a VGG model, and the VGG model is characterized by VGG features, where the VGG features may include: FC6 layer features. VGG16 Net deep neural network architecture.
The VGG16 Net network structure contains a total of 5 stacked convolutional neural networks (ConvNet), each ConvNet in turn consisting of multiple convolutional layers (Conv), followed by a nonlinear mapping layer (ReLU), each ConvNet followed by a pooling layer (Pooling), finally 3 fully connected layers each with 4096 channels and 1 soft-max layers each with 1000 channels, which can be chosen for different output numbers depending on the specific task. The network introduces a smaller convolution kernel (3 multiplied by 3), the ReLU layer is added, the inputs of the convolution layer and the full connection layer are directly connected with the ReLU layer, and a regularization method (Dropout) is used at the full connection layers fc6 and fc7, so that the training time is greatly shortened, the flexibility of the network is improved, and the fitting phenomenon is prevented. According to the invention, factors such as learning and characterization capability, structural flexibility, training time and the like of a network model are comprehensively considered, and VGG16 Net is selected as a feature extractor of the invention. The matrix adjustment function (Reshape functions) in the model is a function that can readjust the number of rows, columns, and dimensions of the matrix.
Optionally, the step of identifying S300 local behavior features may include:
s301 shortest side length fixing step: converting RGB channels of the video frame into resolutions with fixed shortest lengths respectively; and
S302, generating a local behavior feature vector: inputting a video frame with a fixed shortest length into a first network model, inputting an output result of the first network model into a region-based convolutional neural network (FASTERRCNN) model, calculating an optimal detection category result by using the output result of the region-based convolutional neural network, and obtaining a local behavior feature vector by the optimal detection category result through a region-of-interest pooling layer.
Referring to fig. 2, the RGB channels of the video frame are respectively converted into the shortest side length, for example, 600 pixel resolution, and the video frame is input into a second network model, also called a local behavior detection sub-network. The second network model is a local behavior detection network trained for predefined local behaviors. The second network model may include: a first network model, FASTERRCNN, an optimal detection module, and a pooling layer. The data processing flow of the second network model is that the output result of the first network model is input into a FASTERRCNN model, an optimal detection module calculates an optimal detection category result by using the output result of the area-based convolutional neural network, and the optimal detection category result passes through a region of interest (region of interest, ROI) pooling layer (Pooling Layer) to obtain a local behavior feature vector. The second network model is FASTERRCNN based but uses only the optimal detection class.
The optimal detection class is determined based on the following quantified formula: for each detection target and rectangular box output by FASTERRCNN, for example, the detection target takes the maximum probability value softmax_max output by softmxax, the area of the rectangular box is marked as S, and the optimal detection category result opt_detection is calculated:
opt_detection=SCALE*softmax_max+WEIGHT*S
Wherein SCALE is a coefficient to prevent softmax_max from being submerged by the value range of S; WEIGHT is a WEIGHT value for an area. Optionally, scale=1000, weight=0.7, representing local behavior with a slightly higher weight than area.
The optimal detection class result is converted into a 1x1x25088 vector by a 7x7x 512-dimensional output result through a region-of-interest pooling layer and is recorded as a local behavior feature vector local_action_feature_vector. In fig. 2, after obtaining the local behavior feature vector, the results obtained by FC1, FC2, FC M, softmax M, and the results of FC2 are input to FC m×4, and the results obtained by using the window regression function bbox_pred can be used to evaluate the recognition effect of the local behavior feature vector, where M is the local behavior class.
Optionally, the step of determining the behavior class of the S400 video frame may include:
s401, combining video frame feature vectors: combining the scene feature vector and the local behavior feature vector into a video frame feature vector; and
S402, behavior category and confidence calculating step: and inputting the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence coefficient corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers with a Softmax classifier.
In S401, the scene feature vector place_feature_vector and the local behavior feature vector local_action_feature_vector are combined into one video frame feature vector, and the size of the vector is 1x1x (25088+25088) = 50176 dimension vector, denoted as feature_vector, see fig. 2.
Optionally, the step of determining the S500 fragment behavior category may include: and taking the behavior category as the behavior category of the segment when the ratio of the same video frame number of the behavior category to the total video frame number of the segment is larger than a preset second threshold value.
In S402, the video frame feature vector feature_vector passes through 4 full connection layers FC, FC1 to FC4. Of these, FC1 outputs 4096 channels, FC2 outputs 4096 channels, FC3 outputs 1000 channels, and FC4 outputs scores of C categories, see fig. 2.C can be selected according to the number of behavior categories actually required, and is preferably selected from 15 to 30. The output of FC4 is accessed into a Softmax classifier, and finally the prediction confidence of each behavior class is output. The behavior class with the highest confidence is selected and is output as the frame behavior class and is marked as clip (i) _frame (j) _id, clip (i) _frame (j) _confidence.
In the step S500 of determining the behavior class of the segment, the processing of steps S200 to S400 is performed for each frame captured every second in the segment clip (i), and the behavior class of each frame is predicted. The percentage of frames of the clip (i) with the same id in the total predicted frame number is denoted as the same_id_percentage. Whenever such an id exists, the same_id_percentage > same_id_percentage_thres, where same_id_percentage_thres represents a set threshold, and the confidence of the frames of the same id exceeds 65% with a duty ratio exceeding 80%. The id is output as the behavior class of the clip (i).
In the segment merging step of step S600, the above-described processing is performed for each segment roughly obtained in step S100, and the behavior class of each segment is obtained. If the behavior categories of adjacent segments are the same, the two segments are merged into one segment. And finally, obtaining the short video of the video which is segmented according to the behavior category.
It should be understood that the step of identifying the local behavior feature of S300 and the step of determining the behavior class of the video frame of S400 are not necessarily performed sequentially, and may be performed simultaneously or sequentially.
Fig. 3 is a schematic block diagram of a behavior prediction network that trains the present application. Optionally, the method may further comprise a training step of the behavior prediction network.
For a first network model, i.e., a scene prediction network, the network model uses VGG16 to classify N predefined scenes. The output scene category N is selected according to actual requirements, and is generally selected to be 30 to 40. For example, the scene category may be restaurants, basketball courts, concert halls, and so forth. The training strategy is as follows: the weight w is initialized by the following formula:
w=np.random.randn(n)*sqrt(2.0/n)
Where np.random.randn (n) is a function of generating random numbers, i.e. initializing the n weights of the filter for each channel of each convolution layer to a gaussian distribution, can be generated using the numpy method. The square root function is used to calculate sqrt (2.0/n) to ensure that the variance of the distribution of the inputs to each neuron of each layer is consistent. Regularization is performed by adopting a dropout technology to prevent overfitting, wherein dropout refers to temporarily discarding a neural network unit from the network according to a certain probability in the training process of the deep learning network. The probability of each neuron activation is the hyper-parameter p. The pooled result is input to the cost function after passing through two FCs 4096, FC N, softmax N. The cost function is calculated using the cross-entropy loss function cross-entopy loss (Softmax). The weight updating strategy is realized by adopting an SGD+Momentum (random gradient descent+momentum) method. The learning rate (LEARNINGRATE) decreases with training time according to STEP DECAY (step decay).
For the second network model, i.e., the local behavior prediction network, the network uses FASTERRCNN and the training method uses the standard training method of FASTERRCNN. The output local behavior class M is selected according to actual requirements, and is generally selected from 15 to 30. For example, the local behavior may be eating, basketball, dating, and the like. After the local behavior feature vector is obtained, the prediction result obtained through the two FCs 4096, FC M and Softmax M, and the result of the second FC 4096 is input into FC m×4, and the result obtained by using the window regression function bbox_pred can be used for evaluating the recognition effect of the local behavior feature vector, wherein M is the local behavior type. The results of Softmax M and FC M x 4 are input to the cross entropy penalty defined by FASTERRCNN.
After the first network model and the second network model are trained, the third network is trained. The scene network removes the Softmax classifier and the last few layers of full connection layers, the parameters of the rest layers remain unchanged, and the last layer of pooling layer is converted into 1x1x25088 dimensions and recorded as a video frame feature vector. The network is identified for local behavior. When the third network model is trained, each image predicts a plurality of local behaviors and position rectangular frames thereof through the local behavior recognition network, and according to the optimal detection category, the optimal detection category is selected to obtain 7x7x 512-dimensional vector output of the corresponding region-of-interest pooling layer, and the vector output is further converted into a 1x1x 25088-dimensional local behavior feature vector. The scene feature vector and the local behavior feature vector are combined to be 1x1x (25088+25088) = 50176 dimensions, denoted as video frame feature vectors. The video frame feature vector passes through 4 full connection layers FC1 to FC4. The output of FC4 is sequentially connected to Softmax C and cross-entropy loss cross-entopy loss. For the third network model, the other parameters remain unchanged, training only the parameters of the 4-layer FC. The parameter training strategy adopts the training strategy of the first network model.
For the C behavior categories predicted by the third network model, the M local behavior categories predicted by the second network model and the N scene categories predicted by the first network model can be selected as follows. First, defining the overall C behavior categories, such as eating, basketball, and dating, according to business requirements. The possible local behavior categories are then defined based on these C global behaviors, which can generally be kept consistent with the global behaviors, such as eating, basketball, and dating. Finally, according to the overall behavior classification, N possible scenes are defined, for example, for eating, a restaurant, a coffee shop, etc., scenes may be defined.
There is also provided, in accordance with another embodiment of the present application, a video segmentation apparatus, fig. 4 is a schematic block diagram of one example of a video segmentation apparatus in accordance with the present application. The apparatus may include:
a segment segmentation module 100 configured to segment a video into segments based on correlation coefficients between adjacent video frames in the video;
A scene recognition module 200 configured to recognize, for a video frame in the segment, a scene of the video frame, resulting in a scene feature vector;
a local behavior feature identification module 300 configured to identify, for a video frame in the segment, a local behavior feature of the video frame, resulting in a local behavior feature vector;
A video frame behavior category determination module 400 configured to identify a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector;
A segment behavior category determination module 500 configured to determine a behavior category of the segment based on the behavior category and the confidence level of the video frame of the segment; and
And the segment merging module 600 is configured to merge the segments with the same adjacent behavior categories to obtain the segmentation result of the video.
The device provided by the application can fuse the two-way models at the same time, comprehensively utilizes two dimensions of the scene and the local behavior, and extracts the whole behavior information, thereby rapidly segmenting the video.
Alternatively, the segment segmentation module 100 may include:
A histogram calculation module configured to calculate YCbCr histograms for each video frame of the video;
a correlation coefficient calculation module configured to calculate a correlation coefficient of a YCbCr histogram of the video frame with a YCbCr histogram of a previous video frame; and
A threshold comparison module configured to treat the video frame as a start frame of a new segment when the correlation coefficient is less than a predetermined first threshold.
Alternatively, the scene recognition module 200 may include:
A resolution conversion module configured to convert RGB channels of the video frame into fixed-size resolutions, respectively; and
The scene feature vector generation module is configured to input the video frame subjected to resolution conversion into a first network model to obtain a scene feature vector of the video frame, wherein the first network model is as follows: the VGG16 network model of the last full connectivity layer and the Softmax classifier is removed.
Optionally, the local behavior feature recognition module 300 may include:
A shortest side length fixing module configured to convert RGB channels of the video frame to a resolution of a shortest side length fixing, respectively; and
The local behavior feature vector generation module is configured to input a video frame with a fixed shortest length into a first network model, input an output result of the first network model into a region-based convolutional neural network (FASTERRCNN) model, calculate an optimal detection category result by using the output result of the region-based convolutional neural network, and obtain a local behavior feature vector by passing the optimal detection category result through a region-of-interest pooling layer.
Optionally, the video frame behavior category determination module 400 may include:
A video frame feature vector merging module configured to merge the scene feature vector and the local behavior feature vector into a video frame feature vector; and
The behavior category and confidence calculating module is configured to input the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers with a Softmax classifier.
FIG. 5 is a block diagram of one embodiment of a computing device of the present application. Another embodiment of the application also provides a computing device comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, which computer program, when being executed by the processor 1110, implements a method step 1131 for performing any one of the methods according to the application.
Another embodiment of the present application also provides a computer-readable storage medium. Fig. 6 is a block diagram of one embodiment of a computer readable storage medium of the application, comprising a storage unit for program code, the storage unit being provided with a program 1131' for performing the method steps according to the application, the program being executed by a processor.
Embodiments of the present application also provide a computer program product comprising instructions. When the computer program product.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of function in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in implementing the methods of the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (english) medium, such as a random access memory, a read-only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (english: MAGNETIC TAPE), a floppy disk (english: floppy disk), an optical disk (english: optical disk), and any combination thereof.
The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
Claims (4)
1. A video segmentation method, comprising:
A segment segmentation step: dividing the video into segments based on correlation coefficients between adjacent video frames in the video;
Scene recognition: for the video frames in the fragments, converting RGB channels of the video frames into fixed-size resolutions respectively, and inputting the video frames subjected to resolution conversion into a first network model to obtain scene feature vectors of the video frames, wherein the first network model is as follows: removing the VGG16 network model of the last full-connection layer and the Softmax classifier;
a local behavior feature recognition step: converting RGB channels of the video frames into fixed resolution of shortest length respectively, inputting the video frames with the fixed shortest length into a first network model, inputting an output result of the first network model into a convolutional neural network model based on a region, calculating an optimal detection category result by using the output result of the convolutional neural network based on the region, and obtaining a local behavior feature vector by using the optimal detection category result through a region-of-interest pooling layer;
Judging the behavior category of the video frame: based on the scene feature vector and the local behavior feature vector, identifying a behavior class of the video frame and a confidence level corresponding to the behavior class, wherein the video frame behavior class judging step comprises the following steps:
and a video frame feature vector merging step: combining the scene feature vector and the local behavior feature vector into a video frame feature vector, and
Behavior category and confidence calculation step: inputting the video frame feature vector into a third network to obtain a behavior category of the video frame and a confidence coefficient corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers with a Softmax classifier;
fragment behavior category determining step: determining a behavior class of the segment based on the behavior class and the confidence of the video frame of the segment, wherein the segment behavior class determining step comprises the following steps: taking the behavior category as the behavior category of the segment under the condition that the ratio of the same video frame number of the behavior category to the total video frame number of the segment is larger than a preset second threshold value;
fragment merging: and merging the adjacent fragments with the same behavior category to obtain the segmentation result of the video.
2. A video segmentation apparatus, comprising:
A segment segmentation module configured to segment a video into segments based on correlation coefficients between adjacent video frames in the video;
The scene recognition module is configured to convert RGB channels of the video frames into fixed-size resolutions for the video frames in the fragments, and input the video frames subjected to resolution conversion into a first network model to obtain scene feature vectors of the video frames, wherein the first network model is as follows: removing the VGG16 network model of the last full-connection layer and the Softmax classifier;
The local behavior feature recognition module is configured to convert RGB channels of the video frames into fixed resolutions of shortest lengths respectively, input the video frames with the fixed shortest lengths into a first network model, input an output result of the first network model into a region-based convolutional neural network model, calculate an optimal detection category result by utilizing the output result of the region-based convolutional neural network, and obtain a local behavior feature vector by the optimal detection category result through a region-of-interest pooling layer;
a video frame behavior category determination module configured to identify a behavior category of the video frame and a confidence level corresponding to the behavior category based on the scene feature vector and the local behavior feature vector, the video frame behavior category determination module comprising:
a video frame feature vector merge module configured to merge the scene feature vector and the local behavior feature vector into a video frame feature vector, and
The behavior category and confidence calculating module is configured to input the video frame feature vector into a third network to obtain the behavior category of the video frame and the confidence corresponding to the behavior category, wherein the third network is formed by sequentially connecting 4 full-connection layers with a Softmax classifier;
A segment behavior class determination module configured to determine a behavior class of the segment based on a behavior class and a confidence of video frames of the segment, the segment behavior class determination module regarding the behavior class as a behavior class of the segment if a ratio of a number of video frames of the same behavior class to a total number of video frames of the segment is greater than a predetermined second threshold; and
And the segment merging module is configured to merge the segments with the same adjacent behavior categories to obtain the segmentation result of the video.
3. A computer device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method of claim 1 when executing the computer program.
4. A computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110314575.0A CN112906649B (en) | 2018-05-10 | 2018-05-10 | Video segmentation method, device, computer device and medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110314575.0A CN112906649B (en) | 2018-05-10 | 2018-05-10 | Video segmentation method, device, computer device and medium |
CN201810443505.3A CN108647641B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method and device based on two-way model fusion |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810443505.3A Division CN108647641B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method and device based on two-way model fusion |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112906649A CN112906649A (en) | 2021-06-04 |
CN112906649B true CN112906649B (en) | 2024-05-14 |
Family
ID=63754392
Family Applications (4)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810443505.3A Active CN108647641B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method and device based on two-way model fusion |
CN202110314575.0A Active CN112906649B (en) | 2018-05-10 | 2018-05-10 | Video segmentation method, device, computer device and medium |
CN202110313073.6A Active CN112836687B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method, device, computer equipment and medium |
CN202110314627.4A Active CN112966646B (en) | 2018-05-10 | 2018-05-10 | Video segmentation method, device, equipment and medium based on two-way model fusion |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810443505.3A Active CN108647641B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method and device based on two-way model fusion |
Family Applications After (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110313073.6A Active CN112836687B (en) | 2018-05-10 | 2018-05-10 | Video behavior segmentation method, device, computer equipment and medium |
CN202110314627.4A Active CN112966646B (en) | 2018-05-10 | 2018-05-10 | Video segmentation method, device, equipment and medium based on two-way model fusion |
Country Status (1)
Country | Link |
---|---|
CN (4) | CN108647641B (en) |
Families Citing this family (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543590B (en) * | 2018-11-16 | 2023-04-18 | 中山大学 | Video human behavior recognition algorithm based on behavior association degree fusion characteristics |
CN111327945B (en) * | 2018-12-14 | 2021-03-30 | 北京沃东天骏信息技术有限公司 | Method and apparatus for segmenting video |
CN110516540B (en) * | 2019-07-17 | 2022-04-29 | 青岛科技大学 | Group behavior identification method based on multi-stream architecture and long-term and short-term memory network |
CN110602546A (en) * | 2019-09-06 | 2019-12-20 | Oppo广东移动通信有限公司 | Video generation method, terminal and computer-readable storage medium |
CN110751218B (en) * | 2019-10-22 | 2023-01-06 | Oppo广东移动通信有限公司 | Image classification method, image classification device and terminal equipment |
CN111541912B (en) * | 2020-04-30 | 2022-04-22 | 北京奇艺世纪科技有限公司 | Video splitting method and device, electronic equipment and storage medium |
CN113784226A (en) * | 2020-06-10 | 2021-12-10 | 北京金山云网络技术有限公司 | Video slicing method and device, electronic equipment and storage medium |
CN113784227A (en) * | 2020-06-10 | 2021-12-10 | 北京金山云网络技术有限公司 | Video slicing method and device, electronic equipment and storage medium |
CN111881818B (en) * | 2020-07-27 | 2022-07-22 | 复旦大学 | Medical action fine-grained recognition device and computer-readable storage medium |
CN113569703B (en) * | 2021-07-23 | 2024-04-16 | 上海明略人工智能(集团)有限公司 | Real division point judging method, system, storage medium and electronic equipment |
CN113301430B (en) * | 2021-07-27 | 2021-12-07 | 腾讯科技(深圳)有限公司 | Video clipping method, video clipping device, electronic equipment and storage medium |
CN114998994A (en) * | 2022-06-08 | 2022-09-02 | 南京跑码地计算技术有限公司 | Fine-grained action segmentation method for video |
CN118509668A (en) * | 2023-02-08 | 2024-08-16 | 华为云计算技术有限公司 | Video segmentation method and device |
CN117610105B (en) * | 2023-12-07 | 2024-06-07 | 上海烜翊科技有限公司 | Model view structure design method for automatically generating system design result |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102426705A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Behavior splicing method of video scene |
CN106529467A (en) * | 2016-11-07 | 2017-03-22 | 南京邮电大学 | Group behavior identification method based on multi-feature fusion |
Family Cites Families (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7296231B2 (en) * | 2001-08-09 | 2007-11-13 | Eastman Kodak Company | Video structuring by probabilistic merging of video segments |
US20140328570A1 (en) * | 2013-01-09 | 2014-11-06 | Sri International | Identifying, describing, and sharing salient events in images and videos |
US9244924B2 (en) * | 2012-04-23 | 2016-01-26 | Sri International | Classification, search, and retrieval of complex video events |
CN102833492B (en) * | 2012-08-01 | 2016-12-21 | 天津大学 | A kind of video scene dividing method based on color similarity |
CN103366181A (en) * | 2013-06-28 | 2013-10-23 | 安科智慧城市技术(中国)有限公司 | Method and device for identifying scene integrated by multi-feature vision codebook |
EP3007082A1 (en) * | 2014-10-07 | 2016-04-13 | Thomson Licensing | Method for computing a similarity measure for video segments |
CN104331442A (en) * | 2014-10-24 | 2015-02-04 | 华为技术有限公司 | Video classification method and device |
AU2014271236A1 (en) * | 2014-12-02 | 2016-06-16 | Canon Kabushiki Kaisha | Video segmentation method |
CN105989358A (en) * | 2016-01-21 | 2016-10-05 | 中山大学 | Natural scene video identification method |
CN105893936B (en) * | 2016-03-28 | 2019-02-12 | 浙江工业大学 | A Behavior Recognition Method Based on HOIRM and Local Feature Fusion |
CN107590420A (en) * | 2016-07-07 | 2018-01-16 | 北京新岸线网络技术有限公司 | Scene extraction method of key frame and device in video analysis |
CN107027051B (en) * | 2016-07-26 | 2019-11-08 | 中国科学院自动化研究所 | A Video Key Frame Extraction Method Based on Linear Dynamic System |
CN107590442A (en) * | 2017-08-22 | 2018-01-16 | 华中科技大学 | A kind of video semanteme Scene Segmentation based on convolutional neural networks |
CN107992836A (en) * | 2017-12-12 | 2018-05-04 | 中国矿业大学(北京) | A kind of recognition methods of miner's unsafe acts and system |
-
2018
- 2018-05-10 CN CN201810443505.3A patent/CN108647641B/en active Active
- 2018-05-10 CN CN202110314575.0A patent/CN112906649B/en active Active
- 2018-05-10 CN CN202110313073.6A patent/CN112836687B/en active Active
- 2018-05-10 CN CN202110314627.4A patent/CN112966646B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102426705A (en) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | Behavior splicing method of video scene |
CN106529467A (en) * | 2016-11-07 | 2017-03-22 | 南京邮电大学 | Group behavior identification method based on multi-feature fusion |
Non-Patent Citations (5)
Title |
---|
Probability-based method for boosting human action recognition using scene context;Hong-Bo Zhang 等;《IET Computer Vision》;第10卷(第6期);528-536 * |
一种结合姿态和场景的图像中人体行为分类方法;雷庆;李绍滋;陈锻生;;《小型微型计算机系统》;第36卷(第05期);1098-1103 * |
基于多特征融合词包模型的SAR目标鉴别算法;宋文青;王英华;时荔蕙;刘宏伟;保铮;;《电子与信息学报》;第39卷(第11期);2705-2715 * |
基于本征维数和置信度的行为序列分割;熊心雨;潘伟;唐超;;《厦门大学学报(自然科学版)》;第52卷(第04期);479-485 * |
视频监控中的行为序列分割与识别;钱惠敏;茅耀斌;王执铨;叶曙光;;《中国图象图形学报》;第14卷(第11期);2416-2420 * |
Also Published As
Publication number | Publication date |
---|---|
CN112966646A (en) | 2021-06-15 |
CN112906649A (en) | 2021-06-04 |
CN108647641A (en) | 2018-10-12 |
CN112966646B (en) | 2024-01-09 |
CN112836687B (en) | 2024-05-10 |
CN108647641B (en) | 2021-04-27 |
CN112836687A (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112906649B (en) | Video segmentation method, device, computer device and medium | |
KR102462572B1 (en) | Systems and methods for training object classifiers by machine learning | |
JP6192271B2 (en) | Image processing apparatus, image processing method, and program | |
US9418440B2 (en) | Image segmenting apparatus and method | |
JP4746050B2 (en) | Method and system for processing video data | |
CN109684922B (en) | A multi-model recognition method for finished dishes based on convolutional neural network | |
EP3261017A1 (en) | Image processing system to detect objects of interest | |
WO2021135500A1 (en) | Vehicle loss detection model training method and apparatus, vehicle loss detection method and apparatus, and device and medium | |
KR20180065889A (en) | Method and apparatus for detecting target | |
JP4098021B2 (en) | Scene identification method, apparatus, and program | |
CN111539265A (en) | A kind of abnormal behavior detection method in elevator car | |
US20120224789A1 (en) | Noise suppression in low light images | |
CN109934216B (en) | Image processing method, device and computer readable storage medium | |
Haque et al. | A hybrid object detection technique from dynamic background using Gaussian mixture models | |
CN110349119B (en) | Pavement disease detection method and device based on edge detection neural network | |
Sundaram et al. | Object detection and estimation: A hybrid image segmentation technique using convolutional neural network model | |
Vila et al. | Analysis of image informativeness measures | |
Hernandez et al. | Classification of color textures with random field models and neural networks | |
US9367923B2 (en) | Image processing apparatus with improved compression of image data of character images and background images using respective different compressing methods | |
CN115294162A (en) | Target identification method, device, equipment and storage medium | |
Broetto et al. | Heterogeneous feature models and feature selection applied to detection of street lighting lamps types and wattages | |
CN113298102A (en) | Training method and device for target classification model | |
CN115474084B (en) | Method, device, equipment and storage medium for generating video cover image | |
CN116665112B (en) | Tunnel inspection method and device, electronic equipment and storage medium | |
Tichonov et al. | Quality prediction of compressed images via classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |