CN110490060B

CN110490060B - Security protection front-end video equipment based on machine learning hardware architecture

Info

Publication number: CN110490060B
Application number: CN201910621068.4A
Authority: CN
Inventors: 寇京珅
Original assignee: Terminus Beijing Technology Co Ltd
Current assignee: Terminus Beijing Technology Co Ltd
Priority date: 2019-07-10
Filing date: 2019-07-10
Publication date: 2020-09-11
Anticipated expiration: 2039-07-10
Also published as: CN110490060A

Abstract

The invention discloses security and protection front-end video equipment based on a machine learning hardware architecture. The invention is based on the framework of machine learning, can automatically extract the character target belonging to the designated target from the video picture, and tracks the character target, thereby improving the imaging quality of the character target. The invention is based on the machine learning principle under the neural network architecture, can adapt to the time-varying property of the human target in different frames of video pictures, directly drives the front-end video camera equipment by the machine learning identification result, improves the response speed and reduces the delay.

Description

Security protection front-end video equipment based on machine learning hardware architecture

Technical Field

The invention relates to the technical field of intelligent security, in particular to security front-end video equipment based on a machine learning hardware architecture.

Background

In application scenes such as smart cities, smart buildings and smart communities, security video monitoring is used as an infrastructure facility and plays an increasingly important role.

The security video monitoring system is generally divided into a front end and a background, and uplink and downlink data transmission between the front end and the background is realized through a cellular network, a wired network, a coaxial cable or various internet of things. The front-end equipment is a video camera and a cloud deck and is responsible for shooting and uploading monitoring video pictures so as to archive, analyze and display components such as a background server, a television wall and the like; and the front-end equipment can adjust the shooting direction according to the instruction given by the background, and change the view finding range of the monitoring video picture.

With the development of software and hardware technologies, especially with the maturity of intelligent technologies such as image analysis, target extraction, scene recognition and the like, functions which can be provided by security video monitoring are increasingly diversified, the aspects of tracking, alarming and the like are expanded from pure monitoring, full automation becomes a development trend, and the dependence on artificial observation and remote control is remarkably reduced.

However, research, development and innovation of an intelligent security video monitoring system are mainly embodied in a background framework and algorithm, and the front end always keeps the traditional video acquisition and uploading function, which is far from enough. Firstly, the more powerful the intelligent analysis function of security monitoring, the higher the imaging quality requirement of video pictures, the real-time optimization of factors including focusing definition, imaging brightness and the like, the fact that the front-end equipment has the autonomous regulation capability is of great significance to achieving the point, if the remote instruction issuing of a background is still relied on, not only is the communication load increased, but also the real-time performance obviously cannot follow the requirement. Moreover, at present, the number of the security monitoring front-end video devices is rapidly increased, the point locations are more and more dense, and massive data are all processed by the background with great difficulty, so that the operation related to the autonomous adjustment of the front-end devices is expected to be completed by the front-end devices. It is seen that designing security front-end video equipment with intelligent architecture has become an urgent task.

Disclosure of Invention

Objects of the invention

In view of the needs in the prior art, the present invention provides a security front-end video device based on a machine learning hardware architecture. The invention is based on the framework of machine learning, can automatically extract the character target belonging to the designated target from the video picture, and tracks the character target, thereby improving the imaging quality of the character target and forming a virtuous circle.

(II) technical scheme

The security front-end video device of the invention comprises: the device comprises a video camera device, a networking communication device, a video analysis device, a driving interface device and a three-axis rotating holder. The video camera device is used for shooting video pictures in the visual field range; the networked communication device is used for acquiring the video pictures from the video camera device, uploading the video pictures to a control center at the rear end, and receiving the designated target characteristic information from the control center; the video analysis device is used for obtaining the video pictures from the video camera device and obtaining the specified target characteristic information from the network communication device; judging whether a character target meeting the specified target characteristic information exists in the current video picture or not according to the specified target characteristic information, and determining the relative position of the character target in the video picture relative to the picture center under the condition of existence; the driving interface device calculates and calculates displacement required for imaging the human target on the center of the picture according to the relative position determined by the video analysis device, and outputs a driving signal according to the displacement; and the three-axis rotating holder rotates according to the driving signal to adjust the visual field range of the video camera device.

Preferably, the video analysis device specifically includes: the system comprises a video image acquisition and enhancement module, a target local search module, a depth feature extraction neural network module and a fusion classification module; the video picture acquisition and enhancement module acquires video pictures from the video camera device frame by frame and carries out pre-processing of filtering and color enhancement on the video pictures; the target local search module traverses the whole video picture by a template with a preset scale in a certain step length, extracts a characteristic vector by using a local maximum pooling algorithm in the coverage range of the template, and then performs dimension reduction on the whole characteristic vector; matching the feature vectors contained in the specified target feature information with the extracted local feature vectors, and if the two feature vectors are matched, taking the local video image covered by the template as a candidate character target area; the depth feature extraction neural network module is used for extracting depth features from the candidate character target area; the fusion classification module obtains the local feature vector and the depth feature information of the candidate character target area, and then fuses the local feature vector and the depth feature information and judges whether the candidate character target area belongs to the appointed character target or not by utilizing the depth feature based on a supervised training learning mechanism.

Preferably, the dimension reduction processing of the target local search module includes: and (3) cutting the feature vector data into blocks with a preset size, and then respectively obtaining the origin moment and the central moment of each data block, thereby realizing the dimension reduction of the whole feature vector and obtaining the local feature vector of the video picture.

Preferably, the target local search module matches the feature vector included in the specified target feature information with the extracted local feature vector, and specifically includes: and calculating the Hamming distance of the two feature vectors, converting the Hamming distance into the matching degree according to the Hamming distance, presetting a confidence threshold value of the matching degree, and considering that the two feature vectors are matched when the matching degree of the two feature vectors is greater than the confidence threshold value.

Preferably, the depth feature extraction neural network module performs depth feature extraction and pooling on the image by using a depth residual convolution neural network, the input candidate human target region is sequentially processed by convolution layers of each layer and a maximum pooling layer, each convolution layer convolves the image region to obtain a feature map, each maximum pooling layer performs pooling on the feature map output by the corresponding convolution layer according to a maximum value principle to generate a pooled feature map, and the feature map pooled by the last pooling layer is used as the depth feature.

Preferably, the fusion classification module obtains the local feature vector and the depth feature information of the candidate human target region, further fuses the local feature vector and the depth feature information, performs full-link nonlinear activation processing through a plurality of linearly weighted full-link layers, and finally determines whether the candidate human target region belongs to the specified human target by the classifier.

Preferably, the generation process of the specified target feature information received by the networked communication device from the control center includes: traversing a person target picture area appointed by security workers from a video picture by using a template with a preset scale in a certain step length, and extracting a feature vector from the person target picture area by using a local maximum pooling algorithm within the coverage range of the template; then, performing dimension reduction processing on the whole feature vector; and the dimension reduction is to divide the data of the feature vector into data blocks in a preset size, then respectively obtain the origin moment and the central moment of each data block, and combine the origin moment and the central distance of each data block to form the specified target feature information, thereby realizing the dimension reduction of the whole feature vector.

Preferably, the fusion classification module represents the total feature information obtained by fusing the local feature vector and the depth feature information as follows:

T_R＝<T_L，T_D>

wherein, T_RFor the fused total feature information, T_LAs local feature vectors, T_DInputting the fused total feature information into a multilayer full-connection layer for depth feature information, carrying out nonlinear activation and normalization processing on the multilayer full-connection layer, and substituting the generated feature vector into a classifier.

Preferably, the nonlinear activation function of the multilayer fully-connected layer is expressed as

z＝h(W_f·T_R+b_f)

Wherein W_fRepresenting the weight value of each of the fully-connected layers, b_fRepresenting a migration vector.

Preferably, the multi-layer fully-connected layer performs normalization processing on the feature vector z subjected to the non-linear activation of the fully-connected layer, and enables the feature vector z to be subjected to the non-linear activation of the fully-connected layer

Wherein mu_zAnd σ_zRespectively representing the mean and variance of the feature vector z.

(III) advantageous effects

The invention has the following beneficial effects: the invention is based on the machine learning principle under the neural network framework, can adapt to the time-varying property of the human target in different frame video pictures, the invention fuses the characteristic information of the appointed target and the depth characteristic information extracted through the depth neural network, can realize the identification of the human target with higher success rate and accuracy, adopt 38000 multiple video pictures containing more than 1000 pedestrians to verify, the success rate can reach more than 97.5% through the experiment; the invention provides a hardware architecture, and the result of machine learning identification directly drives the front-end video camera equipment, thereby improving the response speed and reducing the delay.

Drawings

The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining and illustrating the present invention and should not be construed as limiting the scope of the present invention.

FIG. 1 is a schematic structural diagram of a security front-end video device disclosed by the present invention;

FIG. 2 is a schematic structural diagram of a video analysis apparatus in the security front-end video device disclosed in the present invention;

fig. 3 is a schematic diagram of a deep feature extraction neural network module in the security front-end video device disclosed by the present invention.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be described in more detail below with reference to the accompanying drawings in the embodiments of the present invention.

It should be noted that: the embodiments described are some embodiments of the present invention, not all embodiments, and features in embodiments and embodiments in the present application may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides security front-end video equipment which can be arranged in indoor and outdoor security monitoring spaces. The security front-end video equipment has a hardware architecture based on machine learning, can accurately identify and lock a person target according to a shot video picture, further automatically adjusts a shooting angle, realizes continuous tracking of a specific person target, and keeps the person target in a central area of the video picture, so that photometry and focusing parameters can be accurately defined, and the person shooting effect can be guaranteed.

Specifically, as shown in fig. 1, the security front-end video device of the present invention includes: the device comprises a video camera device, a networking communication device, a video analysis device, a driving interface device and a three-axis rotating holder.

The video camera device is used for shooting video pictures in the visual field range of the video camera device. In order to ensure the visual effect of the security monitoring video image and provide good image quality for subsequent application links such as character identification, evidence obtaining and the like, the video image, particularly the image of the locked character target in the video image, is expected to be clear. Therefore, it is very critical to ensure the imaging quality by adopting the correct exposure and aggregation parameters relative to the human target, if the exposure is not correct, the human target is too dark or too bright, the key information of five sense organs, clothing and the like cannot be identified, and if the focus is not correct, the human image is blurred, and the identification degree is also seriously influenced. From the standpoint of proper exposure and focus parameters, it is advantageous to have the human target image as much as possible with the center region of the video frame. Firstly, exposure parameters of a video camera device are determined according to measurement and calculation of the brightness of a video picture in a framing process, and a general camera device can support measurement and calculation of the average brightness of a complete picture and the average brightness of a central area of the picture; because the imaging area of the human target is only a local area when viewed in the whole video picture, if the average brightness of the whole picture is measured and calculated and exposure parameters are determined accordingly, the imaging area is affected by a large number of other picture areas except the human target, and the exposure of the human target is easily caused to be improper, for example, under the condition that the human target area is dark, but other areas in the picture are bright and the contrast of light and shade is large, the exposure parameters determined according to the average brightness of the whole picture are easy to cause the exposure of the human target; on the contrary, after the human target is imaged in the central area of the video picture, the average brightness of the central area of the picture can be measured instead, and the exposure parameter is set accordingly, so that the exposure parameter is matched with the light and shade state of the human target. Secondly, imaging the human target in the central area of the video image is also beneficial to fast and accurate focusing, and the imaging distortion can be avoided. In addition, from the view point of tracking the human target, imaging the human target in the center of the picture can also prevent the human from moving out of the shooting visual field to the maximum extent.

The network communication device is used for obtaining the video pictures from the video camera device, uploading the video pictures to a control center at the rear end, and receiving the specified target characteristic information from the control center. The networking communication device can upload the video pictures obtained from the video camera device to a control center at the rear end in real time based on 4G, 5G or other Internet of things communication technologies. The control center can display video pictures on equipment such as a PC (personal computer), a television wall and the like for security workers to check. And if the security worker determines that a certain person target is worth locking and continuously tracking from the video picture, the person target can be specified in the video picture by clicking the picture area where the person target is located in the video picture by using a tool such as a mouse. The control center can further obtain the specified target characteristic information and send the specified target characteristic information to the corresponding security front-end video equipment. The security front-end video equipment receives the specified target characteristic information from the control center through the networking communication device.

The process of the control center for obtaining the designated target feature information comprises the following steps: traversing a person target picture area appointed by security workers from a video picture by using a template with a preset scale in a certain step length, and extracting a feature vector from the person target picture area by using a local maximum pooling algorithm within the coverage range of the template; then, performing dimension reduction processing on the whole feature vector; and the dimension reduction is to divide the data of the feature vector into data blocks in a preset size, then respectively obtain the origin moment and the central moment of each data block, and combine the origin moment and the central distance of each data block to form the specified target feature information, thereby realizing the dimension reduction of the whole feature vector.

The video analysis device is used for obtaining the video pictures from the video camera device and obtaining the specified target characteristic information from the network communication device; and judging whether a character target meeting the specified target characteristic information exists in the current video picture or not according to the specified target characteristic information, and determining the relative position of the character target in the video picture relative to the picture center if the character target exists. The video analysis apparatus determines whether or not there is a human target in the video picture that matches the specified target feature information, as will be described in detail below.

The driving interface device calculates and calculates displacement required for imaging the human target on the center of the picture according to the relative position determined by the video analysis device, and outputs a driving signal according to the displacement;

and the three-axis rotating holder rotates according to the driving signal, and adjusts the visual field range of the video camera device, so that the character target is kept in the central area of the picture.

The video analysis device adopts a machine learning technology, utilizes a trained neural network, judges whether a human target which accords with the characteristic information of the specified target exists in a video picture based on the characteristic information of the specified target, and determines the relative position of the human target in the video picture relative to the picture center if the human target exists. The following describes the structure and operation of the video analyzer in detail with reference to fig. 2.

As shown in fig. 2, the video analysis apparatus specifically includes: the system comprises a video image acquisition and enhancement module, a target local search module, a depth feature extraction neural network module and a fusion classification module.

The video picture acquiring and enhancing module acquires video pictures from the video camera device frame by frame and carries out filtering and color enhancing pretreatment on the video pictures.

The target local search module traverses the whole video picture by a template with a preset scale in a certain step length, extracts a characteristic vector by using a local maximum pooling algorithm in the coverage range of the template, and then performs dimension reduction on the whole characteristic vector; the dimension reduction processing is to cut feature vector data into blocks with a preset size, and then to obtain an origin moment and a central moment of each data block, so that the dimension reduction of the whole feature vector is realized, and the local feature vector of a video picture is obtained. Furthermore, the target local search module matches the feature vectors included in the specified target feature information with the extracted local feature vectors, specifically, a hamming distance between two feature vectors may be calculated, the hamming distance is converted into a matching degree according to the hamming distance, a confidence threshold of the matching degree is preset, when the matching degree of the two feature vectors is greater than the confidence threshold, the two feature vectors are considered to be matched, and then the video picture part covered by the template is used as a candidate person target area.

And the target local searching module determines one or more candidate character target areas from the current video picture in a characteristic vector matching mode according to the specified target characteristic information. However, even if the same person target is in different frames of video pictures, the characteristics of the image will change greatly due to the difference between the shooting angle and the imaging factor, so in order to ensure the success rate and reliability of the recognition, on one hand, a confidence threshold needs to be set properly, the requirement on the matching degree is not too high, and the robustness is proper, so that the candidate person target area can be obtained from the video pictures, and on the other hand, the depth characteristics can be continuously extracted from the candidate person target area through the depth characteristic extraction neural network module, and based on the supervised training learning mechanism, the depth characteristics are utilized to judge whether the candidate person target area belongs to the designated person target.

The depth feature extraction neural network module uses a depth residual convolution neural network to perform depth feature extraction and pooling on the image; specifically, the size of each candidate human target area is adjusted to a preset size, and the size is transmitted to a depth residual convolution neural network; and the depth residual convolution neural network adopts a ResNet5 model, the input candidate human target area is sequentially processed by the convolution layer and the maximum pooling layer of each layer, the dimension is gradually reduced in the process, and the depth characteristic of the candidate human target area after the dimension reduction is generated. Specifically, as shown in fig. 3, the depth feature extraction neural network module includes alternating convolutional layers and maximum pooling layers, the first convolutional layer convolves the image region to obtain a feature map F1, then the first maximum pooling layer performs pooling on the feature map F1 according to the maximum principle to generate a pooled feature map C1, then the feature map F2 is obtained after entering the second convolutional layer convolution, the pooled feature map C2 is generated after entering the second pooling layer, and so on, dimension reduction is performed step by step in the process, and finally, the feature map Cn after the last pooled layer is used as the depth feature.

And the fusion classification module obtains the local feature vector and the depth feature information of the candidate character target area, further fuses the local feature vector and the depth feature information, performs full-connection nonlinear activation processing through a plurality of linearly weighted full-connection layers, and finally judges whether the candidate character target area belongs to the appointed character target by the classifier. Specifically, the total feature information obtained by fusing the local feature vector and the depth feature information is represented as:

T_R＝<T_L，T_D>

wherein, T_RFor the fused total feature information, T_LAs local feature vectors, T_DInputting the fused total feature information into a multi-layer full-connection layer for depth feature information, carrying out nonlinear activation on the multi-layer full-connection layer, and expressing a nonlinear activation function as

z＝h(W_f·T_R+b_f)

Wherein W_fRepresenting the weight value of each of the fully-connected layers, b_fRepresenting a migration vector; performing normalization processing on the feature vector z subjected to nonlinear activation of the full connection layer, and enabling

Wherein mu_zAnd σ_zRespectively representing the mean value and the variance of the feature vector z, substituting the normalized feature vector z' into a classifier, wherein the classifier can adopt an SVM classifier, and judging whether the candidate character target area belongs to the appointed character target through the classifier which realizes training. The judgment result of the fusion classification module is used as the judgment of the video analysis device on whether the alternative character target in the video picture belongs to the character target of the specified target characteristic information; when the determination is yes, the video analysis device can determine the relative position of the human target in the video picture with respect to the picture center.

The invention is based on the machine learning principle under the neural network framework, can adapt to the time-varying property of the human target in different frame video pictures, the invention fuses the characteristic information of the appointed target and the depth characteristic information extracted through the depth neural network, can realize the identification of the human target with higher success rate and accuracy, adopt 38000 multiple video pictures containing more than 1000 pedestrians to verify, the success rate can reach more than 97.5% through the experiment; the invention provides a hardware architecture, and the result of machine learning identification directly drives the front-end video camera equipment, thereby improving the response speed and reducing the delay.

The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. The utility model provides a security protection front end video equipment based on machine learning hardware architecture which characterized in that includes: the system comprises a video camera device, a networking communication device, a video analysis device, a driving interface device and a three-axis rotating holder; the video camera device is used for shooting video pictures in the visual field range; the networked communication device is used for acquiring the video pictures from the video camera device, uploading the video pictures to a control center at the rear end, and receiving the designated target characteristic information from the control center; the video analysis device is used for obtaining the video pictures from the video camera device and obtaining the specified target characteristic information from the network communication device; judging whether a character target meeting the specified target characteristic information exists in the current video picture or not according to the specified target characteristic information, and determining the relative position of the character target in the video picture relative to the picture center under the condition of existence; the driving interface device determines the relative position according to the video analysis device, calculates and calculates the displacement required by imaging the human target on the center of the picture, and outputs a driving signal according to the displacement; the three-axis rotating holder rotates according to the driving signal to adjust the visual field range of the video camera device;

the generation process of the specified target characteristic information received by the networked communication device from the control center comprises the following steps: traversing a person target picture area appointed by security workers from a video picture by using a template with a preset scale in a certain step length, and extracting a feature vector from the person target picture area by using a local maximum pooling algorithm within the coverage range of the template; then, performing dimension reduction processing on the whole feature vector; the dimension reduction is to divide the data of the feature vector into data blocks according to a preset size, then respectively obtain an origin moment and a central moment of each data block, and combine the origin moment and the central moment of each data block to form the specified target feature information, so that the dimension reduction of the whole feature vector is realized;

the video analysis device specifically includes: the system comprises a video image acquisition and enhancement module, a target local search module, a depth feature extraction neural network module and a fusion classification module; the video picture acquisition and enhancement module acquires video pictures from the video camera device frame by frame and carries out pre-processing of filtering and color enhancement on the video pictures; the target local search module traverses the whole video picture by a template with a preset scale in a certain step length, extracts a characteristic vector by using a local maximum pooling algorithm in the coverage range of the template, and then performs dimension reduction on the whole characteristic vector; matching the feature vectors contained in the specified target feature information with the extracted local feature vectors, and if the two feature vectors are matched, taking the local video image covered by the template as a candidate character target area; the depth feature extraction neural network module is used for extracting depth features from the candidate character target area; the fusion classification module obtains the local feature vector and the depth feature information of the candidate character target area, and then fuses the local feature vector and the depth feature information and judges whether the candidate character target area belongs to the appointed character target or not by utilizing the depth feature based on a supervised training learning mechanism;

the depth feature extraction neural network module uses a depth residual convolution neural network to carry out depth feature extraction and pooling on the image, the input candidate person target area sequentially passes through the processing of the convolution layers and the maximum pooling layer of each layer, each convolution layer convolves the image area to obtain a feature map, each maximum pooling layer performs pooling on the feature map output by the corresponding convolution layer according to the maximum principle to generate a pooled feature map, and the feature map pooled by the last pooling layer is used as a depth feature;

the fusion classification module obtains the local feature vector and the depth feature information of the candidate character target area, then fuses the local feature vector and the depth feature information, performs full-connection nonlinear activation processing through a plurality of linearly weighted full-connection layers, and finally judges whether the candidate character target area belongs to the appointed character target or not through a classifier.

2. The machine learning hardware architecture based security front-end video device of claim 1, wherein the dimensionality reduction process of the target local search module comprises: and (3) cutting the feature vector data into blocks with a preset size, and then respectively obtaining the origin moment and the central moment of each data block, thereby realizing the dimension reduction of the whole feature vector and obtaining the local feature vector of the video picture.

3. The security front-end video device based on the machine learning hardware architecture of claim 2, wherein the target local search module matches a feature vector included in the specified target feature information with the extracted local feature vector, and specifically includes: and calculating the Hamming distance of the two feature vectors, converting the Hamming distance into the matching degree according to the Hamming distance, presetting a confidence threshold value of the matching degree, and considering that the two feature vectors are matched when the matching degree of the two feature vectors is greater than the confidence threshold value.

4. The machine learning hardware architecture based security front-end video device of claim 1, wherein the fusion classification module represents the local feature vectors and the total feature information after the depth feature information is fused as:

wherein,

in order to obtain the fused total feature information,

in the form of a local feature vector,

inputting the fused total feature information into a multilayer full-connection layer for depth feature information, carrying out nonlinear activation and normalization processing on the multilayer full-connection layer, and substituting the generated feature vector into a classifier.

5. The machine learning hardware architecture based security front end video device of claim 4, wherein the non-linear activation function of the multi-layer fully connected layer is expressed as

Wherein

Representing the weight value of each of the fully connected layers,

representing a migration vector.

6. The machine learning hardware architecture based security front end video device of claim 5, wherein multiple fully connected layers are non-linearly activated for feature vectors passing through fully connected layers

Performing a normalization process on

Wherein

And

respectively representing the mean and variance of the feature vector Z.