Disclosure of Invention
In view of the above, in order to solve the above problems in the prior art, the present invention provides a law enforcement instant evidence fixing method and a law enforcement instrument, which combine the analysis of pictures and voiceprints to ensure the evidence reliability and uniqueness of the law enforcement instrument.
The invention solves the problems through the following technical means:
in one aspect, the invention provides a law enforcement instant evidence fixing method, which comprises the following steps:
collecting images or audio/video files of a law enforcement scene in the law enforcement process;
adopting a face recognition technology of a VGG-16 convolutional neural network to recognize the face in the image or the audio/video file and generate an image hash value;
identifying the voiceprint in the image or audio/video file by adopting an FBN-Alexnet network small sample voiceprint identification technology to generate a voiceprint hash value;
automatically uploading five elements of a file hash value, trusted time of a time service center, geographical position information, law enforcement equipment information and law enforcement personnel information which are generated by automatically calculating the image or audio/video file to an evidence storing cloud with a block chain through a network, and simultaneously storing a two-dimensional code which is linked with an evidence storing cloud evidence fixing report; the file hash value comprises an image hash value and a voice print hash value.
Preferably, the recognizing the face in the image or the audio/video file by using the face recognition technology of the VGG-16 convolutional neural network specifically comprises:
constructing a VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model;
acquiring a face picture in an image or audio/video file;
preprocessing the face picture, including face detection and face alignment correction;
extracting human face features by adopting a trained VGG-16 convolutional neural network model, and screening the features through three full-connection layers after feature extraction, nonlinear mapping and feature dimension reduction of 5 units of the VGG-16 convolutional neural network model, so as to further reduce feature dimensions; and finally, recognizing the human face by adopting a classifier.
Preferably, the Aadaboost + Haar feature method is adopted for face detection, and affine transformation is adopted for face alignment correction.
Preferably, the building of the VGG-16 convolutional neural network model and the training of the VGG-16 convolutional neural network model comprises:
acquiring a certain number of human face pictures, and preprocessing the human face pictures to be used as a training data set of a VGG-16 convolutional neural network model; preprocessing comprises data enhancement, face detection alignment cutting, data format conversion and picture mean calculation;
and constructing a VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model by adopting a training data set, wherein the training comprises network layer modification, network parameter modification and network model training.
Preferably, the identifying the voiceprint in the image or audio/video file by using the FBN-Alexnet network small sample voiceprint identification technology specifically includes:
inputting an original voice signal of a small sample to obtain a spectrogram;
an image increasing algorithm based on a convex lens imaging principle is adopted, and more training data are obtained by changing the size of a spectrogram;
training an FBN-Alexnet network model by using voiceprint data, wherein the training comprises extracting voiceprint characteristics by a convolutional layer, accelerating network convergence by the FBN, reducing the calculation complexity by a pooling layer and carrying out voiceprint classification by a full connection layer;
and acquiring voiceprint data in the image or audio/video file, and identifying the voiceprint data by adopting a trained FBN-Alexnet network model.
In another aspect, the present invention provides a law enforcement instrument comprising:
the image audio/video acquisition module is used for acquiring images or audio/video files of a law enforcement scene in the law enforcement process;
the image hash value generation module is used for identifying the face in the image or the audio/video file by adopting the face identification technology of the VGG-16 convolutional neural network to generate an image hash value;
the voiceprint hash value generation module is used for identifying the voiceprint in the image or the audio/video file by adopting an FBN-Alexnet network small sample voiceprint identification technology to generate a voiceprint hash value;
the two-dimensional code link generation module is used for automatically uploading five elements, namely a file hash value generated by automatic calculation aiming at the image or audio/video file, trusted time of a time service center, geographical position information, law enforcement equipment information and law enforcement personnel information, into an evidence deposit cloud with a block chain deployed through a network, and storing a two-dimensional code generating an evidence deposit cloud evidence fixation report link; the file hash value comprises an image hash value and a voice print hash value.
Preferably, the image hash value generation module includes:
the neural network model training unit is used for constructing the VGG-16 convolutional neural network model and training the VGG-16 convolutional neural network model;
the face picture acquisition unit is used for acquiring a face picture in an image or audio/video file;
the image preprocessing unit is used for preprocessing the face image, and comprises face detection and face alignment correction;
the face recognition unit is used for extracting face features by adopting the trained VGG-16 convolutional neural network model, and screening the features through three full-connection layers after feature extraction, nonlinear mapping and feature dimension reduction of 5 units of the VGG-16 convolutional neural network model, so that feature dimensions are further reduced; and finally, recognizing the human face by adopting a classifier.
Preferably, the Aadaboost + Haar feature method is adopted for face detection, and affine transformation is adopted for face alignment correction.
Preferably, the neural network model training unit includes:
the training data set acquisition subunit is used for acquiring a certain number of face pictures, preprocessing the face pictures and using the face pictures as a training data set of the VGG-16 convolutional neural network model; preprocessing comprises data enhancement, face detection alignment cutting, data format conversion and picture mean calculation;
and the neural network model training subunit is used for constructing the VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model by adopting a training data set, wherein the training comprises network layer modification, network parameter modification and network model training.
Preferably, the voiceprint hash value generation module includes:
the small sample input unit is used for inputting an original voice signal of a small sample to obtain a spectrogram;
the training data acquisition module is used for acquiring more training data by changing the size of a spectrogram by adopting an image increasing algorithm of a convex lens imaging principle;
the FBN network model training module is used for training an FBN-Alexnet network model by adopting voiceprint data, and comprises a convolutional layer for extracting voiceprint characteristics, an FBN accelerating network convergence, a pooling layer for reducing the calculation complexity and a full connection layer for voiceprint classification;
and the voiceprint recognition module is used for acquiring voiceprint data in the image or audio/video file and recognizing the voiceprint data by adopting a trained FBN-Alexnet network model.
Compared with the prior art, the invention has the beneficial effects that at least:
1) when the law enforcement instrument records, various information closely related to law enforcement scenes, such as the hash value of the recorded electronic file, the trusted time of a time service center, geographical location information, law enforcement equipment information, law enforcement personnel information and the like, can be intelligently obtained, the instant evidence fixation after the recording of the law enforcement instrument is finished is realized by utilizing the information, and the spanning from the common electronic file to the electronic evidence meeting the legal requirements is finished.
2) And the reliability and uniqueness of the evidence in the law enforcement process of the law enforcement instrument are ensured by combining the analysis of the picture and the voiceprint.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. It should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the protection scope of the present invention.
Example 1
As shown in fig. 1, the present invention provides a law enforcement instant evidence fixing method, comprising the following steps:
s1, collecting the image or audio/video file of the law enforcement scene in the law enforcement process; the invention adopts a Zynq-7000 platform-based high-definition video acquisition processing system, and realizes high-definition CMOS image acquisition, image preprocessing on FPGA and video image caching by deeply analyzing the Zynq-7000 platform basic characteristics and the whole frame of high-definition video acquisition processing.
S2, recognizing the face in the image or audio/video file by adopting a face recognition technology of a VGG-16 convolutional neural network to generate an image hash value;
s3, identifying the voiceprint in the image or audio/video file by adopting an FBN-Alexnet network small sample voiceprint identification technology to generate a voiceprint hash value;
s4, automatically uploading five elements of a file hash value, trusted time of a time service center, geographical location information, law enforcement equipment information and law enforcement personnel information generated by automatic calculation aiming at the image or audio/video file through a network into an evidence storing cloud with a block chain deployed, and simultaneously storing a two-dimensional code generating an evidence storing cloud evidence fixing report link; the file hash value comprises an image hash value and a voice print hash value.
As shown in fig. 2, in step S2, the recognizing the face in the image or the audio/video file by using the face recognition technology of the VGG-16 convolutional neural network specifically includes:
s21, constructing a VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model;
s22, acquiring a face picture in the image or audio/video file;
s23, preprocessing the face picture, including face detection and face alignment correction; preferably, the Aadaboost + Haar characteristic method is adopted for face detection, and affine transformation is adopted for face alignment correction;
s24, extracting the face features by adopting the trained VGG-16 convolutional neural network model, and screening the features through three full-connection layers after feature extraction, nonlinear mapping and feature dimension reduction of 5 units of the VGG-16 convolutional neural network model, so as to further reduce feature dimensions; and finally, identifying the human face by adopting a Softmax classifier.
In step S21, constructing the VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model includes:
s211, acquiring a certain number of face pictures, preprocessing the face pictures, and using the preprocessed face pictures as a training data set of the VGG-16 convolutional neural network model; preprocessing comprises data enhancement, face detection alignment cutting, data format conversion and picture mean calculation;
s212, constructing a VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model by adopting a training data set, wherein the training comprises network layer modification, network parameter modification and network model training.
Voiceprint recognition is divided into two techniques, speaker recognition and speaker verification. In any technique, the voiceprint of the speaker is collected, digitized and modeled. After the voiceprint collection of the public society is collected in the whole sample, when the voiceprint inspection material of the case-related sound source is obtained, the voiceprint inspection material can be automatically compared with the voiceprint collection of the whole sample, and the real identity of the suspect can be instantly locked. Because the voiceprint can be remotely sampled and identified, the method has incomparable natural advantages for detecting non-contact cases. Voiceprint recognition currently faces the following challenges: the voice time-varying property affects the voiceprint recognition, the stability of the voice is relatively poorer than the stability of biological characteristics such as human faces, fingerprints and the like, and the voice of one person can be changed due to factors such as different voice-varying periods, pathological changes, trauma, recording conditions, different speech environments and the like, so that the stability of the voice is reduced; the influence of cross-channel collection on voiceprint recognition is caused, sound sources and channels are various, such as a recording pen, a telephone, a VOIP (voice over Internet protocol), a sound pick-up and the like, different audio coding and decoding modes are adopted in different collection channels, and the damage of sound is caused more or less in the process of analog-to-digital conversion; the influence of technologies such as recording attack, TTS and the like on voiceprint recognition. When the voiceprint recognition model based on deep learning is trained through a large amount of voice data, rich acoustic features (frequency spectrum, fundamental tone, formant and the like) can be automatically learned, and the challenges are overcome to a certain extent.
In addition, in the network training process, due to the fact that the problems of more network layers, huge network parameters, time consumption for training and network fitting exist, a Fast Batch Normalization (FBN) method is proposed to be based on so as to accelerate network convergence when an FBN-Alexnet network is trained.
As shown in fig. 3, in step S3, the identifying the voiceprint in the image or audio/video file by using the FBN-Alexnet network small sample voiceprint identification technology specifically includes:
s31, inputting an original voice signal of the small sample to obtain a spectrogram; before using an audio training or testing model, a section of voice signal is framed in advance due to the characteristic of short-time invariance of the voice signal;
s32, obtaining more training data by changing the size of a spectrogram by adopting an image increasing algorithm of a convex lens imaging principle;
s33, training an FBN-Alexnet network model by using voiceprint data, wherein the training comprises extracting voiceprint features by a convolutional layer, accelerating network convergence by an FBN, reducing calculation complexity by a pooling layer and carrying out voiceprint classification by a full connection layer;
and S34, obtaining voiceprint data in the image or audio/video file, and identifying the voiceprint data by adopting the trained FBN-Alexnet network model.
Example 2
As shown in fig. 4, the present invention provides a law enforcement instrument, which includes an image audio/video acquisition module, an image hash value generation module, a voiceprint hash value generation module, and a two-dimensional code link generation module;
the image audio/video acquisition module is used for acquiring images or audio/video files of a law enforcement scene in the law enforcement process; the invention adopts a Zynq-7000 platform-based high-definition video acquisition processing system, and realizes high-definition CMOS image acquisition, image preprocessing on FPGA and video image caching by deeply analyzing the Zynq-7000 platform basic characteristics and the whole frame of high-definition video acquisition processing;
the image hash value generation module is used for identifying the face in the image or the audio/video file by adopting a face identification technology of a VGG-16 convolutional neural network to generate an image hash value;
the voiceprint hash value generation module is used for identifying the voiceprint in the image or the audio/video file by adopting an FBN-Alexnet network small sample voiceprint identification technology to generate a voiceprint hash value;
the two-dimension code link generation module is used for automatically uploading five elements of a file hash value, trusted time of a time service center, geographical location information, law enforcement equipment information and law enforcement personnel information which are generated by automatic calculation aiming at the image or audio/video file through a network into an evidence storing cloud with a block chain deployed, and storing a two-dimension code generating an evidence storing cloud evidence fixing report link; the file hash value comprises an image hash value and a voice print hash value.
As shown in fig. 5, the image hash value generation module includes a neural network model training unit, a face image obtaining unit, an image preprocessing unit, and a face recognition unit;
the neural network model training unit is used for constructing a VGG-16 convolutional neural network model and training the VGG-16 convolutional neural network model;
the face picture acquisition unit is used for acquiring a face picture in an image or audio/video file;
the image preprocessing unit is used for preprocessing a face image, and comprises face detection and face alignment correction; preferably, the Aadaboost + Haar feature method is adopted for face detection, and affine transformation is adopted for face alignment correction.
The face recognition unit is used for extracting face features by adopting a trained VGG-16 convolutional neural network model, and screening the features through three full-connection layers after feature extraction, nonlinear mapping and feature dimension reduction of 5 units of the VGG-16 convolutional neural network model, so that feature dimensions are further reduced; and finally, recognizing the human face by adopting a classifier.
Specifically, the neural network model training unit comprises a training data set acquisition subunit and a neural network model training subunit;
the training data set acquisition subunit is used for acquiring a certain number of human face pictures, preprocessing the human face pictures and using the human face pictures as a training data set of the VGG-16 convolutional neural network model; preprocessing comprises data enhancement, face detection alignment cutting, data format conversion and picture mean calculation;
the neural network model training subunit is used for constructing the VGG-16 convolutional neural network model, and training the VGG-16 convolutional neural network model by adopting a training data set, wherein the training comprises network layer modification, network parameter modification and network model training.
Voiceprint recognition is divided into two techniques, speaker recognition and speaker verification. In any technique, the voiceprint of the speaker is collected, digitized and modeled. After the voiceprint collection of the public society is collected in the whole sample, when the voiceprint inspection material of the case-related sound source is obtained, the voiceprint inspection material can be automatically compared with the voiceprint collection of the whole sample, and the real identity of the suspect can be instantly locked. Because the voiceprint can be remotely sampled and identified, the method has incomparable natural advantages for detecting non-contact cases. Voiceprint recognition currently faces the following challenges: the voice time-varying property affects the voiceprint recognition, the stability of the voice is relatively poorer than the stability of biological characteristics such as human faces, fingerprints and the like, and the voice of one person can be changed due to factors such as different voice-varying periods, pathological changes, trauma, recording conditions, different speech environments and the like, so that the stability of the voice is reduced; the influence of cross-channel collection on voiceprint recognition is caused, sound sources and channels are various, such as a recording pen, a telephone, a VOIP (voice over Internet protocol), a sound pick-up and the like, different audio coding and decoding modes are adopted in different collection channels, and the damage of sound is caused more or less in the process of analog-to-digital conversion; the influence of technologies such as recording attack, TTS and the like on voiceprint recognition. When the voiceprint recognition model based on deep learning is trained through a large amount of voice data, rich acoustic features (frequency spectrum, fundamental tone, formant and the like) can be automatically learned, and the challenges are overcome to a certain extent.
In addition, in the network training process, due to the fact that the problems of more network layers, huge network parameters, time consumption for training and network fitting exist, a Fast Batch Normalization (FBN) method is proposed to be based on so as to accelerate network convergence when an FBN-Alexnet network is trained.
As shown in fig. 6, the voiceprint hash value generation module includes a small sample input unit, a training data acquisition module, an FBN network model training module, and a voiceprint recognition module;
the small sample input unit is used for inputting an original voice signal of a small sample to obtain a spectrogram; before using an audio training or testing model, a section of voice signal is framed in advance due to the characteristic of short-time invariance of the voice signal;
the training data acquisition module is used for acquiring more training data by changing the size of a spectrogram by adopting an image increasing algorithm of a convex lens imaging principle;
the FBN network model training module is used for training an FBN-Alexnet network model by adopting voiceprint data, and comprises a convolutional layer for extracting voiceprint characteristics, an FBN accelerating network convergence, a pooling layer for reducing the calculation complexity and a full connection layer for voiceprint classification;
the voiceprint recognition module is used for obtaining voiceprint data in the image or audio/video file and recognizing the voiceprint data by adopting a trained FBN-Alexnet network model.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.