CN116524560A

CN116524560A - Speaker recognition method, apparatus, device, storage medium and program product

Info

Publication number: CN116524560A
Application number: CN202310374045.4A
Authority: CN
Inventors: 龙良曲; 郭士嘉
Original assignee: Insta360 Innovation Technology Co Ltd
Current assignee: Insta360 Innovation Technology Co Ltd
Priority date: 2023-03-29
Filing date: 2023-03-29
Publication date: 2023-08-01

Abstract

The present application relates to a speaker recognition method, apparatus, device, storage medium and program product, the method comprising: according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame, the face image track of each person in the target scene is obtained, the speaking probability of each person is obtained according to the face image track of each person, and then the speaking person in the current video frame of the target scene is determined according to the speaking probability of each person. The method improves the recognition efficiency when recognizing the speaker in the video scene.

Description

Speaker recognition method, apparatus, device, storage medium and program product

Technical Field

The present invention relates to the field of image recognition technology, and in particular, to a speaker recognition method, apparatus, device, storage medium, and program product.

Background

With the development of information technology, man-machine interaction, teleconferencing, voiceprint recognition and other technologies become hot spot research objects.

Taking a multi-person video scene as an example, in the multi-person video scene, recognition of talkers is an important technology for guaranteeing video effects. In the related art, it is common to analyze a mouth-shaped region of each person in a video scene to identify the person speaking.

However, in the related art, there is a problem in that recognition efficiency is low when recognizing a speaker in a video scene.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a speaker recognition method, apparatus, device, storage medium, and program product that improve recognition efficiency when recognizing a speaker in a video scene.

In a first aspect, the present application provides a speaker recognition method, the method comprising:

acquiring a face image track of each person in the target scene according to the current video frame of the target scene and a preset number of historical video frames before the current video frame;

acquiring the speaking probability of each person according to the face image track of each person;

and determining the speaker in the current video frame of the target scene according to the speaking probability of each person.

In one embodiment, acquiring a face image track of each person in the target scene according to a current video frame of the target scene and a preset number of historical video frames before the current video frame includes:

face detection is respectively carried out on the current video frame and the historical video frame, so that a plurality of face images of each person in the target scene are obtained;

And acquiring the human face image track of each person based on the multiple human face images of each person in the target scene.

In one embodiment, face detection is performed on a current video frame and a historical video frame respectively to obtain a plurality of face images of each person in a target scene, including:

respectively inputting a current video frame and a historical video frame into a preset face detection model, and obtaining a plurality of Hou Xuanren face images through the face detection model;

and determining a plurality of face images of each person in the target scene according to the plurality of Zhang Houxuan face images.

In one embodiment, determining a plurality of face images for each person in the target scene from the plurality Zhang Houxuan of face images includes:

determining a plurality of adjacent video frames in the current video frame and the historical video frame according to the time sequence of the video frames;

according to the similarity between candidate face images in each adjacent video frame, determining the same face image in each adjacent video frame; each identical face representation is an identical person;

and determining a plurality of face images of each person in the target scene according to the same face image in each adjacent video frame.

In one embodiment, determining the same face image in each adjacent video frame based on the similarity between candidate face images in each adjacent video frame comprises:

For any adjacent video frame, acquiring the region coincidence ratio and the image similarity between each first candidate face image in the first video frame of the adjacent video frame and each second candidate face image in the second video frame;

and determining the same face image in the adjacent video frames according to the region overlapping ratio and the image similarity between each first candidate face image in the first video frames and each second candidate face image in the second video frames.

In one embodiment, determining the same face image in adjacent video frames according to the region overlap ratio and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame includes:

for any group of the first candidate face image and the second candidate face image, if the region overlapping ratio between the first candidate face image and the second candidate face image is larger than a preset overlapping ratio threshold value and the image similarity between the first candidate face image and the second candidate face image is larger than a preset similarity threshold value, determining that the first candidate face image and the second candidate face image are the same face image.

In one embodiment, the acquiring the speaking probability of each person according to the face image track of each person includes:

And respectively inputting the face image track of each person to a preset speaking action recognition model, and obtaining the speaking probability of each person through the speaking action recognition model.

In one embodiment, determining a speaker in a current video frame of a target scene according to a probability of speaking for each person includes:

according to the speaking probability of each person, determining candidate speaking persons in the target scene;

and for any candidate speaker, if the candidate speaker is the candidate speaker in the first preset number of times before the current moment, determining the candidate speaker as the speaker in the current video frame of the target scene.

In one embodiment, the method further comprises:

taking the face position of the talker in the current video frame as the center, and intercepting the surrounding area of the face image of the talker to obtain an intercepting area;

highlighting the cut-out area.

In one embodiment, the method further comprises:

and if the speaker does not speak for a second preset number of times, ending the highlighting of the speaker.

In a second aspect, the present application also provides a speaker recognition apparatus, the apparatus comprising:

The first acquisition module is used for acquiring the face image track of each person in the target scene according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame;

the second acquisition module is used for acquiring the speaking probability of each person according to the face image track of each person;

and the determining module is used for determining the speaking person in the current video frame of the target scene according to the speaking probability of each person.

In a third aspect, embodiments of the present application provide a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the method provided by any of the embodiments of the first aspect described above when the computer program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method provided by any of the embodiments of the first aspect described above.

In a fifth aspect, embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method provided by any of the embodiments of the first aspect described above.

According to the speaker recognition method, the device, the equipment, the storage medium and the program product, the face image track of each person in the target scene is obtained according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame, the speaking probability of each person is obtained according to the face image track of each person, and then the speakers in the current video frame of the target scene are determined according to the speaking probability of each person. According to the method, the speaking probability of each person is directly determined according to the face image track of each person without analyzing the mouth shape of each person, so that analysis time is shortened, the corresponding speaking probability is directly determined according to the face image track of each person, and the speaking person of the current video frame in the target scene is determined according to the speaking probability, so that the recognition efficiency in recognizing the speaking person in the video scene is improved.

Drawings

FIG. 1 is a diagram of an application environment for a speaker recognition method in one embodiment;

FIG. 2 is a flow chart of a speaker recognition method in one embodiment;

FIG. 3a is a schematic diagram of a face in a speaking state in one embodiment;

FIG. 3b is a schematic diagram of a face in an unclassified state in one embodiment;

FIG. 4 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 5 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 6 is a schematic representation of a face image of a speaker recognition method in one embodiment;

FIG. 7 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 8 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 9 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 10 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 11 is a schematic illustration of a highlighted image of a speaker in one embodiment;

FIG. 12 is a flow chart of a speaker recognition method according to another embodiment;

FIG. 13 is a block diagram of a speaker recognition device in one embodiment;

fig. 14 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The speaker recognition method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the acquisition device 102 communicates with the processing device 104 over a network. Wherein the acquisition 102 may be, but is not limited to, a device with camera functionality.

It should be noted that, the acquisition device 102 and the processing device 104 may be integrated in the same device, and the device having the image capturing function and the processing function may also directly implement the method.

The following describes the method for identifying the person who is speaking in the main body with the processing device 104.

In one embodiment, as shown in fig. 2, a speaker recognition method is provided, comprising the steps of:

s201, acquiring the face image track of each person in the target scene according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame.

The target scene is a scene in which at least one person exists currently, for example, a conference room, a classroom and the like, in which a plurality of persons add a conference, at any time, a person may speak, a person may listen, a person may take notes at low, and the like.

In the target scene, according to the voice of the speaker, only the angle area of the voice can be approximately positioned, but the specific speaker cannot be clearly positioned, so that the face image track of each person in the target scene can be acquired, and the speaker in the target scene can be directly determined according to the face image track of each person.

When the talker in the target scene at the current moment is determined, the current video frame in the target scene and the historical video frames of the preset number before the current video frame are acquired, and the face image track of each person in the target scene is determined according to the current video frame in the target scene and the historical video frames of the preset number before the current video frame.

The current video frame is an image in a target scene acquired at the current moment, and the historical video frame is an image acquired at a historical moment before the current moment.

Optionally, acquiring all images in a preset historical time before the current moment, acquiring a preset number of images from all images according to a preset sampling frequency, and taking the preset number of images as historical video frames; the continuous preset number of video frames collected before the current moment can be directly obtained as historical video frames.

For example, the number of current video frames and historical video frames is 8, and the image tensor of the current video frames and the historical video frames may be 8×3×160×160.

After the current video frames and the historical video frames of the preset number in front of the current video frames of the target scene are obtained, the human face image track of each person in the target scene can be determined through a preset track model, the current video frames and the historical video frames of the preset number are input into the track model, and the current video frames and the historical video frames are analyzed through the track model to obtain the human face image track of each person.

S202, according to the face image track of each person, the speaking probability of each person is obtained.

The speaking probability is the probability of whether each person is speaking; in one embodiment, the face image track of each person is input to a preset speaking motion recognition model, and the speaking probability of each person is obtained through the speaking motion recognition model.

The speech motion recognition model may be obtained by training a deep convolutional neural network, or may be obtained based on training an open-source video motion model, for example, a motion recognition model (X3D), a video classification model (Temporal Shift Module, TSM), and the like, and different network structures are adopted for platforms with different computational power.

The speech action recognition model is trained by adopting a cross entropy loss function, and cross loss between a predicted value and a labeling value is calculated, so that parameters in the speech action recognition model are updated to update the speech action recognition model.

Wherein Logloss is a cross entropy loss value, y _i Representing the marked value, p _i Representing a prediction of the speech action recognition model, N representing the number of images.

Optionally, before training the speech action recognition model, a huge number of face sequence data sets in a speech state and face sequence data sets in a non-speech state may be collected, and the face sequence data sets in a speech state and the face sequence data sets in a non-speech state are labeled, and then the speech action recognition model is trained according to the face sequence data sets in the speech state and the face sequence data sets in the non-speech state, as shown in fig. 3a and 3b, fig. 3a is a typical face sequence in the speech state, and fig. 3b is a typical face sequence in the non-speech state.

S203, determining the talker in the current video frame of the target scene according to the talking probability of each person.

In one embodiment, for any person, if the speaking probability of the person is greater than the preset speaking threshold, the person is determined to be the speaking person in the current video frame of the target scene, for example, the speaking threshold is 0.6, and if the speaking probability of the person is 0.8, the person is determined to be the speaking person in the current video frame of the target scene.

Optionally, there may be one or more speakers in the current video frame of the target scene, for example, the threshold of speaking is 0.6, the target scene includes 5 speakers, and the speaking probabilities of the 5 speakers are respectively: 0.7, 0.2, 0.5, 0.8, 0.3, the person corresponding to the speaking probability of 0.7 and 0.8 is determined to be the speaking person in the current video frame of the target scene.

In another embodiment, a person corresponding to the largest speaking probability in the speaking probabilities of each person in the target scene may be determined as the speaking person in the current video frame of the target scene; for example, if the target scene includes 5 persons, and speaking probabilities of the 5 persons are respectively: 0.7, 0.2, 0.5, 0.8, 0.3, then the person corresponding to the speaking probability of 0.8 is determined to be the speaking person in the current video frame of the target scene.

According to the speaker recognition method, face image tracks of each person in a target scene are obtained according to the current video frame of the target scene and a preset number of historical video frames before the current video frame, the speaking probability of each person is obtained according to the face image tracks of each person, and then the speakers in the current video frame of the target scene are determined according to the speaking probability of each person. According to the method, the speaking probability of each person is directly determined according to the face image track of each person without analyzing the mouth shape of each person, so that analysis time is shortened, the corresponding speaking probability is directly determined according to the face image track of each person, and the speaking person of the current video frame in the target scene is determined according to the speaking probability, so that the recognition efficiency in recognizing the speaking person in the video scene is improved.

In one embodiment, as shown in fig. 4, the method for obtaining the face image track of each person in the target scene according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame includes the following steps:

s401, face detection is carried out on the current video frame and the historical video frame respectively, and a plurality of face images of each person in the target scene are obtained.

And respectively carrying out face detection on the current video frame and the historical video frame according to a preset face detection algorithm to obtain a face image of each video frame.

At this time, the face image of each video frame is obtained, but the same correspondence of each face image in each video frame is not known, and therefore, the determination of the face images of each person in the target scene is continued from the face image of each video frame.

Optionally, the face images of all the video frames are input into a face recognition model, the same face image in the video frames is determined through analysis of the face recognition model, and the same face image is determined to be the face image of the same person, so that a plurality of face images of each person in the target scene are determined.

S402, acquiring a face image track of each person based on a plurality of face images of each person in the target scene.

Each face image carries shooting time, so that the face image track of each person can be determined according to the shooting time corresponding to the face images of each person in the target scene.

According to the speaker recognition method, face detection is conducted on the current video frame and the historical video frame respectively, a plurality of face images of each person in a target scene are obtained, and face image tracks of each person are obtained based on the plurality of face images of each person in the target scene. In the method, the face image track of each person in the current video frame is determined by a face detection method, and whether the person speaks or not is identified by the face image track of the person, so that the identification efficiency is improved.

In the following, an embodiment of how to obtain a plurality of face images of each person in a target scene is described, in one embodiment, as shown in fig. 5, face detection is performed on a current video frame and a historical video frame, so as to obtain a plurality of face images of each person in the target scene, including the following steps:

s501, respectively inputting a current video frame and a historical video frame into a preset face detection model, and obtaining a plurality of Hou Xuanren face images through the face detection model.

Wherein the face detection model is preset.

Alternatively, face detection may be implemented based on an open-source deep convolutional neural network, such as a Multi-task cascade convolutional neural network (Multi-task Cascaded Convolutional Networks, MTCNN), a face detector (RetinaFace), etc., optionally, an initial face detection model is pre-trained on a large-scale face dataset that is disclosed, and then fine-tuned on a private face dataset in a specific region, resulting in a face detection model that is better in regional effects while ensuring good generalization.

Firstly, a public Face data set, for example, a Wider Face data set, is obtained, fuzzy, shielding and large-angle faces are removed through a manual labeling method according to the characteristics of a scene, and then training is carried out based on a Retinaface model, so that a pre-trained weight, namely an initial Face detection model, is obtained.

Then, on the basis of the pre-training weight, aiming at the characteristics of the face of each region, fine adjustment and optimization of the model are carried out. For example, aiming at the face of a preset region, a face data set of the preset region of a corresponding scene needs to be specially collected and optimized, so that characteristic network weights of different regions are generated in a training mode according to the data set of each region, and a face detection model of each region is obtained; as shown in fig. 6, fig. 6 (a) is a person whose target scene is in the current video frame, and fig. 6 (b) is a face image of a person whose target scene is in the current video frame, obtained by running a face detection model corresponding to asian regions.

When the face detection is carried out on the current video frame and the historical video frame, the source places of people in the current video frame and the historical video frame can be determined, the face detection model is determined according to the source places, and then the face detection is carried out on the current video frame and the historical video frame based on the face detection model.

Specifically, the current video frame and the historical video frame are respectively input into a face detection model, the current video frame and the historical video frame are analyzed through the face detection model, a plurality of face images are output, and the face images are used as a plurality of candidate face images.

S502, determining a plurality of face images of each person in the target scene according to the plurality of Zhang Houxuan face images.

The plurality of candidate face images are face images extracted from the current video frame and the historical video frame, namely the plurality of candidate face images corresponding to the current video frame and the plurality of candidate face images corresponding to the historical video frame.

Thus, in one embodiment, as shown in fig. 7, determining a plurality of face images for each person in a target scene from the plurality Zhang Houxuan of face images includes the steps of:

s701, determining a plurality of adjacent video frames in the current video frame and the historical video frame according to the time sequence of the video frames.

The video frames comprise a current video frame and a historical video frame, and the time sequence represents the time sequence of shooting the video frames, so that a plurality of adjacent video frames in the current video frame and the historical video frame can be determined according to the time sequence of the current video frame and the historical video frame.

S702, determining the same face image in each adjacent video frame according to the similarity between the candidate face images in each adjacent video frame; each identical face representation is an identical person.

Based on the determined multiple adjacent video frames, the similarity between the candidate face images in each adjacent video frame is obtained, specifically, the similarity between every two candidate face images of two video frames in each adjacent video frame is obtained for any adjacent video frame, and then the same face image in the adjacent video frame is determined according to the similarity between the candidate face images in the adjacent video frames.

For example, if the adjacent video frames include an a video frame and a B video frame, the a video frame includes three candidate face images a1, a2, and a3, and the B video frame includes three candidate face images B1, B2, and B3, then the a1 candidate face image and B1 candidate face image, the a1 candidate face image and B2 candidate face image, the a1 candidate face image and B3 candidate face image, the a2 candidate face image and B1 candidate face image, the a2 candidate face image and B2 candidate face image, the a2 candidate face image and B3 candidate face image, the a3 candidate face image and B1 candidate face image, the a3 candidate face image and B2 candidate face image, and the similarity between the a3 candidate face image and B3 candidate face image are acquired respectively.

According to the similarity between face images in the adjacent video frames, the method for determining the same face image in the adjacent video frames may be: and determining the candidate face image with the highest similarity with the candidate face image in the other video frame as the same face image aiming at any candidate face image in one video frame in the adjacent video frames.

Continuing taking the case that the adjacent video frames comprise the video frame A and the video frame B, aiming at the face image a1 candidate in the video frame A, determining the face image a1 candidate and the face image B1 candidate, the face image a1 candidate and the face image B2 candidate and the face image a1 candidate and the face image B3 candidate with the highest similarity as two identical face images, namely determining the face image a1 candidate and the face image B1 candidate as identical face images if the similarity between the face image a1 candidate and the face image B1 candidate is 0.8, the similarity between the face image a1 candidate and the face image B2 candidate is 0.4, and the face image a1 candidate and the face image B3 candidate are 0.6.

Based on the above manner, all the identical face images in the adjacent video frames are determined, thereby determining the identical face images in each adjacent video frame.

S703, determining a plurality of face images of each person in the target scene according to the same face image in each adjacent video frame.

Because each identical face representation is of the same person and each adjacent video frame is composed of the current video frame and the historical video frame in the target scene, a plurality of face images of each person in the target scene can be determined according to the identical face images in each adjacent video frame.

For example, the target scene includes an a video frame, a B video frame and a C video frame, if the a video frame and the B video frame are adjacent video frames, the B video frame and the C video frame are adjacent video frames, the a1 face image of the a video frame and the B1 face image of the B video frame are the same face, and the B1 face image of the B video frame and the C2 face image of the C video frame are the same face, it may be determined that the a1 face image, the B1 face image and the C2 face image are multiple face images of the same person.

Based on the above mode, a plurality of face images of each person in the target scene can be obtained.

According to the speaker recognition method, the current video frame and the historical video frame are respectively input into the preset face detection model, a plurality of Hou Xuanren face images are obtained through the face detection model, and a plurality of face images of each person in the target scene are determined according to the plurality of Zhang Houxuan face images. In the method, a plurality of candidate face images in the target scene are determined directly according to the face detection model, so that the candidate face images in the target scene can be obtained quickly, and the recognition efficiency is improved when the speaker in the target scene is recognized.

The above-mentioned method for determining the same face image in each adjacent video frame by the similarity between the candidate face images, which will be described in detail below by way of an embodiment, in one embodiment, as shown in fig. 8, the method for determining the same face image in each adjacent video frame according to the similarity between the candidate face images in each adjacent video frame includes the following steps:

s801, for any adjacent video frame, acquiring the region coincidence degree and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame of the adjacent video frame.

The first video frame and the second video frame are adjacent video frames, the candidate face image in the first video frame is a first candidate face image, and the candidate face image in the second video frame is a second candidate face image.

Thus, the manner of obtaining the similarity between the candidate face images in the adjacent video frames may be: and acquiring the region coincidence degree and the image similarity between each first candidate face image and each second candidate face image.

In order to reduce the time consumption of tracking a plurality of candidate faces and improve the system efficiency when the intercom personnel perform real-time identification, the similarity between the candidate face images can be determined through the region overlapping ratio and the image similarity when the similarity between the candidate face images in each adjacent video frame is determined.

The region overlap ratio can be calculated by adopting a mode of cross ratio (Intersection over Union, IOU); the image similarity may be calculated by cosine similarity.

Based on the above, the region overlapping ratio and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame of the adjacent video frames can be obtained as follows: and acquiring the cross-over ratio and cosine similarity between each first candidate face image and each second candidate face image.

The calculating mode of the cross-over ratio between any group of first candidate face images and second candidate face images can be as follows: and acquiring a position intersection and a position union of the first candidate face image and the second candidate face image, and determining the ratio of the position intersection to the position union as the intersection ratio of the first candidate face image and the second candidate face image, namely the region overlap ratio in the embodiment.

The calculation mode for the cosine similarity between any group of first candidate face image and second candidate face image can be as follows: and respectively acquiring color histogram vectors of the first candidate face image and the second candidate face image, and calculating cosine similarity between the first candidate face image and the second candidate face image according to the color histograms of the first candidate face image and the second candidate face image, namely the image similarity in the embodiment.

Based on the above, the region coincidence ratio and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame in the adjacent video frames are obtained.

Note that, the calculation method of the region overlap ratio and the image similarity in the present embodiment is not limited to this, and the embodiment of the present application is not limited to this.

S802, determining the same face image in the adjacent video frames according to the region coincidence degree and the image similarity between each first candidate face image in the first video frames and each second candidate face image in the second video frames.

And determining the same face image in the adjacent video frames based on the region overlapping degree and the image similarity between each first candidate face image in the first video frames and each second candidate face image in the second video frames.

In one embodiment, for any group of the first candidate face image and the second candidate face image, if the region overlap ratio between the first candidate face image and the second candidate face image is greater than a preset overlap ratio threshold value, and the image similarity between the first candidate face image and the second candidate face image is greater than a preset similarity threshold value, determining that the first candidate face image and the second candidate face image are the same face image.

If the region overlapping ratio between the first candidate face image and the second candidate face image is smaller than or equal to a preset overlapping ratio threshold value, or the image similarity between the first candidate face image and the second candidate face image is smaller than or equal to a preset similarity threshold value, determining that the first candidate face image and the second candidate face image are not the same face image.

For example, if the overlap ratio threshold a is 0.8 and the similarity threshold b is 0.6, and if the region overlap ratio between the first candidate face image and the second candidate face image is 0.9 and the image similarity between the first candidate face image and the second candidate face image is 0.8, determining that the first candidate face image and the second candidate face image are the same face image, that is, the track of the same person; if the region overlapping ratio between the first candidate face image and the second candidate face image is 0.9 and the image similarity between the first candidate face image and the second candidate face image is 0.5, determining that the first candidate face image and the second candidate face image are not the same face image, namely the tracks of different people.

If there is a region overlap ratio between the first candidate face image and at least two second candidate face images greater than the overlap ratio threshold and the image similarity greater than the similarity threshold, further determining a sum of the region overlap ratio and the image similarity between the first candidate face image and the second face images, and determining the first candidate face image and the second face image with the largest sum as the same face image.

For example, the overlap ratio threshold a is 0.8, the similarity threshold b is 0.6, and if the region overlap ratio between the first candidate face image a1 and the second candidate face image b1 is 0.9, the image similarity between the first candidate face image a1 and the second candidate face image b1 is 0.8, the region overlap ratio between the first candidate face image a1 and the second candidate face image b2 is 0.9, and the image similarity between the first candidate face image a1 and the second candidate face image b2 is 0.7, the first candidate face image a1 and the second candidate face image b1 are determined to be the same face image.

According to the speaker recognition method, for any adjacent video frame, the region coincidence degree and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame of the adjacent video frame are obtained, and the same face image in the adjacent video frame is determined according to the region coincidence degree and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame. In the method, the same face image of the adjacent video frames is determined according to the region overlapping ratio and the image similarity, whether the first candidate face image and the second candidate face image are similar or not is determined according to two dimensions, and accuracy of judging the same face image is improved.

In the following, it is explained by an embodiment how the speaker in the current video frame of the target scene is determined according to the speaking probability of each person, and in an embodiment, as shown in fig. 9, the determining the speaker in the current video frame of the target scene according to the speaking probability of each person includes:

s901, determining candidate speakers of the target scene according to the speaking probability of each person.

For any person in the target scene, if the speaking probability of the person is greater than the preset probability threshold, determining that the person is a candidate speaking person in the target scene, wherein the number of the candidate speaking persons can be multiple.

If the speaking probability of the person is smaller than or equal to the preset probability threshold, determining that the person is a non-candidate speaking person in the target scene, namely that the person does not speak in the current video frame of the target scene.

S902, for any candidate speaker, if the candidate speaker is the candidate speaker in the first preset number of times before the current time, determining the candidate speaker as the speaker in the current video frame of the target scene.

In order to avoid the misrecognition of the speaking action model, filtering can be performed on the candidate speaking person based on the latest historical recognition result, misrecognition is removed, and whether the candidate speaking person is the speaking person of the current video frame of the target scene or not is determined by a preset judgment logic.

And acquiring a speech recognition result of the candidate speech staff, wherein the speech recognition result comprises whether the candidate speech staff is the speech staff or not, and if the candidate speech staff is the speech staff in the current video frame of the target scene, determining the candidate speech staff as the speech staff in the current video frame of the target scene, namely, the speech staff in the current video frame of the target scene.

Otherwise, if the candidate speaker exists at least once as a non-candidate speaker continuously for a first preset number of times before the current time, determining the candidate speaker as the speaker in the current video frame of the target scene.

Alternatively, if P (spaking|x is used _t ) Indicating whether the person is a candidate speaker at time t, and if the person is a candidate speaker, P (speech |x) _t ) For true, person is a non-candidate speaker P (spaking|x _t ) If it is a flag, if AND (P (flashing|x) _t ),P(speaking|x _t-1 ),...,P(speaking|x _t-n ) If true), the candidate speaker is speaking at time t, and the consecutive n times before time t are all candidate speakers, wherein the preset times can be represented as n.

If AND (P (scaling |x) _t ),P(speaking|x _t-1 ),...,P(speaking|x _t-n ) If the result is a flag), then the speech is determined to be invalid and the candidate speaker is a non-speaker in the current video frame of the target scene.

According to the speaker recognition method provided by the embodiment of the invention, the candidate speakers in the target scene are determined according to the speaking probability of each person, and for any candidate speaker, if the candidate speaker is the speaker in the current video frame of the target scene in the first preset times before the current time, the candidate speaker is determined to be the speaker in the current video frame of the target scene. In the method, after the candidate speaker is determined, the candidate speaker is filtered, so that the false recognition of the speaker action recognition model is avoided, and the recognition accuracy of the speaker is improved.

For a speaker in the current video frame of the target scene, the speaker may be highlighted, and in one embodiment, as shown in fig. 10, the embodiment includes the steps of:

s1001, taking the face position of the speaker in the current video frame as the center, and intercepting the surrounding area of the face image of the speaker to obtain an intercepting area.

The method for intercepting the surrounding area of the face image of the talker comprises the steps of taking the face position of the talker in a current video frame as a center, and carrying out area expansion according to a preset expansion size or a preset expansion multiple to obtain an intercepting area; the truncated area may be a rectangular area.

Taking a preset expansion size as an example, taking the face position as the center, expanding the h1 size upwards, expanding the h2 size downwards, expanding the L1 size leftwards and expanding the L2 size rightwards.

S1002 highlights the cut area.

Highlighting the cut-out region based on the cut-out region, wherein highlighting comprises marking the boundary of the cut-out region, as shown in fig. 11, and highlighting the cut-out region through a black line in fig. 11; in order to more clearly display the talker, the intercepting area of the talker can be marked by color lines.

According to the speaker recognition method, the speaker is centered on the face position of the speaker in the current video frame, and the surrounding area of the face image of the speaker is intercepted to obtain an intercepting area; highlighting the cut-out area. In the method, the speaker can be highlighted, so that the speaker can be positioned faster, and other people can pay attention to the speaking content and form of the speaker.

Also provided herein is a disappearance determination logic for a highlight window, which in one embodiment includes: and if the speaker does not speak for a second preset number of times, ending the highlighting of the speaker.

In this embodiment, the speaker indicates that there is a highlighted person, and if the speaker does not speak within the second preset number of times, that is, the speaker determines that the speaker is no longer speaking, the highlighting of the speaker is ended.

Alternatively, if OR (P (scanning |x _t ),P(speaking|x _t-1 ),...,P(speaking|x _t-n ) If false, then the corresponding person is a non-candidate speaker within a second preset number of times, and then the person is determined to not speak, i.e. the speaker is not speaking any more, and the highlighting of the speaker is ended, wherein the second preset number of times is n+1.

In one embodiment, as shown in fig. 12, taking a target scene as a conference room as an example, the method further provides a speaker recognition method, and the method includes the following steps:

s1201, acquiring a current video frame and a historical video frame in a conference room.

S1202, face detection is carried out on a current video frame and a historical video frame respectively, and a plurality of face images are obtained.

And S1203, performing target tracking on the plurality of face images to obtain the face image track of each person in the conference room.

S1204, inputting the face image track of each person to a speaking motion recognition model, and outputting the speaking probability of each person through the speaking motion recognition model.

And S1205, determining the person with the speaking probability larger than the preset probability threshold as the candidate speaking person.

S1206, if the candidate speaker is the candidate speaker within the latest preset number of histories, it is determined that the candidate speaker is the current speaker in the conference room.

S1207, expanding the speaker with the position of the face image as the center, obtaining an expanded area, and highlighting the expanded area.

If the highlighted speaker is a non-candidate speaker within the preset fixed times with the latest history, the highlighting of the speaker is ended.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a speaker recognition device for realizing the speaker recognition method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiment of the speaker recognition device or devices provided below may be referred to the limitation of the speaker recognition method hereinabove, and will not be repeated here.

In one embodiment, as shown in fig. 13, there is provided a speaker recognition apparatus including: a first acquisition module 1301, a second acquisition module 1302, and a determination module 1303, wherein:

the first obtaining module 1301 is configured to obtain a face image track of each person in the target scene according to the current video frame of the target scene and a preset number of historical video frames before the current video frame;

a second obtaining module 1302, configured to obtain a speaking probability of each person according to a face image track of each person;

the determining module 1303 is configured to determine a speaker in a current video frame of the target scene according to the speaking probability of each person.

In one embodiment, the first acquisition module 1301 includes:

The detection unit is used for carrying out face detection on the current video frame and the historical video frame respectively to obtain a plurality of face images of each person in the target scene;

and the acquisition unit is used for acquiring the human face image track of each person based on the multiple human face images of each person in the target scene.

In one embodiment, the detection unit comprises:

the detection subunit is used for respectively inputting the current video frame and the historical video frame into a preset face detection model, and obtaining a plurality of Hou Xuanren face images through the face detection model;

a first determining subunit, configured to determine a plurality of face images of each person in the target scene according to the plurality Zhang Houxuan of face images.

In one embodiment, the first determination subunit comprises:

a second determining subunit, configured to determine, according to the time sequence of the video frames, a plurality of adjacent video frames in the current video frame and the historical video frame;

a third determining subunit, configured to determine, according to the similarity between candidate face images in each adjacent video frame, the same face image in each adjacent video frame; each identical face representation is an identical person;

and the fourth determination subunit is used for determining a plurality of face images of each person in the target scene according to the same face image in each adjacent video frame.

In one embodiment, the third determination subunit comprises:

the acquisition subunit is used for acquiring the region coincidence degree and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame of any adjacent video frame;

and the fifth determining subunit is used for determining the same face image in the adjacent video frames according to the region coincidence degree and the image similarity between each first candidate face image in the first video frames and each second candidate face image in the second video frames.

In one embodiment, the fifth determination subunit comprises:

a sixth determining subunit, configured to determine, for any group of the first candidate face image and the second candidate face image, that the first candidate face image and the second candidate face image are the same face image if the region overlapping ratio between the first candidate face image and the second candidate face image is greater than a preset overlapping ratio threshold, and the image similarity between the first candidate face image and the second candidate face image is greater than a preset similarity threshold.

In one embodiment, the second acquisition module 1302 includes:

the obtaining unit is used for respectively inputting the face image track of each person to a preset speaking action recognition model, and obtaining the speaking probability of each person through the speaking action recognition model.

In one embodiment, the determining module 1303 includes:

a first determining unit, configured to determine candidate speakers of the target scene according to the speaking probability of each person;

and the second determining unit is used for determining the candidate speaker as the speaker in the current video frame of the target scene for any candidate speaker if the candidate speaker is the candidate speaker continuously for the first preset times before the current moment.

In one embodiment, the apparatus 1300 further comprises:

the intercepting module is used for intercepting the surrounding area of the face image of the talker by taking the face position of the talker in the current video frame as the center to obtain an intercepting area;

and the display module is used for highlighting the intercepted area.

In one embodiment, the apparatus 1300 further comprises:

and the ending module is used for ending the highlighting of the speaker if the speaker does not speak for the second preset times.

The individual modules in the speaker recognition device described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 14. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing speaker identification data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speaker recognition method.

It will be appreciated by those skilled in the art that the structure shown in fig. 14 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the computer device to which the present application applies, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.

The implementation principle and technical effect of each step implemented by the processor in the embodiment of the present application are similar to those of the speaker recognition method described above, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.

The steps implemented when the computer program is executed by the processor in the embodiment of the present application are similar to the principle and technical effect of the speaker recognition method described above, and are not described herein again.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.

It should be noted that, the personnel information (including but not limited to face information of personnel, etc.) and the data (including but not limited to data for analysis, stored data, displayed data, etc.) related to the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of speaker recognition, the method comprising:

acquiring a face image track of each person in a target scene according to a current video frame of the target scene and a preset number of historical video frames before the current video frame;

And determining the speaking person in the current video frame of the target scene according to the speaking probability of each person.

2. The method according to claim 1, wherein the acquiring the face image track of each person in the target scene according to the current video frame of the target scene and the historical video frames of the preset number before the current video frame comprises:

3. The method according to claim 2, wherein the performing face detection on the current video frame and the historical video frame respectively to obtain a plurality of face images of each person in the target scene includes:

respectively inputting a current video frame and the historical video frame into a preset face detection model, and obtaining a plurality of Hou Xuanren face images through the face detection model;

and determining a plurality of face images of each person in the target scene according to the plurality of Hou Xuanren face images.

4. A method according to claim 3, wherein said determining a plurality of face images for each person in the target scene from the plurality of Hou Xuanren face images comprises:

according to the similarity between the candidate face images in each adjacent video frame, determining the same face image in each adjacent video frame; each identical face representation is an identical person;

5. The method of claim 4, wherein determining the same face image in each adjacent video frame based on the similarity between the candidate face images in each adjacent video frame, comprises:

for any adjacent video frame, acquiring the region coincidence ratio and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame of the adjacent video frame;

and determining the same face image in the adjacent video frames according to the region coincidence degree and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame.

6. The method of claim 5, wherein determining the same face image in the adjacent video frames based on the region overlap and the image similarity between each first candidate face image in the first video frame and each second candidate face image in the second video frame, comprises:

for any group of first candidate face image and second candidate face image, if the region overlapping ratio between the first candidate face image and the second candidate face image is larger than a preset overlapping ratio threshold value, and the image similarity between the first candidate face image and the second candidate face image is larger than a preset similarity threshold value, determining that the first candidate face image and the second candidate face image are the same face image.

7. The method according to any one of claims 1-6, wherein the obtaining the speaking probability of each person according to the face image track of each person includes:

8. The method of any of claims 1-6, wherein the determining the speaker in the current video frame of the target scene based on the probability of each person speaking comprises:

and for any candidate speaker, if the candidate speaker is the candidate speaker continuously for a first preset number of times before the current moment, determining the candidate speaker as the speaker in the current video frame of the target scene.

9. The method according to any one of claims 1-6, further comprising:

taking the face position of the speaker in the current video frame as the center, and intercepting the surrounding area of the face image of the speaker to obtain an intercepting area;

and highlighting the intercepting region.

10. The method according to claim 9, wherein the method further comprises:

11. A speaker recognition apparatus, the apparatus comprising:

and the determining module is used for determining the talker in the current video frame of the target scene according to the talking probability of each person.

12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 10 when the computer program is executed.

13. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 10.

14. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 10.