CN119360887A

CN119360887A - A voice authentication method and related equipment

Info

Publication number: CN119360887A
Application number: CN202411361419.XA
Authority: CN
Inventors: 郑哲; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2024-09-27
Filing date: 2024-09-27
Publication date: 2025-01-24

Abstract

The present application belongs to the field of artificial intelligence and big data, and is applied in the field of smart cities. It relates to a method for speech authentication, including: obtaining a speech data set, training a pre-trained speech authentication model through the speech data set, and obtaining a trained speech authentication model; performing vocoder classification processing on the target audio through the trained speech authentication model to determine the corresponding vocoder to which the target audio belongs; performing feature extraction processing on the target audio in the corresponding vocoder to which the target audio belongs to obtain target audio features; performing authentication processing on the target audio features to determine the authentication processing results of the target audio. The present application also provides a speech authentication device, a computer device, and a storage medium. The present application solves the problem that the existing speech authentication methods have low accuracy and poor generalization in real environments.

Description

Voice fake identifying method and related equipment

Technical Field

The application relates to the technical field of artificial intelligence and big data, in particular to a voice fake identifying method and related equipment.

Background

With the development of deepfake technologies, it is easier to generate highly realistic dummy voice frequency, and the mainstream voice generation method is mainly divided into two forms, namely, synthesizing speaking voice frequency of a target task through a section of file, namely, TTS. The other is to convert the voice of the source speaker into the voice of the target speaker and keep the audio content unchanged, i.e., voice conversion VC. The development of convolutional neural networks promotes generated voice to be more and more lifelike, so that a voice fake identifying method is urgently needed, and the problems that the existing voice fake identifying method is low in accuracy and poor in generalization in a real environment are solved.

Disclosure of Invention

The embodiment of the application aims to provide a voice fake identifying method and related equipment, which are used for solving the problems of lower accuracy and poor generalization of the existing voice fake identifying method in a real environment, the method mainly aims at classifying input audio by introducing vocoder features into a model, improving the accuracy of voice identification by utilizing vocoder feature recognition, and effectively improving the detection accuracy while not increasing the reasoning complexity.

In order to solve the above technical problems, the embodiment of the present application provides a voice authentication method, which adopts the following technical scheme:

acquiring a voice data set, and training a pre-training voice fake identifying model through the voice data set to obtain a trained voice fake identifying model;

Performing vocoder classification processing on the target audio through the trained voice fake identifying model, and determining a vocoder corresponding to the target audio;

performing feature extraction processing on the target audio in a vocoder corresponding to the target audio to obtain target audio features;

and performing fake identification processing on the target audio characteristics, and determining a fake identification processing result of the target audio.

Further, the voice dataset includes target generated audio and annotations of the target generated audio, and the acquiring the voice dataset includes:

acquiring a sample real audio and vocoders of different categories;

Performing audio generation processing on the sample real audio in the vocoders of different categories to obtain generated audio of different categories;

calculating the correlation between the generated audio and the sample real audio, determining a correlation threshold corresponding to the category of the vocoder, and determining the generated audio with the correlation larger than the correlation threshold as a target generated audio;

And determining the annotation of the target generated audio, wherein the annotation of the target generated audio is formed according to the type of a vocoder and the fake of the generated target generated audio.

Further, the calculating the correlation between the generated audio and the sample real audio includes:

calculating the high-pitch similarity between the generated audio and the sample real audio;

calculating the bass similarity between the generated audio and the sample real audio;

Calculating the lexical similarity between the generated audio and the sample real audio;

Calculating global semantic similarity between the generated audio and the sample real audio;

And carrying out weighted fusion processing on the high-pitch similarity, the low-pitch similarity, the lexical similarity and the global semantic similarity to obtain the correlation degree between the generated audio and the sample real audio.

Further, the determining the correlation threshold corresponding to the category of the vocoder includes:

determining weights corresponding to the categories of the vocoders based on the generation accuracy of the vocoders of different categories;

Determining effective duration and the number of effective phonemes in a sample real audio, wherein the effective duration is the sum of the duration of the effective phonemes, and the effective phonemes are factors of personnel;

And carrying out weighted summation processing on the weight, the effective duration and the number of effective phonemes to obtain the correlation threshold corresponding to the category of the vocoder.

Further, before the training is performed on the pre-trained voice authentication model through the voice data set to obtain a trained voice authentication model, the method further includes:

obtaining a pre-training voice false identification model, adding a vocoder characteristic extraction module into the pre-training voice false identification model, and performing pre-emphasis processing on the vocoder characteristics of different categories by the vocoder characteristic extraction module to obtain pre-emphasis processed vocoder characteristics;

Sending the pre-emphasis processed vocoder characteristics into an example standardization layer and a parameterized analysis filter bank to process the vocoder characteristics into time-frequency representation, so as to obtain vocoder characteristic time-frequency representation;

The vocoder characteristic time-frequency representation is subjected to three backbone blocks with residual connection, wherein the three backbone blocks with residual connection are connected in such a way that the output of a first backbone block is used as the input of a second backbone block, the output of the first backbone block and the output of the second backbone block are added to be used as the input of a third backbone block, the output of the first backbone block is subjected to pooling operation, and the output of the first backbone block, the output of the second backbone block and the output of the third backbone block after pooling operation are simultaneously used as the input to be sent to a convolution layer for characteristic extraction;

And adding the inputs of the three backbone blocks with residual connection, connecting one-dimensional convolution layer and two full-connection layers to obtain the output result of the vocoder.

Further, performing feature processing on the target audio in the vocoder corresponding to the target audio to obtain a target audio feature, including:

Performing time-frequency analysis processing on the target audio in a vocoder corresponding to the target audio to obtain the target audio after the time-frequency analysis processing;

and inputting the target audio subjected to the time-frequency analysis processing into three backbone blocks with residual connection for feature extraction processing, so as to obtain target audio features.

Further, inputting the target audio after the time-frequency analysis processing into three backbone blocks with residual connection for feature extraction processing to obtain target audio features, including:

inputting the target audio subjected to the time-frequency analysis processing into a first backbone block for first feature extraction processing, and outputting to obtain a first target audio feature to be processed;

Inputting the first target audio feature to be processed into a second backbone block for second feature extraction, and outputting to obtain a second target audio feature to be processed;

Inputting the first target audio feature to be processed and the second target audio feature to be processed into a third backbone block for third feature extraction processing to obtain a third target audio feature to be processed;

And inputting the first target audio feature to be processed, the second target audio feature to be processed and the third target audio feature to be processed into a convolution layer to perform fourth feature extraction processing, so as to obtain target audio features.

In order to solve the technical problems, the application also provides a voice authentication device, which adopts the following technical scheme:

the training module is used for acquiring a voice data set, and training the pre-training voice fake identifying model through the voice data set to obtain a trained voice fake identifying model;

the classification processing module is used for performing vocoder classification processing on the target audio through the trained voice fake identifying model and determining a corresponding vocoder to which the target audio belongs;

the extraction processing module is used for carrying out feature extraction processing on the target audio in the vocoder corresponding to the target audio to obtain target audio features;

And the fake identifying processing module is used for carrying out fake identifying processing on the target audio characteristics and determining a fake identifying processing result of the target audio.

In order to solve the above technical problems, the embodiment of the present application further provides a computer device, which adopts the following technical schemes:

the computer device comprises a memory and a processor, wherein the memory stores computer readable instructions, and the processor executes the computer readable instructions to implement the steps of the voice authentication method according to any one of the embodiments of the present application.

In order to solve the above technical problems, an embodiment of the present application further provides a computer readable storage medium, which adopts the following technical schemes:

The computer readable storage medium has stored thereon computer readable instructions which when executed by a processor implement the steps of the voice authentication method according to any of the embodiments of the present invention.

The method has the advantages that a voice data set is obtained, a pre-trained voice fake-discrimination model is trained through the voice data set, a trained voice fake-discrimination model is obtained, a vocoder classification treatment is carried out on target audio through the trained voice fake-discrimination model, a vocoder corresponding to the target audio is determined, feature extraction treatment is carried out on the target audio in the vocoder corresponding to the target audio, target audio features are obtained, fake-discrimination treatment is carried out on the target audio features, a fake-discrimination treatment result of the target audio is determined, the target audio input model is used for processing the audio into a time-frequency representation through a normalization layer and a parameterized analysis filter bank, then the voice is sent to three backbone blocks with residual connection for feature extraction, then features are sent to a GRU layer for information aggregation, finally the fake-discrimination treatment result of the target audio is obtained through a full-connection layer and a softmax function, and the problems that in the existing method, the voice fake-discrimination is low in accuracy and poor in the real environment are solved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method of voice authentication according to the present application;

FIG. 3 is a flow chart of one embodiment of step S201 in FIG. 2;

FIG. 4 is a flowchart of one embodiment of the correlation between the calculated generated audio and the sample real audio of step S2014 in FIG. 3;

FIG. 5 is a flow chart of one embodiment of determining a correlation threshold corresponding to the class of the vocoder of step S2014 of FIG. 3;

FIG. 6 is a flow chart of another embodiment of the method of step speech authentication of FIG. 2;

FIG. 7 is a flow chart of a specific embodiment of step S203 in FIG. 2;

FIG. 8 is a flow chart of a specific embodiment of step S2032 in FIG. 7;

FIG. 9 is a block diagram of a connection of three backbone blocks according to the present application;

FIG. 10 is a schematic diagram of a voice authentication device according to one embodiment of the present application;

FIG. 11 is a schematic diagram illustrating the configuration of one embodiment of the acquisition module 1001 of FIG. 10;

FIG. 12 is a schematic diagram illustrating an embodiment of the first determination submodule 10013 of FIG. 11;

fig. 13 is a schematic structural diagram of another embodiment of the first determining submodule 10013 in fig. 11;

FIG. 14 is a schematic diagram illustrating a structure of an embodiment of the voice authentication device 1000 in FIG. 10;

FIG. 15 is a schematic diagram illustrating a structure of one embodiment of the extraction processing module 1003 in FIG. 10;

Fig. 16 is a schematic diagram illustrating a structure of an embodiment of the extraction processing sub-module 10032 in fig. 15;

FIG. 17 is a schematic diagram of an embodiment of a computer device in accordance with the application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs, the terms used in the description herein are used for the purpose of describing particular embodiments only and are not intended to limit the application, and the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the above description of the drawings are intended to cover non-exclusive inclusions. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a shopping class application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, dynamic video expert compression standard audio plane 3), MP4 (Moving Picture Experts Group Audio Layer IV, dynamic video expert compression standard audio plane 4) players, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the voice authentication method provided by the embodiment of the present application is generally executed by a server/terminal device, and accordingly, the voice authentication device is generally set in the server/terminal device.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow chart of one embodiment of a voice authentication method according to the present application is shown. The voice authentication method comprises the following steps:

Step S201, a voice data set is obtained, and the pre-training voice fake identifying model is trained through the voice data set, so that a trained voice fake identifying model is obtained.

In this embodiment, the electronic device (e.g., the server/terminal device shown in fig. 1) on which the voice authentication method operates may receive the voice authentication request of the terminal device through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection may include, but is not limited to, 3G/4G/5G connection, wiFi connection, bluetooth connection, wiMAX connection, zigbee connection, UWB (ultra wideband) connection, and other now known or later developed wireless connection.

The terminal device may be a terminal device equipped with a remote interaction function such as a personal assistant or an intelligent customer service.

The voice authentication method can be applied to services such as online file authentication, online file management and the like.

Specifically, the electronic equipment can be used for different industry types such as medical voice authentication, property insurance voice authentication, financial voice authentication, store voice authentication and the like.

The voice data set comprises target generated audio and labels of the target generated audio, the target generated audio comprises vocoder generated audio of different types, the vocoder generated audio is obtained through sample real audio and vocoder generation of different types, the correlation degree between the target generated audio and the sample real audio is larger than a preset correlation degree threshold, and the correlation degree threshold is determined according to the content length of the sample real audio and the type of the vocoder.

The target generated audio labeling is composed of two parts, namely a vocoder type and a fake, which generate the target generated audio labeling. Different vocoders have different characteristics such as tone color, voicing speed, etc. The above-mentioned rake can be understood as a generated sound signal, which is generated by a vocoder algorithm, not a sound that is actually present.

The vocoder is a device for converting sound signals into digital signals for processing and transmission, and functions to convert natural speech into a machine-readable form. The vocoder may convert the sound signal into a digital signal by sampling, quantizing, encoding, etc., so that it can be analyzed and processed by the computer system. The vocoder can be applied to voice synthesis, voice communication, voice recognition and the like, in the voice synthesis, the vocoder can convert text data into machine-readable voice signals so as to generate natural and smooth voice output, in the voice communication, the vocoder can convert voice signals of two parties of a call into digital signals for transmission and processing so as to realize high-quality voice communication, and in the voice recognition, the vocoder can convert natural voice into digital signals so as to be convenient for a computer system to recognize and understand.

The pre-training voice fake identifying model is an untrained voice fake identifying model and can be a model which is constructed based on deep learning or machine learning technology and used for voice fake identifying. The pre-training voice fake identifying model can be a model built based on SignatureNet models, convolutional Neural Networks (CNN), long-short-time memory networks (LSTM) and the like.

The training can be supervised training, the supervised training is a machine learning training process based on annotation data, and the core of the supervised training is to construct a voice fake identifying model through learning the mapping relation between input and output.

The trained voice fake identifying model is obtained through training data learning of a large number of real voice data sets, and can judge the authenticity of new voice data.

Further, the frequency spectrum of the audio generated by the vocoder is not identical to the frequency spectrum of the real audio, the frequency spectrum difference between the audio generated by each type of vocoder and the original audio is different, and each type of audio generated by the vocoder has unique implicit characteristics.

In one embodiment, the different vocoders described above may be selected to generate about 12W of audio, in order to ensure that the amount of audio generated by each type of vocoder remains approximately balanced, while also selecting about 2W of real audio, which is composed of 14W of audio, to form a speech data set, wherein the generated audio of each audio is composed of both the type of vocoder and the like that generated it, and the label of the real audio is composed of two real. Wherein, training and test data dividing ratio is 6:1.

It should be noted that, in the training process, parameters of the model are adjusted according to the supervisory signals on the voice data set, so that the model can better identify the authenticity of different voices.

Step S202, performing vocoder classification processing on the target audio through the trained voice fake identifying model, and determining the corresponding vocoder to which the target audio belongs.

In this embodiment, the target audio refers to audio that needs to be subjected to voice authentication.

Specifically, the target audio frequency can be input into a trained voice fake identifying model to determine the corresponding vocoder to which the target audio frequency belongs.

The above-described vocoder classification process may be understood as a process of performing further classification processing on the target audio to determine the type of vocoder to which it belongs.

Furthermore, a vocoder feature extraction module is added on the basis of the voice fake identification model, the voice fake identification model can be assisted by the vocoder feature extraction module to classify the audio, the extracted features can be ensured to accurately capture different statistical characteristics of the vocoder, and the voice fake identification model is more sensitive to different vocoder features.

The vocoder features include fundamental frequency features, harmonic features, phase features, etc. The fundamental frequency characteristic is the characteristic of a sound source, namely the position of each pulse in a pulse sequence generated when a vocal cord vibrates, the spectrum parameter is the common modulation of sound channels (laryngeal cavity, oral cavity, lips, teeth and the like) and can emit different vowels and consonants, the phase characteristic is the phase information of an audio signal in a frequency domain, and the phase characteristic describes the rotation direction and amplitude of the waveform of the audio signal.

Specifically, the process of training the vocoder feature extraction module specifically includes pre-emphasis processing the original waveform of the original speech data set, then sending it to an instance normalization layer and parameterized analysis filter bank to process it into a time-frequency representation, and then the representation is passed through three backbone blocks with residual connections. The three backbone blocks are connected in such a way that the output of the first block is taken as the input of the second block, and added with the output of the second block to be taken as the input of the third block, in order to reduce the overfitting of the model, the output of the first block is subjected to the maximum pooling operation, and then the output of the first block and the output of the other two blocks are taken as the input at the same time to be sent into a convolution layer with batch normalization positions. Wherein each backbone block has a similar structure to that of ECAPA-TDNN, and utilizes AFMS modules instead of extrusion excitation, and performs a max-pooling operation before AFMS. Then adding the inputs of the three backbone blocks, followed by a one-dimensional convolution layer, and two full connection layers to obtain the output result of the vocoder. It should be noted that, the convolution layer and the full connection layer in the training process are only used in the training process, but not used in the reasoning (false identification processing) stage, mainly for improving the efficiency of the speech false identification model and accelerating the reasoning speed. For the network structure with larger parameter quantity of the convolution layer and the full connection layer, if the network structure is used in the reasoning stage, the calculation quantity and the memory consumption are increased, so that the reasoning speed of the whole voice fake identifying model is reduced. In addition, the parameters of the convolution layer and the full connection layer are already learned in the training stage, and can be directly used for reasoning, so that repeated calculation in the reasoning process is avoided.

The three backbone blocks with residual connections described above can be understood as using three neural network backbone structures with residual connections (e.g., inception modules in convolutional neural network CNN) to extract audio features. The residual connection described above is a jump connection, allowing the gradient to propagate directly to deeper layers, thus alleviating the gradient vanishing problem. The Inception module mainly solves the problem of CNN in processing different-size input, and improves the efficiency and performance of the model.

The ECAPA-TDNN is a time delay neural network (TIME DELAY Neural Network) that features the introduction of time delay and gating units in a conventional Recurrent Neural Network (RNN) architecture.

The AFMS module is mainly used for introducing the attention mechanism in the feature interaction process to learn the influence degree of different features on the prediction result, namely the attention mechanism (Auxiliary Feature Matrix Sum).

Further, to more accurately capture different vocoder features to identify vocoder types, a multi-class loss function is used to constrain the vocoder feature extraction module. I.e., the characteristics of the vocoder are identified, and the class of the vocoder (all the classes are all the vocoder classes +1) is output. The feature extraction module is optimized through vocoder recognition, so that the network is more sensitive to vocoder features, and the probability that the input audio belongs to a certain vocoder class is obtained through a softmax function. The loss function used is:

Where y _i denotes that the audio belongs to the i-th type vocoder, p _i denotes the probability that the audio belongs to the i-th type vocoder, and C-1 is the number of all categories.

The output of the three backbone networks is sent to the GRU layer (gate-controlled cyclic unit layer), the frame-level representation is aggregated into speech-level representation, and then a full-connection layer is connected to obtain the predicted result by using the softmax function. The vocoder features are utilized to assist in classification tasks, so that the accuracy of voice fake identification is further improved.

The classification loss function is:

Loss₂＝-[ylog(σ(x_n))+(1-y_n)log(1-σ(x_n))]

where y is the vocoder to which the audio belongs, σ (x _n) is the sigmoid function, and x is mapped to the (0, 1) interval: y _n is the vocoder to which the predicted audio belongs.

The Loss function of the trained speech false-discrimination model is loss=λloss ₁+(1-λ)Loss₂, wherein λ is an adjustable super-parameter used for controlling different weights of the two-by-two classification branch and the multi-classification branch.

And step S203, performing feature extraction processing on the target audio in the corresponding vocoder to which the target audio belongs to obtain the target audio features.

In the present embodiment, the above-described feature extraction process can be understood as a process of extracting a representative feature vector representation from the target audio.

Specifically, the target audio may be preprocessed, for example, to remove mute portions, reduce noise, and the like. Depending on the vocoder type to which the target audio belongs, a corresponding feature extraction algorithm, such as short-time energy or the like, is selected. And applying a feature extraction algorithm to the target audio to obtain features of the target audio.

The target audio feature is a set of values extracted from the original target audio signal that are used to describe the inherent characteristics of the audio. The target audio features may be derived from a time-domain or frequency-domain analysis of the audio and may reflect key information of the audio data.

And step S204, performing fake identification processing on the target audio characteristics, and determining a fake identification processing result of the target audio.

In the present embodiment, the above-described pseudo-audio discrimination process can be understood as a process of recognizing and confirming pseudo audio.

In this embodiment, the target audio is input into the trained speech pseudo-discrimination model, and is processed into a time-frequency representation by using a normalization layer, a parameterized analysis filter bank (paramFbank) and the like, and then is sent into three backbone blocks with residual connection for feature extraction, and then the features are sent into a GRU layer (gate control loop unit layer) for information aggregation. Finally, the final false identification processing result of the predicted target audio is obtained through the full connection layer and the softmax function.

The GRU (gate-controlled loop unit) is a loop neural network structure that can model and predict sequence data. The GRU layer (gated loop unit layer) integrates the features extracted from three backbone blocks with residual connections to form information of a more global context.

The Softmax function is a commonly used activation function for normalizing a numerical vector into a probability distribution vector, and the sum of the probabilities is 1.

In the embodiment, the voice recognition module is added on the basis of the voice recognition model, so that the voice recognition model can more accurately capture the characteristics of the vocoder while recognizing the true and false audios, the characteristic extractor becomes sensitive to the characteristics of the vocoder, and the identification of the true and false audios is simplified. The application uses the thought of multi-task training, integrates the two-classification task and the multi-classification task into one model, shares weight, improves the accuracy of voice identification by utilizing vocoder characteristic identification, can effectively improve the detection accuracy while not increasing the reasoning complexity, and can be applied to business scenes such as financial fraud and the like.

According to the method, a voice data set is acquired, a pre-training voice fake-discrimination model is trained through the voice data set, a trained voice fake-discrimination model is obtained, a vocoder classification treatment is carried out on target audio through the trained voice fake-discrimination model, a vocoder corresponding to the target audio is determined, feature extraction treatment is carried out on the target audio in the vocoder corresponding to the target audio, target audio features are obtained, fake-discrimination treatment is carried out on the target audio features, and a fake-discrimination treatment result of the target audio is determined, so that the problems that in an existing method, the accuracy rate of voice fake discrimination is low in a real environment and the generalization performance is poor are solved.

With continued reference to fig. 3, a flow chart of one embodiment of step S201 in fig. 2 is shown. The step S201 specifically includes the following steps:

Step S2011, obtaining sample real audio and different kinds of vocoders.

In this embodiment, the sample real audio may be understood as real audio that is not modified or edited.

The different types of vocoders include three types of vocoders, GAN-based mel-GAN, full-band-melGAN, multiband-GAN, hiFi-GAN, parallel-GAN, auto-regressive model-based waveRNN, and diffusion model-based DiffWave.

The vocoder is a device for converting sound signals into digital signals for processing and transmission, and functions to convert natural speech into a machine-readable form. The vocoder may convert the sound signal into a digital signal by sampling, quantizing, encoding, etc., so that it can be analyzed and processed by the computer system.

Step 2012, performing audio generation processing on the sample real audio in the vocoders of different categories to obtain generated audio of different categories.

In this embodiment, the above-described audio generation process may be understood as a process of performing audio processing on a real audio sample to generate different kinds of audio.

And step S2013, calculating the correlation degree between the generated audio and the sample real audio, determining a correlation degree threshold corresponding to the category of the vocoder, and determining the generated audio with the correlation degree larger than the acoustic correlation degree threshold as the target generated audio.

In this embodiment, the correlation between the generated audio and the sample real audio may be calculated using a voice analysis technique. The similarity of the two audios is calculated using, for example, a cepstrum similarity (CCS) or a sound event detection technique.

And step S2014, determining the annotation of the target generated audio.

In this embodiment, the target generated audio label is composed of two parts, namely a vocoder type and a fake, which generate the target generated audio label. Different vocoders have different characteristics such as tone color, voicing speed, etc. The above-mentioned rake can be understood as a generated sound signal, which is generated by a vocoder algorithm, not a sound that is actually present.

According to the application, the sample real audio and the vocoders of different categories are obtained, the sample real audio is subjected to audio generation processing in the vocoders of different categories to obtain the generated audio of different categories, then the correlation between the generated audio and the sample real audio is calculated, the correlation threshold corresponding to the category of the vocoder is determined, the generated audio with the correlation greater than the correlation threshold is determined as the target generated audio, and the annotation of the target generated audio is determined, so that the generated audio data is better managed and used, and the quality and the acceptability of the generated audio are improved.

With continued reference to FIG. 4, a flow chart of one embodiment of step S2014 in FIG. 3 is shown. Step S2014 specifically includes the following steps:

Step S20141, calculating the high-pitch similarity between the generated audio and the sample real audio.

In the present embodiment, the above-described similarity of treble can be understood as the similarity of treble portions in two audio signals. The similarity between the generated audio and the sample real audio can be measured by comparing the frequency, amplitude, etc. characteristics of the treble part in the two audios.

Further, the audio signal may be converted into a spectrogram by a spectral analysis algorithm to look at the relative intensities of the different frequency components. By comparing the spectrograms of the generated audio and the real audio, their similarity in the treble portion can be determined. Spectral analysis algorithms include Fast Fourier Transforms (FFTs) and short-time fourier transforms (STFTs).

The audio signal may be converted into a cepstrum by a cepstrum analysis algorithm to view phase information of the audio signal. By comparing the cepstral patterns of the generated audio and the real audio, their similarity in the treble portion can be determined. Cepstral analysis algorithms include mel-frequency cepstral coefficients (MFCCs), linear Predictive Cepstral Coefficients (LPCCs), and the like.

Step S20142, calculating the bass similarity between the generated audio and the sample real audio.

In the present embodiment, the above-described bass similarity can be understood as the degree of similarity of bass portions in two audio signals. The similarity between the generated audio and the sample real audio can be measured by comparing the characteristics of the frequency, amplitude, etc. of the bass portions of the two audio.

Further, the audio signal may be converted into a spectrogram by a spectral analysis algorithm to look at the relative intensities of the different frequency components. By comparing the spectrograms of the generated audio and the real audio, their similarity in the bass portion can be determined. Spectral analysis algorithms include Fast Fourier Transforms (FFTs) and short-time fourier transforms (STFTs).

The audio signal may be converted into a cepstrum by a cepstrum analysis algorithm to view phase information of the audio signal. By comparing the cepstral patterns of the generated audio and the real audio, their similarity in the bass portion can be determined. Cepstral analysis algorithms include mel-frequency cepstral coefficients (MFCCs), linear Predictive Cepstral Coefficients (LPCCs), and the like.

And step S20143, calculating the lexical similarity between the generated audio and the sample real audio.

In this embodiment, lexical similarity can be understood as the degree of similarity of words or phrases between two sentences or texts in terms of grammatical structure, lexical selection, and semantic content. In natural language processing, lexical similarity is typically used to measure the similarity of two sentences at the lexical level. The computation of lexical similarity is generally based on information such as co-occurrence, part-of-speech tagging, syntax structure and the like of words, and cosine similarity, jaccard similarity, editing distance and the like can be used for computing the similarity of generated audio and sample real audio on the vocabulary level.

Step S20144, calculating global semantic similarity between the generated audio and the sample real audio.

In this embodiment, the above-mentioned global voice similarity can be understood as evaluating the similarity of the expressions of two pieces of voice over the whole content. Pre-trained language models, such as BERT, GPT-2, etc., may be used to convert text into semantic representations and calculate similarity scores between them.

And step 20145, carrying out weighted fusion processing on the high-pitch similarity, the low-pitch similarity, the lexical similarity and the global semantic similarity to obtain the correlation between the generated audio and the sample real audio.

In this embodiment, when the electronic device obtains the high-pitch similarity, the low-pitch similarity, the lexical similarity, and the global semantic similarity, the electronic device may obtain the correlation between the generated audio and the sample real audio by performing weighted fusion processing on the high-pitch similarity, the low-pitch similarity, the lexical similarity, and the global semantic similarity.

The correlation degree between the generated audio and the sample real audio is obtained by carrying out weighted fusion processing on the treble similarity, the bass similarity, the lexical similarity and the global semantic similarity.

Specifically, the high-pitch similarity, the low-pitch similarity, the lexical similarity and the global semantic similarity are weighted and fused to obtain a correlation score between the generated audio and the sample real audio. Each of the treble similarity, the bass similarity, the lexical similarity, and the global semantic similarity is assigned a weight value that is used to represent its relative importance in the final relevance score. The weighted fusion is to perform weighted average on the high-pitch similarity, the low-pitch similarity, the lexical similarity and the global semantic similarity to form a new correlation degree. Specifically, the method comprises the steps of firstly determining the weight coefficient of each similarity index, determining the weight coefficient of each similarity index through technologies such as cross validation and the like, then multiplying the value of each similarity index by the corresponding weight coefficient, and finally adding all products to obtain a comprehensive score, wherein the score reflects the correlation between generated audio and sample real audio. The higher the composite score, the higher the correlation between the generated audio and the sample real audio.

According to the method, the device and the system, the treble similarity between the generated audio and the sample real audio is calculated, the bass similarity between the generated audio and the sample real audio is calculated, the lexical similarity between the generated audio and the sample real audio is calculated, the global semantic similarity between the generated audio and the sample real audio is calculated, and the correlation between the generated audio and the sample real audio is determined according to the treble similarity, the bass similarity, the lexical similarity and the global semantic similarity, so that the method and the system can help users to evaluate the quality of the generated audio better.

With continued reference to fig. 5, a flow chart of one embodiment of step S2014 in fig. 3 is shown. Step S2014 specifically includes the following steps:

step S20146, determining the weight corresponding to the type of the vocoder based on the generation accuracy of the vocoders of different types.

In this embodiment, the higher the accuracy of the vocoder generation, the lower the weight. This is because the greater the vocoder plays in speech synthesis, the higher the quality of the speech it generates, and thus the lower its weight is required to balance the weight allocation of the overall speech synthesis system.

The above-mentioned generation accuracy can be understood as a correlation between the generated audio generated by the vocoder and the sample real audio.

The weights are used to measure the performance of different classes of vocoders in generating audio. The weight value may represent the importance of the vocoder in the overall evaluation.

Step S20147, determining the effective duration and the effective phoneme number in the sample real audio.

In this embodiment, the effective duration is the sum of the durations of the effective phonemes, which are the phonemes of the person, that is, the sound features of the person.

The phonemes that effectively represent a person are meaningful for communicating and understanding the speech content, rather than meaningless noise or mispronunciations.

Further, the effective phonemes in the sample real audio can be determined by preprocessing the sample real audio, extracting effective features from the preprocessed sample real audio, and finally analyzing and counting the extracted features. The preprocessing includes removing noise, mute delete, volume balance, etc. to ensure the quality and accuracy of the data. The extracted features may be extracted using mel-frequency cepstrum coefficients (MFCCs), sonograms (spectrogram), and the like. Analysis and statistics can be performed by cluster analysis or factor analysis, etc.

And step S20148, carrying out weighted summation processing on the weight, the effective duration and the number of effective phonemes to obtain a correlation threshold corresponding to the category of the vocoder.

In this embodiment, the correlation threshold corresponding to the class of the vocoder is obtained by performing weighted summation processing according to the weight, the effective duration and the number of effective phonemes.

The above weighted summation process can be understood as a process of multiplying each data by a weight and then adding them together when calculating the sum of a set of data.

Specifically, a weight value is allocated to the weight, the effective duration and the number of effective phonemes corresponding to the category of the vocoder. For example, the weight value corresponding to the class of the vocoder may be assigned to 0.5, the effective duration value to 0.3, and the effective phoneme number value to 0.2. These weight values may be adjusted according to the importance and application scenario of the vocoder. Then, the weight, the effective duration and the number of effective phonemes corresponding to the category of the vocoder are multiplied by the corresponding weight value to obtain a weighted numerical value. Next, the weighted values are compared with a default correlation threshold based on the default correlation threshold. If the weighted value is greater than the default correlation threshold, the vocoder is considered to have a correlation with the target audio, otherwise, no correlation is considered between them.

Further, the weighting, the effective duration and the number of the effective phonemes corresponding to the category of the vocoder can be weighted and summed to determine a correlation threshold corresponding to the category of the vocoder. The correlation threshold represents the performance criteria that the vocoder should achieve under that class. If the actual correlation of the vocoder exceeds this correlation threshold, its performance under that class is considered acceptable. The performance of the vocoder under different categories can be accurately estimated through the correlation threshold, and the accuracy of vocoder estimation is improved.

The application determines the weight corresponding to the type of the vocoder based on the generation accuracy of the vocoders of different types, determines the effective duration and the effective phoneme number in the sample real audio, and carries out weighted summation processing on the weight, the effective duration and the effective phoneme number to obtain the correlation threshold corresponding to the type of the vocoder, thereby more accurately determining the correlation threshold corresponding to the type of the vocoder and further improving the working efficiency.

With continued reference to fig. 6, a flow chart of one embodiment of the voice authentication method of fig. 2 is shown. The voice authentication method specifically further comprises the following steps:

step S601, a pre-training voice fake identifying model is obtained, a vocoder characteristic extracting module is added into the pre-training voice fake identifying model, pre-emphasis processing is carried out on different types of vocoder characteristics in the vocoder characteristic extracting module, and the pre-emphasis processed vocoder characteristics are obtained.

In this embodiment, the pre-trained voice authentication model is a model for voice authentication, which is constructed based on deep learning or machine learning technology, and may be a model constructed based on SignatureNet models, convolutional Neural Networks (CNN), long-short-term memory networks (LSTM), and the like.

The vocoder characteristic extraction module can assist the voice fake identification model to classify the voice, so that the extracted characteristics can accurately capture different statistical characteristics of the vocoder, and the vocoder can be more sensitive to different vocoder characteristics.

Further, the vocoder feature extraction module needs to be trained, and the training process includes pre-emphasis processing of the original waveform of the original vocoder voice data set, then sending the pre-emphasis processing to an instance normalization layer and parameterized analysis filter bank to process the pre-emphasis processing into a time-frequency representation, and then the representation passes through three backbone blocks with residual connections. The three backbone blocks are connected in such a way that the output of the first block is taken as the input of the second block, and added with the output of the second block to be taken as the input of the third block, in order to reduce the overfitting of the model, the output of the first block is subjected to the maximum pooling operation, and then the output of the first block and the output of the other two blocks are taken as the input at the same time to be sent into a convolution layer with batch normalization positions. Wherein each backbone block has a similar structure to that of ECAPA-TDNN, and utilizes AFMS modules instead of extrusion excitation, and performs a max-pooling operation before AFMS. Then adding the inputs of the three backbone blocks, followed by a one-dimensional convolution layer, and two full connection layers to obtain the output result of the vocoder. It should be noted that, the convolution layer and the full connection layer in the training process are only used in the training process, but not used in the reasoning (false identification processing) stage, mainly for improving the efficiency of the speech false identification model and accelerating the reasoning speed. For the network structure with larger parameter quantity of the convolution layer and the full connection layer, if the network structure is used in the reasoning stage, the calculation quantity and the memory consumption are increased, so that the reasoning speed of the whole voice fake identifying model is reduced. In addition, the parameters of the convolution layer and the full connection layer are already learned in the training stage, and can be directly used for reasoning, so that repeated calculation in the reasoning process is avoided.

The pre-emphasis processing can be understood as a processing method for preprocessing an original signal, and by increasing the amplitude of a high-frequency component, the frequency response of the signal is flatter when the signal is processed at a receiving end. The pre-emphasis process may enhance the energy of the high frequency signal, thereby improving the quality of the audio signal and reducing distortion and noise.

Step S602, the pre-emphasis processed vocoder features are sent to an example normalization layer and a parameterized analysis filter bank to be processed into time-frequency representation, and the vocoder feature time-frequency representation is obtained.

In this embodiment, the above-mentioned example normalization layer may accelerate the training process and improve the performance of the model, and may normalize the vocoder features.

The parametric analysis filter bank described above may determine spectral characteristics of the vocoder features for subsequent processing.

The above-described time-frequency representation is a method of representing a signal as both time and frequency dimensions, and can be used to describe the energy distribution and variation of the signal. By feeding the vocoder features into a parameterized analysis filter bank, a time-frequency representation of the vocoder features can be obtained.

Step S603, the vocoder characteristic time frequency representation is passed through three backbone blocks with residual connection.

In this embodiment, the three backbone blocks with residual connections can be understood as extracting audio features using three neural network backbone structures (such as Inception modules in convolutional neural network CNN) with residual connections. The residual connection described above is a jump connection, allowing the gradient to propagate directly to deeper layers, thus alleviating the gradient vanishing problem. The Inception above mainly solves the problem of CNN in processing different size inputs, and improves the efficiency and performance of the model.

The three backbone blocks are connected in such a way that the output of the first block is used as the input of the second block, the output of the first backbone block and the output of the second block are added to be used as the input of the third block, in order to reduce the overfitting of the model, the output of the first block is subjected to the maximum pooling operation, and the output of the first backbone block after the pooling operation and the output of the other two blocks are simultaneously used as the input to be sent into a convolution layer with batch normalization positions. Wherein each backbone block has a similar structure to that of ECAPA-TDNN, and utilizes AFMS modules instead of extrusion excitation, and performs a max-pooling operation before AFMS.

Step S604, adding the inputs of three backbone blocks with residual connection, connecting one-dimensional convolution layer, and obtaining the output result of the vocoder by two full connection layers.

In this embodiment, the one-dimensional convolution layer performs convolution operation on the input vector using a one-dimensional filter, so as to extract feature information of the vocoder feature time-frequency representation.

The fully connected layer uses fully connected neurons to carry out nonlinear transformation on the input feature vectors, so that the output result of the vocoder is obtained.

The convolution layer and the full connection layer in the training process are only used in the training process, but not used in the reasoning (false identification processing) stage, and are mainly used for improving the efficiency of the voice false identification model and accelerating the reasoning speed. For the network structure with larger parameter quantity of the convolution layer and the full connection layer, if the network structure is used in the reasoning stage, the calculation quantity and the memory consumption are increased, so that the reasoning speed of the whole voice fake identifying model is reduced. In addition, the parameters of the convolution layer and the full connection layer are already learned in the training stage, and can be directly used for reasoning, so that repeated calculation in the reasoning process is avoided.

According to the application, a pre-trained voice false-identifying model is obtained, a vocoder characteristic extraction module is added into the pre-trained voice false-identifying model, different types of vocoder characteristics are subjected to pre-emphasis processing in the vocoder characteristic extraction module, the pre-emphasis processed vocoder characteristics are obtained, the pre-emphasis processed vocoder characteristics are sent into an example standardization layer and a parameterized analysis filter bank to be processed into time-frequency representation, the vocoder characteristic time-frequency representation is obtained, the vocoder characteristic time-frequency representation is processed through three backbone blocks with residual connection, the three backbone blocks with residual connection are added, a one-dimensional convolution layer is connected behind, the two full connection layers are used for obtaining the output result of the vocoder, and the extracted characteristics can be ensured to accurately capture different statistical characteristics of the vocoder by training the vocoder characteristic extraction module, so that the accuracy and the efficiency of voice processing tasks are improved.

With continued reference to fig. 7, a flow chart of one specific embodiment of step S203 of fig. 2 is shown. Step S203 specifically includes the following steps:

Step S2031, performing time-frequency analysis processing on the target audio in the vocoder corresponding to the target audio, to obtain the target audio after the time-frequency analysis processing.

In this embodiment, the above-described time-frequency analysis process is used to convert an unstable signal (such as sound, vibration, etc.) from the time domain to the frequency domain for analysis and processing. Time-frequency analysis processes extract features of signals in the time and frequency domains by performing time-frequency transformation on the signals to better understand the nature and features of the signals.

Step S2032, inputting the target audio after time-frequency analysis processing into three backbone blocks with residual connection for feature extraction processing, and obtaining the target audio features.

In this embodiment, the three backbone blocks with residual connections can be understood as three neural network backbone structures (such as Inception modules in convolutional neural network CNN) with residual connections are used to extract audio features. The residual connection described above is a jump connection, allowing the gradient to propagate directly to deeper layers, thus alleviating the gradient vanishing problem. Three backbone block structures with residual connections help to improve the performance and generalization ability of the model.

According to the application, the target audio is subjected to time-frequency analysis processing in the corresponding vocoder to which the target audio belongs, so that the target audio after the time-frequency analysis processing is obtained, and the target audio after the time-frequency analysis processing is input into three backbone blocks with residual connection for feature extraction processing, so that the target audio features are obtained, and the accuracy and reliability of feature extraction can be effectively improved, thereby improving the accuracy and efficiency of voice processing tasks.

With continued reference to fig. 8, a flow chart of one specific embodiment of step S2032 in fig. 7 is shown. Step S2032 specifically includes the steps of:

step S20321, inputting the target audio after the time-frequency analysis processing into the first backbone block for performing the first feature extraction processing, and outputting to obtain the first target audio feature to be processed.

Step S20322, inputting the first to-be-processed target audio feature into the second backbone block for second feature extraction processing, and outputting to obtain the second to-be-processed target audio feature.

Step S20323, inputting the first to-be-processed target audio feature and the second to-be-processed target audio feature into a third backbone block for performing a third feature extraction process, to obtain a third to-be-processed target audio feature.

And step S20324, inputting the first target audio feature to be processed, the second target audio feature to be processed and the third target audio feature to be processed into a convolution layer to perform fourth feature extraction processing, so as to obtain the target audio feature.

In this embodiment, three backbone blocks with residual connection are input to the target audio after the time-frequency analysis processing to perform feature extraction, so as to obtain the target audio feature. Three backbone blocks with residual connections can be understood as using three neural network backbone structures with residual connections (e.g., inception modules in convolutional neural network CNN) to extract audio features. The residual connection described above is a jump connection, allowing the gradient to propagate directly to deeper layers, thus alleviating the gradient vanishing problem. Three backbone block structures with residual connections help to improve the performance and generalization ability of the model.

The output of the first backbone block is taken as the input of the second backbone block, and the output of the second backbone block is added to the output of the third backbone block to be taken as the input of the third backbone block. This way of connection allows for interconnection between the three backbone blocks and enhances the propagation and transfer of features.

The method comprises the steps of inputting target audio subjected to time-frequency analysis processing into a first backbone block for first feature extraction processing, outputting to obtain first target audio features to be processed, inputting the first target audio features to be processed into a second backbone block for second feature extraction processing, outputting to obtain second target audio features to be processed, inputting the first target audio features to be processed and the second target audio features to be processed into a third backbone block for third feature extraction processing, obtaining third target audio features to be processed, inputting the first target audio features to be processed, the second target audio features to be processed and the third target audio features to be processed into a convolution layer for fourth feature extraction processing, obtaining target audio features, and effectively improving accuracy and reliability of feature extraction.

As shown in fig. 9, a block diagram showing a connection manner of three backbone blocks according to the present application. Specifically, the output of the first block is taken as the input of the second block, and added to the output of the second block to be taken as the input of the third block, so as to reduce the over fitting of the model, the output of the first block is subjected to the maximum pooling operation, and then the output of the first block and the output of the other two blocks are taken as the input at the same time and sent into a convolution layer with batch normalization positions. Wherein each backbone block has a similar structure to that of ECAPA-TDNN, and utilizes AFMS modules instead of extrusion excitation, and performs a max-pooling operation before AFMS.

In this embodiment, the output of the first backbone block is taken as the input of the second backbone block, and the output of the second backbone block is added to the output of the third backbone block. This way of connection allows for interconnection between the three backbone blocks and enhances the propagation and transfer of features.

To reduce the risk of model overfitting, the output of the first backbone block is maximally pooled and then fed as input to a convolutional layer with batch normalized positions simultaneously with the outputs of the other two backbone blocks. Thus, the complexity of the model can be reduced, and the generalization capability can be improved.

The three backbone blocks with residual connection can effectively extract the characteristic representation of the voice signal and enhance the robustness and reliability of the model.

The embodiment can acquire and process related data based on artificial intelligence technology. Wherein artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The application can be applied to the fields of artificial intelligence and big data, thereby promoting the construction of smart cities.

The application belongs to the field of smart cities, and can promote the construction of the smart city through the scheme.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by computer readable instructions stored in a computer readable storage medium that, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

With further reference to fig. 10, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a voice authentication device, which corresponds to the embodiment of the voice authentication method shown in fig. 2, and the voice authentication device may be applied to various electronic devices.

As shown in fig. 10, the voice authentication device 1000 in this embodiment includes a training module 1001, a classification processing module 1002, an extraction processing module 1003, and an authentication processing module 1004. Wherein:

The training module 1001 is configured to obtain a voice data set, train the pre-trained voice authentication model through the voice data set, and obtain a trained voice authentication model;

the classification processing module 1002 is configured to perform vocoder classification processing on the target audio through the trained speech recognition model, and determine a vocoder to which the target audio belongs;

An extraction processing module 1003, configured to perform feature extraction processing on the target audio in a vocoder to which the target audio belongs, so as to obtain a target audio feature;

And the authentication processing module 1004 is configured to perform authentication processing on the target audio feature, and determine an authentication processing result of the target audio.

In this embodiment, the pre-training voice false-identifying model is trained by acquiring a voice data set, so as to obtain a trained voice false-identifying model, the target audio is classified by the trained voice false-identifying model, the vocoder to which the target audio belongs is determined, the feature extraction processing is performed on the target audio in the vocoder to which the target audio belongs, the target audio feature is obtained, the false-identifying processing is performed on the target audio feature, and the false-identifying processing result of the target audio is determined, so that the problems of lower accuracy and poor generalization of the existing voice false-identifying method in the real environment are solved.

Referring to fig. 11, which is a schematic structural diagram of an embodiment of the training module 1001 in fig. 10, the training module 1001 includes an acquisition sub-module 10011, a generation processing sub-module 10012, a first determination sub-module 10013, and a second determination sub-module 10014. Wherein:

an acquisition submodule 10011, configured to acquire sample real audio and different types of vocoders;

The generation processing submodule 10012 is configured to perform audio generation processing on the sample real audio in the vocoders of different categories to obtain generated audio of different categories;

A first determining submodule 10013, configured to calculate a correlation between the generated audio and the sample real audio, and determine the correlation threshold corresponding to the category of the vocoder, and determine the generated audio with a correlation greater than the correlation threshold as a target generated audio;

a second determining submodule 10014 is configured to determine a label of the target generated audio, where the label of the target generated audio is formed according to a vocoder class and a fake of the target generated audio.

In this embodiment, a sample real audio and vocoders of different classes are obtained, audio generation processing is performed on the sample real audio in vocoders of different classes to obtain generated audio of different classes, then correlation between the generated audio and the sample real audio is calculated, a correlation threshold corresponding to the class of the vocoders is determined, if the correlation is greater than the correlation threshold corresponding to the class of the vocoders, the generated audio is determined to be a target generated audio, and labeling of the target generated audio is determined, so that generated audio data is managed and used better, and quality and acceptability of the generated audio are improved.

Referring to fig. 12, which is a schematic structural diagram of an embodiment of the first determining submodule 10013 in fig. 11, the first determining submodule 10013 includes a first calculating unit 100131, a second calculating unit 100132, a third calculating unit 100133, a fourth calculating unit 100134, and a first determining unit 100135. Wherein:

A first calculating unit 100131, configured to calculate a treble similarity between the generated audio and the sample real audio;

a second calculating unit 100132 for calculating a bass similarity between the generated audio and the sample real audio;

A third calculation unit 100133, configured to calculate a lexical similarity between the generated audio and the sample real audio;

A fourth calculating unit 100134, configured to calculate a global semantic similarity between the generated audio and the sample real audio;

The first determining unit 100135 is configured to perform weighted fusion processing on the treble similarity, the bass similarity, the lexical similarity and the global semantic similarity, and determine a correlation between the generated audio and the sample real audio.

In this embodiment, the high-pitch similarity between the generated audio and the sample real audio is calculated, the low-pitch similarity between the generated audio and the sample real audio is calculated, the lexical similarity between the generated audio and the sample real audio is calculated, the global semantic similarity between the generated audio and the sample real audio is calculated, and the high-pitch similarity, the low-pitch similarity, the lexical similarity and the global semantic similarity are weighted and fused to obtain the correlation between the generated audio and the sample real audio, so that the user can be helped to evaluate the quality of the generated audio better.

Referring to fig. 13, which is a schematic structural diagram of an embodiment of the first determining submodule 10013 in fig. 11, the first determining submodule 10013 further includes a second determining unit 100136, a third determining unit 100137, and a fourth determining unit 100138. Wherein:

a second determining unit 100136, configured to determine, based on the generation accuracy rates of the vocoders of different types, weights corresponding to the types of the vocoders, where the higher the generation accuracy rate is, the lower the weights are;

The third determining unit 100137 is configured to determine an effective duration and an effective phoneme number in the sample real audio, where the effective duration is a duration sum of effective phonemes, and the effective phonemes are phonemes of a person;

And a fourth determining unit 100138, configured to perform weighted summation processing on the weight, the effective duration, and the number of effective phonemes, to obtain the correlation threshold corresponding to the class of the vocoder.

In this embodiment, based on the generating accuracy of the vocoders of different categories, the weight corresponding to the category of the vocoder is determined, the effective duration and the number of effective phonemes in the sample real audio are determined, and the weighted summation processing is performed on the weight, the effective duration and the number of effective phonemes to obtain the correlation threshold corresponding to the category of the vocoder, so that the correlation threshold corresponding to the category of the vocoder can be determined more accurately, thereby improving the working efficiency.

Referring to fig. 14, which is a schematic structural diagram of an embodiment of the voice authentication device 1000 in fig. 10, the voice authentication device 1000 further includes a pre-emphasis processing module 1005, a time-frequency processing module 1006, a processing module 1007, and an output module 1008. Wherein:

the pre-emphasis processing module 1005 is configured to obtain a pre-training voice pseudo-discrimination model, add a vocoder feature extraction module into the pre-training voice pseudo-discrimination model, and perform pre-emphasis processing on the different types of vocoder features in the vocoder feature extraction module to obtain pre-emphasis processed vocoder features;

The time-frequency processing module 1006 is configured to send the pre-emphasis processed vocoder feature to an instance normalization layer and a parameterized analysis filter bank to process the pre-emphasis processed vocoder feature into a time-frequency representation, so as to obtain a vocoder feature time-frequency representation;

A processing module 1007 for processing the vocoder characteristic time-frequency representation through three backbones with residual connections in such a way that the output of the first backbones is used as the input of the second backbones, adding the output of the first backbone block and the output of the second backbone block to be used as the input of a third backbone block, wherein the output of the first backbone block is subjected to pooling operation, and the output of the first backbone block, the output of the second backbone block and the output of the third backbone block after pooling operation are simultaneously used as the input to be sent to a convolution layer for feature extraction;

And the output module 1008 is used for adding the inputs of the three backbone blocks with residual connection, connecting one-dimensional convolution layer and obtaining the output result of the vocoder by the two full connection layers.

In this embodiment, a pre-trained voice pseudo-discrimination model is obtained, a vocoder feature extraction module is added into the pre-trained voice pseudo-discrimination model, pre-emphasis processing is performed on different types of vocoder features in the vocoder feature extraction module to obtain pre-emphasis processed vocoder features, the pre-emphasis processed vocoder features are sent to an example standardization layer and a parameterized analysis filter bank to be processed into time-frequency representations, the time-frequency representations of the vocoder features are obtained, the time-frequency representations of the vocoder features are processed through three backbone blocks with residual connection, the inputs of the three backbone blocks with residual connection are added and connected with one-dimensional convolution layer, and the two full connection layers obtain output results of the vocoder.

Referring to fig. 15, which is a schematic structural diagram of an embodiment of the extraction processing module 1003 in fig. 10, the extraction processing module 1003 includes an analysis processing sub-module 10031 and an extraction processing sub-module 10032. Wherein:

the analysis processing sub-module 10031 is configured to perform time-frequency analysis processing on the target audio in the vocoder corresponding to the target audio, so as to obtain a target audio after the time-frequency analysis processing;

and the extraction processing submodule 10032 is used for inputting the target audio after the time-frequency analysis processing into three backbone blocks with residual connection for feature extraction processing to obtain target audio features.

In this embodiment, the time-frequency analysis processing is performed on the target audio in the vocoder corresponding to the target audio to obtain the target audio after the time-frequency analysis processing, and the target audio after the time-frequency analysis processing is input into three backbone blocks with residual connection to perform feature extraction processing, so as to obtain the target audio feature, which can effectively improve the accuracy and reliability of feature extraction, thereby improving the accuracy and efficiency of the speech processing task.

Referring to fig. 16, which is a schematic structural diagram of an embodiment of the extraction processing sub-module 10032 in fig. 14, the extraction processing sub-module 10032 includes a first feature extraction unit 100321, a second feature extraction unit 100322, a third feature extraction unit 100323, and a fourth feature extraction unit 100324. Wherein:

The first feature extraction unit 100321 is configured to input the target audio after the time-frequency analysis processing into a first backbone block to perform a first feature extraction process, and output a first target audio feature to be processed;

the second feature extraction unit 100322 is configured to input the first target audio feature to be processed into a second backbone block to perform second feature extraction processing, and output the second target audio feature to be processed;

The third feature extraction unit 100323 is configured to input the first to-be-processed target audio feature and the second to-be-processed target audio feature into a third backbone block to perform third feature extraction processing, so as to obtain a third to-be-processed target audio feature;

And a fourth feature extraction unit 100324, configured to input the first target audio feature to be processed, the second target audio feature to be processed, and the third target audio feature to be processed into a convolution layer to perform fourth feature extraction processing, so as to obtain a target audio feature.

In this embodiment, the target audio after time-frequency analysis processing is input into the first backbone block to perform the first feature extraction processing, the first target audio feature to be processed is output, the first target audio feature to be processed is input into the second backbone block to perform the second feature extraction processing, the second target audio feature to be processed is output, the first target audio feature to be processed and the second target audio feature to be processed are input into the third backbone block to perform the third feature extraction processing, the third target audio feature to be processed is obtained, and the first target audio feature to be processed, the second target audio feature to be processed and the third target audio feature to be processed are input into the convolution layer to perform the fourth feature extraction processing, so as to obtain the target audio feature, which can effectively improve the accuracy and reliability of feature extraction.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 17, fig. 17 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 17 includes a memory 171, a processor 172, and a network interface 173 communicatively coupled to each other via a system bus. It should be noted that only computer device 17 having components 171-173 is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer equipment can be a desktop computer, a notebook computer, a palm computer, a cloud server and other computing equipment. The computer equipment can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 171 includes at least one type of readable storage medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc.

In some embodiments, the memory 171 may be an internal storage unit of the computer device 17, such as a hard disk or a memory of the computer device 17. In other embodiments, the memory 171 may also be an external storage device of the computer device 17, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 17. Of course, the memory 171 may also include both an internal storage unit of the computer device 17 and an external storage device. In this embodiment, the memory 171 is typically used for storing an operating system and various application software installed on the computer device 17, such as computer readable instructions of a voice authentication method. Further, the memory 171 may also be used to temporarily store various types of data that have been output or are to be output. The processor 172 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 172 is typically used to control the overall operation of the computer device 17. In this embodiment, the processor 172 is configured to execute computer readable instructions stored in the memory 171 or process data, such as computer readable instructions for executing the voice authentication method.

The network interface 173 may comprise a wireless network interface or a wired network interface, the network interface 173 typically being used to establish a communication connection between the computer device 17 and other electronic devices.

In this embodiment, the influence data of the system in the knowledge base may be used to generate the voice discrimination influenced by the target system to be tested for the user, so that the user may intuitively obtain the influence range of the target system to be tested, evaluate the influence of modification more accurately and more quickly according to the influence range of the target system to be tested, eliminate irrelevant interference points, simplify regression use cases, avoid the omission of earlier evaluation influence points to cause the development of each system later, find that part of the association is not processed during joint debugging together, and occupy development time to supplement logic if the development time is light, and seriously design possible schemes to override the remade questions, thereby improving the efficiency of project development.

The present application also provides another embodiment, namely, a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the voice authentication method as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method according to the embodiments of the present application.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A voice authentication method, comprising the steps of:

2. The method of claim 1, wherein the speech dataset includes target-generated audio and annotations of the target-generated audio, the obtaining the speech dataset comprising:

acquiring a sample real audio and vocoders of different categories;

3. The method of claim 2, wherein said calculating a correlation between the generated audio and the sample real audio comprises:

4. The voice authentication method of claim 3, wherein the determining the correlation threshold corresponding to the class of the vocoder comprises:

5. The method of claim 4, wherein prior to training the pre-trained speech discrimination model with the speech dataset to obtain a trained speech discrimination model, the method further comprises:

Obtaining a pre-training voice false identification model, adding a vocoder characteristic extraction module into the pre-training voice false identification model, and performing pre-emphasis processing on the different types of vocoder characteristics in the vocoder characteristic extraction module to obtain pre-emphasis processed vocoder characteristics;

6. The method for voice authentication according to any one of claims 1 to 5, wherein performing feature processing on the target audio in a vocoder to which the target audio belongs to obtain a target audio feature, includes:

7. The voice authentication method according to claim 6, wherein inputting the target audio after the time-frequency analysis processing into three backbone blocks with residual connection for feature extraction processing, obtaining target audio features, comprises:

inputting the first target audio feature to be processed into a second backbone block for second feature extraction processing, and outputting to obtain a second target audio feature to be processed;

8. A voice authentication apparatus, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which when executed by the processor implement the steps of the voice authentication method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer readable instructions which when executed by a processor implement the steps of the voice authentication method according to any of claims 1 to 7.