CN112397057B

CN112397057B - Voice processing method, device, equipment and medium based on generation countermeasure network

Info

Publication number: CN112397057B
Application number: CN202011387380.0A
Authority: CN
Inventors: 郑振鹏; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2024-07-02
Anticipated expiration: 2040-12-01
Also published as: WO2022116487A1; CN112397057A

Abstract

The application relates to the technical field of voice processing and discloses a voice processing method, a device, equipment and a medium based on a generation countermeasure network, wherein the method comprises the steps of obtaining a voice section to be processed, cutting the voice section to be processed according to a preset length, marking a cutting order, and obtaining a cutting voice section and a cutting order mark; inputting the cut voice segment into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal; combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced; and splicing the voice signals to be spliced according to the cutting sequence marks to obtain a remolded voice signal. The application also relates to a blockchain technology, and the voice segment to be processed is stored in the blockchain. According to the application, the noise-reduced voice signal and the voice endpoint information are combined, so that the accuracy of voice processing is effectively improved.

Description

Voice processing method, device, equipment and medium based on generation countermeasure network

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a medium for processing speech based on generation of an countermeasure network.

Background

The voice processing includes the steps of voice enhancement (SPEECH ENHANCEMENT) and voice endpoint detection (Voice Activity Detection). The voice enhancement aims at removing the background noise mixed in the voice signal, and by removing the background noise, a clearer voice signal can be obtained, so that the method is beneficial to the subsequent task to obtain a better expression effect. The voice endpoint detection aims at acquiring the starting endpoint of the voice terminal, and by eliminating non-voice, subsequent calculation can be reduced, and the robustness and accuracy of a subsequent voice system are improved. But excessive background noise in the actual environment presents a significant challenge for speech processing.

In order to solve the problem of overlarge background noise in an actual environment, the existing method is to input the voice to be processed with the background noise into a generating countermeasure network, judge the voice to be processed by a discriminator in the generating countermeasure network, and train the judging result so as to achieve the aim of removing the background noise. However, in the voice processing, the method directly judges the voice to be processed, so that the error of the judging result is easy to be larger, the effect of the final voice processing noise is not obvious enough, and the voice processing accuracy is lower. There is a need for a method that can improve the accuracy of speech processing.

Disclosure of Invention

The embodiment of the application aims to provide a voice processing method, device, equipment and medium based on a generation countermeasure network so as to improve the accuracy of voice processing.

In order to solve the above technical problems, an embodiment of the present application provides a voice processing method based on generating an countermeasure network, including:

Obtaining a voice segment to be processed, cutting the voice segment to be processed according to a preset length, and marking a cutting sequence to obtain a cutting voice segment and a cutting sequence mark;

Inputting the cut voice segment into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal;

combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced;

and splicing the voice signals to be spliced according to the cutting sequence marks to obtain a remolded voice signal.

In order to solve the above technical problem, an embodiment of the present application provides a voice processing apparatus based on generating an countermeasure network, including:

The voice processing device comprises a voice section to be processed acquisition module, a voice processing module and a voice processing module, wherein the voice section to be processed is acquired, the voice section to be processed is cut according to a preset length, and the cutting sequence is marked to obtain a cutting voice section and a cutting sequence mark;

The voice segmentation input module is used for inputting the voice segmentation into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal;

The voice signal module to be spliced is used for combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced;

And the remolded voice signal acquisition module is used for splicing the voice signals to be spliced according to the cutting sequence marks to obtain remolded voice signals.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer device is provided comprising one or more processors; a memory for storing one or more programs for causing the one or more processors to implement any of the above-described methods of generating a voice over countermeasure network.

In order to solve the technical problems, the invention adopts a technical scheme that: a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of generating a countermeasure network based speech processing of any of the preceding claims.

The embodiment of the invention provides a voice processing method, device, equipment and medium based on a generated countermeasure network. The method comprises the following steps: the method comprises the steps of obtaining a voice segment to be processed, cutting the voice segment to be processed according to a preset length, and marking a cutting sequence to obtain a cutting voice segment and a cutting sequence mark; inputting the cut voice segment into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal; combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced; and splicing the voice signals to be spliced according to the cutting sequence marks to obtain a remolded voice signal. According to the embodiment of the invention, the noise-reduced voice signal after voice enhancement and the voice endpoint information after voice detection are combined to obtain the remodeled voice signal which can be enhanced by voice and detected by the endpoint, so that voice judgment on the remodeled voice signal is facilitated, and the accuracy of voice processing is effectively improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for the description of the embodiments of the present application, it being apparent that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained from these drawings without the exercise of inventive effort for a person of ordinary skill in the art.

FIG. 1 is a schematic view of an application environment of a voice processing method based on generation of an countermeasure network according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a method for generating a countermeasure network based speech processing according to an embodiment of the present application;

FIG. 3 is a flowchart of an implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 4 is a flowchart of still another implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 5 is a flowchart of still another implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 6 is a flowchart of still another implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 7 is a flowchart of still another implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 8 is a flowchart of still another implementation of a sub-process in a voice processing method based on generating an countermeasure network according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a voice processing apparatus based on a generated countermeasure network according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the applications herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "comprising" and "having" and any variations thereof in the description of the application and the claims and the description of the drawings above are intended to cover a non-exclusive inclusion. The terms first, second and the like in the description and in the claims or in the above-described figures, are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to make the person skilled in the art better understand the solution of the present application, the technical solution of the embodiment of the present application will be clearly and completely described below with reference to the accompanying drawings.

The present invention will be described in detail with reference to the drawings and embodiments.

Referring to fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. Various communication client applications, such as a web browser application, a search class application, an instant messaging tool, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the terminal devices 101, 102, 103.

It should be noted that, the voice processing method based on generating the countermeasure network according to the embodiment of the present application is generally executed by a server, and accordingly, the voice processing device based on generating the countermeasure network is generally configured in the server.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring to fig. 2, fig. 2 illustrates one embodiment of a voice processing method based on generating an countermeasure network.

It should be noted that, if there are substantially the same results, the method of the present invention is not limited to the flow sequence shown in fig. 2, and the method includes the following steps:

S1: and obtaining the voice section to be processed, cutting the voice section to be processed according to the preset length, and marking the cutting sequence to obtain the cutting voice section and the cutting sequence mark.

Specifically, when the voice processing is required for the voice section to be processed, the server firstly acquires the voice section to be processed, cuts the voice section to be processed according to the preset length from the voice starting stage to the voice ending stage, marks the cutting sequence while cutting the voice, and thus obtains the cutting voice section and the cutting sequence mark.

The cutting sequence marks are marks corresponding to each section of cutting voice section when the voice section to be processed is cut.

For example, the duration of the voice section to be processed is 500 seconds, the server cuts the voice section to be processed according to the length of 2 seconds from the beginning stage of voice to the end stage of voice, and marks the cutting sequence, so as to obtain 250 cutting voice sections and cutting sequence marks corresponding to the 250 cutting voice sections. Such as a cut speech segment from 0 to 2 seconds from the speech segment to be processed, which corresponds to a cut order labeled 1.

The preset length is set according to the actual situation, and is not limited herein. In one specific embodiment the predetermined length is 2 seconds.

S2: inputting the cut voice segment into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal.

Specifically, the server inputs the cut voice segments into a trained generation countermeasure network, carries out voice enhancement processing on the cut voice segments through a trained generator in the generation countermeasure network, and generates enhanced voice signals, namely noise-reduced voice signals, wherein each noise-reduced voice signal is actually a sampling point corresponding to each cut voice segment; and inputting the enhanced voice signal into a trained discriminator for generating the countermeasure network, judging the noise-reduced voice signal through the trained discriminator for generating the countermeasure network, and outputting voice endpoint information corresponding to the noise-reduced voice signal, namely outputting whether the noise-reduced voice signal is a probability value of a real voice signal or not.

The noise-reduced voice signal is a voice sampling signal which can be enhanced after voice enhancement is carried out on the cut voice segment; the voice endpoint information is a probability value of whether the voice signal corresponding to noise reduction is a real voice signal, and whether the voice signal is the real voice is judged through the probability value, namely, the obtained judgment result is the real voice or the non-real voice.

Wherein the generative antagonism network (GAN, generative Adversarial Networks) is a deep learning model. The model is built up of (at least) two modules in a frame: the mutual game learning of the generative model (GENERATIVE MODEL) and the discriminant model (DISCRIMINATIVE MODEL) produces a fairly good output. The generation model corresponds to a generator in the generation countermeasure network and is used for outputting a noise-reduced voice signal obtained after the cutting voice signal is enhanced by voice; the discrimination model here corresponds to a discriminator in the generation countermeasure network in the present application for outputting a discrimination result.

S3: and combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced.

Specifically, since the noise-reduced voice signal is a voice signal which can be enhanced by cutting the voice segment and performing voice enhancement, the voice endpoint information is a probability value of whether the noise-reduced voice signal is a real voice signal, and whether the noise-reduced voice signal is a real voice or not is judged by the probability value. Because actually each noise-reduced voice signal is a sampling point corresponding to each cut voice segment, and the voice endpoint information is a probability value corresponding to each sampling point, the noise-reduced voice signal and the voice endpoint information result are combined to form the voice signal to be spliced. That is, each to-be-spliced voice signal can be enhanced by voice, and meanwhile, each to-be-spliced voice signal contains a probability value of whether the voice signal is real voice or not, namely, each to-be-spliced voice signal can be enhanced by voice and detected by voice.

S4: and splicing the voice signals to be spliced according to the cutting sequence marks to obtain a remolded voice signal.

Specifically, the speech signals to be spliced are spliced from the beginning of each section of speech to the end of speech, and the speech signals are marked according to the cutting sequence to obtain remolded speech signals. The remolded voice signal is subjected to voice enhancement and voice endpoint detection, and is combined with the voice enhancement and the voice endpoint detection, so that the purposes of removing noise and improving the accuracy of voice processing are achieved.

In the embodiment, a voice segment to be processed is obtained, the voice segment to be processed is cut according to a preset length, and a cutting sequence is marked to obtain a cutting voice segment and a cutting sequence mark; inputting the cut voice segment into a trained generation countermeasure network to obtain a noise-reduced voice signal and voice endpoint information corresponding to the noise-reduced voice signal; combining the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced; and splicing the voice signals to be spliced according to the cutting sequence marks to obtain remolded voice signals, combining the voice signals subjected to voice enhancement and noise reduction with voice endpoint information subjected to voice detection to obtain remolded voice signals capable of being enhanced by voice and detected by endpoints, so that voice judgment on the remolded voice signals is facilitated, and the accuracy of voice processing is effectively improved. Referring to fig. 3, fig. 3 shows a specific implementation manner before step S1, and this embodiment includes:

S2A, acquiring a preset noise voice signal and a target voice signal, and cutting the noise voice signal and the target voice signal according to a preset length to obtain a noise voice section and a target voice section.

Specifically, in the training process of generating the countermeasure network, the noise voice signal and the target voice signal are firstly acquired, and then are input into the generating countermeasure network for training.

It should be noted that, the preset length for cutting the noise voice signal and the target voice signal may be different from the preset length for cutting the voice segment to be processed in step S1, but the optimal implementation effect is that the preset lengths of the noise voice signal and the target voice signal are set to be the same. In addition, when the noise voice signal and the target voice signal are cut, the overlapping parts of the noise voice section and the target voice section are respectively set, and the overlapping parts are not required to be set for cutting the voice section to be processed. The method is characterized in that the overlapping part is arranged when the model of the network is trained, training data can be added, the model learns better network parameters, and when the processing process of the voice to be processed is carried out, the voice processing task can be completed only by processing each sampling point once.

S2B, extracting the noise voice section and the target voice section as training data according to a random extraction and unreplacing mode.

Specifically, a mode of random extraction without replacement is adopted to extract the noise voice section and the target voice section so as to ensure that the noise voice section and the target voice section which are randomly extracted are not repeatable, thereby being beneficial to generating model training of an countermeasure network.

S2C, inputting training data into a generating countermeasure network, generating an observation voice segment and a judging result, and calculating a loss function value according to the observation voice segment and the judging result to obtain target loss.

Specifically, the training data includes randomly extracted noise voice segments and target voice segments, the noise voice segments are input into a generator for generating an countermeasure network to generate observation voice segments, and then the noise voice segments and the target voice segments are input into a discriminator for generating the countermeasure network to obtain respective discrimination results. And then calculating a loss function value according to the observed voice segment and the discrimination result to obtain the target loss.

The judgment result is that the training data is input into a discriminator for generating an countermeasure network, the training data is judged by the discriminator, whether each training data is real voice or noise is obtained, if the training data is real voice, the judgment result is 1, and if the training data is noise, the judgment result is 0. And since the training data is not single, there are a large number of discrimination results 1 and 0 in the discrimination results, which is advantageous in calculating the loss function value of the discrimination result.

And S2D, updating parameters of the generated countermeasure network according to the target loss to obtain the trained generated countermeasure network.

Specifically, the target loss obtained in step S2C is updated correspondingly to generate the generator parameters and the discriminator parameters of the countermeasure network, and finally the trained countermeasure network is obtained.

In this embodiment, a preset noise voice signal and a target voice signal are obtained, the noise voice signal and the target voice signal are cut according to a preset length to obtain a noise voice section and a target voice section, then the noise voice section and the target voice section are extracted in a random extraction and unreput mode to serve as training data, the training data are input into a generated countermeasure network to generate an observation voice section and a discrimination result, a loss function value is calculated according to the observation voice section and the discrimination result to obtain a target loss, finally parameters of the generated countermeasure network are updated according to the target loss to obtain a trained generated countermeasure network, so that the countermeasure network is generated according to training of the noise voice section and the target voice section, the subsequent output of noise-reduced voice signals and voice endpoint information corresponding to the noise-reduced voice signals is facilitated, and therefore the accuracy of voice processing is improved.

Referring to fig. 4, fig. 4 shows a specific implementation of step S2C, in which training data is input into a countermeasure network in step S2C to generate an observation speech segment and a discrimination result, and a loss function value is calculated according to the observation speech segment and the discrimination result, so as to obtain a specific implementation process of target loss, which is described in detail as follows:

S2C1, inputting the noise voice segment in the training data into a generator for generating an countermeasure network, generating an observation voice segment, and calculating a loss function value of the observation voice segment and a target voice segment in the training data to obtain a first loss value.

Specifically, the noise speech segment in the training data is input to the generator that generates the countermeasure network, and a detected and speech-enhanced speech signal, i.e., an observed speech segment, can be obtained. Calculating a loss function value of the target voice section in the observed voice section and the training data through the loss function to obtain a first loss value, and judging the deviation degree of the observed voice section and the target voice section through the first loss value, wherein the larger the first loss value is, the more dissimilar the observed voice section and the target voice section are, namely the larger the deviation degree of the observed voice section and the target voice section is; in the application, training is carried out on the generated countermeasure network, so that noise and real voice can be distinguished to the greatest extent, namely, the larger the obtained first loss value is, the closer the training on the generated countermeasure network is to completion. The first loss value is therefore used in a subsequent step to update the generator parameters.

S2C2, inputting the noise voice segment in the training data into a discriminator for generating an countermeasure network to obtain a first discrimination result, and calculating a loss function value of the first discrimination result to obtain a second loss value.

Specifically, the noise voice segment in the training data is judged by the judging device, so that whether the noise voice segment in each training data is real voice or noise is obtained, if the noise voice segment is real voice, the first judging result is 1, and if the noise voice segment is noise, the first judging result is 0. Since the noise speech segment of the training data is not single, a large number of discrimination results 1 and 0 exist in the first discrimination result, so that the loss function value of the first discrimination result is calculated, and the second loss value is obtained.

S2C3, inputting the target voice segment in the training data into a discriminator for generating the countermeasure network to obtain a second discrimination result, and calculating a loss function value of the second discrimination result to obtain a third loss value.

Specifically, the target voice segment in the training data is judged by the judging device, so that whether the target voice segment in each training data is real voice or noise is obtained, if the target voice segment is real voice, the second judging result is 1, and if the target voice segment is noise, the second judging result is 0. Since the target speech segment of the training data is not single, a large number of discrimination results 1 and 0 exist in the second discrimination result, so that the loss function value of the second discrimination result is calculated, and a third loss value is obtained.

And S2C4, taking the first loss value, the second loss value and the third loss value as target losses.

Specifically, the first loss value, the second loss value, and the third loss value are used as target losses, and the parameters of the subsequent generation countermeasure network are updated.

In this embodiment, a noise speech segment in training data is input into a generator for generating an countermeasure network to generate an observed speech segment, a loss function value of the observed speech segment and a target speech segment in the training data is calculated to obtain a first loss value, the noise speech segment in the training data is input into a discriminator for generating the countermeasure network to obtain a first discrimination result, a loss function value of the first discrimination result is calculated to obtain a second loss value, the target speech segment in the training data is input into the discriminator for generating the countermeasure network to obtain a second discrimination result, a loss function value of the second discrimination result is calculated to obtain a third loss value, and the first loss value, the second loss value and the third loss value are used as target losses, so that the loss function value is calculated through different data, the parameters of the countermeasure network are conveniently updated subsequently, and the accuracy of voice processing is improved.

Referring to fig. 5, fig. 5 shows a specific implementation of step S2D, in which parameters of the generated countermeasure network are updated according to the target loss in step S2D, and a specific implementation process of the generated countermeasure network is trained, which is described in detail as follows:

And S2D1, updating and generating generator parameters of the countermeasure network according to the first loss value.

Specifically, since the first loss value is obtained by generating the observed speech segment by the generator and calculating the loss function value of the observed speech segment and the target speech segment in the training data, the generator parameter of the countermeasure network is updated and generated based on the first loss value. This facilitates the generation of updates to the antagonism network parameters.

And S2D2, updating and generating the discriminator parameters of the countermeasure network according to the second loss value and the third loss value.

Specifically, since the second loss value and the third loss value are both calculated by the determination result generated by the arbiter, updating the arbiter parameters generating the countermeasure network by using the second loss value and the third loss value is beneficial to updating the parameters generating the countermeasure network.

And S2D3, stopping updating parameters for generating the countermeasure network when the first loss value reaches a preset threshold value, and obtaining the trained generated countermeasure network.

Specifically, the network parameters of the generated countermeasure network are updated through the first loss value, the second loss value and the third loss value, if the first loss value does not reach the preset threshold, the first loss value, the second loss value and the third loss value are regenerated according to the steps S2C1 to S2C3, the network parameters of the generated countermeasure network are updated, and when the first loss value reaches the preset threshold, the training generation countermeasure network is fully provided with the recognition of the noise voice signal and the target voice signal. And when the first loss value reaches a preset threshold value, updating the parameters for generating the countermeasure network is stopped, and the trained countermeasure network is obtained.

The preset threshold is set according to the actual situation, and is not limited herein. In one embodiment, the predetermined threshold is 0.95.

In this embodiment, according to the first loss value, the generator parameters of the countermeasure network are updated, and according to the second loss value and the third loss value, the arbiter parameters of the countermeasure network are updated, and when the first loss value reaches a preset threshold, the parameters of the countermeasure network are stopped being updated, so as to obtain the trained countermeasure network. The updating of the generation countermeasure network is realized, which is beneficial to improving the accuracy of voice processing.

Referring to fig. 6, fig. 6 shows a specific implementation manner of step S2, in which the root in step S2 inputs the cut speech segment into the trained generating countermeasure network to obtain a noise-reduced speech signal and a specific implementation process of speech endpoint information corresponding to the noise-reduced speech signal, which is described in detail as follows:

S21, inputting the cut voice segments into a trained generation countermeasure network, and generating sequence matrix characteristics for the cut voice segments through a coding-decoding model of the generator.

Specifically, the encoding-decoding model (Encoder-decoder) includes functions such as encoding and decoding. Encoding converts the input sequence into a dense vector of fixed dimensions by an encoder, and the decoding stage generates the target translation from this activation state. In this embodiment, a sequence of dense vectors is first generated for a cut speech segment by the encoding-decoding model of the generator, and then this sequence of dense vectors is converted into a matrix-form sequence matrix feature.

The sequence matrix features are generated by encoding and decoding the cut speech segments through the encoding-decoding model of the generator and are used for representing feature information of the cut speech segments. For example, if a sequence of features Y includes feature information Y1, feature information Y2, feature information Y3, and feature information Y4, the sequence features are y= { Y1, Y2, Y3, and Y4}.

S22, combining the sequence matrix features with the same size according to a jump connection mode to obtain target features.

Specifically, as the depth of the training network for generating the countermeasure network model increases, the conditions of gradient explosion and gradient disappearance are easy to occur, and the conditions are unfavorable for training the generating countermeasure network model, so that a jump connection mode is introduced, a transmission channel of shallow network information and deep network information is established, the sequence matrix characteristics with the same size are combined to obtain target characteristics, and the gradient explosion and gradient disappearance are solved.

The sequence matrix features with the same size refer to sequence matrix features with the same width and height.

Further, the whole network of the generator is constructed by a convolutional neural network. The convolutional neural network is a feedforward neural network which comprises convolutional calculation and has a depth structure, and is one of representative algorithms of deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network.

The jump Connection (Skip Connection) is to combine sequence matrix features with the same size by establishing a transmission channel of shallow network information and deep network information so as to solve the problems of gradient explosion and gradient disappearance in the generation of the antagonistic network model training.

S23, inputting the target characteristics into a full-connection layer network of the generator to obtain a noise-reduced voice signal.

Specifically, since the whole network of the producer is constructed by the convolutional neural network, the convolutional neural network further comprises a full-connection network with 2 layers, the input signal of the full-connection network is the last layer of the hidden vector and the encoding-decoding model, and the input quantity is the target characteristic. The full-connection layer is used for generating a voice endpoint result, namely in the embodiment, the target feature is input into the full-connection layer network of the generator to obtain the noise-reduced voice signal, so that the enhanced voice signal is obtained.

S24, inputting the noise-reduced voice signals into a trained discriminator for generating an countermeasure network to obtain voice endpoint information corresponding to the noise-reduced voice signals.

Specifically, the judgment of the noise-reduced voice signals by the trained generating countermeasure network discriminator is performed, voice endpoint information corresponding to each noise-reduced voice signal is output, and then the probability that each noise-reduced voice signal is real voice is judged.

In this embodiment, the cut speech segments are input into the trained generating countermeasure network, the sequence matrix features are generated for the cut speech segments by the encoding-decoding model of the generator, the sequence matrix features with the same size are combined according to the jump connection mode to obtain the target features, the target features are input into the full-connection layer network of the generator to obtain the noise-reduced speech signals, the noise-reduced speech signals are input into the trained discriminator of the generating countermeasure network to obtain the speech endpoint information corresponding to the noise-reduced speech signals, the acquisition of the noise-reduced speech signals and the speech endpoint information corresponding to the noise-reduced speech signals is realized, the follow-up remodeling speech signals are facilitated, and the accuracy of speech processing is further improved.

Referring to fig. 7, fig. 7 shows a specific implementation manner of step S22, and a specific implementation process of combining sequence matrix features with the same size according to a jump connection manner in step S22 to obtain a target feature is described in detail as follows:

S221, traversing the sequence matrix characteristics to obtain sequence matrix characteristics with the same size as a target matrix, wherein the width and the height of the target matrix are consistent.

Specifically, in order to solve the problems of gradient explosion and gradient disappearance in the process of generating an antagonistic network model training, sequence matrix characteristics with the same size are obtained as a target matrix by traversing the sequence matrix characteristics of a shallow network layer and a deep network layer. The width and height of the target matrix are uniform.

S222, combining the target matrixes in a jump connection mode to obtain target characteristics.

Specifically, a transmission channel of a shallow network layer and a deep network layer of a fully-connected network is established in a jump connection mode, and a target matrix is combined to obtain target characteristics.

In the embodiment, the sequence matrix features with the same size are obtained by traversing the sequence matrix features and used as the target matrix, and the target matrix is combined in a jump connection mode to obtain the target features, so that the problems of gradient explosion and gradient disappearance in the process of generating the countermeasure network model training are solved, the generation of the countermeasure network training is facilitated, and further the accuracy of voice processing is facilitated.

Referring to fig. 8, fig. 8 shows a specific implementation manner of step S4, in which the speech signals to be spliced are spliced according to the cutting sequence mark in step S4, so as to obtain a specific implementation process of the reshaped speech signals, which is described in detail as follows:

s41, arranging the voice signals to be spliced according to the sequence of the cutting sequence marks from small to large to obtain a voice sequence.

Specifically, since the cutting order marks are marked from the beginning stage to the end of the voice to be processed, the voice signals to be spliced are arranged from the small order to the large order according to the cutting order marks, and the voice sequence is obtained.

And S42, splicing the head and the tail of the voice signals to be spliced according to the voice sequence to obtain a remolded voice signal.

Specifically, the head and the tail of the voice signals to be spliced are spliced to form complete remolded voice signals, the remolded voice signals are subjected to voice enhancement and voice endpoint detection, and the purposes of removing noise and improving voice processing accuracy are achieved by combining the voice enhancement and the voice endpoint detection.

In this embodiment, the voice signals to be spliced are arranged according to the order of the cutting order from small to large to obtain a voice sequence, and the head and the tail of the voice signals to be spliced are spliced according to the voice sequence to obtain a remolded voice signal, so that the purpose of voice processing is achieved, and the remolded voice signal has the characteristics of voice enhancement and voice endpoint detection, which is beneficial to the accuracy of voice processing.

It should be emphasized that, to further ensure the privacy and security of the above-mentioned speech segments to be processed, the above-mentioned speech segments to be processed may also be stored in a node of a blockchain.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

Referring to fig. 9, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a voice processing apparatus based on generating an countermeasure network, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

As shown in fig. 9, the voice processing apparatus based on the generation countermeasure network of the present embodiment includes: a to-be-processed speech segment acquisition module 51, a cut speech segment input module 52, a speech signal module 53 to be spliced, and a remolded speech signal acquisition module 54, wherein:

The to-be-processed voice segment obtaining module 51 is configured to obtain a to-be-processed voice segment, cut the to-be-processed voice segment according to a preset length, and mark a cutting order to obtain a cutting voice segment and a cutting order mark;

The cut speech segment input module 52 is configured to input the cut speech segment into a trained generating countermeasure network, so as to obtain a noise-reduced speech signal and speech endpoint information corresponding to the noise-reduced speech signal;

The voice signal module 53 to be spliced is configured to combine the noise-reduced voice signal with the corresponding voice endpoint information to form a voice signal to be spliced;

The remolded voice signal obtaining module 54 is configured to splice the voice signals to be spliced according to the cutting sequence marks, so as to obtain remolded voice signals.

Further, the voice processing apparatus based on generating the countermeasure network further includes, before cutting the voice segment input module 52:

the voice cutting module is used for acquiring a preset noise voice signal and a target voice signal, and cutting the noise voice signal and the target voice signal according to a preset length to obtain a noise voice section and a target voice section;

The training data acquisition module is used for extracting a noise voice section and a target voice section as training data in a random extraction and unreplacing mode;

the target loss acquisition module is used for inputting training data into a generated countermeasure network, generating an observation voice segment and a discrimination result, and calculating a loss function value according to the observation voice segment and the discrimination result to obtain target loss;

and the parameter updating module is used for updating the parameters of the generated countermeasure network according to the target loss to obtain the trained generated countermeasure network.

Further, the target loss acquisition module includes:

The first loss value calculation unit is used for inputting the noise voice segment in the training data into a generator for generating an countermeasure network, generating an observation voice segment, and calculating loss function values of the observation voice segment and a target voice segment in the training data to obtain a first loss value;

The second loss value calculation unit is used for inputting the noise voice segment in the training data into a discriminator for generating an countermeasure network to obtain a first discrimination result, and calculating a loss function value of the first discrimination result to obtain a second loss value;

the third loss value calculation unit is used for inputting the target voice segment in the training data into a discriminator for generating an countermeasure network to obtain a second discrimination result, and calculating a loss function value of the second discrimination result to obtain a third loss value;

and a target loss definition unit configured to take the first loss value, the second loss value, and the third loss value as target losses.

Further, the parameter updating module includes:

a generator parameter updating unit for updating generator parameters for generating an countermeasure network according to the first loss value;

The discriminator parameter updating unit is used for updating and generating discriminator parameters of the countermeasure network according to the second loss value and the third loss value;

And the updating stopping unit is used for stopping updating the parameters for generating the countermeasure network when the first loss value reaches a preset threshold value, so that the trained generation countermeasure network is obtained.

Further, the cut speech segment input module 52 includes:

the sequence matrix characteristic unit is used for inputting the cut voice segment into a trained generation countermeasure network, and generating sequence matrix characteristics for the cut voice segment through a coding-decoding model of the generator;

The target feature acquisition unit is used for combining the sequence matrix features with the same size according to the jump connection mode to obtain target features;

The noise-reducing voice signal unit is used for inputting the target characteristics into the full-connection layer network of the generator to obtain a noise-reducing voice signal;

the voice endpoint information unit is used for inputting the noise-reduced voice signal into the trained discriminator for generating the countermeasure network to obtain the voice endpoint information corresponding to the noise-reduced voice signal.

Further, the target feature acquisition unit includes:

The target matrix acquisition subunit is used for traversing the sequence matrix characteristics and acquiring sequence matrix characteristics with the same size as a target matrix, wherein the width and the height of the target matrix are consistent;

and the target feature acquisition subunit is used for combining the target matrixes in a jump connection mode to obtain target features.

Further, the remodelled voice signal acquisition module 54 includes:

The voice sequence acquisition unit is used for marking the sequence from small to large according to the cutting sequence, and arranging the voice signals to be spliced to obtain a voice sequence;

And the voice signal remolding unit is used for splicing the head and the tail of the voice signals to be spliced according to the voice sequence to obtain remolded voice signals.

In order to solve the technical problems, the embodiment of the application also provides computer equipment. Referring specifically to fig. 10, fig. 10 is a basic structural block diagram of a computer device according to the present embodiment.

The computer device 6 comprises a memory 61, a processor 62, a network interface 63 communicatively connected to each other via a system bus. It should be noted that only a computer device 6 having three components, a memory 61, a processor 62, and a network interface 63, is shown in the figures, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead. It will be appreciated by those skilled in the art that the computer device herein is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an Application SPECIFIC INTEGRATED Circuit (ASIC), a Programmable gate array (Field-Programmable GATE ARRAY, FPGA), a digital Processor (DIGITAL SIGNAL Processor, DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The computer device can perform man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch pad or voice control equipment and the like.

The memory 61 includes at least one type of readable storage media including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 61 may be an internal storage unit of the computer device 6, such as a hard disk or memory of the computer device 6. In other embodiments, the memory 61 may also be an external storage device of the computer device 6, such as a plug-in hard disk provided on the computer device 6, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like. Of course, the memory 61 may also include both internal storage units of the computer device 6 and external storage devices. In the present embodiment, the memory 61 is typically used for storing an operating system and various types of application software installed on the computer device 6, such as program codes based on a voice processing method for generating an countermeasure network, and the like. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 62 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device 6. In this embodiment, the processor 62 is configured to execute a program code stored in the memory 61 or process data, for example, a program code based on a voice processing method for generating an countermeasure network.

The network interface 63 may comprise a wireless network interface or a wired network interface, which network interface 63 is typically used for establishing a communication connection between the computer device 6 and other electronic devices.

The present application provides still another embodiment, namely, a computer-readable storage medium storing a server maintenance program executable by at least one processor to cause the at least one processor to perform the steps of a voice processing method based on generating an countermeasure network as described above.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It is apparent that the above-described embodiments are only some embodiments of the present application, but not all embodiments, and the preferred embodiments of the present application are shown in the drawings, which do not limit the scope of the patent claims. This application may be embodied in many different forms, but rather, embodiments are provided in order to provide a thorough and complete understanding of the present disclosure. Although the application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described in the foregoing description, or equivalents may be substituted for elements thereof. All equivalent structures made by the content of the specification and the drawings of the application are directly or indirectly applied to other related technical fields, and are also within the scope of the application.

Claims

1. A method of generating a voice processing over an countermeasure network, comprising:

Acquiring a preset noise voice signal and a target voice signal, and cutting the noise voice signal and the target voice signal according to a preset length to obtain a noise voice section and a target voice section;

extracting a noise voice section and a target voice section as training data in a random extraction and unreplacing mode;

Inputting the noise voice segment in the training data into a generator for generating an countermeasure network, generating an observation voice segment, and calculating a loss function value of the observation voice segment and a target voice segment in the training data to obtain a first loss value;

inputting the noise voice segment in the training data into a discriminator for generating an countermeasure network to obtain a first discrimination result, and calculating a loss function value of the first discrimination result to obtain a second loss value;

inputting the target voice segment in the training data into a discriminator for generating an countermeasure network to obtain a second discrimination result, and calculating a loss function value of the second discrimination result to obtain a third loss value;

taking the first loss value, the second loss value and the third loss value as target losses;

updating the generator parameters of the generated countermeasure network according to the first loss value;

updating the discriminator parameters of the generated countermeasure network according to the second loss value and the third loss value;

when the first loss value reaches a preset threshold value, stopping updating the parameters of the generated countermeasure network to obtain a trained generated countermeasure network;

Obtaining a voice segment to be processed, cutting the voice segment to be processed according to the preset length, and marking a cutting sequence to obtain a cutting voice segment and a cutting sequence mark;

inputting the cut voice segments into the trained generation countermeasure network, and generating sequence matrix characteristics for the cut voice segments through a coding-decoding model of a generator;

Combining sequence matrix features with the same size according to a jump connection mode to obtain target features;

inputting the target characteristics into a full-connection layer network of the generator to obtain a noise-reduced voice signal;

Inputting the noise-reduced voice signal into the trained discriminator for generating the countermeasure network to obtain voice endpoint information corresponding to the noise-reduced voice signal, wherein the voice endpoint information is a probability value of whether the corresponding noise-reduced voice signal is a real voice signal or not, and whether the voice endpoint information is the real voice can be judged through the probability value;

2. The method for generating a voice processing over an countermeasure network according to claim 1, wherein the step of combining the sequence matrix features of equal size in a jump connection manner to obtain a target feature includes:

Traversing the sequence matrix characteristics to obtain the sequence matrix characteristics with the same size as a target matrix, wherein the width and the height of the target matrix are consistent;

And combining the target matrixes in a jump connection mode to obtain the target characteristics.

3. The method for generating a voice processing over an countermeasure network according to any of claims 1 to 2, wherein the splicing the voice signals to be spliced according to the cut order mark to obtain a reshaped voice signal includes:

the voice signals to be spliced are arranged according to the sequence from small to large of the cutting sequence marks, and a voice sequence is obtained;

and splicing the head and the tail of the voice signals to be spliced according to the voice sequence to obtain the remolded voice signals.

4. A voice processing apparatus based on generating an countermeasure network, comprising:

A first loss value calculation unit, configured to input a noise voice segment in the training data into a generator that generates an countermeasure network, generate an observation voice segment, and calculate a loss function value of the observation voice segment and a target voice segment in the training data, to obtain a first loss value;

A third loss value calculation unit, configured to input a target speech segment in the training data into a discriminator for generating an countermeasure network, obtain a second discrimination result, and calculate a loss function value of the second discrimination result to obtain a third loss value;

A target loss definition unit configured to take the first loss value, the second loss value, and the third loss value as target losses;

A generator parameter updating unit, configured to update the generator parameter of the generation countermeasure network according to the first loss value;

A discriminator parameter updating unit, configured to update the discriminator parameter of the generated countermeasure network according to the second loss value and the third loss value;

The updating stopping unit is used for stopping updating the parameters of the generated countermeasure network when the first loss value reaches a preset threshold value, so that the trained generated countermeasure network is obtained;

The voice section to be processed obtaining module is used for obtaining voice sections to be processed, cutting the voice sections to be processed according to the preset length, and marking cutting sequences to obtain cutting voice sections and cutting sequence marks;

The sequence matrix characteristic unit is used for inputting the cut voice segment into the trained generation countermeasure network, and generating sequence matrix characteristics for the cut voice segment through a coding-decoding model of the generator;

The voice endpoint information unit is used for inputting the noise-reduced voice signal into the trained discriminator for generating the countermeasure network to obtain voice endpoint information corresponding to the noise-reduced voice signal, wherein the voice endpoint information is a probability value of whether the corresponding noise-reduced voice signal is a real voice signal or not, and whether the voice endpoint information is the real voice can be judged through the probability value;

5. A computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the method of generating a countermeasure network based speech processing of any of claims 1 to 3 when the computer program is executed.

6. A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method for generating a countermeasure network-based speech processing according to any of claims 1 to 3.