CN112216293B

CN112216293B - Tone color conversion method and device

Info

Publication number: CN112216293B
Application number: CN202010889099.0A
Authority: CN
Inventors: 王愈; 李健; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2024-08-02
Anticipated expiration: 2040-08-28
Also published as: CN112216293A

Abstract

The embodiment of the invention provides a tone color conversion method and device, wherein the method comprises the following steps: acquiring voice to be converted; extracting various characteristic parameters of the voice to be converted; combining the plurality of characteristic parameters to obtain a characteristic vector; performing tone color conversion on the feature vector to obtain a target feature parameter; and carrying out sounding processing by adopting the target characteristic parameters to obtain target voice. The voice conversion method can carry out tone conversion on various characteristic parameters of the voice to be converted, thoroughly convert the various characteristic parameters of the voice to be converted into characteristic parameters of a target person, improve naturalness and stability of a conversion result, and enable the converted voice to retain the characteristics of the original speaker, such as intonation and the like.

Description

Tone color conversion method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a tone color conversion method and a tone color conversion device.

Background

VC (Voice Conversion) is a process of converting the sound ray tone of one person's Voice into the sound ray tone of another person, the content of the Voice being unchanged. The voice conversion is different from the voice synthesis, the voice conversion is from text to voice, NLP (Natural Language Processing ) analysis is needed from text to voice, and voice of a table sound, a table meaning and an expression is reproduced, and the generation of the voice is focused; the change from speech to speech, which does not involve NLP, to the direct acoustic level, is re-mapped to speech.

The tone color conversion can be widely applied from ordinary entertainment, pronunciation correction, identity attack and defense and the like. The development history of tone color conversion is not short, and only two people can read the voice (namely parallel corpus) with the same content to train a one-to-one conversion model between the two people, so that the requirement on the total data amount is high, and the conversion stability is poor.

Currently, tone conversion mainly converts voice of any person into sound ray tone of a specific target person, and content remains unchanged, and the converted voice approaches the target person in various aspects, including intonation. However, in some demanding situations, it is desirable that the converted speech retain the original speaker's voice, such as the voice that is angry remains angry after conversion.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention have been made to provide a tone color conversion method and a corresponding tone color conversion apparatus that overcome or at least partially solve the foregoing problems.

The embodiment of the invention discloses a tone color conversion method, which comprises the following steps:

Acquiring voice to be converted;

Extracting various characteristic parameters of the voice to be converted;

Combining the plurality of characteristic parameters to obtain a characteristic vector;

Performing tone color conversion on the feature vector to obtain a target feature parameter;

and carrying out sounding processing by adopting the target characteristic parameters to obtain target voice.

Optionally, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the combining the plurality of feature parameters to obtain a feature vector includes:

Extracting acoustic features of the first spectrum parameters to obtain second spectrum parameters, wherein the second spectrum parameters correspond to sounding contents of the voice to be converted;

and combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a feature vector.

Optionally, the performing timbre conversion on the feature vector to obtain a target feature parameter includes:

and performing timbre conversion on the characteristic vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.

Optionally, the performing the sounding processing by using the target feature parameter to obtain a target voice includes:

and inputting the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter into a preset vocoder to carry out sounding treatment to obtain target voice.

and performing timbre conversion on the feature vector by adopting a preset U-shaped timbre conversion model to obtain a target feature parameter.

Optionally, the timbre conversion model with the preset U-shaped structure comprises a pooling layer and a deconvolution layer, wherein an operation core of the pooling layer comprises a binary context prediction model, and an operation core of the deconvolution layer comprises a binary context prediction model.

Optionally, a tone color conversion model with a preset U-shaped structure is adopted to perform tone color conversion on the feature vector to obtain a target feature parameter, which includes:

In a pooling layer of a tone color conversion model of the preset U-shaped structure, performing downsampling processing on the feature vector by adopting a binary context prediction model to obtain a first intermediate vector;

in a deconvolution layer of the tone color conversion model with the preset U-shaped structure, the binary context prediction model is adopted to carry out up-sampling processing on the first intermediate vector to obtain a second intermediate vector;

And converting the second intermediate vector to obtain target characteristic parameters.

Optionally, in the pooling layer of the preset U-shaped timbre conversion model, the downsampling processing is performed on the feature vector by using a binary context prediction model to obtain a first intermediate vector, where the downsampling processing includes:

And in a pooling layer of the tone color conversion model of the preset U-shaped structure, a binary context prediction model is adopted, and a vector at one moment is predicted according to the feature vectors at two adjacent moments, so that a first intermediate vector is obtained.

Optionally, in the deconvolution layer of the preset U-shaped timbre conversion model, the upsampling processing is performed on the first intermediate vector by using the binary context prediction model to obtain a second intermediate vector, including:

and in the deconvolution layer of the tone color conversion model with the preset U-shaped structure, the binary context prediction model is adopted to predict a vector at one moment according to the first intermediate vectors at two adjacent moments, so as to obtain a second intermediate vector.

Optionally, the weights of the pooling layer and the deconvolution layer are shared.

The embodiment of the invention also discloses a tone color conversion device, which comprises:

The language acquisition module is used for acquiring the voice to be converted;

the characteristic parameter extraction module is used for extracting various characteristic parameters of the voice to be converted;

the characteristic parameter combination module is used for combining the plurality of characteristic parameters to obtain a characteristic vector;

The tone color conversion module is used for performing tone color conversion on the feature vector to obtain a target feature parameter;

And the sounding processing module is used for performing sounding processing by adopting the target characteristic parameters to obtain target voice.

Optionally, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the characteristic parameter combination module comprises:

The spectrum parameter extraction sub-module is used for extracting acoustic characteristics of the first spectrum parameters to obtain second spectrum parameters, and the second spectrum parameters correspond to sounding contents of the voice to be converted;

And the characteristic parameter combination sub-module is used for combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a characteristic vector.

Optionally, the timbre conversion module includes:

And the first tone conversion sub-module is used for tone conversion of the feature vector to obtain a target spectrum parameter, a target fundamental frequency parameter and a target aperiodic component parameter.

Optionally, the sound production processing module includes:

and the sounding processing sub-module is used for inputting the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter into a preset vocoder to perform sounding processing to obtain target voice.

Optionally, the timbre conversion module includes:

And the second tone conversion sub-module is used for performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter.

Optionally, the second timbre conversion submodule includes:

the downsampling processing unit is used for performing downsampling processing on the feature vector by adopting a binary context prediction model in a pooling layer of the tone conversion model of the preset U-shaped structure to obtain a first intermediate vector;

The up-sampling processing unit is used for up-sampling the first intermediate vector by adopting the binary context prediction model in a deconvolution layer of the tone conversion model with the preset U-shaped structure to obtain a second intermediate vector;

And the conversion unit is used for converting the second intermediate vector to obtain target characteristic parameters.

Optionally, the downsampling processing unit includes:

and the downsampling processing subunit is used for predicting a vector at one moment according to the feature vectors at two adjacent moments by adopting a binary context prediction model in a pooling layer of the tone conversion model with the preset U-shaped structure to obtain a first intermediate vector.

Optionally, the upsampling processing unit includes:

and the up-sampling processing subunit is used for predicting a vector at one moment according to the first intermediate vectors at two adjacent moments by adopting the binary context prediction model in the deconvolution layer of the tone conversion model with the preset U-shaped structure to obtain a second intermediate vector.

The embodiment of the invention also discloses an electronic device, which comprises:

One or more processors; and

One or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform the method of any of the embodiments of the present invention.

Embodiments of the present invention also disclose a computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform a method according to any of the embodiments of the present invention.

The embodiment of the invention has the following advantages:

In the embodiment of the invention, the voice to be converted is obtained, various characteristic parameters of the voice to be converted are extracted, the various characteristic parameters are combined to obtain the characteristic vector, the tone conversion is carried out on the characteristic vector to obtain the target characteristic parameter, and the target characteristic parameter is adopted to carry out sound production processing to obtain the target voice, so that the various characteristic parameters of the voice to be converted can be subjected to tone conversion, the various characteristic parameters of the voice to be converted are thoroughly converted into the characteristic parameters of a target person, the naturalness and the stability of the conversion result are improved, and the converted voice can retain the characteristics of the original speaker, intonation and the like.

Drawings

FIG. 1 is a block diagram of a tone color conversion system of the present invention;

FIG. 2 is a flow chart of steps of an embodiment of a tone color conversion method of the present invention;

FIG. 3 is a schematic diagram of a binary context prediction model of the present invention;

fig. 4 is a block diagram showing a tone color conversion apparatus according to an embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Tone conversion method based on PPGs (Phonetic posteriorgrams, phonetic posterior phonetic): the voice recognition is introduced, the basic pronunciation characteristics without personal characteristics are extracted through the voice recognition, and then the basic pronunciation characteristics are converted to specific target persons, so that the conversion of any one can be realized, and the naturalness of the conversion effect is good. As shown in fig. 1, the system integrally comprises three parts of ASR (Automatic Speech Recognition, speech recognition), a conversion model and a vocoder (Vocoder), wherein the three parts are separated by a dotted line in the figure, and the first two parts are model training steps for respectively training an acoustic model of speech recognition and a timbre conversion model acting on spectral parameters; the third part is the real tone color conversion flow after the model is trained. ASR is responsible for extracting an acoustic feature which is irrelevant to a speaker and only reflects pronunciation contents from voice, and is called PPGs; the conversion model is responsible for converting from PPGs to spectral parameters of a specific person, and then the generated spectral parameters are sent to the vocoder for sounding along with Log F0 and AP of the input speech. The tone color conversion flow is as follows:

1) The voice to be converted is input into a voice signal parameter extraction algorithm, and two sets of parameters are extracted: the first set is a characteristic pre-extraction module of a voice recognition system, and a spectrum parameter MFCC (Mel Frequency Cepstrum Coefficient ) is extracted for the next voice recognition; the second set is to extract spectral parameters MCEPs, fundamental frequency F0 (Log F0) and aperiodic component AP (MCEPs. After the subsequent step of tone conversion, the voice is sent back to the reconstruction synthesis algorithm of the vocoder together with the Log F0 and AP to obtain voice, and the voice at this time sounds to be the converted tone).

2) And sending the MFCC into an acoustic model of the voice recognition system to obtain PPGs.

3) And sending the PPGs into a tone color conversion model to obtain MCEPs of the target person.

4) Simple linear transformation is carried out on the Log F0 obtained in the step 1), such as pre-calculating the difference of the global average of two people Log F0 before and after the conversion, the Log F0 obtained in 1) is then added to the difference.

5) And delivering MCEPs of the target person together with the Log F0 obtained in 4) and the AP obtained in 1) to a vocoder to obtain the final converted voice.

Wherein the ASR portion trains the DNN recognition model with Kaldi toolkits; the vocoder adopts the STRAIGHT toolkit of the traditional signal class to extract MCEP, log F0 and AP; the conversion model adopts a simple bidirectional LSTM structure to model the conversion relation from PPGs to MCEPs.

The scheme can convert the voice of any person into the sound ray tone of a specific target person, the content is kept unchanged, and the converted voice is close to the target person in all aspects including intonation. However, in some demanding situations, it is desirable that the converted speech retain the original speaker's voice, such as the voice that is angry remains angry after conversion. In speech, the primary factors affecting intonation are intonation, including pitch and variation pattern of the overall intonation, and further expansion, if intonation can be finely controlled in fine granularity, singing is performed.

Therefore, in the embodiment of the invention, three sets of parameters of MCEPs, AP and Log F0 can be sent into a tone color conversion model together to obtain three parameters of a target person, the three sets of parameters are thoroughly converted into the target person integrally, and compared with a scheme for tone color conversion only aiming at MCEPs, the three sets of parameters are synchronously converted integrally, so that the naturalness and stability of a conversion result can be obviously improved.

UFNANS (U-shaped Fully-parallel Acoustic Neural Structure, U-shaped full parallel acoustic neural structure) is a deep neural network structure facing one-dimensional sequence modeling tasks (such as voice, natural language processing and the like), and the structure has two main characteristics: firstly, it is a U-shaped structure which recursively halves the input size by downsampling round by round in view of the very popular U-Net in the image field, then, the result of each round is doubled in size by deconvolution, and the result is added as a residual to the input of the round, and for each round, it can be regarded as the addition of two paths of information, the basic convolution and the other path of information which returns after going back after going to the bottom with one round of size, and the second path of information is available, so that the wider view can be covered. Secondly, the model is a full convolution network, only convolution, deconvolution and Pooling pooling operations are carried out in the model, and no RNN (Recurrent Neural Network, cyclic neural network) type infrastructure is included, so that the effect of full parallelization calculation is achieved, and the calculation speed can be greatly improved.

The forward computation of UFANS is specifically illustrated below assuming that the model input is a matrix of size [ _T, _D ], where _T represents the length of time (e.g., the number of frames of speech) and _D represents the feature dimension of each frame. The calculation flow is as follows:

1) The end of the input is complemented by 0 along the time axis, resulting IN a matrix IN of size T, D such that the complemented length T is exactly an integer power of 2 (e.g., 4,8, 16, 32, 64, 128, etc.).

2) IN passes through the convolution layer A1 (convolution kernel size is 3, output characteristic dimension is F) and the matched excitation function set to obtain a matrix O_A1 with the size of [ T, F ].

3) O_A1 is subjected to a layer B1 (cell size 2, jumping unit 2) of average Pooling to obtain a matrix O_B1 with size [ T/2,F ]. The average Pooling layer B1 was calculated as: calculating the average value of the first frame and the second frame as the first moment of output; calculating the average value of the third frame and the fourth frame as the second moment of output; calculating the average value of the fifth frame and the sixth frame as a third moment of output; after the calculation is finished one by one, the calculation is equivalent to calculating from two times from front to back to obtain one time, so the final time length is halved.

4) O_B1 passes through a convolution layer A2 (the convolution kernel size is 3, the output characteristic dimension is F) and a matched excitation function set to obtain a matrix O_A2 with the size of [ T/2,F ].

5) O_A2 is passed through layer B2 (cell size 2, jumping unit 2) of average Pooling to obtain matrix O_B2 of size [ T/4,F ].

6) O_B2 passes through deconvolution layer C2 (convolution kernel size 2, skip unit 2, output feature dimension F) to obtain matrix O_C2 with size [ T/2,F ]. The deconvolution layer C2 is calculated by: firstly, inserting an all-zero vector between every two input moments to obtain a temporary matrix with doubled size, and then performing common convolution calculation to obtain a result which is doubled compared with the input size.

7) Adding O_C2 obtained in O_A2 and O_C2 obtained in 4) and O_C2 obtained in 6) (the lengths are T/2), and obtaining a matrix O_D2 with the size of [ T/2,F ] through a convolution layer D2 (the convolution kernel size is 3 and the output characteristic dimension is F) and a matched excitation function group.

8) O_D2 passes through deconvolution layer C1 (convolution kernel size is 2, skip unit 2, output feature dimension is F) to obtain matrix O_C1 with size of [ T, F ].

9) Adding O_C1 obtained in O_A1 and O_C1 obtained in 2) and O_C1 obtained in 8) (the length is 2), and obtaining a matrix O_D1 with the size of [ T, F ] through a convolution layer D1 (the convolution kernel size is 3, the output characteristic dimension is F) and a matched excitation function group.

10 O_d1 is passed through the final convolution layer E (convolution kernel size 3, output characteristic dimension 2F) and the matched excitation function (e.g., tanh) to obtain matrix OUT of size T, 2F.

It should be noted that the above process illustrates a structure with only 2 layers, and in a real case, the process is generally designed into more layers, except that the single-layer process from 4) to 7) is iterated for more rounds.

In the embodiment of the invention, the tone color conversion model in the tone color conversion mode based on the PPGs can be replaced by a UFNANS structure with better effect and performance from the original DBLSTM structure. LSTM (Long Short-Term Memory) is an advanced RNN structure, which slowly-releases a state of a previous period of time by means of an internal state unit, receives new information while walking over time and gradually forgets the oldest information, and the coverage width of a model to a context field of view depends on the Memory of the state unit, so that for any moment, only the previous limited moment of the previous information can be seen, and DBLSTM is simply adding together LSTM in two opposite directions from front to back and from back to front, and essentially each direction is memorized separately. Whereas UFNANS's principle is more excellent: first, the basic operation inside the U-shaped structure is basically all convolution operation, the convolution operation is to merge information forward and backward integrally, and two directions superior to DBLSTM are respectively and independently memorized. Secondly, the U-shaped structure enables information of different span contexts of [ front 1, rear 1] [ front 2, rear 2] [ front 4, rear 4] [ front 8, rear 8] [ front 16, rear 16] … to be seen simultaneously for each moment, and fusion weight magnitudes of the different groups of information can be learned by self. The deeper the number of U-shaped layers, the wider the covered context span, without limitation. Finally, the performance of the neural network of the RNN structure must be recursively calculated from time to time, and the neural network of the convolution structure can combine the operation of the convolution kernel into a matrix operation to be completed at one time, namely the parallel operation, so that the performance is very fast.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a tone color conversion method according to the present invention may specifically include the following steps:

step 201, obtaining voice to be converted;

The speech to be converted may be audio data for which a timbre conversion is required. In the embodiment of the invention, the voice to be converted can be obtained, so that the voice to be converted is input into a pre-trained voice recognition system, and the voice recognition system is adopted to recognize the voice to be converted and convert the tone.

Step 202, extracting various characteristic parameters of the voice to be converted;

The characteristic parameter may be a parameter that can be a key feature of the speech to be converted, for example, the characteristic parameter may be MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstrum coefficient), the person produces sound through the vocal tract, the shape of the vocal tract determines how the sound is produced, the shape of the vocal tract is displayed in the envelope of the short-time power spectrum of the speech, and MFCC is a feature that accurately describes this envelope; the characteristic parameter can also be a fundamental frequency F0, which is used for representing the vibration frequency of the fundamental tone, and the fundamental frequency F0 determines the level of the voice tone; the characteristic parameter may also be an aperiodic component AP.

Specifically, the voice to be converted may be input into a voice signal parameter extraction algorithm, and various characteristic parameters of the voice to be converted may be extracted by adopting the voice signal parameter extraction algorithm. As an example, the extracted feature parameters may include two sets: the first set can adopt a characteristic pre-extraction module of a voice recognition system to extract a first spectrum parameter MFCC for subsequent voice recognition; the second set may use a vocoder parameter extraction algorithm to extract spectral parameters MCEPs (Mel-cepstral Coefficients), fundamental frequency parameters F0 (Log F0 may be obtained by taking the logarithm of F0 after extraction), and aperiodic component parameters AP.

Step 203, combining the various characteristic parameters to obtain a characteristic vector;

The feature parameters extracted for each frame of voice to be converted can be various, and various feature parameters corresponding to each frame of voice to be converted can be spliced together to obtain a long feature vector, so that subsequent tone conversion is carried out by adopting the feature vector.

Step 204, performing timbre conversion on the feature vector to obtain a target feature parameter;

the target feature parameter refers to a feature parameter corresponding to the voice of the target person, and in the embodiment of the invention, the feature vector can be converted into the target feature parameter through tone conversion.

Specifically, the voice recognition system may include a timbre conversion model, and may use a timbre conversion model feature vector to perform timbre conversion to obtain a target feature parameter.

And 205, performing sounding processing by adopting the target characteristic parameters to obtain target voice.

The speech recognition system may be coupled to a vocoder that synthesizes the received characteristic parameters to generate speech. In the embodiment of the invention, the target characteristic parameters can be input into the vocoder, and the target characteristic parameters are subjected to sounding processing by the vocoder to obtain target voice, wherein the target voice can be voice conforming to the tone of a target person.

In a preferred embodiment of the invention, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the step 203 may comprise the sub-steps of:

extracting acoustic features of the first spectrum parameters to obtain second spectrum parameters, wherein the second spectrum parameters correspond to sounding contents of the voice to be converted; and combining the second spectrum parameter, the fundamental frequency parameter and the aperiodic component parameter to obtain a feature vector.

The first spectral parameter may be an MFCC mel frequency cepstrum parameter.

In the embodiment of the invention, the first spectrum parameter MFCC can be further extracted to obtain a second spectrum parameter, and the second spectrum parameter can be PPGs (Phonetic posteriorgrams, speech posterior speech), wherein PPGs is an acoustic feature which is irrelevant to a speaker and only reflects the sounding content, and corresponds to the sounding content of the speech to be converted.

After the second spectral parameters PPGs are extracted, the second spectral parameters, the fundamental frequency parameters and the aperiodic component parameters may be spelled into a long eigenvector. In a specific implementation, log F0 may be obtained by taking the logarithm of the fundamental frequency parameter F0, and then the PPGs, log F0, and AP of each frame are spelled into a long feature vector.

In a preferred embodiment of the present invention, the step 204 may comprise the following sub-steps:

The feature vector contains PPGs, log F0 and AP of the voice to be converted, and a tone conversion module in a voice recognition system can be adopted to perform tone conversion on the PPGs, the Log F0 and the AP in the feature vector to obtain target spectrum parameters, target fundamental frequency parameters and target aperiodic component parameters. Wherein the target spectrum parameter is a spectrum parameter MCEPs of the target person, the target fundamental frequency parameter is a fundamental frequency parameter Log F0 of the target person, and the target aperiodic component parameter is an aperiodic component parameter AP of the target person.

The tone color conversion module can convert the second spectrum parameter PPGs of the voice to be converted into the spectrum parameter MCEPs of the target person, the Log F0 of the voice to be converted into the Log F0 of the target person, and the AP of the voice to be converted into the AP of the target person.

In a preferred embodiment of the present invention, the step 205 may comprise the following sub-steps:

The preset vocoder may be a preset module for synthesizing voice. In the embodiment of the invention, the target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter can be input into a preset vocoder, and the vocoder can synthesize the received target spectrum parameter, the target fundamental frequency parameter and the target aperiodic component parameter to generate target voice, wherein the target voice can be voice conforming to the tone of a target person.

The preset U-shaped tone color conversion model may be a preset UFNANS-structured tone color conversion model, where the preset U-shaped tone color conversion model is used for performing tone color conversion on input data.

In the embodiment of the invention, a preset U-shaped tone color conversion model can be adopted to perform tone color conversion on the feature vector to obtain the target feature parameter, and compared with the tone color conversion model with DBLSTM structure, the tone color conversion model with UFNANS structure has better effect and performance, the advantage in effect is derived from a wide enough context view, the advantage in performance is derived from a fully parallel network structure, and the naturalness and stability of the conversion result are further improved.

In a preferred embodiment of the present invention, the preset U-shaped timbre conversion model includes a pooling layer and a deconvolution layer, wherein an operation core of the pooling layer includes a binary context prediction model, and an operation core of the deconvolution layer includes a binary context prediction model.

The binary context prediction model may be a 2-gram prediction model, the 2-gram prediction model may Input feature vectors at two moments in a pooling layer and Output feature vectors at one moment, as shown in fig. 2, which illustrates a schematic structure of a binary context prediction model according to an embodiment of the present invention, where Input1 and Input2 are feature vectors at two moments, respectively, output is an Output feature vector, and the binary context prediction model may be regarded as downsampling feature vectors at two moments to a feature vector at one moment.

In the embodiment of the invention, the operation core of the pooling layer comprises a binary context prediction model, and the operation core of the deconvolution layer comprises a binary context prediction model. In the original UFNANS structure, the pooling layer carries out downsampling in a Average Pooling average value mode, the deconvolution layer carries out upsampling in a zero filling mode, vectors obtained through processing have deviation, in the embodiment of the invention, the internal defects of UFNANS are further improved, a binary context prediction model is introduced and shared in downsampling and upsampling, the embarrassment of the inexhaustible information quantity is improved on one hand, and on the other hand, the model sharing has an effect of resisting generation, so that training convergence is more accurate.

In a preferred embodiment of the invention, the weights of the pooling layer and the deconvolution layer are shared.

Because the pooling layer and the deconvolution layer pay attention to the same context information, the pooling layer and the deconvolution layer can be jointly learned by adopting the same weight, so that the requirement on training data can be reduced, and common information can be commonly mined in the processes of downsampling and upsampling.

In a preferred embodiment of the present invention, a preset U-shaped tone conversion model is used to perform tone conversion on the feature vector to obtain a target feature parameter, where the method includes:

In a pooling layer of a tone color conversion model of the preset U-shaped structure, performing downsampling processing on the feature vector by adopting a binary context prediction model to obtain a first intermediate vector; in a deconvolution layer of the tone color conversion model with the preset U-shaped structure, the binary context prediction model is adopted to carry out up-sampling processing on the first intermediate vector, so as to obtain second intermediate data; and converting the second intermediate data to obtain target characteristic parameters.

In the embodiment of the invention, in a pooling layer of a tone conversion model with a preset U-shaped structure, a binary context prediction model is adopted to carry out downsampling treatment on a feature vector so as to obtain a first intermediate vector; in a deconvolution layer of a tone color conversion model with a preset U-shaped structure, performing up-sampling processing on the first intermediate vector by adopting a binary context prediction model to obtain a second intermediate vector; and converting the second intermediate vector to obtain the target characteristic parameters.

In a preferred embodiment of the present invention, in the pooling layer of the timbre conversion model of the preset U-shaped structure, the downsampling process is performed on the feature vector by using a binary context prediction model to obtain a first intermediate vector, where the downsampling process includes:

Specifically, a binary context prediction model is adopted in a pooling layer of a preset timbre conversion model with a U-shaped structure, and a vector at one moment is predicted according to feature vectors at two adjacent moments to obtain a first intermediate vector. Compared with the mode that the operation core in the Average Pooling mean value pooling layer of the original UFNANS structure does not contain the binary context prediction model, the method can calculate a mean value from the original feature vectors at two adjacent moments. After a binary context prediction model is added into an operation core of the Average Pooling average value pooling layer, feature vectors at two adjacent moments are changed into the binary context prediction model to obtain a downsampling result, the Average Pooling average value pooling layer aims at realizing the length halving through downsampling, and the binary context prediction model is used for flexibly predicting the result, so that the relation between the adjacent moments can be learned more flexibly and accurately than the simple rough averaging.

In a preferred embodiment of the present invention, in the deconvolution layer of the preset U-shaped timbre conversion model, the upsampling processing is performed on the first intermediate vector by using the binary context prediction model to obtain a second intermediate vector, including:

Specifically, in a deconvolution layer of a preset tone color conversion model with a U-shaped structure, a binary context prediction model is adopted, and vectors at two moments are predicted according to the first intermediate vector at one moment, so that a second intermediate vector is obtained. The Deconvolution layer is initially designed to double the length (up-sampling), the information of the length to be added without a second time is the cost, the length can be complemented in the most disastrous mode of zero filling, and then properly flattened through convolution. After the binary context prediction model is added into the operation core of the Deconvolution layer, the true predicted result based on the context is supplemented at each position, and the information quantity of the binary context prediction model has a true meaning, and can be traced back to the Average Pooling mean value pooling layer downsampling true statistical distribution. Therefore, the downsampling and upsampling share a set of binary context prediction models, and the binary context prediction models are combined for training to complement each other, so that the binary context prediction models have the effect of resisting generation, and the convergence of the tone color conversion models with the preset U-shaped structures is more accurate.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Referring to fig. 4, a block diagram of an embodiment of a timbre conversion apparatus according to the present invention is shown, which may specifically include the following modules:

a language acquisition module 401, configured to acquire a voice to be converted;

a feature parameter extracting module 402, configured to extract a plurality of feature parameters of the speech to be converted;

a feature parameter combination module 403, configured to combine the multiple feature parameters to obtain a feature vector;

A timbre conversion module 404, configured to perform timbre conversion on the feature vector to obtain a target feature parameter;

and the sounding processing module 405 is configured to perform sounding processing by using the target feature parameters to obtain target voice.

In a preferred embodiment of the invention, the characteristic parameters include a first spectral parameter, a fundamental frequency parameter and an aperiodic component parameter; the feature parameter combination module 403 includes:

In a preferred embodiment of the present invention, the timbre conversion module 404 includes:

In a preferred embodiment of the present invention, the sound processing module 405 includes:

In a preferred embodiment of the present invention, the timbre conversion model of the preset U-shaped structure includes a pooling layer and a deconvolution layer, wherein an operation core of the pooling layer includes a binary context prediction model, and an operation core of the deconvolution layer includes a binary context prediction model.

In a preferred embodiment of the present invention, the second timbre conversion sub-module includes:

In a preferred embodiment of the present invention, the downsampling processing unit includes:

In a preferred embodiment of the present invention, the upsampling processing unit comprises:

For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.

The embodiment of the invention provides electronic equipment, which comprises:

One or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform the method of any of the embodiments of the present invention.

Embodiments of the present invention disclose a computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform a method according to any of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.

The foregoing has described in detail a tone color conversion method and a tone color conversion apparatus according to the present invention, and specific examples have been used herein to illustrate the principles and embodiments of the present invention, and the above examples are only for aiding in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A tone color conversion method, comprising:

Acquiring voice to be converted;

Extracting various characteristic parameters of the voice to be converted;

Performing sounding processing by adopting the target characteristic parameters to obtain target voice;

The method further comprises the steps of:

Performing tone color conversion on the feature vector by adopting a preset U-shaped tone color conversion model to obtain a target feature parameter; the preset U-shaped tone color conversion model is of a preset UFANS-structure; the UFANS structure is a deep neural network structure facing a one-dimensional sequence modeling task, the input size is reduced by half recursively through downsampling round by round in the structure, the result of each round doubles the size through deconvolution, and the result is used as a residual error to be added to the input of the round;

The tone color conversion model of the preset U-shaped structure comprises a pooling layer and a deconvolution layer, wherein an operation core of the pooling layer comprises a binary context prediction model, and an operation core of the deconvolution layer comprises a binary context prediction model; the binary context prediction model is a 2-gram prediction model;

2. The method of claim 1, wherein the characteristic parameters include a first spectral parameter, a fundamental frequency parameter, and an aperiodic component parameter; the combining the plurality of feature parameters to obtain a feature vector includes:

3. The method according to claim 2, wherein performing timbre conversion on the feature vector to obtain a target feature parameter comprises:

4. A method according to claim 3, wherein said performing a voicing process using said target feature parameters to obtain target speech comprises:

5. The method of claim 1, wherein the step of performing downsampling the feature vector with a binary context prediction model in a pooling layer of the timbre conversion model of the preset U-shaped structure to obtain a first intermediate vector includes:

6. The method according to claim 1, wherein in the deconvolution layer of the preset U-shaped timbre conversion model, the up-sampling processing is performed on the first intermediate vector by using the binary context prediction model to obtain a second intermediate vector, including:

7. The method of claim 1, wherein weights of the pooling layer and deconvolution layer are shared.

8. A tone color conversion apparatus, comprising:

the sound production processing module is used for carrying out sound production processing by adopting the target characteristic parameters to obtain target voice;

the tone color conversion module includes:

The second tone conversion sub-module is used for performing tone conversion on the feature vector by adopting a preset U-shaped tone conversion model to obtain a target feature parameter; the preset U-shaped tone color conversion model is of a preset UFANS-structure; the UFANS structure is a deep neural network structure facing a one-dimensional sequence modeling task, the input size is reduced by half recursively through downsampling round by round in the structure, the result of each round doubles the size through deconvolution, and the result is used as a residual error to be added to the input of the round;

the second tone color conversion sub-module includes:

9. An electronic device, comprising:

One or more processors; and

One or more machine readable media having instructions stored thereon, which when executed by the one or more processors, cause the electronic device to perform the method of any of claims 1-7.

10. A computer-readable storage medium having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the method of any of claims 1-7.