CN112185342A

CN112185342A - Voice conversion and model training method, device and system and storage medium

Info

Publication number: CN112185342A
Application number: CN202011054910.XA
Authority: CN
Inventors: 武剑桃; 李秀林
Original assignee: Databaker Beijng Technology Co ltd
Current assignee: Databaker Beijng Technology Co ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-05

Abstract

The invention provides a voice conversion method, a voice conversion device, a voice conversion system, a storage medium, a model training method, a model training device, a model training system and a storage medium. The voice conversion method comprises the following steps: acquiring N groups of source speech data of a source speaker under N different channels respectively, wherein N is an integer greater than 1; respectively extracting the characteristics of each group of source speech data in the N groups of source speech data to obtain N groups of source recognition acoustic characteristics; performing feature combination on the N groups of source recognition acoustic features to obtain acoustic features of a source speaker; mapping the acoustic features of the source speaker to the acoustic features of the target speaker through a predetermined mapping model; and performing voice synthesis based on the acoustic characteristics of the target speaker to obtain the target voice of the target speaker. Model training and voice conversion are carried out based on multi-channel voice data, robustness to a noisy environment is higher, and the phenomenon of inaccurate recognition during voice conversion can be reduced.

Description

Voice conversion and model training method, device and system and storage medium

Technical Field

The present invention relates to the field of speech signal processing technologies, and in particular, to a speech conversion method, apparatus, and system, and a storage medium, and a model training method, apparatus, and system, and a storage medium.

Background

In the field of speech signal processing, a speech conversion (i.e. a voice tone conversion) technology is an important research direction at present. The speech conversion aims at modifying the timbre of an arbitrary speaker, converting it into the timbre of a fixed speaker, while the content of the speech remains unchanged. Speech conversion involves front-end signal processing, speech recognition and speech synthesis techniques. The existing voice conversion technology mainly uses single-channel data after front-end signal processing as voice data for extracting and recognizing acoustic features and synthesizing the acoustic features, and carries out training of a network model related to voice conversion, so that the whole voice conversion system is realized.

The existing voice conversion technology based on single channel data has the following disadvantages: when the environment is noisy, serious inaccurate identification phenomena can occur, and the error types can include wrong sound, wrong characters and the like. For example, the source speech is "i ai beijing tiananmen", and the converted target speech is "i ai (one sound) beijing tiananmen", which is a sound error. It is also possible to make a word error, for example, convert the source speech into a target speech "i love beijing tiano". These errors directly result in poor hearing of the target speech obtained by the final conversion.

Disclosure of Invention

In order to at least partially solve the problems in the prior art, a speech conversion method, a speech conversion device, a speech conversion system, a storage medium, a model training method, a model training device, a model training system, and a storage medium are provided.

According to an aspect of the present invention, there is provided a voice conversion method including: acquiring N groups of source speech data of a source speaker under N different channels respectively, wherein N is an integer greater than 1; respectively extracting the characteristics of each group of source speech data in the N groups of source speech data to obtain N groups of source recognition acoustic characteristics; performing feature combination on the N groups of source recognition acoustic features to obtain acoustic features of a source speaker; mapping the acoustic features of the source speaker to the acoustic features of the target speaker through a predetermined mapping model; and performing voice synthesis based on the acoustic characteristics of the target speaker to obtain the target voice of the target speaker.

Illustratively, obtaining N sets of source speech data for a source speaker in N different channels, respectively, includes: n groups of source speech data of a source speaker acquired by a microphone array are acquired, and the microphone array comprises N microphones with different arrangement orientations, wherein the N microphones correspond to N different channels one by one.

Illustratively, obtaining N sets of source speech data for a source speaker in N different channels, respectively, includes: acquiring M groups of initial source speech data of a source speaker acquired by M microphones, wherein M is an integer greater than or equal to 1; and performing channel conversion operation from M channels to N channels on the M groups of initial source speech data to obtain N groups of source speech data.

Illustratively, performing an M-channel to N-channel transformation operation on M sets of initial source speech data to obtain N sets of source speech data comprises: performing a first channel conversion operation from an M channel to a single channel on the M groups of initial source speech data to obtain a single group of source speech data; and performing single-channel to N-channel second channel conversion operation on the single group of source speech data to obtain N groups of source speech data.

Illustratively, performing a single-channel to N-channel second channel transform operation on a single set of source audio data to obtain N sets of source audio data comprises: simulating the single group of source voice data to obtain a single group of new source voice data; and filtering the single group of new source voice data through N spatial filters which are in one-to-one correspondence with the N different channels respectively to obtain N groups of source voice data.

Illustratively, simulating a single set of source speech data to obtain a single set of new source speech data comprises:

a single set of source speech data is simulated by the following formula:

wherein, y₃For a single set of new source speech data, s₃As a single set of source speech data, h₃Is a third convolution kernel, n₃Is the third noise.

Illustratively, the predetermined mapping model includes a speech recognition model and a feature mapping model, and mapping the acoustic features of the source speaker to the acoustic features of the target speaker by the predetermined mapping model includes: inputting the acoustic characteristics of the source speaker into the speech recognition model to obtain a speech posterior probability of the source speaker output by the speech recognition model, the speech posterior probability comprising a set of values corresponding to a time range and a speech class range; and inputting the voice posterior probability of the source speaker into the feature mapping model to obtain the acoustic feature of the target speaker output by the feature mapping model.

Illustratively, the speech category range corresponds to a phoneme state range.

Illustratively, the set of values corresponds to a posterior probability for each speech class in the range of speech classes for each time in the range of times, and wherein the speech posterior probability comprises a matrix.

Illustratively, the speech recognition model includes one or more of the following network models: a long and short term memory network model, a convolution neural network model, a time delay neural network model and a deep neural network model; and/or, the feature mapping model comprises one or more of the following network models: tensor to tensor network model, convolutional neural network model, sequence to sequence model, attention model.

Illustratively, the method further comprises: acquiring first training voice data of a sample speaker and second training voice data of a target speaker; carrying out channel conversion operation from a single channel to N channels on the first training voice data to obtain N groups of sample voice data under N different channels respectively; respectively extracting the characteristics of each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic characteristics; performing feature combination on the N groups of first recognition acoustic features to obtain recognition acoustic features of the sample speaker; performing channel conversion operation from a single channel to N channels on the second training voice data to obtain N groups of target voice data under N different channels respectively; respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second synthesized acoustic features; performing feature combination on the N groups of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker; and obtaining a predicted synthesized acoustic feature through mapping of a predetermined mapping model based on the recognized acoustic feature of the sample speaker, and training the predetermined mapping model by taking the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature.

Illustratively, the feature extraction of each of the N sets of target speech data to obtain N sets of second synthesized acoustic features includes: respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second identification acoustic features and N groups of second synthesis acoustic features; obtaining the predicted synthesized acoustic features mapped by the predetermined mapping model based on the recognized acoustic features of the sample speakers includes: inputting the recognition acoustic characteristics of the sample speaker into the voice recognition model to obtain the voice posterior probability of the sample speaker output by the voice recognition model; training a voice recognition model based on the voice posterior probability of the sample speaker; performing feature combination on the N groups of second recognition acoustic features to obtain the recognition acoustic features of the target speaker; inputting the recognition acoustic characteristics of the target speaker into the trained voice recognition model to obtain the voice posterior probability of the target speaker; and inputting the voice posterior probability of the target speaker into the feature mapping model to obtain the predicted synthesized acoustic features output by the feature mapping model.

Illustratively, performing a single-channel to N-channel transformation operation on the first training speech data to obtain N sets of sample speech data under N different channels respectively includes: simulating the first training voice data to obtain first new voice data; filtering the first new voice data through N spatial filters which are in one-to-one correspondence with N different channels respectively to obtain N groups of sample voice data; performing a single-channel to N-channel conversion operation on the second training speech data to obtain N sets of target speech data under N different channels, respectively, includes: simulating the second training voice data to obtain second new voice data; and filtering the second new voice data through N spatial filters respectively to obtain N groups of target voice data.

Illustratively, the spatial filter is a cardioid spatial filter.

Illustratively, simulating the first training speech data to obtain the first new speech data comprises:

the first training speech data is simulated by the following formula:

wherein, y₁Is the first new voice data, s₁For the first training speech data, h₁Is a first convolution kernel, n₁Is a first noise;

simulating the second training speech data to obtain second new speech data comprises:

simulating the second training speech data by:

where y2 is the second new speech data, s2 is the second training speech data, h2 is the second convolution kernel, and n2 is the second noise.

Illustratively, the method 100 may further include: randomly selecting a first convolution kernel and/or a second convolution kernel from pre-stored convolution kernels; the first noise and/or the second noise are randomly selected from pre-stored noise, wherein the pre-stored noise comprises one or more of white noise, pink noise and brown noise.

Illustratively, the source speaker is different from the target speaker.

Illustratively, the acoustic features of the source speaker are mel-frequency cepstrum coefficient features, perceptual linear prediction features, filter bank features or constant-Q cepstrum coefficient features, and the acoustic features of the target speaker are mel-frequency cepstrum features, line spectrum pair features after mel frequency, line spectrum pair features based on mel generalized cepstrum analysis or linear prediction coding features.

According to another aspect of the present invention, there is provided a model training method, including: acquiring first training voice data of a sample speaker and second training voice data of a target speaker; performing channel conversion operation from a single channel to N channels on the first training voice data to obtain N groups of sample voice data under N different channels respectively, wherein N is an integer greater than 1; respectively extracting the characteristics of each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic characteristics; performing feature combination on the N groups of first recognition acoustic features to obtain recognition acoustic features of the sample speaker; performing channel conversion operation from a single channel to N channels on the second training voice data to obtain N groups of target voice data under N different channels respectively; respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second synthesized acoustic features; performing feature combination on the N groups of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker; and obtaining a predicted synthesized acoustic feature through mapping by a predetermined mapping model based on the recognized acoustic feature of the sample speaker, and training the predetermined mapping model by using the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature, wherein the predetermined mapping model is used for mapping the acoustic feature of the source speaker to the acoustic feature of the target speaker in the process of carrying out voice conversion on any source speaker and the target speaker, so that voice synthesis is carried out by a predetermined synthesizer based on the acoustic feature of the target speaker to obtain the target voice of the target speaker.

Illustratively, the predetermined mapping model includes a speech recognition model and a feature mapping model, and the performing feature extraction on each of the N sets of target speech data to obtain N sets of second synthesized acoustic features includes: respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second identification acoustic features and N groups of second synthesis acoustic features; obtaining the predicted synthesized acoustic features mapped by the predetermined mapping model based on the recognized acoustic features of the sample speakers includes: inputting the recognition acoustic characteristics of the sample speaker into a voice recognition model to obtain a voice posterior probability of the sample speaker output by the voice recognition model, wherein the voice posterior probability comprises a set of values corresponding to a time range and a voice category range; training a voice recognition model based on the voice posterior probability of the sample speaker; performing feature combination on the N groups of second recognition acoustic features to obtain the recognition acoustic features of the target speaker; inputting the recognition acoustic characteristics of the target speaker into the trained voice recognition model to obtain the voice posterior probability of the target speaker; and inputting the voice posterior probability of the target speaker into the feature mapping model to obtain the predicted synthesized acoustic features output by the feature mapping model.

Illustratively, the speech category range corresponds to a phoneme state range.

Illustratively, the spatial filter is a cardioid spatial filter.

the first training speech data is simulated by the following formula:

simulating the second training speech data by:

wherein, y₂Is the second new speech data, s₂For the second training speech data, h₂Is a second convolution kernel, n₂Is the second noise.

Illustratively, the method further comprises: randomly selecting a first convolution kernel and/or a second convolution kernel from pre-stored convolution kernels; the first noise and/or the second noise are randomly selected from pre-stored noise, wherein the pre-stored noise comprises one or more of white noise, pink noise and brown noise.

Illustratively, the recognized acoustic features of the sample speaker are mel-frequency cepstrum coefficient features, perceptual linear prediction features, filter bank features or constant-Q cepstrum coefficient features, and the synthesized acoustic features of the target speaker are mel-frequency cepstrum features, line spectrum pair features after mel frequency, line spectrum pair features based on mel generalized cepstrum analysis or linear prediction coding features.

According to another aspect of the present invention, there is provided a voice conversion apparatus including: the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring N groups of source speech data of a source speaker under N different channels respectively, wherein N is an integer greater than 1; the extraction module is used for respectively extracting the characteristics of each group of source speech data in the N groups of source speech data to obtain N groups of source recognition acoustic characteristics; the combining module is used for carrying out feature combination on the N groups of source recognition acoustic features so as to obtain the acoustic features of the source speaker; a mapping module for mapping the acoustic features of the source speaker to the acoustic features of the target speaker by a predetermined mapping model; and the synthesis module is used for carrying out voice synthesis based on the acoustic characteristics of the target speaker so as to obtain the target voice of the target speaker.

According to another aspect of the present invention, there is also provided a speech conversion system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the above speech conversion method when executed by the processor.

According to another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the above-described speech conversion method when executed.

According to another aspect of the present invention, there is also provided a model training apparatus, including: the acquisition module is used for acquiring first training voice data of a sample speaker and second training voice data of a target speaker; the first simulation module is used for carrying out channel conversion operation from a single channel to N channels on the first training voice data so as to obtain N groups of sample voice data under N different channels respectively, wherein N is an integer greater than 1; the first extraction module is used for respectively extracting the characteristics of each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic characteristics; the first combination module is used for carrying out feature combination on the N groups of first recognition acoustic features so as to obtain the recognition acoustic features of the sample speaker; the second simulation module is used for carrying out channel conversion operation from a single channel to N channels on the second training voice data so as to obtain N groups of target voice data under N different channels; the second extraction module is used for respectively extracting the characteristics of each group of target voice data in the N groups of target voice data to obtain N groups of second synthesized acoustic characteristics; the second combination module is used for carrying out feature combination on the N groups of second synthesized acoustic features so as to obtain the synthesized acoustic features of the target speaker; and a training module, which is used for obtaining the predicted synthesized acoustic characteristics through mapping of a predetermined mapping model based on the recognized acoustic characteristics of the sample speaker, and training the predetermined mapping model by taking the synthesized acoustic characteristics of the target speaker as the true values of the predicted synthesized acoustic characteristics, wherein the predetermined mapping model is used for mapping the acoustic characteristics of the source speaker to the acoustic characteristics of the target speaker in the process of carrying out voice conversion on any source speaker and the target speaker, so that the predetermined synthesizer carries out voice synthesis based on the acoustic characteristics of the target speaker to obtain the target voice of the target speaker.

According to another aspect of the present invention, there is also provided a model training system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for executing the above model training method when the computer program instructions are executed by the processor.

According to another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the above-described model training method when executed.

According to the voice conversion method, the voice conversion device and the voice conversion system, the storage medium, the model training method, the voice conversion device and the storage medium, in the model training stage, the voice data are simulated to generate the multi-channel voice data, the multi-channel voice data are selected as a data processing basis to train the preset mapping model required by the voice conversion, and therefore the robustness of the model to a noisy environment is higher when the model is applied to the actual conversion stage, and the phenomenon of inaccurate recognition during the voice conversion is further reduced. Correspondingly, in the actual conversion stage, the multi-channel voice data of the source speaker is obtained, and voice conversion is carried out based on the multi-channel voice data.

A series of concepts in a simplified form are introduced in the summary of the invention, which is described in further detail in the detailed description section. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.

The advantages and features of the present invention are described in detail below with reference to the accompanying drawings.

Drawings

The following drawings of the invention are included to provide a further understanding of the invention. The drawings illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings, there is shown in the drawings,

FIG. 1 shows a schematic flow diagram of a model training method according to one embodiment of the invention;

FIG. 2 shows a flow diagram of a single-channel to N-channel transform operation according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of feature extraction and merging of speech data in multiple passes, according to one embodiment of the invention;

fig. 4 shows a schematic flow diagram of the application of the model training method to a PPG-based speech conversion scenario according to an embodiment of the invention;

FIG. 5 shows a schematic flow diagram of a method of voice conversion according to one embodiment of the present invention;

FIG. 6 shows a schematic block diagram of a speech conversion apparatus according to an embodiment of the present invention;

FIG. 7 shows a schematic block diagram of a speech conversion system according to one embodiment of the present invention;

FIG. 8 shows a schematic block diagram of a model training apparatus according to one embodiment of the present invention; and

FIG. 9 shows a schematic block diagram of a model training system according to one embodiment of the present invention.

Detailed Description

In the following description, numerous details are provided to provide a thorough understanding of the present invention. One skilled in the art, however, will understand that the following description merely illustrates a preferred embodiment of the invention and that the invention may be practiced without one or more of these details. In other instances, well known features have not been described in detail so as not to obscure the invention.

In order to at least partially solve the technical problem, embodiments of the present invention provide a method, an apparatus, and a system for speech conversion and a storage medium, and a method, an apparatus, and a system for model training and a storage medium.

In a complex acoustic environment, noise always comes from all directions, and often overlaps with a speech signal in time and frequency spectrum, and in addition to the effects of echo and reverberation, it is very difficult to capture relatively pure speech with a single microphone. And the multi-channel data generated based on the microphone array principle can be fused with the space-time information of the voice signal, and can simultaneously extract the sound source and inhibit noise. Therefore, in the voice conversion task, multi-channel voice data can be generated based on the original voice data through the microphone array technology, and training of a network model required by voice conversion is performed, so that the robustness of the model can be enhanced theoretically, and the corresponding recognition rate and conversion effect can be improved.

According to the embodiment of the invention, in the model training stage, based on the microphone array principle, multi-channel voice data (which contains information of different directions and contains more abundant original information than voice data of a single channel) is generated based on original training voice data, and the multi-channel voice data is selected as a data processing basis to train the preset mapping model required by voice conversion, so that the model has higher robustness to a noisy environment when applied to the actual conversion stage, and the phenomenon of inaccurate identification in the voice conversion stage is further reduced. Correspondingly, in the actual conversion stage, the multi-channel voice data of the source speaker can be acquired through actual microphone array acquisition or channel conversion and the like, and voice conversion is carried out based on the multi-channel voice data, so that the robustness of the voice conversion to a noisy environment can be improved compared with the voice conversion based on single-channel voice data.

For ease of understanding, an implementation of the model training method according to an embodiment of the present invention will be described below with reference to fig. 1-4. First, FIG. 1 shows a schematic flow diagram of a model training method 100 according to one embodiment of the invention. As shown in FIG. 1, the model training method 100 includes steps S110-S180.

In step S110, first training speech data of a sample speaker and second training speech data of a target speaker are acquired.

The sample and target speakers may be arbitrary speakers, wherein the target speaker involved in training the model may or may not coincide with the target speaker in the actual speech conversion. Illustratively, the first training speech data of the sample speaker may be from a timmit corpus.

In step S120, a single-channel to N-channel conversion operation is performed on the first training speech data to obtain N groups of sample speech data under N different channels, where N is an integer greater than 1.

Model training often requires a large number of samples (corresponding to a sample speaker and a target speaker herein), and it is very difficult to train by collecting voice data under multiple channels for each sample, the device is not easy to arrange, the cost is high, and the data size is large, the difficulty of collection, storage and transmission is large, so it is costly to directly obtain multi-channel voice data of each sample. According to the embodiment of the invention, the voice data of the sample can be subjected to channel expansion by adopting a multi-channel data simulation (namely channel conversion from a single channel to an N channel) mode, and the voice data of the sample under a plurality of different channels can be obtained.

Illustratively, a multi-channel data simulation may include two parts: simulation data generation and spatial filter filtering (generating multiple channels).

For example, the first training speech data may be simulated by the following formula:

wherein, y₁Is the first new voice data, s₁For the first training speech data, h₁Is a first convolution kernel, n₁Is the first noise.

Simulating the first training speech data may be understood as transforming the first training data to generate a new speech data, which transformation may include, for example, adding noise.

The first convolution kernel may be set to any suitable convolution kernel as necessary. For example, the first convolution kernel may be fixed, preset. For another example, the first convolution kernel may be selected from a plurality of preset convolution kernels, and the selection may be a random selection, or the like. The first noise may be set as needed, and may be fixed, preset, or selected from a plurality of preset noises, and the selection may be random, for example.

Illustratively, the first new speech data obtained in the previous step may be filtered by a spatial filter to generateAnd forming multi-channel voice data. Fig. 2 shows a flow diagram of a single-channel to N-channel transform operation according to one embodiment of the invention. In fig. 2, the original speech data may include the first training speech data and the second training speech data described herein, the simulated speech data may include the first new speech data and the second new speech data, and fig. 2 does not distinguish the channel transformation from the single channel to the N channel for the first training speech data and the second training speech data, but shows a unified simulation flow of the two. For example, in FIG. 2, y may be y as described herein₁Or y₂And h may be h as described herein₁Or h₂And s may be s as described herein₁Or s₂And n may be n as described herein₁Or n₂. Further, in FIG. 2, f₁,f₂……f_NRepresenting a spatial filter, y₁,y₂……y_NRepresenting y simulated speech data under each channel 1,2 … … N.

Illustratively, the spatial filter f₁,f₂……f_NThere may be filters in one-to-one correspondence with the N different placement orientations, and each spatial filter may be any suitable form of filter. For example, the spatial filter may be a linear filter, a nonlinear filter, or the like. The linear filter may include a rectangular averaging filter, a circular averaging filter, a gaussian low-pass filter, a laplacian filter, a prewitt filter, a sobel filter, a non-sharp filter, and the like. The nonlinear filter may include a median filter, a maximum filter, a minimum filter, and the like. In one example, the spatial filter may be a cardioid spatial filter.

It should be noted that the steps included in the above-mentioned single-channel to N-channel conversion operation are only examples and are not limitations of the present invention, and the present invention is not limited to the above-mentioned implementation. For example, the step of simulating the first training speech data may further comprise simulating the first training speech data s₁Or the first new voice data y₁Volume up or down operation。

In step S130, feature extraction is performed on each of the N groups of sample voice data, respectively, to obtain N groups of first recognition acoustic features.

The simulation voice data of N channels (namely N groups of sample voice data) are subjected to feature extraction, and then feature combination is carried out. FIG. 3 illustrates a flow diagram for feature extraction and merging of speech data in multiple passes, according to one embodiment of the invention. For the purpose of distinction, in the present invention, the acoustic features obtained by feature extraction and combination may be referred to as recognized acoustic features (similar to those recognized in the conventional speech recognition technology), and the acoustic features used for speech synthesis input into the synthesizer may be referred to as synthesized acoustic features (similar to those recognized in the conventional speech synthesis technology).

The feature extraction described herein may be implemented using any existing or future feature extraction method that may occur, and may be considered part of speech recognition. The acoustic features extracted here (i.e., the first recognized acoustic features) may be mel-frequency cepstral coefficient features (MFCCs) or the like.

In step S140, N sets of first recognized acoustic features are feature-combined to obtain recognized acoustic features of the sample speaker.

In addition, the feature combination described herein may be implemented by any existing or future feature combination method, for example, by means of feature concatenation or addition of corresponding elements.

In step S150, a single-channel to N-channel conversion operation is performed on the second training speech data to obtain N sets of target speech data under N different channels, respectively.

In step S160, feature extraction is performed on each of the N sets of target speech data to obtain N sets of second synthesized acoustic features.

In step S170, N sets of second synthesized acoustic features are feature-combined to obtain synthesized acoustic features of the target speaker.

The operation modes of the channel transformation operation, the feature extraction and the feature combination of the second training speech data are consistent with the corresponding steps of the first training speech data, i.e. steps S150 to S170 can be understood with reference to steps S120 to S140, respectively, and are not described herein again.

In step S180, a predictive synthesis acoustic feature is mapped by a predetermined mapping model for mapping the acoustic feature of the source speaker to the acoustic feature of the target speaker in the process of performing voice conversion between any one of the source speaker and the target speaker to perform voice synthesis by a predetermined synthesizer based on the acoustic feature of the target speaker to obtain a target voice of the target speaker, and the predictive synthesis acoustic feature is trained on the predetermined mapping model using the synthetic acoustic feature of the target speaker as a true value of the predictive synthesis acoustic feature.

In an actual speech conversion stage, the acoustic features of the source speaker may be first extracted, and the acoustic features of the source speaker (i.e., the recognized acoustic features of the source speaker) are mapped to the acoustic features of the target speaker (i.e., the synthesized acoustic features of the target speaker) by a predetermined mapping model, and then the acoustic features of the target speaker are input to a predetermined synthesizer for speech synthesis to obtain the target speech of the target speaker.

The predetermined mapping model may be any suitable network model that is capable of mapping the acoustic features of the input certain speech to the acoustic features of the target speech. By way of example and not limitation, the predetermined mapping model may include a speech recognition model and a feature mapping model, wherein the speech recognition model may include one or more of the following network models: a long short term memory network model (LSTM), a convolutional neural network model (CNN), a time delay neural network model (TDNN) and a deep neural network model (DNN); and/or, the feature mapping model may include one or more of the following network models: tensor-to-tensor network model (T2T), CNN, sequence-to-sequence model (Seq2Seq), attention model (attention). For example, the feature mapping model may be a two-way long short term memory network model (DBLSTM).

For example, a cost function may be constructed by using a predicted synthesized acoustic feature obtained by mapping a predetermined mapping model based on the recognized acoustic feature of the sample speaker as a predicted value, using a synthesized acoustic feature of the target speaker as a true value, and training the predetermined mapping model by minimizing the cost function until a training result satisfies a requirement. This training process is understood by those skilled in the art and will not be described in detail herein.

In this way, it is possible to generate multi-channel speech data based on speech data of a sample speaker and a target speaker according to the microphone array principle, and train a predetermined mapping model required for speech conversion based on the multi-channel speech data. As mentioned above, the model obtained by training in this way has higher robustness to a noisy environment when applied to an actual conversion stage, and further, the phenomenon of inaccurate recognition during voice conversion can be effectively reduced.

According to an embodiment of the present invention, the predetermined mapping model may include a speech recognition model and a feature mapping model, and performing feature extraction on each of the N sets of target speech data to obtain N sets of second synthesized acoustic features (step S160) may include: respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second identification acoustic features and N groups of second synthesis acoustic features; obtaining the predicted synthesized acoustic features mapped by the predetermined mapping model based on the recognized acoustic features of the sample speakers may include: inputting the recognition acoustic characteristics of the sample speaker into a voice recognition model to obtain a voice posterior probability of the sample speaker output by the voice recognition model, wherein the voice posterior probability comprises a set of values corresponding to a time range and a voice category range; training a voice recognition model based on the voice posterior probability of the sample speaker; performing feature combination on the N groups of second recognition acoustic features to obtain the recognition acoustic features of the target speaker; inputting the recognition acoustic characteristics of the target speaker into the trained voice recognition model to obtain the voice posterior probability of the target speaker; and inputting the voice posterior probability of the target speaker into the feature mapping model to obtain the predicted synthesized acoustic features output by the feature mapping model.

The model training method 100 described herein may be applied to a speech Posterior Probability (PPG) -based speech conversion scenario that utilizes non-parallel training data. Fig. 4 shows a schematic flow diagram of the application of the model training method 100 to a PPG-based speech conversion scenario according to an embodiment of the present invention. As shown in fig. 4, the whole procedure of PPG-based model training and speech conversion can be divided into three phases: a first training phase (labeled "training phase 1"), a second training phase (labeled "training phase 2"), and a transition phase. The first training phase and the second training phase correspond to the execution time of the model training method 100, and the conversion phase refers to the actual conversion phase executed when the speech conversion is actually performed after the model training is completed.

In the first training stage, the speech of the sample speaker (i.e. the first training speech data) is subjected to the steps of channel conversion operation from single channel to N channel, feature extraction and feature combination to obtain combined acoustic features (i.e. the recognized acoustic features of the sample speaker). The implementation of these steps is described above and will not be described here. In fig. 4, the first recognized acoustic feature obtained after feature extraction and the recognized acoustic feature of the sample speaker obtained after feature merging may be MFCCs, but this is only an example and not a limitation of the present invention.

Similarly, in the second training stage, the speech of the target speaker (i.e. the second training speech data) is subjected to the steps of channel conversion operation from single channel to N channel, feature extraction and feature combination to obtain combined acoustic features. The implementation of these steps is described above and will not be described here. In the feature extraction step in the second training stage, in addition to extracting the second synthesized acoustic feature, a second recognition acoustic feature may be extracted. Accordingly, in the feature combination step, besides the synthesized acoustic features of the target speaker after combination, the recognized acoustic features of the target speaker can be obtained through combination. In the embodiment shown in fig. 4, the second synthesized acoustic feature and the synthesized acoustic feature of the target speaker may be mel-frequency cepstral features (MCEPs), and the second recognized acoustic feature and the recognized acoustic feature of the target speaker may be MFCCs, but this is only an example and not a limitation of the present invention.

Methods using non-parallel data training and readily available PPG perform better than parallel data training methods. PPG is a matrix of time-versus-class that represents the posterior probability of each speech class for each particular time frame of an utterance. Alternatively, the PPG may be generated by employing a speaker-independent automatic speech recognition (SI-ASR) model for mapping speaker differences. The mapping between the obtained PPG and the corresponding acoustic features of the target speaker can then be modeled using a DBLSTM model. In FIG. 4, the speech recognition model may be the SI-ASR model and the feature mapping model may be the DBLSTM model, which are examples only and are not limiting of the invention.

The speech recognition model (SI-ASR model) may be trained first in a first training stage, and after training, the trained speech recognition model may be used to process the recognized acoustic features of the target speaker in a second training stage to obtain the PPG of the target speaker. In a second training phase, a feature mapping model (DBLSTM model) may be trained with the PPG of the target speaker and the synthesized acoustic features of the target speaker during the training phase. Then, in the conversion stage, the trained speech recognition model may be used to obtain the PPG of the source speaker, and the PPG is input into the trained feature mapping model to obtain the synthesized acoustic features of the target speaker in the conversion stage, and then speech synthesis is performed by the synthesizer.

PPG is a matrix of time-versus-class that represents the posterior probability of each speech class for each particular time frame of an utterance. The phonetic category may refer to a word, phoneme, or phoneme state (senone). PPG obtained from SI-ASR is the same where the language content/pronunciation of the different speech utterances is the same. In some embodiments, PPG obtained from SI-ASR may represent audible clarity (articulation) of speech data in a speaker normalized space and correspond to speech content independent of the speaker. These PPG are therefore seen as a bridge between the source and target speakers.

The PPG-based model training and speech conversion method has the following advantages. First, parallel training data is not required. Second, no alignment procedure of the speech data of the sample speaker with the speech data of the target speaker is required, which avoids the effect of possible alignment errors. Third, the trained model can be applied to any other source speaker as long as the target speaker is fixed (as in many-to-one speech conversion).

According to an embodiment of the invention, the speech class range corresponds to a phoneme state range. According to an embodiment of the invention, the set of values corresponds to a posterior probability for each speech class in the range of speech classes for each time in the range of times, and wherein the speech posterior probability comprises a matrix.

The significance and expression form of PPG are described above and are not described herein.

According to an embodiment of the present invention, performing a single-channel to N-channel conversion operation on the first training speech data to obtain N groups of sample speech data under N different channels (step S120) may include: simulating the first training voice data to obtain first new voice data; filtering the first new voice data through N spatial filters which are in one-to-one correspondence with N different channels respectively to obtain N groups of sample voice data; performing a single-channel to N-channel conversion operation on the second training speech data to obtain N sets of target speech data under N different channels (step S150) may include: simulating the second training voice data to obtain second new voice data; and filtering the second new voice data through N spatial filters respectively to obtain N groups of target voice data.

According to an embodiment of the present invention, simulating the first training speech data to obtain the first new speech data includes:

the first training speech data is simulated by the following formula:

simulating the second training speech data by:

The above description has described the embodiment of the channel conversion operation from the single channel to the N channel, and the description thereof is omitted here.

According to an embodiment of the present invention, the method 100 may further include: randomly selecting a first convolution kernel and/or a second convolution kernel from pre-stored convolution kernels; the first noise and/or the second noise are randomly selected from pre-stored noise, wherein the pre-stored noise comprises one or more of white noise, pink noise and brown noise. The type of pre-stored noise described above is merely an example and not a limitation of the present invention, which may also include any other suitable noise.

The pre-stored convolution kernel may be pre-designed, and may illustratively be related to the acquisition environment in which the source speaker participating in the actual speech conversion is located, e.g., related to the size of the room in which the source speaker is located. Alternatively, one or more convolution kernels may be stored in advance in the database, and when necessary, any one of the convolution kernels is selected as a first convolution kernel to perform channel conversion on the first training speech data or as a second convolution kernel to perform channel conversion on the second training speech data. The selection mode can be random selection or selection according to preset conditions.

The pre-stored noise may also be pre-designed. Alternatively, one or more kinds of noise may be stored in advance in the database, and any one of the noise may be selected as the first noise to perform channel conversion on the first training speech data or as the second noise to perform channel conversion on the second training speech data when necessary.

It will be appreciated that the above steps of selecting a convolution kernel and selecting noise are performed prior to performing a channel transform operation on the corresponding training speech data.

According to the embodiment of the invention, the acoustic feature of the sample speaker is MFCC, perceptual linear prediction feature (PLP), filter bank feature (FBank) or constant Q cepstral coefficient feature (CQCC), and the acoustic feature of the target speaker is MCEP, line spectrum pair feature (LSP), line spectrum pair feature after Mel frequency (Mel-LSP), line spectrum pair feature based on Mel generalized cepstral analysis (MGC-LSP) or linear prediction coding feature (LPC). The form of identifying the acoustic features and synthesizing the acoustic features may be understood by those skilled in the art and will not be described in detail herein.

According to another aspect of the present invention, a method of voice conversion is provided. FIG. 5 shows a schematic flow diagram of a method 500 of voice conversion according to one embodiment of the present invention. As shown in fig. 5, the speech conversion method 500 includes the following steps S510-S550.

In step S510, N sets of source speech data of a source speaker under N different channels are obtained, where N is an integer greater than 1.

In step S520, feature extraction is performed on each of the N sets of source speech data to obtain N sets of source recognition acoustic features.

In step S530, N sets of source recognition acoustic features are feature-combined to obtain acoustic features of the source speaker.

In step S540, the acoustic features of the source speaker are mapped to the acoustic features of the target speaker by a predetermined mapping model.

In step S550, speech synthesis is performed based on the acoustic features of the target speaker to obtain a target speech of the target speaker.

In the actual speech conversion stage, the speech data of the source speaker can also be processed in a multi-channel manner. Compared with the voice conversion based on single-channel voice data, the voice conversion based on multiple channels can improve the robustness to noise and improve the recognition rate and conversion effect of voice recognition.

The predetermined mapping model obtained by the above-described model training method may be applied to subsequent speech conversion stages. In this case, the number of channels of the speech data of the source speaker can be kept consistent with the number of channels at the time of training in the speech conversion stage. An exemplary scheme for keeping the number of channels consistent is described below.

In one embodiment, obtaining N sets of source speech data for a source speaker in N different channels (step S510) may include: n groups of source speech data of a source speaker acquired by a microphone array are acquired, and the microphone array comprises N microphones with different arrangement orientations, wherein the N microphones correspond to N different channels one by one.

In a more preferred embodiment, the acquisition environment in which the source speaker participating in the actual speech conversion is located is known, fixed, e.g. already produced. That is, the actual acquisition environment may include a fixed-location microphone array. In this case, in the above model training stage, when performing channel conversion from a single channel to an N channel, modeling and simulation may be performed based on the arrangement manner of the microphone array in actual voice conversion, that is, the number of the channels in simulation is the same as the number of microphones in the actual microphone array, and the voice collection effect of each channel conforms to the voice collection rule of the corresponding microphone in the microphone array.

In the scheme, the method can directly extract and merge the characteristics of each group of source speech data without channel conversion. For feature extraction and feature merging, reference may be made to the above description, which is not repeated here. The preset mapping model obtained by the scheme training is relatively fit with the actual acquisition environment, so that the accuracy rate of voice conversion is relatively high.

In another embodiment, obtaining N sets of source speech data for a source speaker in N different channels respectively comprises: acquiring M groups of initial source speech data of a source speaker acquired by M microphones, wherein M is an integer greater than or equal to 1; and performing channel conversion operation from M channels to N channels on the M groups of initial source speech data to obtain N groups of source speech data.

The predetermined mapping model obtained by the above-described model training method 100 training can be applied to any acquisition environment, even a microphone placement environment that is not in accordance with the simulation channel at the time of training. For example, four-channel (corresponding to four microphones) data simulation is performed during training, and only two microphones for collecting source speech during actual speech conversion may perform channel conversion, convert two-channel speech data into four-channel speech data, make the channels consistent with those during training, and then perform subsequent speech conversion operations. The scheme is flexible to use, and the application range of the model is wider.

Referring back to fig. 4, the channel transformation operation of the conversion stage is optional and may not be performed in the case where the N sets of source speech data can be acquired directly. Furthermore, it should be understood that in the first training phase, the second training phase and the transition phase, the corresponding channel transformation operation is not limited to the embodiment shown in fig. 4, and as described above, the channel transformation operation may further include other steps such as adjusting the volume.

The first channel conversion operation from the M channel to the single channel may be implemented by using any existing or future channel merging technology, which is not described herein in detail. The second channel conversion operation from the single channel to the N channel may be performed in a manner consistent with the above-described channel conversion operation from the single channel to the N channel performed on the first training speech data and the second training speech data, and the second channel conversion operation may be understood with reference to the above channel conversion operation from the single channel to the N channel performed on the first training speech data and the second training speech data, which is not described herein again.

Alternatively, one of the initial sets of source speech data may be selected from the M initial sets of source speech data as the desired single set of source speech data. That is, performing an M-channel to N-channel transformation operation on M sets of initial source speech data to obtain N sets of source speech data may include: selecting the single set of source speech data from the M sets of initial source speech data; and performing single-channel to N-channel second channel conversion operation on the single group of source speech data to obtain N groups of source speech data.

a single set of source speech data is simulated by the following formula:

Illustratively, the method 100 may further include: randomly selecting a third convolution kernel from pre-stored convolution kernels; randomly selecting a third noise from pre-stored noises, wherein the pre-stored noises comprise one or more of white noise, pink noise and brown noise. The pre-stored convolution kernel may be identical to the pre-stored convolution kernel employed when the first convolution kernel and/or the second convolution kernel is selected, and the pre-stored noise may be identical to the pre-stored noise employed when the first noise and/or the second noise is selected.

Alternatively, the first convolution kernel, the second convolution kernel, and the third convolution kernel may be implemented by the same convolution kernel, and the first noise, the second noise, and the third noise may also be the same noise. Furthermore, it is preferable that the N spatial filters employed are consistent when performing a single channel to N channel transform for the first training speech data, the second training speech data, and the single set of source speech data, respectively.

According to an embodiment of the present invention, the predetermined mapping model includes a speech recognition model and a feature mapping model, and mapping the acoustic features of the source speaker to the acoustic features of the target speaker through the predetermined mapping model (step S540) may include: inputting the acoustic characteristics of the source speaker into the speech recognition model to obtain a speech posterior probability of the source speaker output by the speech recognition model, the speech posterior probability comprising a set of values corresponding to a time range and a speech class range; and inputting the voice posterior probability of the source speaker into the feature mapping model to obtain the acoustic feature of the target speaker output by the feature mapping model.

Referring back to fig. 4, at the conversion stage, the acoustic features of the source speaker (i.e., the recognized acoustic features of the source speaker), which may be, for example, MFCCs, may be obtained through feature extraction and feature merging. The MFCC is then input into a trained speech recognition model, and the PPG of the source speaker can be obtained. Subsequently, the PPG of the source speaker is input into a trained feature mapping model, and the acoustic features of the target speaker (i.e., the synthesized acoustic features of the target speaker), which may be, for example, MCEP, may be obtained. Then, the acoustic characteristics of the target speaker are input into a synthesizer to obtain the target voice of the target speaker. The synthesizer (i.e., the predetermined synthesizer described herein) may be any suitable speech synthesizer model, which may be pre-trained. The advantages of the PPG-based speech conversion are described above and will not be described further here.

According to an embodiment of the invention, the speech class range corresponds to a phoneme state range.

According to an embodiment of the invention, the set of values corresponds to a posterior probability for each speech class in the range of speech classes for each time in the range of times, and wherein the speech posterior probability comprises a matrix.

According to an embodiment of the invention, the speech recognition model comprises one or more of the following network models: a long and short term memory network model, a convolution neural network model, a time delay neural network model and a deep neural network model; and/or, the feature mapping model comprises one or more of the following network models: tensor to tensor network model, convolutional neural network model, sequence to sequence model, attention model.

According to an embodiment of the invention, the method 500 further comprises: acquiring first training voice data of a sample speaker and second training voice data of a target speaker; carrying out channel conversion operation from a single channel to N channels on the first training voice data to obtain N groups of sample voice data under N different channels respectively; respectively extracting the characteristics of each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic characteristics; performing feature combination on the N groups of first recognition acoustic features to obtain recognition acoustic features of the sample speaker; performing channel conversion operation from a single channel to N channels on the second training voice data to obtain N groups of target voice data under N different channels respectively; respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second synthesized acoustic features; performing feature combination on the N groups of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker; and obtaining a predicted synthesized acoustic feature through mapping of a predetermined mapping model based on the recognized acoustic feature of the sample speaker, and training the predetermined mapping model by taking the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature.

According to the embodiment of the present invention, the performing feature extraction on each of N sets of target speech data to obtain N sets of second synthesized acoustic features includes: respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second identification acoustic features and N groups of second synthesis acoustic features; obtaining the predicted synthesized acoustic features mapped by the predetermined mapping model based on the recognized acoustic features of the sample speakers includes: inputting the recognition acoustic characteristics of the sample speaker into the voice recognition model to obtain the voice posterior probability of the sample speaker output by the voice recognition model; training a voice recognition model based on the voice posterior probability of the sample speaker; performing feature combination on the N groups of second recognition acoustic features to obtain the recognition acoustic features of the target speaker; inputting the recognition acoustic characteristics of the target speaker into the trained voice recognition model to obtain the voice posterior probability of the target speaker; and inputting the voice posterior probability of the target speaker into the feature mapping model to obtain the predicted synthesized acoustic features output by the feature mapping model.

According to an embodiment of the present invention, performing a channel conversion operation from a single channel to N channels on first training speech data to obtain N groups of sample speech data under N different channels respectively includes: simulating the first training voice data to obtain first new voice data; filtering the first new voice data through N spatial filters which are in one-to-one correspondence with N different channels respectively to obtain N groups of sample voice data; performing a single-channel to N-channel conversion operation on the second training speech data to obtain N sets of target speech data under N different channels, respectively, includes: simulating the second training voice data to obtain second new voice data; and filtering the second new voice data through N spatial filters respectively to obtain N groups of target voice data.

According to an embodiment of the invention, the spatial filter is a cardioid spatial filter.

the first training speech data is simulated by the following formula:

simulating the second training speech data by:

According to an embodiment of the invention, the method 500 further comprises: randomly selecting a first convolution kernel and/or a second convolution kernel from pre-stored convolution kernels; the first noise and/or the second noise are randomly selected from pre-stored noise, wherein the pre-stored noise comprises one or more of white noise, pink noise and brown noise.

According to an embodiment of the invention, the source speaker is different from the target speaker.

According to the embodiment of the invention, the acoustic features of the source speaker are Mel frequency cepstrum coefficient features, perceptual linear prediction features, filter bank features or constant Q cepstrum coefficient features, and the acoustic features of the target speaker are Mel cepstrum features, line spectrum pair features after Mel frequency, line spectrum pair features based on Mel generalized cepstrum analysis or linear prediction coding features.

The above describes the flow of the model training method 100, and the speech conversion method 500 may include the steps of the model training method 100, and the implementation of the steps may be understood with reference to the above description, and will not be described herein again.

In the conversion stage, for example, extraction of additional parameters may be performed, such as extraction of fundamental frequency information F0 and aperiodic components AP of each set of source speech data. Further, F0 may be linearly transformed. Additional parameters may be added when speech synthesis is performed in the synthesizer. For example, the acoustic features of the source speaker obtained by the mapping may be input to a synthesizer together with the converted F0 and AP to synthesize the target speech.

According to another aspect of the present invention, a voice conversion apparatus is provided. Fig. 6 shows a schematic block diagram of a speech conversion device 600 according to an embodiment of the invention.

As shown in fig. 6, the voice conversion apparatus 600 according to the embodiment of the present invention includes an obtaining module 610, an extracting module 620, a combining module 630, a mapping module 640, and a synthesizing module 650. The various modules may each perform the various steps/functions of the speech conversion method 500 described above in connection with fig. 1. Only the main functions of the respective components of the speech conversion apparatus 600 will be described below, and details that have been described above will be omitted.

The obtaining module 610 is configured to obtain N sets of source speech data of a source speaker under N different channels, where N is an integer greater than 1.

The extraction module 620 is configured to perform feature extraction on each group of source speech data in the N groups of source speech data, respectively, to obtain N groups of source recognition acoustic features.

The merging module 630 is configured to perform feature merging on the N sets of source recognition acoustic features to obtain acoustic features of the source speaker.

The mapping module 640 is used to map the acoustic features of the source speaker to the acoustic features of the target speaker through a predetermined mapping model.

The synthesis module 650 is used for performing speech synthesis based on the acoustic features of the target speaker to obtain the target speech of the target speaker.

According to another aspect of the present invention, a speech conversion system is provided. FIG. 7 shows a schematic block diagram of a speech conversion system 700 according to one embodiment of the present invention. The speech conversion system 700 includes a processor 710 and a memory 720.

The memory 720 stores computer program instructions for implementing corresponding steps in the speech conversion method 500 according to an embodiment of the present invention.

The processor 710 is configured to execute the computer program instructions stored in the memory 720 to perform the corresponding steps of the speech conversion method 500 according to the embodiment of the present invention.

Illustratively, the speech conversion system 700 may further comprise a microphone array comprising N microphones having different placement orientations in one-to-one correspondence with the N different channels, the computer program instructions when executed by the processor 710 further operable to perform the steps of: n sets of source speech data of a source speaker acquired by a microphone array are acquired.

Illustratively, the speech conversion system 700 may further comprise M microphones, the computer program instructions when executed by the processor 710 further being operable to perform the steps of: acquiring M groups of initial source speech data of a source speaker acquired by M microphones, wherein M is an integer greater than or equal to 1; and performing channel conversion operation from M channels to N channels on the M groups of initial source speech data to obtain N groups of source speech data.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor, are used for executing the corresponding steps of the voice conversion method 500 of the embodiment of the present invention and for implementing the corresponding modules in the voice conversion apparatus 600 according to the embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

According to another aspect of the present invention, a model training apparatus is provided. FIG. 8 shows a schematic block diagram of a model training apparatus 800 according to one embodiment of the present invention.

As shown in FIG. 8, a model training apparatus 800 according to an embodiment of the present invention includes an acquisition module 810, a first simulation module 820, a first extraction module 830, a first merging module 840, a second simulation module 850, a second extraction module 860, a second merging module 870, and a training module 880. The various modules may each perform the various steps/functions of the model training method 100 described above in connection with fig. 1. Only the main functions of the components of the model training apparatus 800 will be described below, and details that have been described above will be omitted.

The obtaining module 810 is configured to obtain first training speech data of a sample speaker and second training speech data of a target speaker.

The first simulation module 820 is configured to perform a channel conversion operation from a single channel to N channels on the first training speech data to obtain N groups of sample speech data under N different channels, where N is an integer greater than 1.

The first extraction module 830 is configured to perform feature extraction on each of the N groups of sample voice data, respectively, to obtain N groups of first recognition acoustic features;

the first combining module 840 is configured to perform feature combining on the N sets of first recognized acoustic features to obtain recognized acoustic features of the sample speaker;

the second simulation module 850 is configured to perform channel conversion from a single channel to N channels on the second training speech data to obtain N sets of target speech data under N different channels, respectively;

the second extraction module 860 is configured to perform feature extraction on each of the N groups of target speech data, respectively, to obtain N groups of second synthesized acoustic features;

the second combining module 870 is configured to feature combine the N sets of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker.

The training module 880 is configured to obtain a predicted synthesized acoustic feature through mapping by a predetermined mapping model based on the recognized acoustic feature of the sample speaker, and train the predetermined mapping model with the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature, where the predetermined mapping model is configured to map the acoustic feature of the source speaker to the acoustic feature of the target speaker in a process of performing speech conversion between any one source speaker and the target speaker, so that the predetermined synthesizer performs speech synthesis based on the acoustic feature of the target speaker to obtain the target speech of the target speaker.

According to another aspect of the present invention, a model training system is provided. FIG. 9 shows a schematic block diagram of a model training system 900 in accordance with one embodiment of the present invention. Model training system 900 includes a processor 910 and a memory 920.

The memory 920 stores computer program instructions for implementing corresponding steps in the model training method 100 according to an embodiment of the present invention.

The processor 910 is configured to execute the computer program instructions stored in the memory 920 to perform the corresponding steps of the model training method 100 according to the embodiment of the present invention.

According to another aspect of the present invention, there is provided a storage medium on which program instructions are stored, which when executed by a computer or a processor are configured to perform the corresponding steps of the model training method 100 according to an embodiment of the present invention, and to implement the corresponding modules in the model training apparatus 800 according to an embodiment of the present invention. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules in a model training or speech conversion system according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of speech conversion, comprising:

acquiring N groups of source speech data of a source speaker under N different channels respectively, wherein N is an integer greater than 1;

respectively extracting the characteristics of each group of source speech data in the N groups of source speech data to obtain N groups of source recognition acoustic characteristics;

feature combining the N sets of source-recognized acoustic features to obtain acoustic features of the source speaker;

mapping the acoustic features of the source speaker to acoustic features of a target speaker through a predetermined mapping model;

performing speech synthesis based on the acoustic features of the target speaker to obtain a target speech of the target speaker.

2. The speech conversion method of claim 1, wherein said obtaining N sets of source speech data of a source speaker in N different channels, respectively, comprises:

obtaining the N sets of source speech data of the source speaker acquired by a microphone array, the microphone array comprising N microphones having different placement orientations in one-to-one correspondence with the N different channels.

3. The speech conversion method of claim 1, wherein said obtaining N sets of source speech data of a source speaker in N different channels, respectively, comprises:

acquiring M groups of initial source speech data of the source speaker acquired by M microphones, wherein M is an integer greater than or equal to 1; and

and performing channel conversion operation from an M channel to an N channel on the M groups of initial source speech data to obtain the N groups of source speech data.

4. A model training method, comprising:

acquiring first training voice data of a sample speaker and second training voice data of a target speaker;

performing channel conversion operation from a single channel to N channels on the first training voice data to obtain N groups of sample voice data under N different channels respectively, wherein N is an integer greater than 1;

respectively carrying out feature extraction on each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic features;

feature combining the N sets of first recognized acoustic features to obtain recognized acoustic features of the sample speaker;

performing channel conversion operation from a single channel to N channels on the second training voice data to obtain N groups of target voice data under the N different channels respectively;

respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data to obtain N groups of second synthesized acoustic features;

performing feature merging on the N sets of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker; and

and obtaining a predicted synthesized acoustic feature through mapping by a predetermined mapping model based on the recognized acoustic feature of the sample speaker, and training the predetermined mapping model by using the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature, wherein the predetermined mapping model is used for mapping the acoustic feature of the source speaker to the acoustic feature of the target speaker in the process of carrying out voice conversion on any source speaker and the target speaker, so as to carry out voice synthesis by a predetermined synthesizer based on the acoustic feature of the target speaker to obtain the target voice of the target speaker.

5. A speech conversion apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring N groups of source speech data of a source speaker under N different channels respectively, wherein N is an integer greater than 1;

the extraction module is used for respectively extracting the characteristics of each group of source speech data in the N groups of source speech data to obtain N groups of source recognition acoustic characteristics;

a merging module, configured to perform feature merging on the N groups of source recognition acoustic features to obtain acoustic features of the source speaker;

a mapping module for mapping the acoustic features of the source speaker to acoustic features of a target speaker by a predetermined mapping model;

a synthesis module, configured to perform speech synthesis based on the acoustic features of the target speaker to obtain a target speech of the target speaker.

6. A speech conversion system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor for performing the speech conversion method of any of claims 1 to 3.

7. A storage medium on which are stored program instructions for performing, when executed, a speech conversion method according to any one of claims 1 to 3.

8. A model training apparatus comprising:

the acquisition module is used for acquiring first training voice data of a sample speaker and second training voice data of a target speaker;

the first simulation module is used for carrying out channel conversion operation from a single channel to N channels on the first training voice data so as to obtain N groups of sample voice data under N different channels respectively, wherein N is an integer greater than 1;

the first extraction module is used for respectively extracting the characteristics of each group of sample voice data in the N groups of sample voice data to obtain N groups of first recognition acoustic characteristics;

a first combining module, configured to perform feature combining on the N sets of first recognized acoustic features to obtain recognized acoustic features of the sample speaker;

the second simulation module is used for carrying out channel conversion operation from a single channel to N channels on the second training voice data so as to obtain N groups of target voice data under the N different channels respectively;

the second extraction module is used for respectively carrying out feature extraction on each group of target voice data in the N groups of target voice data so as to obtain N groups of second synthesized acoustic features;

a second combining module, configured to perform feature combining on the N sets of second synthesized acoustic features to obtain synthesized acoustic features of the target speaker; and

and the training module is used for mapping to obtain a predicted synthesized acoustic feature through a preset mapping model based on the recognized acoustic feature of the sample speaker, and training the preset mapping model by using the synthesized acoustic feature of the target speaker as a true value of the predicted synthesized acoustic feature, wherein the preset mapping model is used for mapping the acoustic feature of the source speaker to the acoustic feature of the target speaker in the process of carrying out voice conversion on any source speaker and the target speaker, so that a preset synthesizer carries out voice synthesis based on the acoustic feature of the target speaker to obtain the target voice of the target speaker.

9. A model training system comprising a processor and a memory, wherein the memory has stored therein computer program instructions for execution by the processor for performing the model training method of claim 4.

10. A storage medium on which program instructions are stored, which program instructions are operable when executed to perform the model training method of claim 4.