[go: up one dir, main page]

CN113870875A - Tone feature extraction method, device, computer equipment and storage medium - Google Patents

Tone feature extraction method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113870875A
CN113870875A CN202111130551.6A CN202111130551A CN113870875A CN 113870875 A CN113870875 A CN 113870875A CN 202111130551 A CN202111130551 A CN 202111130551A CN 113870875 A CN113870875 A CN 113870875A
Authority
CN
China
Prior art keywords
voice
speaker
training
data
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111130551.6A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202111130551.6A priority Critical patent/CN113870875A/en
Publication of CN113870875A publication Critical patent/CN113870875A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention relates to the field of artificial intelligence, and particularly discloses a tone characteristic extraction method, a tone characteristic extraction device, computer equipment and a storage medium, wherein voice data of at least two speakers are obtained and input into a preset bidirectional recurrent neural network so as to convert the voice data into continuous vectors, the continuous vectors are quantized into voice text content discrete vectors, the difference value of the continuous vectors and the voice text content discrete vectors is calculated, and then the loss value of a preset target optimization function is calculated according to the difference value; when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data; and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information. The invention can obtain the tone characteristic information which can better characterize the speaker, thereby improving the voice conversion effect.

Description

Tone feature extraction method, device, computer equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a tone characteristic extraction method, a tone characteristic extraction device, computer equipment and a storage medium.
Background
In daily life, the voice conversion technology is applied to the fields such as driving navigation, film and television works dubbing and the like. The voice conversion generally refers to converting one person's voice into another person's voice, for example, converting the voice of a male announcer in driving navigation into a certain voice of a driver's favorite star forest.
Voice conversion is to change different speakers, i.e. different timbres, substantially without changing the voice content. In the prior art, the difference between the original continuous speech variable and the quantized speech discrete variable is usually calculated, and the calculation is repeated for many times to obtain an expected average value as the tone color characteristic of the final speaker.
However, the tone color feature obtained by the above-mentioned tone color feature obtaining method does not well characterize the tone color of the speaker, thereby resulting in poor voice conversion effect.
Disclosure of Invention
Therefore, it is necessary to provide a method, an apparatus, a computer device and a storage medium for extracting a tone color feature to solve the problem that the tone color feature obtained by the existing voice conversion technology cannot well represent the tone color of a speaker, which results in a poor voice conversion effect.
A method for extracting timbre features comprises the following steps:
acquiring voice data of at least two speakers, wherein the voice data of at least one speaker at least comprises two voices, and the voice data is associated with speaker tag information;
inputting the voice data into a preset bidirectional cyclic neural network so as to convert the voice data into continuous vectors, quantizing the continuous vectors into discrete vectors of voice text contents, and calculating the difference value between the continuous vectors and the discrete vectors of the voice text contents;
calculating a loss value of a preset target optimization function according to the difference value;
when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information.
A tone color feature extraction device comprising:
the voice data acquisition module is used for acquiring voice data of at least two speakers, wherein the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the tag information of the speaker;
the difference value calculation module is used for inputting the voice data into a preset bidirectional recurrent neural network so as to convert the voice data into a continuous vector, quantizing the continuous vector into a voice text content discrete vector and calculating the difference value between the continuous vector and the voice text content discrete vector;
the loss value calculation module is used for calculating the loss value of a preset target optimization function according to the difference value;
the training module is used for adjusting the parameters of the bidirectional cyclic neural network according to the loss value when the loss value does not meet the preset requirement, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and the speaker tone characteristic information determining module is used for determining the difference as speaker tone characteristic information associated with the speaker tag information when the loss value meets the preset requirement.
A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, the processor implementing the above-mentioned timbre feature extraction method when executing the computer readable instructions.
One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of timbre feature extraction as described above.
The method, the device, the computer equipment and the storage medium for extracting the tone color characteristics acquire the voice data of at least two speakers, input the voice data into a preset bidirectional recurrent neural network to convert the voice data into continuous vectors, quantize the continuous vectors into discrete vectors of voice text contents, calculate the difference value between the continuous vectors and the discrete vectors of the voice text contents, and calculate the loss value of a preset target optimization function according to the difference value; when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data; and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information. Compared with the traditional voice conversion technology in a tone characteristic extraction mode, the method adopts the bidirectional cyclic neural network to process the voice data, and can well decouple the voice text content and the speaker tone characteristic, thereby obtaining the tone which can better represent the speaker, and further improving the voice conversion effect.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a flow chart of a method for extracting timbre features according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a vector quantization process performed on speech training data by using VQ technique according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating training of a speech conversion model according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a tone feature extraction apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a computer device in an embodiment of the invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Artificial intelligence is a branch of computer science and research in this field includes robotics, speech (including speech processing, speech recognition, speech synthesis, speaker recognition, speech conversion, etc.), image recognition, natural language processing, and expert systems, among others. The invention relates to voice processing, voice synthesis, speaker tone characteristic extraction and voice conversion, and the proposed tone characteristic extraction method can well decouple voice text content and speaker tone characteristics by adopting a bidirectional cyclic neural network to process voice data, thereby obtaining the tone which can better represent a speaker and further improving the voice conversion effect. The voice conversion effect of the invention is good, the cost is low, and the invention can be applied to the dubbing field, such as the dubbing of the self-made creative video of the self-media, the dubbing of the self-made animation of the cartoon fan, the dubbing of the movie and television works, etc.; the method is favorable for promoting the continuous innovation and development of the voice technology in the field of artificial intelligence, and has wide market prospect.
In an embodiment, as shown in fig. 1, a method for extracting timbre features is provided, and for convenience of explanation, only a part related to the embodiment is shown in the figure, including the following steps:
step S10, acquiring the voice data of at least two speakers; the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the speaker tag information.
In the embodiment of the present invention, the speech includes the text content of the speech and the information of the tone color characteristic of the speaker (i.e., the waveform of the speaker's voice vibration (the law of vibration)). For example, if the speaker a says "i love my hometown", the voice including the voice text content "i love my hometown" and the tone characteristic information of the speaker a, i.e. the voice data of the speaker a, is obtained. It can be understood that if the speaker b says "i love my hometown", the voice including the voice text content "i love my hometown" and the tone characteristic information of the speaker b, i.e. the voice data of the speaker b, will be obtained. Generally, different persons utter different tones, and thus, the speeches spoken by different persons can be recognized by the tones.
The speaker tag information is information indicating the identity of the speaker, and may be a text identifier, a graphic identifier, a numeric identifier, a letter identifier, or the like. For example, the tag information of the speaker a, the speaker b, and the speaker c may respectively represent the identities of the three speakers by three characters "a", "b", and "c", or may respectively represent the identities of the three speakers by numbers "1", "2", and "3".
In an embodiment of the present invention, the association between the voice data and the speaker tag information may be performed by storing the voice data and the speaker tag information in a one-to-one correspondence manner, and specifically, the voice data and the speaker tag information may be stored in a correspondence relationship as shown in table 1 below.
TABLE 1 correspondence table of voice data and speaker tag information
Voice data Speaker tag information
Voice data 1 Speaker nail
Voice data 2 Speaker B
In another embodiment of the present invention, the voice data is associated with speaker tag information, and each piece of voice data may also carry corresponding speaker tag information. For example, the voice data of the speaker A carries a text tag identifying the identity information "A" of the speaker A.
Step S20, inputting the voice data into a preset bidirectional recurrent neural network to convert the voice data into continuous vectors, quantizing the continuous vectors into discrete vectors of voice text content, and calculating the difference between the continuous vectors and the discrete vectors of the voice text content.
In an embodiment of the present invention, the converting the speech data into a continuous vector may specifically be to extract a Mel spectrum (Mel spectrum) from the speech data, where the Mel spectrum has a coding vector (i.e. continuous vector) potentially corresponding to the Mel spectrum, and specifically may be a one-dimensional array with a length of 256 dimensions, for example, (1.1,3.3,2.5,1.2, … …). Extracting the mel-frequency spectrum from the voice data can adopt the extraction technology known by the technical personnel in the field, for example, pre-emphasis, framing and windowing are carried out on the audio signal, and then short-time Fourier transform (STFT) is carried out on each frame of signal to obtain a short-time amplitude spectrum; and then the short-time amplitude spectrum passes through a Mel filter bank to obtain a Mel spectrum.
The discrete vector of the content of the voice text is obtained by searching a codebook and replacing the continuous vector with the nearest codebook vector. For example, the codebook corresponding to the above continuous vector (1.1,3.3,2.5,1.2, … …) is (1,3,2,1, … …). At this time, the difference between the continuous vector and the discrete vector of the speech text content can be calculated as (0.1,0.3,05,0.2, … …).
In another embodiment of the present invention, before inputting the voice data into the preset bidirectional recurrent neural network, the voice data may be converted into a continuous vector, the continuous vector is quantized into a discrete vector of the content of the voice text, and then the continuous vector and the discrete vector of the content of the voice text are input into the bidirectional recurrent neural network, and a difference between the continuous vector and the discrete vector of the content of the voice text is calculated.
In an exemplary embodiment, the voice data includes a first voice, a second voice, and a third voice; the first voice, the second voice and the first speaker tag information are associated, and the third voice and the second speaker tag information are associated.
It can be understood that the speakers corresponding to the first voice and the second voice are both the first speaker, and the speaker corresponding to the third voice is the second speaker.
In the above step S20, the converting the speech data into a continuous vector, quantizing the continuous vector into a discrete vector of speech text content, and calculating a difference between the continuous vector and the discrete vector of speech text content includes:
and converting the first voice into a first continuous vector, quantizing the first continuous vector into a first voice text content discrete vector, and calculating a first difference value between the first continuous vector and the first voice text content discrete vector.
And converting the second voice into a second continuous vector, quantizing the second continuous vector into a second voice text content discrete vector, and calculating a second difference value between the second continuous vector and the second voice text content discrete vector.
And converting the third voice into a third continuous vector, quantizing the third continuous vector into a third voice text content discrete vector, and calculating a third difference value between the third continuous vector and the third voice text content discrete vector.
The method for calculating the first difference, the second difference, and the third difference may refer to the method for calculating the difference between the continuous vector and the discrete vector of the content of the speech text in the embodiment, and is not described herein again.
And step S30, calculating the loss value of the preset target optimization function according to the difference value.
In an embodiment, with reference to the foregoing exemplary embodiment, a loss value of a preset objective optimization function is calculated according to the first difference, the second difference, and the third difference, where the preset objective optimization function is:
L=-(y1!=y2)‖SA(x1)-SB(x1)‖+(y1==y2)‖SA(x1)-SA(x2)‖;
wherein L is a loss value; y is1Representing a first speaker; y is2Representing a second speaker; sA(x1) Representing a first difference value obtained after the first voice is processed by a preset bidirectional recurrent neural network; sA(x2) Representing a second difference value obtained after the second voice is processed by a preset bidirectional recurrent neural network; sB(x1) And representing a third difference value obtained after the third voice is processed by a preset bidirectional recurrent neural network.
And step S40, when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data.
And step S50, when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information.
In the embodiment of the present invention, the preset requirement generally refers to whether the loss value is less than or equal to a preset threshold, for example, the preset threshold is 0.1,0.3, and the like. And when the loss value is greater than the preset requirement, the loss value is represented as not meeting the preset requirement, otherwise, the loss value meets the preset requirement.
In one embodiment, when the loss value does not meet the preset requirement, the parameters of the bidirectional recurrent neural network are adjusted according to the loss value, and new voice data are continuously input into the bidirectional recurrent neural network with the adjusted parameters for training. The new voice data may be 4 or 8 pieces of voice data randomly extracted from the training data set, and the number of pieces of randomly extracted voice data may be determined according to actual needs, which is not specifically limited herein. The new speech data may be speech data including the first speaker and/or the second speaker, or speech data of other speakers.
And when the loss value does not meet the preset requirement, repeating the training steps until the loss value meets the preset requirement.
In the embodiment of the invention, the parameters of the bidirectional cyclic neural network are adjusted through the loss values, so that the difference of the tone characteristic information of different sentences spoken by the same speaker is as small as possible, and the difference of the tone characteristic information of the same sentence spoken by different speakers is as large as possible, thereby obtaining the tone characteristic information capable of well representing the speaker and further improving the effect of voice conversion.
In one embodiment, the step S50 includes:
and calculating an average value of the first difference value and the second difference value, and determining the average value as the first speaker tone characteristic information associated with the first speaker tag information.
And determining the third difference as second speaker tone characteristic information associated with the second speaker tag information.
As an example, when the loss value of the preset objective optimization function calculated from the first difference, the second difference, and the third difference satisfies the preset requirement, an average value of the first difference and the second difference is calculated, and the average value is determined as the first speaker timbre characteristic information. And determining the third difference as the tone color characteristic information of the second speaker.
As another example, when the loss value of the preset objective optimization function calculated according to the first difference, the second difference and the third difference does not meet the preset requirement, the parameters of the bidirectional recurrent neural network are adjusted according to the currently calculated loss value, new voice data is continuously input into the bidirectional recurrent neural network with the adjusted parameters for training, a new loss value is obtained, whether the new loss value meets the preset requirement or not is judged, and if so, speaker tone color feature information corresponding to each speaker tag information is output.
In an embodiment, after the step S50, the method further includes:
and obtaining source speech data to be converted and target speaker tag information.
And acquiring the corresponding relation between the speaker tag information and the speaker tone characteristic information, and acquiring the target speaker tone characteristic information corresponding to the target speaker tag information according to the corresponding relation.
And extracting source speech text content discrete vectors of the source speech data, and performing speech synthesis on the source speech text content discrete vectors and the target speaker tone characteristic information through a speech conversion model to obtain target speech data.
Where source speech data generally refers to raw speech data that is not converted. The target speaker tag information refers to the identity information of the speaker that is desired to be converted, for example, the target speaker is a nail, and the target speaker tag information may be the name or code of the nail.
In the embodiment of the invention, the corresponding relationship between the speaker tag information and the speaker tone characteristic information specifically means that the speaker tag information corresponds to the speaker tone characteristic information one by one. For example, the label information of the speaker A corresponds to the tone characteristic information of the speaker A, and the label information of the speaker B corresponds to the tone characteristic information of the speaker B.
As an example, a correspondence table of speaker tag information and speaker timbre feature information may be constructed in advance, as shown in table 2 below.
Table 2 correspondence table between speaker tag information and speaker tone feature information
Figure BDA0003280322480000101
Figure BDA0003280322480000111
The target speaker tone characteristic information can be obtained by inquiring the corresponding relation table according to the target speaker tag information.
In the embodiment of the present invention, the discrete vectors of the source speech text content of the source speech data are extracted, and the step of converting the speech data into continuous vectors and quantizing the continuous vectors into discrete vectors of the speech text content may be referred to above, and will not be described herein again.
As an example, assuming that the source speech data is a segment of speech of a speaker a, and it is desired to convert the timbre feature of a in the segment of speech into the timbre feature of b, so as to obtain target speech data, according to the above method, firstly, the timbre feature information of the speaker b and the discrete vector of the text content of the source speech are obtained, and then, the discrete vector of the text content of the source speech and the timbre feature information of b are speech-synthesized through a speech conversion model, so as to obtain target speech data.
In one embodiment, before performing speech synthesis according to the discrete vectors of the text content of the source speech and the timbre feature information of the target speaker to obtain target speech data, the method includes:
acquiring a plurality of voice training data, and performing vector quantization processing on the plurality of voice training data to obtain a plurality of discrete vectors of training voice text contents; the plurality of voice training data carry training speaker tag information.
And acquiring the tone characteristic information of the training speaker corresponding to the label information of the training speaker.
And training a preset countermeasure generation network by using the discrete vectors of the contents of the training voice texts and the tone characteristic information of the training speaker to obtain the voice conversion model.
In an embodiment, performing vector quantization processing on the plurality of speech training data to obtain a plurality of discrete vectors of training speech text content includes:
and carrying out normalization processing on the plurality of voice training data to obtain a plurality of training voice continuous vectors.
And searching discrete vectors of the training voice text content corresponding to the continuous vectors of the training voice according to a preset code book.
As an example, a VQ technique as shown in fig. 2 may be adopted to perform vector quantization processing on several pieces of speech training data, specifically, the rough process of converting speech training data (i.e. Audio X in fig. 2) into a late code via an encoder (discrete vector) is as follows: because of the characteristics of the neural network, the speech training data has a corresponding potential coding vector V (which is a continuous vector), the coding vector is a one-dimensional array with a length of 256 dimensions, the coding vector is input into an input layer (i.e., IN fig. 2), normalization processing is performed through an IN layer to obtain a normalization vector (i.e., IN (V) IN fig. 2), a codebook closest to the normalization vector is found out by searching the codebook, and the codebook is used for replacing the normalization vector, so that a discrete vector of the content of the training speech text can be obtained.
In an embodiment, the number of voice training data includes first voice training data; the training speaker voice tone characteristic information comprises first training speaker voice tone characteristic information and second training speaker voice tone characteristic information.
The training a preset countermeasure generation network by using the discrete vectors of the contents of the training speech texts and the tone characteristic information of the training speaker to obtain the speech conversion model comprises the following steps:
and training a generator of a preset countermeasure generation network by using the first voice training data and the second training speaker tone characteristic information to obtain generated voice data.
And training a preset decoder by using the first voice training data and the tone characteristic information of the first training speaker to obtain reconstructed voice data.
And training a preset discriminator of the countermeasure generating network by using the generated voice data, the reconstructed voice data and the voice training data, and obtaining a voice conversion model after training.
As an example, the speech conversion model may be obtained by training with reference to a training method as shown in FIG. 3. Specifically, in fig. 3, the target speaker D _ Vector1 is the timbre feature information of the second training speaker, the source speaker D _ Vector1 is the timbre feature information of the first training speaker, Audio X is the first speech training data, late code is the discrete Vector of the content of the first speech text after the VQ technique processing of the first speech training data, Generator is the Generator of the challenge generation network, Decoder is the Decoder, X1 is the reconstructed speech data, X2 is the generated speech data, and Discriminator is the Discriminator of the challenge generation network.
Specifically, during training, a discrete vector of the content of a first voice text is simultaneously input into a Decoder and a generator, and meanwhile, the Decoder is always input with the tone characteristic information of a first training speaker to obtain generated voice data; always inputting tone characteristic information of a second training speaker into the Generator to obtain reconstructed voice data; training a discriminator of a preset countermeasure generating network by using the generated voice data, the reconstructed voice data and a plurality of voice training data so as to obtain the following discrimination results: the identification result of the generated voice data is false, the corresponding speaker is the second training speaker, the identification result of the reconstructed voice data is false, the corresponding speaker is the first training speaker, the identification result of the voice training data is true, and the corresponding speaker is the first training speaker.
In the embodiment of the invention, the generation method of the confrontation generation network can be normalized according to the training method, the training difficulty of the voice conversion model is reduced, and the conversion effect of the model can be improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In one embodiment, a tone color feature extraction device is provided, and the tone color feature extraction device corresponds to the tone color feature extraction method in the foregoing embodiment one to one. As shown in fig. 4, the timbre feature extraction device includes a speech data acquisition module 11, a difference calculation module 12, a loss value calculation module 13, a training module 14, and a speaker timbre feature information determination module 15. The functional modules are explained in detail as follows:
the voice data acquiring module 11 is configured to acquire voice data of at least two speakers, where the voice data of at least one speaker at least includes two voices, and the voice data is associated with speaker tag information.
And a difference value calculating module 12, configured to input the voice data into a preset bidirectional recurrent neural network, so as to convert the voice data into a continuous vector, quantize the continuous vector into a discrete vector of the content of the voice text, and calculate a difference value between the continuous vector and the discrete vector of the content of the voice text.
And a loss value calculating module 13, configured to calculate a loss value of a preset target optimization function according to the difference.
And the training module 14 is configured to adjust the parameter of the bidirectional recurrent neural network according to the loss value when the loss value does not meet a preset requirement, and train the bidirectional recurrent neural network with the adjusted parameter by using new voice data.
And the speaker tone characteristic information determining module 15 is configured to determine the difference as speaker tone characteristic information associated with the speaker tag information when the loss value meets a preset requirement.
In an embodiment, the voice data includes a first voice, a second voice, and a third voice; the first voice, the second voice and the first speaker tag information are associated, and the third voice and the second speaker tag information are associated.
The difference calculation module 12 includes: a first difference calculation unit, a second difference calculation unit, and a third difference calculation unit.
The first difference calculation unit is configured to convert the first speech into a first continuous vector, quantize the first continuous vector into a first speech text content discrete vector, and calculate a first difference between the first continuous vector and the first speech text content discrete vector.
And the second difference value calculating unit is used for converting the second voice into a second continuous vector, quantizing the second continuous vector into a second voice text content discrete vector, and calculating a second difference value between the second continuous vector and the second voice text content discrete vector.
And the third difference value calculating unit is used for converting the third voice into a third continuous vector, quantizing the third continuous vector into a third voice text content discrete vector, and calculating a third difference value between the third continuous vector and the third voice text content discrete vector.
The loss value calculation module 13 may be configured to:
calculating a loss value of a preset target optimization function according to the first difference, the second difference and the third difference, wherein the preset target optimization function is as follows:
L=-(y1!=y2)‖SA(x1)-SB(x1)‖+(y1==y2)‖SA(x1)-SA(x2)‖;
wherein L is a loss value; y is1Representing a first speaker; y is2Representing a second speaker; sA(x1) Representing a first difference value obtained after the first voice is processed by a preset bidirectional recurrent neural network; sA(x2) Representing a second difference value obtained after the second voice is processed by a preset bidirectional recurrent neural network; sB(x1) And representing a third difference value obtained after the third voice is processed by a preset bidirectional recurrent neural network.
In one embodiment, the training module 14 includes a first speaker timbre characteristic information determining unit and a second speaker timbre characteristic information determining unit.
The first speaker tone characteristic information determining unit is used for calculating the average value of the first difference value and the second difference value and determining the average value as first speaker tone characteristic information associated with the first speaker label information;
and the second speaker tone characteristic information determining unit is used for determining the third difference as second speaker tone characteristic information associated with the second speaker tag information.
In an embodiment, the above-mentioned timbre feature extracting apparatus further includes: the device comprises an acquisition module, a target speaker tone characteristic information acquisition module and a voice synthesis module.
And the acquisition module is used for acquiring the source speech data to be converted and the target speaker tag information.
And the target speaker tone characteristic information acquisition module is used for acquiring the corresponding relation between the speaker tag information and the speaker tone characteristic information and acquiring the target speaker tone characteristic information corresponding to the target speaker tag information according to the corresponding relation.
And the voice synthesis module is used for extracting the source voice text content discrete vector of the source voice data, and performing voice synthesis on the source voice text content discrete vector and the target speaker voice color characteristic information through a voice conversion model to obtain target voice data.
In an embodiment, the above-mentioned timbre feature extracting apparatus further includes: the system comprises a voice training data acquisition module, a speaker training tone characteristic information acquisition module and a voice conversion model training module.
The voice training data acquisition module is used for acquiring a plurality of voice training data and carrying out vector quantization processing on the voice training data to obtain a plurality of discrete vectors of training voice text contents; the plurality of voice training data carry training speaker tag information.
And the training speaker tone characteristic information acquisition module is used for acquiring training speaker tone characteristic information corresponding to the training speaker label information.
And the voice conversion model training module is used for training a preset confrontation generation network by using the discrete vectors of the contents of the training voice texts and the tone characteristic information of the training speaker so as to obtain the voice conversion model.
In an embodiment, the number of voice training data includes first voice training data; the training speaker voice tone characteristic information comprises first training speaker voice tone characteristic information and second training speaker voice tone characteristic information.
The voice conversion model training module comprises a generated voice data training unit, a reconstructed voice data training unit and a discriminator training unit.
And the generated voice data training unit is used for training a generator of a preset confrontation generation network by using the first voice training data and the second training speaker tone characteristic information to obtain generated voice data.
And the reconstructed voice data training unit is used for training a preset decoder by using the first voice training data and the tone characteristic information of the first training speaker to obtain reconstructed voice data.
And the discriminator training unit is used for training the discriminator of the preset countermeasure generating network by using the generated voice data, the reconstructed voice data and the voice training data, and obtaining a voice conversion model after training.
In an embodiment, the voice training data obtaining module may be configured to:
and carrying out normalization processing on the plurality of voice training data to obtain a plurality of training voice continuous vectors.
And searching discrete vectors of the training voice text content corresponding to the continuous vectors of the training voice according to a preset code book.
For the specific definition of the tone color feature extraction device, reference may be made to the above definition of the tone color feature extraction method, which is not described herein again. The respective modules in the above-mentioned timbre feature extraction device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operating system and execution of computer-readable instructions in the readable storage medium. The database of the computer device is used for storing data related to the tone color feature extraction method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions, when executed by a processor, implement a method of timbre feature extraction. The readable storage media provided by the present embodiment include nonvolatile readable storage media and volatile readable storage media.
In one embodiment, a computer device is provided, comprising a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, the processor when executing the computer readable instructions implementing the steps of:
acquiring voice data of at least two speakers; the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the tag information of the speaker;
inputting the voice data into a preset bidirectional cyclic neural network so as to convert the voice data into continuous vectors, quantizing the continuous vectors into discrete vectors of voice text contents, and calculating the difference value between the continuous vectors and the discrete vectors of the voice text contents;
calculating a loss value of a preset target optimization function according to the difference value;
when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information.
In one embodiment, one or more computer-readable storage media storing computer-readable instructions are provided, the readable storage media provided by the embodiments including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions which, when executed by one or more processors, perform the steps of:
acquiring voice data of at least two speakers; the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the tag information of the speaker;
inputting the voice data into a preset bidirectional cyclic neural network so as to convert the voice data into continuous vectors, quantizing the continuous vectors into discrete vectors of voice text contents, and calculating the difference value between the continuous vectors and the discrete vectors of the voice text contents;
calculating a loss value of a preset target optimization function according to the difference value;
when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information.
It will be understood by those of ordinary skill in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to computer readable instructions, which may be stored in a non-volatile readable storage medium or a volatile readable storage medium, and when executed, the computer readable instructions may include processes of the above embodiments of the methods. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method for extracting timbre features, comprising:
acquiring voice data of at least two speakers; the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the tag information of the speaker;
inputting the voice data into a preset bidirectional cyclic neural network so as to convert the voice data into continuous vectors, quantizing the continuous vectors into discrete vectors of voice text contents, and calculating the difference value between the continuous vectors and the discrete vectors of the voice text contents;
calculating a loss value of a preset target optimization function according to the difference value;
when the loss value does not meet the preset requirement, adjusting the parameters of the bidirectional cyclic neural network according to the loss value, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and when the loss value meets the preset requirement, determining the difference value as speaker tone characteristic information associated with the speaker tag information.
2. The tone color feature extraction method according to claim 1, wherein the voice data includes a first voice, a second voice, and a third voice; the first voice, the second voice and the first speaker tag information are associated, and the third voice and the second speaker tag information are associated;
the converting the voice data into a continuous vector, quantizing the continuous vector into a discrete vector of voice text content, and calculating a difference value between the continuous vector and the discrete vector of voice text content includes:
converting the first voice into a first continuous vector, quantizing the first continuous vector into a first voice text content discrete vector, and calculating a first difference value between the first continuous vector and the first voice text content discrete vector;
converting the second voice into a second continuous vector, quantizing the second continuous vector into a second voice text content discrete vector, and calculating a second difference value between the second continuous vector and the second voice text content discrete vector;
converting the third voice into a third continuous vector, quantizing the third continuous vector into a third voice text content discrete vector, and calculating a third difference value between the third continuous vector and the third voice text content discrete vector;
the calculating a loss value of a preset objective optimization function according to the difference value includes:
calculating a loss value of a preset target optimization function according to the first difference, the second difference and the third difference, wherein the preset target optimization function is as follows:
L=-(y1!=y2)‖SA(x1)-SB(x1)‖+(y1==y2)‖SA(x1)-SA(x2)‖;
wherein L is a loss value;
y1representing a first speaker;
y2representing a second speaker;
SA(x1) Representing a first difference value obtained after the first voice is processed by a preset bidirectional recurrent neural network;
SA(x2) Representing a second difference value obtained after the second voice is processed by a preset bidirectional recurrent neural network;
SB(x1) And representing a third difference value obtained after the third voice is processed by a preset bidirectional recurrent neural network.
3. The method for extracting timbre features of claim 2 wherein determining the difference as speaker timbre feature information associated with the speaker tag information when the loss value meets a preset requirement comprises:
calculating an average value of the first difference value and the second difference value, and determining the average value as first speaker tone characteristic information associated with first speaker tag information;
and determining the third difference as second speaker tone characteristic information associated with the second speaker tag information.
4. The method for extracting timbre features of claim 1, wherein when the loss value satisfies a predetermined requirement, determining the difference value as speaker timbre feature information associated with the speaker tag information further comprises:
obtaining source speech data to be converted and target speaker tag information;
acquiring the corresponding relation between the speaker tag information and speaker tone characteristic information, and acquiring target speaker tone characteristic information corresponding to the target speaker tag information according to the corresponding relation;
and extracting source speech text content discrete vectors of the source speech data, and performing speech synthesis on the source speech text content discrete vectors and the target speaker tone characteristic information through a speech conversion model to obtain target speech data.
5. The method for extracting timbre features of claim 4, wherein before performing speech synthesis based on the discrete vectors of the source speech text content and the timbre feature information of the target speaker to obtain target speech data, the method comprises:
acquiring a plurality of voice training data, and performing vector quantization processing on the plurality of voice training data to obtain a plurality of discrete vectors of training voice text contents; the voice training data carries training speaker label information;
acquiring training speaker tone characteristic information corresponding to the training speaker label information;
and training a preset countermeasure generation network by using the discrete vectors of the contents of the training voice texts and the tone characteristic information of the training speaker to obtain the voice conversion model.
6. The method of timbre feature extraction as claimed in claim 5, wherein the number of speech training data comprises first speech training data; the training speaker voice tone characteristic information comprises first training speaker voice tone characteristic information and second training speaker voice tone characteristic information;
the training a preset countermeasure generation network by using the discrete vectors of the contents of the training speech texts and the tone characteristic information of the training speaker to obtain the speech conversion model comprises the following steps:
training a generator of a preset countermeasure generation network by using the first voice training data and the second training speaker tone characteristic information to obtain generated voice data;
training a preset decoder by using the first voice training data and the tone characteristic information of the first training speaker to obtain reconstructed voice data;
and training a preset discriminator of the countermeasure generating network by using the generated voice data, the reconstructed voice data and the voice training data, and obtaining a voice conversion model after training.
7. The method for extracting timbre features as claimed in claim 5, wherein said performing vector quantization on said plurality of speech training data to obtain a plurality of discrete vectors of training speech text content comprises:
carrying out normalization processing on the plurality of voice training data to obtain a plurality of training voice continuous vectors;
and searching discrete vectors of the training voice text content corresponding to the continuous vectors of the training voice according to a preset code book.
8. A tone color feature extraction device characterized by comprising:
the voice data acquisition module is used for acquiring voice data of at least two speakers, wherein the voice data of at least one speaker at least comprises two voices, and the voice data is associated with the tag information of the speaker;
the difference value calculation module is used for inputting the voice data into a preset bidirectional recurrent neural network so as to convert the voice data into a continuous vector, quantizing the continuous vector into a voice text content discrete vector and calculating the difference value between the continuous vector and the voice text content discrete vector;
the loss value calculation module is used for calculating the loss value of a preset target optimization function according to the difference value;
the training module is used for adjusting the parameters of the bidirectional cyclic neural network according to the loss value when the loss value does not meet the preset requirement, and training the bidirectional cyclic neural network with the adjusted parameters by using new voice data;
and the speaker tone characteristic information determining module is used for determining the difference as speaker tone characteristic information associated with the speaker tag information when the loss value meets the preset requirement.
9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the timbre feature extraction method as claimed in any one of claims 1 to 7.
10. One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the method of timbre feature extraction as claimed in any one of claims 1 to 7.
CN202111130551.6A 2021-09-26 2021-09-26 Tone feature extraction method, device, computer equipment and storage medium Pending CN113870875A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111130551.6A CN113870875A (en) 2021-09-26 2021-09-26 Tone feature extraction method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111130551.6A CN113870875A (en) 2021-09-26 2021-09-26 Tone feature extraction method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113870875A true CN113870875A (en) 2021-12-31

Family

ID=78994702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111130551.6A Pending CN113870875A (en) 2021-09-26 2021-09-26 Tone feature extraction method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113870875A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273777A (en) * 2022-07-22 2022-11-01 魔珐(上海)信息科技有限公司 Updating method and application method of sound conversion model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115273777A (en) * 2022-07-22 2022-11-01 魔珐(上海)信息科技有限公司 Updating method and application method of sound conversion model

Similar Documents

Publication Publication Date Title
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN110287283B (en) Intention model training method, intention recognition method, device, equipment and medium
CN112712813B (en) Voice processing method, device, equipment and storage medium
AU712412B2 (en) Speech processing
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
US11763801B2 (en) Method and system for outputting target audio, readable storage medium, and electronic device
US20220059083A1 (en) Neural modulation codes for multilingual and style dependent speech and language processing
US5280562A (en) Speech coding apparatus with single-dimension acoustic prototypes for a speech recognizer
CN112712789B (en) Cross-language audio conversion method, device, computer equipment and storage medium
CN110853616A (en) Speech synthesis method, system and storage medium based on neural network
CN112667787A (en) Intelligent response method, system and storage medium based on phonetics label
EP4447040A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN114724568A (en) Voice conversion method, system, device and storage medium based on neural network
CN112735454A (en) Audio processing method and device, electronic equipment and readable storage medium
Wu et al. Multilingual text-to-speech training using cross language voice conversion and self-supervised learning of speech representations
CN113870875A (en) Tone feature extraction method, device, computer equipment and storage medium
CN118230722B (en) Intelligent voice recognition method and system based on AI
EP0215065A1 (en) Individual recognition by voice analysis
JP6220733B2 (en) Voice classification device, voice classification method, and program
CN117592564A (en) Question-answer interaction method, device, equipment and medium
CN117672254A (en) Voice conversion method, device, computer equipment and storage medium
US4790017A (en) Speech processing feature generation arrangement
CN116597848A (en) Speech conversion processing method, device, computer equipment and storage medium
US20230317085A1 (en) Audio processing device, audio processing method, recording medium, and audio authentication system
CN115394284B (en) Speech synthesis method, system, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination