[go: up one dir, main page]

CN113539233B - Voice processing method and device and electronic equipment - Google Patents

Voice processing method and device and electronic equipment Download PDF

Info

Publication number
CN113539233B
CN113539233B CN202010301719.4A CN202010301719A CN113539233B CN 113539233 B CN113539233 B CN 113539233B CN 202010301719 A CN202010301719 A CN 202010301719A CN 113539233 B CN113539233 B CN 113539233B
Authority
CN
China
Prior art keywords
voice
text information
language
training data
conversion model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010301719.4A
Other languages
Chinese (zh)
Other versions
CN113539233A (en
Inventor
李栋梁
刘恺
周明
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202010301719.4A priority Critical patent/CN113539233B/en
Priority to PCT/CN2021/070432 priority patent/WO2021208531A1/en
Publication of CN113539233A publication Critical patent/CN113539233A/en
Application granted granted Critical
Publication of CN113539233B publication Critical patent/CN113539233B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/086Detection of language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a voice processing method, a voice processing device and electronic equipment, wherein the method comprises the following steps: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of target user pronunciation, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; therefore, the multi-language text can be converted into the target voice data of the target user with the corresponding language under the condition that only the voice data of the target user with a single language is available, and the multi-language voice conversion is realized.

Description

Voice processing method and device and electronic equipment
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing voice, and an electronic device.
Background
With the development of speech processing technology, speech conversion technology is also widely used. For example, the field of input methods adopts a voice conversion technology to realize variable-sound input; for another example, the instant messaging software applies a voice conversion technology to realize the voice conversion of the video call or the voice call; etc.
The voice conversion technique refers to a technique of converting sound of one person (source user) into sound of another person (target user). In the prior art, the voice data of a target user is generally collected, and the model is trained by adopting the voice data of the target user; in the subsequent application process, after the voice data of the source user are obtained, the trained model is adopted to carry out voice conversion on the voice data of the source user, so as to obtain the voice data of the target user.
However, if the data of the training model is single language voice data of the target user, the prior art can only convert the voice data of the source user pronouncing in the language into the voice data of the target user pronouncing in the language; the conversion of voice data uttered by the source user in other languages into voice data uttered by the target user in other languages cannot be achieved. For example, if the data of the training model is the Chinese voice data of the target user, voice conversion can only be performed on the voice data of the source user adopting Chinese pronunciation, so that the voice data of the target user adopting Chinese pronunciation can be obtained; but cannot convert the voice data of the source user with english pronunciation into the voice data of the target user with english pronunciation.
Disclosure of Invention
The embodiment of the invention provides a voice processing method, which is used for realizing multilingual voice conversion under the condition that only voice data of a target user in a single language is provided.
Correspondingly, the embodiment of the invention also provides a voice processing device and electronic equipment, which are used for ensuring the realization and application of the method.
In order to solve the above problems, an embodiment of the present invention discloses a speech processing method, which specifically includes: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Optionally, the obtaining text information to be converted includes: acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and carrying out voice recognition on the source voice data, and determining corresponding text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: respectively inputting the source voice data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
Optionally, the converting the text information into the target voice data of the target user pronouncing in the source language according to the text information and the target conversion model corresponding to the target user includes: converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation; and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
Optionally, the converting the text information by using the target conversion model, outputting the acoustic feature of the target user speaking the text information by using the source language, including: inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
Optionally, the method further comprises the step of training the universal conversion model: collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
Optionally, the method further comprises the step of adaptively training the trained universal conversion model according to the single-language voice data of the target user to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
The embodiment of the invention also discloses a voice processing device, which specifically comprises: the acquisition module is used for acquiring text information to be converted; the information determining module is used for determining the source language corresponding to the text information and the target user to be converted; the voice conversion module is used for converting the text information into target voice data of the target user which adopts source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Optionally, the acquiring module includes: the voice acquisition sub-module is used for acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and the recognition sub-module is used for carrying out voice recognition on the source voice data and determining corresponding text information to be converted.
Optionally, the identifying sub-module includes: the first voice recognition unit is used for inputting the source voice data into N voice recognizers respectively to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
Optionally, the identifying sub-module includes: the second voice recognition unit is used for inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
Optionally, the voice conversion module includes: the feature generation sub-module is used for converting the text information by adopting the target conversion model and outputting the acoustic features of the target user, which are generated by adopting the source language; and the voice synthesis sub-module is used for synthesizing the acoustic characteristics by adopting a synthesizer to obtain target voice data of the target user for pronunciation by adopting the source language.
Optionally, the feature generation sub-module is configured to input the text information, a language identifier corresponding to a source language, and a user identifier corresponding to a target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
Optionally, the method further comprises: the first training module is used for training the universal conversion model; the first training module is specifically configured to collect X pieces of first voice training data of M users, where one piece of first voice training data corresponds to one language, and X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
Optionally, the method further comprises: the second training module is used for carrying out self-adaptive training on the trained universal conversion model according to the single-language voice data of the target user to generate a target conversion model; the second training module is specifically configured to obtain Y pieces of second voice training data of the target user, where languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
The embodiment of the invention also discloses a readable storage medium, which enables the electronic equipment to execute the voice processing method according to any one of the embodiments of the invention when the instructions in the storage medium are executed by the processor of the electronic equipment.
The embodiment of the invention also discloses an electronic device, which comprises a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, and the one or more programs comprise instructions for: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Optionally, the obtaining text information to be converted includes: acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and carrying out voice recognition on the source voice data, and determining corresponding text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: respectively inputting the source voice data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
Optionally, the converting the text information into the target voice data of the target user pronouncing in the source language according to the text information and the target conversion model corresponding to the target user includes: converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation; and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
Optionally, the converting the text information by using the target conversion model, outputting the acoustic feature of the target user speaking the text information by using the source language, including: inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
Optionally, instructions for training the generic conversion model are also included: collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
Optionally, the method further includes the following instructions for adaptively training the trained generic conversion model according to the target user's single language speech data to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, the text information to be converted can be obtained, and the source language corresponding to the text information and the target user to be converted are determined; then converting the text information into target voice data of the target user pronouncing in source language according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of target user pronunciation, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; therefore, the multi-language text can be converted into the target voice data of the target user with the corresponding language under the condition that only the voice data of the target user with a single language is available, and the multi-language voice conversion is realized.
Drawings
FIG. 1 is a flow chart of steps of an embodiment of a speech processing method of the present invention;
FIG. 2 is a flow chart of the steps of an embodiment of a model training method of the present invention;
FIG. 3 is a flow chart of steps of an embodiment of a model adaptive training method of the present invention;
FIG. 4 is a flowchart illustrating steps of an alternate embodiment of a speech processing method of the present invention;
FIG. 5a is a process schematic of a speech processing method of the present invention;
FIG. 5b is a process schematic of another speech processing method of the present invention;
FIG. 6 is a block diagram of an embodiment of a speech processing apparatus of the present invention;
FIG. 7 is a block diagram of an alternative embodiment of a speech processing apparatus of the present invention;
FIG. 8 is a block diagram of an electronic device for speech processing according to an exemplary embodiment;
Fig. 9 is a schematic structural view of an electronic device for voice processing according to another exemplary embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech processing method of the present invention may specifically include the following steps:
Step 102, obtaining text information to be converted.
In the embodiment of the present invention, text information to be converted into voice data may be obtained, and then the voice conversion may be performed on the text information with reference to steps 104 to 106.
And 104, determining the source language corresponding to the text information and the target user to be converted.
In the embodiment of the invention, after the text information to be converted is acquired, the source language corresponding to the text information and the target user to be converted can be determined, so that the subsequent determination of which user to convert the text information into the voice data pronouncing in which language is convenient.
Step 106, converting the text information into target voice data of the target user pronouncing in the source language according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
In the embodiment of the invention, the universal conversion model can be trained in advance according to the voice data containing N languages, and the trained universal conversion model is obtained. And performing self-adaptive training on the trained universal conversion model according to the single-language voice data of the pronunciation of the target user to obtain the target conversion model corresponding to the target user. The training and adaptive training process of the model is described later.
Then, a target conversion model corresponding to the target user can be adopted to convert the text information, so that a corresponding conversion result is obtained; and converting the text information into voice data (hereinafter referred to as target voice data) of the target user pronouncing in the source language according to the conversion result. Wherein, N is a positive integer greater than 1, and the source language corresponding to the source voice data is one of the N languages; therefore, the method can convert the multilingual text information into the target voice data of the target user in the corresponding language under the condition that only the voice data of the target user in a single language exists.
The source language may be the same or different from the language corresponding to the voice data of the target user for performing the adaptive training by using the trained generic conversion model, which is not limited in the embodiment of the present invention.
In summary, in the embodiment of the invention, text information to be converted can be obtained, and a source language corresponding to the text information and a target user to be converted are determined; then converting the text information into target voice data of the target user pronouncing in source language according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of target user pronunciation, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; therefore, the multi-language text information can be converted into the target voice data of the target user with the corresponding language under the condition that only the voice data of the target user with a single language is available, and the multi-language voice conversion is realized.
The following describes how to train the general-purpose sound synthesis model.
Referring to fig. 2, a flowchart illustrating steps of an embodiment of a model training method of the present invention may specifically include the following steps:
Step 202, collecting X pieces of first voice training data of M users, where one piece of first voice training data corresponds to one language, and X pieces of first voice training data correspond to N languages.
In the embodiment of the invention, M, X and N are both positive integers, and M, X and N can be set according to requirements; for example, M is set to 20, X is set to 1000, and N is set to 5 (e.g., 5 languages of Chinese, english, japanese, korean, and Russian); the embodiments of the present invention are not limited in this regard.
After M, X and N are determined, x (i) pieces of speech data may be collected for each of the M users; wherein, the value of i is 1-M, and the sum of the numbers of voice data collected by M users is X. The requirement for the collection of the X pieces of voice data may be that each piece of voice data corresponds to one language, and the X pieces of voice data cover N languages. The first speech training data may then be determined for each of the X pieces of speech data, thereby obtaining X pieces of first speech training data.
Step 204, extracting the reference acoustic features corresponding to each piece of first voice training data, and labeling the corresponding user identification and language identification for each piece of first voice training data and the corresponding reference acoustic features.
In the embodiment of the invention, the output of the universal conversion model is an acoustic feature, and the acoustic feature refers to a feature which can be used for synthesizing voice data. In order to train the universal conversion model, corresponding acoustic features can be extracted from each piece of first voice training data respectively to serve as reference acoustic features, so that the universal conversion model can be trained reversely by comparing the reference acoustic features with acoustic features output by the universal conversion model.
Because the acoustic characteristics of the same text information, which are different from each other and are different from each other, are different; in order to train the universal conversion model to learn the acoustic characteristics of different users in different languages, corresponding user identifications can be allocated to each user in advance, and corresponding language identifications can be allocated to each language; wherein the user identifier is used for uniquely identifying a user, and the language identifier is used for uniquely identifying a language. Then labeling the user identification of the corresponding user and the language identification of the corresponding language for each piece of first voice training data and the corresponding reference acoustic characteristics respectively; and training the universal conversion model by adopting X pieces of first voice training data marked with the user identification and the language identification and corresponding reference acoustic characteristics, wherein the steps 206-208 can be referred to.
Step 206, identifying text information corresponding to each piece of first voice training data.
In the embodiment of the invention, voice recognition can be performed on each piece of first voice training data to determine corresponding text information; and training the universal conversion model according to the text information of each piece of first voice data.
In one example of the present invention, reference may be made to the following sub-steps 22-24 for a way of identifying text information corresponding to the first speech training data; a first piece of speech training data will be described below as an example.
Step 22, inputting the first voice training data to N voice recognizers respectively to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language;
And a substep 24, splicing the N voice recognition results to obtain corresponding text information.
In the embodiment of the present invention, the first voice training data may be input to N voice recognizers, where each of the N voice recognizers is a voice recognizer of a single language. Then each voice recognizer carries out voice recognition on the first voice training data and outputs a corresponding voice recognition result; the voice recognition result can be text coding information or text itself. When the voice recognition result is text coding information, the text coding information output by each voice recognizer can be spliced according to a preset sequence to obtain text information corresponding to the first voice training data; the preset sequence may be set according to requirements, which is not limited by the embodiment of the present invention. When the voice recognition result is text, the voice recognition results output by the voice recognizers can be respectively encoded (such as one-hot (thermal encoding) and converted into word vectors and the like); and then splicing the voice recognition results after the encoding.
In one example of the present invention, still another way to identify text information corresponding to the first voice training data may refer to the following sub-steps 42-44; a first piece of speech training data will be described below as an example.
And step 42, inputting the first voice training data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages.
Substep 44, determining the speech recognition result as text information.
In the embodiment of the invention, a voice recognizer capable of recognizing N voices can also be adopted to perform voice recognition on the first voice training data; namely, the first voice training data is input to the voice recognizer capable of recognizing N languages, and a corresponding voice recognition result is obtained. The speech recognition result may be text coding information, where the text coding information is different from the dimension of the text coding information output by the speech recognizer capable of recognizing one language and the meaning corresponding to each dimension.
In the embodiment of the invention, whether the first voice training data is input to N voice recognizers, text information is obtained by splicing the voice recognition results of the voice recognizers or the text information is input to one voice recognizer; all include correlations between languages. And then training the universal conversion model by adopting text information, so that the trained universal conversion model can learn the association among various languages, and after the trained universal conversion model is adaptively trained by adopting single-language voice data of a target user to obtain a target conversion model, the target conversion model can realize multi-language voice conversion.
Step 208, training the universal conversion model according to the text information, the reference acoustic feature, the user identifier and the language identifier corresponding to the first voice training data.
Taking a first piece of speech training data as an example, a description will now be given of how to train the generic conversion model. In the embodiment of the invention, text information, user identification and language identification corresponding to the first voice training data can be input into the universal conversion model; and performing forward calculation on the text information by the universal conversion model, and outputting predicted acoustic features corresponding to the first voice training data. In the process of performing forward computation on the text information by the universal conversion model, model parameters can be associated with user identification and language identification of the first voice training data. And then comparing the predicted acoustic features with the reference acoustic features corresponding to the first voice training data, and adjusting model parameters of the universal conversion model, which correspond to the user identification and the language identification of the first voice training data. The universal conversion model can be continuously trained by adopting X pieces of first voice training data until the ending condition is met; and a trained universal conversion model can be obtained.
In one embodiment of the invention, after the trained universal conversion model is obtained, the trained universal conversion model can be adaptively trained by adopting single language voice data of a target user, so as to obtain a target conversion model capable of predicting multilingual acoustic characteristics of the target user; the following may be possible:
referring to fig. 3, a flowchart of the steps of one embodiment of a model adaptive training method of the present invention is shown.
Step 302, acquiring Y pieces of second voice training data of the target user, wherein languages corresponding to the Y pieces of second voice training data are the same.
In the embodiment of the invention, Y is a positive integer, and can be specifically set according to requirements, and the embodiment of the invention is not limited to the positive integer. After Y is determined, Y pieces of voice data can be selected from voice data of the target user pronouncing in the same language as second voice training data; and then, carrying out self-adaptive training on the trained universal conversion model by adopting the Y pieces of second voice training data, wherein reference can be made to the steps 304-308.
Step 304, extracting the reference acoustic features of each piece of second voice training data, and labeling the user identification and the language identification of the target user for each piece of second voice training data and the corresponding reference acoustic features.
Step 306, for each piece of second voice training data, performing voice recognition on the second voice training data to determine corresponding text information.
And 308, performing self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
Steps 304-308 are similar to steps 204-208 described above, and are not repeated here.
The following describes how text information is converted into target voice data.
Referring to fig. 4, a flowchart illustrating steps of an alternative embodiment of a speech processing method of the present invention may specifically include the steps of:
Step 402, source voice data of a source user is acquired, wherein the source user and a target user are the same user or different users.
And 404, performing voice recognition on the source voice data to determine corresponding text information to be converted.
In the embodiment of the present invention, one way to obtain the text information to be converted may be to obtain the source voice data of the source user; and then determining corresponding text information to be converted by carrying out voice recognition on the source voice data. The source user and the target user may be the same user or different users, which is not limited in the embodiment of the present invention.
In the embodiment of the invention, the source voice data is subjected to voice recognition, and the mode for determining the corresponding text information to be converted comprises a plurality of modes; in one example, a manner of performing voice recognition on the source voice data and determining the corresponding text information to be converted may refer to the following sub-steps:
Step 62, inputting the source voice data to N voice recognizers respectively to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language;
And a substep 64, splicing the N voice recognition results to obtain text information to be converted.
This sub-step 62-64 is similar to sub-step 22-24 described above and will not be described again here.
In another example of the present invention, the method for performing speech recognition on the source speech data and determining the text information to be converted may include the following sub-steps:
and step 82, inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages.
Substep 84, determining the speech recognition result as text information to be converted.
This sub-step 82-84 is similar to the sub-step 42-44 described above and will not be described again here.
Of course, the user can also directly input text information to be converted into voice data; furthermore, the embodiment of the invention can acquire the text information input by the user and determine the text information input by the user as the text information to be converted.
Step 406, determining the source language corresponding to the text information and the target user to be converted.
In one example of the present invention, when a source user inputs source voice data (or inputs text information), the language corresponding to the source voice data (or inputs text information) and a target user to be converted may be configured. Therefore, after the source voice data is acquired, the configuration information of the source voice data can be acquired, and the source language corresponding to the text information to be converted and the target user to be converted are determined according to the configuration information.
In another example of the present invention, when a source user is inputting source voice data (or inputting text information), the language corresponding to the source voice data (or inputting text information) is not configured. At this time, one way to determine the source language corresponding to the text information to be converted may be to directly identify the language of the text information to be converted, and determine the source language corresponding to the text information to be converted.
In addition, if the source voice data is input by the user, language identification can be performed on the source voice data, and languages corresponding to the source voice data are determined; and then determining the language corresponding to the source voice data as the source language corresponding to the text information to be converted. One way to determine the language corresponding to the source voice data may be to input the source voice data to a language judgment module; and judging the languages of the source voice data by a language judging module, and determining the languages corresponding to the source voice data. Another way to determine the language corresponding to the source voice data may be to determine, by a voice recognizer, the source language corresponding to the source voice data, and then perform language recognition while inputting the source voice data into voice recognition for voice recognition. When the source voice data is input into N voice recognizers for voice recognition, each voice recognizer can output a corresponding voice recognition result and can also output probability information of the voice recognizer corresponding to the voice; and then determining the language corresponding to the maximum speech recognizer of the output probability information as the language corresponding to the source speech data. When the source voice data is input into one voice recognizer for voice recognition, each voice recognizer can output the corresponding voice recognition result and the corresponding language of the source voice data.
In the embodiment of the invention, since the acoustic features are output by the target conversion model, after text information corresponding to the source voice data is converted into the acoustic features corresponding to the target user by adopting the target conversion model, a synthesizer can be adopted to synthesize the acoustic features into the target voice data. In step 106, the text information is converted into the target voice data of the target user pronouncing in the source language according to the text information and the target conversion model corresponding to the target user, which can refer to steps 408-410.
And step 408, converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation.
In the embodiment of the invention, the obtained target conversion model can be adopted to carry out voice conversion, and the acoustic characteristics of the target user adopting the source language pronunciation are output; the following sub-steps can be referred to:
And S2, inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model.
And S4, searching target model parameters matched with the language identification and the user identification by the target conversion model.
And S6, converting the text information by the target conversion model through the target model parameters, and outputting the acoustic characteristics of the target user through the source language pronunciation.
In the embodiment of the invention, the language identification corresponding to the source language and the user identification corresponding to the target user can be determined; and then inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model. The target conversion model can search the target model parameters matched with the language identification corresponding to the source language and the user identification corresponding to the target user; and converting the text information by using the target model parameters, and outputting the acoustic characteristics of the target user using the source language pronunciation.
And step 410, synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
As an example of the present invention, reference may be made to fig. 5a, which shows a schematic process diagram of a speech processing method of the present invention. In fig. 5a, the source voice data are respectively input to N voice recognizers to obtain N corresponding voice recognition results; and then splicing the N voice recognition results to realize the recognition of the text information corresponding to the source voice data.
As another example of the present invention, reference may be made to fig. 5b, which shows a schematic process diagram of yet another speech processing method of the present invention. In fig. 5b, the source voice data is input to a voice recognizer to recognize the text information corresponding to the source voice data.
In one application of the embodiment of the invention, the source voice data of Zhang-three input can be obtained, voice recognition is carried out on the source voice data of Zhang-three input, and the corresponding text information K to be converted is determined. Then, the language corresponding to the text information K to be converted can be determined to be the language A, and the Li-IV of the target user can be determined. Then converting the text information K by adopting the target conversion model, and outputting acoustic characteristics of the Lifour pronunciation of the text information K by adopting the source language; and then synthesizing acoustic features of the Li four by adopting a synthesizer to obtain target voice data of the Li four for pronouncing the text information K by adopting the language A. And further, the voice data input by Zhang san (source user) in the language A is converted into target voice data of Lisi (target user) in the language A aiming at general pronunciation.
In summary, in the embodiment of the present invention, source voice data of a source user may be obtained, and the source voice data may be identified and subjected to voice recognition, so as to determine corresponding text information to be converted; determining a source language corresponding to the source voice data and a target user to be converted; then converting the text information by adopting the target conversion model, and outputting acoustic characteristics of the target user for pronouncing the text information by adopting the source language; synthesizing the acoustic characteristic information by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language; therefore, the multilingual source voice data can be converted into the target voice data of the target user with the corresponding language under the condition that only the voice data with the single language of the target user is available, and multilingual voice conversion is realized.
Secondly, in the embodiment of the invention, in the process of recognizing text information corresponding to the source voice data, the source voice data can be respectively input into N voice recognizers to obtain N voice recognition results corresponding to the source voice data, wherein one voice recognizer corresponds to one language; then splicing the N voice recognition results to obtain corresponding text information; or the source voice data can be input into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information. And the accuracy of the determined text information can be further improved, so that the accuracy of multilingual conversion is further improved.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 6, a block diagram of an embodiment of a speech processing apparatus according to the present invention is shown, and may specifically include the following modules:
An obtaining module 602, configured to obtain text information to be converted;
the information determining module 604 is configured to determine a source language corresponding to the text information and a target user to be converted;
the voice conversion module 606 is configured to convert the text information into target voice data that is generated by the target user in the source language according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Referring to fig. 7, a block diagram of an alternative embodiment of a speech processing apparatus of the present invention is shown.
In an alternative embodiment of the present invention, the obtaining module 602 includes:
A voice acquisition submodule 6022, configured to acquire source voice data of a source user, where the source user and a target user are the same user or different users;
and the recognition submodule 6024 is used for carrying out voice recognition on the source voice data and determining corresponding text information to be converted.
In an alternative embodiment of the present invention, the identifying submodule 6024 includes:
the first speech recognition unit 60242 is configured to input the source speech data to N speech recognizers respectively, to obtain N corresponding speech recognition results, where one speech recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
In an alternative embodiment of the present invention, the identifying submodule 6024 includes:
The second speech recognition unit 60244 is configured to input the source speech data to a speech recognizer to obtain a corresponding speech recognition result, where the speech recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
In an alternative embodiment of the present invention, the voice conversion module 606 includes:
a feature generation submodule 6062, configured to convert the text information by using the target conversion model, and output acoustic features of the target user using the source language pronunciation;
And the voice synthesis submodule 6064 is used for synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
In an optional embodiment of the present invention, the feature generating submodule 6062 is configured to input the text information, a language identifier corresponding to a source language, and a user identifier corresponding to a target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
In an alternative embodiment of the present invention, the method further includes:
A first training module 608, configured to train the generic conversion model; the first training module is specifically configured to collect X pieces of first voice training data of M users, where one piece of first voice training data corresponds to one language, and X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
In an alternative embodiment of the present invention, the method further includes:
The second training module 610 is configured to adaptively train the trained generic conversion model according to the single-language speech data of the target user, and generate a target conversion model; the second training module is specifically configured to obtain Y pieces of second voice training data of the target user, where languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
In summary, in the embodiment of the invention, text information to be converted can be obtained, and a source language corresponding to the text information and a target user to be converted are determined; then converting the text information into target voice data of the target user pronouncing in source language according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of target user pronunciation, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1; therefore, the multi-language text can be converted into the target voice data of the target user with the corresponding language under the condition that only the voice data of the target user with a single language is available, and the multi-language voice conversion is realized.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
Fig. 8 is a block diagram illustrating a configuration of an electronic device 800 for speech processing according to an example embodiment. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 8, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power component 806 provides power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 814 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 814 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform a speech processing method, the method comprising: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Optionally, the obtaining text information to be converted includes: acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and carrying out voice recognition on the source voice data, and determining corresponding text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: respectively inputting the source voice data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
Optionally, the converting the text information into the target voice data of the target user pronouncing in the source language according to the text information and the target conversion model corresponding to the target user includes: converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation; and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
Optionally, the converting the text information by using the target conversion model, outputting the acoustic feature of the target user speaking the text information by using the source language, including: inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
Optionally, the method further comprises the step of training the universal conversion model: collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
Optionally, the method further comprises the step of adaptively training the trained universal conversion model according to the single-language voice data of the target user to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
Fig. 9 is a schematic diagram of an electronic device 900 for speech processing according to another exemplary embodiment of the present invention. The electronic device 900 may be a server that may vary widely in configuration or performance and may include one or more central processing units (central processing units, CPU) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) that store applications 942 or data 944. Wherein the memory 932 and the storage medium 930 may be transitory or persistent. The program stored in the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 922 may be arranged to communicate with a storage medium 930, and execute a series of instruction operations in the storage medium 930 on a server.
The server(s) may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input/output interfaces 958, one or more keyboards 956, and/or one or more operating systems 941, such as Windows ServerTM, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.
An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for: acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted; according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1.
Optionally, the obtaining text information to be converted includes: acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users; and carrying out voice recognition on the source voice data, and determining corresponding text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: respectively inputting the source voice data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; and splicing the N voice recognition results to obtain text information to be converted.
Optionally, the performing voice recognition on the source voice data, determining corresponding text information to be converted includes: inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages; and determining the voice recognition result as text information to be converted.
Optionally, the converting the text information into the target voice data of the target user pronouncing in the source language according to the text information and the target conversion model corresponding to the target user includes: converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation; and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
Optionally, the converting the text information by using the target conversion model, outputting the acoustic feature of the target user speaking the text information by using the source language, including: inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model; the target conversion model searches target model parameters matched with the language identification and the user identification; and the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
Optionally, instructions for training the generic conversion model are also included: collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; identifying text information corresponding to each piece of first voice training data; and training the universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the first voice training data.
Optionally, the method further includes the following instructions for adaptively training the trained generic conversion model according to the target user's single language speech data to generate a target conversion model: acquiring Y pieces of second voice training data of the target user, wherein the languages corresponding to the Y pieces of second voice training data are the same; respectively extracting reference acoustic features of each piece of second voice training data, and respectively labeling user identification and language identification of a target user for each piece of second voice training data and the corresponding reference acoustic features; identifying text information corresponding to each piece of second voice training data; and carrying out self-adaptive training on the trained universal conversion model according to the text information, the reference acoustic characteristics, the user identification and the language identification corresponding to the second training voice data to obtain a target conversion model.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The foregoing has described in detail a speech processing method, a speech processing apparatus and an electronic device according to the present invention, and specific examples have been provided herein to illustrate the principles and embodiments of the present invention, the above examples being provided only to assist in understanding the method and core idea of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (9)

1. A method of speech processing, comprising:
acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted;
according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language;
The target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1;
The step of training the generic conversion model comprises:
collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; m, X are integers greater than 1;
respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; the user identifier is used for uniquely identifying a user, and the language identifier is used for uniquely identifying a language;
Identifying text information corresponding to each piece of first voice training data;
Training the universal conversion model according to text information, reference acoustic characteristics, user identification and language identification corresponding to the first voice training data;
The identifying text information corresponding to the first voice training data for each piece of first voice training data includes: respectively inputting the first voice training data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; splicing the N voice recognition results to obtain corresponding text information;
Training the universal conversion model according to text information, reference acoustic features, user identification and language identification corresponding to the first voice training data, including:
Inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model aiming at each piece of first voice training data of the X pieces of first voice training data;
Forward computing the text information by the universal conversion model, and outputting predicted acoustic features corresponding to the first voice training data; in the process of forward calculation of the text information by the universal conversion model, model parameters are associated with user identification and language identification of the first voice training data;
Comparing the predicted acoustic characteristics with the reference acoustic characteristics corresponding to the first voice training data, and adjusting model parameters of the universal conversion model, which correspond to the user identification and the language identification of the first voice training data, until an ending condition is met, so as to obtain the trained universal conversion model;
And when the ending condition is not met, based on the next first voice training data in the X pieces of first voice training data, executing the steps of inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model and the follow-up steps until the ending condition is met, and obtaining the trained universal conversion model.
2. The method of claim 1, wherein the obtaining text information to be converted comprises:
acquiring source voice data of a source user, wherein the source user and a target user are the same user or different users;
And carrying out voice recognition on the source voice data, and determining corresponding text information to be converted.
3. The method of claim 2, wherein the performing speech recognition on the source speech data to determine the corresponding text information to be converted comprises:
respectively inputting the source voice data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language;
And splicing the N voice recognition results to obtain text information to be converted.
4. The method of claim 2, wherein the performing speech recognition on the source speech data to determine the corresponding text information to be converted comprises:
inputting the source voice data into a voice recognizer to obtain a corresponding voice recognition result, wherein the voice recognizer corresponds to N languages;
and determining the voice recognition result as text information to be converted.
5. The method of claim 1, wherein converting the text information into target speech data uttered by the target user in the source language according to the text information and a target conversion model corresponding to the target user, comprises:
Converting the text information by adopting the target conversion model, and outputting the acoustic characteristics of the target user by adopting the source language pronunciation;
and synthesizing the acoustic features by adopting a synthesizer to obtain target voice data of the target user pronouncing by adopting the source language.
6. The method of claim 5, wherein converting the text information using the target conversion model, outputting acoustic features of the target user uttering the text information using the source language, comprises:
Inputting the text information, the language identification corresponding to the source language and the user identification corresponding to the target user into the target conversion model;
the target conversion model searches target model parameters matched with the language identification and the user identification;
And the target conversion model converts the text information by adopting the target model parameters and outputs the acoustic characteristics of the target user for pronunciation by adopting the source language.
7. A speech processing apparatus, comprising:
the acquisition module is used for acquiring text information to be converted;
The information determining module is used for determining the source language corresponding to the text information and the target user to be converted;
The voice conversion module is used for converting the text information into target voice data of the target user which adopts source language pronunciation according to the text information and a target conversion model corresponding to the target user; the target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1;
the first training module is used for collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; m, X are integers greater than 1; respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; the user identifier is used for uniquely identifying a user, and the language identifier is used for uniquely identifying a language; identifying text information corresponding to each piece of first voice training data; training the universal conversion model according to text information, reference acoustic characteristics, user identification and language identification corresponding to the first voice training data;
The first training module is configured to input the first voice training data to N voice recognizers respectively, so as to obtain N corresponding voice recognition results, where one voice recognizer corresponds to one language; splicing the N voice recognition results to obtain corresponding text information; inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model aiming at each piece of first voice training data of the X pieces of first voice training data; forward computing the text information by the universal conversion model, and outputting predicted acoustic features corresponding to the first voice training data; in the process of forward calculation of the text information by the universal conversion model, model parameters are associated with user identification and language identification of the first voice training data; comparing the predicted acoustic characteristics with the reference acoustic characteristics corresponding to the first voice training data, and adjusting model parameters of the universal conversion model, which correspond to the user identification and the language identification of the first voice training data, until an ending condition is met, so as to obtain the trained universal conversion model; and when the ending condition is not met, based on the next first voice training data in the X pieces of first voice training data, executing the steps of inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model and the follow-up steps until the ending condition is met, and obtaining the trained universal conversion model.
8. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for:
acquiring text information to be converted, and determining a source language corresponding to the text information and a target user to be converted;
according to the text information and a target conversion model corresponding to the target user, converting the text information into target voice data of the target user pronouncing in a source language;
The target conversion model carries out self-adaptive training on the trained universal conversion model according to single language voice data of a target user, and the universal conversion model carries out training according to voice data containing N languages; the source language is one of the N languages, and N is an integer greater than 1;
The step of training the generic conversion model comprises:
collecting X pieces of first voice training data of M users, wherein one piece of first voice training data corresponds to one language, and the X pieces of first voice training data correspond to N languages; m, X are integers greater than 1;
respectively extracting reference acoustic features of each piece of first voice training data, and respectively labeling corresponding user identifications and language identifications for each piece of first voice training data and corresponding reference acoustic features; the user identifier is used for uniquely identifying a user, and the language identifier is used for uniquely identifying a language;
Identifying text information corresponding to each piece of first voice training data;
Training the universal conversion model according to text information, reference acoustic characteristics, user identification and language identification corresponding to the first voice training data;
The identifying text information corresponding to the first voice training data for each piece of first voice training data includes: respectively inputting the first voice training data into N voice recognizers to obtain N corresponding voice recognition results, wherein one voice recognizer corresponds to one language; splicing the N voice recognition results to obtain corresponding text information;
Training the universal conversion model according to text information, reference acoustic features, user identification and language identification corresponding to the first voice training data, including:
Inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model aiming at each piece of first voice training data of the X pieces of first voice training data;
Forward computing the text information by the universal conversion model, and outputting predicted acoustic features corresponding to the first voice training data; in the process of forward calculation of the text information by the universal conversion model, model parameters are associated with user identification and language identification of the first voice training data;
Comparing the predicted acoustic characteristics with the reference acoustic characteristics corresponding to the first voice training data, and adjusting model parameters of the universal conversion model, which correspond to the user identification and the language identification of the first voice training data, until an ending condition is met, so as to obtain the trained universal conversion model;
And when the ending condition is not met, based on the next first voice training data in the X pieces of first voice training data, executing the steps of inputting text information, user identification and language identification corresponding to the first voice training data into the universal conversion model and the follow-up steps until the ending condition is met, and obtaining the trained universal conversion model.
9. A readable storage medium, characterized in that instructions in said storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the speech processing method according to any one of the method claims 1-6.
CN202010301719.4A 2020-04-16 2020-04-16 Voice processing method and device and electronic equipment Active CN113539233B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010301719.4A CN113539233B (en) 2020-04-16 2020-04-16 Voice processing method and device and electronic equipment
PCT/CN2021/070432 WO2021208531A1 (en) 2020-04-16 2021-01-06 Speech processing method and apparatus, and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010301719.4A CN113539233B (en) 2020-04-16 2020-04-16 Voice processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113539233A CN113539233A (en) 2021-10-22
CN113539233B true CN113539233B (en) 2024-07-30

Family

ID=78084017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010301719.4A Active CN113539233B (en) 2020-04-16 2020-04-16 Voice processing method and device and electronic equipment

Country Status (2)

Country Link
CN (1) CN113539233B (en)
WO (1) WO2021208531A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114360558B (en) * 2021-12-27 2022-12-13 北京百度网讯科技有限公司 Voice conversion method, voice conversion model generation method and device
CN117035004B (en) * 2023-07-24 2024-07-23 北京泰策科技有限公司 Text, picture and video generation method and system based on multi-modal learning technology
CN116844523B (en) * 2023-08-31 2023-11-10 深圳市声扬科技有限公司 Voice data generation method and device, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN110970018A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Speech recognition method and device

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SE9600959L (en) * 1996-03-13 1997-09-14 Telia Ab Speech-to-speech translation method and apparatus
CN105261355A (en) * 2015-09-02 2016-01-20 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus
CN105185372B (en) * 2015-10-20 2017-03-22 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105845125B (en) * 2016-05-18 2019-05-03 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and speech synthetic device
CN107910004A (en) * 2017-11-10 2018-04-13 科大讯飞股份有限公司 Speech translation processing method and device
JP6876642B2 (en) * 2018-02-20 2021-05-26 日本電信電話株式会社 Speech conversion learning device, speech conversion device, method, and program
CN108447486B (en) * 2018-02-28 2021-12-03 科大讯飞股份有限公司 Voice translation method and device
CN108874788A (en) * 2018-06-22 2018-11-23 深圳市沃特沃德股份有限公司 Voice translation method and device
CN109300469A (en) * 2018-09-05 2019-02-01 满金坝(深圳)科技有限公司 Simultaneous interpretation method and device based on machine learning
CN108986793A (en) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 translation processing method, device and equipment
CN109686363A (en) * 2019-02-26 2019-04-26 深圳市合言信息科技有限公司 A kind of on-the-spot meeting artificial intelligence simultaneous interpretation equipment
CN110610720B (en) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN110970014B (en) * 2019-10-31 2023-12-15 阿里巴巴集团控股有限公司 Voice conversion, file generation, broadcasting and voice processing method, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9922641B1 (en) * 2012-10-01 2018-03-20 Google Llc Cross-lingual speaker adaptation for multi-lingual speech synthesis
CN109147758A (en) * 2018-09-12 2019-01-04 科大讯飞股份有限公司 A kind of speaker's sound converting method and device
CN110970018A (en) * 2018-09-28 2020-04-07 珠海格力电器股份有限公司 Speech recognition method and device

Also Published As

Publication number Publication date
WO2021208531A1 (en) 2021-10-21
CN113539233A (en) 2021-10-22

Similar Documents

Publication Publication Date Title
CN113362812B (en) Voice recognition method and device and electronic equipment
CN113362813B (en) Voice recognition method and device and electronic equipment
CN110210310B (en) Video processing method and device for video processing
CN111368541B (en) Named entity identification method and device
KR102334299B1 (en) Voice information processing method, apparatus, program and storage medium
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
CN113539233B (en) Voice processing method and device and electronic equipment
CN113345452B (en) Voice conversion method, training method, device and medium of voice conversion model
CN113409765B (en) Speech synthesis method and device for speech synthesis
CN113223542B (en) Audio conversion method and device, storage medium and electronic equipment
CN112735396A (en) Speech recognition error correction method, device and storage medium
CN108628819B (en) Processing method and device for processing
CN114154485B (en) A text error correction method and device
CN113589954B (en) Data processing method and device and electronic equipment
CN113936697B (en) Voice processing method and device for voice processing
CN112331194B (en) Input method and device and electronic equipment
CN110930977A (en) Data processing method and device and electronic equipment
CN113807540B (en) A data processing method and device
CN116705015A (en) Equipment wake-up method, device and computer readable storage medium
CN116484828A (en) Similar case determining method, device, apparatus, medium and program product
CN113589946B (en) Data processing method and device and electronic equipment
CN113589947B (en) Data processing method and device and electronic equipment
CN113345451B (en) Sound changing method and device and electronic equipment
CN114063792B (en) Data processing method, device and electronic equipment
CN112668340B (en) Information processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant