CN108492818A - Conversion method, device and the computer equipment of Text To Speech - Google Patents
Conversion method, device and the computer equipment of Text To Speech Download PDFInfo
- Publication number
- CN108492818A CN108492818A CN201810241804.9A CN201810241804A CN108492818A CN 108492818 A CN108492818 A CN 108492818A CN 201810241804 A CN201810241804 A CN 201810241804A CN 108492818 A CN108492818 A CN 108492818A
- Authority
- CN
- China
- Prior art keywords
- text
- frame
- linear spectral
- converted
- voice
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Telephonic Communication Services (AREA)
- Machine Translation (AREA)
Abstract
The application proposes that a kind of conversion method of Text To Speech, device and computer equipment, the conversion method of above-mentioned Text To Speech include:Obtain the corresponding frame of text to be converted;Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame;By the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network model trained in advance, the corresponding linear spectral of the present frame is obtained;After the linear spectral for obtaining the corresponding frame of the text to be converted, according to the linear spectral of the corresponding frame of the text to be converted, the corresponding voice of the text to be converted is obtained.The linear spectral directly according to the corresponding frame of text to be converted may be implemented in the application, obtains the corresponding voice of text to be converted, and due to being not introduced into approximate error, more preferably, whole fluency is more preferable for the sound quality of the voice obtained.
Description
Technical field
This application involves a kind of speech synthesis technique field more particularly to conversion method of Text To Speech, device and meters
Calculate machine equipment.
Background technology
TTS is from the abbreviation of Text To Speech (Text To Speech), is an interactive part, it is therefore an objective to allow machine
Device can speak according to text.Phone is the minimum phonation unit that the mankind speak, in Chinese, phone be exactly each initial consonant or
Simple or compound vowel of a Chinese syllable.In order to allow machine that can send out corresponding sound according to text, need to model the acoustic model of each phone.
Existing to be modeled in the related technology using vocoder, this modeling method first carries out voice signal at framing
Reason, is then divided into three pieces to the acoustic model of each frame and models:(1) whether this frame needs sounding;(2) base of this frame
Plinth frequency;(3) shock response of this frame relative to base frequency.But what this method obtained due to the use of human research
Priori has carried out approximation, thus has certain information loss so that the sound effect mechanical sense finally synthesized is listened than heavier
Feel poor, can not be compared with natural voice.
Invention content
The application is intended to solve at least some of the technical problems in related technologies.
For this purpose, first purpose of the application is to propose a kind of conversion method of Text To Speech, to realize direct root
According to the linear spectral of the corresponding frame of text to be converted, the corresponding voice of text to be converted is obtained, due to being not introduced into approximate error,
More preferably, whole fluency is more preferable for the sound quality of the voice of acquisition.
Second purpose of the application is to propose a kind of conversion equipment of Text To Speech.
The third purpose of the application is to propose a kind of computer equipment.
The 4th purpose of the application is to propose a kind of non-transitorycomputer readable storage medium.
In order to achieve the above object, the application first aspect embodiment proposes a kind of conversion method of Text To Speech, including:
Obtain the corresponding frame of text to be converted;Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, with
And obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame;By the vector characteristics of acquisition, text prosodic features
Neural network model trained in advance, obtains the corresponding linear spectral of the present frame with mappings characteristics input;It is waited for described in acquisition
After the linear spectral of the corresponding frame of converting text, according to the linear spectral of the corresponding frame of the text to be converted, wait turning described in acquisition
The corresponding voice of exchange of notes sheet.
In the conversion method of the Text To Speech of the embodiment of the present application, after obtaining the corresponding frame of text to be converted, obtain
Present frame corresponds to the vector characteristics and text prosodic features of phone in above-mentioned frame, and obtains the previous frame correspondence of above-mentioned present frame
Linear spectral mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input in advance training
Neural network model obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted,
According to the linear spectral of the corresponding frame of above-mentioned text to be converted, the corresponding voice of above-mentioned text to be converted is obtained, so as to realize
Directly according to the linear spectral of the corresponding frame of text to be converted, the corresponding voice of text to be converted is obtained, due to being not introduced into approximate miss
Difference, therefore the sound quality of the voice obtained is more preferably, whole fluency is more preferable.
In order to achieve the above object, the application second aspect embodiment proposes a kind of conversion equipment of Text To Speech, including:
Acquisition module, for obtaining the corresponding frame of text to be converted;Obtain vector characteristics and text that present frame in the frame corresponds to phone
This prosodic features, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame;Input module is used for institute
Vector characteristics, text prosodic features and the mappings characteristics input of acquisition module acquisition neural network model trained in advance are stated, is obtained
Obtain the corresponding linear spectral of the present frame;Conversion module, for the linear spectral for obtaining the corresponding frame of the text to be converted it
Afterwards, according to the linear spectral of the corresponding frame of the text to be converted, the corresponding voice of the text to be converted is obtained.
In the conversion equipment of the Text To Speech of the embodiment of the present application, acquisition module obtain the corresponding frame of text to be converted it
Afterwards, vector characteristics and text prosodic features that present frame in above-mentioned frame corresponds to phone are obtained, and obtain the upper of above-mentioned present frame
The mappings characteristics of the corresponding linear spectral of one frame, then input module is special by the vector characteristics of acquisition, text prosodic features and mapping
Sign input neural network model trained in advance, obtains the corresponding linear spectral of above-mentioned present frame.It is corresponded to obtaining text to be converted
Frame linear spectral after, conversion module obtains above-mentioned text to be converted according to the linear spectral of the corresponding frame of above-mentioned text to be converted
This corresponding voice obtains text pair to be converted so as to realize the linear spectral directly according to the corresponding frame of text to be converted
The voice answered, due to being not introduced into approximate error, more preferably, whole fluency is more preferable for the sound quality of the voice obtained.
In order to achieve the above object, the application third aspect embodiment proposes a kind of computer equipment, including memory, processor
And it is stored in the computer program that can be run on the memory and on the processor, the processor executes the calculating
When machine program, method as described above is realized.
To achieve the goals above, the application fourth aspect embodiment proposes a kind of computer-readable storage of non-transitory
Medium, is stored thereon with computer program, and the computer program realizes method as described above when being executed by processor.
The additional aspect of the application and advantage will be set forth in part in the description, and will partly become from the following description
It obtains obviously, or recognized by the practice of the application.
Description of the drawings
The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
Obviously and it is readily appreciated that, wherein:
Fig. 1 is the flow chart of conversion method one embodiment of the application Text To Speech;
Fig. 2 be the application Text To Speech conversion method in two layers of full articulamentum composition neural network schematic diagram;
Fig. 3 is the flow chart of another embodiment of the conversion method of the application Text To Speech;
Fig. 4 is the flow chart of the conversion method further embodiment of the application Text To Speech;
Fig. 5 be the application Text To Speech conversion method in neural network model structural schematic diagram;
Fig. 6 be the application Text To Speech conversion method in convolution unit structural schematic diagram;
Fig. 7 is the flow chart of the conversion method further embodiment of the application Text To Speech;
Fig. 8 is the structural schematic diagram of conversion equipment one embodiment of the application Text To Speech;
Fig. 9 is the structural schematic diagram of the application computer equipment one embodiment.
Specific implementation mode
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.
It is not that the frequency spectrum directly according to audio carries out voice conversion in the conversion plan of existing Text To Speech,
But first after extracting the parameters,acoustics such as fundamental frequency, spectrum envelope out in audio, voice conversion is carried out according to these indirect acoustic features.
Since the complexity of sound spectrum inevitably introduces a large amount of approximation, finally leads to simplify the process of voice conversion
Causing the voice being finally synthesizing, more or less there is a degree of distortions.In order to evade this problem, the direct root of the application
Voice conversion is carried out according to the corresponding linear spectral of text to be converted.
Fig. 1 is the flow chart of conversion method one embodiment of the application Text To Speech, as shown in Figure 1, above-mentioned text
Conversion method to voice may include:
Step 101, the corresponding frame of text to be converted is obtained.
Specifically, obtaining the corresponding frame of text to be converted can be:Obtain the corresponding frame of each phone of text to be converted.
By taking text to be converted is " Baidu " as an example, the phone of " Baidu " includes:B, ai, d and u, therefore obtain text to be converted
Originally corresponding frame can be:" b ", " ai ", " d " and " u " corresponding frame is obtained respectively.
Step 102, vector characteristics and text prosodic features that present frame in above-mentioned frame corresponds to phone are obtained, and obtains and works as
The mappings characteristics of the corresponding linear spectral of previous frame of previous frame.
Wherein, the vector characteristics that present frame corresponds to phone in above-mentioned frame can be the corresponding phone of present frame, such as initial consonant, rhythm
The one-hot features of the sounds basic unit such as mother;The text prosodic features that present frame corresponds to phone in above-mentioned frame can be current
The part of speech of word and/or the term vector of place word where the complementary features of the corresponding phone of frequency, such as the corresponding phone of current frequency.
Still by taking text to be converted is " Baidu " as an example, it is assumed that " b " this phone corresponds to 5 frames, this corresponding sound of 5 frames
Son is all just " b ", and for each frame in this 5 frames, the vector characteristics and the text rhythm of the corresponding phone of present frame are special
Sign, the as corresponding vector characteristics of " b " this phone and text prosodic features.
Specifically, obtaining the mappings characteristics of the corresponding linear spectral of previous frame of present frame can be:By upper the one of present frame
The neural network that the corresponding linear spectral input of frame is made of two layers of full articulamentum is mapped, and the previous frame of above-mentioned present frame is obtained
The mappings characteristics of corresponding linear spectral.
Fig. 2 be the application Text To Speech conversion method in two layers of full articulamentum composition neural network schematic diagram,
After obtaining the corresponding linear spectral of previous frame, the corresponding linear spectral of previous frame is inputted into neural network shown in Fig. 2 and is reflected
It penetrates, you can obtain the mappings characteristics of the corresponding linear spectral of previous frame.
Step 103, by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network trained in advance
Model obtains the corresponding linear spectral of above-mentioned present frame.
Specifically, after obtaining the corresponding linear spectral of present frame, the corresponding linear spectral of present frame can be inputted to Fig. 2 institutes
The neural network shown obtains the mappings characteristics of the corresponding linear spectral of present frame, and then the vector of phone corresponding with next frame is special
Text prosodic features of seeking peace is stitched together, the input feature vector as next frame.
Step 104, after the linear spectral for obtaining the corresponding frame of above-mentioned text to be converted, according to above-mentioned text pair to be converted
The linear spectral for the frame answered obtains the corresponding voice of above-mentioned text to be converted.
Specifically, in the method according to step 102 and step 103 description, it is corresponding all to obtain above-mentioned text to be converted
After the linear spectral of frame, you can according to the linear spectral of the corresponding frame of above-mentioned text to be converted, obtain above-mentioned text to be converted and correspond to
Voice.
In the conversion method of above-mentioned Text To Speech, after obtaining the corresponding frame of text to be converted, obtain in above-mentioned frame when
Previous frame corresponds to the vector characteristics and text prosodic features of phone, and obtain the corresponding linear spectral of previous frame of above-mentioned present frame
Mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network mould trained in advance
Type obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted, waited for according to above-mentioned
The linear spectral of the corresponding frame of converting text, obtains the corresponding voice of above-mentioned text to be converted, and so as to realize, directly basis waits for
The linear spectral of the corresponding frame of converting text obtains the corresponding voice of text to be converted, due to being not introduced into approximate error, obtains
Voice sound quality more preferably, whole fluency is more preferable.
Fig. 3 is the flow chart of another embodiment of the conversion method of the application Text To Speech, as shown in figure 3, the application
In embodiment illustrated in fig. 1, step 104 may include:
Step 301, above-mentioned text to be converted is obtained by iteration from the linear spectral of the corresponding frame of above-mentioned text to be converted
The phase spectrum of corresponding voice.
Step 302, it according to the phase spectrum of the corresponding voice of above-mentioned text to be converted, is calculated by reversed Short Time Fourier Transform
Method generates the corresponding voice of above-mentioned text to be converted.
Fig. 4 is the flow chart of the conversion method further embodiment of the application Text To Speech, as shown in figure 4, the application
In embodiment illustrated in fig. 1, step 103 may include:
Step 401, the vector characteristics of acquisition, text prosodic features and mappings characteristics are dropped by one layer of full articulamentum
After dimension, in the network that input is formed by least two there are the convolution unit of residual error connection, risen by one layer of full articulamentum
After sampling, the corresponding linear spectral of above-mentioned present frame is obtained.
That is, in the present embodiment, the structure of neural network model can be with as shown in figure 5, Fig. 5 arrives for the application text
The structural schematic diagram of neural network model in the conversion method of voice.The output feature of neural network model shown in fig. 5 is
The corresponding linear spectral of present frame, i.e., the amplitude of the frequency spectrum obtained after the Short Time Fourier Transform of the corresponding audio of above-mentioned present frame
Spectrum, above-mentioned neural network model first carry out sub-frame processing to the corresponding audio of above-mentioned present frame, are a frame per 25ms, frame and frame it
Between there are the overlappings of 12.5ms.Then each frame does Fourier transformation, and it is above-mentioned current to take the real part of Fourier transformation
The corresponding linear spectral of frame.
Wherein, above-mentioned convolution unit includes two one-dimensional convolution kernels, and the output of one of convolution kernel passes through activation primitive
It is regular on predetermined interval;The output of above-mentioned convolution unit is for the regular output result on predetermined interval of a convolution kernel and separately
One convolution kernel does not pass through the product of the corresponding element of the output result of activation primitive.
Above-mentioned activation primitive can be sigmoid activation primitives, and above-mentioned predetermined interval can be (0,1) section, above-mentioned volume
The structure of product unit can with as shown in fig. 6, Fig. 6 be the application Text To Speech conversion method in convolution unit structural representation
Figure.
The present embodiment has used convolutional neural networks model instead of recurrent neural network, this help to improve network and
The efficiency of row operation, substantially increases the aggregate velocity of voice;In addition, the convolutional neural networks model that uses of the present embodiment can be with
Play the role of similar filter, thus also has good voice transfer capability to the lower training data of signal-to-noise ratio.
Fig. 7 is the flow chart of the conversion method further embodiment of the application Text To Speech, as shown in fig. 7, the application
In embodiment illustrated in fig. 1, before step 103, can also include:
Step 701, the corresponding frame of training text is obtained.
Step 702, vector characteristics and the text rhythm that present frame in the corresponding frame of above-mentioned training text corresponds to phone are obtained
Feature, and obtain the mapping spy of the corresponding linear spectral of previous frame voice of present frame in the corresponding voice of above-mentioned training text
Sign.
Step 703, the vector characteristics of acquisition, text prosodic features and mappings characteristics are inputted to neural network mould to be trained
Type obtains the corresponding linear spectral of present frame.
Step 704, corresponding according to above-mentioned training text after the linear spectral for obtaining the corresponding frame of above-mentioned training text
The linear spectral of frame obtains the corresponding voice of above-mentioned training text.
Step 705, when the corresponding voice of above-mentioned training text of acquisition meets predetermined condition, trained nerve is obtained
Network model.
The training process for the neural network model that 701~step 705 of above-mentioned steps uses the embodiment of the present application carries out
Explanation.
From step 702 as can be seen that in training neural network model, the mappings characteristics of the previous frame used are above-mentioned instructions
Practice the mappings characteristics of the corresponding linear spectral of previous frame voice of present frame in the corresponding voice of text, and in actual use nerve net
When network model, the mappings characteristics of the previous frame used are the mappings of the corresponding linear spectral of previous frame of neural network model output
Feature.
In step 705, the corresponding voice of above-mentioned training text of acquisition, which meets predetermined condition, to be:By to be trained
Neural network model obtain the corresponding voice of above-mentioned training text, compared with the real voice of above-mentioned training text, sound quality and
Within a predetermined range, certainly, above-mentioned preset range can be in specific implementation, according to system performance for the error of whole fluency
And/or the sets itselfs such as realization demand, the present embodiment are not construed as limiting the size of above-mentioned preset range.
The conversion method of Text To Speech provided by the embodiments of the present application has directly used text to be converted corresponding linear
Spectrum carries out voice conversion, than the voice obtained for the scheme for carrying out voice conversion using the obtained acoustic feature of approximate evaluation
Error smaller, thus the sound quality of the voice obtained is more preferably, whole fluency is more preferable;Also, the application has used convolutional Neural net
Network model helps to improve the efficiency of network run parallel instead of recurrent neural network, greatly improves the synthesis speed of audio
Degree;In addition, convolutional neural networks model used in this application can play the role of similar filter, thus it is relatively low to noise
Training data also have good voice transfer capability.
Fig. 8 is the structural schematic diagram of conversion equipment one embodiment of the application Text To Speech, in the embodiment of the present application
The conversion equipment of Text To Speech the conversion method of Text To Speech provided by the embodiments of the present application may be implemented.Such as Fig. 8 institutes
Show, the conversion equipment of above-mentioned Text To Speech may include:Acquisition module 81, input module 82 and conversion module 83;
Wherein, acquisition module 81, for obtaining the corresponding frame of text to be converted;It obtains present frame in above-mentioned frame and corresponds to phone
Vector characteristics and text prosodic features, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of above-mentioned present frame;Tool
Body, obtaining the corresponding frame of text to be converted can be:Obtain the corresponding frame of each phone of text to be converted.
By taking text to be converted is " Baidu " as an example, the phone of " Baidu " includes:B, ai, d and u, therefore obtain text to be converted
Originally corresponding frame can be:" b ", " ai ", " d " and " u " corresponding frame is obtained respectively.
Wherein, the vector characteristics that present frame corresponds to phone in above-mentioned frame can be the corresponding phone of present frame, such as initial consonant, rhythm
The one-hot features of the sounds basic unit such as mother;The text prosodic features that present frame corresponds to phone in above-mentioned frame can be current
The part of speech of word and/or the term vector of place word where the complementary features of the corresponding phone of frequency, such as the corresponding phone of current frequency.
Still by taking text to be converted is " Baidu " as an example, it is assumed that " b " this phone corresponds to 5 frames, this corresponding sound of 5 frames
Son is all just " b ", and for each frame in this 5 frames, the vector characteristics and the text rhythm of the corresponding phone of present frame are special
Sign, the as corresponding vector characteristics of " b " this phone and text prosodic features.
Specifically, acquisition module 81, specifically for connecting the corresponding linear spectral input of the previous frame of present frame entirely by two layers
The neural network for connecing layer composition is mapped, and the mappings characteristics of the corresponding linear spectral of previous frame of above-mentioned present frame are obtained.
After obtaining the corresponding linear spectral of previous frame, acquisition module 81 can be by the corresponding linear spectral input figure of previous frame
Neural network shown in 2 is mapped, you can obtains the mappings characteristics of the corresponding linear spectral of previous frame.
Input module 82, the vector characteristics, text prosodic features for obtaining acquisition module 81 and mappings characteristics input
Trained neural network model in advance obtains the corresponding linear spectral of present frame.
In the present embodiment, after input module 82 obtains the corresponding linear spectral of present frame, acquisition module 81 can will work as
The corresponding linear spectral of previous frame inputs neural network shown in Fig. 2, obtains the mappings characteristics of the corresponding linear spectral of present frame, then with
The vector characteristics and text prosodic features of the corresponding phone of next frame are stitched together, the input feature vector as next frame.
Conversion module 83, for after the linear spectral for obtaining the corresponding frame of above-mentioned text to be converted, waiting turning according to above-mentioned
The linear spectral of the corresponding frame of exchange of notes sheet obtains the corresponding voice of above-mentioned text to be converted.
Specifically, after the linear spectral for obtaining the corresponding all frames of above-mentioned text to be converted, conversion module 83 can root
According to the linear spectral of the corresponding frame of above-mentioned text to be converted, the corresponding voice of above-mentioned text to be converted is obtained.
In the present embodiment, conversion module 83, specifically for passing through from the linear spectral of the corresponding frame of above-mentioned text to be converted
Iteration obtains the phase spectrum of the corresponding voice of above-mentioned text to be converted;According to the phase of the corresponding voice of above-mentioned text to be converted
Spectrum generates the corresponding voice of above-mentioned text to be converted by reversed Short Time Fourier Transform algorithm.
In the present embodiment, input module 82 is specifically used for vector characteristics, text prosodic features and the mappings characteristics that will obtain
After one layer of full articulamentum carries out dimensionality reduction, in the network that input is formed by least two there are the convolution unit of residual error connection,
After one layer of full articulamentum carries out liter sampling, the corresponding linear spectral of above-mentioned present frame is obtained.
That is, in the present embodiment, the structure of neural network model can be as shown in Figure 5.Neural network shown in fig. 5
The output feature of model is the corresponding linear spectral of present frame, i.e., after the Short Time Fourier Transform of the corresponding audio of above-mentioned present frame
The amplitude spectrum of obtained frequency spectrum, above-mentioned neural network model first carry out sub-frame processing to the corresponding audio of above-mentioned present frame, often
25ms is a frame, and there are the overlapping of 12.5ms between frame and frame.Then each frame does Fourier transformation, takes the reality of Fourier transformation
Number part is the corresponding linear spectral of above-mentioned present frame.
Wherein, above-mentioned convolution unit includes two one-dimensional convolution kernels, and the output of one of convolution kernel passes through activation primitive
It is regular on predetermined interval;The output of above-mentioned convolution unit is for the regular output result on predetermined interval of a convolution kernel and separately
One convolution kernel does not pass through the product of the corresponding element of the output result of activation primitive.
Above-mentioned activation primitive can be sigmoid activation primitives, and above-mentioned predetermined interval can be (0,1) section, above-mentioned volume
The structure of product unit can be as shown in Figure 6.
The present embodiment has used convolutional neural networks model instead of recurrent neural network, this help to improve network and
The efficiency of row operation, substantially increases the aggregate velocity of voice;In addition, the convolutional neural networks model that uses of the present embodiment can be with
Play the role of similar filter, thus also has good voice transfer capability to the lower training data of signal-to-noise ratio.
Further, acquisition module 81 are additionally operable to obtain the corresponding frame of training text;It is corresponding to obtain above-mentioned training text
Present frame corresponds to the vector characteristics and text prosodic features of phone in frame, and obtains in the corresponding voice of above-mentioned training text
State the mappings characteristics of the corresponding linear spectral of previous frame voice of present frame;
It is defeated to be additionally operable to the vector characteristics for obtaining acquisition module 81, text prosodic features and mappings characteristics for input module 82
Enter neural network model to be trained, obtains the corresponding linear spectral of above-mentioned present frame;
Conversion module 83 is additionally operable to after the linear spectral for obtaining the corresponding frame of above-mentioned training text, according to above-mentioned training
The linear spectral of the corresponding frame of text obtains the corresponding voice of above-mentioned training text;
Acquisition module 81 is additionally operable to, when the corresponding voice of above-mentioned training text of acquisition meets predetermined condition, be instructed
The neural network model perfected.
In training neural network model, the mappings characteristics for the previous frame that acquisition module 81 obtains are above-mentioned training texts pair
The mappings characteristics of the corresponding linear spectral of previous frame voice of present frame in the voice answered, and in actual use neural network model
When, the mappings characteristics for the previous frame that acquisition module 81 obtains are the corresponding linear spectrals of previous frame of neural network model output
Mappings characteristics.
Wherein, the corresponding voice of above-mentioned training text of acquisition, which meets predetermined condition, to be:Pass through nerve to be trained
The corresponding voice of above-mentioned training text that network model obtains, compared with the real voice of above-mentioned training text, sound quality and entirety
The error of fluency within a predetermined range, certainly, above-mentioned preset range can in specific implementation, according to system performance and/or
The sets itselfs such as realization demand, the present embodiment are not construed as limiting the size of above-mentioned preset range.
In the conversion equipment of above-mentioned Text To Speech, after obtaining the corresponding frame of text to be converted, obtain in above-mentioned frame when
Previous frame corresponds to the vector characteristics and text prosodic features of phone, and obtain the corresponding linear spectral of previous frame of above-mentioned present frame
Mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network mould trained in advance
Type obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted, waited for according to above-mentioned
The linear spectral of the corresponding frame of converting text, obtains the corresponding voice of above-mentioned text to be converted, and so as to realize, directly basis waits for
The linear spectral of the corresponding frame of converting text obtains the corresponding voice of text to be converted, due to being not introduced into approximate error, obtains
Voice sound quality more preferably, whole fluency is more preferable.
Fig. 9 is the structural schematic diagram of the application computer equipment one embodiment, and above computer equipment may include depositing
Reservoir, processor and it is stored in the computer program that can be run on above-mentioned memory and on above-mentioned processor, above-mentioned processor
When executing above computer program, the conversion method of Text To Speech provided by the embodiments of the present application may be implemented.
Wherein, above computer equipment can be server, or terminal device, above-mentioned terminal device can be a
People's computer (Personal Computer;Hereinafter referred to as:PC) or the intelligent terminals such as laptop, the present embodiment
The specific form of above computer equipment is not construed as limiting.
Fig. 9 shows the block diagram of the exemplary computer device 12 suitable for being used for realizing the application embodiment.Fig. 9 is shown
Computer equipment 12 be only an example, any restrictions should not be brought to the function and use scope of the embodiment of the present application.
As shown in figure 9, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with
Including but not limited to:One or more processor or processing unit 16, system storage 28 connect different system component
The bus 18 of (including system storage 28 and processing unit 16).
Bus 18 indicates one or more in a few class bus structures, including memory bus or Memory Controller,
Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts
For example, these architectures include but not limited to industry standard architecture (Industry Standard
Architecture;Hereinafter referred to as:ISA) bus, microchannel architecture (Micro Channel Architecture;Below
Referred to as:MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards
Association;Hereinafter referred to as:VESA) local bus and peripheral component interconnection (Peripheral Component
Interconnection;Hereinafter referred to as:PCI) bus.
Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by
The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.
System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access
Memory (Random Access Memory;Hereinafter referred to as:RAM) 30 and/or cache memory 32.Computer equipment 12
It may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only conduct
Citing, storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 9 do not show, commonly referred to as " hard disk
Driver ").Although being not shown in Fig. 9, can provide for the magnetic to moving non-volatile magnetic disk (such as " floppy disk ") read-write
Disk drive, and to removable anonvolatile optical disk (such as:Compact disc read-only memory (Compact Disc Read Only
Memory;Hereinafter referred to as:CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only
Memory;Hereinafter referred to as:DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving
Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program production
Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application
The function of embodiment.
Program/utility 40 with one group of (at least one) program module 42 can be stored in such as memory 28
In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs
Module and program data may include the realization of network environment in each or certain combination in these examples.Program mould
Block 42 usually executes function and/or method in embodiments described herein.
Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24
Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make
The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other computing device
Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also
To pass through network adapter 20 and one or more network (such as LAN (Local Area Network;Hereinafter referred to as:
LAN), wide area network (Wide Area Network;Hereinafter referred to as:WAN) and/or public network, for example, internet) communication.Such as figure
Shown in 9, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.Although should be understood that in Fig. 9 not
It shows, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not limited to:Microcode, equipment are driven
Dynamic device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..
Processing unit 16 is stored in program in system storage 28 by operation, to perform various functions application and
Data processing, such as realize the conversion method of Text To Speech provided by the embodiments of the present application.
The embodiment of the present application also provides a kind of non-transitorycomputer readable storage medium, is stored thereon with computer journey
The conversion side of Text To Speech provided by the embodiments of the present application may be implemented in sequence, above computer program when being executed by processor
Method.
Appointing for one or more computer-readable media may be used in above-mentioned non-transitorycomputer readable storage medium
Meaning combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer can
It reads storage medium and for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device
Or device, or the arbitrary above combination.The more specific example (non exhaustive list) of computer readable storage medium includes:
Electrical connection, portable computer diskette, hard disk, random access memory (RAM), read-only storage with one or more conducting wires
Device (Read Only Memory;Hereinafter referred to as:ROM), erasable programmable read only memory (Erasable
Programmable Read Only Memory;Hereinafter referred to as:EPROM) or flash memory, optical fiber, portable compact disc are read-only deposits
Reservoir (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer
Readable storage medium storing program for executing, which can be any, includes or the tangible medium of storage program, which can be commanded execution system, device
Either device use or in connection.
Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated,
Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including --- but
It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be
Any computer-readable medium other than computer readable storage medium, which can send, propagate or
Transmission for by instruction execution system, device either device use or program in connection.
The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited
In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.
Can with one or more programming languages or combinations thereof come write for execute the application operation computer
Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++,
Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with
It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion
Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer.
It is related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (Local
Area Network;Hereinafter referred to as:) or wide area network (Wide Area Network LAN;Hereinafter referred to as:WAN) it is connected to user
Computer, or, it may be connected to outer computer (such as being connected by internet using ISP).
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not
It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office
It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field
Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples
It closes and combines.
In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discuss suitable
Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be by the application
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (system of such as computer based system including processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicating, propagating or passing
Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment
It sets.The more specific example (non-exhaustive list) of computer-readable medium includes following:Electricity with one or more wiring
Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (Random Access
Memory;Hereinafter referred to as:RAM), read-only memory (Read Only Memory;Hereinafter referred to as:ROM), erasable editable
Read memory (Erasable Programmable Read Only Memory;Hereinafter referred to as:EPROM) or flash memory,
Fiber device and portable optic disk read-only storage (Compact Disc Read Only Memory;Hereinafter referred to as:CD-
ROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable media, because
For can be then suitable with other into edlin, interpretation or when necessary for example by carrying out optical scanner to paper or other media
Mode is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the application can be realized with hardware, software, firmware or combination thereof.Above-mentioned
In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage
Or firmware is realized.Such as, if realized in another embodiment with hardware, following skill well known in the art can be used
Any one of art or their combination are realized:With for data-signal realize logic function logic gates from
Logic circuit is dissipated, the application-specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (Programmable
Gate Array;Hereinafter referred to as:PGA), field programmable gate array (Field Programmable Gate Array;Below
Referred to as:FPGA) etc..
Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries
Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium
In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.
In addition, each functional unit in each embodiment of the application can be integrated in a processing module, it can also
That each unit physically exists alone, can also two or more units be integrated in a module.Above-mentioned integrated mould
The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized in the form of software function module and when sold or used as an independent product, can also be stored in a computer
In read/write memory medium.
Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above
Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application
System, those skilled in the art can be changed above-described embodiment, change, replace and become within the scope of application
Type.
Claims (14)
1. a kind of conversion method of Text To Speech, which is characterized in that including:
Obtain the corresponding frame of text to be converted;
Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, and obtain the upper of the present frame
The mappings characteristics of the corresponding linear spectral of one frame;
By the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network model trained in advance, institute is obtained
State the corresponding linear spectral of present frame;
After the linear spectral for obtaining the corresponding frame of the text to be converted, according to the linear of the corresponding frame of the text to be converted
Spectrum obtains the corresponding voice of the text to be converted.
2. according to the method described in claim 1, it is characterized in that, the previous frame for obtaining the present frame is corresponding linear
The mappings characteristics of spectrum include:
The neural network that the corresponding linear spectral input of the previous frame of the present frame is made of two layers of full articulamentum is mapped,
Obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame.
3. according to the method described in claim 1, it is characterized in that, described according to the linear of the corresponding frame of the text to be converted
Spectrum, the voice for obtaining the text to be converted include:
The corresponding voice of the text to be converted is obtained by iteration from the linear spectral of the corresponding frame of the text to be converted
Phase spectrum;
According to the phase spectrum of the corresponding voice of the text to be converted, by reversed Short Time Fourier Transform algorithm generate described in wait for
The corresponding voice of converting text.
4. according to the method described in claim 1, it is characterized in that, it is described by the vector characteristics of acquisition, text prosodic features and
Mappings characteristics input neural network model trained in advance, obtaining the corresponding linear spectral of the present frame includes:
By the vector characteristics of acquisition, text prosodic features and mappings characteristics after one layer of full articulamentum carries out dimensionality reduction, input by
At least two, there are in the network of the convolution unit composition of residual error connection, after one layer of full articulamentum carries out liter sampling, obtain
The corresponding linear spectral of the present frame.
5. according to the method described in claim 4, it is characterized in that, the convolution unit include two one-dimensional convolution kernels, wherein
The output of one convolution kernel is regular on predetermined interval by activation primitive;The output of the convolution unit is that a convolution kernel is advised
The whole output result on predetermined interval and another convolution kernel not multiplying by the corresponding element of the output result of activation primitive
Product.
6. according to the method described in claim 1-5 any one, which is characterized in that described by the vector characteristics of acquisition, text
The neural network model that prosodic features and mappings characteristics input are trained in advance, before obtaining the corresponding linear spectral of the present frame,
Further include:
Obtain the corresponding frame of training text;
Vector characteristics and text prosodic features that present frame in the corresponding frame of the training text corresponds to phone are obtained, and are obtained
The mappings characteristics of the corresponding linear spectral of previous frame voice of present frame described in the corresponding voice of the training text;
The vector characteristics of acquisition, text prosodic features and mappings characteristics are inputted into neural network model to be trained, described in acquisition
The corresponding linear spectral of present frame;
After the linear spectral for obtaining the corresponding frame of the training text, according to the linear spectral of the corresponding frame of the training text,
Obtain the corresponding voice of the training text;
When the corresponding voice of the training text of acquisition meets predetermined condition, trained neural network model is obtained.
7. a kind of conversion equipment of Text To Speech, which is characterized in that including:
Acquisition module, for obtaining the corresponding frame of text to be converted;Obtain the vector characteristics that present frame in the frame corresponds to phone
With text prosodic features, and the mappings characteristics of the corresponding linear spectral of previous frame of the acquisition present frame;
Input module, the vector characteristics, text prosodic features and mappings characteristics input for obtaining the acquisition module are advance
Trained neural network model obtains the corresponding linear spectral of the present frame;
Conversion module is used for after the linear spectral for obtaining the corresponding frame of the text to be converted, according to the text to be converted
The linear spectral of corresponding frame obtains the corresponding voice of the text to be converted.
8. device according to claim 7, which is characterized in that
The acquisition module is specifically used for the corresponding linear spectral input of the previous frame of the present frame by two layers of full articulamentum group
At neural network mapped, obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame.
9. device according to claim 7, which is characterized in that
The conversion module, specifically for obtaining described wait for by iteration from the linear spectral of the corresponding frame of the text to be converted
The phase spectrum of the corresponding voice of converting text;According to the phase spectrum of the corresponding voice of the text to be converted, by reversely in short-term
Fourier Transform Algorithm generates the corresponding voice of the text to be converted.
10. device according to claim 7, which is characterized in that
The input module, specifically for connecting the vector characteristics of acquisition, text prosodic features and mappings characteristics entirely by one layer
After connecing layer progress dimensionality reduction, in the network that input is formed by least two there are the convolution unit of residual error connection, connect entirely by one layer
It connects after layer carries out liter sampling, obtains the corresponding linear spectral of the present frame.
11. device according to claim 10, which is characterized in that the convolution unit includes two one-dimensional convolution kernels,
In a convolution kernel output it is regular on predetermined interval by activation primitive;The output of the convolution unit is a convolution kernel
The regular output result on predetermined interval does not pass through the corresponding element of the output result of activation primitive with another convolution kernel
Product.
12. according to the device described in claim 7-11 any one, which is characterized in that
The acquisition module is additionally operable to obtain the corresponding frame of training text;Obtain present frame in the corresponding frame of the training text
The vector characteristics and text prosodic features of corresponding phone, and obtain present frame described in the corresponding voice of the training text
The mappings characteristics of the corresponding linear spectral of previous frame voice;
It is defeated to be additionally operable to the vector characteristics for obtaining the acquisition module, text prosodic features and mappings characteristics for the input module
Enter neural network model to be trained, obtains the corresponding linear spectral of the present frame;
The conversion module is additionally operable to after the linear spectral for obtaining the corresponding frame of the training text, according to the training text
The linear spectral of this corresponding frame, obtains the corresponding voice of the training text;
The acquisition module is additionally operable to, when the corresponding voice of the training text of acquisition meets predetermined condition, be trained
Good neural network model.
13. a kind of computer equipment, which is characterized in that including memory, processor and be stored on the memory and can be
The computer program run on the processor when the processor executes the computer program, realizes such as claim 1-6
In any method.
14. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the meter
The method as described in any in claim 1-6 is realized when calculation machine program is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810241804.9A CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810241804.9A CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108492818A true CN108492818A (en) | 2018-09-04 |
CN108492818B CN108492818B (en) | 2020-10-30 |
Family
ID=63319329
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810241804.9A Active CN108492818B (en) | 2018-03-22 | 2018-03-22 | Text-to-speech conversion method and device and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108492818B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754778A (en) * | 2019-01-17 | 2019-05-14 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and the computer equipment of text |
CN110164413A (en) * | 2019-05-13 | 2019-08-23 | 北京百度网讯科技有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
CN110335587A (en) * | 2019-06-14 | 2019-10-15 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing |
CN110473516A (en) * | 2019-09-19 | 2019-11-19 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method, device and electronic equipment |
CN111862931A (en) * | 2020-05-08 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Voice generation method and device |
CN112289297A (en) * | 2019-07-25 | 2021-01-29 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
CN112464649A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Pinyin conversion method and device for polyphone, computer equipment and storage medium |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
US11322135B2 (en) | 2019-09-12 | 2022-05-03 | International Business Machines Corporation | Generating acoustic sequences via neural networks using combined prosody info |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
GB2326321A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Speech synthesis using neural networks |
CA2161540C (en) * | 1994-04-28 | 2000-06-13 | Orhan Karaali | A method and apparatus for converting text into audible signals using a neural network |
CN101996628A (en) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | Method and device for extracting prosodic features of speech signal |
US20130226569A1 (en) * | 2008-12-18 | 2013-08-29 | Lessac Technologies, Inc. | Methods employing phase state analysis for use in speech synthesis and recognition |
US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
CN104867489A (en) * | 2015-04-27 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for simulating reading and pronunciation of real person |
CN105336321A (en) * | 2015-09-25 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Phonetic segmentation method and device for speech synthesis |
CN105679306A (en) * | 2016-02-19 | 2016-06-15 | 上海语知义信息技术有限公司 | Method and system for predicting fundamental frequency frame in voice synthesis |
CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN107680580A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text transformation model training method and device, text conversion method and device |
-
2018
- 2018-03-22 CN CN201810241804.9A patent/CN108492818B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2161540C (en) * | 1994-04-28 | 2000-06-13 | Orhan Karaali | A method and apparatus for converting text into audible signals using a neural network |
GB2326320A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Text to speech synthesis using neural network |
GB2326321A (en) * | 1997-06-13 | 1998-12-16 | Motorola Inc | Speech synthesis using neural networks |
US20130226569A1 (en) * | 2008-12-18 | 2013-08-29 | Lessac Technologies, Inc. | Methods employing phase state analysis for use in speech synthesis and recognition |
CN101996628A (en) * | 2009-08-21 | 2011-03-30 | 索尼株式会社 | Method and device for extracting prosodic features of speech signal |
US8527276B1 (en) * | 2012-10-25 | 2013-09-03 | Google Inc. | Speech synthesis using deep neural networks |
CN104867489A (en) * | 2015-04-27 | 2015-08-26 | 苏州大学张家港工业技术研究院 | Method and system for simulating reading and pronunciation of real person |
CN105336321A (en) * | 2015-09-25 | 2016-02-17 | 百度在线网络技术(北京)有限公司 | Phonetic segmentation method and device for speech synthesis |
CN107039034A (en) * | 2016-02-04 | 2017-08-11 | 科大讯飞股份有限公司 | A kind of prosody prediction method and system |
CN105679306A (en) * | 2016-02-19 | 2016-06-15 | 上海语知义信息技术有限公司 | Method and system for predicting fundamental frequency frame in voice synthesis |
CN107452369A (en) * | 2017-09-28 | 2017-12-08 | 百度在线网络技术(北京)有限公司 | Phonetic synthesis model generating method and device |
CN107680580A (en) * | 2017-09-28 | 2018-02-09 | 百度在线网络技术(北京)有限公司 | Text transformation model training method and device, text conversion method and device |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109754778B (en) * | 2019-01-17 | 2023-05-30 | 平安科技(深圳)有限公司 | Text speech synthesis method and device and computer equipment |
WO2020147404A1 (en) * | 2019-01-17 | 2020-07-23 | 平安科技(深圳)有限公司 | Text-to-speech synthesis method, device, computer apparatus, and non-volatile computer readable storage medium |
CN109754778A (en) * | 2019-01-17 | 2019-05-14 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, device and the computer equipment of text |
US11620980B2 (en) | 2019-01-17 | 2023-04-04 | Ping An Technology (Shenzhen) Co., Ltd. | Text-based speech synthesis method, computer device, and non-transitory computer-readable storage medium |
CN110164413A (en) * | 2019-05-13 | 2019-08-23 | 北京百度网讯科技有限公司 | Phoneme synthesizing method, device, computer equipment and storage medium |
CN110164413B (en) * | 2019-05-13 | 2021-06-04 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, computer device and storage medium |
CN110335587A (en) * | 2019-06-14 | 2019-10-15 | 平安科技(深圳)有限公司 | Phoneme synthesizing method, system, terminal device and readable storage medium storing program for executing |
CN110335587B (en) * | 2019-06-14 | 2023-11-10 | 平安科技(深圳)有限公司 | Speech synthesis method, system, terminal device and readable storage medium |
CN112289297A (en) * | 2019-07-25 | 2021-01-29 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
CN112289297B (en) * | 2019-07-25 | 2024-08-02 | 阿里巴巴集团控股有限公司 | Speech synthesis method, device and system |
US11322135B2 (en) | 2019-09-12 | 2022-05-03 | International Business Machines Corporation | Generating acoustic sequences via neural networks using combined prosody info |
US11842728B2 (en) | 2019-09-12 | 2023-12-12 | International Business Machines Corporation | Training neural networks to predict acoustic sequences using observed prosody info |
CN110473516A (en) * | 2019-09-19 | 2019-11-19 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method, device and electronic equipment |
CN111862931A (en) * | 2020-05-08 | 2020-10-30 | 北京嘀嘀无限科技发展有限公司 | Voice generation method and device |
CN112464649A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Pinyin conversion method and device for polyphone, computer equipment and storage medium |
CN113763924A (en) * | 2021-11-08 | 2021-12-07 | 北京优幕科技有限责任公司 | Acoustic deep learning model training method, and voice generation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108492818B (en) | 2020-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108492818A (en) | Conversion method, device and the computer equipment of Text To Speech | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
CN109754778A (en) | Phoneme synthesizing method, device and the computer equipment of text | |
CN108597492B (en) | Phoneme synthesizing method and device | |
WO2020182153A1 (en) | Method for performing speech recognition based on self-adaptive language, and related apparatus | |
WO2020098269A1 (en) | Speech synthesis method and speech synthesis device | |
CN108573694A (en) | Language material expansion and speech synthesis system construction method based on artificial intelligence and device | |
CN110223705A (en) | Phonetics transfer method, device, equipment and readable storage medium storing program for executing | |
CN108447486A (en) | A kind of voice translation method and device | |
CN108806665A (en) | Phoneme synthesizing method and device | |
CN107749305A (en) | Method of speech processing and its device | |
JP2022133408A (en) | Speech conversion method and system, electronic apparatus, readable storage medium, and computer program | |
CN108986793A (en) | translation processing method, device and equipment | |
JPH0713581A (en) | Method and system for provision of sound with space information | |
CN108831437A (en) | A kind of song generation method, device, terminal and storage medium | |
WO2021212954A1 (en) | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources | |
CN109637551A (en) | Phonetics transfer method, device, equipment and storage medium | |
US20240339106A1 (en) | Phonemes And Graphemes for Neural Text-to-Speech | |
CN113724683B (en) | Audio generation method, computer device and computer readable storage medium | |
CN113948062B (en) | Data conversion method and computer storage medium | |
CN112331177A (en) | Rhythm-based speech synthesis method, model training method and related equipment | |
CN109599090B (en) | Method, device and equipment for voice synthesis | |
CN110147554A (en) | Simultaneous interpreting method, device and computer equipment | |
CN102568476A (en) | Voice conversion method based on self-organizing feature map network cluster and radial basis network | |
US20240273311A1 (en) | Robust Direct Speech-to-Speech Translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |