CN108492818A

CN108492818A - Conversion method, device and the computer equipment of Text To Speech

Info

Publication number: CN108492818A
Application number: CN201810241804.9A
Authority: CN
Inventors: 张黄斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2018-09-04
Anticipated expiration: 2038-03-22
Also published as: CN108492818B

Abstract

The application proposes that a kind of conversion method of Text To Speech, device and computer equipment, the conversion method of above-mentioned Text To Speech include：Obtain the corresponding frame of text to be converted；Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame；By the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network model trained in advance, the corresponding linear spectral of the present frame is obtained；After the linear spectral for obtaining the corresponding frame of the text to be converted, according to the linear spectral of the corresponding frame of the text to be converted, the corresponding voice of the text to be converted is obtained.The linear spectral directly according to the corresponding frame of text to be converted may be implemented in the application, obtains the corresponding voice of text to be converted, and due to being not introduced into approximate error, more preferably, whole fluency is more preferable for the sound quality of the voice obtained.

Description

Conversion method, device and the computer equipment of Text To Speech

Technical field

This application involves a kind of speech synthesis technique field more particularly to conversion method of Text To Speech, device and meters Calculate machine equipment.

Background technology

TTS is from the abbreviation of Text To Speech (Text To Speech), is an interactive part, it is therefore an objective to allow machine Device can speak according to text.Phone is the minimum phonation unit that the mankind speak, in Chinese, phone be exactly each initial consonant or Simple or compound vowel of a Chinese syllable.In order to allow machine that can send out corresponding sound according to text, need to model the acoustic model of each phone.

Existing to be modeled in the related technology using vocoder, this modeling method first carries out voice signal at framing Reason, is then divided into three pieces to the acoustic model of each frame and models：(1) whether this frame needs sounding；(2) base of this frame Plinth frequency；(3) shock response of this frame relative to base frequency.But what this method obtained due to the use of human research Priori has carried out approximation, thus has certain information loss so that the sound effect mechanical sense finally synthesized is listened than heavier Feel poor, can not be compared with natural voice.

Invention content

The application is intended to solve at least some of the technical problems in related technologies.

For this purpose, first purpose of the application is to propose a kind of conversion method of Text To Speech, to realize direct root According to the linear spectral of the corresponding frame of text to be converted, the corresponding voice of text to be converted is obtained, due to being not introduced into approximate error, More preferably, whole fluency is more preferable for the sound quality of the voice of acquisition.

Second purpose of the application is to propose a kind of conversion equipment of Text To Speech.

The third purpose of the application is to propose a kind of computer equipment.

The 4th purpose of the application is to propose a kind of non-transitorycomputer readable storage medium.

In order to achieve the above object, the application first aspect embodiment proposes a kind of conversion method of Text To Speech, including： Obtain the corresponding frame of text to be converted；Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, with And obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame；By the vector characteristics of acquisition, text prosodic features Neural network model trained in advance, obtains the corresponding linear spectral of the present frame with mappings characteristics input；It is waited for described in acquisition After the linear spectral of the corresponding frame of converting text, according to the linear spectral of the corresponding frame of the text to be converted, wait turning described in acquisition The corresponding voice of exchange of notes sheet.

In the conversion method of the Text To Speech of the embodiment of the present application, after obtaining the corresponding frame of text to be converted, obtain Present frame corresponds to the vector characteristics and text prosodic features of phone in above-mentioned frame, and obtains the previous frame correspondence of above-mentioned present frame Linear spectral mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input in advance training Neural network model obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted, According to the linear spectral of the corresponding frame of above-mentioned text to be converted, the corresponding voice of above-mentioned text to be converted is obtained, so as to realize Directly according to the linear spectral of the corresponding frame of text to be converted, the corresponding voice of text to be converted is obtained, due to being not introduced into approximate miss Difference, therefore the sound quality of the voice obtained is more preferably, whole fluency is more preferable.

In order to achieve the above object, the application second aspect embodiment proposes a kind of conversion equipment of Text To Speech, including： Acquisition module, for obtaining the corresponding frame of text to be converted；Obtain vector characteristics and text that present frame in the frame corresponds to phone This prosodic features, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame；Input module is used for institute Vector characteristics, text prosodic features and the mappings characteristics input of acquisition module acquisition neural network model trained in advance are stated, is obtained Obtain the corresponding linear spectral of the present frame；Conversion module, for the linear spectral for obtaining the corresponding frame of the text to be converted it Afterwards, according to the linear spectral of the corresponding frame of the text to be converted, the corresponding voice of the text to be converted is obtained.

In the conversion equipment of the Text To Speech of the embodiment of the present application, acquisition module obtain the corresponding frame of text to be converted it Afterwards, vector characteristics and text prosodic features that present frame in above-mentioned frame corresponds to phone are obtained, and obtain the upper of above-mentioned present frame The mappings characteristics of the corresponding linear spectral of one frame, then input module is special by the vector characteristics of acquisition, text prosodic features and mapping Sign input neural network model trained in advance, obtains the corresponding linear spectral of above-mentioned present frame.It is corresponded to obtaining text to be converted Frame linear spectral after, conversion module obtains above-mentioned text to be converted according to the linear spectral of the corresponding frame of above-mentioned text to be converted This corresponding voice obtains text pair to be converted so as to realize the linear spectral directly according to the corresponding frame of text to be converted The voice answered, due to being not introduced into approximate error, more preferably, whole fluency is more preferable for the sound quality of the voice obtained.

In order to achieve the above object, the application third aspect embodiment proposes a kind of computer equipment, including memory, processor And it is stored in the computer program that can be run on the memory and on the processor, the processor executes the calculating When machine program, method as described above is realized.

To achieve the goals above, the application fourth aspect embodiment proposes a kind of computer-readable storage of non-transitory Medium, is stored thereon with computer program, and the computer program realizes method as described above when being executed by processor.

The additional aspect of the application and advantage will be set forth in part in the description, and will partly become from the following description It obtains obviously, or recognized by the practice of the application.

Description of the drawings

The application is above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein：

Fig. 1 is the flow chart of conversion method one embodiment of the application Text To Speech；

Fig. 2 be the application Text To Speech conversion method in two layers of full articulamentum composition neural network schematic diagram；

Fig. 3 is the flow chart of another embodiment of the conversion method of the application Text To Speech；

Fig. 4 is the flow chart of the conversion method further embodiment of the application Text To Speech；

Fig. 5 be the application Text To Speech conversion method in neural network model structural schematic diagram；

Fig. 6 be the application Text To Speech conversion method in convolution unit structural schematic diagram；

Fig. 7 is the flow chart of the conversion method further embodiment of the application Text To Speech；

Fig. 8 is the structural schematic diagram of conversion equipment one embodiment of the application Text To Speech；

Fig. 9 is the structural schematic diagram of the application computer equipment one embodiment.

Specific implementation mode

Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to for explaining the application, and should not be understood as the limitation to the application.

It is not that the frequency spectrum directly according to audio carries out voice conversion in the conversion plan of existing Text To Speech, But first after extracting the parameters,acoustics such as fundamental frequency, spectrum envelope out in audio, voice conversion is carried out according to these indirect acoustic features. Since the complexity of sound spectrum inevitably introduces a large amount of approximation, finally leads to simplify the process of voice conversion Causing the voice being finally synthesizing, more or less there is a degree of distortions.In order to evade this problem, the direct root of the application Voice conversion is carried out according to the corresponding linear spectral of text to be converted.

Fig. 1 is the flow chart of conversion method one embodiment of the application Text To Speech, as shown in Figure 1, above-mentioned text Conversion method to voice may include：

Step 101, the corresponding frame of text to be converted is obtained.

Specifically, obtaining the corresponding frame of text to be converted can be：Obtain the corresponding frame of each phone of text to be converted.

By taking text to be converted is " Baidu " as an example, the phone of " Baidu " includes：B, ai, d and u, therefore obtain text to be converted Originally corresponding frame can be：" b ", " ai ", " d " and " u " corresponding frame is obtained respectively.

Step 102, vector characteristics and text prosodic features that present frame in above-mentioned frame corresponds to phone are obtained, and obtains and works as The mappings characteristics of the corresponding linear spectral of previous frame of previous frame.

Wherein, the vector characteristics that present frame corresponds to phone in above-mentioned frame can be the corresponding phone of present frame, such as initial consonant, rhythm The one-hot features of the sounds basic unit such as mother；The text prosodic features that present frame corresponds to phone in above-mentioned frame can be current The part of speech of word and/or the term vector of place word where the complementary features of the corresponding phone of frequency, such as the corresponding phone of current frequency.

Still by taking text to be converted is " Baidu " as an example, it is assumed that " b " this phone corresponds to 5 frames, this corresponding sound of 5 frames Son is all just " b ", and for each frame in this 5 frames, the vector characteristics and the text rhythm of the corresponding phone of present frame are special Sign, the as corresponding vector characteristics of " b " this phone and text prosodic features.

Specifically, obtaining the mappings characteristics of the corresponding linear spectral of previous frame of present frame can be：By upper the one of present frame The neural network that the corresponding linear spectral input of frame is made of two layers of full articulamentum is mapped, and the previous frame of above-mentioned present frame is obtained The mappings characteristics of corresponding linear spectral.

Fig. 2 be the application Text To Speech conversion method in two layers of full articulamentum composition neural network schematic diagram, After obtaining the corresponding linear spectral of previous frame, the corresponding linear spectral of previous frame is inputted into neural network shown in Fig. 2 and is reflected It penetrates, you can obtain the mappings characteristics of the corresponding linear spectral of previous frame.

Step 103, by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network trained in advance Model obtains the corresponding linear spectral of above-mentioned present frame.

Specifically, after obtaining the corresponding linear spectral of present frame, the corresponding linear spectral of present frame can be inputted to Fig. 2 institutes The neural network shown obtains the mappings characteristics of the corresponding linear spectral of present frame, and then the vector of phone corresponding with next frame is special Text prosodic features of seeking peace is stitched together, the input feature vector as next frame.

Step 104, after the linear spectral for obtaining the corresponding frame of above-mentioned text to be converted, according to above-mentioned text pair to be converted The linear spectral for the frame answered obtains the corresponding voice of above-mentioned text to be converted.

Specifically, in the method according to step 102 and step 103 description, it is corresponding all to obtain above-mentioned text to be converted After the linear spectral of frame, you can according to the linear spectral of the corresponding frame of above-mentioned text to be converted, obtain above-mentioned text to be converted and correspond to Voice.

In the conversion method of above-mentioned Text To Speech, after obtaining the corresponding frame of text to be converted, obtain in above-mentioned frame when Previous frame corresponds to the vector characteristics and text prosodic features of phone, and obtain the corresponding linear spectral of previous frame of above-mentioned present frame Mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network mould trained in advance Type obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted, waited for according to above-mentioned The linear spectral of the corresponding frame of converting text, obtains the corresponding voice of above-mentioned text to be converted, and so as to realize, directly basis waits for The linear spectral of the corresponding frame of converting text obtains the corresponding voice of text to be converted, due to being not introduced into approximate error, obtains Voice sound quality more preferably, whole fluency is more preferable.

Fig. 3 is the flow chart of another embodiment of the conversion method of the application Text To Speech, as shown in figure 3, the application In embodiment illustrated in fig. 1, step 104 may include：

Step 301, above-mentioned text to be converted is obtained by iteration from the linear spectral of the corresponding frame of above-mentioned text to be converted The phase spectrum of corresponding voice.

Step 302, it according to the phase spectrum of the corresponding voice of above-mentioned text to be converted, is calculated by reversed Short Time Fourier Transform Method generates the corresponding voice of above-mentioned text to be converted.

Fig. 4 is the flow chart of the conversion method further embodiment of the application Text To Speech, as shown in figure 4, the application In embodiment illustrated in fig. 1, step 103 may include：

Step 401, the vector characteristics of acquisition, text prosodic features and mappings characteristics are dropped by one layer of full articulamentum After dimension, in the network that input is formed by least two there are the convolution unit of residual error connection, risen by one layer of full articulamentum After sampling, the corresponding linear spectral of above-mentioned present frame is obtained.

That is, in the present embodiment, the structure of neural network model can be with as shown in figure 5, Fig. 5 arrives for the application text The structural schematic diagram of neural network model in the conversion method of voice.The output feature of neural network model shown in fig. 5 is The corresponding linear spectral of present frame, i.e., the amplitude of the frequency spectrum obtained after the Short Time Fourier Transform of the corresponding audio of above-mentioned present frame Spectrum, above-mentioned neural network model first carry out sub-frame processing to the corresponding audio of above-mentioned present frame, are a frame per 25ms, frame and frame it Between there are the overlappings of 12.5ms.Then each frame does Fourier transformation, and it is above-mentioned current to take the real part of Fourier transformation The corresponding linear spectral of frame.

Wherein, above-mentioned convolution unit includes two one-dimensional convolution kernels, and the output of one of convolution kernel passes through activation primitive It is regular on predetermined interval；The output of above-mentioned convolution unit is for the regular output result on predetermined interval of a convolution kernel and separately One convolution kernel does not pass through the product of the corresponding element of the output result of activation primitive.

Above-mentioned activation primitive can be sigmoid activation primitives, and above-mentioned predetermined interval can be (0,1) section, above-mentioned volume The structure of product unit can with as shown in fig. 6, Fig. 6 be the application Text To Speech conversion method in convolution unit structural representation Figure.

The present embodiment has used convolutional neural networks model instead of recurrent neural network, this help to improve network and The efficiency of row operation, substantially increases the aggregate velocity of voice；In addition, the convolutional neural networks model that uses of the present embodiment can be with Play the role of similar filter, thus also has good voice transfer capability to the lower training data of signal-to-noise ratio.

Fig. 7 is the flow chart of the conversion method further embodiment of the application Text To Speech, as shown in fig. 7, the application In embodiment illustrated in fig. 1, before step 103, can also include：

Step 701, the corresponding frame of training text is obtained.

Step 702, vector characteristics and the text rhythm that present frame in the corresponding frame of above-mentioned training text corresponds to phone are obtained Feature, and obtain the mapping spy of the corresponding linear spectral of previous frame voice of present frame in the corresponding voice of above-mentioned training text Sign.

Step 703, the vector characteristics of acquisition, text prosodic features and mappings characteristics are inputted to neural network mould to be trained Type obtains the corresponding linear spectral of present frame.

Step 704, corresponding according to above-mentioned training text after the linear spectral for obtaining the corresponding frame of above-mentioned training text The linear spectral of frame obtains the corresponding voice of above-mentioned training text.

Step 705, when the corresponding voice of above-mentioned training text of acquisition meets predetermined condition, trained nerve is obtained Network model.

The training process for the neural network model that 701~step 705 of above-mentioned steps uses the embodiment of the present application carries out Explanation.

From step 702 as can be seen that in training neural network model, the mappings characteristics of the previous frame used are above-mentioned instructions Practice the mappings characteristics of the corresponding linear spectral of previous frame voice of present frame in the corresponding voice of text, and in actual use nerve net When network model, the mappings characteristics of the previous frame used are the mappings of the corresponding linear spectral of previous frame of neural network model output Feature.

In step 705, the corresponding voice of above-mentioned training text of acquisition, which meets predetermined condition, to be：By to be trained Neural network model obtain the corresponding voice of above-mentioned training text, compared with the real voice of above-mentioned training text, sound quality and Within a predetermined range, certainly, above-mentioned preset range can be in specific implementation, according to system performance for the error of whole fluency And/or the sets itselfs such as realization demand, the present embodiment are not construed as limiting the size of above-mentioned preset range.

The conversion method of Text To Speech provided by the embodiments of the present application has directly used text to be converted corresponding linear Spectrum carries out voice conversion, than the voice obtained for the scheme for carrying out voice conversion using the obtained acoustic feature of approximate evaluation Error smaller, thus the sound quality of the voice obtained is more preferably, whole fluency is more preferable；Also, the application has used convolutional Neural net Network model helps to improve the efficiency of network run parallel instead of recurrent neural network, greatly improves the synthesis speed of audio Degree；In addition, convolutional neural networks model used in this application can play the role of similar filter, thus it is relatively low to noise Training data also have good voice transfer capability.

Fig. 8 is the structural schematic diagram of conversion equipment one embodiment of the application Text To Speech, in the embodiment of the present application The conversion equipment of Text To Speech the conversion method of Text To Speech provided by the embodiments of the present application may be implemented.Such as Fig. 8 institutes Show, the conversion equipment of above-mentioned Text To Speech may include：Acquisition module 81, input module 82 and conversion module 83；

Wherein, acquisition module 81, for obtaining the corresponding frame of text to be converted；It obtains present frame in above-mentioned frame and corresponds to phone Vector characteristics and text prosodic features, and obtain the mappings characteristics of the corresponding linear spectral of previous frame of above-mentioned present frame；Tool Body, obtaining the corresponding frame of text to be converted can be：Obtain the corresponding frame of each phone of text to be converted.

Specifically, acquisition module 81, specifically for connecting the corresponding linear spectral input of the previous frame of present frame entirely by two layers The neural network for connecing layer composition is mapped, and the mappings characteristics of the corresponding linear spectral of previous frame of above-mentioned present frame are obtained.

After obtaining the corresponding linear spectral of previous frame, acquisition module 81 can be by the corresponding linear spectral input figure of previous frame Neural network shown in 2 is mapped, you can obtains the mappings characteristics of the corresponding linear spectral of previous frame.

Input module 82, the vector characteristics, text prosodic features for obtaining acquisition module 81 and mappings characteristics input Trained neural network model in advance obtains the corresponding linear spectral of present frame.

In the present embodiment, after input module 82 obtains the corresponding linear spectral of present frame, acquisition module 81 can will work as The corresponding linear spectral of previous frame inputs neural network shown in Fig. 2, obtains the mappings characteristics of the corresponding linear spectral of present frame, then with The vector characteristics and text prosodic features of the corresponding phone of next frame are stitched together, the input feature vector as next frame.

Conversion module 83, for after the linear spectral for obtaining the corresponding frame of above-mentioned text to be converted, waiting turning according to above-mentioned The linear spectral of the corresponding frame of exchange of notes sheet obtains the corresponding voice of above-mentioned text to be converted.

Specifically, after the linear spectral for obtaining the corresponding all frames of above-mentioned text to be converted, conversion module 83 can root According to the linear spectral of the corresponding frame of above-mentioned text to be converted, the corresponding voice of above-mentioned text to be converted is obtained.

In the present embodiment, conversion module 83, specifically for passing through from the linear spectral of the corresponding frame of above-mentioned text to be converted Iteration obtains the phase spectrum of the corresponding voice of above-mentioned text to be converted；According to the phase of the corresponding voice of above-mentioned text to be converted Spectrum generates the corresponding voice of above-mentioned text to be converted by reversed Short Time Fourier Transform algorithm.

In the present embodiment, input module 82 is specifically used for vector characteristics, text prosodic features and the mappings characteristics that will obtain After one layer of full articulamentum carries out dimensionality reduction, in the network that input is formed by least two there are the convolution unit of residual error connection, After one layer of full articulamentum carries out liter sampling, the corresponding linear spectral of above-mentioned present frame is obtained.

That is, in the present embodiment, the structure of neural network model can be as shown in Figure 5.Neural network shown in fig. 5 The output feature of model is the corresponding linear spectral of present frame, i.e., after the Short Time Fourier Transform of the corresponding audio of above-mentioned present frame The amplitude spectrum of obtained frequency spectrum, above-mentioned neural network model first carry out sub-frame processing to the corresponding audio of above-mentioned present frame, often 25ms is a frame, and there are the overlapping of 12.5ms between frame and frame.Then each frame does Fourier transformation, takes the reality of Fourier transformation Number part is the corresponding linear spectral of above-mentioned present frame.

Above-mentioned activation primitive can be sigmoid activation primitives, and above-mentioned predetermined interval can be (0,1) section, above-mentioned volume The structure of product unit can be as shown in Figure 6.

Further, acquisition module 81 are additionally operable to obtain the corresponding frame of training text；It is corresponding to obtain above-mentioned training text Present frame corresponds to the vector characteristics and text prosodic features of phone in frame, and obtains in the corresponding voice of above-mentioned training text State the mappings characteristics of the corresponding linear spectral of previous frame voice of present frame；

It is defeated to be additionally operable to the vector characteristics for obtaining acquisition module 81, text prosodic features and mappings characteristics for input module 82 Enter neural network model to be trained, obtains the corresponding linear spectral of above-mentioned present frame；

Conversion module 83 is additionally operable to after the linear spectral for obtaining the corresponding frame of above-mentioned training text, according to above-mentioned training The linear spectral of the corresponding frame of text obtains the corresponding voice of above-mentioned training text；

Acquisition module 81 is additionally operable to, when the corresponding voice of above-mentioned training text of acquisition meets predetermined condition, be instructed The neural network model perfected.

In training neural network model, the mappings characteristics for the previous frame that acquisition module 81 obtains are above-mentioned training texts pair The mappings characteristics of the corresponding linear spectral of previous frame voice of present frame in the voice answered, and in actual use neural network model When, the mappings characteristics for the previous frame that acquisition module 81 obtains are the corresponding linear spectrals of previous frame of neural network model output Mappings characteristics.

Wherein, the corresponding voice of above-mentioned training text of acquisition, which meets predetermined condition, to be：Pass through nerve to be trained The corresponding voice of above-mentioned training text that network model obtains, compared with the real voice of above-mentioned training text, sound quality and entirety The error of fluency within a predetermined range, certainly, above-mentioned preset range can in specific implementation, according to system performance and/or The sets itselfs such as realization demand, the present embodiment are not construed as limiting the size of above-mentioned preset range.

In the conversion equipment of above-mentioned Text To Speech, after obtaining the corresponding frame of text to be converted, obtain in above-mentioned frame when Previous frame corresponds to the vector characteristics and text prosodic features of phone, and obtain the corresponding linear spectral of previous frame of above-mentioned present frame Mappings characteristics, then by the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network mould trained in advance Type obtains the corresponding linear spectral of above-mentioned present frame.After the linear spectral for obtaining the corresponding frame of text to be converted, waited for according to above-mentioned The linear spectral of the corresponding frame of converting text, obtains the corresponding voice of above-mentioned text to be converted, and so as to realize, directly basis waits for The linear spectral of the corresponding frame of converting text obtains the corresponding voice of text to be converted, due to being not introduced into approximate error, obtains Voice sound quality more preferably, whole fluency is more preferable.

Fig. 9 is the structural schematic diagram of the application computer equipment one embodiment, and above computer equipment may include depositing Reservoir, processor and it is stored in the computer program that can be run on above-mentioned memory and on above-mentioned processor, above-mentioned processor When executing above computer program, the conversion method of Text To Speech provided by the embodiments of the present application may be implemented.

Wherein, above computer equipment can be server, or terminal device, above-mentioned terminal device can be a People's computer (Personal Computer；Hereinafter referred to as：PC) or the intelligent terminals such as laptop, the present embodiment The specific form of above computer equipment is not construed as limiting.

Fig. 9 shows the block diagram of the exemplary computer device 12 suitable for being used for realizing the application embodiment.Fig. 9 is shown Computer equipment 12 be only an example, any restrictions should not be brought to the function and use scope of the embodiment of the present application.

As shown in figure 9, computer equipment 12 is showed in the form of universal computing device.The component of computer equipment 12 can be with Including but not limited to：One or more processor or processing unit 16, system storage 28 connect different system component The bus 18 of (including system storage 28 and processing unit 16).

Bus 18 indicates one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using the arbitrary bus structures in a variety of bus structures.It lifts For example, these architectures include but not limited to industry standard architecture (Industry Standard Architecture；Hereinafter referred to as：ISA) bus, microchannel architecture (Micro Channel Architecture；Below Referred to as：MAC) bus, enhanced isa bus, Video Electronics Standards Association (Video Electronics Standards Association；Hereinafter referred to as：VESA) local bus and peripheral component interconnection (Peripheral Component Interconnection；Hereinafter referred to as：PCI) bus.

Computer equipment 12 typically comprises a variety of computer system readable media.These media can be it is any can be by The usable medium that computer equipment 12 accesses, including volatile and non-volatile media, moveable and immovable medium.

System storage 28 may include the computer system readable media of form of volatile memory, such as arbitrary access Memory (Random Access Memory；Hereinafter referred to as：RAM) 30 and/or cache memory 32.Computer equipment 12 It may further include other removable/nonremovable, volatile/non-volatile computer system storage mediums.Only conduct Citing, storage system 34 can be used for reading and writing immovable, non-volatile magnetic media, and (Fig. 9 do not show, commonly referred to as " hard disk Driver ").Although being not shown in Fig. 9, can provide for the magnetic to moving non-volatile magnetic disk (such as " floppy disk ") read-write Disk drive, and to removable anonvolatile optical disk (such as：Compact disc read-only memory (Compact Disc Read Only Memory；Hereinafter referred to as：CD-ROM), digital multi CD-ROM (Digital Video Disc Read Only Memory；Hereinafter referred to as：DVD-ROM) or other optical mediums) read-write CD drive.In these cases, each driving Device can be connected by one or more data media interfaces with bus 18.Memory 28 may include at least one program production Product, the program product have one group of (for example, at least one) program module, and it is each that these program modules are configured to perform the application The function of embodiment.

Program/utility 40 with one group of (at least one) program module 42 can be stored in such as memory 28 In, such program module 42 includes --- but being not limited to --- operating system, one or more application program, other programs Module and program data may include the realization of network environment in each or certain combination in these examples.Program mould Block 42 usually executes function and/or method in embodiments described herein.

Computer equipment 12 can also be with one or more external equipments 14 (such as keyboard, sensing equipment, display 24 Deng) communication, can also be enabled a user to one or more equipment interact with the computer equipment 12 communicate, and/or with make The computer equipment 12 any equipment (such as network interface card, the modulatedemodulate that can be communicated with one or more of the other computing device Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 22.Also, computer equipment 12 may be used also To pass through network adapter 20 and one or more network (such as LAN (Local Area Network；Hereinafter referred to as： LAN), wide area network (Wide Area Network；Hereinafter referred to as：WAN) and/or public network, for example, internet) communication.Such as figure Shown in 9, network adapter 20 is communicated by bus 18 with other modules of computer equipment 12.Although should be understood that in Fig. 9 not It shows, other hardware and/or software module can be used in conjunction with computer equipment 12, including but not limited to：Microcode, equipment are driven Dynamic device, redundant processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

Processing unit 16 is stored in program in system storage 28 by operation, to perform various functions application and Data processing, such as realize the conversion method of Text To Speech provided by the embodiments of the present application.

The embodiment of the present application also provides a kind of non-transitorycomputer readable storage medium, is stored thereon with computer journey The conversion side of Text To Speech provided by the embodiments of the present application may be implemented in sequence, above computer program when being executed by processor Method.

Appointing for one or more computer-readable media may be used in above-mentioned non-transitorycomputer readable storage medium Meaning combination.Computer-readable medium can be computer-readable signal media or computer readable storage medium.Computer can It reads storage medium and for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device Or device, or the arbitrary above combination.The more specific example (non exhaustive list) of computer readable storage medium includes： Electrical connection, portable computer diskette, hard disk, random access memory (RAM), read-only storage with one or more conducting wires Device (Read Only Memory；Hereinafter referred to as：ROM), erasable programmable read only memory (Erasable Programmable Read Only Memory；Hereinafter referred to as：EPROM) or flash memory, optical fiber, portable compact disc are read-only deposits Reservoir (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer Readable storage medium storing program for executing, which can be any, includes or the tangible medium of storage program, which can be commanded execution system, device Either device use or in connection.

Computer-readable signal media may include in a base band or as the data-signal that a carrier wave part is propagated, Wherein carry computer-readable program code.Diversified forms may be used in the data-signal of this propagation, including --- but It is not limited to --- electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be Any computer-readable medium other than computer readable storage medium, which can send, propagate or Transmission for by instruction execution system, device either device use or program in connection.

The program code for including on computer-readable medium can transmit with any suitable medium, including --- but it is unlimited In --- wireless, electric wire, optical cable, RF etc. or above-mentioned any appropriate combination.

Can with one or more programming languages or combinations thereof come write for execute the application operation computer Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Further include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with It fully executes, partly execute on the user computer on the user computer, being executed as an independent software package, portion Divide and partly executes or executed on a remote computer or server completely on the remote computer on the user computer. It is related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (Local Area Network；Hereinafter referred to as：) or wide area network (Wide Area Network LAN；Hereinafter referred to as：WAN) it is connected to user Computer, or, it may be connected to outer computer (such as being connected by internet using ISP).

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is contained at least one embodiment or example of the application.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing custom logic function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the application includes other realization, wherein can not press shown or discuss suitable Sequence, include according to involved function by it is basic simultaneously in the way of or in the opposite order, to execute function, this should be by the application Embodiment person of ordinary skill in the field understood.

Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (system of such as computer based system including processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium " can any can be included, store, communicating, propagating or passing Defeated program is for instruction execution system, device or equipment or the dress used in conjunction with these instruction execution systems, device or equipment It sets.The more specific example (non-exhaustive list) of computer-readable medium includes following：Electricity with one or more wiring Interconnecting piece (electronic device), portable computer diskette box (magnetic device), random access memory (Random Access Memory；Hereinafter referred to as：RAM), read-only memory (Read Only Memory；Hereinafter referred to as：ROM), erasable editable Read memory (Erasable Programmable Read Only Memory；Hereinafter referred to as：EPROM) or flash memory, Fiber device and portable optic disk read-only storage (Compact Disc Read Only Memory；Hereinafter referred to as：CD- ROM).In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable media, because For can be then suitable with other into edlin, interpretation or when necessary for example by carrying out optical scanner to paper or other media Mode is handled electronically to obtain described program, is then stored in computer storage.

It should be appreciated that each section of the application can be realized with hardware, software, firmware or combination thereof.Above-mentioned In embodiment, software that multiple steps or method can in memory and by suitable instruction execution system be executed with storage Or firmware is realized.Such as, if realized in another embodiment with hardware, following skill well known in the art can be used Any one of art or their combination are realized：With for data-signal realize logic function logic gates from Logic circuit is dissipated, the application-specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (Programmable Gate Array；Hereinafter referred to as：PGA), field programmable gate array (Field Programmable Gate Array；Below Referred to as：FPGA) etc..

Those skilled in the art are appreciated that realize all or part of step that above-described embodiment method carries Suddenly it is that relevant hardware can be instructed to complete by program, the program can be stored in a kind of computer-readable storage medium In matter, which includes the steps that one or a combination set of embodiment of the method when being executed.

In addition, each functional unit in each embodiment of the application can be integrated in a processing module, it can also That each unit physically exists alone, can also two or more units be integrated in a module.Above-mentioned integrated mould The form that hardware had both may be used in block is realized, can also be realized in the form of software function module.The integrated module is such as Fruit is realized in the form of software function module and when sold or used as an independent product, can also be stored in a computer In read/write memory medium.

Storage medium mentioned above can be read-only memory, disk or CD etc..Although having been shown and retouching above Embodiments herein is stated, it is to be understood that above-described embodiment is exemplary, and should not be understood as the limit to the application System, those skilled in the art can be changed above-described embodiment, change, replace and become within the scope of application Type.

Claims

1. a kind of conversion method of Text To Speech, which is characterized in that including：

Obtain the corresponding frame of text to be converted；

Vector characteristics and text prosodic features that present frame in the frame corresponds to phone are obtained, and obtain the upper of the present frame The mappings characteristics of the corresponding linear spectral of one frame；

By the vector characteristics of acquisition, text prosodic features and mappings characteristics input neural network model trained in advance, institute is obtained State the corresponding linear spectral of present frame；

After the linear spectral for obtaining the corresponding frame of the text to be converted, according to the linear of the corresponding frame of the text to be converted Spectrum obtains the corresponding voice of the text to be converted.

2. according to the method described in claim 1, it is characterized in that, the previous frame for obtaining the present frame is corresponding linear The mappings characteristics of spectrum include：

The neural network that the corresponding linear spectral input of the previous frame of the present frame is made of two layers of full articulamentum is mapped, Obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame.

3. according to the method described in claim 1, it is characterized in that, described according to the linear of the corresponding frame of the text to be converted Spectrum, the voice for obtaining the text to be converted include：

The corresponding voice of the text to be converted is obtained by iteration from the linear spectral of the corresponding frame of the text to be converted Phase spectrum；

According to the phase spectrum of the corresponding voice of the text to be converted, by reversed Short Time Fourier Transform algorithm generate described in wait for The corresponding voice of converting text.

4. according to the method described in claim 1, it is characterized in that, it is described by the vector characteristics of acquisition, text prosodic features and Mappings characteristics input neural network model trained in advance, obtaining the corresponding linear spectral of the present frame includes：

By the vector characteristics of acquisition, text prosodic features and mappings characteristics after one layer of full articulamentum carries out dimensionality reduction, input by At least two, there are in the network of the convolution unit composition of residual error connection, after one layer of full articulamentum carries out liter sampling, obtain The corresponding linear spectral of the present frame.

5. according to the method described in claim 4, it is characterized in that, the convolution unit include two one-dimensional convolution kernels, wherein The output of one convolution kernel is regular on predetermined interval by activation primitive；The output of the convolution unit is that a convolution kernel is advised The whole output result on predetermined interval and another convolution kernel not multiplying by the corresponding element of the output result of activation primitive Product.

6. according to the method described in claim 1-5 any one, which is characterized in that described by the vector characteristics of acquisition, text The neural network model that prosodic features and mappings characteristics input are trained in advance, before obtaining the corresponding linear spectral of the present frame, Further include：

Obtain the corresponding frame of training text；

Vector characteristics and text prosodic features that present frame in the corresponding frame of the training text corresponds to phone are obtained, and are obtained The mappings characteristics of the corresponding linear spectral of previous frame voice of present frame described in the corresponding voice of the training text；

The vector characteristics of acquisition, text prosodic features and mappings characteristics are inputted into neural network model to be trained, described in acquisition The corresponding linear spectral of present frame；

After the linear spectral for obtaining the corresponding frame of the training text, according to the linear spectral of the corresponding frame of the training text, Obtain the corresponding voice of the training text；

When the corresponding voice of the training text of acquisition meets predetermined condition, trained neural network model is obtained.

7. a kind of conversion equipment of Text To Speech, which is characterized in that including：

Acquisition module, for obtaining the corresponding frame of text to be converted；Obtain the vector characteristics that present frame in the frame corresponds to phone With text prosodic features, and the mappings characteristics of the corresponding linear spectral of previous frame of the acquisition present frame；

Input module, the vector characteristics, text prosodic features and mappings characteristics input for obtaining the acquisition module are advance Trained neural network model obtains the corresponding linear spectral of the present frame；

Conversion module is used for after the linear spectral for obtaining the corresponding frame of the text to be converted, according to the text to be converted The linear spectral of corresponding frame obtains the corresponding voice of the text to be converted.

8. device according to claim 7, which is characterized in that

The acquisition module is specifically used for the corresponding linear spectral input of the previous frame of the present frame by two layers of full articulamentum group At neural network mapped, obtain the mappings characteristics of the corresponding linear spectral of previous frame of the present frame.

9. device according to claim 7, which is characterized in that

The conversion module, specifically for obtaining described wait for by iteration from the linear spectral of the corresponding frame of the text to be converted The phase spectrum of the corresponding voice of converting text；According to the phase spectrum of the corresponding voice of the text to be converted, by reversely in short-term Fourier Transform Algorithm generates the corresponding voice of the text to be converted.

10. device according to claim 7, which is characterized in that

The input module, specifically for connecting the vector characteristics of acquisition, text prosodic features and mappings characteristics entirely by one layer After connecing layer progress dimensionality reduction, in the network that input is formed by least two there are the convolution unit of residual error connection, connect entirely by one layer It connects after layer carries out liter sampling, obtains the corresponding linear spectral of the present frame.

11. device according to claim 10, which is characterized in that the convolution unit includes two one-dimensional convolution kernels, In a convolution kernel output it is regular on predetermined interval by activation primitive；The output of the convolution unit is a convolution kernel The regular output result on predetermined interval does not pass through the corresponding element of the output result of activation primitive with another convolution kernel Product.

12. according to the device described in claim 7-11 any one, which is characterized in that

The acquisition module is additionally operable to obtain the corresponding frame of training text；Obtain present frame in the corresponding frame of the training text The vector characteristics and text prosodic features of corresponding phone, and obtain present frame described in the corresponding voice of the training text The mappings characteristics of the corresponding linear spectral of previous frame voice；

It is defeated to be additionally operable to the vector characteristics for obtaining the acquisition module, text prosodic features and mappings characteristics for the input module Enter neural network model to be trained, obtains the corresponding linear spectral of the present frame；

The conversion module is additionally operable to after the linear spectral for obtaining the corresponding frame of the training text, according to the training text The linear spectral of this corresponding frame, obtains the corresponding voice of the training text；

The acquisition module is additionally operable to, when the corresponding voice of the training text of acquisition meets predetermined condition, be trained Good neural network model.

13. a kind of computer equipment, which is characterized in that including memory, processor and be stored on the memory and can be The computer program run on the processor when the processor executes the computer program, realizes such as claim 1-6 In any method.

14. a kind of non-transitorycomputer readable storage medium, is stored thereon with computer program, which is characterized in that the meter The method as described in any in claim 1-6 is realized when calculation machine program is executed by processor.