CN114550733B

CN114550733B - Voice synthesis method capable of being used for chip end

Info

Publication number: CN114550733B
Application number: CN202210426046.4A
Authority: CN
Inventors: 曹艳艳
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2022-04-22
Filing date: 2022-04-22
Publication date: 2022-07-01
Anticipated expiration: 2042-04-22
Also published as: CN114550733A

Abstract

A speech synthesis method for chip end includes the following steps: step 1, calculating a pulse excitation seed signal and a noise excitation seed signal; step 2, solving an excitation signal by the fundamental frequency F0 of the given voice and the strip-shaped non-periodic ratio ap; and 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform. The invention only carries out multiply-add operation when calculating the periodic excitation and the non-periodic excitation by offline calculating the needed pulse excitation seed signal in advance, does not relate to Fourier transform and inverse Fourier transform, and improves the operation speed of the vocoder at the chip end.

Description

Voice synthesis method capable of being used for chip end

Technical Field

The invention belongs to the technical field of voice, and particularly relates to a voice synthesis method applicable to a chip end.

Background

The off-line voice synthesis chip can be used in the fields of information machines, attendance machines, voice guide, vending machines, intelligent toys and the like. And receiving text data to be synthesized through a communication interface to realize the conversion from the text to the voice (or TTS voice). The traditional speech synthesis chip adopts a splicing method, the rhythm of speech synthesis is not strong, the synthesized text is influenced by splicing segments, and the high-performance speech synthesis chip has high price, thereby greatly limiting the application scene of the off-line speech synthesis chip. The voice synthesis chip has higher cost performance and more natural effect, and can promote the industrial application of the TTS voice synthesis technology to be deeper and wider. The most commonly used speech synthesis vocoders in the industry are WORLD (WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE transformations on information and systems, vol. E99-D, No. 7, pp. 1877-1884, 2016. M. Morise, F. Yokomori, and K. Ozawa) vocoders, because the calculation method is a pure signaling theory, the vocoders have better synthesis effect than other traditional vocoders (STRAIGHT, Griffim, etc.) and have lower calculation complexity and faster synthesis speed than neural network based vocoders (MelGAN, LPCnet, etc.). The method is more suitable for low-performance chip-side equipment.

The World vocoder is based on a source-filter model, where source refers to the sound source, i.e., vocal cords, that emits a train of pulse signals (pulse train). The faster the vocal cords vibrate, the higher the pitch of the sound and the denser the pulse signal. The filter refers to the part through which the sound source signal passes, including the vocal tract, the laryngeal cavity, the oral cavity, the lips, the teeth, and the like. Under the combined action of these parts, different timbres and different vowel consonants can be produced. These parts together form a filter system, which can be regarded as a Linear time-invariant system. The WORLD vocoder inputs three acoustic features: f0 fundamental, spectral envelope, periodic parameter. In an open source project of WORLD (https:// github. com/mmority/WORLD), code realization for obtaining a time domain signal by three acoustic characteristics is provided. And obtaining the spectrum envelope of the periodic signal through the spectrum envelope and the non-periodic ratio when the periodic response is obtained, then analyzing the minimum phase spectrum through the cepstrum, and obtaining the periodic response through the inverse Fourier transform. When non-periodic response is obtained, firstly, the frequency spectrum of white noise is obtained, the frequency spectrum envelope of a non-periodic signal is obtained according to a non-periodic ratio, the minimum phase spectrum of non-periodic excitation is obtained by performing cepstrum analysis in the same way, then, the frequency spectrum envelopes are multiplied in a frequency domain, the convolution calculation of a linear time-invariant system corresponding to the white noise signal and the frequency spectrum envelope is simulated, and finally, the multiplied frequency spectrum is subjected to Fourier inverse transformation to obtain the non-periodic response. Fourier transform and inverse Fourier transform are repeatedly used in the whole calculation process, and the time consumption at the chip end is high. Complete speech synthesis systems typically include front-end text normalization, Chinese-to-pinyin conversion, phoneme-to-duration and acoustic features, vocoders, and the like. The time consumption of the vocoder part is usually high, so that the algorithm optimization of the part greatly improves the possibility of realizing the WORLD algorithm at the chip end.

Disclosure of Invention

In order to improve the operation speed of a vocoder and increase the feasibility of realizing off-line voice synthesis on a low-performance chip, the invention discloses a voice synthesis method applicable to a chip end.

The invention relates to a voice synthesis method for a chip end, which comprises the following steps:

step 1, calculating a pulse excitation seed signal and a noise excitation seed signal;

step 2, calculating excitation signal by the fundamental frequency F0 and the strip-shaped non-periodic ratio ap of the given voice

And 2-1, solving the pulse number and position by the fundamental frequency F0.

2-11, up-sampling the characteristic of the fundamental frequency F0 to a time domain signal length N, wherein the time domain signal length is the length of the fundamental frequency F0 multiplied by a frame shift, the frame shift represents a sliding step length when the time domain signal is subjected to acoustic characteristic solving, and an up-sampling result is recorded as a_iN-1, i being different dimensions;

2-12, multiplying the up-sampled data of each dimension obtained in the step 2-11 by 2 pi, and dividing by the sampling rate fs; pi is a circumferential rate, and then each sampling point sequentially calculates an accumulated value; can be expressed as the following equation:

wherein a is_iRepresenting the up-sampling result obtained in the step 2-11, wherein N is the length of a time domain signal; b_jA cumulative data value representing a jth dimension;

2-13 for each accumulated data value b of 2-12_j，Evaluating the respective cumulative data values b_jAccumulated data value b of its adjacent dimension_j+1After 2 pi is solved for remainder, the difference of the remainder is solved and the absolute value is taken, the formula is as follows:

c_krepresenting the absolute value of the difference in the k-th dimension, b_kFor the accumulated data value of the k dimension,% represents the operation of taking remainder, N is the length of the time domain signal, | | represents the operation of taking absolute value;

2-14. EyiAbsolute value of difference c of secondary pair 2-13_kJudging that k =1,2.. N-1, if c_k>Pi, the k-th position is the position of the pulse point, the positions of all the pulse points are counted and recorded as k_i,i=0,1... N_p，N_pThe total number of the pulse points is shown;

step 2-2. solving for non-periodic excitation

2-21, up-sampling the banded aperiodic ratio ap characteristic of the given voice to the time domain signal length N;

2-22, expanding the noise excitation seed signal to a time domain signal length N;

2-23, multiplying the results of the steps 2-21 and the steps 2-22 according to dimension bits, and then combining the results into a characteristic with the length of a time domain signal, namely non-periodic excitation;

step 2-3, solving the periodic excitation,

the method specifically comprises the following steps: for each pulse position k obtained in the 2-1 step_i,i=0,1... N_pThe following operations are performed:

judging whether the pulse position is an unvoiced segment or not according to the fundamental frequency F0 and the strip-shaped non-periodic ratio ap, if so, the periodic excitation is 0; otherwise, multiplying the pulse excitation seed signals obtained in the step 1 by (1-ap) in sequence_ki), ap_kiTo k in the corresponding dimension_iA strip-like aperiodic ratio of N and N_apAdding the values of the dimensions to one dimension to obtain periodic excitation at the pulse position;

superposing the periodic excitation of all pulse positions according to the positions of the pulses to obtain complete periodic excitation;

step 2-4, adding the non-periodic excitation obtained in the step 2-2 and the periodic excitation obtained in the step 2-3 to obtain an excitation signal;

and 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform.

Preferably, the step 1 specifically comprises:

step 1-1, self-defining a frequency range fr and a maximum frequency U, and calculating a dimension N of a banded aperiodic ratio according to a sampling rate fs_ap(ii) a The formula is as follows:

---（1）

wherein N is_apThe dimension representing the non-periodic ratio of the strip,

rounding is performed for the lower part, min represents the minimum value, fs is the sampling rate, U is the maximum frequency, and fr is the frequency range;

and 1-2.

A cosine function is adopted to simulate the pulse reference frequency of each dimension of the banded aperiodic ratio, and a pulse excitation seed signal is solved, wherein the formula is as follows:

p_i=f^-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…N_ap ----（2）

wherein f is^-1Representing the inverse Fourier transform, cos being the cosine operator, wp being the excitation seed vector, fr being the frequency range, p_iThe seed signal is excited for the pulse of the ith dimension of the banded non-periodic ratio.

And 1-3.

Obtaining or randomly generating N_apAnd (4) grouping random white noise signals, and solving a noise excitation seed signal. The formula is as follows:

n_i=f^-1(f(w_i) * f(p_i)),i=1,2…N_ap---（3）

wherein, f and f^-1Respectively representing a Fourier transform and an inverse Fourier transform, w_iIs as follows

Group random white noise, n_iIs as follows

The group noise excites the seed signal.

Preferably, the step 3 specifically includes the following steps:

step 3-1, obtaining a minimum phase frequency spectrum from the frequency spectrum, wherein a calculation formula is as follows:

wherein

V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, e^iwq，e^-iwqAnd (4) calculating a complex function, wherein sp is a spectral envelope characteristic.

Step 3-2, performing window calculation on the excitation signal obtained in the step 2 according to frame shift extraction data, and extracting the excitation signal, wherein the window length is determined by the Fourier transform length in the step 1 according to the Fourier spectrum;

step 3-3, multiplying the minimum phase frequency spectrum obtained in the step 3-1 by the Fourier frequency spectrum of the excitation signal obtained in the step 3-2;

step 3-4, performing inverse Fourier transform on the product result of the step 3-3 to obtain an impulse response;

and 3-5, overlapping all impulse responses according to the frame shift position to obtain a voice waveform.

Preferably, the specific method for determining whether the pulse is an unvoiced segment is to set an unvoiced threshold, if the value of the fundamental frequency F0 of the pulse point is 0 or the largest dimension of the pulse point, i.e. nth_apThe ratio ap of the band-shaped non-periodicity in the dimension is larger than the unvoiced threshold, which indicates that the pulse position is in an unvoiced segment.

According to the invention, the required pulse excitation seed signal is calculated off-line in advance, and only simple multiplication and addition operation is carried out when the periodic excitation and the non-periodic excitation are calculated, so that Fourier transform and inverse Fourier transform are not involved, and the operation speed of the WORLD vocoder at the chip end is improved; and finally, impulse response is calculated for each frame of data, so that streaming voice synthesis is facilitated.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a spectral diagram of a synthesized audio obtained by computing acoustic features in an original audio according to an embodiment of the present invention;

FIG. 3 is a diagram of original audio frequency spectra in an embodiment of the present invention;

the abscissa in fig. 2 and 3 represents the time-domain signal time point and the ordinate represents the frequency domain.

Detailed Description

The acoustic characteristics of a given piece of audio data comprise the following information of fundamental frequency F0, spectral envelope characteristic sp, and banded aperiodic ratio ap; the acoustic signature represents the ratio of white noise to pulse sequence energy in the audio signal. The source is a mixture of white noise and pulse sequences, as demonstrated in reference papers (e.g., D4C, a band-adaptive excitation for high-quality space synthesis, Speech Communication, vol. 84, pp. 57-65, Nov. 2016, M. Morise) for unvoiced sound, high white noise ratio, high band-to-aperiodic ratio; for voiced sounds, the white noise ratio is low and the banded aperiodic ratio is low. A specific definition of the banded aperiodic ratio can be found in the article "A mixed excitation LPC vocoder model for low bit rate Speech coding" (IEEE trans. Speech Audio Process 3(4), 242-250, 1995, McCree, V.A., Barnwell, P T.) to propose this acoustic parameter associated with mixed excitation. The invention carries out time domain audio signal reduction on the acoustic characteristics of a given section of audio data for convenient speech synthesis and simplifies the reduction process.

The following provides a more detailed description of the present invention.

The present invention may restore a time-domain audio signal of a given piece of audio data by the following steps.

Step 1, calculating a pulse excitation seed signal and a noise excitation seed signal.

Step 1-1, self-defining a frequency range fr and a maximum frequency U, and calculating the dimension of a banded aperiodic ratio according to a sampling rate fs. The formula is as follows:

（1）

wherein N is_apThe dimension representing the non-periodic ratio of the strip,

for lower rounding, min represents the minimum value, fs is the sampling rate, U is the maximum frequency, and fr is the frequency range. The empirical values of U and fr can be obtained, N can be seen_apThe size is determined by the sampling rate.

For example, if the sampling rate of audio is 16000 and the frame shift is set to 80, that is, data with a fourier transform length is taken every 80 points, data with a fundamental frequency characteristic of 16000/80=200 d and matrix data with an ap characteristic of [3,200] can be obtained. The values of the frequency range fr and the maximum frequency U determine initial values according to the calculation precision.

The frequency range refers to the relative scale between two frequencies, and the smaller the frequency range value is, the larger the obtained ap value dimension is, and the finer the calculation is. The sampling rate is the number of sampling points included in each second of audio, and if the sampling rate of one audio is 16000, the audio includes 16000 sampling data per second, usually, one frequency needs two sampling points to be determined, so the maximum frequency that can be expressed by the audio with the sampling rate of 16000 is 8000.

And 1-2.

p_i=f^-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…N_ap ---（2）

wherein f is^-1Representing the inverse Fourier transform, cos is the cosine operator, wp is the excitation seed vector, empirical values can be taken based on the sampling rate and Fourier transform length, the vector length is typically set to half the Fourier transform length, fr is the frequency range, p_iThe seed signal is excited for the pulse of the ith dimension of the strip-shaped non-periodic ratio.

And 1-3.

ObtainingOr randomly generating N_apAnd (4) grouping random white noise signals, and solving a noise excitation seed signal. The formula is as follows:

n_i=f^-1(f(w_i) * f(p_i)),i=1,2…N_ap---（3）

wherein, f and f^-1Respectively representing a Fourier transform and an inverse Fourier transform, w_iIs a first

Group random white noise, n_iIs as follows

The group noise excites the seed signal.

The above-mentioned pulse excitation seed signal p_iAnd noise excitation seed signal n_iAll are only affected by sampling frequency and frequency range, and can be calculated once as a constant.

And 2, solving an excitation signal represented by the two characteristics according to the given fundamental frequency F0 and the strip-shaped aperiodic ratio ap, wherein the fundamental frequency determines the pulse position, the strip-shaped aperiodic ratio characteristic determines the ratio of periodic excitation to aperiodic excitation, and the excitation signal comprises aperiodic excitation and periodic excitation.

And 2-1, solving the pulse number and position by the fundamental frequency F0.

2-11, up-sampling the characteristic of the fundamental frequency F0 to a time domain signal dimension N, wherein the time domain signal length is the length of the fundamental frequency F0 multiplied by a frame shift, the frame shift represents a sliding step length when the time domain signal is used for solving the acoustic characteristic, and the up-sampling result is recorded as a_iN-1, i is of different dimensions.

2-12. multiply each dimension data of the 2-11 up-sampling result by 2 pi, and then divide by the sampling rate fs. Pi is a circumferential rate, and then each sampling point sequentially calculates an accumulated value; can be expressed as the following equation:

wherein a is_iRepresenting the upsampling results obtained in steps 2-11N is the time domain signal dimension; b is a mixture of_jRepresents the cumulative data value of the j dimension;

2-13 for each accumulated data value b of 2-12_j，Evaluating the respective cumulative data values b_jDimension b adjacent thereto_j+1The absolute values of the differences after the remainder is calculated for 2 pi respectively are as follows:

c_krepresenting the absolute value of the difference in the k-th dimension, b_kFor the accumulated data value of the k dimension,% represents the operation of taking remainder, and | | represents the operation of taking absolute value;

2-14, the absolute value of the difference between 2-13_kJudging that k =1,2.. N-1, if c_k>Pi, the k-th position is the position of the pulse point, and the total number of the pulse points is N_pThe position of each pulse point is denoted as k_i,i=0,1... N_p；

Step 2-2. solving for non-periodic excitation

2-21, up-sampling the banded aperiodic ratio ap characteristic to a time domain signal dimension N;

2-22, expanding the noise excitation seed signal to a time domain signal dimension N;

2-23, multiplying the results of the steps 2-21 and the steps 2-22 according to dimension bits, and then combining the results into the characteristic that the length is the length of the time domain signal, namely the aperiodic excitation.

Step 2-3, solving periodic excitation

For each pulse position k obtained in the 2-1 step_i,i=0,1... N_pThe following operations are performed:

the specific method for determining whether the pulse is an unvoiced segment is usually to set an unvoiced threshold, for example, if the value of the pulse point F0 is 0 or the ratio of the band-shaped non-periodic ratio in the maximum dimension of the pulse point is greater than the unvoiced threshold 0.999, indicating that only white noise exists, then the pulse position is in an unvoiced segment.

And (4) superposing the periodic excitation of all pulse positions according to the positions of the pulses to obtain complete periodic excitation.

And 2-4, adding the non-periodic excitation obtained in the step 2-2 and the periodic excitation obtained in the step 2-3 to obtain an excitation signal.

The invention only carries out simple multiplication and addition operation when calculating the periodic excitation and the non-periodic excitation by offline calculating the needed pulse excitation seed signal in advance, does not relate to Fourier transform and inverse Fourier transform, and improves the operation speed of the WORLD vocoder at the chip end.

After the excitation signal is obtained, the impulse response is calculated for each frame of data, so that the subsequent streaming voice synthesis is facilitated.

Step 3, calculating corresponding audio data for each frame of spectral envelope, and then superposing the audio data according to frame shift to obtain a final voice waveform, namely a time domain audio signal; the step specifically comprises the following steps:

wherein

V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, e^iwq，e^-iwqRepresents the complex function, sp is the spectrum packetAnd (4) connecting the collaterals.

The specific process is shown in FIG. 1.

One specific example is given below.

The acoustic characteristics of a section of audio data are known as fundamental frequency F0, spectrum envelope SP, and banded aperiodic ratio AP, the frame shift is set to be 0.005 millisecond, the sampling rate of the audio to be synthesized is 16000, and the Fourier transform length is 1024. The present invention restores its time-domain audio signal by the following steps.

1. A pulse excitation seed signal and a noise excitation seed signal are calculated.

The size of the self-defined frequency range is 3000, the maximum frequency is 8000, and the dimension 3 of the banded aperiodic ratio is obtained by the formula (1). Obtaining the pulse excitation seed signal p of each dimension by the formula (2)₀, p₁, p₂Each dimension is 1024 in length. Obtaining the noise excitation seed signal n of each dimension by the formula (3)₀, n₁,n₂And each dimension is 8192 in length.

The pulse excitation seed signal and the noise excitation seed signal calculated in the step are only related to the sampling rate, the frequency range and the minimum frequency, so that the pulse excitation seed signal and the noise excitation seed signal can be calculated off line and stored as variables, and can be directly called when audio is subsequently synthesized without calculation.

2. From a given fundamental frequency F0 and spectral envelope SP, its excitation signal is obtained, including aperiodic and periodic excitations, assuming that F0 and the banded aperiodic ratio AP feature m frames.

2-1) finding the pulse number and position by F0.

1a) The F0 feature is upsampled to m x 80 dimensions. Where (× denotes multiplication), m × 80 is the corresponding time domain signal dimension N _ response. An up-sampling result is obtained and is marked as a_iN _ response, i is a different dimension;

1b) calculating b as described in steps 2-12_j,j=0,1... N_response ；

1c) As described in steps 2-13, c is calculated_k,k=0,1... N_response；

1d) Sequentially judging c_kIf c is a value of_k>Pi, the k-th position is the position of the pulse point, and the number N of the pulse points is recorded_PAnd the position k of each pulse point_i,i=0,1... N_p。

2-2) non-periodic excitation

2a) The ap signal magnitude [3, m ] is up-sampled to [3, m x 80 ];

2b) expanding the noise excitation seed signal to a time domain signal dimension, if the time domain signal dimension N _ response is smaller than the excitation seed signal length 8192, taking the first N _ response data of the noise excitation seed signal, if the N _ response is larger than the excitation seed signal length 8192, selecting required data from the noise excitation seed signal, wherein the expanded noise excitation seed signal is [3, m × 80 ];

2c) multiplying the results of the step 2a) and the step 2b) according to bits, and then adding the multiplied [3, m & lt80 ] matrixes according to rows to obtain a matrix [1, m & lt80 ] which is the non-periodic excitation.

2-3) periodic excitation

Initializing a periodic excitation signal of length [1, m 80], and for each pulse position:

3a) judging whether the pulse point is an unvoiced segment according to the fundamental frequency value of the pulse point and the strip-shaped non-periodic ratio of the maximum dimension of the pulse point,

if the fundamental frequency value of the pulse point is less than 0 or the strip non-periodic ratio value on the maximum dimension of the pulse point is more than 0.999, the periodic excitation is 0; all indicate that the pulse point is in the unvoiced segment, and the maximum dimension is the Nth dimension of the pulse point_apAnd (5) maintaining.

3b) If it is not the unvoiced segment, the pulse will be followedMultiplying the impulse point by the impulse excitation seed signal in turn according to the dimension (1-ap)_ki), ap_kiTo k in the corresponding dimension_iThe ratio of the band-shaped non-periodicity of the two phases to obtain a matrix [3,1024 ]]In rows and in parallel as [1,1024 ]]Obtaining periodic excitation at the pulse position;

and calculating the periodic excitation of each pulse position, and superposing the periodic excitation according to the pulse position to obtain complete periodic excitation.

The non-periodic excitation of step 2) and the periodic excitation of step 3) are added.

3. Calculating corresponding audio data for each frame spectral envelope sp, and then superposing to obtain a final time domain audio signal;

3-1) deriving a minimum phase spectrum from the spectrum, involving a Fourier transform and an inverse Fourier transform.

3-2) sequentially taking excitation signals according to frame shift, extracting a data length of 512, multiplying by a window function, adopting a Hamming window in the embodiment, and then extracting a Fourier spectrum;

3-3) multiplying the minimum phase frequency spectrum in the step 1) and the frequency spectrum of the excitation signal in the step 2) to obtain frequency spectrum information of the time domain signal;

3-4) carrying out inverse Fourier transform on the result of the step 3) to obtain an impulse response.

And superposing the impulse response obtained by each frame according to the frame shift to obtain a voice waveform.

Time-consuming comparison on chip:

by adopting the speech recognition chip CI1103 developed by the applicant, the audio time consumption with the synthesis time of 1 second is compared with the traditional algorithm, such as shown in Table 1, and the time consumption is obviously shortened by the invention.

TABLE 1

Fig. 3 shows a spectrogram of an original audio, and fig. 2 shows a spectrogram of a synthesized audio calculated by the present invention from an acoustic characteristic fundamental frequency F0, a spectral envelope SP, and a banded aperiodic ratio AP of the original audio. FIGS. 2 and 3 use coordinate systems of the same size; as can be seen from a comparison of fig. 2 and fig. 3, the resultant synthesized audio of fig. 3 has a higher similarity to fig. 2.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A speech synthesis method for chip end, comprising the steps of:

Step 2-1, solving the number and the position of the pulses by the fundamental frequency F0;

wherein a is_iRepresenting the up-sampling result obtained in the step 2-11, wherein N is the length of a time domain signal; b_jRepresenting accumulated data value of j-th dimension；

2-13 for each accumulated data value b of 2-12_j，Evaluating the respective cumulative data values b_jAccumulated data value b of its adjacent dimension_j+1After 2 pi is respectively solved for remainder, the difference of the remainder is solved and the absolute value is taken, and the formula is as follows:

2-14, the absolute value of the difference between 2-13_kJudging that k =1,2.. N-1, if c_k>Pi, the k-th position is the position of the pulse point, the positions of all the pulse points are counted and recorded as k_i,i=0,1... N_p，N_pThe total number of the pulse points is shown;

step 2-2. solving for non-periodic excitation

2-23, multiplying the results of the steps 2-21 and 2-22 according to dimension bits, and then combining the results into the characteristic that the length is the length of the time domain signal, namely the non-periodic excitation;

step 2-3. solving periodic excitation

the periodic excitation of all pulse positions is superposed according to the positions of the pulses to obtain complete periodic excitation;

step 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform;

the step 1 specifically comprises the following steps:

---（1）

wherein N is_apThe dimension representing the non-periodic ratio of the strip,

and 1-2.

p_i=f^-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…N_ap ----（2）

wherein f is^-1Representing the inverse Fourier transform, cos being the cosine operator, wp being the excitation seed vector, fr being the frequency range, p_iExciting a seed signal for a pulse of an ith dimension of a banded non-periodic ratio;

and 1-3.

Obtaining or randomly generating N_apGrouping random white noise signals, and solving a noise excitation seed signal;

the formula is as follows:

n_i=f^-1(f(w_i) * f(p_i)),i=1,2…N_ap---（3）

Group random white noise, n_iIs as follows

The group noise excites the seed signal.

2. The speech synthesis method applicable to chip side according to claim 1, wherein the step 3 specifically comprises the steps of:

wherein

V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, e^iwq，e^-iwqRepresenting and solving a complex function, wherein sp is a spectrum envelope characteristic;

3. The method of chip-side speech synthesis according to claim 1,

the specific method for judging whether the pulse point is the unvoiced segment is to set an unvoiced threshold, if the value of the fundamental frequency F0 of the pulse point is 0 or the maximum dimension of the pulse point is Nth_apThe ratio ap of the band-shaped non-periodicity in the dimension is larger than the unvoiced threshold, which indicates that the pulse position is in an unvoiced segment.