CN114550733B - Voice synthesis method capable of being used for chip end - Google Patents
Voice synthesis method capable of being used for chip end Download PDFInfo
- Publication number
- CN114550733B CN114550733B CN202210426046.4A CN202210426046A CN114550733B CN 114550733 B CN114550733 B CN 114550733B CN 202210426046 A CN202210426046 A CN 202210426046A CN 114550733 B CN114550733 B CN 114550733B
- Authority
- CN
- China
- Prior art keywords
- excitation
- pulse
- dimension
- signal
- periodic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 8
- 230000005284 excitation Effects 0.000 claims abstract description 117
- 230000000737 periodic effect Effects 0.000 claims abstract description 71
- 230000037433 frameshift Effects 0.000 claims abstract description 20
- 230000003595 spectral effect Effects 0.000 claims abstract description 17
- 238000005070 sampling Methods 0.000 claims description 37
- 238000001228 spectrum Methods 0.000 claims description 37
- 230000015572 biosynthetic process Effects 0.000 claims description 19
- 230000004044 response Effects 0.000 claims description 19
- 238000003786 synthesis reaction Methods 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 13
- 238000000034 method Methods 0.000 claims description 12
- 230000001186 cumulative effect Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 3
- 230000005236 sound signal Effects 0.000 description 6
- 230000008569 process Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000001208 nuclear magnetic resonance pulse sequence Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000011664 signaling Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/11—Complex mathematical operations for solving equations, e.g. nonlinear equations, general mathematical optimization problems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/14—Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Algebra (AREA)
- Operations Research (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
A speech synthesis method for chip end includes the following steps: step 1, calculating a pulse excitation seed signal and a noise excitation seed signal; step 2, solving an excitation signal by the fundamental frequency F0 of the given voice and the strip-shaped non-periodic ratio ap; and 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform. The invention only carries out multiply-add operation when calculating the periodic excitation and the non-periodic excitation by offline calculating the needed pulse excitation seed signal in advance, does not relate to Fourier transform and inverse Fourier transform, and improves the operation speed of the vocoder at the chip end.
Description
Technical Field
The invention belongs to the technical field of voice, and particularly relates to a voice synthesis method applicable to a chip end.
Background
The off-line voice synthesis chip can be used in the fields of information machines, attendance machines, voice guide, vending machines, intelligent toys and the like. And receiving text data to be synthesized through a communication interface to realize the conversion from the text to the voice (or TTS voice). The traditional speech synthesis chip adopts a splicing method, the rhythm of speech synthesis is not strong, the synthesized text is influenced by splicing segments, and the high-performance speech synthesis chip has high price, thereby greatly limiting the application scene of the off-line speech synthesis chip. The voice synthesis chip has higher cost performance and more natural effect, and can promote the industrial application of the TTS voice synthesis technology to be deeper and wider. The most commonly used speech synthesis vocoders in the industry are WORLD (WORLD: a vocoder-based high-quality speech synthesis system for real-time applications, IEICE transformations on information and systems, vol. E99-D, No. 7, pp. 1877-1884, 2016. M. Morise, F. Yokomori, and K. Ozawa) vocoders, because the calculation method is a pure signaling theory, the vocoders have better synthesis effect than other traditional vocoders (STRAIGHT, Griffim, etc.) and have lower calculation complexity and faster synthesis speed than neural network based vocoders (MelGAN, LPCnet, etc.). The method is more suitable for low-performance chip-side equipment.
The World vocoder is based on a source-filter model, where source refers to the sound source, i.e., vocal cords, that emits a train of pulse signals (pulse train). The faster the vocal cords vibrate, the higher the pitch of the sound and the denser the pulse signal. The filter refers to the part through which the sound source signal passes, including the vocal tract, the laryngeal cavity, the oral cavity, the lips, the teeth, and the like. Under the combined action of these parts, different timbres and different vowel consonants can be produced. These parts together form a filter system, which can be regarded as a Linear time-invariant system. The WORLD vocoder inputs three acoustic features: f0 fundamental, spectral envelope, periodic parameter. In an open source project of WORLD (https:// github. com/mmority/WORLD), code realization for obtaining a time domain signal by three acoustic characteristics is provided. And obtaining the spectrum envelope of the periodic signal through the spectrum envelope and the non-periodic ratio when the periodic response is obtained, then analyzing the minimum phase spectrum through the cepstrum, and obtaining the periodic response through the inverse Fourier transform. When non-periodic response is obtained, firstly, the frequency spectrum of white noise is obtained, the frequency spectrum envelope of a non-periodic signal is obtained according to a non-periodic ratio, the minimum phase spectrum of non-periodic excitation is obtained by performing cepstrum analysis in the same way, then, the frequency spectrum envelopes are multiplied in a frequency domain, the convolution calculation of a linear time-invariant system corresponding to the white noise signal and the frequency spectrum envelope is simulated, and finally, the multiplied frequency spectrum is subjected to Fourier inverse transformation to obtain the non-periodic response. Fourier transform and inverse Fourier transform are repeatedly used in the whole calculation process, and the time consumption at the chip end is high. Complete speech synthesis systems typically include front-end text normalization, Chinese-to-pinyin conversion, phoneme-to-duration and acoustic features, vocoders, and the like. The time consumption of the vocoder part is usually high, so that the algorithm optimization of the part greatly improves the possibility of realizing the WORLD algorithm at the chip end.
Disclosure of Invention
In order to improve the operation speed of a vocoder and increase the feasibility of realizing off-line voice synthesis on a low-performance chip, the invention discloses a voice synthesis method applicable to a chip end.
The invention relates to a voice synthesis method for a chip end, which comprises the following steps:
step 1, calculating a pulse excitation seed signal and a noise excitation seed signal;
step 2, calculating excitation signal by the fundamental frequency F0 and the strip-shaped non-periodic ratio ap of the given voice
And 2-1, solving the pulse number and position by the fundamental frequency F0.
2-11, up-sampling the characteristic of the fundamental frequency F0 to a time domain signal length N, wherein the time domain signal length is the length of the fundamental frequency F0 multiplied by a frame shift, the frame shift represents a sliding step length when the time domain signal is subjected to acoustic characteristic solving, and an up-sampling result is recorded as aiN-1, i being different dimensions;
2-12, multiplying the up-sampled data of each dimension obtained in the step 2-11 by 2 pi, and dividing by the sampling rate fs; pi is a circumferential rate, and then each sampling point sequentially calculates an accumulated value; can be expressed as the following equation:
wherein a isiRepresenting the up-sampling result obtained in the step 2-11, wherein N is the length of a time domain signal; bjA cumulative data value representing a jth dimension;
2-13 for each accumulated data value b of 2-12j,Evaluating the respective cumulative data values bjAccumulated data value b of its adjacent dimensionj+1After 2 pi is solved for remainder, the difference of the remainder is solved and the absolute value is taken, the formula is as follows:
ckrepresenting the absolute value of the difference in the k-th dimension, bkFor the accumulated data value of the k dimension,% represents the operation of taking remainder, N is the length of the time domain signal, | | represents the operation of taking absolute value;
2-14. EyiAbsolute value of difference c of secondary pair 2-13kJudging that k =1,2.. N-1, if ck>Pi, the k-th position is the position of the pulse point, the positions of all the pulse points are counted and recorded as ki,i=0,1... Np,NpThe total number of the pulse points is shown;
step 2-2. solving for non-periodic excitation
2-21, up-sampling the banded aperiodic ratio ap characteristic of the given voice to the time domain signal length N;
2-22, expanding the noise excitation seed signal to a time domain signal length N;
2-23, multiplying the results of the steps 2-21 and the steps 2-22 according to dimension bits, and then combining the results into a characteristic with the length of a time domain signal, namely non-periodic excitation;
step 2-3, solving the periodic excitation,
the method specifically comprises the following steps: for each pulse position k obtained in the 2-1 stepi,i=0,1... NpThe following operations are performed:
judging whether the pulse position is an unvoiced segment or not according to the fundamental frequency F0 and the strip-shaped non-periodic ratio ap, if so, the periodic excitation is 0; otherwise, multiplying the pulse excitation seed signals obtained in the step 1 by (1-ap) in sequenceki), apkiTo k in the corresponding dimensioniA strip-like aperiodic ratio of N and NapAdding the values of the dimensions to one dimension to obtain periodic excitation at the pulse position;
superposing the periodic excitation of all pulse positions according to the positions of the pulses to obtain complete periodic excitation;
step 2-4, adding the non-periodic excitation obtained in the step 2-2 and the periodic excitation obtained in the step 2-3 to obtain an excitation signal;
and 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform.
Preferably, the step 1 specifically comprises:
step 1-1, self-defining a frequency range fr and a maximum frequency U, and calculating a dimension N of a banded aperiodic ratio according to a sampling rate fsap(ii) a The formula is as follows:
wherein N isapThe dimension representing the non-periodic ratio of the strip,rounding is performed for the lower part, min represents the minimum value, fs is the sampling rate, U is the maximum frequency, and fr is the frequency range;
and 1-2.
A cosine function is adopted to simulate the pulse reference frequency of each dimension of the banded aperiodic ratio, and a pulse excitation seed signal is solved, wherein the formula is as follows:
pi=f-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…Nap ----(2)
wherein f is-1Representing the inverse Fourier transform, cos being the cosine operator, wp being the excitation seed vector, fr being the frequency range, piThe seed signal is excited for the pulse of the ith dimension of the banded non-periodic ratio.
And 1-3.
Obtaining or randomly generating NapAnd (4) grouping random white noise signals, and solving a noise excitation seed signal. The formula is as follows:
ni=f-1(f(wi) * f(pi)),i=1,2…Nap---(3)
wherein, f and f-1Respectively representing a Fourier transform and an inverse Fourier transform, wiIs as followsGroup random white noise, niIs as followsThe group noise excites the seed signal.
Preferably, the step 3 specifically includes the following steps:
step 3-1, obtaining a minimum phase frequency spectrum from the frequency spectrum, wherein a calculation formula is as follows:
V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, eiwq,e-iwqAnd (4) calculating a complex function, wherein sp is a spectral envelope characteristic.
Step 3-2, performing window calculation on the excitation signal obtained in the step 2 according to frame shift extraction data, and extracting the excitation signal, wherein the window length is determined by the Fourier transform length in the step 1 according to the Fourier spectrum;
step 3-3, multiplying the minimum phase frequency spectrum obtained in the step 3-1 by the Fourier frequency spectrum of the excitation signal obtained in the step 3-2;
step 3-4, performing inverse Fourier transform on the product result of the step 3-3 to obtain an impulse response;
and 3-5, overlapping all impulse responses according to the frame shift position to obtain a voice waveform.
Preferably, the specific method for determining whether the pulse is an unvoiced segment is to set an unvoiced threshold, if the value of the fundamental frequency F0 of the pulse point is 0 or the largest dimension of the pulse point, i.e. nthapThe ratio ap of the band-shaped non-periodicity in the dimension is larger than the unvoiced threshold, which indicates that the pulse position is in an unvoiced segment.
According to the invention, the required pulse excitation seed signal is calculated off-line in advance, and only simple multiplication and addition operation is carried out when the periodic excitation and the non-periodic excitation are calculated, so that Fourier transform and inverse Fourier transform are not involved, and the operation speed of the WORLD vocoder at the chip end is improved; and finally, impulse response is calculated for each frame of data, so that streaming voice synthesis is facilitated.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of the present invention;
FIG. 2 is a spectral diagram of a synthesized audio obtained by computing acoustic features in an original audio according to an embodiment of the present invention;
FIG. 3 is a diagram of original audio frequency spectra in an embodiment of the present invention;
the abscissa in fig. 2 and 3 represents the time-domain signal time point and the ordinate represents the frequency domain.
Detailed Description
The acoustic characteristics of a given piece of audio data comprise the following information of fundamental frequency F0, spectral envelope characteristic sp, and banded aperiodic ratio ap; the acoustic signature represents the ratio of white noise to pulse sequence energy in the audio signal. The source is a mixture of white noise and pulse sequences, as demonstrated in reference papers (e.g., D4C, a band-adaptive excitation for high-quality space synthesis, Speech Communication, vol. 84, pp. 57-65, Nov. 2016, M. Morise) for unvoiced sound, high white noise ratio, high band-to-aperiodic ratio; for voiced sounds, the white noise ratio is low and the banded aperiodic ratio is low. A specific definition of the banded aperiodic ratio can be found in the article "A mixed excitation LPC vocoder model for low bit rate Speech coding" (IEEE trans. Speech Audio Process 3(4), 242-250, 1995, McCree, V.A., Barnwell, P T.) to propose this acoustic parameter associated with mixed excitation. The invention carries out time domain audio signal reduction on the acoustic characteristics of a given section of audio data for convenient speech synthesis and simplifies the reduction process.
The following provides a more detailed description of the present invention.
The present invention may restore a time-domain audio signal of a given piece of audio data by the following steps.
Step 1, calculating a pulse excitation seed signal and a noise excitation seed signal.
Step 1-1, self-defining a frequency range fr and a maximum frequency U, and calculating the dimension of a banded aperiodic ratio according to a sampling rate fs. The formula is as follows:
wherein N isapThe dimension representing the non-periodic ratio of the strip,for lower rounding, min represents the minimum value, fs is the sampling rate, U is the maximum frequency, and fr is the frequency range. The empirical values of U and fr can be obtained, N can be seenapThe size is determined by the sampling rate.
For example, if the sampling rate of audio is 16000 and the frame shift is set to 80, that is, data with a fourier transform length is taken every 80 points, data with a fundamental frequency characteristic of 16000/80=200 d and matrix data with an ap characteristic of [3,200] can be obtained. The values of the frequency range fr and the maximum frequency U determine initial values according to the calculation precision.
The frequency range refers to the relative scale between two frequencies, and the smaller the frequency range value is, the larger the obtained ap value dimension is, and the finer the calculation is. The sampling rate is the number of sampling points included in each second of audio, and if the sampling rate of one audio is 16000, the audio includes 16000 sampling data per second, usually, one frequency needs two sampling points to be determined, so the maximum frequency that can be expressed by the audio with the sampling rate of 16000 is 8000.
And 1-2.
A cosine function is adopted to simulate the pulse reference frequency of each dimension of the banded aperiodic ratio, and a pulse excitation seed signal is solved, wherein the formula is as follows:
pi=f-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…Nap ---(2)
wherein f is-1Representing the inverse Fourier transform, cos is the cosine operator, wp is the excitation seed vector, empirical values can be taken based on the sampling rate and Fourier transform length, the vector length is typically set to half the Fourier transform length, fr is the frequency range, piThe seed signal is excited for the pulse of the ith dimension of the strip-shaped non-periodic ratio.
And 1-3.
ObtainingOr randomly generating NapAnd (4) grouping random white noise signals, and solving a noise excitation seed signal. The formula is as follows:
ni=f-1(f(wi) * f(pi)),i=1,2…Nap---(3)
wherein, f and f-1Respectively representing a Fourier transform and an inverse Fourier transform, wiIs a firstGroup random white noise, niIs as followsThe group noise excites the seed signal.
The above-mentioned pulse excitation seed signal piAnd noise excitation seed signal niAll are only affected by sampling frequency and frequency range, and can be calculated once as a constant.
And 2, solving an excitation signal represented by the two characteristics according to the given fundamental frequency F0 and the strip-shaped aperiodic ratio ap, wherein the fundamental frequency determines the pulse position, the strip-shaped aperiodic ratio characteristic determines the ratio of periodic excitation to aperiodic excitation, and the excitation signal comprises aperiodic excitation and periodic excitation.
And 2-1, solving the pulse number and position by the fundamental frequency F0.
2-11, up-sampling the characteristic of the fundamental frequency F0 to a time domain signal dimension N, wherein the time domain signal length is the length of the fundamental frequency F0 multiplied by a frame shift, the frame shift represents a sliding step length when the time domain signal is used for solving the acoustic characteristic, and the up-sampling result is recorded as aiN-1, i is of different dimensions.
2-12. multiply each dimension data of the 2-11 up-sampling result by 2 pi, and then divide by the sampling rate fs. Pi is a circumferential rate, and then each sampling point sequentially calculates an accumulated value; can be expressed as the following equation:
wherein a isiRepresenting the upsampling results obtained in steps 2-11N is the time domain signal dimension; b is a mixture ofjRepresents the cumulative data value of the j dimension;
2-13 for each accumulated data value b of 2-12j,Evaluating the respective cumulative data values bjDimension b adjacent theretoj+1The absolute values of the differences after the remainder is calculated for 2 pi respectively are as follows:
ckrepresenting the absolute value of the difference in the k-th dimension, bkFor the accumulated data value of the k dimension,% represents the operation of taking remainder, and | | represents the operation of taking absolute value;
2-14, the absolute value of the difference between 2-13kJudging that k =1,2.. N-1, if ck>Pi, the k-th position is the position of the pulse point, and the total number of the pulse points is NpThe position of each pulse point is denoted as ki,i=0,1... Np;
Step 2-2. solving for non-periodic excitation
2-21, up-sampling the banded aperiodic ratio ap characteristic to a time domain signal dimension N;
2-22, expanding the noise excitation seed signal to a time domain signal dimension N;
2-23, multiplying the results of the steps 2-21 and the steps 2-22 according to dimension bits, and then combining the results into the characteristic that the length is the length of the time domain signal, namely the aperiodic excitation.
Step 2-3, solving periodic excitation
For each pulse position k obtained in the 2-1 stepi,i=0,1... NpThe following operations are performed:
judging whether the pulse position is an unvoiced segment or not according to the fundamental frequency F0 and the strip-shaped non-periodic ratio ap, if so, the periodic excitation is 0; otherwise, multiplying the pulse excitation seed signals obtained in the step 1 by (1-ap) in sequenceki), apkiTo k in the corresponding dimensioniA strip-like aperiodic ratio of N and NapAdding the values of the dimensions to one dimension to obtain periodic excitation at the pulse position;
the specific method for determining whether the pulse is an unvoiced segment is usually to set an unvoiced threshold, for example, if the value of the pulse point F0 is 0 or the ratio of the band-shaped non-periodic ratio in the maximum dimension of the pulse point is greater than the unvoiced threshold 0.999, indicating that only white noise exists, then the pulse position is in an unvoiced segment.
And (4) superposing the periodic excitation of all pulse positions according to the positions of the pulses to obtain complete periodic excitation.
And 2-4, adding the non-periodic excitation obtained in the step 2-2 and the periodic excitation obtained in the step 2-3 to obtain an excitation signal.
The invention only carries out simple multiplication and addition operation when calculating the periodic excitation and the non-periodic excitation by offline calculating the needed pulse excitation seed signal in advance, does not relate to Fourier transform and inverse Fourier transform, and improves the operation speed of the WORLD vocoder at the chip end.
After the excitation signal is obtained, the impulse response is calculated for each frame of data, so that the subsequent streaming voice synthesis is facilitated.
Step 3, calculating corresponding audio data for each frame of spectral envelope, and then superposing the audio data according to frame shift to obtain a final voice waveform, namely a time domain audio signal; the step specifically comprises the following steps:
step 3-1, obtaining a minimum phase frequency spectrum from the frequency spectrum, wherein a calculation formula is as follows:
V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, eiwq,e-iwqRepresents the complex function, sp is the spectrum packetAnd (4) connecting the collaterals.
Step 3-2, performing window calculation on the excitation signal obtained in the step 2 according to frame shift extraction data, and extracting the excitation signal, wherein the window length is determined by the Fourier transform length in the step 1 according to the Fourier spectrum;
step 3-3, multiplying the minimum phase frequency spectrum obtained in the step 3-1 by the Fourier frequency spectrum of the excitation signal obtained in the step 3-2;
step 3-4, performing inverse Fourier transform on the product result of the step 3-3 to obtain an impulse response;
and 3-5, overlapping all impulse responses according to the frame shift position to obtain a voice waveform.
The specific process is shown in FIG. 1.
One specific example is given below.
The acoustic characteristics of a section of audio data are known as fundamental frequency F0, spectrum envelope SP, and banded aperiodic ratio AP, the frame shift is set to be 0.005 millisecond, the sampling rate of the audio to be synthesized is 16000, and the Fourier transform length is 1024. The present invention restores its time-domain audio signal by the following steps.
1. A pulse excitation seed signal and a noise excitation seed signal are calculated.
The size of the self-defined frequency range is 3000, the maximum frequency is 8000, and the dimension 3 of the banded aperiodic ratio is obtained by the formula (1). Obtaining the pulse excitation seed signal p of each dimension by the formula (2)0, p1, p2Each dimension is 1024 in length. Obtaining the noise excitation seed signal n of each dimension by the formula (3)0, n1,n2And each dimension is 8192 in length.
The pulse excitation seed signal and the noise excitation seed signal calculated in the step are only related to the sampling rate, the frequency range and the minimum frequency, so that the pulse excitation seed signal and the noise excitation seed signal can be calculated off line and stored as variables, and can be directly called when audio is subsequently synthesized without calculation.
2. From a given fundamental frequency F0 and spectral envelope SP, its excitation signal is obtained, including aperiodic and periodic excitations, assuming that F0 and the banded aperiodic ratio AP feature m frames.
2-1) finding the pulse number and position by F0.
1a) The F0 feature is upsampled to m x 80 dimensions. Where (× denotes multiplication), m × 80 is the corresponding time domain signal dimension N _ response. An up-sampling result is obtained and is marked as aiN _ response, i is a different dimension;
1b) calculating b as described in steps 2-12j,j=0,1... N_response ;
1c) As described in steps 2-13, c is calculatedk,k=0,1... N_response;
1d) Sequentially judging ckIf c is a value ofk>Pi, the k-th position is the position of the pulse point, and the number N of the pulse points is recordedPAnd the position k of each pulse pointi,i=0,1... Np。
2-2) non-periodic excitation
2a) The ap signal magnitude [3, m ] is up-sampled to [3, m x 80 ];
2b) expanding the noise excitation seed signal to a time domain signal dimension, if the time domain signal dimension N _ response is smaller than the excitation seed signal length 8192, taking the first N _ response data of the noise excitation seed signal, if the N _ response is larger than the excitation seed signal length 8192, selecting required data from the noise excitation seed signal, wherein the expanded noise excitation seed signal is [3, m × 80 ];
2c) multiplying the results of the step 2a) and the step 2b) according to bits, and then adding the multiplied [3, m & lt80 ] matrixes according to rows to obtain a matrix [1, m & lt80 ] which is the non-periodic excitation.
2-3) periodic excitation
Initializing a periodic excitation signal of length [1, m 80], and for each pulse position:
3a) judging whether the pulse point is an unvoiced segment according to the fundamental frequency value of the pulse point and the strip-shaped non-periodic ratio of the maximum dimension of the pulse point,
if the fundamental frequency value of the pulse point is less than 0 or the strip non-periodic ratio value on the maximum dimension of the pulse point is more than 0.999, the periodic excitation is 0; all indicate that the pulse point is in the unvoiced segment, and the maximum dimension is the Nth dimension of the pulse pointapAnd (5) maintaining.
3b) If it is not the unvoiced segment, the pulse will be followedMultiplying the impulse point by the impulse excitation seed signal in turn according to the dimension (1-ap)ki), apkiTo k in the corresponding dimensioniThe ratio of the band-shaped non-periodicity of the two phases to obtain a matrix [3,1024 ]]In rows and in parallel as [1,1024 ]]Obtaining periodic excitation at the pulse position;
and calculating the periodic excitation of each pulse position, and superposing the periodic excitation according to the pulse position to obtain complete periodic excitation.
The non-periodic excitation of step 2) and the periodic excitation of step 3) are added.
3. Calculating corresponding audio data for each frame spectral envelope sp, and then superposing to obtain a final time domain audio signal;
3-1) deriving a minimum phase spectrum from the spectrum, involving a Fourier transform and an inverse Fourier transform.
3-2) sequentially taking excitation signals according to frame shift, extracting a data length of 512, multiplying by a window function, adopting a Hamming window in the embodiment, and then extracting a Fourier spectrum;
3-3) multiplying the minimum phase frequency spectrum in the step 1) and the frequency spectrum of the excitation signal in the step 2) to obtain frequency spectrum information of the time domain signal;
3-4) carrying out inverse Fourier transform on the result of the step 3) to obtain an impulse response.
And superposing the impulse response obtained by each frame according to the frame shift to obtain a voice waveform.
Time-consuming comparison on chip:
by adopting the speech recognition chip CI1103 developed by the applicant, the audio time consumption with the synthesis time of 1 second is compared with the traditional algorithm, such as shown in Table 1, and the time consumption is obviously shortened by the invention.
TABLE 1
Fig. 3 shows a spectrogram of an original audio, and fig. 2 shows a spectrogram of a synthesized audio calculated by the present invention from an acoustic characteristic fundamental frequency F0, a spectral envelope SP, and a banded aperiodic ratio AP of the original audio. FIGS. 2 and 3 use coordinate systems of the same size; as can be seen from a comparison of fig. 2 and fig. 3, the resultant synthesized audio of fig. 3 has a higher similarity to fig. 2.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.
Claims (3)
1. A speech synthesis method for chip end, comprising the steps of:
step 1, calculating a pulse excitation seed signal and a noise excitation seed signal;
step 2, calculating excitation signal by the fundamental frequency F0 and the strip-shaped non-periodic ratio ap of the given voice
Step 2-1, solving the number and the position of the pulses by the fundamental frequency F0;
2-11, up-sampling the characteristic of the fundamental frequency F0 to a time domain signal length N, wherein the time domain signal length is the length of the fundamental frequency F0 multiplied by a frame shift, the frame shift represents a sliding step length when the time domain signal is subjected to acoustic characteristic solving, and an up-sampling result is recorded as aiN-1, i being different dimensions;
2-12, multiplying the up-sampled data of each dimension obtained in the step 2-11 by 2 pi, and dividing by the sampling rate fs; pi is a circumferential rate, and then each sampling point sequentially calculates an accumulated value; can be expressed as the following equation:
wherein a isiRepresenting the up-sampling result obtained in the step 2-11, wherein N is the length of a time domain signal; bjRepresenting accumulated data value of j-th dimension;
2-13 for each accumulated data value b of 2-12j,Evaluating the respective cumulative data values bjAccumulated data value b of its adjacent dimensionj+1After 2 pi is respectively solved for remainder, the difference of the remainder is solved and the absolute value is taken, and the formula is as follows:
ckrepresenting the absolute value of the difference in the k-th dimension, bkFor the accumulated data value of the k dimension,% represents the operation of taking remainder, N is the length of the time domain signal, | | represents the operation of taking absolute value;
2-14, the absolute value of the difference between 2-13kJudging that k =1,2.. N-1, if ck>Pi, the k-th position is the position of the pulse point, the positions of all the pulse points are counted and recorded as ki,i=0,1... Np,NpThe total number of the pulse points is shown;
step 2-2. solving for non-periodic excitation
2-21, up-sampling the banded aperiodic ratio ap characteristic of the given voice to the time domain signal length N;
2-22, expanding the noise excitation seed signal to a time domain signal length N;
2-23, multiplying the results of the steps 2-21 and 2-22 according to dimension bits, and then combining the results into the characteristic that the length is the length of the time domain signal, namely the non-periodic excitation;
step 2-3. solving periodic excitation
The method specifically comprises the following steps: for each pulse position k obtained in the 2-1 stepi,i=0,1... NpThe following operations are performed:
judging whether the pulse position is an unvoiced segment or not according to the fundamental frequency F0 and the strip-shaped non-periodic ratio ap, if so, the periodic excitation is 0; otherwise, multiplying the pulse excitation seed signals obtained in the step 1 by (1-ap) in sequenceki), apkiTo k in the corresponding dimensioniA strip-like aperiodic ratio of N and NapAdding the values of the dimensions to one dimension to obtain periodic excitation at the pulse position;
the periodic excitation of all pulse positions is superposed according to the positions of the pulses to obtain complete periodic excitation;
step 2-4, adding the non-periodic excitation obtained in the step 2-2 and the periodic excitation obtained in the step 2-3 to obtain an excitation signal;
step 3, calculating corresponding audio data for each frame of spectral envelope of the given audio, and then overlapping according to frame shift to obtain a final voice waveform;
the step 1 specifically comprises the following steps:
step 1-1, self-defining a frequency range fr and a maximum frequency U, and calculating a dimension N of a banded aperiodic ratio according to a sampling rate fsap(ii) a The formula is as follows:
wherein N isapThe dimension representing the non-periodic ratio of the strip,rounding is performed for the lower part, min represents the minimum value, fs is the sampling rate, U is the maximum frequency, and fr is the frequency range;
and 1-2.
A cosine function is adopted to simulate the pulse reference frequency of each dimension of the banded aperiodic ratio, and a pulse excitation seed signal is solved, wherein the formula is as follows:
pi=f-1(0.5+0.5*cos(2π(wp-fr*i)/2*fr)),i=1,2…Nap ----(2)
wherein f is-1Representing the inverse Fourier transform, cos being the cosine operator, wp being the excitation seed vector, fr being the frequency range, piExciting a seed signal for a pulse of an ith dimension of a banded non-periodic ratio;
and 1-3.
Obtaining or randomly generating NapGrouping random white noise signals, and solving a noise excitation seed signal;
the formula is as follows:
ni=f-1(f(wi) * f(pi)),i=1,2…Nap---(3)
2. The speech synthesis method applicable to chip side according to claim 1, wherein the step 3 specifically comprises the steps of:
step 3-1, obtaining a minimum phase frequency spectrum from the frequency spectrum, wherein a calculation formula is as follows:
V (w) represents the resulting minimum phase spectrum, w represents the minimum phase spectral domain, q represents the spectral envelope domain, eiwq,e-iwqRepresenting and solving a complex function, wherein sp is a spectrum envelope characteristic;
step 3-2, performing window calculation on the excitation signal obtained in the step 2 according to frame shift extraction data, and extracting the excitation signal, wherein the window length is determined by the Fourier transform length in the step 1 according to the Fourier spectrum;
step 3-3, multiplying the minimum phase frequency spectrum obtained in the step 3-1 by the Fourier frequency spectrum of the excitation signal obtained in the step 3-2;
step 3-4, performing inverse Fourier transform on the product result of the step 3-3 to obtain an impulse response;
and 3-5, overlapping all impulse responses according to the frame shift position to obtain a voice waveform.
3. The method of chip-side speech synthesis according to claim 1,
the specific method for judging whether the pulse point is the unvoiced segment is to set an unvoiced threshold, if the value of the fundamental frequency F0 of the pulse point is 0 or the maximum dimension of the pulse point is NthapThe ratio ap of the band-shaped non-periodicity in the dimension is larger than the unvoiced threshold, which indicates that the pulse position is in an unvoiced segment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210426046.4A CN114550733B (en) | 2022-04-22 | 2022-04-22 | Voice synthesis method capable of being used for chip end |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210426046.4A CN114550733B (en) | 2022-04-22 | 2022-04-22 | Voice synthesis method capable of being used for chip end |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114550733A CN114550733A (en) | 2022-05-27 |
CN114550733B true CN114550733B (en) | 2022-07-01 |
Family
ID=81667506
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210426046.4A Active CN114550733B (en) | 2022-04-22 | 2022-04-22 | Voice synthesis method capable of being used for chip end |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114550733B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009144368A1 (en) * | 2008-05-30 | 2009-12-03 | Nokia Corporation | Method, apparatus and computer program product for providing improved speech synthesis |
EP2144230A1 (en) * | 2008-07-11 | 2010-01-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Low bitrate audio encoding/decoding scheme having cascaded switches |
CN102750955A (en) * | 2012-07-20 | 2012-10-24 | 中国科学院自动化研究所 | Vocoder based on residual signal spectrum reconfiguration |
WO2018159402A1 (en) * | 2017-02-28 | 2018-09-07 | 国立研究開発法人情報通信研究機構 | Speech synthesis system, speech synthesis program, and speech synthesis method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9607610B2 (en) * | 2014-07-03 | 2017-03-28 | Google Inc. | Devices and methods for noise modulation in a universal vocoder synthesizer |
-
2022
- 2022-04-22 CN CN202210426046.4A patent/CN114550733B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009144368A1 (en) * | 2008-05-30 | 2009-12-03 | Nokia Corporation | Method, apparatus and computer program product for providing improved speech synthesis |
EP2144230A1 (en) * | 2008-07-11 | 2010-01-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Low bitrate audio encoding/decoding scheme having cascaded switches |
CN102750955A (en) * | 2012-07-20 | 2012-10-24 | 中国科学院自动化研究所 | Vocoder based on residual signal spectrum reconfiguration |
WO2018159402A1 (en) * | 2017-02-28 | 2018-09-07 | 国立研究開発法人情報通信研究機構 | Speech synthesis system, speech synthesis program, and speech synthesis method |
Non-Patent Citations (3)
Title |
---|
High-quality waveform generator from fundamental frequency, spectral envelope, and band aperiodicity;Masanori Morise等;《Proceedings of APSIPA Annual Summit and Conference 2019》;20191231;全文 * |
WORLD: A Vocoder-Based High-Quality Speech Synthesis System;Masanori MORISE等;《The Institute of Electronics, Information and Communication Engineers》;20160731;第E99-D卷(第7期);全文 * |
基于生成对抗网络的多判别歌声合成声码器的研究;陈飞扬;《中国优秀硕士学位论文全文数据库》;20220315;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114550733A (en) | 2022-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9031834B2 (en) | Speech enhancement techniques on the power spectrum | |
CN102306492B (en) | Voice conversion method based on convolutive nonnegative matrix factorization | |
US9135923B1 (en) | Pitch synchronous speech coding based on timbre vectors | |
CN112489629B (en) | Voice transcription model, method, medium and electronic equipment | |
JP7617261B2 (en) | Audio generator, audio signal generation method, and audio generator training method | |
CN102201240B (en) | Harmonic noise excitation model vocoder based on inverse filtering | |
EP2109096A1 (en) | Speech synthesis with dynamic constraints | |
Wu et al. | Quasi-periodic WaveNet vocoder: A pitch dependent dilated convolution model for parametric speech generation | |
CN103886859B (en) | Phonetics transfer method based on one-to-many codebook mapping | |
Yoneyama et al. | High-fidelity and pitch-controllable neural vocoder based on unified source-filter networks | |
EP4586246A1 (en) | Decoder | |
CN114550733B (en) | Voice synthesis method capable of being used for chip end | |
EP2087485B1 (en) | Multicodebook source -dependent coding and decoding | |
Song et al. | Improved time-frequency trajectory excitation modeling for a statistical parametric speech synthesis system | |
Kwon et al. | Effective parameter estimation methods for an excitnet model in generative text-to-speech systems | |
CN104282300A (en) | Non-periodic component syllable model building and speech synthesizing method and device | |
KR102837410B1 (en) | Methods for generating audio signals and training audio generators and audio generators | |
KR102837411B1 (en) | Methods for generating audio signals and training audio generators and audio generators | |
RU2823015C1 (en) | Audio data generator and methods of generating audio signal and training audio data generator | |
Jiang et al. | ESTVocoder: An Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram | |
Orphanidou et al. | Voice morphing using the generative topographic mapping | |
Jiang et al. | Excitation-Spectral-Transformed Neural Vocoder Conditioned on Mel Spectrogram | |
Arakawa et al. | High quality voice manipulation method based on the vocal tract area function obtained from sub-band LSP of STRAIGHT spectrum | |
Gandhi et al. | Source separation using particle filters. | |
Ye | Efficient Approaches for Voice Change and Voice Conversion Systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |