[go: up one dir, main page]

AU2018100318A4 - A method of generating raw music audio based on dilated causal convolution network - Google Patents

A method of generating raw music audio based on dilated causal convolution network Download PDF

Info

Publication number
AU2018100318A4
AU2018100318A4 AU2018100318A AU2018100318A AU2018100318A4 AU 2018100318 A4 AU2018100318 A4 AU 2018100318A4 AU 2018100318 A AU2018100318 A AU 2018100318A AU 2018100318 A AU2018100318 A AU 2018100318A AU 2018100318 A4 AU2018100318 A4 AU 2018100318A4
Authority
AU
Australia
Prior art keywords
training
audio
convolution
dilated
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2018100318A
Inventor
Shuhan Li
Shipeng Liu
Bingyan ZHANG
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to AU2018100318A priority Critical patent/AU2018100318A4/en
Application granted granted Critical
Publication of AU2018100318A4 publication Critical patent/AU2018100318A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Complex Calculations (AREA)

Abstract

Abstract This patent introduces an efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. By taking advantage of this method, both of training layers and training time descend largely. The model can extract the elements in the musical fragments and generate what possesses the similar tone and the coherent phoneme, of which totally depend on the previous audio. Simultaneously, the training results are extremely smooth and natural, and therefore makes human listeners involved puzzled to distinguish the musical and the generative fragments. We are firmly convinced that the modal can be applied to plenty of relevant fields, including art field and commercial areas, which certainly helps music learning and speech recognition etc.

Description

DESCRIPTION
TITLE A method of generating raw music audio based on dilated causal convolution network
FIELD OF THE INVENTION
This invention mainly applies to the field of audio processing, like audio recognition and audio generation. Take audio recognition as an example, after the host's accent was trained by the model, it would remember the sound accurately, by the next time when the host speaks to the system, it would swiftly recognize the host and execute a series of operations.
Speech synthesis and speech recognition are two key technologies for the realization of Man-Machine Speech Communication and the establishment of a spoken and spoken language system. It is an important competitive market for the information industry in the present times to make the computer similar to human speaking ability. Compared with speech recognition, speech synthesis technology is relatively mature, and has begun to move towards industrialization. Large-scale application is just around the corner.
BACKGROUND OF THE INVENTION
With the development of the technology, the application of artificial intelligence has been improved a lot. Besides, the people's demand on the entertainment is also very enormous, which creates a considerable market in this field. Nowadays, the artificial intelligence has applied in the images a lot, like style transfer. However, in the field of audio, it is not so advanced. Our patent uses the strategy of artificial intelligence and deep learning, which can make an effort to fill this gap in some way.
The goal of the invention is to render naturally sounding speech signals given a text to be synthesized. In physics, sound is a vibration that typically propagates as an audible wave of pressure, through a transmission medium such as a gas, liquid or solid. The sound waves are generated by a sound source, such as the vibrating diaphragm of a stereo speaker. The sound source creates vibrations in the surrounding medium. As the source continues to vibrate the medium, the vibrations propagate away from the source at the speed of sound, thus forming the sound wave. This invention is used to imitate this process to produce a similar type of sound wave as the given one via computers. When trained to model music, the system can generate novel and often highly realistic musical fragments.
In the invention, the music generation process is based on the raw sound wave. It is able to model distribution over thousands of random variables. So, the approach can succeed in generating wideband raw wave audio waveforms. And these waves are signals with very high temporal resolution, at least 16000 samples per sound. (figure.l, figure.2 ) In addition, we need to shorten the time needed for raw audio generation, because of the long-range temporal dependencies. So, we develop a new architecture based on dilated casual convolutions, which has exhibited quite high efficiency.
However, the rate of synthesizing process is not so ideal. Due to the limit of the approach of the calculation, each time of the calculation can only synthesize one resolution. Although the efficiency has been rendered a lot through stacked dilated convolutions, it still takes about 0.015 second to synthesize one resolution. Besides, we need to get 16000 samples per second. So, synthesizing one second of music takes about 4 minutes.
This invention has two main applications. First of all, it can produce the sound of a certain sort of instruments. For this one, there is no restriction on the rhythm. Secondly, we can import a part of a piece of music, which is about 200 millisecond. And it can synthesize some similar rhythm according to the given one. The synthesized sound wave has a similar characteristic as the given one. Both of the two types of the synthesized music is about 20 seconds. After testing both of these two application, we found out that the quality of the synthesized sound waves of these two applications is almost same.
SUMMARY OF THE INVENTION
An efficient music generator model is introduced in this paper. This model can generate music fragments that relate to the previous phoneme, simultaneously the audio samples are independent of the samples at future time steps. The conditional probability of the generative phoneme {xl,x2,...xt} is as follows:
The conditional probability is modeled by convolution layers, of which network doesn't need the pooling layers. Also, the output has the same time dimensionality as the input has. Obviously, the classical networks comes with a computational cost as each state of the whole network needs to be computed sequentially. In order to solve the problem, we use dilated causal convolutional layers (figure.3]to capture a bounded receptive field and compute features for all phoneme positions at once. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of time steps. For images, masks are adopted in the convolutions to avoid seeing the future context; masks have previously also been used in non-convolutional models. As to audio, it can be much easier to implement this by shifting the output of a normal convolution by a few time steps. What's more, the audio generation process is sequential for this network, as each sampled phoneme needs to be given as input back into the network.
DESCRIPTION OF DRAWING
The following drawings are only for the purpose of description and explanation but not for limitation, where in:
Figure, land Figure.2 show the models of the waveform.
Figure.3 presents the process of dilated convolution.
Figure.4 depicts the process of the whole code.
Figure.5 illustrates the core process in the code.
Figure.6 and Figure.7 demonstrates the results of the invention in the form of wave models.
DESCRIPTION OF PREFERRED EMBODIMENTS
Causal filter is the basic of this model. Each of the music generated at time step t depends on the previous time steps, thereby the model would be unable to destroy the sequence of audio data by using the causal filter. At the training time, the conditional prediction of each phoneme made in the net can be calculated simultaneously because all of time steps of ground truth are shown in the data stack. After generate a latest phoneme, it would be fed back in to the data stack immediately. Also, because of the lack of recurrent connections, this model would be trained faster than RNNs.
However, there still exists a dilemma, which is in order to obtain as many as the phonemes at time steps before, the net would be remarkably deep that it needs large amounts of layers.
To solve this problem, we utilize the dilated convolution which achieves the same goal with fewer layers and combine it with causal convolution. A dilated convolution is set to enlarge the receptive field. If both of the dilated convolution and normal convolution has the same number of layers, the dilated convolution has larger receptive field. The working mechanism of it is as follows, showed in fig.l. The output of the dilated causal convolution has the same size as the input and there is no need of pooling or strided layers. Particularly, the convolution with dilation 1 is the same as a standard convolution. Fig.l shows the layers with dilations 1, 2, 4, 8, and in our model, the dilation is up to 512 which means that the generated node has a receptive field with 1024 previous phonemes. In our model, the dilations exponential grow of 2 for each layer and repeated as follows:1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512. PART A:
We divide the flow chart into two sections, the parent process and the core process. The aim of the invention is to generate a natural audio sequence and it can be a perfect realization by combining the two parts euphoniously.
As to the parent process (figure.4), we prepare several datasets like pieces of violin music and divide them into large amounts of bunches. Users could also use their own dataset to train this net. Now, we can begin the training.
Firstly, we define a certain number of epochs, which decide the total times of training. In each epoch, the net generates a model. Thus by large quantities of iteration, the model will be much more accurate and have an excellent convergence. When training starts, the epoch is initialized with zero. The loop will be stopped when it reaches the set epoch.
For each iteration in the loop, we need to retrieve the current data loader's n dimensional and the target at first. Then pass the parameters in the core process in order to generate an audio node. This process would be elaborated on in the description of the core section. Now, after a series of operations in the core process, we generate an audio node which depends on the previous phoneme. However, it's not natural enough. What we desire to generate is a sequence of smooth sound. Therefore, we need to compare the output with the target which was retrieved before and calculate the loss function. After that, for the purpose of approaching the goal, we do back propagation and make the gradient down. When everything has done, it’s time to save the model and go loop.
When the current epoch meets the value that was set before, stop the loop and output the accurate node. PART B:
First of all, we should mention 3 important methods this code employs: (one is]causal convolutions, dilated convolutions, and residual and skip connections.
The advantages of stacked dilated convolutions appear in that networks can gain very large receptive fields with just a few layers. It’s designed that each of the data point requires a training system with 3 blocks of 10 exponentially increasing layers. That is, there will be 3 * 10 iterations in the function. First, we input the samples as X.
For each of the iteration(figure.5], we first use causal convolution which is quite important in the system. Through the use of causal convolution, we can guarantee that the model won't violate the ordering in which we model the data. That is, for each of the point Xt, the prediction p(Xt+l | XI, ..., Xt) emitted cannot depend on any of the time steps occurred behind it.
Secondly, we perform 2 operations on the result we generate above which named step A and step B. As to step A, a dilation function is operated to generate the corresponding result for future use. In terms of step B, we use the original result to do the dilated convolution.
After that, we utilize the result of step B. In this part, we have to deal with 2 parameters that are called filter and gate. We do filter convolution and tanh function on the result of step B and assign the value to filter. Similarly, we do gate convolution and sigmoid function on the result of step B. Then we multiply the two results we generate above.
The third step is a convolution operation using a 1*1 convolution kernel.
For the last step of each iteration, we need to do 2 steps on the result we generate above, one is named residual and the other one is parameterized skip connection. We use the result in the previous step and the current result in step A to do the residual operation. The residual block increases the efficiency of the training process and greatly helps to reduce the problems of retrogradation. As to the parameterized skip connection, it functions as speeding up convergence and training deeper models. After each iteration we will generate a parameter which is called skip.
In order to generate the final output, there still need some more steps to be performed. Two relu-end_conv steps and 1 softmax distribution are performed on the skip we generate so that we can have the final output. We use the softmax distribution to model the conditional distribution p(Xt | XI, ... , Xt-1] over the audio samples. The original samples are stored as a sequence of 16-bit integers which makes it harder to tract. Thus we apply a μ-law companding transformation to the audio and then quantize it to 256 possible values:
As we employ queues to store the given samples and generated samples, we carried out a controlled experiment. That is, we first add given samples in the queue and then discard them to see whether there exists difference.
Figure 6. and Figure.7 show the results of our invention. The values on the row show the time length of the selected piece of wave, while the column of the table shows the value of each generated point in the waveform.

Claims (2)

  1. CLAIM
  2. 1. An efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. conditional probability of the generative phoneme {xl,x2,...xt} is as follows:
    the conditional probability is modeled by convolution layers, of which network doesn’t need the pooling layers. 1
AU2018100318A 2018-03-14 2018-03-14 A method of generating raw music audio based on dilated causal convolution network Ceased AU2018100318A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2018100318A AU2018100318A4 (en) 2018-03-14 2018-03-14 A method of generating raw music audio based on dilated causal convolution network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2018100318A AU2018100318A4 (en) 2018-03-14 2018-03-14 A method of generating raw music audio based on dilated causal convolution network

Publications (1)

Publication Number Publication Date
AU2018100318A4 true AU2018100318A4 (en) 2018-04-26

Family

ID=61973046

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2018100318A Ceased AU2018100318A4 (en) 2018-03-14 2018-03-14 A method of generating raw music audio based on dilated causal convolution network

Country Status (1)

Country Link
AU (1) AU2018100318A4 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215406A (en) * 2020-09-23 2021-01-12 国网甘肃省电力公司营销服务中心 Non-invasive type residential electricity load decomposition method based on time convolution neural network
CN114559133A (en) * 2022-04-27 2022-05-31 苏芯物联技术(南京)有限公司 Universal welding arc starting continuity real-time detection method and system
CN115066691A (en) * 2020-02-07 2022-09-16 渊慧科技有限公司 Cyclic unit for generating or processing a sequence of images
CN117122288A (en) * 2023-09-08 2023-11-28 太原理工大学 Epileptic electroencephalogram signal early warning method and device based on anchoring convolution network
US11929085B2 (en) 2018-08-30 2024-03-12 Dolby International Ab Method and apparatus for controlling enhancement of low-bitrate coded audio

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11929085B2 (en) 2018-08-30 2024-03-12 Dolby International Ab Method and apparatus for controlling enhancement of low-bitrate coded audio
CN115066691A (en) * 2020-02-07 2022-09-16 渊慧科技有限公司 Cyclic unit for generating or processing a sequence of images
CN112215406A (en) * 2020-09-23 2021-01-12 国网甘肃省电力公司营销服务中心 Non-invasive type residential electricity load decomposition method based on time convolution neural network
CN112215406B (en) * 2020-09-23 2024-04-16 国网甘肃省电力公司电力科学研究院 Non-invasive resident electricity load decomposition method based on time convolution neural network
CN114559133A (en) * 2022-04-27 2022-05-31 苏芯物联技术(南京)有限公司 Universal welding arc starting continuity real-time detection method and system
CN114559133B (en) * 2022-04-27 2022-07-29 苏芯物联技术(南京)有限公司 Real-time detection method and system for arc striking continuity of universal welding
CN117122288A (en) * 2023-09-08 2023-11-28 太原理工大学 Epileptic electroencephalogram signal early warning method and device based on anchoring convolution network

Similar Documents

Publication Publication Date Title
AU2018100318A4 (en) A method of generating raw music audio based on dilated causal convolution network
Liu et al. Recent progress in the CUHK dysarthric speech recognition system
CN111771213B (en) Speech style migration
Mangal et al. LSTM based music generation system
US20200410976A1 (en) Speech style transfer
JP7617261B2 (en) Audio generator, audio signal generation method, and audio generator training method
JP7103390B2 (en) Acoustic signal generation method, acoustic signal generator and program
Masuda et al. Synthesizer Sound Matching with Differentiable DSP.
EP4292078A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
CN111326170B (en) Otophone-to-Normal Voice Conversion Method and Device by Joint Time-Frequency Domain Expansion Convolution
CN112184859B (en) End-to-end virtual object animation generation method and device, storage medium and terminal
US11776528B2 (en) Method for changing speed and pitch of speech and speech synthesis system
KR102358692B1 (en) Method and tts system for changing the speed and the pitch of the speech
CN114141237A (en) Speech recognition method, apparatus, computer equipment and storage medium
CN112002302A (en) Speech synthesis method and device
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN114267366A (en) Speech noise reduction through discrete representation learning
CN112837670A (en) Voice synthesis method and device and electronic equipment
CN116665704A (en) A multi-task learning method for automatic notation of piano polyphonic music based on local attention
JP2022549352A (en) training a neural network to generate structured embeddings
Choi et al. Adversarial speaker-consistency learning using untranscribed speech data for zero-shot multi-speaker text-to-speech
CN118737122A (en) Method, apparatus, device and readable medium for speech synthesis
Caillon Hierarchical temporal learning for multi-instrument and orchestral audio synthesis
Plantinga et al. Phonetic feedback for speech enhancement with and without parallel speech data
CN117649839B (en) A personalized speech synthesis method based on low-rank adaptation

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry