AU2018100318A4

AU2018100318A4 - A method of generating raw music audio based on dilated causal convolution network

Info

Publication number: AU2018100318A4
Application number: AU2018100318A
Authority: AU
Inventors: Shuhan Li; Shipeng Liu; Bingyan ZHANG
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-03-14
Filing date: 2018-03-14
Publication date: 2018-04-26
Anticipated expiration: 2026-03-14

Abstract

Abstract This patent introduces an efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. By taking advantage of this method, both of training layers and training time descend largely. The model can extract the elements in the musical fragments and generate what possesses the similar tone and the coherent phoneme, of which totally depend on the previous audio. Simultaneously, the training results are extremely smooth and natural, and therefore makes human listeners involved puzzled to distinguish the musical and the generative fragments. We are firmly convinced that the modal can be applied to plenty of relevant fields, including art field and commercial areas, which certainly helps music learning and speech recognition etc.

Description

DESCRIPTION

TITLE A method of generating raw music audio based on dilated causal convolution network

FIELD OF THE INVENTION

This invention mainly applies to the field of audio processing, like audio recognition and audio generation. Take audio recognition as an example, after the host's accent was trained by the model, it would remember the sound accurately, by the next time when the host speaks to the system, it would swiftly recognize the host and execute a series of operations.

Speech synthesis and speech recognition are two key technologies for the realization of Man-Machine Speech Communication and the establishment of a spoken and spoken language system. It is an important competitive market for the information industry in the present times to make the computer similar to human speaking ability. Compared with speech recognition, speech synthesis technology is relatively mature, and has begun to move towards industrialization. Large-scale application is just around the corner.

BACKGROUND OF THE INVENTION

With the development of the technology, the application of artificial intelligence has been improved a lot. Besides, the people's demand on the entertainment is also very enormous, which creates a considerable market in this field. Nowadays, the artificial intelligence has applied in the images a lot, like style transfer. However, in the field of audio, it is not so advanced. Our patent uses the strategy of artificial intelligence and deep learning, which can make an effort to fill this gap in some way.

The goal of the invention is to render naturally sounding speech signals given a text to be synthesized. In physics, sound is a vibration that typically propagates as an audible wave of pressure, through a transmission medium such as a gas, liquid or solid. The sound waves are generated by a sound source, such as the vibrating diaphragm of a stereo speaker. The sound source creates vibrations in the surrounding medium. As the source continues to vibrate the medium, the vibrations propagate away from the source at the speed of sound, thus forming the sound wave. This invention is used to imitate this process to produce a similar type of sound wave as the given one via computers. When trained to model music, the system can generate novel and often highly realistic musical fragments.

In the invention, the music generation process is based on the raw sound wave. It is able to model distribution over thousands of random variables. So, the approach can succeed in generating wideband raw wave audio waveforms. And these waves are signals with very high temporal resolution, at least 16000 samples per sound. (figure.l, figure.2 ) In addition, we need to shorten the time needed for raw audio generation, because of the long-range temporal dependencies. So, we develop a new architecture based on dilated casual convolutions, which has exhibited quite high efficiency.

However, the rate of synthesizing process is not so ideal. Due to the limit of the approach of the calculation, each time of the calculation can only synthesize one resolution. Although the efficiency has been rendered a lot through stacked dilated convolutions, it still takes about 0.015 second to synthesize one resolution. Besides, we need to get 16000 samples per second. So, synthesizing one second of music takes about 4 minutes.

This invention has two main applications. First of all, it can produce the sound of a certain sort of instruments. For this one, there is no restriction on the rhythm. Secondly, we can import a part of a piece of music, which is about 200 millisecond. And it can synthesize some similar rhythm according to the given one. The synthesized sound wave has a similar characteristic as the given one. Both of the two types of the synthesized music is about 20 seconds. After testing both of these two application, we found out that the quality of the synthesized sound waves of these two applications is almost same.

SUMMARY OF THE INVENTION

An efficient music generator model is introduced in this paper. This model can generate music fragments that relate to the previous phoneme, simultaneously the audio samples are independent of the samples at future time steps. The conditional probability of the generative phoneme {xl,x2,...xt} is as follows:

The conditional probability is modeled by convolution layers, of which network doesn't need the pooling layers. Also, the output has the same time dimensionality as the input has. Obviously, the classical networks comes with a computational cost as each state of the whole network needs to be computed sequentially. In order to solve the problem, we use dilated causal convolutional layers (figure.3]to capture a bounded receptive field and compute features for all phoneme positions at once. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of time steps. For images, masks are adopted in the convolutions to avoid seeing the future context; masks have previously also been used in non-convolutional models. As to audio, it can be much easier to implement this by shifting the output of a normal convolution by a few time steps. What's more, the audio generation process is sequential for this network, as each sampled phoneme needs to be given as input back into the network.

DESCRIPTION OF DRAWING

The following drawings are only for the purpose of description and explanation but not for limitation, where in:

Figure, land Figure.2 show the models of the waveform.

Figure.3 presents the process of dilated convolution.

Figure.4 depicts the process of the whole code.

Figure.5 illustrates the core process in the code.

Figure.6 and Figure.7 demonstrates the results of the invention in the form of wave models.

DESCRIPTION OF PREFERRED EMBODIMENTS

Causal filter is the basic of this model. Each of the music generated at time step t depends on the previous time steps, thereby the model would be unable to destroy the sequence of audio data by using the causal filter. At the training time, the conditional prediction of each phoneme made in the net can be calculated simultaneously because all of time steps of ground truth are shown in the data stack. After generate a latest phoneme, it would be fed back in to the data stack immediately. Also, because of the lack of recurrent connections, this model would be trained faster than RNNs.

However, there still exists a dilemma, which is in order to obtain as many as the phonemes at time steps before, the net would be remarkably deep that it needs large amounts of layers.

To solve this problem, we utilize the dilated convolution which achieves the same goal with fewer layers and combine it with causal convolution. A dilated convolution is set to enlarge the receptive field. If both of the dilated convolution and normal convolution has the same number of layers, the dilated convolution has larger receptive field. The working mechanism of it is as follows, showed in fig.l. The output of the dilated causal convolution has the same size as the input and there is no need of pooling or strided layers. Particularly, the convolution with dilation 1 is the same as a standard convolution. Fig.l shows the layers with dilations 1, 2, 4, 8, and in our model, the dilation is up to 512 which means that the generated node has a receptive field with 1024 previous phonemes. In our model, the dilations exponential grow of 2 for each layer and repeated as follows:1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512. PART A:

We divide the flow chart into two sections, the parent process and the core process. The aim of the invention is to generate a natural audio sequence and it can be a perfect realization by combining the two parts euphoniously.

As to the parent process (figure.4), we prepare several datasets like pieces of violin music and divide them into large amounts of bunches. Users could also use their own dataset to train this net. Now, we can begin the training.

Firstly, we define a certain number of epochs, which decide the total times of training. In each epoch, the net generates a model. Thus by large quantities of iteration, the model will be much more accurate and have an excellent convergence. When training starts, the epoch is initialized with zero. The loop will be stopped when it reaches the set epoch.

For each iteration in the loop, we need to retrieve the current data loader's n dimensional and the target at first. Then pass the parameters in the core process in order to generate an audio node. This process would be elaborated on in the description of the core section. Now, after a series of operations in the core process, we generate an audio node which depends on the previous phoneme. However, it's not natural enough. What we desire to generate is a sequence of smooth sound. Therefore, we need to compare the output with the target which was retrieved before and calculate the loss function. After that, for the purpose of approaching the goal, we do back propagation and make the gradient down. When everything has done, it’s time to save the model and go loop.

When the current epoch meets the value that was set before, stop the loop and output the accurate node. PART B:

First of all, we should mention 3 important methods this code employs: (one is]causal convolutions, dilated convolutions, and residual and skip connections.

The advantages of stacked dilated convolutions appear in that networks can gain very large receptive fields with just a few layers. It’s designed that each of the data point requires a training system with 3 blocks of 10 exponentially increasing layers. That is, there will be 3 * 10 iterations in the function. First, we input the samples as X.

For each of the iteration(figure.5], we first use causal convolution which is quite important in the system. Through the use of causal convolution, we can guarantee that the model won't violate the ordering in which we model the data. That is, for each of the point Xt, the prediction p(Xt+l | XI, ..., Xt) emitted cannot depend on any of the time steps occurred behind it.

Secondly, we perform 2 operations on the result we generate above which named step A and step B. As to step A, a dilation function is operated to generate the corresponding result for future use. In terms of step B, we use the original result to do the dilated convolution.

After that, we utilize the result of step B. In this part, we have to deal with 2 parameters that are called filter and gate. We do filter convolution and tanh function on the result of step B and assign the value to filter. Similarly, we do gate convolution and sigmoid function on the result of step B. Then we multiply the two results we generate above.

The third step is a convolution operation using a 1*1 convolution kernel.

For the last step of each iteration, we need to do 2 steps on the result we generate above, one is named residual and the other one is parameterized skip connection. We use the result in the previous step and the current result in step A to do the residual operation. The residual block increases the efficiency of the training process and greatly helps to reduce the problems of retrogradation. As to the parameterized skip connection, it functions as speeding up convergence and training deeper models. After each iteration we will generate a parameter which is called skip.

In order to generate the final output, there still need some more steps to be performed. Two relu-end_conv steps and 1 softmax distribution are performed on the skip we generate so that we can have the final output. We use the softmax distribution to model the conditional distribution p(Xt | XI, ... , Xt-1] over the audio samples. The original samples are stored as a sequence of 16-bit integers which makes it harder to tract. Thus we apply a μ-law companding transformation to the audio and then quantize it to 256 possible values:

As we employ queues to store the given samples and generated samples, we carried out a controlled experiment. That is, we first add given samples in the queue and then discard them to see whether there exists difference.

Figure 6. and Figure.7 show the results of our invention. The values on the row show the time length of the selected piece of wave, while the column of the table shows the value of each generated point in the waveform.

Claims

CLAIM
1. An efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. conditional probability of the generative phoneme {xl,x2,...xt} is as follows:

the conditional probability is modeled by convolution layers, of which network doesn’t need the pooling layers. 1