AU2018100318A4 - A method of generating raw music audio based on dilated causal convolution network - Google Patents
A method of generating raw music audio based on dilated causal convolution network Download PDFInfo
- Publication number
- AU2018100318A4 AU2018100318A4 AU2018100318A AU2018100318A AU2018100318A4 AU 2018100318 A4 AU2018100318 A4 AU 2018100318A4 AU 2018100318 A AU2018100318 A AU 2018100318A AU 2018100318 A AU2018100318 A AU 2018100318A AU 2018100318 A4 AU2018100318 A4 AU 2018100318A4
- Authority
- AU
- Australia
- Prior art keywords
- training
- audio
- convolution
- dilated
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 230000001364 causal effect Effects 0.000 title claims abstract description 12
- 238000013528 artificial neural network Methods 0.000 claims abstract 2
- 238000011176 pooling Methods 0.000 claims description 3
- 238000012549 training Methods 0.000 abstract description 10
- 239000012634 fragment Substances 0.000 abstract description 4
- 230000001427 coherent effect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 14
- 230000010339 dilation Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000007789 gas Substances 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 239000007788 liquid Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Landscapes
- Complex Calculations (AREA)
Abstract
Abstract This patent introduces an efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. By taking advantage of this method, both of training layers and training time descend largely. The model can extract the elements in the musical fragments and generate what possesses the similar tone and the coherent phoneme, of which totally depend on the previous audio. Simultaneously, the training results are extremely smooth and natural, and therefore makes human listeners involved puzzled to distinguish the musical and the generative fragments. We are firmly convinced that the modal can be applied to plenty of relevant fields, including art field and commercial areas, which certainly helps music learning and speech recognition etc.
Description
DESCRIPTION
TITLE A method of generating raw music audio based on dilated causal convolution network
FIELD OF THE INVENTION
This invention mainly applies to the field of audio processing, like audio recognition and audio generation. Take audio recognition as an example, after the host's accent was trained by the model, it would remember the sound accurately, by the next time when the host speaks to the system, it would swiftly recognize the host and execute a series of operations.
Speech synthesis and speech recognition are two key technologies for the realization of Man-Machine Speech Communication and the establishment of a spoken and spoken language system. It is an important competitive market for the information industry in the present times to make the computer similar to human speaking ability. Compared with speech recognition, speech synthesis technology is relatively mature, and has begun to move towards industrialization. Large-scale application is just around the corner.
BACKGROUND OF THE INVENTION
With the development of the technology, the application of artificial intelligence has been improved a lot. Besides, the people's demand on the entertainment is also very enormous, which creates a considerable market in this field. Nowadays, the artificial intelligence has applied in the images a lot, like style transfer. However, in the field of audio, it is not so advanced. Our patent uses the strategy of artificial intelligence and deep learning, which can make an effort to fill this gap in some way.
The goal of the invention is to render naturally sounding speech signals given a text to be synthesized. In physics, sound is a vibration that typically propagates as an audible wave of pressure, through a transmission medium such as a gas, liquid or solid. The sound waves are generated by a sound source, such as the vibrating diaphragm of a stereo speaker. The sound source creates vibrations in the surrounding medium. As the source continues to vibrate the medium, the vibrations propagate away from the source at the speed of sound, thus forming the sound wave. This invention is used to imitate this process to produce a similar type of sound wave as the given one via computers. When trained to model music, the system can generate novel and often highly realistic musical fragments.
In the invention, the music generation process is based on the raw sound wave. It is able to model distribution over thousands of random variables. So, the approach can succeed in generating wideband raw wave audio waveforms. And these waves are signals with very high temporal resolution, at least 16000 samples per sound. (figure.l, figure.2 ) In addition, we need to shorten the time needed for raw audio generation, because of the long-range temporal dependencies. So, we develop a new architecture based on dilated casual convolutions, which has exhibited quite high efficiency.
However, the rate of synthesizing process is not so ideal. Due to the limit of the approach of the calculation, each time of the calculation can only synthesize one resolution. Although the efficiency has been rendered a lot through stacked dilated convolutions, it still takes about 0.015 second to synthesize one resolution. Besides, we need to get 16000 samples per second. So, synthesizing one second of music takes about 4 minutes.
This invention has two main applications. First of all, it can produce the sound of a certain sort of instruments. For this one, there is no restriction on the rhythm. Secondly, we can import a part of a piece of music, which is about 200 millisecond. And it can synthesize some similar rhythm according to the given one. The synthesized sound wave has a similar characteristic as the given one. Both of the two types of the synthesized music is about 20 seconds. After testing both of these two application, we found out that the quality of the synthesized sound waves of these two applications is almost same.
SUMMARY OF THE INVENTION
An efficient music generator model is introduced in this paper. This model can generate music fragments that relate to the previous phoneme, simultaneously the audio samples are independent of the samples at future time steps. The conditional probability of the generative phoneme {xl,x2,...xt} is as follows:
The conditional probability is modeled by convolution layers, of which network doesn't need the pooling layers. Also, the output has the same time dimensionality as the input has. Obviously, the classical networks comes with a computational cost as each state of the whole network needs to be computed sequentially. In order to solve the problem, we use dilated causal convolutional layers (figure.3]to capture a bounded receptive field and compute features for all phoneme positions at once. It is a fully convolutional neural network, where the convolutional layers have various dilation factors that allow its receptive field to grow exponentially with depth and cover thousands of time steps. For images, masks are adopted in the convolutions to avoid seeing the future context; masks have previously also been used in non-convolutional models. As to audio, it can be much easier to implement this by shifting the output of a normal convolution by a few time steps. What's more, the audio generation process is sequential for this network, as each sampled phoneme needs to be given as input back into the network.
DESCRIPTION OF DRAWING
The following drawings are only for the purpose of description and explanation but not for limitation, where in:
Figure, land Figure.2 show the models of the waveform.
Figure.3 presents the process of dilated convolution.
Figure.4 depicts the process of the whole code.
Figure.5 illustrates the core process in the code.
Figure.6 and Figure.7 demonstrates the results of the invention in the form of wave models.
DESCRIPTION OF PREFERRED EMBODIMENTS
Causal filter is the basic of this model. Each of the music generated at time step t depends on the previous time steps, thereby the model would be unable to destroy the sequence of audio data by using the causal filter. At the training time, the conditional prediction of each phoneme made in the net can be calculated simultaneously because all of time steps of ground truth are shown in the data stack. After generate a latest phoneme, it would be fed back in to the data stack immediately. Also, because of the lack of recurrent connections, this model would be trained faster than RNNs.
However, there still exists a dilemma, which is in order to obtain as many as the phonemes at time steps before, the net would be remarkably deep that it needs large amounts of layers.
To solve this problem, we utilize the dilated convolution which achieves the same goal with fewer layers and combine it with causal convolution. A dilated convolution is set to enlarge the receptive field. If both of the dilated convolution and normal convolution has the same number of layers, the dilated convolution has larger receptive field. The working mechanism of it is as follows, showed in fig.l. The output of the dilated causal convolution has the same size as the input and there is no need of pooling or strided layers. Particularly, the convolution with dilation 1 is the same as a standard convolution. Fig.l shows the layers with dilations 1, 2, 4, 8, and in our model, the dilation is up to 512 which means that the generated node has a receptive field with 1024 previous phonemes. In our model, the dilations exponential grow of 2 for each layer and repeated as follows:1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512,1, 2, 4, 8,..., 512. PART A:
We divide the flow chart into two sections, the parent process and the core process. The aim of the invention is to generate a natural audio sequence and it can be a perfect realization by combining the two parts euphoniously.
As to the parent process (figure.4), we prepare several datasets like pieces of violin music and divide them into large amounts of bunches. Users could also use their own dataset to train this net. Now, we can begin the training.
Firstly, we define a certain number of epochs, which decide the total times of training. In each epoch, the net generates a model. Thus by large quantities of iteration, the model will be much more accurate and have an excellent convergence. When training starts, the epoch is initialized with zero. The loop will be stopped when it reaches the set epoch.
For each iteration in the loop, we need to retrieve the current data loader's n dimensional and the target at first. Then pass the parameters in the core process in order to generate an audio node. This process would be elaborated on in the description of the core section. Now, after a series of operations in the core process, we generate an audio node which depends on the previous phoneme. However, it's not natural enough. What we desire to generate is a sequence of smooth sound. Therefore, we need to compare the output with the target which was retrieved before and calculate the loss function. After that, for the purpose of approaching the goal, we do back propagation and make the gradient down. When everything has done, it’s time to save the model and go loop.
When the current epoch meets the value that was set before, stop the loop and output the accurate node. PART B:
First of all, we should mention 3 important methods this code employs: (one is]causal convolutions, dilated convolutions, and residual and skip connections.
The advantages of stacked dilated convolutions appear in that networks can gain very large receptive fields with just a few layers. It’s designed that each of the data point requires a training system with 3 blocks of 10 exponentially increasing layers. That is, there will be 3 * 10 iterations in the function. First, we input the samples as X.
For each of the iteration(figure.5], we first use causal convolution which is quite important in the system. Through the use of causal convolution, we can guarantee that the model won't violate the ordering in which we model the data. That is, for each of the point Xt, the prediction p(Xt+l | XI, ..., Xt) emitted cannot depend on any of the time steps occurred behind it.
Secondly, we perform 2 operations on the result we generate above which named step A and step B. As to step A, a dilation function is operated to generate the corresponding result for future use. In terms of step B, we use the original result to do the dilated convolution.
After that, we utilize the result of step B. In this part, we have to deal with 2 parameters that are called filter and gate. We do filter convolution and tanh function on the result of step B and assign the value to filter. Similarly, we do gate convolution and sigmoid function on the result of step B. Then we multiply the two results we generate above.
The third step is a convolution operation using a 1*1 convolution kernel.
For the last step of each iteration, we need to do 2 steps on the result we generate above, one is named residual and the other one is parameterized skip connection. We use the result in the previous step and the current result in step A to do the residual operation. The residual block increases the efficiency of the training process and greatly helps to reduce the problems of retrogradation. As to the parameterized skip connection, it functions as speeding up convergence and training deeper models. After each iteration we will generate a parameter which is called skip.
In order to generate the final output, there still need some more steps to be performed. Two relu-end_conv steps and 1 softmax distribution are performed on the skip we generate so that we can have the final output. We use the softmax distribution to model the conditional distribution p(Xt | XI, ... , Xt-1] over the audio samples. The original samples are stored as a sequence of 16-bit integers which makes it harder to tract. Thus we apply a μ-law companding transformation to the audio and then quantize it to 256 possible values:
As we employ queues to store the given samples and generated samples, we carried out a controlled experiment. That is, we first add given samples in the queue and then discard them to see whether there exists difference.
Figure 6. and Figure.7 show the results of our invention. The values on the row show the time length of the selected piece of wave, while the column of the table shows the value of each generated point in the waveform.
Claims (2)
- CLAIM
- 1. An efficient and autoregressive method of generating a natural audio sequence, which utilizes the dilated causal convolution network, the advanced deep neural network. conditional probability of the generative phoneme {xl,x2,...xt} is as follows:the conditional probability is modeled by convolution layers, of which network doesn’t need the pooling layers. 1
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2018100318A AU2018100318A4 (en) | 2018-03-14 | 2018-03-14 | A method of generating raw music audio based on dilated causal convolution network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2018100318A AU2018100318A4 (en) | 2018-03-14 | 2018-03-14 | A method of generating raw music audio based on dilated causal convolution network |
Publications (1)
Publication Number | Publication Date |
---|---|
AU2018100318A4 true AU2018100318A4 (en) | 2018-04-26 |
Family
ID=61973046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
AU2018100318A Ceased AU2018100318A4 (en) | 2018-03-14 | 2018-03-14 | A method of generating raw music audio based on dilated causal convolution network |
Country Status (1)
Country | Link |
---|---|
AU (1) | AU2018100318A4 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112215406A (en) * | 2020-09-23 | 2021-01-12 | 国网甘肃省电力公司营销服务中心 | Non-invasive type residential electricity load decomposition method based on time convolution neural network |
CN114559133A (en) * | 2022-04-27 | 2022-05-31 | 苏芯物联技术(南京)有限公司 | Universal welding arc starting continuity real-time detection method and system |
CN115066691A (en) * | 2020-02-07 | 2022-09-16 | 渊慧科技有限公司 | Cyclic unit for generating or processing a sequence of images |
CN117122288A (en) * | 2023-09-08 | 2023-11-28 | 太原理工大学 | Epileptic electroencephalogram signal early warning method and device based on anchoring convolution network |
US11929085B2 (en) | 2018-08-30 | 2024-03-12 | Dolby International Ab | Method and apparatus for controlling enhancement of low-bitrate coded audio |
-
2018
- 2018-03-14 AU AU2018100318A patent/AU2018100318A4/en not_active Ceased
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11929085B2 (en) | 2018-08-30 | 2024-03-12 | Dolby International Ab | Method and apparatus for controlling enhancement of low-bitrate coded audio |
CN115066691A (en) * | 2020-02-07 | 2022-09-16 | 渊慧科技有限公司 | Cyclic unit for generating or processing a sequence of images |
CN112215406A (en) * | 2020-09-23 | 2021-01-12 | 国网甘肃省电力公司营销服务中心 | Non-invasive type residential electricity load decomposition method based on time convolution neural network |
CN112215406B (en) * | 2020-09-23 | 2024-04-16 | 国网甘肃省电力公司电力科学研究院 | Non-invasive resident electricity load decomposition method based on time convolution neural network |
CN114559133A (en) * | 2022-04-27 | 2022-05-31 | 苏芯物联技术(南京)有限公司 | Universal welding arc starting continuity real-time detection method and system |
CN114559133B (en) * | 2022-04-27 | 2022-07-29 | 苏芯物联技术(南京)有限公司 | Real-time detection method and system for arc striking continuity of universal welding |
CN117122288A (en) * | 2023-09-08 | 2023-11-28 | 太原理工大学 | Epileptic electroencephalogram signal early warning method and device based on anchoring convolution network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2018100318A4 (en) | A method of generating raw music audio based on dilated causal convolution network | |
Liu et al. | Recent progress in the CUHK dysarthric speech recognition system | |
CN111771213B (en) | Speech style migration | |
Mangal et al. | LSTM based music generation system | |
US20200410976A1 (en) | Speech style transfer | |
JP7617261B2 (en) | Audio generator, audio signal generation method, and audio generator training method | |
JP7103390B2 (en) | Acoustic signal generation method, acoustic signal generator and program | |
Masuda et al. | Synthesizer Sound Matching with Differentiable DSP. | |
EP4292078A1 (en) | Methods and systems for modifying speech generated by a text-to-speech synthesiser | |
CN111326170B (en) | Otophone-to-Normal Voice Conversion Method and Device by Joint Time-Frequency Domain Expansion Convolution | |
CN112184859B (en) | End-to-end virtual object animation generation method and device, storage medium and terminal | |
US11776528B2 (en) | Method for changing speed and pitch of speech and speech synthesis system | |
KR102358692B1 (en) | Method and tts system for changing the speed and the pitch of the speech | |
CN114141237A (en) | Speech recognition method, apparatus, computer equipment and storage medium | |
CN112002302A (en) | Speech synthesis method and device | |
JP7124373B2 (en) | LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM | |
CN114267366A (en) | Speech noise reduction through discrete representation learning | |
CN112837670A (en) | Voice synthesis method and device and electronic equipment | |
CN116665704A (en) | A multi-task learning method for automatic notation of piano polyphonic music based on local attention | |
JP2022549352A (en) | training a neural network to generate structured embeddings | |
Choi et al. | Adversarial speaker-consistency learning using untranscribed speech data for zero-shot multi-speaker text-to-speech | |
CN118737122A (en) | Method, apparatus, device and readable medium for speech synthesis | |
Caillon | Hierarchical temporal learning for multi-instrument and orchestral audio synthesis | |
Plantinga et al. | Phonetic feedback for speech enhancement with and without parallel speech data | |
CN117649839B (en) | A personalized speech synthesis method based on low-rank adaptation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FGI | Letters patent sealed or granted (innovation patent) | ||
MK22 | Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry |