KR102556098B1

KR102556098B1 - Method and apparatus of audio signal encoding using weighted error function based on psychoacoustics, and audio signal decoding using weighted error function based on psychoacoustics

Info

Publication number: KR102556098B1
Application number: KR1020170173405A
Authority: KR
Inventors: 성종모; 김민제; 시바라만 아스윈; 젠 카이
Original assignee: 한국전자통신연구원; 더 트러스티즈 오브 인디애나 유니버시티
Priority date: 2017-11-24
Filing date: 2017-12-15
Publication date: 2023-07-18
Anticipated expiration: 2037-12-15
Also published as: KR20190060628A

Abstract

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 동일한 모델의 복잡도를 가지고 개선된 오디오 신호의 품질을 제공하거나, 또는 낮은 모델의 복잡도를 가지고 동일한 수준의 오디오 신호의 품질을 제공할 수 있다.
일 실시예에 따르면, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크에 있어서, 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)를 생성하는 단계; 상기 마스킹 임계치에 기초하여, 상기 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)을 계산하는 단계; 상기 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성하는 단계; 상기 가중된 오류 함수를 이용하여 학습된 파라미터를 상기 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성하는 단계를 포함하는 뉴럴 네트워크일 수 있다.According to an embodiment, by using a perceptually weighted error function using human hearing characteristics, improved audio signal quality is provided with the same model complexity, or the same level of audio signal is provided with a low model complexity. The quality of the audio signal can be provided.
According to an embodiment, in a neural network applied to an audio signal encoding method using an audio signal encoding apparatus, generating a masking threshold for a first audio signal before being learned; calculating a weight matrix to be applied to each frequency component of the first audio signal based on the masking threshold; generating a weighted error function by correcting a preset error function using the weighting matrix; and generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal.

Description

Audio signal encoding method and apparatus using psychoacoustic based weighted error function, and audio signal decoding method and apparatus }

아래의 실시예들은 심리음향 기반 가중된 오류 함수를 이용한 오디오 신호 부호화 방법 및 장치, 그리고 오디오 신호 복호화 방법 및 장치에 관한 것으로, 보다 구체적으로 사람의 청각 특성을 고려한 가중된 오류 함수를 이용하여 학습된 파라미터를 적용하는 오디오 신호 부호화 방법 및 장치, 그리고 오디오 신호 복호화 방법 및 장치에 관한 것이다.The following embodiments relate to a method and apparatus for encoding an audio signal using a weighted error function based on psychoacoustics, and a method and apparatus for decoding an audio signal, and more specifically, a method and apparatus for decoding an audio signal using a weighted error function considering human hearing characteristics. It relates to an audio signal encoding method and apparatus applying a parameter, and an audio signal decoding method and apparatus.

최근 다양한 목적 및 응용에 적용되는 음성 및 오디오 코덱이 ITU-T, MPEG, 3GPP와 같은 표준화 기구에서 개발되고 있다. 대부분의 오디오 코덱은 사람의 다양한 청각적 특성을 이용한 심리음향 모델에 기반하고 있다. 또한, 음성 코덱은 주로 음성 발생 모델에 기반하고 있지만, 동시에 주관적 품질 향상을 위해 사람의 인지적 특성도 활용하고 있다. Recently, voice and audio codecs applied to various purposes and applications are being developed by standardization organizations such as ITU-T, MPEG, and 3GPP. Most audio codecs are based on psychoacoustic models using various auditory characteristics of humans. In addition, voice codecs are mainly based on voice generation models, but at the same time, human cognitive characteristics are also utilized to improve subjective quality.

이와 같이, 종래의 음성 및 오디오 코덱은 부호화 단계에서 발생하는 양자화 잡음을 효과적으로 제어하기 위해 사람의 청각 특성에 기반한 방법을 사용하고 있다. As such, conventional speech and audio codecs use a method based on human hearing characteristics to effectively control quantization noise generated in the coding step.

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 동일한 모델의 복잡도를 가지고 개선된 오디오 신호의 품질을 제공할 수 있다. According to an embodiment, an improved audio signal quality may be provided with the same complexity of the model by using a perceptually weighted error function using human hearing characteristics.

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 낮은 모델의 복잡도를 가지고 동일한 수준의 오디오 신호의 품질을 제공할 수 있다. According to an embodiment, it is possible to provide the same level of audio signal quality with a low model complexity by using a perceptually weighted error function using human hearing characteristics.

일 측면에 따르면, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크에 있어서, 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)를 생성하는 단계; 상기 마스킹 임계치에 기초하여, 상기 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)을 계산하는 단계; 상기 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성하는 단계; 상기 가중된 오류 함수를 이용하여 학습된 파라미터를 상기 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성하는 단계를 포함하는 뉴럴 네트워크일 수 있다.According to one aspect, in a neural network applied to an audio signal encoding method using an audio signal encoding apparatus, generating a masking threshold for a first audio signal before being learned; calculating a weight matrix to be applied to each frequency component of the first audio signal based on the masking threshold; generating a weighted error function by correcting a preset error function using the weighting matrix; and generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal.

상기 가중 행렬은, 상기 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 상기 가중치는, 상기 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고, 상기 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정되는 뉴럴 네트워크일 수 있다.The weighting matrix includes weights to be applied to frequency components of the first audio signal, and the weights are inversely proportional to a masking threshold for the first audio signal and proportional to the size of each frequency component of the first audio signal. It may be a neural network configured to

상기 뉴럴 네트워크는, 상기 생성된 제2 오디오 신호와 상기 제1 오디오 신호를 비교하여, 지각적 품질 평가를 수행하는 단계를 더 포함하는 뉴럴 네트워크일 수 있다.The neural network may further include performing perceptual quality evaluation by comparing the generated second audio signal with the first audio signal.

상기 지각적 품질 평가는, PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 포함하는 뉴럴 네트워크일 수 있다.The perceptual quality evaluation is an objective evaluation of PESQ (Perceptual Evaluation of Speech Quality), POLQA (Perceptual Objective Listening Quality Assessment), PEAQ (Perceptual Evaluation of Audio Quality), MOS (Mean Opinion Score), MUSHRA (Multiple Stimuli with Hidden It may be a neural network including subjective evaluation of Reference and Anchor).

상기 지각적 품질 평가에 기초하여, 모델에 포함된 토폴로지의 조정 가능 여부를 판단하는 뉴럴 네트워크일 수 있다.It may be a neural network that determines whether or not the topology included in the model can be adjusted based on the perceptual quality evaluation.

상기 토폴로지의 조정 가능 여부를 판단할 경우, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면, 복잡도가 증가된 모델을 이용하여 파라미터를 재학습하고, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면, 상기 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터를 재학습하는 뉴럴 네트워크일 수 있다.When determining whether the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy the preset quality requirements, parameters are re-learned using a model with increased complexity, and the result of the perceptual quality evaluation If t satisfies preset quality requirements, it may be a neural network that relearns parameters using a model whose complexity is reduced within the quality requirements.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법에 있어서, 입력 오디오 신호를 수신하는 단계; 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하는 단계; 상기 생성된 잠재 벡터를 부호화하여 비트스트림을 출력하는 단계를 포함하는 오디오 신호 부호화 방법일 수 있다.According to one aspect, an audio signal encoding method performed by an audio signal encoding apparatus using a neural network, comprising: receiving an input audio signal; generating a dimensionally reduced latent vector of the input audio signal based on parameters of the hidden layer, wherein the neural network includes one or more hidden layers; It may be an audio signal encoding method comprising the step of encoding the generated latent vector and outputting a bitstream.

상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하는 단계는, 상기 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 상기 학습된 파라미터에 기초하여 상기 잠재벡터를 생성하거나, 상기 모델의 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 상기 잠재벡터를 생성하는 오디오 신호 부호화 방법일 수 있다.The generating of the dimensionally reduced latent vector of the input audio signal based on the parameters of the hidden layer learned using the neural network may include adjusting the topology of the model including the number of hidden layers and nodes. audio that generates the latent vector based on the learned parameter when it is impossible or not necessary, or generates the latent vector based on the relearned parameter by applying the adjusted topology when the topology of the model can be adjusted It may be a signal encoding method.

상기 잠재 벡터의 부호화는, 채널을 통해 상기 비트스트림을 전송하기 위해 이진화하는 오디오 신호의 부호화 방법일 수 있다.Coding of the latent vector may be a method of encoding an audio signal in which the bitstream is binarized in order to transmit the bitstream through a channel.

일 측면에 따르면, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크에 있어서, 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)를 생성하는 단계; 상기 마스킹 임계치에 기초하여, 상기 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)을 계산하는 단계; 상기 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성하는 단계; 상기 가중된 오류 함수를 이용하여 학습된 파라미터를 상기 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성하는 단계를 포함하는 뉴럴 네트워크일 수 있다.According to one aspect, in a neural network applied to an audio signal decoding method using an audio signal decoding apparatus, generating a masking threshold for a first audio signal before being learned; calculating a weight matrix to be applied to each frequency component of the first audio signal based on the masking threshold; generating a weighted error function by correcting a preset error function using the weighting matrix; and generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal.

상기 토폴로지의 조정 가능 여부를 판단할 경우, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면, 복잡도가 증가된 모델을 이용하여 파라미터를 재학습시키고, 상기 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면, 상기 품질의 요구사항내에서 복잡도가 줄어든 모델을 이용하여 파라미터를 재학습시키는 뉴럴 네트워크일 수 있다.When determining whether the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy the preset quality requirements, parameters are re-learned using a model with increased complexity, and the result of the perceptual quality evaluation [0041] If [[tau]([tau]) satisfies preset quality requirements, it may be a neural network that relearns parameters using a model whose complexity is reduced within the quality requirements.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법에 있어서, 상기 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터(latent vector)가 부호화된 비트스트림을 수신하는 단계; 상기 수신한 비트스트림으로부터 상기 잠재 벡터를 복원하는 단계; 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 학습된 파라미터가 적용된 상기 히든 레이어를 이용하여 상기 복원된 잠재 벡터로부터 출력 오디오 신호를 복호화하는 단계를 포함하는 오디오 신호 복호화 방법일 수 있다.According to one aspect, in an audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied, a latent vector generated by applying a parameter learned through the neural network to an input audio signal is an encoded bit. receiving a stream; restoring the latent vector from the received bitstream; The neural network may include one or more hidden layers, and the audio signal decoding method may include decoding an output audio signal from the reconstructed latent vector using the hidden layer to which the learned parameter is applied.

상기 생성된 잠재 벡터는, 상기 히든 레이어의 개수, 상기 히든 레이어에 속한 노드의 개수를 포함하는 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 상기 학습된 파라미터에 기초하여 상기 잠재 벡터는 생성되고, 상기 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 상기 잠재 벡터는 생성되는 오디오 신호 복호화 방법일 수 있다.The generated latent vector is generated based on the learned parameter when topology adjustment including the number of hidden layers and the number of nodes belonging to the hidden layer is not possible or necessary, and the topology If it is possible to adjust the latent vector based on the parameter relearned by applying the adjusted topology, the latent vector may be an audio signal decoding method.

상기 잠재 벡터의 부호화는, 채널을 통해 상기 비트스트림을 전송하기 위해 이진화하는 오디오 신호의 복호화 방법일 수 있다.The encoding of the latent vector may be a method of decoding an audio signal in which binarization is performed to transmit the bitstream through a channel.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치에 있어서, 상기 오디오 신호 부호화 장치는 프로세서 및 상기 프로세서에 의해 실행 가능한 하나 이상의 명령어를 포함하는 메모리를 포함하고, 상기 하나 이상의 명령어가 상기 프로세서에서 실행되면, 입력 오디오 신호를 수신하고, 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 뉴럴 네트워크를 이용하여 학습된 상기 히든 레이어의 파라미터에 기초하여, 상기 입력 오디오 신호의 차원 축소된 잠재벡터를 생성하고, 상기 생성된 잠재 벡터를 부호화하여 비트스트림을 출력하는 오디오 신호 부호화 장치일 수 있다.According to one aspect, in an audio signal encoding apparatus to which a neural network is applied, the audio signal encoding apparatus includes a processor and a memory including one or more instructions executable by the processor, and the one or more instructions are executed by the processor. , an input audio signal is received, the neural network includes one or more hidden layers, and a dimensionally reduced latent vector of the input audio signal is generated based on parameters of the hidden layer learned using the neural network. and outputting a bitstream by encoding the generated latent vector.

일 측면에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치에 있어서, 상기 오디오 신호 복호화 장치는 프로세서 및 상기 프로세서에 의해 실행 가능한 하나 이상의 명령어를 포함하는 메모리를 포함하고, 상기 하나 이상의 명령어가 상기 프로세서에서 실행되면, 상기 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재 벡터(latent vector)가 양자화된 비트스트림을 수신하고, 상기 수신한 비트스트림으로부터 상기 잠재 벡터를 복원하고, 상기 뉴럴 네트워크는 하나 이상의 히든 레이어를 포함하고, 상기 학습된 파라미터가 적용된 상기 히든 레이어를 이용하여 상기 복원된 잠재 벡터로부터 출력 오디오 신호를 복호화하는 오디오 신호 복호화 장치일 수 있다.According to one aspect, in an audio signal decoding apparatus to which a neural network is applied, the audio signal decoding apparatus includes a processor and a memory including one or more instructions executable by the processor, and the one or more instructions are executed by the processor. , a bitstream in which a latent vector generated by applying the parameter learned through the neural network to an input audio signal is quantized, the latent vector is restored from the received bitstream, and the neural network may include one or more hidden layers, and may be an audio signal decoding apparatus that decodes an output audio signal from the reconstructed latent vector using the hidden layer to which the learned parameter is applied.

도 1은 일 실시예에 따른, 3개의 hidden layer을 포함하는 Autoencoder의 구조를 나타낸 도면이다.
도 2는 일 실시예에 따른, 동시 마스킹 효과(simultaneous masking effect)를 나타낼 수 있다.
도 3은 일 실시예에 따른, 마스킹 효과를 고려한 가청 및 비가청 영역을 나타낸 도면이다.
도 4는 일 실시예에 따른, 가중된 오류 함수를 이용한 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 5는 일 실시예에 따른, 오디오 신호 부호화 장치와 채널, 오디오 신호 복호화 장치를 나타낸 도면이다.
도 6은 일 실시예에 따른, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 7은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법을 나타낸 도면이다.
도 8은 일 실시예에 따른, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.
도 9는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법을 나타낸 도면이다.
도 10은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치를 나타낸 도면이다.
도 11는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치를 나타낸 도면이다.1 is a diagram showing the structure of an autoencoder including three hidden layers according to an embodiment.
2 may represent a simultaneous masking effect according to an embodiment.
3 is a diagram illustrating audible and non-audible regions in consideration of a masking effect, according to an exemplary embodiment.
4 is a diagram illustrating a learning process of a neural network using a weighted error function according to an embodiment.
5 is a diagram illustrating an audio signal encoding apparatus, a channel, and an audio signal decoding apparatus according to an exemplary embodiment.
6 is a diagram illustrating a neural network learning process applied to an audio signal encoding method using an audio signal encoding apparatus according to an exemplary embodiment.
7 is a diagram illustrating an audio signal encoding method performed by an audio signal encoding apparatus to which a neural network is applied, according to an exemplary embodiment.
8 is a diagram illustrating a neural network learning process applied to an audio signal decoding method using an audio signal decoding apparatus according to an exemplary embodiment.
9 is a diagram illustrating an audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied, according to an exemplary embodiment.
10 is a diagram illustrating an audio signal encoding apparatus to which a neural network is applied, according to an exemplary embodiment.
11 is a diagram illustrating an audio signal decoding apparatus to which a neural network is applied according to an exemplary embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 실시될 수 있다. 따라서, 실시예들은 특정한 개시형태로 한정되는 것이 아니며, 본 명세서의 범위는 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다. Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be modified and implemented in various forms. Therefore, the embodiments are not limited to the specific disclosed form, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical spirit.

제 1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소는 제 1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른, 3개의 hidden layer을 포함하는 Autoencoder의 구조를 나타낸 도면이다. 1 is a diagram showing the structure of an autoencoder including three hidden layers according to an embodiment.

일 실시예에 따르면, 뉴럴 네트워크(Neural Network)를 적용하여 오디오 신호 부호화 및 복호화는 수행될 수 있다. 이때, 뉴럴 네트워크는 딥러닝/머신러닝의 다양한 모델을 포함할 수 있다. 구체적으로 뉴럴 네트워크는 딥러닝의 모델로서 Autoencoding 방법을 포함할 수 있다.According to an embodiment, audio signal encoding and decoding may be performed by applying a neural network. In this case, the neural network may include various models of deep learning/machine learning. Specifically, the neural network may include an autoencoding method as a deep learning model.

뉴럴 네트워크는 오류 함수(error function or cost function)를 최소화하기 위한 최적화 문제를 해결할 수 있다. 여기서, 최적화는 반복적인 학습 알고리즘(iterative learning algorithm)에 의해서 수행되며, 최적화를 통해 오류를 최소화하는 파라미터(parameter)는 발견될 수 있다. Neural networks can solve optimization problems to minimize an error function or cost function. Here, optimization is performed by an iterative learning algorithm, and a parameter minimizing an error can be found through optimization.

예를 들면, 뉴럴 네트워크가 적용된 입력 데이터는 다음의 수학식 1과 같은 출력 데이터를 예측할 수 있다. 여기서, D 차원의 N개의 입력 데이터 , R 차원의 N개의 출력 데이터 를 의미할 수 있고, 파라미터는 입력 데이터에 적용된 뉴럴 네트워크의 모델을 나타낼 수 있다. For example, input data to which a neural network is applied may predict output data as shown in Equation 1 below. Here, N input data of D dimension , N output data of R dimension can mean, parameter may represent a model of a neural network applied to input data.

이때, 예측된 출력 데이터와 목표 데이터의 차이는 오류를 나타낼 수 있고, 오류 함수는 아래의 수학식 2와 같이 정의될 수 있다. 여기서, 목표 데이터 는 R 차원의 N개의 목표 데이터일 수 있다.At this time, the predicted output data with target data The difference in may represent an error, and the error function may be defined as in Equation 2 below. Here, target data may be N target data of R dimension.

여기서, 과 은 목표 데이터와 출력 데이터의 n-번째 column 벡터를 나타낼 수 있다. 따라서, 아래의 수학식 3과 같이, N개의 데이터에 대해서 총 오류는 계산될 수 있다.here, class may represent the n-th column vector of target data and output data. Therefore, as shown in Equation 3 below, the total error can be calculated for N pieces of data.

따라서, 총 오류를 최소화하기 위해 파라미터는 조정될 수 있으며, 반복적인 학습 알고리즘에 의해 총 오류를 최소화하는 파라미터는 발견될 수 있다. 반복적인 학습 알고리즘을 수행하는 다양한 뉴럴 네트워크의 모델이 존재할 수 있다. 그 중에서도 Autoencoding 방법은 입력 데이터로부터 잠재 패턴(latent patterns)의 학습에 효과적일 수 있다. Thus, the parameters can be adjusted to minimize the total error, and parameters that minimize the total error can be found by an iterative learning algorithm. There may exist various models of neural networks that perform iterative learning algorithms. Among them, the autoencoding method can be effective in learning latent patterns from input data.

이하, 일 실시예에 따른, Autoencoding 방법이 적용된 뉴럴 네트워크에 대해 설명한다. 다만, Autoencoding 방법에 한정되는 것은 아니며, 다른 뉴럴 네트워크가 적용된 모델도 포함될 수 있다.Hereinafter, a neural network to which an autoencoding method is applied according to an embodiment will be described. However, it is not limited to the autoencoding method, and models to which other neural networks are applied may also be included.

일 실시예에 따른, Autoencoder는 목표 데이터 Y와 입력 데이터 X 가 동일한 크기를 가질 수 있어, 차원 축소에 효과적일 수 있다. Autoencoder가 L개의 히든 레이어(layer)를 포함할 경우, fully-connected deep Autoencoder는 아래의 수학식 4와 같이 회귀적(recursively)으로 정의될 수 있다. According to an embodiment, the autoencoder may have the same size as the target data Y and the input data X, so that dimensionality reduction may be effective. When an autoencoder includes L hidden layers, a fully-connected deep autoencoder can be recursively defined as in Equation 4 below.

여기서, 과는 각각 번째 레이어에 대한 가중치(weighting)와 바이어스(bias)를 나타낼 수 있다. 이때, 파라미터는 를 나타낼 수 있다.here, class are respectively Weighting and bias for the th layer may be indicated. At this time, the parameter Is can represent

도 1은 일 실시예에 따른, 3개의 히든 레이어를 포함하는 Autoencoder를 나타낸 도면이다. 여기서, 바닥 레이어(bottom layer, 110)는 데이터가 입력되는 입력 레이어를 나타내며, 꼭대기 레이어(top layer, 120)는 데이터가 출력되는 출력 레이어를 나타내며, 중간의 레이어는 히든 레이어(hidden layer, 130)을 나타낼 수 있다. 또한, 노드는 입력 레이어/출력 레이어/히든 레이어에 포함되며, 그중에서도 노드(140)는 바이어스된 노드를 나타낼 수 있다. 1 is a diagram illustrating an autoencoder including three hidden layers according to an embodiment. Here, the bottom layer 110 represents an input layer into which data is input, the top layer 120 represents an output layer into which data is output, and the middle layer represents a hidden layer 130 can represent In addition, the node is included in the input layer/output layer/hidden layer, and among them, the node 140 may represent a biased node.

일 실시예에 따른, Autoencoder는 입력 레이어와 히든 레이어 1,2 를 포함하는 부호화부와 히든 레이어 3과 출력 레이어를 포함하는 복호화부를 포함하는 압축 시스템일 수 있다. 이때, 히든 레이어 2는 입력 데이터를 압축하는 code 레이어일 수 있다. According to an embodiment, an autoencoder may be a compression system including an encoder including an input layer and hidden layers 1 and 2 and a decoder including a hidden layer 3 and an output layer. In this case, hidden layer 2 may be a code layer for compressing input data.

압축 시스템인 Autoencoder의 부호화부는 인 경우 code 레이어에서 차원 축소된 를 생성할 수 있다. 또한, 목표 데이터 Y와 입력 데이터 X에 대해 아래의 수학식 5와 같이 설정된 오류 함수를 이용하여, Autoencoder의 복호화부는 로부터 입력 데이터를 복원할 수 있다. The encoding part of Autoencoder, which is a compression system, If , the code layer is dimensionally reduced. can create In addition, using the error function set as in Equation 5 below for the target data Y and the input data X, the decoding unit of the autoencoder The input data can be restored from

효과적인 code 레이어를 위해 히든 레이어의 개수와 히든 레이어에 포함된 노드의 개수는 증가될 수 있다. 이때, Autoencoder의 복잡도와 출력 데이터의 품질 간은 trade-off될 수 있다. 따라서, Autoencoder의 복잡도가 큰 경우, 배터리 소모 및 메모리 용량의 문제가 발생할 수 있다. For an effective code layer, the number of hidden layers and the number of nodes included in the hidden layer may be increased. At this time, a trade-off may be made between the complexity of the autoencoder and the quality of the output data. Therefore, when the complexity of the autoencoder is high, problems of battery consumption and memory capacity may occur.

다른 일 실시예에 따른, Autoencoder는 노이즈(noise)에 의해 변형된 신호(noisy signal)를 이용하여 노이즈가 제거된 원 신호(clean signal)을 생성하는 denoising Autoencoder일 수 있다. 노이즈는 부가 잡음(additive noise), 반향(reverberation), 대역 통과 필터링(band-pass filtering)을 포함할 수 있다.According to another embodiment, the autoencoder may be a denoising autoencoder that generates a clean signal from which noise is removed using a noisy signal. Noise may include additive noise, reverberation, and band-pass filtering.

변형된 신호(noisy signal)로부터 원 신호(clean signal)을 생성하는 denoising Autoencoder는 아래의 수학식 6과 같이 표현될 수 있다. 이때, Y는 원 신호의 크기 스펙트럼, X는 변형된 신호의 크기 스펙트럼일 수 있다. 여기서, X는 변형 함수 에 의해서 로 표현될 수 있다. 수학식 6에서, 는 변형 함수의 역함수를 근사화한 것으로, 를 나타낼 수 있다. A denoising autoencoder that generates a clean signal from a noisy signal can be expressed as Equation 6 below. In this case, Y may be the amplitude spectrum of the original signal, and X may be the amplitude spectrum of the modified signal. where X is the transformation function by can be expressed as In Equation 6, is an approximation of the inverse function of the transformation function, can represent

이때, 예를 들어 원 신호가 부가 잡음으로 인해 변형된 경우, 아래의 수학식 7과 같이 denoising Autoencoder는 원 신호를 직접 추정하는 것보다 부가 잡음을 제거하기 위한 이상적인 마스크(ideal mask) 를 추정하도록 학습되는 것이 효과적일 수 있다. At this time, for example, if the original signal is deformed due to the additional noise, as shown in Equation 7 below, the denoising autoencoder is an ideal mask for removing the additional noise rather than directly estimating the original signal. It can be effective to learn to estimate .

여기서, 이상적인 마스크는 아래의 수학식 8과 같이 Hadamard 곱을 이용하여 변형된 신호의 부가 잡음을 제거하는데 사용될 수 있다. Here, the ideal mask can be used to remove additional noise of the transformed signal using the Hadamard product as shown in Equation 8 below.

이때, 는 추정된 이상적인 비율 마스크(ideal ratio mask)일 수 있으며, 아래의 수학식 9와 같이 표현될 수 있다. 여기서, Q는 부가 잡음에 대한 크기 스펙트럼일 수 있으며, 원 신호가 부가 잡음에 의해 변형될 경우 변형 함수는 수학식 10과 같이 정의될 수 있다.At this time, may be an estimated ideal ratio mask, and may be expressed as in Equation 9 below. Here, Q may be a magnitude spectrum for the additive noise, and when the original signal is transformed by the additive noise, the transformation function may be defined as in Equation 10.

denoising Autoencoder는 수학식 4와 유사한 구조를 이용하여 함수를 학습할 수 있다. 다만, 변형 함수 가 매우 복잡하기 때문에, 많은 히든 레이어와 노드의 개수는 필요할 수 있다. 따라서, 모델의 복잡도와 성능간의 trade-off가 발생할 수 있다. The denoising Autoencoder is a function using a structure similar to Equation 4. can learn However, the transformation function Since is very complex, a large number of hidden layers and nodes may be required. Thus, a trade-off between model complexity and performance may occur.

따라서, trade-off를 해결하기 위해, 사람의 청각 특성을 이용한 오류 함수에 기반하여 autoencoding 방법에 대해 이하 자세하게 설명한다. 청각 특성을 이용한 오류 함수를 적용함으로써 모델 복잡도를 낮추거나, Autoencoder의 성능을 개선할 수 있다. Therefore, in order to solve the trade-off, an autoencoding method based on an error function using human hearing characteristics will be described in detail below. By applying an error function using hearing characteristics, model complexity can be reduced or autoencoder performance can be improved.

도 2는 일 실시예에 따른, 동시 마스킹 효과(simultaneous masking effect)를 나타낼 수 있다.2 may represent a simultaneous masking effect according to an embodiment.

그래프(210)는 조용한 환경에서 오디오 신호의 주파수에 따른 가청 음압 레벨을 데시벨(db)로 나타낸 것이다. 예를 들면, 4kHz 주파수 대역에서 가장 낮은 db를 나타내고 있으며, 저주파 또는 고주파 대역일수록 가청 음악 레벨의 임계치는 증가할 수 있다. 보다 구체적으로, 30Hz에서는 약 30db의 큰 토널(tonal) 신호를 대부분의 사람은 인지할 수 없지만, 1kHz에서는 약 10db의 상대적으로 작은 토널 신호를 사람은 인지할 수 있다. Graph 210 shows the audible sound pressure level in decibels (db) versus the frequency of an audio signal in a quiet environment. For example, the lowest db is shown in a frequency band of 4 kHz, and the threshold of the audible music level may increase in a low or high frequency band. More specifically, most people cannot perceive a large tonal signal of about 30 db at 30 Hz, but can perceive a relatively small tonal signal of about 10 db at 1 kHz.

그래프(230)는 1kHz에서 존재하는 토널 신호에 의해 수정된 그래프를 나타낼 수 있다. 즉, 토널 신호는 1kHz에서 그래프(210)를 상승시킬 수 있다. 따라서, 그래프(230)보다 작은 크기의 신호는 사람에 의해 인식될 수 없다. Graph 230 may represent a graph modified by a tonal signal present at 1 kHz. That is, the tonal signal can cause the graph 210 to rise at 1 kHz. Accordingly, a signal having a smaller magnitude than the graph 230 cannot be recognized by humans.

마스커(masker, 220)는 1kHz에서의 토널 신호를 나타내며, 마스키(maskee)는 마스커에 의해 마스킹되는 신호를 나타낼 수 있다. 예를 들면, 마스키는 그래프(230)보다 작은 크기의 신호를 포함할 수 있다.A masker 220 may represent a tonal signal at 1 kHz, and a maskee may represent a signal masked by the masker. For example, the maski may include a signal having a smaller size than the graph 230 .

도 3은 일 실시예에 따른, 마스킹 효과를 고려한 가청 및 비가청 영역을 나타낸 도면이다.3 is a diagram illustrating audible and non-audible regions in consideration of a masking effect, according to an exemplary embodiment.

도 3은 도 2의 그래프와 입력 오디오 신호의 스펙트럼을 중첩한 것을 나타낸다. 여기서, 가청 영역(340)은 입력 오디오 신호의 스펙트럼(360)이 해당 주파수에서 마스킹 임계치(masking threshold) 곡선보다 큰 스펙트럼을 가지는 것을 나타낼 수 있다. 또한, 비가청 영역(350)은 입력 오디오 신호의 스펙트럼(360)이 해당 주파수에서 마스킹 임계치(masking threshold) 곡선보다 작은 스펙트럼을 가지는 것을 나타낼 수 있다. FIG. 3 shows the overlapping of the graph of FIG. 2 and the spectrum of an input audio signal. Here, the audible region 340 may indicate that the spectrum 360 of the input audio signal has a spectrum greater than a masking threshold curve at a corresponding frequency. Also, the inaudible region 350 may indicate that the spectrum 360 of the input audio signal has a spectrum smaller than a masking threshold curve at a corresponding frequency.

예를 들면, 30Hz에서 그래프(310)이 그래프(360)보다 크기 때문에 비가청 영역(350)일 수 있으며, 10kHz에서 그래프(310)이 그래프(360)보다 크기 때문에 비가청 영역(350)일 수 있으며, 4kHz에서 그래프(360)이 그래프(310)보다 작기 때문에 가청 영역(340)일 수 있다. For example, since the graph 310 is larger than the graph 360 at 30 Hz, it may be a non-audible area 350, and at 10 kHz, the graph 310 may be larger than the graph 360, so it may be a non-audible area 350. , and since the graph 360 is smaller than the graph 310 at 4 kHz, it may be an audible region 340.

일 실시예에 따르면, Autoencoder의 훈련에서 입력 오디오 신호의 특성에 따라 비가청 영역보다 가청 영역을 더 고려하도록, 오류 함수는 수정될 수 있다. 즉, 입력 오디오 신호의 특정 주파수 성분의 크기가 대응하는 마스킹 임계치보다 작은 경우, 사람은 오류에 대해 상대적으로 예민하지 않을 수 있다. 또는 입력 오디오 신호의 특정 주파수 성분의 크기가 대응하는 마스킹 임계치보다 큰 경우, 사람은 오류에 대해 상대적으로 더 예민할 수 있다. According to an embodiment, an error function may be modified to consider an audible region more than a non-audible region according to characteristics of an input audio signal in training of an autoencoder. That is, when the magnitude of a specific frequency component of the input audio signal is smaller than the corresponding masking threshold, humans may not be relatively sensitive to errors. Alternatively, when the magnitude of a specific frequency component of the input audio signal is greater than the corresponding masking threshold, humans may be relatively more sensitive to errors.

따라서, 사람의 청각 특성을 고려한 Autoencoder의 훈련을 위해, 수정된 오류 함수는 아래 수학식 11과 같이 표현될 수 있다. 이때, 수정된 오류 함수는 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 나타낼 수 있다.Therefore, for the training of the autoencoder considering human hearing characteristics, the corrected error function can be expressed as Equation 11 below. In this case, the corrected error function may represent a weighted error function obtained by correcting a preset error function.

여기서, H는 가중 행렬을 나타내며, 가중 행렬은 입력 오디오 신호 각각의 주파수 성분에 적용되는 가중치를 포함할 수 있다. 예를 들면, 번째 샘플의 번째 계수가 큰 마스킹 임계치를 가질 경우 대응하는 가중치 은 상대적으로 작을 수 있다. 또 다른 예를 들면, 번째 샘플의 번째 계수가 작은 마스킹 임계치를 가질 경우 대응하는 가중치 은 상대적으로 클 수 있다.Here, H represents a weighting matrix, and the weighting matrix may include weights applied to each frequency component of the input audio signal. For example, of the second sample If the th coefficient has a large masking threshold, the corresponding weight may be relatively small. For another example, of the second sample If the th coefficient has a small masking threshold, the corresponding weight can be relatively large.

도 4는 일 실시예에 따른, 가중된 오류 함수를 이용한 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.4 is a diagram illustrating a learning process of a neural network using a weighted error function according to an embodiment.

일 실시예에 따르면, 시간-주파수 분석(401)을 통해 미리 설정된 길이의 분석 프레임에 대해서 제1 오디오 신호의 주파수 스펙트럼은 획득될 수 있다. 이때, 주파수 스펙트럼은 제1 오디오 신호에 대해 필터 뱅크(filter bank) 또는 MDCT(Modified Discrete Cosine Transform)를 적용하여 획득될 수 있고, 다른 방법 또한 적용하여 획득될 수 있다. According to an embodiment, a frequency spectrum of the first audio signal may be obtained for an analysis frame having a preset length through the time-frequency analysis 401 . In this case, the frequency spectrum may be obtained by applying a filter bank or a modified discrete cosine transform (MDCT) to the first audio signal, or may be obtained by applying other methods.

여기서, 제1 오디오 신호는 파라미터의 학습을 위한 훈련용 오디오 신호를 나타낼 수 있다. 또한, 미리 설정된 길이의 분석 프레임은 제1 오디오 신호의 10ms ~ 50ms 길이의 분석 프레임을 포함할 수 있으며, 다른 길이의 분석 프레임 또한 포함할 수 있다.Here, the first audio signal may represent a training audio signal for parameter learning. In addition, the analysis frame having a preset length may include an analysis frame having a length of 10 ms to 50 ms of the first audio signal, and may also include analysis frames having other lengths.

따라서, 시간-주파수 분석(401)을 통해, 제1 오디오 신호의 주파수 스펙트럼 Y 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼 는 생성될 수 있다.Therefore, through the time-frequency analysis 401, the frequency spectrum Y of the first audio signal or the frequency spectrum of the first audio signal modified by noise can be created.

일 실시예에 따르면, 심리 음향 분석(402)을 통해 제1 오디오 신호에 대한 마스킹 임계치(masking threshold)는 계산될 수 있다. 즉, 제1 오디오 신호의 청각적 특성을 분석하여 마스킹 임계치는 계산될 수 있다. According to an embodiment, a masking threshold for the first audio signal may be calculated through psychoacoustic analysis 402 . That is, the masking threshold may be calculated by analyzing the acoustic characteristics of the first audio signal.

심리 음향 분석 방법은 MPEG PAM-I 또는 MPEG PAM-II를 포함할 수 있으며, 다른 방법에 의한 심리 음향 분석을 포함할 수 있다. 예를 들면, 동시 마스킹 효과 만 아니라 시간적 마스킹(temporal masking)이 이용될 수 있다. The psychoacoustic analysis method may include MPEG PAM-I or MPEG PAM-II, and may include psychoacoustic analysis by other methods. For example, temporal masking can be used as well as concurrent masking effects.

일 실시예에 따르면, 심리 음향 분석(402)을 통해 계산된 마스킹 임계치를 이용하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬(weight matrix)은 계산(403)될 수 있다. According to an embodiment, a weight matrix to be applied to each frequency component of the first audio signal may be calculated (403) using the masking threshold calculated through the psychoacoustic analysis (402).

이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. In this case, the weighting matrix may include weights to be applied to the frequency components of the first audio signal, and the weights may be set in inverse proportion to a masking threshold for the first audio signal and proportional to the size of each frequency component of the first audio signal. That is, the weight may be determined by a signal-to-mask ratio (SMR), and the SMR may represent a ratio between the magnitude of each frequency component of the first audio signal and the magnitude of the masking threshold.

예를 들면, 제1 오디오 신호의 각각의 주파수 성분 에 대한 가중 행렬의 가중치 는 대응하는 마스킹 임계치에 반비례하고 주파수 성분의 크기에 비례하도록 설정될 수 있다. For example, each frequency component of the first audio signal the weights of the weighting matrix for may be set to be inversely proportional to the corresponding masking threshold and proportional to the size of the frequency component.

일 실시예에 따르면, 마스킹 임계치가 데시벨로 표현될 경우 가중 행렬의 각각의 가중치는 아래의 수학식 12와 같은 관계로 표현될 수 있다. 다만, 일 실시예에 따르면, 마스킹 임계치와 가중치 간의 선형 스케일 변환 또는 로그 스케일 변환을 포함할 수 있으며, 다른 스케일 변환도 포함할 수 있다. According to an embodiment, when the masking threshold is expressed in decibels, each weight in the weighting matrix may be expressed in a relationship as shown in Equation 12 below. However, according to an embodiment, linear scale conversion or log scale conversion between the masking threshold and weights may be included, and other scale conversions may also be included.

일 실시예에 따르면, 미리 설정된 오류 함수 는 오류 가중 함수(404)를 통해 가중된 오류 함수를 생성할 수 있다. 여기서, 수학식 11과 같이 표현되는 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. According to one embodiment, a preset error function may generate a weighted error function via the error weighting function 404. Here, the weighted error function expressed as in Equation 11 may be used for learning model parameters by applying weights.

이때, 모델은 뉴럴 네트워크를 포함할 수 있으며, 예를 들면, Autoencoder를 포함할 수 있다. 또한, 모델은 토폴로지를 포함할 수 있으며, 모델의 토폴로지는 입력 레이어, 하나 이상의 히든 레이어, 출력 레이어 및 각 레이어에 포함된 노드를 포함할 수 있다. In this case, the model may include a neural network, and may include, for example, an autoencoder. Also, the model may include a topology, and the topology of the model may include an input layer, one or more hidden layers, an output layer, and nodes included in each layer.

일 실시예에 따르면, 모델의 파라미터 학습(405)는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 스펙트럼 Z는 출력될 수 있다. According to one embodiment, the parameters of the model Learning 405 may be learned using a weighted error function. For example, learning may be performed on the topology of an initial model, and an audio spectrum Z predicted using the topology of a model on which learning is completed may be output.

이때, 예측된 오디오 스펙트럼 Z는 제1 오디오 신호의 주파수 스펙트럼 Y 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼 에 대해 학습된 모델의 파라미터를 적용하여 생성될 수 있다. In this case, the predicted audio spectrum Z is the frequency spectrum Y of the first audio signal or the frequency spectrum of the first audio signal modified by noise. It can be generated by applying the parameters of the model learned for

일 실시예에 따르면, 지각적 품질 평가(406)는 예측된 오디오 스펙트럼을 제1 오디오 신호 또는 제1 오디오 신호의 주파수 스펙트럼 Y와 비교하여 품질 평가를 수행할 수 있다. According to one embodiment, the perceptual quality assessment 406 compares the predicted audio spectrum to the first audio signal. Alternatively, quality evaluation may be performed by comparing with the frequency spectrum Y of the first audio signal.

이때, 지각적 품질 평가는 PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 이용할 수 있다. 지각적 품질 평가는 이에 한정되지 않으며, 다른 품질 평가를 포함할 수 있다.At this time, the perceptual quality evaluation is based on the objective evaluation of PESQ (Perceptual Evaluation of Speech Quality), POLQA (Perceptual Objective Listening Quality Assessment), PEAQ (Perceptual Evaluation of Audio Quality), MOS (Mean Opinion Score), MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) subjective assessments can be used. Perceptual quality assessment is not limited thereto and may include other quality assessments.

지각적 품질 평가(406)는 현재 학습된 모델에 대한 품질 평가를 나타낼 수 있다. 품질 평가에 기초하여, 미리 설정된 품질 및 모델의 복잡도와 같은 모델 요구사항을 만족하였는지 여부를 판단할 수 있다. 또한, 품질 평가에 기초하여, 모델의 복잡도와 같은 모델 토폴로지의 조정 가능 여부를 판단할 수 있다. 여기서, 모델의 복잡도는 히든 레이어의 개수, 노드의 개수와 양의 상관 관계를 가질 수 있다.Perceptual quality assessment 406 may represent a quality assessment for the currently learned model. Based on the quality evaluation, it may be determined whether or not model requirements such as preset quality and complexity of the model are satisfied. Also, based on the quality evaluation, it may be determined whether the model topology, such as model complexity, can be adjusted. Here, the complexity of the model may have a positive correlation with the number of hidden layers and nodes.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하지 않은 경우 또는 조정할 필요가 없다고 판단된 경우(407), 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when it is determined that the topology of the model is not adjustable or does not need to be adjusted (407), parameters of the currently learned model may be stored and learning may be terminated. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하다고 판단된 경우(407), 모델의 토폴로지는 업데이트(408)될 수 있고, 앞서 기술한 학습을 위한 과정은 반복될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when it is determined that the topology of the model can be adjusted (407), the topology of the model may be updated (408), and the above-described process for learning may be repeated. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

예를 들면, 품질 요구사항을 만족하지 못한 경우, 모델의 복잡도가 증가하도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면 복잡도가 증가된 모델을 이용하여 파라미터는 재학습될 수 있다. 또는, 품질 요구사항을 만족하지 않지만 더 이상 모델의 복잡도가 증가되지 않을 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. For example, if the quality requirements are not satisfied, the topology of the model may be adjusted to increase the complexity of the model, and the above-described process for learning may be repeated. Accordingly, when determining whether the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy a preset quality requirement, the parameters may be re-learned using a model with increased complexity. Alternatively, when the quality requirements are not satisfied but the complexity of the model is not increased any more, parameters of the currently learned model may be stored and learning may be terminated.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다. For another example, if the quality requirements are satisfied, the topology of the model may be adjusted so that the complexity of the model is reduced within the quality requirements, and the above-described process for learning may be repeated. Therefore, when determining whether the topology can be adjusted, if the result of the perceptual quality evaluation satisfies the preset quality requirements, the parameters may be re-learned using a model whose complexity is reduced within the quality requirements.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 입력 오디오 신호에 대해 신호 처리는 수행될 수 있다. 이때, 신호 처리는 압축/잡음제거/부호화/복호화를 포함할 수 있으며, 이에 한정되지 않는다.According to one embodiment, signal processing may be performed on the input audio signal based on the parameters of the learned model. In this case, signal processing may include compression/noise removal/encoding/decoding, but is not limited thereto.

도 5는 일 실시예에 따른, 오디오 신호 부호화 장치와 채널, 오디오 신호 복호화 장치를 나타낸 도면이다.5 is a diagram illustrating an audio signal encoding apparatus, a channel, and an audio signal decoding apparatus according to an exemplary embodiment.

일 실시예에 따르면, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치(510)는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치(510)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. According to an embodiment, the audio signal encoding apparatus 510 to which a neural network is applied may receive an input audio signal. In this case, the audio signal encoding apparatus 510 may apply a model learned through a neural network. Here, the neural network may include an autoencoder.

오디오 신호 부호화 장치(510)는 Autoencoder 부호화부(511) 및 양자화부(512)를 포함할 수 있다. 여기서, Autoencoder 부호화부(511)는 입력 레이어 부터 code 레이어까지의 레이어를 포함할 수 있으며, 이때 code 레이어는 번째 히든 레이어를 나타낼 수 있다. The audio signal encoding apparatus 510 may include an autoencoder encoding unit 511 and a quantization unit 512. Here, the autoencoder encoding unit 511 may include layers from the input layer to the code layer, in which case the code layer It may indicate a second hidden layer.

오디오 신호 부호화 장치(510)는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여 수신한 입력 오디오 신호의 차원 축소된 잠재벡터(latent vector)를 생성할 수 있다. 이때 생성된 잠재벡터는 양자화부(512)에 의해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다. 여기서, 양자화부(512)는 전송 채널을 통해 비트스트림을 전송하기 위해 이진화하는 과정을 포함할 수 있다.The audio signal encoding apparatus 510 may generate a dimensionally reduced latent vector of the received input audio signal based on the parameters of the hidden layer learned using the neural network. At this time, the generated latent vector may be quantized or coded by the quantization unit 512 and output as a bitstream. Here, the quantization unit 512 may include a process of binarizing the bitstream in order to transmit it through a transmission channel.

출력된 비트스트림은 전송 채널(530)에 의해 오디오 신호 복호화 장치(520)으로 전송될 수 있다. The output bitstream may be transmitted to the audio signal decoding apparatus 520 through the transmission channel 530 .

오디오 신호 복호화 장치(520)는 Autoencoder 복호화부(521) 및 역양자화부(522)를 포함할 수 있다. 여기서, Autoencoder 복호화부(521)는 번째 히든 레이어부터 출력 레이어까지의 레이어를 포함할 수 있다. The audio signal decoding apparatus 520 may include an autoencoder decoding unit 521 and an inverse quantization unit 522. Here, the autoencoder decoding unit 521 Layers from the th hidden layer to the output layer may be included.

오디오 신호 복호화 장치(520)는 전송 채널(530)을 통해 전송된 비트스트림을 역양자화부(522)에서 역양자화 또는 역부호화하여 잠재벡터를 복원할 수 있다. 복원된 잠재벡터를 이용하여, Autoencoder 복호화부(521)는 출력 오디오 신호를 복호화할 수 있거나, 또는 Autoencoder 복호화부(521)는 출력 오디오 신호를 계산할 수 있다. 여기서, 역양자화부(522)는 비트스트림을 이진화하는 과정을 포함할 수 있다.The audio signal decoding apparatus 520 may inversely quantize or inverse-code the bitstream transmitted through the transmission channel 530 in the inverse quantization unit 522 to restore the latent vector. Using the restored latent vector, the autoencoder decoding unit 521 can decode the output audio signal or the autoencoder decoding unit 521 can calculate the output audio signal. Here, the inverse quantization unit 522 may include a process of binarizing the bitstream.

Autoencoder 부호화부(511) 및 Autoencoder 복호화부(521)의 파라미터는 도 4에 의해 학습된 파라미터일 수 있다. 학습된 파라미터에 대해 자세한 사항은 도 4를 참조한다.Parameters of the autoencoder encoding unit 511 and the autoencoder decoding unit 521 may be parameters learned from FIG. 4 . For details on the learned parameters, refer to FIG. 4 .

일 실시예에 따르면, 사람의 청각 특성을 이용하여 지각적으로 가중된 오류 함수를 이용하여, 동일한 모델의 복잡도를 가지고 개선된 오디오 신호의 품질을 제공할 수 있다. 또는 낮은 모델의 복잡도를 가지고 동일한 수준의 오디오 신호의 품질을 제공할 수 있다. 따라서, 오디오 신호 압축 및 복원을 위한 오디오 코덱에 활용될 수 있다. According to an embodiment, an improved audio signal quality may be provided with the same complexity of the model by using a perceptually weighted error function using human hearing characteristics. Alternatively, the same level of audio signal quality may be provided with a low model complexity. Therefore, it can be used in an audio codec for compressing and restoring an audio signal.

도 6은 일 실시예에 따른, 오디오 신호 부호화 장치를 이용하여 오디오 신호 부호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.6 is a diagram illustrating a neural network learning process applied to an audio signal encoding method using an audio signal encoding apparatus according to an exemplary embodiment.

단계(601)에서, 오디오 신호 부호화 장치는 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치를 생성할 수 있다. 여기서, 제1 오디오 신호는 뉴럴 네트워크를 학습하기 위한 훈련용 오디오 신호를 나타낼 수 있다. In step 601, the audio signal encoding apparatus may generate a masking threshold for the first audio signal before being learned. Here, the first audio signal may represent a training audio signal for learning the neural network.

마스킹 임계치는 제1 오디오 신호의 청각적 특성을 분석하여 계산될 수 있다. 심리 음향 분석 방법은 MPEG PAM-I 또는 MPEG PAM-II를 포함할 수 있으며, 다른 방법에 의한 심리 음향 분석을 포함할 수 있다. 예를 들면, 동시 마스킹 효과 만 아니라 시간적 마스킹(temporal masking)이 이용될 수 있다. The masking threshold may be calculated by analyzing the acoustic characteristics of the first audio signal. The psychoacoustic analysis method may include MPEG PAM-I or MPEG PAM-II, and may include psychoacoustic analysis by other methods. For example, temporal masking can be used as well as concurrent masking effects.

단계(602)에서, 오디오 신호 부호화 장치는 생성된 마스킹 임계치에 기초하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬을 계산할 수 있다. 이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. In step 602, the audio signal encoding apparatus may calculate a weighting matrix to be applied to each frequency component of the first audio signal based on the generated masking threshold. In this case, the weighting matrix may include weights to be applied to the frequency components of the first audio signal, and the weights may be set in inverse proportion to a masking threshold for the first audio signal and proportional to the size of each frequency component of the first audio signal. That is, the weight may be determined by a signal-to-mask ratio (SMR), and the SMR may represent a ratio between the magnitude of each frequency component of the first audio signal and the magnitude of the masking threshold.

예를 들면, 제1 오디오 신호의 각각의 주파수 성분 에 대한 가중 행렬의 가중치 는 대응하는 마스킹 임계치에 반비례하고 주파수 성분의 크기에 비례하도록 설정될 수 있다.For example, each frequency component of the first audio signal the weights of the weighting matrix for may be set to be inversely proportional to the corresponding masking threshold and proportional to the size of the frequency component.

단계(603)에서, 오디오 신호 부호화 장치는 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성할 수 있다. 여기서, 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. In step 603, the audio signal encoding apparatus may generate a weighted error function obtained by correcting a preset error function using a weighting matrix. Here, the weighted error function may be used to learn model parameters by applying weights.

이때, 모델은 뉴럴 네트워크를 포함할 수 있으며, 예를 들면 뉴럴 네트워크는 Autoencoder를 포함할 수 있다. 또한, 모델은 토폴로지를 포함할 수 있으며, 모델의 토폴로지는 입력 레이어, 하나 이상의 히든 레이어, 출력 레이어 및 각 레이어에 포함된 노드를 포함할 수 있다.In this case, the model may include a neural network, and for example, the neural network may include an autoencoder. Also, the model may include a topology, and the topology of the model may include an input layer, one or more hidden layers, an output layer, and nodes included in each layer.

단계(604)에서, 오디오 신호 부호화 장치는 가중된 오류 함수를 이용하여 학습된 파라미터를 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성할 수 있다. 여기서, 모델의 파라미터는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 반복적인 학습 알고리즘에 의해 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 신호는 출력될 수 있다. 여기서, 예측된 오디오 신호는 제2 오디오 신호를 나타낼 수 있다.In step 604, the audio signal encoding apparatus may generate a second audio signal by applying the learned parameter to the first audio signal using the weighted error function. Here, the parameters of the model may be learned using a weighted error function. For example, learning may be performed on the topology of an initial model, and an audio signal predicted using the topology of a model whose learning has been completed by an iterative learning algorithm may be output. Here, the predicted audio signal may represent the second audio signal.

이때, 제2 오디오 신호의 주파수 스펙트럼은 제1 오디오 신호의 주파수 스펙트럼 또는 노이즈에 의해 변형된 제1 오디오 신호의 주파수 스펙트럼에 대해 학습된 모델의 파라미터를 적용하여 생성될 수 있다. In this case, the frequency spectrum of the second audio signal may be generated by applying the learned model parameter to the frequency spectrum of the first audio signal or the frequency spectrum of the first audio signal modified by noise.

제2 오디오 신호는 제1 오디오 신호와 비교하여, 지각적 품질 평가는 수행될 수 있다. 이때, 지각적 품질 평가는 PESQ(Perceptual Evaluation of Speech Quality), POLQA(Perceptual Objective Listening Quality Assessment), PEAQ(Perceptual Evaluation of Audio Quality)의 객관적 평가 또는 MOS(Mean Opinion Score), MUSHRA(Multiple Stimuli with Hidden Reference and Anchor)의 주관적 평가를 이용할 수 있다. 지각적 품질 평가는 이에 한정되지 않으며, 다른 품질 평가를 포함할 수 있다.By comparing the second audio signal with the first audio signal, a perceptual quality evaluation may be performed. At this time, the perceptual quality evaluation is based on the objective evaluation of PESQ (Perceptual Evaluation of Speech Quality), POLQA (Perceptual Objective Listening Quality Assessment), PEAQ (Perceptual Evaluation of Audio Quality), MOS (Mean Opinion Score), MUSHRA (Multiple Stimuli with Hidden Reference and Anchor) subjective assessments can be used. Perceptual quality assessment is not limited thereto and may include other quality assessments.

지각적 품질 평가는 현재 학습된 모델에 대한 품질 평가를 나타낼 수 있다. 품질 평가에 기초하여, 미리 설정된 품질 및 모델의 복잡도와 같은 모델 요구사항을 만족하였는지 여부를 판단할 수 있다. 또한, 품질 평가에 기초하여, 모델의 복잡도와 같은 모델 토폴로지의 조정 가능 여부를 판단할 수 있다. 여기서, 모델의 복잡도는 히든 레이어의 개수, 노드의 개수와 양의 상관 관계를 가질 수 있다.The perceptual quality evaluation may indicate a quality evaluation of the currently learned model. Based on the quality evaluation, it may be determined whether or not model requirements such as preset quality and complexity of the model are satisfied. Also, based on the quality evaluation, it may be determined whether the model topology, such as model complexity, can be adjusted. Here, the complexity of the model may have a positive correlation with the number of hidden layers and nodes.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하지 않은 경우 또는 조정할 필요가 없다고 판단된 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when it is determined that the topology of the model is not adjustable or does not need to be adjusted, parameters of the currently learned model may be stored and learning may be terminated. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 모델의 토폴로지를 조정 가능하다고 판단된 경우, 모델의 토폴로지는 업데이트될 수 있고, 앞서 기술한 학습을 위한 과정은 반복될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when it is determined that the topology of the model can be adjusted, the topology of the model can be updated, and the above-described process for learning can be repeated. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 앞서 기술한 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다.For another example, if the quality requirements are satisfied, the topology of the model may be adjusted so that the complexity of the model is reduced within the quality requirements, and the above-described process for learning may be repeated. Therefore, when determining whether the topology can be adjusted, if the result of the perceptual quality evaluation satisfies the preset quality requirements, the parameters may be re-learned using a model whose complexity is reduced within the quality requirements.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 제1 오디오 신호에 대해 신호 처리는 수행될 수 있다. 이때, 신호 처리는 압축/잡음제거/부호화/복호화를 포함할 수 있으며, 이에 한정되지 않는다.According to an embodiment, signal processing may be performed on the first audio signal based on the parameters of the learned model. In this case, signal processing may include compression/noise removal/encoding/decoding, but is not limited thereto.

도 7은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치가 수행하는 오디오 신호 부호화 방법을 나타낸 도면이다.7 is a diagram illustrating an audio signal encoding method performed by an audio signal encoding apparatus to which a neural network is applied, according to an exemplary embodiment.

단계(701)에서, 오디오 신호 부호화 장치는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. In step 701, the audio signal encoding apparatus may receive an input audio signal. In this case, the audio signal encoding apparatus may apply a model learned through a neural network. Here, the neural network may include an autoencoder.

단계(702)에서, 오디오 신호 부호화 장치는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여, 입력 오디오 신호의 차원 축소된 잠재벡터를 생성할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.In step 702, the audio signal encoding apparatus may generate a dimensionally reduced latent vector of the input audio signal based on the parameters of the hidden layer learned using the neural network. Here, the process of learning the parameters of the hidden layer has been described in detail above.

일 실시예에 따르면, 잠재벡터는 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 불가능하거나 필요하지 않은 경우 학습된 파라미터에 기초하여 생성될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, latent vectors may be generated based on learned parameters when it is impossible or not necessary to adjust the topology of a model including the number of hidden layers and nodes. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

일 실시예에 따르면, 잠재벡터는 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 생성될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when the topology of a model including the number of hidden layers and the number of nodes can be adjusted, the latent vector may be generated based on the relearned parameter by applying the adjusted topology. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

예를 들면, 품질 요구사항을 만족하지 못한 경우, 모델의 복잡도가 증가하도록 모델의 토폴로지는 조정될 수 있고, 파라미터의 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족하지 못한다면 복잡도가 증가된 모델을 이용하여 파라미터는 재학습될 수 있다. 또는, 품질 요구사항을 만족하지 않지만 더 이상 모델의 복잡도가 증가되지 않을 경우, 현재 학습된 모델의 파라미터는 저장되어 학습은 종료될 수 있다. For example, if the quality requirements are not satisfied, the topology of the model may be adjusted to increase the complexity of the model, and the process for learning parameters may be repeated. Accordingly, when determining whether the topology can be adjusted, if the result of the perceptual quality evaluation does not satisfy a preset quality requirement, the parameters may be re-learned using a model with increased complexity. Alternatively, when the quality requirements are not satisfied but the complexity of the model is not increased any more, parameters of the currently learned model may be stored and learning may be terminated.

또 다른 예를 들면, 품질 요구사항을 만족하는 경우, 품질 요구사항 내에서 모델의 복잡도가 줄어들도록 모델의 토폴로지는 조정될 수 있고, 파라미터의 학습을 위한 과정이 반복될 수 있다. 따라서, 토폴로지의 조정 가능 여부를 판단할 경우, 지각적 품질 평가의 결과가 미리 설정된 품질 요구사항을 만족한다면 품질 요구사항 내에서 복잡도가 줄어든 모델을 이용하여 파라미터는 재학습될 수 있다.For another example, if the quality requirements are satisfied, the topology of the model may be adjusted so that the complexity of the model is reduced within the quality requirements, and the process for learning parameters may be repeated. Therefore, when determining whether the topology can be adjusted, if the result of the perceptual quality evaluation satisfies the preset quality requirements, the parameters may be re-learned using a model whose complexity is reduced within the quality requirements.

단계(703)에서, 오디오 신호 부호화 장치는 생성된 잠재벡터를 부호화하여 비트스트림을 출력할 수 있다. 여기서, 잠재벡터는 전송 채널을 통해 전송되기 위해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다.In step 703, the audio signal encoding apparatus may output a bitstream by encoding the generated latent vector. Here, the latent vector may be output as a bitstream after being quantized or coded to be transmitted through a transmission channel.

도 8은 일 실시예에 따른, 오디오 신호 복호화 장치를 이용하여 오디오 신호 복호화 방법에 적용되는 뉴럴 네트워크의 학습 과정을 나타낸 도면이다.8 is a diagram illustrating a neural network learning process applied to an audio signal decoding method using an audio signal decoding apparatus according to an exemplary embodiment.

단계(801)에서, 오디오 신호 복호화 장치는 학습되기 전 제1 오디오 신호에 대한 마스킹 임계치를 생성할 수 있다. 여기서, 제1 오디오 신호는 뉴럴 네트워크를 학습하기 위한 훈련용 오디오 신호를 나타낼 수 있다. In step 801, the audio signal decoding apparatus may generate a masking threshold for the first audio signal before being learned. Here, the first audio signal may represent a training audio signal for learning the neural network.

단계(802)에서, 오디오 신호 복호화 장치는 생성된 마스킹 임계치에 기초하여, 제1 오디오 신호 각각의 주파수 성분에 적용될 가중 행렬을 계산할 수 있다. 이때, 가중 행렬은 제1 오디오 신호의 주파수 성분에 적용될 가중치를 포함하고, 가중치는 제1 오디오 신호에 대한 마스킹 임계치에 반비례하고 제1 오디오 신호 각각의 주파수 성분의 크기에 비례하도록 설정될 수 있다. 즉, 가중치는 SMR(Signal-to-Mask Ratio)에 의해 결정될 수 있으며, SMR은 제1 오디오 신호 각각의 주파수 성분에서 크기와 마스킹 임계치의 크기 간의 비율을 나타낼 수 있다. In step 802, the audio signal decoding apparatus may calculate a weighting matrix to be applied to each frequency component of the first audio signal based on the generated masking threshold. In this case, the weighting matrix may include weights to be applied to the frequency components of the first audio signal, and the weights may be set in inverse proportion to a masking threshold for the first audio signal and proportional to the size of each frequency component of the first audio signal. That is, the weight may be determined by a signal-to-mask ratio (SMR), and the SMR may represent a ratio between the magnitude of each frequency component of the first audio signal and the magnitude of the masking threshold.

단계(803)에서, 오디오 신호 복호화 장치는 가중 행렬을 이용하여 미리 설정된 오류 함수를 수정한 가중된 오류 함수를 생성할 수 있다. 여기서, 가중된 오류 함수는 가중치를 적용하여 모델의 파라미터의 학습에 이용될 수 있다. In step 803, the audio signal decoding apparatus may generate a weighted error function by correcting a preset error function using a weighting matrix. Here, the weighted error function may be used to learn model parameters by applying weights.

단계(804)에서, 오디오 신호 복호화 장치는 가중된 오류 함수를 이용하여 학습된 파라미터를 제1 오디오 신호에 적용하여 제2 오디오 신호를 생성할 수 있다. 여기서, 모델의 파라미터는 가중된 오류 함수를 이용하여 학습될 수 있다. 예를 들면, 초기 모델의 토폴로지에 대해서 학습이 수행될 수 있으며, 반복적인 학습 알고리즘에 의해 학습이 완료된 모델의 토폴로지를 이용하여 예측된 오디오 신호는 출력될 수 있다. 여기서, 예측된 오디오 신호는 제2 오디오 신호를 나타낼 수 있다.In step 804, the audio signal decoding apparatus may generate a second audio signal by applying the learned parameter to the first audio signal using the weighted error function. Here, the parameters of the model may be learned using a weighted error function. For example, learning may be performed on the topology of an initial model, and an audio signal predicted using the topology of a model whose learning has been completed by an iterative learning algorithm may be output. Here, the predicted audio signal may represent the second audio signal.

일 실시예에 따르면, 학습된 모델의 파라미터에 기초하여, 입력 오디오 신호에 대해 압축/잡음제거/부호화/복호화와 같은 신호 처리는 수행될 수 있다. 이때, 신호 처리는 이에 한정되지 않는다.According to an embodiment, signal processing such as compression/noise removal/encoding/decoding may be performed on an input audio signal based on parameters of a learned model. At this time, the signal processing is not limited to this.

도 9는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치가 수행하는 오디오 신호 복호화 방법을 나타낸 도면이다.9 is a diagram illustrating an audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied, according to an exemplary embodiment.

단계(901)에서, 오디오 신호 복호화 장치는 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터가 부호화된 비트스트림을 수신할 수 있다. 이때, 오디오 신호 복호화 장치는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. In step 901, the audio signal decoding apparatus may receive a bitstream encoded with a latent vector generated by applying a parameter learned through a neural network to an input audio signal. In this case, the audio signal decoding apparatus may apply a model learned through a neural network. Here, the neural network may include an autoencoder.

단계(902)에서, 오디오 신호 복호화 장치는 수신한 비트스트림으로부터 잠재벡터를 복원할 수 있다. 여기서, 비트스트림은 전송 채널을 통해 전송되기 위해 잠재벡터를 양자화 또는 부호화하여 생성될 수 있다.In step 902, the audio signal decoding apparatus may restore a latent vector from the received bitstream. Here, the bitstream may be generated by quantizing or encoding a latent vector to be transmitted through a transmission channel.

단계(903)에서, 오디오 신호 복호화 장치는 학습된 파라미터가 적용된 히든 레이어를 이용하여 복원된 잠재벡터로부터 출력 오디오 신호를 복호화 할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.In step 903, the audio signal decoding apparatus may decode the output audio signal from the restored latent vector using the hidden layer to which the learned parameter is applied. Here, the process of learning the parameters of the hidden layer has been described in detail above.

도 10은 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 부호화 장치를 나타낸 도면이다.10 is a diagram illustrating an audio signal encoding apparatus to which a neural network is applied, according to an exemplary embodiment.

오디오 신호 부호화 장치(1000)는 프로세서(1010)와 메모리(1020)를 포함할 수 있다. 메모리(1020)는 프로세서에 의해 실행 가능한 하나 이상의 명령어(instruction)을 포함할 수 있다.The audio signal encoding apparatus 1000 may include a processor 1010 and a memory 1020. The memory 1020 may include one or more instructions executable by a processor.

프로세서(1010)는 입력 오디오 신호를 수신할 수 있다. 이때, 오디오 신호 부호화 장치(1000)의 프로세서(1010)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. The processor 1010 may receive an input audio signal. In this case, the processor 1010 of the audio signal encoding apparatus 1000 may apply a model learned through a neural network. Here, the neural network may include an autoencoder.

오디오 신호 부호화 장치(1000)의 프로세서(1010)는 뉴럴 네트워크를 이용하여 학습된 히든 레이어의 파라미터에 기초하여, 입력 오디오 신호의 차원 축소된 잠재벡터를 생성할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.The processor 1010 of the audio signal encoding apparatus 1000 may generate a dimensionally reduced latent vector of the input audio signal based on the parameters of the hidden layer learned using the neural network. Here, the process of learning the parameters of the hidden layer has been described in detail above.

오디오 신호 부호화 장치(1000)의 프로세서(1010)는 생성된 잠재벡터를 부호화하여 비트스트림을 출력할 수 있다. 여기서, 잠재벡터는 전송 채널을 통해 전송되기 위해 양자화 또는 부호화되어 비트스트림으로 출력될 수 있다.The processor 1010 of the audio signal encoding apparatus 1000 may encode the generated latent vector and output a bitstream. Here, the latent vector may be output as a bitstream after being quantized or coded to be transmitted through a transmission channel.

도 11는 일 실시예에 따른, 뉴럴 네트워크가 적용된 오디오 신호 복호화 장치를 나타낸 도면이다.11 is a diagram illustrating an audio signal decoding apparatus to which a neural network is applied according to an exemplary embodiment.

오디오 신호 복호화 장치(1100)는 프로세서(1110)와 메모리(1120)를 포함할 수 있다. 메모리(1120)는 프로세서에 의해 실행 가능한 하나 이상의 명령어(instruction)을 포함할 수 있다.The audio signal decoding apparatus 1100 may include a processor 1110 and a memory 1120. The memory 1120 may include one or more instructions executable by a processor.

프로세서(1110)는 뉴럴 네트워크를 통해 학습된 파라미터를 입력 오디오 신호에 적용하여 생성된 잠재벡터가 부호화된 비트스트림을 수신할 수 있다. 이때, 오디오 신호 복호화 장치(1100)의 프로세서(1110)는 뉴럴 네트워크를 통해 학습된 모델이 적용될 수 있다. 여기서, 뉴럴 네트워크는 Autoencoder을 포함할 수 있다. The processor 1110 may receive a bitstream encoded with a latent vector generated by applying a parameter learned through a neural network to an input audio signal. In this case, the processor 1110 of the audio signal decoding apparatus 1100 may apply a model learned through a neural network. Here, the neural network may include an autoencoder.

오디오 신호 복호화 장치(1100)의 프로세서(1110)는 수신한 비트스트림으로부터 잠재벡터를 복원할 수 있다. 여기서, 비트스트림은 전송 채널을 통해 전송되기 위해 잠재벡터를 양자화 또는 부호화하여 생성될 수 있다.The processor 1110 of the audio signal decoding apparatus 1100 may reconstruct latent vectors from the received bitstream. Here, the bitstream may be generated by quantizing or encoding a latent vector to be transmitted through a transmission channel.

오디오 신호 복호화 장치(1100)의 프로세서(1110)는 학습된 파라미터가 적용된 히든 레이어를 이용하여 복원된 잠재벡터로부터 출력 오디오 신호를 복호화 할 수 있다. 여기서, 히든 레이어의 파라미터를 학습하는 과정은 앞서 자세히 기술하였다.The processor 1110 of the audio signal decoding apparatus 1100 may decode an output audio signal from the restored latent vector using the hidden layer to which the learned parameter is applied. Here, the process of learning the parameters of the hidden layer has been described in detail above.

일 실시예에 따르면, 잠재벡터는 히든 레이어의 개수, 노드의 개수를 포함하는 모델의 토폴로지의 조정이 가능한 경우 조정된 토폴로지를 적용함으로써 재학습된 파라미터에 기초하여 생성될 수 있다. 이때, 모델의 토폴로지를 조정 가능한지 여부에 대한 판단은 모델이 적용되는 분야 및 모델에 대한 요구사항에 따라 다를 수 있다.According to an embodiment, when the topology of a model including the number of hidden layers and the number of nodes can be adjusted, the latent vector may be generated based on the relearned parameters by applying the adjusted topology. At this time, the determination of whether the topology of the model can be adjusted may differ depending on the field to which the model is applied and the requirements for the model.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. or even if replaced or substituted by equivalents, appropriate results can be achieved.

110: 입력 레이어
120: 출력 레이어
130: 히든 레이어110: input layer
120: output layer
130: hidden layer

Claims

In the neural network learning method applied to the audio signal encoding method using an audio signal encoding apparatus,
generating a masking threshold for the first audio signal before being learned;
calculating a weight matrix to be applied to each frequency component of the first audio signal based on the masking threshold;
generating a weighted error function by correcting a preset error function using the weighting matrix;
generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal;
and include
The learning method,
Comparing the generated second audio signal with the first audio signal to perform perceptual quality evaluation;
The neural network,
A learning method for determining whether a topology included in a model can be adjusted based on the perceptual quality evaluation.

According to claim 1,
The weighting matrix includes weights to be applied to frequency components of the first audio signal;
The weight is
The learning method is set to be inversely proportional to the masking threshold for the first audio signal and proportional to the size of the frequency component of each of the first audio signals.

delete

According to claim 1,
The perceptual quality evaluation,
Objective assessment of Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA), Perceptual Evaluation of Audio Quality (PEAQ), or
A learning method that includes subjective assessment of Mean Opinion Score (MOS) and Multiple Stimuli with Hidden Reference and Anchor (MUSHRA).

delete

According to claim 1,
When determining whether the topology can be adjusted,
If the result of the perceptual quality evaluation does not satisfy a preset quality requirement, relearning parameters using a model with increased complexity,
If the result of the perceptual quality evaluation satisfies a preset quality requirement, a learning method for re-learning parameters using a model with reduced complexity within the quality requirement.

An audio signal encoding method performed by an audio signal encoding apparatus to which a neural network is applied,
receiving an input audio signal;
generating a dimensionally reduced latent vector of the input audio signal based on parameters of the hidden layer, wherein the neural network includes one or more hidden layers;
Encoding the generated latent vector to output a bitstream
including,
Generating a dimensionally reduced latent vector of the input audio signal based on the parameters of the hidden layer learned using the neural network includes:
If it is impossible or not necessary to adjust the topology of the model including the number of hidden layers and the number of nodes, the latent vector is generated based on the learned parameters;
If the topology of the model can be adjusted, the audio signal encoding method generates the latent vector based on the relearned parameters by applying the adjusted topology.

delete

According to claim 7,
Encoding of the latent vector,
An audio signal encoding method for binarizing the bitstream for transmission through a channel.

In the neural network learning method applied to the audio signal decoding method using an audio signal decoding apparatus,
generating a masking threshold for the first audio signal before being learned;
calculating a weight matrix to be applied to each frequency component of the first audio signal based on the masking threshold;
generating a weighted error function by correcting a preset error function using the weighting matrix;
generating a second audio signal by applying a parameter learned using the weighted error function to the first audio signal;
including,
The learning method,
Comparing the generated second audio signal with the first audio signal to perform perceptual quality evaluation;
The neural network,
A learning method for determining whether a topology included in a model can be adjusted based on the perceptual quality evaluation.

According to claim 10,
The weighting matrix includes weights to be applied to frequency components of the first audio signal;
The weight is
The learning method is set to be inversely proportional to the masking threshold for the first audio signal and proportional to the size of the frequency component of each of the first audio signals.

delete

According to claim 10,
The perceptual quality evaluation,
Objective assessment of Perceptual Evaluation of Speech Quality (PESQ), Perceptual Objective Listening Quality Assessment (POLQA), Perceptual Evaluation of Audio Quality (PEAQ), or
A learning method that includes subjective assessment of Mean Opinion Score (MOS) and Multiple Stimuli with Hidden Reference and Anchor (MUSHRA).

delete

According to claim 10,
When determining whether the topology can be adjusted,
If the result of the perceptual quality evaluation does not satisfy the preset quality requirements, relearning parameters using a model with increased complexity,
If the result of the perceptual quality evaluation satisfies the preset quality requirements, the learning method of re-learning parameters using a model whose complexity is reduced within the quality requirements.

An audio signal decoding method performed by an audio signal decoding apparatus to which a neural network is applied,
receiving a bitstream encoded with a latent vector generated by applying the parameter learned through the neural network to an input audio signal;
restoring the latent vector from the received bitstream;
The neural network includes one or more hidden layers, and decoding an output audio signal from the reconstructed latent vector using the hidden layer to which the learned parameter is applied.
including,
The latent vector,
When it is impossible or necessary to adjust the topology of a model including the number of hidden layers and the number of nodes belonging to the hidden layer, it is generated based on the learned parameters;
An audio signal decoding method that is generated based on a relearned parameter by applying the adjusted topology when the topology can be adjusted.

delete

According to claim 16,
Encoding of the latent vector,
An audio signal decoding method for binarizing the bitstream for transmission through a channel.

In the audio signal encoding apparatus to which the neural network is applied,
The audio signal encoding apparatus includes a processor and a memory including one or more instructions executable by the processor,
When the one or more instructions are executed in the processor,
receive an input audio signal;
The neural network includes one or more hidden layers, and generates a dimensionally reduced latent vector of the input audio signal based on parameters of the hidden layer learned using the neural network;
Encoding the generated latent vector to output a bitstream;
The latent vector,
If it is impossible or not necessary to adjust the topology of the model including the number of hidden layers and nodes, it is generated based on the learned parameters;
If the topology can be adjusted, the audio signal encoding device is generated based on the relearned parameter by applying the adjusted topology.

In the audio signal decoding apparatus to which the neural network is applied,
The audio signal decoding apparatus includes a processor and a memory including one or more instructions executable by the processor,
When the one or more instructions are executed in the processor,
Receiving a bitstream in which a latent vector generated by applying a parameter learned through the neural network to an input audio signal is quantized;
restoring the latent vector from the received bitstream;
The neural network includes one or more hidden layers, and decodes an output audio signal from the reconstructed latent vector using the hidden layer to which the learned parameter is applied;
The latent vector,
If it is impossible or not necessary to adjust the topology of the model including the number of hidden layers and nodes, it is generated based on the learned parameters;
If the topology can be adjusted, the audio signal decoding apparatus is generated based on the relearned parameter by applying the adjusted topology.