KR101243897B1

KR101243897B1 - Blind Source separation method in reverberant environments based on estimation of time delay and attenuation of the signals

Info

Publication number: KR101243897B1
Application number: KR1020110061697A
Authority: KR
Inventors: 박형민; 이태준; 김민욱
Original assignee: 서강대학교산학협력단
Priority date: 2011-06-24
Filing date: 2011-06-24
Publication date: 2013-03-20
Anticipated expiration: 2031-06-24
Also published as: KR20130006857A

Abstract

본 발명에 따르는 신호의 시간 지연 및 감쇄 추정에 기반한 반향 환경에서의 암묵 음원 분리 방법은, 본 발명의 암묵 음원 분리 방법은, 둘 이상의 마이크로부터의 혼합신호들을 입력받는 단계; 상기 혼합신호들을 STFT(Short Time Fourier Transform)하여 시간-주파수 영역의 혼합신호들로 변환하는 단계; STFT된 혼합신호들에 대해, 주파수별 감쇄 및 시간 지연 값에 대한 초기화를 수행하고, 초기화된 주파수별 감쇄 및 시간 지연 값이 수렴되도록 학습시키고, 학습된 주파수별 감쇄 및 시간 지연 값을 토대로 주파수별 이진 마스크를 생성하고, 상기 주파수별 이진 마스크를 이용하여 주파수별로 신호를 분리하고, 상기 주파수별로 분리된 신호들에 대해 상관 계수를 구하여 주파수별로 분리된 신호들에 대한 순서를 맞추는 단계; 상기 순서가 맞춰진 신호들을 ISTFT(Inverse Short Time Fourier Transform)하여 시간 영역의 음원 신호들로 복원하는 단계;를 구비한다. The method for separating tacit sound sources in an echo environment based on a time delay and attenuation estimation of a signal according to the present invention comprises: receiving mixed signals from two or more microphones; Converting the mixed signals into mixed signals in a time-frequency domain by performing a short time fourier transform (STFT); For the STFT mixed signals, the frequency attenuation and time delay values are initialized, the initialized frequency attenuation and time delay values are converged, and the frequency-specific frequency attenuation and time delay values are converged. Generating a binary mask, separating signals by frequency using the binary mask for each frequency, and obtaining a correlation coefficient with respect to the signals separated by frequency, and matching the order of the signals separated by frequency; And reconstructing the ordered signals into inverse short time fourier transforms (ISTFTs) to sound source signals in a time domain.

Description

Blind Source separation method in reverberant environments based on estimation of time delay and attenuation of the signals}

본 발명은 암묵 음원 분리 기술에 관한 것으로, 더욱 상세하게는 반향 환경을 고려하여 주파수마다 각기 다른 감쇄 및 시간 지연 값을 추정하여 암묵 음원 신호를 분리하는 신호의 시간 지연 및 감쇄 추정에 기반한 반향 환경에서의 암묵 음원 분리 방법에 관한 것이다. The present invention relates to a tacit sound separation technique, and more particularly, in a reverberation environment based on a time delay and attenuation estimation of a signal for separating tactile sound signals by estimating different decay and time delay values for each frequency in consideration of a reverberation environment. It relates to a method of separating the tacit sound source.

인간은 다수의 음원이 존재하는 환경에서 특정 음원 신호에 주목하여 인식할 수 있으므로, 기계 역시 효과적인 신호 처리를 위해 혼합 신호로부터 특정 음원을 분리할 것이 요구되었다. 이와 같이 여러 음원 신호가 혼합된 혼합 신호에서 개별적인 음원 신호를 분리해내는 것을 암묵 음원 분리(Blind Source Seperation;BSS)라 한다. 여기서 암묵(Blind)은 원본 음원 신호에 대한 정보가 없으며 혼합 환경에 대해서도 정보가 없다는 것을 의미한다. 그리고 혼합 신호로부터 최종적으로 음원 신호를 분리하는 과정을 디믹스(demix) 또는 언믹스(unmix)라 한다.Since humans can pay attention to and recognize a specific sound source signal in an environment in which a plurality of sound sources exist, a machine also needs to separate a specific sound source from a mixed signal for effective signal processing. As described above, the separation of individual sound source signals from a mixed signal in which several sound source signals are mixed is called blind source separation (BSS). Here, blind means that there is no information about the original sound source signal and no information about the mixed environment. The process of finally separating the sound source signal from the mixed signal is called demix or unmix.

종래 암묵 음원 분리 방법은, 사용하는 혼합 신호의 개수에 따라 한 개의 혼합 신호를 이용하는 암묵 음원 분리 방법과 다수의 혼합 신호를 이용하는 암묵 음원 분리 방법으로 나눌 수 있다. Conventional tacit sound source separation methods can be divided into tacit sound source separation methods using one mixed signal and tacit sound source separation methods using a plurality of mixed signals according to the number of mixed signals to be used.

먼저 한 개의 혼합 신호를 이용하는 암묵 음원 분리 방법은 한 개의 마이크로 입력되는 혼합 신호를 이용하여 통계적 추측을 통해 음원 신호를 분리하는 것으로, 이는 공간적인 정보를 사용하지 못하기 때문에 그 성능이 매우 떨어진다. First, the implicit sound source separation method using one mixed signal separates the sound source signal through statistical inference by using the mixed signal input to one micro input, which is very poor in performance because it cannot use spatial information.

그리고 다수의 혼합 신호를 이용하는 암묵 음원 분리 방법은 다수의 마이크로 입력되는 혼합 신호들을 이용하여 통계적 추측뿐만 아니라 공간적인 정보를 이용하여 음원 신호를 분리하는 것으로, 한 개의 혼합 신호를 사용하는 방법보다 훨씬 좋은 성능을 보인다. In addition, the implicit sound source separation method using a plurality of mixed signals separates sound source signals using spatial information as well as statistical guesses by using a plurality of mixed signals inputted with a micro signal, which is much better than a method using a single mixed signal. Shows performance.

또한 상기 다수의 혼합 신호를 이용하는 암묵 음원 분리 방법은, 사용되는 혼합 신호의 개수와 분리하고자 하는 음원 신호의 개수의 관계에 따라 다시 분류된다. 즉, 독립 성분 분석(Independent Component Analysis;ICA) 방법과 독립 벡터 분석(Independent Vector Analysis;IVA) 방법은 다수의 혼합 신호를 이용하여 주로 음원의 독립성에 기반한 통계적 신호 처리 기법을 통해 음원 신호를 분리하는 것으로, 이는 음원 신호의 개수가 혼합 신호의 개수보다 많아지면 그 성능이 급격하게 떨어지는 문제가 있다. 반면에 빔 포밍(Beam Forming) 방법이나, ESPRIT(Estimation of Signal Parameters via Rotational Invariance Technique) 방법, DUET(Degenerate Unmixing and Estimation Technique) 방법은 음원의 공간적인 정보를 주로 이용하기 때문에 혼합 신호의 개수와 상관없이 음원 신호를 분리할 수 있다는 장점이 있다. In addition, the implicit sound source separation method using the plurality of mixed signals is classified again according to the relationship between the number of mixed signals used and the number of sound source signals to be separated. In other words, Independent Component Analysis (ICA) and Independent Vector Analysis (IVA) methods use a plurality of mixed signals to separate sound source signals through statistical signal processing based on the independence of sound sources. In other words, if the number of sound source signals is greater than the number of mixed signals, the performance may drop sharply. On the other hand, the beam forming method, the ESPRIT (Estimation of Signal Parameters via Rotational Invariance Technique) method, and the DUET (Degenerate Unmixing and Estimation Technique) method mainly use spatial information of the sound source and thus correlate with the number of mixed signals. The advantage is that the sound source signal can be separated without.

상술한 바와 같이 여러가지 암묵 음원 분리 방법이 존재하며, 그 중 DUET 방법은 시간-주파수 영역에서 이루어지는 대표적인 음원 분리 방법으로서 인간의 두 귀 신호 처리와 유사하게 혼합 신호의 상대적 시간 차(Interaural Time Different;ITD)와 세기 차(Interaural Intensity Different;IID)를 이용하여 음원의 개수에 관계없이 음원 신호를 분리할 수 있다. As described above, there are various implicit sound source separation methods, among which the DUET method is a representative sound source separation method performed in the time-frequency domain, similar to the processing of two ear signals in humans. ) And an intensity intensity (Interaural Intensity Different; IID) can be used to separate the sound source signal regardless of the number of sound sources.

그런데 상기 DUET 방법은 음원별로 모든 주파수에 대해 동일하게 추정된 감쇄 및 시간 지연 값을 이용하여 암묵 음원을 분리하나, 실제적인 반향 환경에서는 감쇄 및 시간 지연 값이 각각의 주파수마다 서로 다르므로, 분리된 음원의 품질이 낮은 문제가 있었다.
However, the DUET method separates the tacit sound sources by using the same attenuation and time delay values estimated for all frequencies for each sound source, but in a real echo environment, the attenuation and time delay values are different for each frequency. There was a problem of low quality of the sound source.

이러한 DUET 방법에 대해 좀 더 상세히 설명한다. This DUET method will be described in more detail.

인간의 두 귀에 들어오는 신호를 예시한 도 1과 경로 차에 의해 발생하는 신호의 감쇄 및 시간 지연을 예시한 도 2를 참조하면, 인간은 두 귀에 들어오는 음향 신호만으로 음원의 위치를 파악할 수 있으며, 이는 귀가 머리의 양쪽에 붙어있기 때문이다. 즉, 주파수가 낮은 신호는 소리가 장애물을 돌아 반대쪽까지 전파되는 회절 현상에 의해 각 귀에 도달하는 신호의 세기와 시간 차이가 발생하고, 주파수가 높은 주파수의 신호는 회절 현상 대신 머리에서의 반사에 의해 역시 마찬가지로 신호의 세기와 시간 차이가 발생하므로, 뇌에서는 두 귀에 들어오는 상대적인 신호의 세기 및 시간 차이를 이용하여 음원의 방향을 파악하며, 이러한 특성을 이용하여 여러 방향에서 들어오는 음원 신호들 중 원하는 음원 방향에서 들어오는 음원 신호를 집중하여 청취할 수 있게 된다.Referring to FIG. 1 illustrating a signal coming into two human ears and FIG. 2 illustrating a time delay and attenuation of a signal generated by a path difference, a human can determine the position of a sound source using only acoustic signals coming into both ears. Ears are attached to both sides of the head. That is, a signal with a low frequency causes a difference in intensity and time of a signal reaching each ear by a diffraction phenomenon in which sound propagates to the opposite side of the obstacle, and a signal having a high frequency is reflected by the head instead of a diffraction phenomenon. Likewise, since the signal intensity and time difference occur, the brain determines the direction of the sound source by using the relative strength and time difference of the two signals coming in the two ears, and by using this characteristic, the desired sound source direction among the sound source signals from various directions is used. The sound source signal coming from can be concentrated and listened to.

이와 같이 인간이 두 귀를 이용하여 다수의 음원이 혼재하는 환경에서 원하는 특정 음원을 인지하는 현상을 칵테일 파티 효과(cocktail party effect)라 한다. 이러한 인간의 두 귀 신호 처리는 다수의 음원이 존재하는 환경에서 두 귀로 들어오는 신호만으로 이루어지기 때문에 매우 효율적임은 물론이고 실제 음향 환경에도 적합하다. As such, the phenomenon in which a human recognizes a specific sound source desired in an environment in which a plurality of sound sources are mixed by using two ears is called a cocktail party effect. This human ear signal processing is very efficient because it consists only of the signals coming into the two ears in the environment where a large number of sound sources exist, as well as suitable for the actual acoustic environment.

상기한 DUET 방법의 혼합 신호 모델을 도 3을 참조하여 설명한다. 두 개의 마이크 각각으로는 서로 다로 음원 신호들이 직선 경로를 통해 인입되며, 혼합 신호들은 음원과 마이크 간의 거리차에 의해 상대적인 감쇄 및 시간 지연 값을 갖는다. 이러한 조건에서 N개의 음원 신호가 존재할 경우에 DUET 방법의 혼합 모델은 수학식 1 및 수학식 2로 나타낼 수 있다. The mixed signal model of the above DUET method will be described with reference to FIG. 3. In each of the two microphones, sound source signals are introduced through a straight path, and mixed signals have relative attenuation and time delay values due to the distance difference between the sound source and the microphone. Under these conditions, when there are N sound source signals, the mixed model of the DUET method may be represented by Equations 1 and 2 below.

상기 수학식 1에서

는 제1마이크로 입력된 신호이고, 상기

는 제1마이크에 입력된 신호 중 N 개의 음원신호별 성분을 나타낸다. In Equation (1)

Is the first micro-input signal, and

Denotes components for each of the N sound source signals among the signals input to the first microphone.

상기 수학식 2에서

는 제2마이크로 입력된 신호이고, 상기

및

는 각각 제1마이크에 대한 제2마이크 입력신호의 j번째 음원신호 성분의 상대적인 감쇄율 및 시간 지연 값이다.
In Equation (2)

Is a second micro-input signal, and

And

Are respectively the relative attenuation rate and time delay value of the j-th sound source signal component of the second microphone input signal relative to the first microphone.

이러한 시간 영역에서의 혼합 모델을 STFT(Short Time Fourier Transform)하여 시간-주파수 영역에서의 혼합 모델로 변환하면, 수학식 3과 같다. When the mixed model in the time domain is transformed into a mixed model in the time-frequency domain by performing a short time fourier transform (STFT), Equation 3 is obtained.

상기 수학식 3에서,

,

는 시간(

)-주파수(

) 영역에서의 혼합 신호이고,

...

는 시간 지연 값이고,

...

는 감쇄율이고,

...

는 시간(

)-주파수(

) 영역에서의 음원 신호이다.
In Equation (3)

,

Is the time (

)-frequency(

) Is a mixed signal in the

...

Is the time delay value,

...

Is the decay rate,

...

Is the time (

)-frequency(

) Is the sound source signal in the region.

<WDO(W-Disjoint Orthogonal><W-Disjoint Orthogonal>

상기 DUET 방법에서는 혼합 신호를 구성하는 각각의 음원들에 대해 각각의 시간-주파수에서 하나의 음원 성분이 지배적이라고 가정하는 WDO(W-disjoint orthogonal)를 적용하며 이는 수학식 4와 같다. In the DUET method, W-disjoint orthogonal (WDO) that assumes that one sound source component is dominant at each time-frequency is applied to each sound source constituting the mixed signal.

상기 수학식 4에서,

와

는 시간(

)-주파수(

)에서의 서로 다른 음원 성분이다.
In Equation 4,

Wow

Is the time (

)-frequency(

Different sound source components at).

상기한 WDO 가정에 의해 마이크로 들어오는 혼합 신호의 모든 시간-주파수 성분은 하나의 음원 신호와만 연관있게 된다. 상기 WDO 가정은 실제 음향 신호에 완전하게 대응하지는 않지만, 음성 신호에 대해서는 매우 적절하게 대응한다고 알려져 있다.By the above WDO assumption, all time-frequency components of the mixed signal coming into the microphone are associated with only one sound source signal. The WDO hypothesis does not fully correspond to the actual acoustic signal, but is known to correspond very appropriately to the speech signal.

이에 상기 수학식 3에 따른 시간-주파수 영역에서의 혼합 모델에 WDO 가정을 적용하면 수학식 5와 같이 하나의 시간-주파수 영역의 성분이 지배적인 하나의 음원 신호로만 구성된다. Accordingly, when the WDO assumption is applied to the mixed model in the time-frequency domain according to Equation 3, only one sound source signal in which one component of the time-frequency domain is dominant as shown in Equation 5 is used.

상기 수학식 5에서 j는 지배적인 하나의 음원 신호를 지시한다.
In Equation 5, j indicates one dominant sound source signal.

또한 상기한 WDO 가정에 의한 혼합 신호를 이용하여 마이크에 입력되는 신호의 상대적인 감쇄 및 시간 지연 값은 수학식 6에 따라 추정된다. In addition, the relative attenuation and time delay values of the signal input to the microphone using the mixed signal based on the above-described WDO assumption are estimated according to Equation (6).

상기 수학식 6에서,

는 추정된 감쇄율이고,

는 추정된 시간 지연 값이다.
In Equation 6,

Is an estimated decay rate,

Is the estimated time delay value.

상기한 수학식 6을 이용하여 모든 시간-주파수 성분에 대해 감쇄 및 시간 지연 값을 구할 수 있으며, 이 값들을 누적하여 도 4에 도시한 바와 같은 모든 시간-주파수 성분에 대한 감쇄 및 시간 지연 히스토그램을 생성할 수 있다. 상기 히스토그램의 첨두값들은 혼합 신호들을 구성하는 각각의 음원 신호들의 감쇄 및 시간 지연 값을 나타내고, 이 값들을 감쇄 및 시간 지연 값에 대한 학습 과정의 초기값으로 사용한다.
Attenuation and time delay values for all time-frequency components can be obtained using Equation 6 above, and the values are accumulated to obtain attenuation and time delay histograms for all time-frequency components as shown in FIG. 4. Can be generated. The peak values of the histogram represent attenuation and time delay values of the respective sound source signals constituting the mixed signals, and these values are used as initial values of the learning process for the attenuation and time delay values.

<감쇄 및 시간 지연 값 학습><Learning decay and time delay values>

상기 수학식 5에서처럼 혼합 신호의 특정 시간-주파수에서 j번째 음원 신호가 지배적이라면 두 혼합 신호의 관계는 수학식 7과 같이 표현된다. 이를 감쇄 값에 대해 정규화하면 수학식 8과 같이 표현된다. If the j-th sound source signal is dominant at a specific time-frequency of the mixed signal as in Equation 5, the relationship between the two mixed signals is expressed as in Equation 7. Normalized to the attenuation value is expressed as Equation (8).

상기 수학식 8에서,

는 감쇄에 대한 정규화 값이고,

는 j번째 음원 신호에 대한 추정된 감쇄율이고,

는 j번째 음원 신호에 대한 추정된 시간 지연 값이다. In Equation 8,

Is the normalization value for the attenuation,

Is an estimated attenuation rate for the jth source signal,

Is an estimated time delay value for the j-th sound source signal.

상기 혼합 신호의 특정 시간-주파수에서 어떤 음원 신호가 지배적인지 미리 알 수 없으므로, 수학식 8을 이용하여 수학식 9의 비용함수를 구성한다. Since it is impossible to know in advance which sound source signal is dominant at a particular time-frequency of the mixed signal, the cost function of Equation 9 is constructed using Equation 8.

상기 수학식 9에서

는 비용함수이며, 이 비용함수

는 특정 시간-주파수에서 어떤 음원 신호가 지배적인지를 확인하고 그 음원 신호에 대해 추정된 감쇄율 및 시간 지연 값의 정확도를 나타낸다.
In Equation (9)

Is the cost function, and this cost function

Denotes which source signal is dominant at a particular time-frequency and indicates the accuracy of the estimated attenuation rate and time delay value for that source signal.

상기 수학식 9는 미분 불가능한 함수이므로, 수학식 10의 미분 가능한 연속 함수로 근사화된 비용 함수를 사용한다. Since Equation 9 is a non-differential function, a cost function approximated by Equation 10 is possible.

상기 수학식 10에서

는 근사화에서 연속 함수의 smoothness 정도를 결정하는 파라미터이다. 모든 음원에 대한 감쇄 및 시간 지연 값을 매 프레임 별로 추정하기 위해 수학식 10에 나타낸 비용함수 J의 최소값을 구하는 stochastic gradient descent 알고리즘을 적용하여 감쇄 및 시간 지연 값 각각으로 편미분하며, 이는 수학식 11 및 수학식 12에 나타낸 바와 같다. In Equation 10

Is a parameter that determines the degree of smoothness of the continuous function in the approximation. In order to estimate the attenuation and time delay values for every sound source every frame, the stochastic gradient descent algorithm, which obtains the minimum value of the cost function J shown in Equation 10, is applied to the attenuation and time delay values. It is as shown in (12).

상기 수학식 11 및 12에 의해 구해진 편미분 값을 토대로 수학식 13 및 수학식 14와 같이 감쇄 및 시간 지연 값을 갱신하며, 이는 비용함수 J가 최소가 되어 감쇄 및 시간 지연 값이 수렴할 때까지 반복된다. The attenuation and time delay values are updated as shown in Equations 13 and 14 based on the partial derivatives obtained by Equations 11 and 12, which are repeated until the cost function J becomes minimum and convergence and time delay values converge. do.

상기 수학식 13 및 수학식 14에서

및

는 각각 감쇄율 및 시간 지연 학습을 위한 학습률로서 상수로 주어진다.
In Equations 13 and 14

And

Is given as a constant as a learning rate for the decay rate and the time delay learning, respectively.

<이진 마스크 생성><Generate binary mask>

상기한 감쇄 및 시간 지연 값을 수학식 8에 대입했을 때에, j번째 음원 신호에 해당하는 값들에 대한 결과값이 가장 작은 값을 갖는지 여부를 토대로 해당 j번째 음원 신호가 그 시간-주파수에서 지배적인지 여부를 판단하고, j번째 음원 신호가 그 시간-주파수에서 지배적인 음원 신호이면 WDO 가정에 따라 j번째 음원 신호를 분리하기 위한 마스크에서 해당 시간-주파수 값을 1로 하고, 다른 시간-주파수 값을 0으로 하는, 이진 마스크(binary mask)를 생성하며, 이는 수학식 15에 따른다. When the attenuation and time delay values are substituted into Equation 8, the j-th sound source signal is dominant at the time-frequency based on whether the result value of the values corresponding to the j-th sound source signal has the smallest value. If the j-th sound source signal is the dominant sound signal at the time-frequency, the time-frequency value is set to 1 in the mask for separating the j-th sound source signal according to the WDO assumption, and other time-frequency values are determined. Create a binary mask, which is set to zero, which is according to equation (15).

상기 수학식 15에서,

는 이진 마스크이며, 상기

는 j번째 음원 신호에 대한 감쇄 정규화 값

이 다른 음원 신호들에 대한 감쇄 정규화 값들

보다 가장 작은 조건을 의미한다.
In Equation 15,

Is a binary mask and said

Is the attenuation normalization value for the jth sound source signal

Attenuation Normalization Values for these Other Sound Source Signals

It means the smaller condition.

<음원 분리><Sound source separation>

각각의 음원 신호에 대해 생성한 이진 마스크를 혼합 신호에 적용하여 다음과 같이 음원 신호를 분리한다. The binary mask generated for each sound source signal is applied to the mixed signal to separate the sound source signal as follows.

상기 수학식 16에서

는 j번째 음원 분리 신호이고,

는 제1마이크로부터의 혼합 신호이다.
In Equation 16

Is the jth sound source separation signal,

Is the mixed signal from the first microphone.

상기 이진 마스크를 통해 분리된 신호는 ISTFT(inverse short time fourier transform)되어 시간 영역에서의 음원 신호로 복원된다.
The signal separated through the binary mask is inverse short time fourier transform (ISTFT) to restore the sound source signal in the time domain.

상기한 DUET 방법에서는 음원과 마이크 사이의 직선 경로를 통해 마이크로 인입되고, 그 과정에서 직선 경로 차에 의한 감쇄 및 시간 지연이 발생한다고 가정한다. 이에따라 DUET 방법에서 가정하는 음원과 마이크 사이의 혼합 필터는 도 5에 도시한 바와 같이 직선 경로에 의한 하나의 감쇄 및 시간 지연 값을 갖는 형태이다. In the above DUET method, it is assumed that the microphone is drawn in through a straight path between the sound source and the microphone, and attenuation and time delay due to the difference of the straight path occur in the process. Accordingly, the mixed filter between the sound source and the microphone assumed in the DUET method has one attenuation and time delay values due to a straight path as shown in FIG. 5.

그러나 실제 반향 환경에서는 도 6에 도시한 바와 같이 음원 신호가 출발한 후에 다양한 물체에 부딪혀 반사되므로 직선 경로뿐만 아니라 다양한 경로를 통해 서로 다른 감쇄 및 시간 지연 값을 가지고 마이크에 입력되기 때문에 도 5에 도시한 간단한 모양이 아니라 도 7에 도시한 바와 같이 다양한 감쇄 및 시간 지연 값을 갖는 필터를 적용해야 한다. However, in a real echo environment, as shown in FIG. 6, since the sound source signal is reflected by hitting various objects after starting, it is input to the microphone with different attenuation and time delay values through various paths as well as a straight path, as shown in FIG. 5. Instead of a simple shape, a filter having various attenuation and time delay values should be applied as shown in FIG.

이에따라 단순한 직선 경로에 의한 상대적인 감쇄 및 시간 지연 값만을 추정하는 DUET 방법은 실제적인 반향 환경에 적절하지 않다. 실제로 DUET 방법의 실험 결과 역시 환경에 따라 성능의 편차기 심하게 나타나고, 특히 반향이 있는 환경에서 반향이 강해질수록 성능이 급격하게 하락하였다. 즉, DUET 방법은 반향이 없는 이상적인 환경에서 적합한 음원 분리 성능을 보이지만 실제 반향 환경에서는 적용할 수 없는 한계가 있었다. Accordingly, the DUET method, which estimates only the relative attenuation and time delay values by a simple straight path, is not suitable for the actual echo environment. In fact, the experimental results of the DUET method also show a significant variation in performance depending on the environment, and especially in an environment with reflections, the performance drops sharply as the reflection becomes stronger. In other words, the DUET method shows a suitable sound separation performance in an ideal environment without reflections, but has a limitation that cannot be applied in an actual reflection environment.

본 발명은 반향 환경을 고려하여 주파수마다 각기 다른 감쇄 및 시간 지연 값을 추정하여 암묵 음원 신호를 분리하는 신호의 시간 지연 및 감쇄 추정에 기반한 반향 환경에서의 암묵 음원 분리 방법을 제공하는 것을 그 목적으로 한다. It is an object of the present invention to provide a method for separating tacit sound sources in an echo environment based on a time delay and attenuation estimation of a signal that separates tacit sound signals by estimating different attenuation and time delay values for each frequency in consideration of the echo environment. do.

또한 본 발명은 주파수마다 각기 다른 감쇄 및 시간 지연 값을 추정함에 따라 야기되는 데이터 부족 및 순서 바뀜 문제를 해소하기 위해 클러스터 분리를 기반으로 하는 초기값 추정과 스펙트럼 포락선의 상관계수에 기반한 순서 바뀜 조정을 이행하는 신호의 시간 지연 및 감쇄 추정에 기반한 반향 환경에서의 암묵 음원 분리 방법을 제공하는 것을 그 목적으로 한다. In addition, the present invention provides an initial value estimation based on cluster separation and an order shift adjustment based on the correlation coefficient of the spectral envelope in order to solve the data shortage and order shifting problem caused by estimating different attenuation and time delay values for each frequency. It is an object of the present invention to provide a method for separating tacit sound sources in an echo environment based on time delay and attenuation estimation of a signal to be implemented.

상기한 목적을 달성하기 위한 본 발명의 암묵 음원 분리 방법은, 둘 이상의 마이크로부터의 혼합신호들을 입력받는 단계; 상기 혼합신호들을 STFT(Short Time Fourier Transform)하여 시간-주파수 영역의 혼합신호들로 변환하는 단계; STFT된 혼합신호들에 대해, 주파수별 감쇄 및 시간 지연 값에 대한 초기화를 수행하고, 초기화된 주파수별 감쇄 및 시간 지연 값이 수렴되도록 학습시키고, 학습된 주파수별 감쇄 및 시간 지연 값을 토대로 주파수별 이진 마스크를 생성하고, 상기 주파수별 이진 마스크를 이용하여 주파수별로 신호를 분리하고, 상기 주파수별로 분리된 신호들에 대해 상관 계수를 구하여 주파수별로 분리된 신호들에 대한 순서를 맞추는 단계; 상기 순서가 맞춤된 신호들을 ISTFT(Inverse Short Time Fourier Transform)하여 시간 영역의 음원 신호들로 복원하는 단계;를 구비한다. The tacit sound source separation method of the present invention for achieving the above object comprises the steps of: receiving mixed signals from two or more microphones; Converting the mixed signals into mixed signals in a time-frequency domain by performing a short time fourier transform (STFT); For the STFT mixed signals, the frequency attenuation and time delay values are initialized, the initialized frequency attenuation and time delay values are converged, and the frequency-specific frequency attenuation and time delay values are converged. Generating a binary mask, separating signals by frequency using the binary mask for each frequency, and obtaining a correlation coefficient with respect to the signals separated by frequency, and matching the order of the signals separated by frequency; And reconstructing the ordered signals into inverse short time fourier transforms (ISTFTs) to sound source signals in a time domain.

상기한 본 발명은 반향 환경을 고려하여 주파수마다 각기 다른 감쇄 및 시간 지연 값을 추정하여 암묵 음원 신호를 분리함으로써 암묵 음원 신호의 분리 성능을 향상시킬 수 있다. The present invention described above can improve the separation performance of the tacit sound source signal by separating tacit sound source signals by estimating different attenuation and time delay values for each frequency in consideration of the echo environment.

또한 본 발명은 클러스터 분리를 기반으로 하는 초기값 추정과 스펙트럼 포락선의 상관계수에 기반한 순서 바뀜 조정을 이행하여, 주파수마다 각기 다른 감쇄 및 시간 지연 값을 추정함에 따라 야기되는 데이터 부족 및 순서 바뀜 문제를 해소할 수 있는 효과가 있다. In addition, the present invention implements the order shift adjustment based on the initial value estimation based on cluster separation and the correlation coefficient of the spectral envelope to solve the problem of data shortage and order shift caused by estimating different attenuation and time delay values for each frequency. There is an effect that can be solved.

도 1은 인간의 두 귀에 들어오는 신호를 예시한 도면.
도 2는 경로 차에 의해 발생하는 신호의 감쇄 및 시간 지연을 예시한 도면.
도 3은 DUET 방법에서의 혼합 신호 모델을 예시한 도면.
도 4는 감쇄 및 시간지연 히스토그램을 도시한 도면.
도 5는 DUET 방법에서의 필터 모델을 예시한 도면.
도 6은 반향 환경에서의 혼합 신호 모델을 예시한 도면.
도 7은 실제 반향 환경에서의 필터 모델을 예시한 도면.
도 8은 본 발명에 따른 암묵 음원 분리 장치의 구성도.
도 9는 본 발명에 따른 암묵 음원 분리 방법의 흐름도.
도 10은 순서 바뀜 문제를 예시한 도면.
도 11은 상관 계수 계산 과정을 예시한 도면.
도 12는 상관 계수 크기에 따라 순서 맞춤을 수행하기 위해 주파수 순서를 설정하는 과정을 예시한 도면.
도 13은 상관 계수 비교를 통한 순서 맞춤 과정을 예시한 도면.
도 14는 기준 값과 순서 조정한 신호 전체와의 비교를 예시한 도면. 1 illustrates a signal coming into two ears of a human;
2 illustrates the attenuation and time delay of a signal caused by a path difference.
3 illustrates a mixed signal model in the DUET method.
4 shows attenuation and time delay histogram.
5 illustrates a filter model in the DUET method.
6 illustrates a mixed signal model in an echo environment.
7 illustrates a filter model in a real echo environment.
8 is a block diagram of a tacit sound source separation apparatus according to the present invention.
9 is a flowchart of a method for separating tacit sound sources according to the present invention.
10 illustrates a reordering problem.
11 illustrates a correlation coefficient calculation process.
12 is a diagram illustrating a process of setting a frequency order to perform ordering according to a correlation coefficient magnitude.
13 is a diagram illustrating an ordering process by comparing correlation coefficients.
14 is a diagram illustrating a comparison between a reference value and the entire ordered signal.

<암묵 음원 분리 장치의 구성><Configuration of the silent sound source separation device>

본 발명의 바람직한 실시예에 따른 신호의 시간 지연 및 감쇄 추정에 기반한 반향 환경에서의 암묵 음원 분리 장치의 구성을 도 1을 참조하여 설명한다. The configuration of the tacit sound source separating apparatus in the echo environment based on the time delay and attenuation estimation of the signal according to the preferred embodiment of the present invention will be described with reference to FIG.

상기 암묵 음원 분리 장치는 제1 및 제2마이크(100,102)와 STFT(Short Time Fourier Transformer)(104)와 암묵 음원 분리부(106)와 ISTFT(Inverse Short Time Fourier Transformer)(108)로 구성된다. The tacit sound source separating apparatus includes first and second microphones 100 and 102, a short time fourier transformer (STFT) 104, a tacit sound source separating unit 106, and an inverse short time fourier transformer (ISTFT) 108.

상기 제1 및 제2마이크(100,102)는 입력되는 오디오에 대응되는 혼합 신호를 각각 출력한다. 여기서, 음원 신호는 실제 반향 환경에서 다양한 경로를 통해 제1 및 제2마이크(100,102)에 인입되므로, 음원과 마이크 사이의 반향 필터 hij를 이용하면, 음원 신호와 마이크의 출력신호는 수학식 17 및 수학식 18로 나타낼 수 있다. The first and second microphones 100 and 102 output mixed signals corresponding to the input audio, respectively. Here, since the sound source signal is introduced into the first and second microphones 100 and 102 through various paths in the actual echo environment, when the echo filter hij between the sound source and the microphone is used, the output signal of the sound source signal and the microphone is represented by Equation 17 and It may be represented by Equation 18.

상기 수학식 17에서,

는 제1마이크(100)의 출력신호이며,

는 제1마이크(100)와 음원들 사이의 반향 필터이며,

는 음원 신호이다. In Equation 17,

Is an output signal of the first microphone 100,

Is an echo filter between the first microphone 100 and the sound sources,

Is the sound source signal.

상기 수학식 18에서,

는 제2마이크(102)의 출력신호이며,

는 제2마이크(102)와 음원들 사이의 반향 필터이며,

는 음원 신호이다. In Equation 18,

Is the output signal of the second microphone 102,

Is an echo filter between the second microphone 102 and the sound sources,

Is the sound source signal.

상기 STFT(104)는 상기 제1 및 제2마이크(100,102)의 출력 신호를 입력받아 STFT(Short Time Fourier Transform)하여 시간-주파수 영역의 혼합 신호들로 출력한다. The STFT 104 receives the output signals of the first and second microphones 100 and 102 and outputs them as mixed signals in a time-frequency domain by performing a short time fourier transform (STFT).

상기 암묵 음원 분리부(106)는 상기 STFT(104)의 출력 신호를 입력받아 음원 신호들로 분리하여 출력한다. The implicit sound source separating unit 106 receives the output signal of the STFT 104 and separates the sound source signals into output signals.

상기 ISTFT(Inverse Short Time Fourier Transformer)(108)는 분리된 음원 신호들을 입력받아 ISTFT(Inverse Short Time Fourier Transform)하여 시간 영역의 음원 신호로 복원하여 출력한다.
The inverse short time fourier transformer (ISTFT) 108 receives the separated sound source signals and restores the inverse short time fourier transform (ISTFT) to a sound source signal in the time domain.

<암묵 음원 분리 절차><Second sound source separation procedure>

본 발명의 바람직한 실시예에 따른 암묵 음원 분리 장치의 암묵 음원 분리부(106)의 처리 절차를 도 2를 참조하여 설명한다. The processing procedure of the tacit sound source separation unit 106 of the tacit sound source separation device according to the preferred embodiment of the present invention will be described with reference to FIG.

상기 암묵 음원 분리부(106)는 입력된 혼합 신호에 대해 주파수별 감쇄 및 시간 지연 값에 대한 초기화를 수행한다(200단계). The implicit sound source separating unit 106 performs attenuation and time delay values for respective frequencies with respect to the input mixed signal in step 200.

이후 상기 암묵 음원 분리부(106)는 상기 주파수별 감쇄 및 시간 지연 값을 학습시킨 후에(202단계), 그 학습된 주파수별 감쇄 및 시간 지연 값을 토대로 주파수별 이진 마스크를 생성한다(204단계). Then, the implicit sound source separating unit 106 learns the attenuation and time delay values for each frequency (step 202), and generates a binary mask for each frequency based on the learned attenuation and time delay values for each frequency (step 204). .

이후 상기 암묵 음원 분리부(106)는 주파수별 이진 마스크를 이용하여 주파수별로 신호를 분리하고(206단계), 상기 주파수별로 분리된 신호들에 대해 상관 계수를 구하여 순서를 맞추고(208단계), 그 상기 맞춰진 순서를 최적화되도록 조정하여 출력한다(210단계).
Thereafter, the blind source separating unit 106 separates signals by frequency using a binary mask for each frequency (step 206), obtains correlation coefficients for the signals separated by each frequency, and sets the order (step 208). The adjusted order is output to be optimized (step 210).

<암묵 음원 분리 절차의 상세 설명><Detailed explanation of the implicit sound source separation procedure>

이하, 상기한 암묵 음원 분리 절차를 좀 더 상세히 설명한다. Hereinafter, the tacit sound source separation procedure will be described in more detail.

<혼합 모델><Mixed model>

실제 반향 환경에서 다양한 경로를 통해 제1 및 제2마이크(100,102)에 들어오는 혼합신호는 음원과 마이크 사이의 반향 필터 hij를 이용하여 수학식 19 및 수학식 20과 같이 나타낼 수 있다. In a real echo environment, the mixed signal entering the first and second microphones 100 and 102 through various paths may be represented by Equation 19 and Equation 20 using an echo filter hij between the sound source and the microphone.

상기한 혼합신호에 대해 STFT를 취하여 시간-주파수 영역의 신호로 변환하고, 두 혼합신호 간의 상대적인 감쇄 및 시간 지연 값을 갖는 형태로 전개하면 수학식 19로 정리된다. The STFT of the mixed signal is converted into a signal in the time-frequency domain, and is expanded to a form having a relative attenuation and time delay value between the two mixed signals.

상기 수학식 19에서,

는 시간-주파수 영역에서의 혼합신호이고,

는 시간-주파수 영역에서의 반향필터이고, 상기

는 시간-주파수 영역에서의 음원신호이다. In Equation 19,

Is a mixed signal in the time-frequency domain,

Is an echo filter in the time-frequency domain, and

Is a sound source signal in the time-frequency domain.

그리고 상기

...

는 N개의 음원 신호에 대한 감쇄율이고,

...

는 N개의 음원 신호에 대한 시간 지연 값이다. And said

...

Is the attenuation rate for N sound signals,

...

Is a time delay value for the N sound source signals.

상기 수학식 19에 WDO 가정을 적용하여 혼합신호에서 시간-주파수 성분은 하나의 음원신호만이 지배적이라는 점을 이용하면, 시간-주파수 영역에서 혼합 신호의 모델은 지배적인 음원 신호 이외의 신호 성분이 제거되어 수학식 20과 같이 나타낼 수 있다. When the WDO assumption is applied to Equation 19, the time-frequency component of the mixed signal is dominant, and only one sound source signal is dominant. In the time-frequency domain, the model of the mixed signal has a signal component other than the dominant sound source signal. It can be removed and expressed as Equation 20.

따라서 동일 음원 신호에 대해서도 주파수마다 서로 다른 감쇄 및 시간 지연 값을 갖게 되므로 전체 주파수에 대해 하나의 감쇄 및 시간 지연 값을 찾는 기존의 DUET 방법은 실제 반향 환경에 적용할 수 없다. 이러한 문제를 해결하고자 본 발명은 음원 신호의 모든 주파수에 대한 감쇄 및 시간 지연 값을 추정한다. Therefore, the same sound source signal has different attenuation and time delay values for each frequency, and thus the existing DUET method for finding one attenuation and time delay value for the entire frequency cannot be applied to an actual echo environment. To solve this problem, the present invention estimates the attenuation and time delay values for all frequencies of the sound source signal.

상기 수학식 21은 감쇄 및 시간 지연 값(

,

)이 모든 주파수(w)에 대해 각기 다르게 추정됨을 나타낸다.
Equation 21 is an attenuation and time delay value (

,

) Is estimated differently for all frequencies w.

<감쇄 및 시간 지연 값 초기화><Clear and Reset Time Delay Values>

상기한 바와 같이 본 발명은 음원 신호들 각각에 대해 모든 주파수에서 서로 다른 감쇄 및 시간 지연 값을 추정하여야 하므로, 감쇄 및 시간 지연 값의 초기값 역시 모든 주파수에 대해 다르게 주어져야 한다. 그러나 각각의 주파수에 대한 초기값을 구함에 있어 모든 시간-주파수 성분에 대해 감쇄 및 시간 지연 값들을 추정하여 히스토그램을 생성하던 기존 DUET 방법에 비해 해당 주파수 데이터만을 사용하므로 훨씬 적은 감쇄 및 시간 지연 추정 값들을 사용하게 된다. 이에 본 발명에서는 LBG(Linde, Buzo, Gray)가 제안한 벡터 양자화(vector quantization) 방법을 채용한다. 상기 LBG 방법은 이진 분할 방법과 k-means 클러스터링 방법을 결합한 방법으로, 이진 분할로 중심을 구하고, 이를 k-means 클러스터링 방법의 초기값으로 사용한다.As described above, the present invention must estimate different attenuation and time delay values at all frequencies for each of the sound source signals, and therefore, initial values of the attenuation and time delay values should be given differently for all frequencies. However, in calculating the initial value for each frequency, much less attenuation and time delay estimation values are used since only the corresponding frequency data is used as compared to the conventional DUET method which estimates the attenuation and time delay values for all time-frequency components to generate a histogram. To use them. Accordingly, the present invention employs a vector quantization method proposed by LBG (Linde, Buzo, Gray). The LBG method is a combination of a binary division method and a k-means clustering method. The LBG method uses a binary division to obtain a center and uses this as an initial value of the k-means clustering method.

상기 LBG 방법은 이진 분할 방법과 k-means 클러스터링 방법을 결합한 방법으로 이진 분할 방법의 빠른 연산 수행과 k-means 클러스터링 방법의 정확성의 장점을 모두 갖는다. 상기 k-means 클러스터링 방법은 초기 중심 값을 임의로 선택하는데, 이때 선택된 초기값에 민감한 특징을 나타낸다. 따라서 LBG 방법은 이진 분할로 중심을 구하고, 이를 k-mean 방법의 초기값으로 사용하여 단점을 보완한다.The LBG method combines the binary division method and the k-means clustering method, and has both the advantages of fast operation of the binary division method and accuracy of the k-means clustering method. The k-means clustering method arbitrarily selects an initial center value, wherein the k-means clustering method is sensitive to the selected initial value. Therefore, LBG method finds the center by binary division and uses this as initial value of k-mean method to compensate for the shortcomings.

상기 LBG 방법의 이진 분할 단계를 위해 하나의 주파수에 대해 수학식 6을 적용하여 구한 감쇄 및 시간 지연 값들을 한 개의 클러스터로 정의하고 그 중심을 수학식 22에 따라 찾는다. For the binary division of the LBG method, the attenuation and time delay values obtained by applying Equation 6 to one frequency are defined as a cluster and the center thereof is found according to Equation 22.

상기 수학식 22에서,

는 중심값을 나타내며,

는 클러스터에 포함되는 감쇄 및 시간 지연 값들의 수이다.In Equation 22,

Represents the center value,

Is the number of attenuation and time delay values included in the cluster.

상기 수학식 23에서

는 감쇄 및 시간 지연 값의 한 쌍을 나타내고,

과

는 각각 추정된 감쇄율 및 시간 지연 값이다. In Equation 23

Represents a pair of attenuation and time delay values,

and

Are the estimated decay rate and time delay values, respectively.

상기 주파수 각각에 대한 감쇄 및 시간 지연 값들의 중심이 구해지면, 그 중심 값으로부터 약간 이동된 2개의 중심 값을 수학식 24에 따라 구한다. 여기서, 상기 중심에서 약간 이동된 2개의 중심 값을 구하는 이유는, 두 개의 클러스터로 분할하기 위함이다. When the centers of the attenuation and time delay values for each of the frequencies are obtained, two center values slightly shifted from the center values are obtained according to Equation (24). Here, the reason for obtaining two center values slightly shifted from the center is to divide into two clusters.

상기 수학식 24에서, ε은 이동폭을 결정하는 작은 양의 상수 값이고,

는 총 m개의 클러스터 중 분할 대상인 l번째 클러스터에 대한 중심값을 나타내며,

및

는 총 m+1개의 클러스터로 분할하기 위해 이동된 새로운 중심값이다. In Equation 24, ε is a small positive constant value for determining the moving width,

Represents the center value for the l th cluster to be split among a total of m clusters.

And

Is the new center value shifted to divide m + 1 clusters in total.

이와 같이 구해진 2개의 중심값을 감쇄 및 시간 지연 값의 초기값으로 주어 k-means 클러스터링 방법으로 상기 중심값을 갱신한다. 여기서, 분산이 큰 클러스터에 대해 수학식 24를 적용하여 음원 신호의 개수에 해당하는 감쇄 및 시간 지연 값의 초기값을 설정할 수 있다.
The two center values thus obtained are given as initial values of attenuation and time delay values, and the center values are updated by the k-means clustering method. Here, the initial value of the attenuation and time delay values corresponding to the number of sound source signals may be set by applying Equation 24 to a cluster having a large dispersion.

<감쇄 및 시간지연 값 학습><Learning decay and time delay values>

상기 감쇄 및 시간 지연 값의 학습 역시 각각의 주파수에 대해 개별적으로 이루어지며 수학식 25에 따라 추정된다. The learning of the attenuation and time delay values is also done separately for each frequency and is estimated according to equation (25).

상기 수학식 25에서,

는 지배적인 하나의 음원 신호(j)에 대한 감쇄 정규화값이고,

는 감쇄값이고,

는 제1마이크(100)로의 혼합신호이고,

는 제2마이크(102)로의 혼합신호이고,

는 시간 지연 값이다.
In Equation 25,

Is the attenuation normalization value for one dominant sound source signal j,

Is an attenuation value,

Is the mixed signal to the first microphone 100,

Is the mixed signal to the second microphone 102,

Is the time delay value.

상기 감쇄 및 시간 지연 값은 수학식 26 내지 28에 따라 전체 주파수에 대한 누적 값이 아닌 각각의 주파수에 대해 개별적인 감쇄 및 시간 지연 값을 계산한다.The attenuation and time delay values calculate the individual attenuation and time delay values for each frequency rather than the cumulative value for the entire frequency in accordance with Equations 26-28.

상기 수학식 26에서

는 비용함수이고,

는 근사화에서 연속 함수의 평활(smoothness) 정도를 결정하는 파라미터이며,

...

는 제1 내지 제N 음원 신호에 대한 감쇄 정규화 값이다. In Equation 26

Is the cost function,

Is a parameter that determines the degree of smoothness of the continuous function in the approximation.

...

Is an attenuation normalization value for the first to Nth sound source signals.

상기 수학식 27은 비용함수

를 감쇄값으로 편미분한 것이다. Equation 27 is a cost function

Is the partial derivative of the attenuation value.

상기 수학식 28은

를 시간 지연값으로 편미분한 것이다. Equation 28 is

Is the partial derivative of the time delay value.

상기 각각의 주파수 별로 구한 감쇄 및 시간 지연 값들을 이용하여 수학식 29 및 수학식 30과 같이 각각의 주파수에 대한 감쇄 및 시간 지연 값을 갱신하며, 그 값이 수렴할 때까지 학습 과정을 반복한다. 상기 학습 과정에서 주파수별 에너지 크기가 다르기 때문에 수학식 31과 같이 신호의 각 주파수별 에너지 값을 구하고 그 크기에 따른 학습률을 준다.Using the attenuation and time delay values obtained for each frequency, the attenuation and time delay values for each frequency are updated as in Equations 29 and 30, and the learning process is repeated until the values converge. Since the magnitude of energy for each frequency is different in the learning process, the energy value for each frequency of the signal is obtained as shown in Equation 31, and the learning rate is given according to the magnitude.

상기 수학식 29에서,

는 주파수별 에너지 값의 크기에 따른 감쇄율에 대한 학습률이고,

는 감쇄값이다. In Equation 29,

Is the learning rate for the attenuation rate according to the magnitude of the energy value for each frequency,

Is the attenuation value.

상기 수학식 30에서,

는 주파수별 에너지 값의 크기에 따른 시간 지연에 대한 학습률이고,

는 시간 지연 값이다. In Equation 30,

Is the learning rate for the time delay according to the magnitude of the energy value for each frequency.

Is the time delay value.

상기

및

는 수학식 31에 의해 결정된다. remind

And

Is determined by equation (31).

,

상기 수학식 31에서, β(ω)는 주파수별 에너지 값의 크기에 따른 학습률 펙터(factor)이고,

와

는 각각 감쇄율 및 시간 지연에 대한 학습률로 변환하는 펙터(factor)이며,

는 제1 및 제2마이크(100,102)로부터의 혼합 신호의 전체 에너지 값이고,

는 학습률 펙터(factor) 설정을 위한 상수값,

는 주파수별 에너지 값의 최대값이다.
In Equation 31, β (ω) is a learning rate factor according to the magnitude of energy value of each frequency,

Wow

Are factors that translate into learning rates for decay rates and time delays, respectively.

Is the total energy value of the mixed signal from the first and second microphones 100, 102,

Is a constant value for setting the learning rate factor,

Is the maximum value of the energy value for each frequency.

<순서 맞춤><Order order>

학습을 통해 최종적으로 수렴한 감쇄 및 시간 지연 값을 이용하여 이진 마스크를 생성하고, 이진 마스크를 이용하여 혼합 신호로부터 음원을 분리하는 과정은 종래 DUET 방법과 동일하다. 그러나 본 발명은 주파수마다 독립적으로 감쇄 및 시간지연 값을 구하기 때문에 도 10에 도시한 바와 같이 분리된 음원에서 주파수 별로 음원의 순서가 바뀌는 순서 바뀜 문제가 발생할 수 있다. 이러한 순서 바뀜 현상이 일어나게 되면 하나의 복원 신호에 주파수 별로 다양한 음원 신호가 존재하게 되어 제대로 음원 분리가 되었다고 할 수 없다.The process of generating a binary mask using the attenuation and time delay values finally converged through learning, and separating the sound source from the mixed signal using the binary mask is the same as the conventional DUET method. However, since the present invention obtains the attenuation and time delay values independently for each frequency, as shown in FIG. 10, the order of changing the order of the sound sources for each frequency may occur in the separated sound source. When such a reversal phenomenon occurs, various sound source signals exist for each frequency in one reconstruction signal, and thus, sound source separation cannot be performed properly.

이에 본 발명은 DUET 방법에 적합한 순서 바뀜 조정을 이행한다. Accordingly, the present invention implements a reordering adjustment suitable for the DUET method.

본 발명에 따른 순서 바뀜 조정은 주파수 영역 독립 성분 분석에 적용하기 위해 Murata 등이 제안한 방법(N. Murata, S. Ikeda, and A. Ziehe, "An approach to blind source separation based on temporal structure of speech signals" Neurocomputing, vol. 41, no. 1-4, pp. 1-24, Oct. 2001.)으로 순서를 초기화한 후 전체 상관 계수를 최대화하도록 순서를 조정하게 된다.The reordering adjustment according to the present invention is a method proposed by Murata et al. (N. Murata, S. Ikeda, and A. Ziehe, "An approach to blind source separation based on temporal structure of speech signals). " Neurocomputing , vol. 41, no. 1-4, pp. 1-24, Oct. 2001.) and then adjust the sequence to maximize the overall correlation coefficient.

상기 순서 바뀜 조정은, 수학식 32에 따라 주파수 별로 이진 마스크를 통과하여 분리한 신호간의 상관 계수를 구하고 수학식 33과 같이 상관계수의 크기에 따라 맞춤할 주파수 순서를 결정한다.The reordering adjustment obtains a correlation coefficient between signals separated through a binary mask for each frequency according to Equation 32, and determines a frequency order to be fit according to the magnitude of the correlation coefficient as shown in Equation 33.

상기 수학식 32에서,

는 상관 계수의 크기이고,

~

는 이진 마스크에 의해 분리된 신호들이다. In Equation 32,

Is the magnitude of the correlation coefficient,

~

Are signals separated by a binary mask.

여기서, 상관계수가 작다는 것은 두 음원의 구분이 쉽다는 것을 의미하기 때문에 가장 확실한 기준이 될 수 있다. 이에따라 수학식 34과 같이 가장 상관 계수가 작은 주파수의 신호 값을 기준 값으로 정한다.Here, the small correlation coefficient may be the most obvious criterion because it means that the two sound sources can be easily distinguished. Accordingly, as shown in Equation 34, the signal value of the frequency having the smallest correlation coefficient is determined as the reference value.

상기 수학식 34에서,

는 기준값이고,

는 가장 상관 계수가 작은 주파수의 분리 음원 신호 값이다. In Equation 34,

Is the reference value,

Is the value of the separated sound source signal of the frequency with the smallest correlation coefficient.

상기 상관 계수가 낮은 주파수 순서대로 수학식 35에 따라 음원의 순서를 바꿔가며 기준 값과 음원 간의 상관 계수를 계산하고 그 값이 최대가 되는 순서에 맞춰 순서를 조정하며 이를 도시한 것이 도 12 및 도 13이다.The correlation coefficients are changed in order of low frequency according to Equation 35, and the correlation coefficients between the reference value and the sound source are calculated, and the order is adjusted according to the order in which the values are maximized. 13.

상기 수학식 35에서

는 주파수

에서 맞춤된 순서열을 나타내며,

는 맞춤된 순서에 따른 분리 음원 신호이고,

는 이전 주파수까지 맞춤된 음원 신호로부터의 기준값이다.
In Equation 35

Frequency

Represents a custom sequence in

Is a separate sound source signal in a customized order,

Is the reference value from the sound source signal fitted up to the previous frequency.

그 다음, 수학식 36과 같이 기준 값에 맞춤된 분리 음원 신호를 더하여 기준 값을 갱신한다.Then, the reference value is updated by adding a separate sound source signal fitted to the reference value as shown in Equation 36.

상기 수학식 36에서,

는 맞춤된 음원 신호로부터의 기준값이고,

는 맞춤된 순서에 따른 분리 음원 신호이다. In Equation 36,

Is the reference value from the customized sound source signal,

Is a separate sound source signal in a customized order.

하지만 이러한 Murata 등의 방법은 기준 값의 갱신과 순서 맞춤이 동시에 이루어지고 그 과정이 한번에 끝나기 때문에 잘못된 순서 맞춤과 누락된 순서 바뀜이 존재할 수 있다. 이러한 오류는 전체적인 성능에 악영향을 주기 때문에 이를 보완하기 위하여 Sawada 등의 방법(H. Sawada, R. Mukai, S. Araki, and S. Makino, "Robust and precise method for solving the permutation problem of frequency-domain blind source separation," in Proc. Int. Symp. Independent Component Analysis Blind Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 505-510.)에 나타난 신호 전체의 상관 계수를 최대화하는 방향으로 반복적인 최적화를 하는 과정을 적용한다. However, the method of Murata et al. May have incorrect ordering and missing order reversal because the reference value is updated and ordered at the same time and the process is completed at once. Since these errors adversely affect the overall performance, Sawada et al. (H. Sawada, R. Mukai, S. Araki, and S. Makino, "Robust and precise method for solving the permutation problem of frequency-domain blind source separation, "in Proc. Int. Symp. Independent Component Analysis Blind Signal Separation (ICA), Nara, Japan, Apr. 2003, pp. 505-510.). Apply the process of optimization.

신호 전체의 상관 계수는 수학식 37과 같이 이전 반복 단계의 순서정보에 의한 포락 신호의 합을 기준으로 순서정보를 새로 맞춘 뒤, 각 주파수 및 음원에 대하여 기준 값과 상관계수를 합산하여 구한다. 그리고 이 값이 최대가 될 때까지 이전 반복 단계의 순서정보에 의한 순서 맞춤 과정을 반복해서 수행하게 된다. 이러한 반복적인 최적화 과정은 전체 상관 계수의 값이 이전 반복 단계의 그것에 비해 더 이상 증가하지 않으면 반복이 종료된다.The correlation coefficient of the entire signal is obtained by newly adjusting the sequence information based on the sum of the envelope signals based on the sequence information of the previous repetition step as shown in Equation 37, and then summing the reference value and the correlation coefficient for each frequency and sound source. And the ordering process by the order information of the previous iteration step is repeated until this value is the maximum. This iterative optimization process ends when the value of the overall correlation coefficient no longer increases compared to that of the previous iteration step.

상기 수학식 37에서

는 신호 전체의 상관 계수이다.
In Equation 37

Is the correlation coefficient of the whole signal.

Sawada 등의 방법에서는 신호 전체의 상관 계수를 최대화하는 최적화 과정 뒤에 다시 이웃한 주파수 및 하모닉 주파수의 중심을 기준으로 상관계수를 최대화하는 세부맞춤 과정을 진행한다. 그러나 이러한 세부맞춤 과정은 독립성분분석 기반의 암묵 음원 분리 방법과는 달리, 본 발명의 음원 분리 방법에서는 주파수 별로 음원의 도착 방향에 따라 WDO 조건을 최대화하는 결과물을 이용하기 때문에 이웃한 주파수 및 하모닉 주파수간의 연관성이 많이 떨어지게 되어 한 곳으로 수렴하기 힘든 더 어려운 문제가 된다. 이는 반향이 큰 환경에서 더욱 두드러지게 나타나며, 이러한 Sawada 등의 방법의 성능저하를 실험결과에서 확인할 수 있었다.In the method of Sawada et al., After the optimization process of maximizing the correlation coefficient of the entire signal, the detail process is performed to maximize the correlation coefficient based on the center of the neighboring frequency and the harmonic frequency. However, unlike the implicit sound source separation method based on independent component analysis, the detail matching process uses neighboring and harmonic frequencies because the sound source separation method of the present invention uses the result of maximizing the WDO condition according to the arrival direction of the sound source for each frequency. The associations between them become much worse, making them more difficult to converge into one place. This is more prominent in a large echo environment, and the performance degradation of Sawada's method can be confirmed from the experimental results.

그러므로 먼저 Murata 등의 방법으로 순서를 맞춘 결과를 초기값으로 하여 Sawada 등의 방법에 사용된 전체 상관 계수를 최대화하는 반복 최적화과정만을 수행한다. 이러한 방법으로 최종적으로 분리 신호는 다음과 같은 수학식 38에 의해 구한다. Therefore, first, iterative optimization process that maximizes the overall correlation coefficient used in Sawada's method is performed by initializing the result by Murata's method. In this way, the separated signal is finally obtained by the following equation (38).

상기 수학식 38에서,

는 분리된 최종 신호이며,

는 전체 상관 계수 최대화 과정을 통해 맞춤된 순서에 따른 분리 음원 신호이고, M은 전체 상관계수 최대화 과정의 마지막 반복 횟수를 나타낸다. In Equation 38,

Is the final separated signal,

Is the separated sound source signal according to the customized order through the entire correlation coefficient maximization process, and M represents the number of last repetitions of the overall correlation coefficient maximization process.

100 : 제1마이크
102 : 제2마이크
104 : STFT
106 : 암묵 음원 분리부
108 : ISTFT100: first microphone
102: second microphone
104: STFT
106: implicit sound source separation unit
108: ISTFT

Claims

In the tacit sound source separation method performed by the tacit sound source separation device,
Receiving mixed signals from two or more microphones;
Converting the mixed signals into mixed signals in a time-frequency domain by performing a short time fourier transform (STFT);
For STFT mixed signals,
Initialize the attenuation and time delay values for each frequency,
Learn to converge the initialized frequency attenuation and time delay values, generate a frequency-specific binary mask based on the learned frequency attenuation and time delay values,
Separating signals by frequency using the frequency-specific binary mask, and obtaining a correlation coefficient with respect to the signals separated by the frequencies to adjust the order of the signals separated by the frequencies;
And reconstructing the ordered signals by inverse short time fourier transform (ISTFT) to sound source signals in a time domain.
Initialization of the attenuation and time delay value,
For each frequency we define the attenuation and time delay values as a cluster,
The center value of the cluster is detected according to Equation 39,
The method of claim 2, wherein the center value is determined as an initial value of the attenuation and time delay values.
Equation 39

In Equation 39,

Represents the center value,

Is the number of decay and time delay values included in the cluster,

Is a pair of decay and time delay values, estimated decay rate

And time delay values

Consists of

delete

The method of claim 1,
The center value is moved by Equation 40, and the moved center value is set as a new initial value of the divided cluster of the attenuation and time delay values.
Equation 40

In Equation 40, ε is a small positive constant value for determining a moving width,

And

Is the new center shifted to split m + 1 clusters in total.

The method of claim 1,
Learning to converge the frequency attenuation and time delay values,
A method of separating tactile sound sources according to Equations 41 to 47 until the attenuation and time delay values converge to minimize the cost function.
Equation 41

In Equation 41,

Is the normalization value for the attenuation,

Is an estimated attenuation rate for the jth source signal,

Is an estimated time delay value for the j-th sound source signal,

And

Respectively represent the first and second microphone output signals in the time-frequency domain.
Equation 42

In Equation 42

Is the cost function,

...

Is an attenuation normalization value for the first to Nth sound source signals.
Equation 43

Equation 43 is a cost function

Is the partial derivative of the attenuation value.
Equation 44

Equation 44 is a cost function

Is a partial derivative of time delay.

Equation 45
In Equation 45,

Is attenuation value.
Equation 46

In Equation 46,

Is the time delay value.
Equation 47

,

In Equation 47, β (ω) is a learning rate factor according to the magnitude of the energy value for each frequency,

Wow

Is the total energy value of the mixed signal from the microphones,

Is a constant value for setting the learning factor factor,

Is the maximum value of the energy value for each frequency.

The method of claim 1,
Finding the correlation coefficients for the signals separated by frequency and ordering the signals separated by frequency,
A correlation coefficient for the separated signals is calculated according to Equation 48,
Determining a frequency order to fit according to Equation 49 according to the calculated magnitude of the correlation coefficient,
The method of claim 2, characterized in that the order of the signals separated by the reference value based on the signal value of the lowest frequency coefficient according to (50) and (51).
Equation 48

In Equation 48,

Is the magnitude of the correlation coefficient,

~

Are signals separated by a binary mask.
Equation 49

Equation 50

In Equation 50,

Is the reference value,

Is the value of the separated sound source signal with the lowest correlation coefficient.
Equation 51

In Equation 51

Frequency

Represents a custom sequence in

Is a separate sound source signal in a customized order,

Is the reference value from the source signal fitted to the previous frequency.

The method of claim 5,
After adjusting the order of the separated signals,
Re-adjust order information based on the sum of the envelope signals based on the order information of the separated signals;
And summing a reference value and a correlation coefficient for each frequency and sound source, and repeating the ordering process until the summed value is maximized.