KR100426844B1

KR100426844B1 - A method on the improvement of speaker enrolling speed for the MLP-based speaker verification system through improving learning speed and reducing learning data

Info

Publication number: KR100426844B1
Application number: KR10-2002-0022128A
Authority: KR
Inventors: 이태승; 황병원
Original assignee: 주식회사 화음소
Priority date: 2002-04-23
Filing date: 2002-04-23
Publication date: 2004-04-13
Anticipated expiration: 2022-04-23
Also published as: KR20030083451A

Abstract

본 발명은 EBP의 학습속도를 개선하는 방법과, 기존의 화자증명 방법에서 화자군집 방법을 도입한 배경화자 축소방법을 사용하여 MLP 기반 화자증명 시스템에서 화자등록에 필요한 시간을 단축한 학습속도 개선과 학습데이터 축소를 통한 MLP 기반 화자증명 시스템의 등록속도 향상방법에 관한 것이다. MLP(multilayer perceptron)는 다른 패턴인식 방법에 비해 몇 가지 유리한 이점을 지니고 있어 화자증명 시스템의 화자학습 및 인식 방법으로서 사용이 기대된다. 그러나 MLP의 학습은 학습에 이용되는 EBP(error backpropagation) 알고리즘의 저속 때문에 상당한 시간을 소요한다. 이 점은 화자증명 시스템에서 높은 화자인식률을 달성하기 위해서는 많은 배경화자가 필요하다는 점과 맞물려 시스템에 화자를 등록하기 위해 많은 시간이 걸린다는 문제를 낳는다. 화자증명 시스템은 화자 등록후 곧바로 증명 서비스를 제공해야 하기 때문에 이 문제를 해결해야 한다. 본 발명에서는 이 문제를 해결하기 위해 EBP의 학습속도를 개선하는 방법과, 기존의 화자증명 방법에서 화자군집 방법을 도입한 배경화자 축소방법을 사용하여 MLP 기반 화자증명 시스템에서 화자등록에 필요한 시간을 단축한다.The present invention improves the learning speed by reducing the time required for speaker registration in the MLP-based speaker authentication system by using the method of improving the learning speed of the EBP and the background speaker reduction method in which the speaker grouping method is introduced in the existing speaker authentication method. The present invention relates to a method for improving registration speed of an MLP-based speaker identification system by reducing learning data. MLP (multilayer perceptron) has several advantages over other pattern recognition methods, and it is expected to be used as a speaker learning and recognition method of speaker identification system. However, learning of MLP takes considerable time because of the low speed of the error backpropagation (EBP) algorithm used for learning. This, together with the fact that many background speakers are required to achieve a high speaker recognition rate in speaker identification systems, creates a problem that it takes a long time to register a speaker in the system. This problem must be solved because the speaker authentication system must provide a proof service immediately after the speaker is registered. In order to solve this problem, the present invention uses a method of improving the learning speed of the EBP and a background speaker reduction method in which the speaker grouping method is introduced in the existing speaker identification method. Shorten.

Description

A method on the improvement of speaker enrolling speed for the MLP-based speaker verification system through improving learning speed and reducing learning data}

본 발명은 화자증명 시스템의 등록속도 향상방법에 관한 것으로서, 보다 상세하게는 EBP의 학습속도를 개선하는 방법과, 기존의 화자증명 방법에서 화자군집 방법을 도입한 배경화자 축소방법을 사용하여 MLP 기반 화자증명 시스템에서 화자등록에 필요한 시간을 단축한 학습속도 개선과 학습데이터 축소를 통한 MLP 기반 화자증명 시스템의 등록속도 향상방법에 관한 것이다.The present invention relates to a method for improving the registration speed of a speaker identification system. More specifically, the present invention relates to a method of improving the learning speed of an EBP and to using a background speaker reduction method in which a speaker grouping method is introduced in the existing speaker identification method. The present invention relates to a method of improving the registration speed of an MLP-based speaker identification system by improving the learning speed by reducing the time required for speaker registration in a speaker identification system and reducing the learning data.

MLP(multilayer perceptron)는 다른 패턴인식 방법에 비해 몇 가지 유리한 이점을 지니고 있어 화자증명 시스템의 화자학습 및 인식 방법으로서 사용이 기대된다. 그러나 MLP의 학습은 학습에 이용되는 EBP(error backpropagation) 알고리즘의 저속 때문에 상당한 시간을 소요한다. 이 점은 화자증명 시스템에서 높은 화자인식률을 달성하기 위해서는 많은 배경화자가 필요하다는 점과 맞물려 시스템에 화자를 등록하기 위해 많은 시간이 걸린다는 문제를 낳는다. 화자증명 시스템은 화자 등록후 곧바로 증명 서비스를 제공해야 하기 때문에 이 문제를 해결해야 한다.MLP (multilayer perceptron) has several advantages over other pattern recognition methods, and it is expected to be used as a speaker learning and recognition method of speaker identification system. However, learning of MLP takes considerable time because of the low speed of the error backpropagation (EBP) algorithm used for learning. This, together with the fact that many background speakers are required to achieve a high speaker recognition rate in speaker identification systems, creates a problem that it takes a long time to register a speaker in the system. This problem must be solved because the speaker authentication system must provide a proof service immediately after the speaker is registered.

따라서, 본 발명은 상기한 종래의 제반 문제점을 해결하기 위한 것으로, EBP의 학습속도를 개선하는 방법과, 기존의 화자증명 방법에서 화자군집 방법을 도입한 배경화자 축소방법을 사용하여 MLP 기반 화자증명 시스템에서 화자등록에 필요한 시간을 단축한 학습속도 개선과 학습데이터 축소를 통한 MLP 기반 화자증명 시스템의 등록속도 향상방법을 제공함에 그 목적이 있다.Therefore, the present invention is to solve the above-mentioned conventional problems, MLP-based speaker authentication using a method of improving the learning speed of the EBP, and a background speaker reduction method in which the speaker grouping method is introduced in the existing speaker identification method. The purpose of the present invention is to provide a method for improving the registration speed of the MLP-based speaker authentication system by reducing the learning speed and reducing the learning data.

도1a 내지 1d는 본 발명에 따른 EBP 학습 단계를 나타낸 도면으로서,1A to 1D are diagrams illustrating an EBP learning step according to the present invention.

도1a는 2모델 데이터 분포,Figure 1a is a two-model data distribution,

도1b는 중심위치학습,1b is a center position learning,

도1c는 분산학습,Figure 1c is distributed learning,

도1d는 윤곽학습을 나타낸 도면.1D is a diagram illustrating contour learning.

도2a 내지 2d는 본 발명에 따른 EBP 학습의 패턴별 척력을 나타낸 도면.Figure 2a to 2d is a view showing the repulsive force for each pattern of EBP learning according to the present invention.

도3a는 저밀도 배경화자와 도3b는 고밀도 배경화자의 경우에서 MLP의 등록화자학습을 나타낸 도면.Fig. 3a shows the registration speaker learning of the MLP in the case of the low density background speaker and Fig. 3b.

도4는 MLP-I의 배경화자군집 선택과 MLP-II의 등록화자학습을 나타낸 도면.4 shows background speaker group selection of MLP-I and registration speaker learning of MLP-II.

도5는 지속음 MLP 기반 화자증명 시스템 처리 흐름도.5 is a flowchart of processing of a sustained sound MLP based speaker identification system.

도6은 본 발명의 실시예에 따른 온라인 EBP 알고리즘의 실험결과 그래프.Figure 6 is a graph of the experimental results of the online EBP algorithm according to an embodiment of the present invention.

도7은 본 발명의 실시예에 따른 COIL 적용 실험결과 그래프.7 is a graph of the results of COIL application according to an embodiment of the present invention.

도8은 본 발명의 실시예에 따른 DCS를 이용한 실험결과 그래프.Figure 8 is a graph of the experimental results using the DCS according to an embodiment of the present invention.

도9는 본 발명의 실시예에 따른 COIL 및 DCS 동시 적용시 실험결과 그래프.Figure 9 is a graph of the experimental results when the simultaneous application of COIL and DCS in accordance with an embodiment of the present invention.

도10은 본 발명의 실시예에 따른 각 실험결과의 향상률 추이 그래프.10 is a graph of the improvement rate of each experimental result according to an embodiment of the present invention.

상기한 목적을 달성하기 위한 본 발명에 따른 학습속도 개선과 학습데이터 축소를 통한 MLP 기반 화자증명 시스템의 등록속도 향상방법은, 16bit 16kHz로 샘플링된 등록화자의 입력음성을 20ms 오버랩시킨 30ms 길이의 프레임으로 나누고, 각 프레임에 대해 16차 Mel 간격 필터뱅크(filter bank)를 추출하여 고립단어 및 지속음 검출에 사용하되, 필터뱅크 계수는 전체 스펙트럼 포락에 미치는 성량의 영향을 제거하기 위해 1kHz까지의 계수를 평균하여 이 값을 모든 계수에서 차감하고, MLP의 효과적인 학습을 위해 다시 모든 계수의 평균이 0이 되도록 조정하며, 각 프레임에 대해 50차의 0∼3kHz 대역 균등간격 Mel 필터뱅크를 추출하여 화자증명에 사용하되, 필터뱅크 계수는 전체 스펙트럼 포락에 미치는 성량의 영향을 제거하기 위해 1kHz까지의 계수를 평균하여 이 값을 모든 계수에서 차감하고, MLP의 효과적인 학습을 위해 다시 모든 계수의 평균이 0이 되도록 조정하는 음성분석 및 특징추출단계; 각 지속음과 묵음을 화자독립 방식으로 검출하도록 학습된 MLP를 사용하여 고립단어와 고립단어 내의 지속음을 검출하는 고립단어 검출 및 지속음 검출단계; 기본 온라인 EBP 또는 COIL 적용의 경우에는 등록화자와 배경화자 데이터를 이용하여 MLP를 학습하는 한편, MLP-I 및 MLP-II 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP-I에 입력한 뒤, 출력뉴런의 수치를 평균하고, 이 평균치가높은 순서로 n명의 배경화자를 선택하며, 지속음별로 선택된 n명의 배경화자 데이터를 이용하여 MLP-II에 배경화자를 학습시키되, 지속음별로 MLP-II 학습에 사용되는 패턴수는 배경화자 당 10개씩으로 하는 지속음별 등록화자 학습단계; 기본 온라인 EBP 또는 COIL 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP 에 입력한 뒤, 출력뉴런의 수치를 평균하는 한편, MLP-I 및 MLP-II 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP-I에 입력한 뒤, 출력뉴런의 수치를 평균하고, 이 평균치가 높은 순서로 n명의 배경화자를 선택하며, 선택된 배경화자 가운데 지속음별 등록화자 학습단계의 배경화자가 1명 이상 포함된 모든 지속음에 대해 MLP-II의 출력치를 평균하고, 지속음별 등록화자 학습단계의 배경화자가 1명 이상 포함된 지속음이 전무한 경우 의뢰화자를 거부하는 지속음별 화자점수 평가단계; 지속음별 화자점수 평가단계의 평균치와 사전 설정한 문턱값을 비교하여 최종적인 거부/수락을 결정하는 등록어 및 화자점수 문턱값 비교단계로 이루어진 것을 특징으로 한다.The method for improving the registration speed of the MLP-based speaker authentication system by improving the learning speed and reducing the learning data according to the present invention for achieving the above object is a frame of 30ms length that overlaps the input voice of a registered speaker sampled at 16bit 16kHz by 20ms. The 16th-order Mel interval filter bank is extracted for each frame and used to detect isolated words and continuous sounds.The filter bank coefficient is divided into coefficients up to 1kHz to remove the effect of the quantity on the entire spectral envelope. This value is subtracted from all coefficients and adjusted so that the average of all coefficients is zero again for effective learning of MLP, and the 50th-order 0-3kHz band equally spaced Mel filter bank is extracted for each frame. Filterbank coefficients are obtained by averaging the coefficients up to 1 kHz to remove the effect of the mass on the spectral envelope. Subtracted from all coefficients and speech analysis and feature extraction step of adjusting for effective learning of the MLP again, the average of all coefficients is zero; An isolated word detection and sustained sound detection step of detecting the isolated word and the sustained sound in the isolated word using the MLP trained to detect each continuous sound and the silence in a speaker-independent manner; In the case of basic online EBP or COIL application, the MLP is trained using registered speaker and background speaker data, while in the case of MLP-I and MLP-II application, each continuous sound of the whole isolated word is input to MLP-I by continuous sound. Then, the number of output neurons is averaged, and the average of these background values is selected in order of the highest n, and the background speakers are trained in the MLP-II using the n background speaker data selected for each continuous sound. The number of patterns used for learning is the continuous speech register speaker learning step of 10 per background speaker; In the case of basic online EBP or COIL application, each continuous sound of all isolated words is input to the MLP for each continuous sound, and the output neurons are averaged, while for the MLP-I and MLP-II applications, Each continuous sound is input to the MLP-I, and then the average of the output neurons is selected, and n background speakers are selected in descending order of the average, and one or more background speakers in the continuous speech register learning stage are selected among the selected background speakers. A sustained speech score evaluation step of averaging the output values of the MLP-II for all the sustained sounds and rejecting the requesting speaker when there is no continuous sound including one or more background speakers of the sustained-listed speaker learning step; Comparing the average value and the predetermined threshold value of the speaker score evaluation step for each continuous sound is characterized in that consisting of the registration word and the speaker score threshold comparison step to determine the final rejection / acceptance.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명한다.Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings.

컴퓨터, PDA, 휴대폰 등 각종 정보기기가 생활저변에서 활용되는 시대를 맞이하여 개인정보를 보호하는 수단이 점차 중요하게 인식되고 있다. 이 중 생체인식은 사람마다 고유한 특징을 이용하여 이러한 정보기기에 대한 접근을 통제하는 수단으로, 보안정보 관리의 편의성이 뛰어나고 보안능력 자체도 탁월하다. 주요한 생체특징으로 지문, 홍체, 얼굴모양, 음성 등이 있으나, 이 중 음성은 우리가 가장 쉽게 접할 수 있고 음성처리에 필요한 비용이 다른 특징에 비해 상대적으로 저렴하다는 이점 때문에 이를 이용한 연구가 진행되고 있다.In the age when various information devices such as computers, PDAs and mobile phones are utilized in everyday life, a means of protecting personal information is increasingly recognized. Among them, biometrics is a means of controlling access to such information devices by using unique features of each person, and it is excellent in convenience of security information management and excellent security capability itself. Major bio-characteristics are fingerprint, iris, face shape, and voice, but among these, voice is the one that we can easily access, and the cost of voice processing is relatively low compared to other features. .

음성을 이용한 생체인식 방법을 화자인식이라고 한다. 화자인식은 시스템에 여러 화자를 등록하고 이들 중 현재 음성과 일치하는 화자를 선택하는 화자식별과, 시스템에 미리 등록해 둔 특정 화자의 신원을 선택한 뒤 현재 음성이 그 화자와 일치하는지의 여부를 판별하는 화자증명으로 나뉜다. 이 중 화자증명은 불특정 다수를 대상으로 신원을 확인하고, 이러한 화자증명 모듈을 다수 결합하여 화자식별 시스템을 구성할 수 있기 때문에 화자증명 시스템의 연구가 더 활발하게 이루어지고 있다.Biometrics using speech is called speaker recognition. Speaker recognition identifies a speaker by registering several speakers in the system and selecting the speaker that matches the current voice, and determining whether the current voice matches the speaker after selecting the identity of a specific speaker registered in advance in the system. It is divided into speaker identification. Among them, the speaker identification is identified to an unspecified number of people, and the speaker identification system can be composed by combining a plurality of speaker identification modules.

특정화자의 음성을 학습하고 입력된 음성을 이렇게 학습된 음성과 비교하여 신원을 판별하기 위해 다양한 인식방법이 사용된다. 이들 중 MLP(multilayer perceptron)는 다음과 같은 이점을 갖고 있어 다양한 인식문제에서 사용될 수 있다.Various recognition methods are used to determine the identity by learning the voice of a specific speaker and comparing the inputted voice with the learned voice. Among these, MLP (multilayer perceptron) has the following advantages and can be used in various recognition problems.

첫째, 논파라메트릭(nonparametric) 방식이기 때문에 문제에서 가정해야 하는 하부확률분포가 필요없다.First, because it is nonparametric, there is no need for a lower probability distribution that must be assumed in the problem.

둘째, 학습되는 각 모델 사이의 차이를 최대한 구별하는 거부학습능력이 있기 때문에 인식오류 가능성을 최소화한다.Second, the possibility of recognition error is minimized because it has the ability of rejection learning to distinguish the difference between each model to be learned.

셋째, 학습모델별로 +1, 0(또는 -1)의 학습목표치를 사용할 때 LDA(linear discriminant analysis)와 유사한 특징공간 변환능력을 갖는다.Third, the feature space transformation ability is similar to that of linear discriminant analysis (LDA) when using a learning target of +1, 0 (or -1) for each learning model.

일반적으로 MLP의 학습에는 상당한 시간이 소요된다. 이는 MLP의 학습에 사용하는 EBP(error backpropagation) 알고리즘에 주요한 원인을 둔다. EBP 알고리즘은 최대 기울기 감소 방법을 바탕으로 한 것으로, MLP의 현재출력과 목표출력 사이의 오류를 출력층에서 은닉층으로 역방향으로 전파하면서 가중치를 조정하는 방법으로 최종적인 목표치를 달성한다. EBP 학습이 느린 까닭은 최대 기울기 감소 방법이 현재의 가중치에 대한 지역적인 정보만 사용하는 것에서 연유한다.In general, learning an MLP takes considerable time. This is the main reason for the error backpropagation (EBP) algorithm used for learning MLP. The EBP algorithm is based on the maximum slope reduction method, and the final target is achieved by adjusting the weight while propagating the error between the current output and the target output of the MLP in the reverse direction from the output layer to the hidden layer. The slowness of EBP learning is due to the fact that the maximum slope reduction method uses only local information about the current weight.

한편, 학습 데이터의 크기도 MLP의 학습속도에 영향을 줄 수 있다. 화자증명을 위해서는 신원점수를 측정할 의뢰화자와 비교를 위한 배경화자가 필요하며, 이 중 의뢰화자는 다시 실제화자와 사칭화자로 나뉜다. 높은 화자증명 인식률을 위해서는 의뢰화자를 되도록 특성이 유사한 배경화자와 비교하여 엄밀한 판별이 이뤄지도록 해야 한다. 그러나 의뢰화자의 특성을 미리 알 수 없기 때문에 가능한 한 많은 배경화자를 준비하여 어떤 화자가 신원확인을 의뢰하더라도 정확한 판별이 이뤄질 수 있도록 한다. 하지만, 높은 인식률을 달성하기 위해 배경화자의 수를 늘리면 학습시간이 긴 MLP 화자증명 시스템의 등록시간도 함께 길어진다는 점은 자명하다.On the other hand, the size of the learning data may also affect the learning speed of the MLP. In order to prove the speaker, a background speaker for comparison with the requesting speaker to measure the identity score is required, and the requesting speaker is divided into a real speaker and an impersonated speaker. For high speaker recognition rate, the requesting speaker should be compared with the background speaker with similar characteristics. However, because the characteristics of the calling speaker cannot be known in advance, as many background speakers as possible are prepared so that an accurate determination can be made even if a speaker requests identification. However, it is obvious that increasing the number of background speakers to achieve a high recognition rate also increases the registration time of the MLP speaker identification system with a long learning time.

화자증명 시스템은 화자의 등록과 동시에 신원증명 서비스가 가능해야 할 뿐 아니라 빈번히 이용되는 정보기기 사용의 편의성을 고려하여 신속한 증명 서비스가 이루어져야 한다. 앞서 설명한 두 요인으로 인해 MLP를 이용한 시스템의 화자등록 시간은 긴 편이지만, 구별함수에 기반한 MLP의 빠른 동작특성 때문에 화자증명 시간은 보편적인 정보기기의 처리능력을 기준으로 볼 때 만족할만한 수준이다. 이에 따라 MLP에 기반한 화자증명 시스템의 실시간 성능을 보장하기 위해 등록속도 개선에 노력이 집중되어야 한다.The speaker identification system should not only be able to provide identification services at the same time as the speaker is registered, but also should be promptly verified in consideration of the convenience of frequently used information devices. The speaker registration time of the system using MLP is long due to the two factors described above, but the speaker identification time is satisfactory in view of the processing capacity of general information equipment because of the fast operation characteristic of MLP based on the distinction function. Accordingly, efforts should be focused on improving the registration speed to ensure real-time performance of the MLP-based speaker authentication system.

MLP의 학습속도를 개선하려는 시도는 크게 두 방향으로 이루어졌다. 첫 번째방향은 경험과 실험결과를 활용한 것으로, 출력치가 목표치에서 멀 경우에는 학습률을 크게 하고 가까울 경우에는 작게 하는 것이다. 이것은 다시 가중치 벡터 전체에 일괄적으로 영향을 미치는 전역 학습률을 변경하는 방법과 각 가중치마다 최대 기울기의 변화에 따라서 지역 학습률을 변경하는 방법으로 나뉜다. 두 번째 방향은 최적화 이론을 활용한 것으로, 가중치에 대한 2차 미분정보를 사용한다. 이러한 부류로는 모멘텀(momentum)을 사용하여 이전의 학습추세를 현재 갱신에 반영하거나, Newton의 최적화 이론 또는 이를 변형한 알고리즘을 이용하여 목표치로 가장 빠르게 수렴할 수 있는 가중치 벡터 갱신치를 계산하는 방법이 있다.Attempts to improve the learning speed of MLP have been made in two directions. The first direction uses experience and experimental results. If the output is far from the target, the learning rate is increased and if it is close, the learning rate is decreased. This is divided into a method of changing the global learning rate that collectively affects the entire weight vector and a method of changing the local learning rate according to the change of the maximum slope for each weight. The second direction uses an optimization theory, which uses second derivatives of the weights. This class includes the use of momentum to reflect previous learning trends in the current update, or use Newton's optimization theory or a modified algorithm to calculate weighted vector updates that can converge to the target at the fastest. have.

가중치 갱신은 두 가지 방식으로 이루어진다. 하나는 모든 학습데이터를 제시한 후 그에 따른 변경치들의 평균을 적용하는 방법이고, 다른 하나는 학습데이터를 하나씩 제시할 때마다 변경치를 적용하는 방법이다. 전자를 온라인(또는 확률적) 방식이라고 하고, 후자를 오프라인(또는 일괄적) 방식이라고 부른다. 두 방식 모두 모든 학습데이터가 제시되는 한 주기를 에폭(epoch)이라 하고, 에폭마다 MLP 목표치와 출력치 사이의 차이를 검사하여 학습의 속행여부를 결정한다.Weight updating is done in two ways. One method is to apply the average of the change values after presenting all the training data, and the other is to apply the change value whenever the training data is presented one by one. The former is called on-line (or probabilistic) and the latter is called off-line (or batch). In both methods, one period in which all the learning data is presented is called an epoch, and the difference between the MLP target value and the output value is determined for each epoch to determine whether to continue the learning.

화자증명을 포함한 패턴인식에서 MLP를 사용할 경우 오프라인 학습보다 온라인 학습이 더 빠른 속도로 이루어지는데, 그 원인으로 아래와 같은 이유를 찾을 수 있다.When MLP is used in pattern recognition including speaker identification, online learning is faster than offline learning. The following reasons can be found.

첫째, 모델 내의 모든 패턴이 서로에 대해 상당한 중복성을 내포하므로 모든 패턴이 EBP의 최대 기울기 계산에 기여한다. 이 때문에 모델에 포함된 패턴수가 많을수록 에폭 단위의 학습속도가 빨라진다.First, all the patterns in the model imply significant redundancy for each other, which contributes to the calculation of the maximum slope of the EBP. For this reason, the more patterns included in the model, the faster the learning speed in units of epochs.

둘째, 모델 내 패턴들의 최대 기울기가 90°이내일 경우 오류를 최소화하는 방향으로 학습이 진행된다.Second, if the maximum slope of the patterns in the model is within 90 °, the learning proceeds to minimize the error.

셋째, 로컬 미니마에 빠질 가능성을 크게 줄인다. 이러한 특성은 모델 내의 모든 패턴마다 가중치 벡터의 갱신이 이루어질 때 중심위치에서 상대적으로 멀리 떨어진 패턴에 의해 전체 진행방향과 다른 임의적 진동이 발생하기 때문이다.Third, greatly reduce the likelihood of falling into the local minima. This characteristic is because when the weight vector is updated for every pattern in the model, random vibrations different from the overall traveling direction are generated by the pattern relatively far from the center position.

패턴인식의 여러 연구에서 온라인 방식의 EBP 학습이 위에서 소개한 오프라인 계열의 여러 방법보다도 빠르게 이루어진다는 결과가 보고되었다.Several studies of pattern recognition have reported that online EBP learning is faster than the offline methods described above.

파라메트릭 방식의 기존 화자증명 시스템에서는 화자등록시 등록화자와 유사한 제한된 수의 배경화자를 선택한 뒤, 의뢰화자의 신원판별시 이 군집의 화자정보를 이용하여 의뢰화자의 신원점수를 계산하는 방법이 도입되었다.In the existing parametric speaker identification system, a method of selecting a limited number of background speakers similar to registered speakers when registering a speaker and then calculating the client's identity score by using the speaker information of this group when identifying the requesting speaker's identity is introduced. .

Higgins 등은 신뢰할만한 신원점수를 얻기 위해 의뢰화자의 유사도 점수를 의뢰화자와 가장 유사한 배경화자의 점수로 평준화하는 방법을 다음과 같이 제안하였다.Higgins et al. Proposed a method of leveling the similarity score of the sponsor to the score of the background speaker most similar to the client in order to obtain a reliable identity score.

여기서,는 의뢰화자의 음성열이고,는 의뢰화자가 주장하는 신원을,는 의뢰화자에 가장 근접한 배경화자를 나타낸다.here, Is the requester's voice sequence, The identity asserted by the client, Indicates the background speaker closest to the requesting speaker.

그러나 이 점수는 1)의뢰화자와 가장 근접한 배경화자를 찾기 위해 모든 배경화자를 탐색하므로 계산량이 배경화자의 규모와 비례하고, 2)배경화자 집단에서 의뢰화자의 위치에 따라(즉, 의뢰화자 주위의 배경화자 분포밀도에 따라) 식(1)의 분모항의 값이 크게 달라지므로 안정적인 점수를 얻기 힘들다는 단점을 갖는다.However, this score is 1) because all the speakers are searched to find the closest background speaker, so the computation is proportional to the size of the background speaker, and 2) depending on the location of the sponsor in the group of background speakers (ie The value of the denominator term in Eq. (1) varies greatly, depending on the background speaker distribution density of.

이 문제를 해결하기 위해 화자를 시스템에 등록할 때 일정한 수의 근접 배경화자를 선택한 뒤, 의뢰화자 점수를 계산할 때 이들의 유사도 점수를 합한 값을 평준화 항으로 사용하는 화자군집 방법이 제안되었다.In order to solve this problem, a speaker group method is proposed in which a certain number of adjacent background speakers are selected when registering the speaker in the system, and the sum of the similarity scores is used as the leveling term when calculating the client speaker score.

여기서,는 등록화자와 근접한 화자군집을 뜻한다.을 이루는 배경화자의 수가 많을수록 증명점수의 신뢰성이 높아진다.here, Means a speaker group close to the registered speaker. The larger the number of background speakers, the higher the reliability of the proof score.

처리속도의 측면에서 봤을 때 화자군집 방법은 식(1)의 신원증명에 요구되는 처리시간을 단축하기 위해 근접 배경화자 탐색처리를 화자등록 단계로 분산시킨 것으로 간주할 수 있다. 화자증명 시스템의 등록과 증명단계 중에서 등록단계가 증명단계보다 비교적 실시간에 대한 엄격도가 낮으므로 이러한 계산량 분산은 전반적인 실시간 성능을 높이는 효과를 갖는다.In terms of processing speed, the speaker grouping method can be regarded as the distribution of the proximity background speaker searching process to the speaker registration stage in order to shorten the processing time required for identification of Equation (1). Among the registration and certification stages of the speaker authentication system, the registration stage is relatively less stringent for real time than the certification stage.

패턴인식에서 EBP 알고리즘으로 학습되는 MLP는 3단계의 학습단계를 밟는다.즉, 1)모델의 중심위치를 학습하고, 2)모델의 분산을 학습한 다음, 3)모델분포의 윤곽을 학습한다. 후자 쪽으로 진행해 갈수록 학습에 기여하는 패턴영역이 외곽으로 한정되고 패턴마다 효율적인 학습률이 달라질 뿐 아니라, 학습에 기여하지 않는 패턴이 계속해서 계산과정에 포함되는 비효율성이 발생한다. 기존의 온라인 EBP 알고리즘이 이러한 변화를 수용할 수 있다면 더욱 빠른 학습이 가능할 것이고, 이는 곧 MLP 기반의 화자증명 시스템의 등록속도 향상으로 이어질 수 있다.The MLP trained by the EBP algorithm in pattern recognition takes three stages of learning: 1) learn the central position of the model, 2) learn the model distribution, and 3) learn the contours of the model distribution. As we move toward the latter, not only the pattern area contributing to learning is limited to the outside, but the effective learning rate varies for each pattern, and the inefficiency in which patterns that do not contribute to learning continue to be included in the calculation process. If the existing online EBP algorithm can accommodate this change, it will be possible to learn more quickly, which can lead to faster registration speed of MLP-based speaker identification system.

MLP의 학습속도를 증가시키는 방법 외에 MLP의 학습에 필요한 데이터량을 줄이는 방법으로도 화자증명 시스템의 등록속도를 향상시킬 수 있다. 이를 위해 등록단계에 기존의 화자군집 방법을 도입하여 등록화자와 비슷한 소수의 배경화자를 선택함으로써 등록시 발생하는 데이터량을 줄인다.In addition to increasing the learning speed of the MLP, reducing the amount of data required for learning the MLP can improve the registration speed of the speaker identification system. To this end, the existing speaker group method is introduced at the registration stage to reduce the amount of data generated during registration by selecting a few background speakers similar to the registered speakers.

본 발명에서는 이 두 아이디어의 구체적인 실현방법을 제시하고 지속음을 화자의 인식단위로 하는 시스템에 이 방법을 적용한다. 이를 위한 이후 발명의 구성은 다음과 같다. 먼저 두 아이디어의 배경을 다루고, 다음으로는 각 아이디어의 구체적인 실현방법을 설명한다. 그리고, 본 발명에서 구현한 화자증명 시스템을 설명한 다음, 음성 데이터베이스를 이용한 실험을 통해 두 방법의 효과를 입증하기로 한다.In the present invention, a concrete method of realizing these two ideas is presented and the method is applied to a system in which the continuous sound is the speaker's recognition unit. Since the configuration of the invention for this purpose is as follows. First we cover the background of the two ideas, and then we explain the specific implementation of each idea. Next, the speaker authentication system implemented in the present invention will be described, and then the effects of the two methods will be verified through experiments using a voice database.

EBP 알고리즘에서 가중치의 현재 값에 대해 목표치에 가장 빠르게 접근할 수 있는 변위는 아래와 같이 계산된다.In the EBP algorithm, the displacement that gives the fastest approach to the target value for the current value of the weight is calculated as follows.

여기서,는 패턴에 대한 출력뉴런층의 오류측정 함수이고,는번째 뉴런과번째 뉴런 사이의 연결 가중치,는번째 뉴런의 동작치,는번째 뉴런에 대한 가중된 입력의 총합을 나타낸다. 이 식에서 계산된 변화량을 이전의 가중치 벡터에 적용하면 목표치에 더욱 가까운 값을 도출할 수 있으며, 이를 아래의 방법으로 실현한다.here, The pattern Error measurement function of the output neuron layer for Is With the fourth neuron Weight of the connection between the first neuron, Is Motion of the first neuron, Is The sum of the weighted inputs for the first neuron. Applying the amount of change calculated in this equation to the previous weight vector, we can derive a value closer to the target value and realize it by the following method.

여기서,는 가중치 벡터의 특정 상태 시각을 나타내며,는 시각에서 학습패턴(들)의 목표치에 대한 오류치이고,는 적용할 변화량의 비율을 결정하는 학습률이다.here, Represents the specific state time of the weight vector, Visual Is the error value for the target value of the learning pattern (s) Is the learning rate that determines the percentage of change to apply.

패턴인식을 위한 EBP 학습에서 각 모델은 식(4)에 의해 반복적인 방식으로 학습되므로 다음과 같이, 1)중심위치학습, 2)영역분산학습, 3)영역윤곽학습 과정을 거친다.In the EBP learning for pattern recognition, each model is trained in an iterative manner by Equation (4), so it goes through 1) central location learning, 2) domain dispersion learning, and 3) domain contour learning.

2모델 분류문제에서 이 과정을 도 1a 내지 1d에서 보여준다. 도시된 바와 같이 도1a 내지 1d는 본 발명에 따른 EBP 학습 단계를 나타낸 도면으로서, 도1a는 2모델 데이터 분포, 도1b는 중심위치학습, 도1c는 분산학습, 도1d는 윤곽학습을 나타낸다.This process in the two-model classification problem is shown in Figures 1A-1D. 1A to 1D illustrate an EBP learning step according to the present invention, in which FIG. 1A shows two model data distribution, FIG. 1B shows center position learning, FIG. 1C shows distributed learning, and FIG. 1D shows outline learning.

도면의 도1b ~ 1d에서 검은색 띠는 두 모델의 결정 경계선을 나타내며 색이 짙을수록 경계가 뚜렷함을 의미한다. 여기서 경계선이 도1b에서는 두 모델의 중심을 가로지르는 모습을 보여주고, 도1c에서는 두 모델의 영역분산정도를 구분하고 있으며, 도1d에서는 세부적인 윤곽을 표현하고 있음을 알 수 있다.In FIGS. 1B to 1D of the drawings, the black band represents the crystal boundary of the two models, and the darker the color, the clearer the boundary. Here, it can be seen that the boundary line crosses the center of the two models in FIG. 1B, the degree of area dispersion of the two models is distinguished in FIG. 1C, and the detailed outline is shown in FIG. 1D.

한편, 식(3)의는 아래와 같이 표현된다.On the other hand, of formula (3) Is expressed as

여기서,는 현재 패턴에 대한 개별 출력뉴런의 오류치이고,은 출력뉴런의 개수이다.는 다시 다음과 같이 표현된다.here, Is the error value of the individual output neurons for the current pattern, Is the number of output neurons. Is again expressed as

여기서,는 학습 목표치이고,는 현재 출력치이다.here, Is your learning goal, Is the current output value.

도1a 내지 1b의 각 단계에서 식(5)에 의해 각 패턴이 나타내는 오류치는 결정 경계선을 움직이는 척력으로 생각할 수 있으며, 이는 곧 각 단계에서 개별 패턴이 학습에 기여하는 정도로 간주할 수 있다. 도2a 내지 2d에서 도1a 내지 1b를 패턴별 척력 개념으로 묘사한 모습을 보여준다.In each step of Figs. 1A to 1B, the error value represented by each pattern by Equation (5) can be regarded as the repulsive force moving the decision boundary, which can be regarded as the degree to which the individual pattern contributes to learning at each step. Figures 2a to 2d shows a state depicted in Figure 1a to 1b by the concept of repulsive force for each pattern.

도2a 내지 2d에서 점선은 결정 경계선을 나타내고, 화살표는 패턴별 척력을 나타낸다. 도2a에 해당하는 미학습 상태에서는 모든 패턴이 강한 척력을 나타내어 중심위치를 학습할 수 있게 하며, 도2b 이후부터는 결정 경계선 인근과 경계선을 넘어 상대방 모델 영역에 위치한 패턴이 강한 척력을 발생시켜 경계선의 형상과 위치를 변화시킨다.In Figs. 2A to 2D, the dotted line represents the crystal boundary line, and the arrow represents the repulsive force for each pattern. In the non-learning state of FIG. 2A, all patterns exhibit strong repulsive force so that the center position can be learned. Since FIG. 2B, patterns located in the other model region near the crystal boundary line and beyond the boundary line generate strong repulsive force. Change shape and position.

패턴인식에서 온라인 방식이 오프라인 방식에 비해 빠른 학습을 달성하는 원인을 패턴별 척력의 적용방식에서 찾을 수 있다. 오프라인 방식에서는 모든 패턴별 척력벡터의 합벡터를 적용하지만, 온라인 방식에서는 이들 척력벡터를 개별적으로 적용한다. 따라서 온라인 방식이 결정 경계선의 모양을 변화시킬 기회를 더 많이 갖게 되며, 모델의 영역이 복잡한 형상을 띠는 경우 그 효과는 더욱 크다.The reason why the online method achieves faster learning than the offline method in pattern recognition can be found in the application method of repulsive force by pattern. In the off-line method, the sum vector of all repulsion vectors for each pattern is applied, but in the on-line method, these repulsive force vectors are applied individually. Thus, the on-line approach has more opportunities to change the shape of the crystal boundary, and the effect is greater when the model area is complex.

이처럼 온라인 방식이 오프라인 방식에 비해 학습속도 면에서 이점을 갖고 있으나, 아직도 최적의 속도를 달성하기 위한 여지가 남아있다.Although the online method has an advantage in learning speed compared to the offline method, there is still room for achieving an optimal speed.

온라인 EBP에서 학습단계에 따른 패턴의 척력분포를 보다 적극적으로 활용하면 학습시간을 더욱 단축할 수 있다.In the online EBP, more active use of the repulsive distribution of patterns according to the learning stages can further reduce the learning time.

도2a 내지 2d에서 모델의 결정 경계선이 아직 학습되지 않은 패턴의 큰 척력에 의해 변형되는 것을 볼 수 있다. 이 척력이 결정 경계선에 미치는 영향력은 식(4)의 학습률에 의해 조정된다. 즉, 학습이 덜된 패턴에 대해가 클수록 그 패턴의 인근 결정 경계선은 급격히 이동하게 된다. 그러나 학습이 진행되면서 결정 경계선이 학습목표 주변에 도달하면, 그 부근의 패턴에 대한가 점차 작아져야만 결정 경계선이 학습목표에 수렴할 가능성이 높아진다. 또한, 여러 모델 사이의 중첩된 영역은 완벽한 학습이 불가능하므로, 결정 경계선의 불필요한 진동을 막기위해 이 영역의 패턴에 대해서는가 작아야 한다.In Figures 2a to 2d it can be seen that the crystal boundary of the model is deformed by the large repulsive force of the pattern that has not yet been learned. The effect of this repulsive force on the decision boundary is the learning rate in equation (4). Is adjusted by. That is, for less learned patterns The larger is, the more rapidly the neighboring crystal boundaries of the pattern move. But as learning progresses, when the decision boundary reaches around the learning objectives, Must decrease gradually to increase the likelihood that the decision boundary will converge on the learning objectives. In addition, overlapping regions between different models are not fully trained, so the pattern of these regions can be changed to avoid unnecessary vibration of the crystal boundary. Should be small.

이를 위해 기존의 온라인 EBP 알고리즘에서와 같이를 학습기간 동안 고정하는 대신,가 상황에 맞춰 변경될 수 있어야 한다. 이와 같은 가변는 이미 오프라인 EBP에서 시도된 바 있지만, 온라인 EBP에서 이를 적용하면 온라인 방식의 데이터 중복성 활용 이점을 더욱 향상시킬 수 있다.For this purpose, as in the traditional online EBP algorithm, Instead of fixing it during the learning period, Should be adaptable to the situation. Variable like this Has already been tried in the offline EBP, but applying it in the online EBP can further enhance the benefits of using online data redundancy.

한편, 기존의 온라인 EBP 알고리즘에서는 패턴이 이미 학습되었더라도 그 패턴이 계속해서 학습계산에 참여한다. 일정한 학습성과의 기준을 설정하고 패턴이 그 기준을 만족하면 학습계산에서 그 패턴을 제외하더라도 전체 학습진행에는 거의 영향을 주지 않으면서 학습시간을 단축할 수 있다.On the other hand, in the existing online EBP algorithm, even if the pattern has already been learned, the pattern continues to participate in the learning calculation. If a standard is set for a certain learning outcome and the pattern satisfies the criterion, the learning time can be shortened with little effect on the overall learning progress even if the pattern is excluded from the learning calculation.

식(2)에서 의 배경화자 수가 무한대라고 가정한다면 Bayes 법칙에 따른 의뢰화자의 주장화자에 대한 사후확률에서 화자들의 사전확률이 고정됨에 따라 식(2)를 사후확률로 볼 수 있다.If we assume that the number of background speakers in Eq. (2) is infinite, we can see Eq. (2) as post-probability as the speaker's pre-probability is fixed in the post-probability for the claimant's claimant according to Bayes' law.

Gish는 MLP가 충분한 학습용량을 가졌다면 다수모델을 분류하는 패턴인식에서 각 모델의 사후확률을 학습할 수 있다는 사실을 증명했다. 이에 따라 2개의 학습모델로서 등록화자와 충분한 수의 배경화자가 주어진다면 MLP의 학습은 식(7)으로 표현되는 화자군집 방법을 근사화한다.Gish proved that if the MLP had sufficient learning capacity, it could learn the posterior probability of each model from pattern recognition that classifies multiple models. Accordingly, if two registered learners are provided as registered speakers and a sufficient number of background speakers, the learning of the MLP approximates the speaker group method represented by Equation (7).

그러나, 화자군집을 이루는 배경화자의 수가 일정한 수준 이상이면 인식률 향상에는 거의 기여하지 못하면서 계산량은 선형적으로 증가하게 된다. 따라서 실험을 통해 만족할만한 인식률 수준에서 가장 적은 배경화자를 선택하는 것이 유리하다.However, if the number of background speakers constituting the speaker group is above a certain level, the computational amount increases linearly with little contribution to the recognition rate improvement. Therefore, it is advantageous to select the least background speakers at the satisfactory recognition rate through experiments.

상기에서 제기한 온라인 방식 EBP 알고리즘의 학습속도 향상 아이디어를 실현하기 위해 본 발명에서는 두 가지 방법을 제안한다.In order to realize the idea of improving the learning speed of the on-line EBP algorithm, the present invention proposes two methods.

첫 번째 방법은 학습패턴별로 학습률을 가변하는 방법이다.The first method is to vary the learning rate for each learning pattern.

학습률은 모델의 학습진행에 따라 큰 값에서 작은 값으로의 변화가 필요하다. 식(5)에서 현재 패턴의 모델과 대응하는 출력뉴런의는 현재패턴의 학습상태를 알 수 있는 수치적 측정수단을 제공한다. 즉, 패턴이 충분히 학습되지 않았을 경우 큰 값을 나타내고, 충분히 학습되었을 경우 작은 값을 나타낸다. 따라서학습 초기에는 큰 값이, 그 뒤 학습이 완결되어감에 따라 작은 값이 구해진다.The learning rate needs to change from large value to small value as the model progresses. In equation (5), the output neuron corresponding to the model of the current pattern Provides a numerical measurement means to know the learning status of the current pattern. That is, when the pattern is not sufficiently learned, a large value is displayed, and when the pattern is sufficiently learned, a small value is represented. Therefore, a large value is obtained at the beginning of learning, and a small value is obtained as the learning is completed.

온라인 학습방식에서 보편적으로 선택되는 학습률의 범위는 1∼0.0001이다. 이 범위에서 상한을 넘어서면 내부변수의 값이 발산하기 쉽고, 하한을 넘어서면 학습이 불필요하게 길어진다. 그러나 문제에 따라 적절한 값의 범위가 다르므로 이 범위는 실험을 통해 결정해야 한다. 식(5)의를 학습률로 사용하면 하한에 대해서는 걱정할 필요가 없으므로 적절한 상한을 알아내기만 하면 된다. 그런 다음 0에서부터 이렇게 알아낸 상한까지의 범위를 제한한다.The learning rate that is commonly selected in the online learning method is 1 to 0.0001. If the upper limit is exceeded in this range, the value of the internal variable tends to diverge, and if it exceeds the lower limit, the learning becomes unnecessarily long. However, the range of appropriate values varies depending on the problem, so this range should be determined by experiment. Of equation (5) If you use as a learning rate, you don't have to worry about the lower limit, so you only need to find the appropriate upper limit. Then from zero up to the upper bound Limit the scope of

여기서,은 실험을 통해 알아내는 학습률의 상한이고,는의 출력치이며,는 제한된 값이다.here, Is the upper limit of the learning rate found through the experiment, Is Is the output of Is a limited value.

그러나 식(8)는 현재패턴에 한정된 오류정보만을 사용한다. 만약 학습모델 자체가 필연적인 오류를 크게 내포하는 경우, 즉 여러 모델 사이에 모델 분포영역이 겹치는 부분이 클 경우에는 그 부분의 패턴에 의한가 현재 학습상황과 대체적으로 무관하게 큰 값을 나타냄으로써 전체 학습을 방해하는 결과를 초래할 수 있다. 이런 경우에 대처하기 위해 식(8)의 값을 한 단계 이전 에폭의 평균오류로 제한한다.However, equation (8) uses only error information limited to the current pattern. If the learning model itself contains a lot of inevitable errors, that is, if the model distribution region overlaps largely among several models, Is largely irrelevant to the current learning situation and can result in disturbing the overall learning. To deal with this case, limit the value of Eq. (8) to the mean error of the epoch one step earlier.

여기서,는 아래의 식(10)과 같이 정의되는 평균오류제곱에너지를 나타낸다.here, Denotes the mean error square energy defined as Equation (10) below.

여기서,은 1에폭 동안 사용된 패턴의 수이다.here, Is the number of patterns used during one epoch.

두 번째 방법은 패턴의 학습을 생략하는 방법이다.The second method is to omit the learning of the pattern.

온라인 방식 EBP에서 이뤄지는 주요 계산은 패턴의 오류계산, 오류 역전파, 가중치 갱신이다. 현재 학습단계에서 현재 패턴의 기여도는 를 통해 알 수 있으므로, 의 값이 학습 전에 설정하는 식(10)의 최종목표 오류치보다 작은 경우 현재 패턴이 학습에 기여하는 바가 적다고 판단하여 오류 역전파와 내부변수 갱신 계산과정을 생략할 수 있다. 만일 이후 다른 패턴의 학습으로 현재 패턴의 기여도가 높아진다면 이 상태는 에 의해 발견되기 때문에 다시 학습에 참여할 수 있게 된다.The main calculations made in the online EBP are error calculation of patterns, error back propagation and weight update. Since the contribution of the current pattern in the current learning phase can be found through, if the value of is smaller than the final target error value of Eq. (10) set before learning, it is determined that the current pattern contributes less to the learning. The variable update calculation process can be omitted. If later learning of other patterns contributes to the higher contribution of the current pattern, this condition is found by, so that you can re-engage.

본 발명에서는 이 두 방법을 가리켜 패턴별 가변학습률 및 학습생략(Changing rate and Omitting patterns in Instant Learning) 방법이라고 부른다.In the present invention, these two methods are referred to as a changing rate and omitting patterns in instant learning method.

위에서 제기한 대로 MLP 학습에 필요한 배경화자의 수를 줄이기 위해 식(2)와 같이 등록하는 화자에 인접한 제한된 수의 배경화자를 선택하되, 화자증명 시스템의 인식률을 저하시키지 않는 수준에서 최소수를 선택한다.As suggested above, to reduce the number of background speakers required for MLP learning, select a limited number of background speakers adjacent to the registered speaker as shown in Eq. (2), but select the minimum number at a level that does not reduce the recognition rate of the speaker identification system. do.

이 방법을 위해서 등록화자가 각 배경화자와 얼마나 근접하는 지를 평가하는 MLP와, 이 MLP를 통해 선택된 배경화자를 이용하여 등록화자의 특성을 학습하는 MLP가 필요하다. 본 발명에서는 전자를 MLP-I, 후자를 MLP-II라고 한다.This method requires an MLP that evaluates how close the registered speaker is to each background speaker, and an MLP that learns the characteristics of the registered speaker using the background speaker selected through the MLP. In the present invention, the former is referred to as MLP-I and the latter as MLP-II.

MLP-I은 배경화자의 수와 동일한 개수의 출력뉴런을 가지며, 사전에 배경화자의 데이터를 이용하여 각각을 분류할 수 있도록 학습된다. 이 때 은닉뉴런이 많으면 배경화자의 근접도 평가만으로도 많은 시간이 소요될 수 있으므로, 각 배경화자 데이터를 80% 이상의 정확도로 분류할 수 있을 정도로만 은닉뉴런 개수를 선택한다.The MLP-I has the same number of output neurons as the number of background speakers, and is learned to classify each using the background speaker's data in advance. In this case, if there are many hidden neurons, it may take a long time just to evaluate the proximity of the background speakers. Therefore, the number of hidden neurons is selected so that each background speaker data can be classified with more than 80% accuracy.

MLP-II는 MLP-I을 통해 선택된 배경화자의 데이터를 이용하여 등록화자를 학습한다. 단, MLP의 경우 화자군집 방법과 달리 학습모델을 등록화자와 배경화자의 2개로 설정하므로 MLP-I으로 선택된 배경화자들의 패턴을 한 모델로 통합한다.The MLP-II learns the registered speaker using the data of the background speaker selected through the MLP-I. Unlike the speaker grouping method, however, the learning model is set to two registered speakers and a background speaker, so the patterns of the background speakers selected by the MLP-I are integrated into one model.

MLP-I에서 선택될 배경화자의 수는 실험을 통해 결정한다. 이 수는 등록화자의 분포범위를 선택된 배경화자들이 둘러쌀 수 있는 수준 이상이어야 하며, 인식률의 하락을 방지하기 위해서는 배경화자가 등록화자를 밀도있게 둘러싸야 한다.The number of background speakers to be selected in MLP-I is determined experimentally. This number should be above the level that the selected background speakers can cover the distribution range of the registered speakers, and the background speakers should densely surround the registered speakers to prevent the recognition rate from falling.

배경화자의 밀도에 따른 등록화자의 학습양상을 도3a 및 3b에서 설명하고 있다. 이 그림에서 점선은 등록화자와 배경화자의 영역을 결정하는 경계선을 나타낸다. 또한, 도4는 MLP-I과 MLP-II의 동작원리를 설명한다.The learning pattern of the registered speaker according to the density of the background speaker is illustrated in FIGS. 3A and 3B. In this figure, the dotted lines represent the boundaries that determine the area of the registered speaker and the background speaker. 4 illustrates the operation principle of MLP-I and MLP-II.

본 발명에서는 이 방법을 DCS(discriminative cohort speakers) 방법이라고 부른다.In the present invention, this method is called a DCS (discriminative cohort speakers) method.

본 발명에서 구현한 화자증명 시스템은 입력음성에서 고립단어를 추출하고, 이 고립단어에서 한국어 지속음(/a/, /e/, /o/, /u/, /i/, /l/, 비음)을 인식한 다음, 각 지속음별로 MLP-I과 MLP-II를 이용하여 화자를 학습하고 증명점수를 계산한다. 이 시스템에서 이뤄지는 처리를 도5에서 보여주며, 각 처리의 설명은 아래와 같다.The speaker identification system implemented in the present invention extracts the isolated word from the input voice, and the Korean continuous sound (/ a /, / e /, / o /, / u /, / i /, / l /, nasal sound) from the isolated word. ), Then learn the speaker using MLP-I and MLP-II for each continuous sound and calculate the proof score. The processing performed in this system is shown in FIG. 5, and description of each processing is as follows.

1단계)음성분석 및 특징추출Step 1) Voice Analysis and Feature Extraction

16bit 16kHz로 샘플링된 등록화자의 입력음성을 20ms 오버랩시킨 30ms 길이의 프레임으로 나눈다.The input speech of a 16bit 16kHz sampled speaker is divided into 30ms long frames that overlap 20ms.

각 프레임에 대해 16차 Mel 간격 필터뱅크(filter bank)를 추출하여 고립단어 및 지속음 검출에 사용한다. 필터뱅크 계수는 전체 스펙트럼 포락에 미치는 성량의 영향을 제거하기 위해 1kHz까지의 계수를 평균하여 이 값을 모든 계수에서 차감한다. 그리고, MLP의 효과적인 학습을 위해 다시 모든 계수의 평균이 0이 되도록 조정한다.A 16th-order Mel interval filter bank is extracted for each frame and used to detect isolated words and continuous sounds. The filterbank coefficients are then subtracted from all coefficients by averaging the coefficients up to 1 kHz to remove the effect of the mass on the entire spectral envelope. Then, for effective learning of MLP, again adjust the average of all coefficients to zero.

각 프레임에 대해 50차의 0∼3kHz 대역 균등간격 Mel 필터뱅크를 추출하여 화자증명에 사용한다. 이 음성특징은 2차 포만트(formant)에 더 많은 화자정보가집중된다는 연구결과에 의한 것이다. 필터뱅크 계수는 전체 스펙트럼 포락에 미치는 성량의 영향을 제거하기 위해 1kHz까지의 계수를 평균하여 이 값을 모든 계수에서 차감한다. 그리고, MLP의 효과적인 학습을 위해 다시 모든 계수의 평균이 0이 되도록 조정한다.For each frame, the 50th-order 0-3kHz band equally spaced Mel filter bank is extracted and used for speaker identification. This voice feature is based on research showing that more speaker information is concentrated in the secondary formant. The filterbank coefficients are then subtracted from all coefficients by averaging the coefficients up to 1 kHz to remove the effect of the mass on the entire spectral envelope. Then, for effective learning of MLP, again adjust the average of all coefficients to zero.

2단계) 고립단어 검출 및 지속음 검출Step 2) Isolated word detection and continuous sound detection

각 지속음과 묵음을 화자독립 방식으로 검출하도록 학습된 MLP를 사용하여 고립단어와 고립단어 내의 지속음을 검출한다.The MLP trained to detect each continuous sound and silence in speaker-independent manner is used to detect isolated words and sustained sounds within the isolated words.

3단계) 지속음별 등록화자 학습Step 3) Learn continuous speaker registration

3-1단계)기본 온라인 EBP 또는 COIL 적용의 경우에는 등록화자와 배경화자 데이터를 이용하여 MLP를 학습한다.Step 3-1) In case of basic online EBP or COIL application, learn MLP using registered speaker and background speaker data.

3-2단계)MLP-I 및 MLP-II 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP-I에 입력한 뒤, 출력뉴런의 수치를 평균하고, 이 평균치가 높은 순서로 n명의 배경화자를 선택한다. 지속음별로 선택된 n명의 배경화자 데이터를 이용하여 MLP-II에 등록화자를 학습시킨다. 지속음별로 MLP-II 학습에 사용되는 패턴수는 배경화자 당 10개씩이다.Step 3-2) In case of applying MLP-I and MLP-II, each continuous sound of all isolated words is input to MLP-I for each continuous sound, and then the values of output neurons are averaged, and the background of n persons in the order of the average value is high. Select the speaker. The registered speaker is trained in MLP-II using the n background speaker data selected for each continuous sound. The number of patterns used for MLP-II learning per continuous sound is 10 per background speaker.

4단계) 지속음별 화자점수 평가Step 4) Evaluate Speaker Score by Continuous Sound

4-1단계) 기본 온라인 EBP 또는 COIL 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP 에 입력한 뒤, 출력뉴런의 수치를 평균한다.Step 4-1) In the case of basic online EBP or COIL application, each continuous sound of all isolated words is input to the MLP for each continuous sound, and the output neurons are averaged.

4-2단계) MLP-I 및 MLP-II 적용의 경우에는 지속음별로 전체 고립단어의 각 지속음을 MLP-I에 입력한 뒤, 출력뉴런의 수치를 평균하고, 이 평균치가 높은 순서로 n명의 배경화자를 선택한다. 선택된 배경화자 가운데 3단계의 배경화자가 1명 이상 포함된 모든 지속음에 대해 MLP-II의 출력치를 평균한다. 3단계의 배경화자가 1명 이상 포함된 지속음이 전무한 경우 의뢰화자를 거부한다.Step 4-2) In case of applying MLP-I and MLP-II, each continuous sound of all isolated words is input to MLP-I for each continuous sound, and then the values of output neurons are averaged. Select the speaker. The output of the MLP-II is averaged over all sustained sounds containing at least one background speaker in three levels among the selected background speakers. If there is no continuous sound with one or more background speakers in level 3, the requesting speaker is rejected.

5단계) 등록어 및 화자점수 문턱값 비교Step 5) Comparison of Registered Word and Speaker Score Threshold

4단계의 평균치와 사전 설정한 문턱값을 비교하여 최종적인 거부/수락을 결정한다.The final rejection / acceptance is determined by comparing the average value in step 4 with a preset threshold.

이 화자증명 시스템에서는 지속음을 화자인식단위로 사용하기 때문에 하부확률분포가 단변량(univariate)의 형태를 띤다. 따라서 화자학습에 관련한 MLP, MLP-I, MLP-II는 모두 1개의 은닉계층이 포함된 2층 구조만으로 충분하다. 이 중 MLP와 MLP-II는 학습하는 모델이 2개뿐 이므로 1개의 출력뉴런과 2개의 은닉뉴런이면 충분히 이들을 학습할 수 있다.In this speaker identification system, since the continuous sound is used as speaker recognition unit, the lower probability distribution takes the form of univariate. Therefore, MLP, MLP-I, and MLP-II related to speaker learning are sufficient as a two-layer structure including one hidden layer. Since MLP and MLP-II have only two models to learn, one output neuron and two hidden neurons can fully learn them.

실시예Example

이 실험에 사용할 데이터는 한국인 남녀 40명의 4연숫자 발성을 녹음한 것이다. 여기서 4연숫자라 함은 아라비아 숫자 0∼9에 해당하는 /goN/, /il/, /i/, /sam/, /sa/, /o/, /yug/, /cil/, /pal/, /gu/ 음을 연속해서 4자리를 발성한 것을 의미한다. 각 화자가 총 35개의 서로 다른 숫자음 배열을 4회씩 발성하였고, 발성은 16kHz 주기의 16bit 크기로 녹음되었다. 4회 발성 중 3회를 각 화자의 등록음성으로 사용하고 나머지 1회를 증명시험 음성으로 사용한다.The data to be used for this experiment was a four-digit vocalization of 40 Korean men and women. Where the four-digit number is / goN /, / il /, / i /, / sam /, / sa /, / o /, / yug /, / cil /, / pal / corresponding to the Arabic numerals 0-9. , / gu / means four consecutive digits. Each speaker uttered a total of 35 different numeric arrays four times, and the utterances were recorded in 16-bit size at 16 kHz. Three out of four voices are used as the registered voice of each speaker and the remaining one is used as the proof test voice.

등록화자 학습시 필요한 배경화자로는 위의 40명 외의 남녀 29명을 사용한다.As the background speaker required for learning the registered speaker, 29 men and women besides the above 40 are used.

실험에서 화자등록에 사용되는 MLP(DCS 방법의 경우 MLP-II)의 각종 파라미터는 다음과 같다.Various parameters of MLP (MLP-II in case of DCS method) used for speaker registration in the experiment are as follows.

입력 데이터는 -1.0∼+1.0으로 평준화되며, 학습속도 향상을 위해 각 출력뉴런의 목표치는 등록화자에 +0.9, 배경화자에 -0.9를 지정한다. 모든 내부변수는 학습 전 -0.5∼+0.5의 임의수치로 초기화된다. 학습시 두 모델의 음성패턴은 교대로 MLP에 제시되는데, 거의 대부분의 경우에 있어서 두 모델의 학습패턴수가 일치하지 않으므로 많은 쪽 패턴이 모두 제시될 때까지 적은 쪽 패턴을 반복해서 제시하여 1에폭을 채운다. 최대 학습에폭은 로컬 미니마에 빠지는 경우를 고려하여 1000회로 제한하고, 학습목표는 식(10)의 계산치가 0.01 이하가 되는 동시에 조기 학습중지를 막기위해 이 값의 변화율(average squared error energy rate)이 0.01 이하가 되는 것으로 한다.The input data is leveled from -1.0 to +1.0, and the target value of each output neuron is +0.9 for the registered speaker and -0.9 for the background speaker to improve the learning speed. All internal variables are initialized to random values from -0.5 to +0.5 before learning. During training, the voice patterns of the two models are alternately presented in the MLP. In most cases, the number of learning patterns of the two models does not match, so the smaller one is repeatedly presented until all the patterns are presented. Fill it. The maximum learning epoch is limited to 1000 in consideration of falling into the local minima, and the learning goal is the average squared error energy rate in order to prevent premature learning interruption while the calculated value of equation (10) is less than 0.01. It shall be set to 0.01 or less.

실험화자 40명을 한 명씩 차례로 등록화자와 실제화자로 사용하고 이를 제외한 나머지 39명을 사칭화자로 사용한다. 결과적으로 화자당 35회의 실제화자 시도와 1,560회의 사칭화자 시도를 평가하게 되고, 실험 전체로 봤을 때는 1,400회의 실제화자 시도와 54,600회의 사칭화자 시도를 평가하게 된다.40 experimental speakers are used one by one as registered speakers and actual speakers, and 39 others are used as impersonal speakers. As a result, 35 actual speaker attempts and 1,560 impersonal speaker attempts are evaluated per speaker, and as a whole experiment, 1,400 actual speaker attempts and 54,600 impersonal speaker attempts are evaluated.

실험은 AMD 1.4GHz급 컴퓨터에서 실시하였으며, 실험결과에서 오류율은 동일오류율(Equal Error Rate)을 의미하고, 학습패턴수는 1명의 화자를 등록하기 위해 학습된 총 패턴수를 나타내며, 학습시간은 이 패턴들을 학습하는 데 걸린 실제시간을 가리킨다. 오류율, 학습패턴수, 학습시간은 전체 증명시도에 대한 평균치이다.The experiment was conducted on AMD 1.4GHz class computer. In the test results, the error rate means equal error rate, the number of learning patterns represents the total number of patterns learned to register one speaker, and the learning time is Indicates the actual time taken to learn the patterns. The error rate, the number of learning patterns, and the learning time are averages for all proof attempts.

먼저, 기본 온라인 EBP 알고리즘만을 사용하여 학습률을 변경해 가며 최적의학습률을 찾는 실험결과를 도6에 보였다.First, the experimental results of finding the optimal learning rate by changing the learning rate using only the basic online EBP algorithm are shown in FIG. 6.

이 결과에서 최고 학습속도는 학습률 0.5에서 달성되었으나 최저 오류율을 고려할 때 학습률 1.0을 선택하는 것이 타당하다.In this result, the highest learning rate was achieved at the learning rate of 0.5, but it is reasonable to select a learning rate of 1.0 considering the lowest error rate.

다음으로 이 학습률에서 온라인 EBP 대신 COIL을 적용했을 때의 실험결과를 도7에 보였다. 여기서 온라인 EBP를 적용한 실험을 OnEBP로, 가변학습률만 적용한 실험을 CIL로, 가변학습률과 패턴생략을 모두 적용한 실험을 COIL로 표기한다.Next, the experimental results of applying COIL instead of online EBP at this learning rate are shown in FIG. 7. Here, the experiment using online EBP is denoted as OnEBP, the experiment using only variable learning rate as CIL, and the experiment using both variable learning rate and pattern omission as COIL.

이 결과에서 CIL과 COIL의 학습률 상한은 1.0이며, 이 수치 역시 온라인 EBP와 마찬가지로 학습시간과 오류율을 함께 고려하여 최적의 수치를 선택한 것이다. 각각의 경우 학습시간이 95.1%, 229.2% 향상되었다.In this result, the upper limit of the learning rate of CIL and COIL is 1.0, and this figure is the optimal value considering the learning time and error rate together with the online EBP. In each case, learning time improved by 95.1% and 229.2%.

도8에서는 온라인 EBP를 사용하면서 DCS를 적용했을 때의 실험결과를 보인다. 이 결과에서는 실제로 학습에 투입되는 패턴수가 줄어드는 것을 확인해야 하므로 에폭수 대신 총학습기간 동안 학습에 사용된 패턴수를 기록한다.8 shows experimental results when DCS is applied while using the online EBP. In this result, it is necessary to confirm that the number of patterns actually put into learning decreases, so instead, record the number of patterns used for learning during the total learning period instead of the epoch.

이 결과에서 오류율이 높아지지 않는 수준에서 군집내 최소 화자수는 26명이고 0.19% 상승시 화자수는 14명이며, 각각의 경우에서 학습시간은 온라인 EBP에 비해 4.6%, 54.9% 향상되었다. 또한, 군집내 화자수가 배경화자 수와 같을 때(즉, 학습데이터 감축 효과가 없을 때) 화자군집 방법을 적용하지 않은 경우보다 5% 정도의 학습시간 지연이 관측되었는데, 이는 MLP-I의 적용에 따른 계산량의 증대 때문이다.As a result, the minimum number of speakers in the cluster was 26 and the number of speakers was 14 when the number increased by 0.19%. In each case, the learning time improved 4.6% and 54.9% compared to the online EBP. In addition, when the number of speakers in the cluster is equal to the number of background speakers (that is, when there is no learning data reduction effect), a learning time delay of about 5% was observed, compared to the case where the speaker group method was not applied. This is because of the increase in the amount of calculation.

마지막으로 COIL과 DCS를 모두 이용한 실험결과를 도9에서 보인다.Finally, the experimental results using both COIL and DCS are shown in FIG. 9.

이 결과에서 오류율이 0.05% 상승하는 수준에서 군집내 최소 화자수는 20명이고 0.19% 상승시 화자수는 17명이며, 각각의 경우에서 학습시간은 온라인 EBP에 비해 305.1%, 351.4% 향상되었다.As a result, the error rate increased by 0.05%, the minimum number of speakers in the group was 20, and when the number increased by 0.19%, the number of speakers was 17. In each case, the learning time was improved by 305.1% and 351.4% compared to the online EBP.

오류율이 1.6% 수준일 때 각 실험결과의 향상률을 도10에서 정리한다.The improvement rate of each experimental result when the error rate is 1.6% is summarized in FIG.

이 결과에서 볼 수 있듯이, COIL 방법과 DCS 방법이 동시에 적용될 경우 서로 시너지 효과를 발휘하여 개별적으로 적용될 때보다 높은 학습시간 향상률을 기록한다.As can be seen from this result, when the COIL method and the DCS method are applied at the same time, they exhibit synergistic effects and record higher learning time improvement rates than when applied individually.

이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 또한 설명하였으나, 본 발명은 상기한 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 본 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 실시가 가능한 것은 물론이고, 그와 같은 변경은 기재된 청구범위 내에 있게 된다.Although the preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the above-described embodiments, and the present invention is not limited to the above-described embodiments without departing from the spirit of the present invention as claimed in the claims. Various modifications can be made by those skilled in the art, and such modifications are intended to fall within the scope of the appended claims.

이상에서 설명한 바와 같이 본 발명에 따른 학습속도 개선과 학습데이터 축소를 통한 MLP 기반 화자증명 시스템의 등록속도 향상방법에 의하면, 다른 패턴인식 방법에 비해 몇 가지 이점을 제공하는 MLP는 화자증명 시스템의 화자학습 및 인식 방법으로도 유망하다. 그러나 MLP의 학습은 학습에 이용되는 EBP 알고리즘의 저속 때문에 상당한 시간을 소요한다. 이 점은 화자증명 시스템에서 높은 화자인식률을 달성하기 위해서는 많은 배경화자가 필요하다는 점과 맞물려 시스템에 화자를 등록하기 위해 많은 시간이 든다는 문제를 낳는다. 이 점은 화자 등록후 곧바로 증명 서비스를 제공해야 하는 화자증명 시스템의 실시간 요구를 만족시키지 못한다.As described above, according to the method of improving the registration speed of the MLP-based speaker authentication system by improving the learning speed and reducing the learning data according to the present invention, the MLP, which provides some advantages over other pattern recognition methods, is a speaker of the speaker identification system. It is also promising as a learning and recognition method. However, the learning of MLP takes considerable time because of the low speed of the EBP algorithm used for learning. This, together with the fact that many background speakers are required to achieve a high speaker recognition rate in speaker identification systems, creates a problem that it takes a lot of time to register a speaker in the system. This does not meet the real-time requirements of the speaker identification system, which must provide the proof service immediately after the speaker is registered.

본 발명에서는 이 문제를 해결하기 위해 EBP의 학습속도를 개선하는 COIL 방법과, 기존의 화자증명에서 화자군집 방법을 도입한 DCS 방법을 사용하여 MLP 기반 화자증명 시스템에서 화자등록에 필요한 시간을 단축하였다. 지속음을 화자인식 단위로 하는 MLP 기반의 화자증명 시스템에서 이 두 방법을 적용한 결과 각각 등록시간 향상을 확인하였고, 두 방법을 동시적용했을 때 기존의 온라인 EBP 학습 알고리즘을 사용했을 때보다 대략 4배 빠른 등록속도가 기록되었다.In order to solve this problem, the present invention shortens the time required for speaker registration in the MLP-based speaker authentication system by using the COIL method to improve the learning speed of the EBP and the DCS method which adopts the speaker grouping method in the existing speaker authentication. . As a result of applying these two methods in MLP-based speaker authentication system with continuous sound recognition unit, we confirmed the improvement of registration time, and when applied simultaneously, it is about 4 times faster than using the existing online EBP learning algorithm. Registration speed was recorded.

한편, 실시된 실험결과에 의하면 COIL 방법보다 DCS의 속도향상폭이 작은 것으로 나타났는데, 이것은 실험의 화자증명 시스템에 사용된 배경화자의 수가 적어 DCS의 배경화자 감축효과가 작게 나타난 것으로 추측된다. 추후 보다 많은 배경화자를 적용하면 인식률의 향상과 함께 DCS의 더 큰 효과를 기대해 볼 수 있을 것이다.On the other hand, the experimental results showed that the DCS speedup was smaller than that of the COIL method. This suggests that the number of background speakers used in the speaker identification system was small, indicating that the DCS reduction effect was small. If more background speakers are applied in the future, the recognition rate will be improved and the greater effect of DCS will be expected.

Claims

The input speech of 16bit 16kHz sampled is divided into 30ms long frames overlapping 20ms, and 16th Mel interval filter banks are extracted for each frame and used to detect isolated words and continuous sounds. Averages the coefficients up to 1 kHz to remove the effect of the mass on the full spectral envelope, subtracts this value from all coefficients, and again adjusts the average of all coefficients to zero for effective learning of the MLP. ,

Mel filter bank of 50th order 0 ~ 3kHz band equally spaced interval is extracted for each frame and used for speaker identification. Filter bank coefficient is obtained by averaging coefficients up to 1kHz to remove the effect of the quantity on the entire spectrum envelope. A speech analysis and feature extraction step of subtracting from all coefficients and adjusting the average of all coefficients to zero again for effective learning of MLP;

An isolated word detection and sustained sound detection step of detecting an isolated word and a sustained sound in the isolated word using an MLP trained to detect each continuous sound and silence in a speaker-independent manner;

In the case of basic online EBP or COIL application, the MLP is trained using registered speaker and background speaker data,

In the case of MLP-I and MLP-II application, each continuous sound of all isolated words is input to MLP-I for each continuous sound, and then the average number of output neurons is averaged. Learning the registered speakers in the MLP-II using n background speaker data selected for each continuous sound, wherein the number of patterns used for the MLP-II learning for each continuous sound is 10 per background speaker;

In the case of basic online EBP or COIL application, each continuous sound of all isolated words is input to the MLP for each continuous sound, and then the average value of output neurons is averaged.

In the case of MLP-I and MLP-II application, each continuous sound of all isolated words is input to MLP-I for each continuous sound, and then the average number of output neurons is averaged, and n background speakers are selected in the order of higher average. Among the selected background speakers, the output of the MLP-II is averaged for all sustained sounds containing at least one background speaker in the continuous speech register learner stage, and there are no sustained sounds containing at least one background speaker in the continuous speech register learner stage. A case of sustained speech score evaluation that rejects the requesting speaker;

Compared with the average value of the speaker score evaluation step for each continuous sound and the preset threshold value, the registration word and speaker score threshold value comparison step for determining final rejection / acceptance are made. How to improve registration speed of MLP based speaker identification system.

The method of claim 1,

MLP, MLP-I, and MLP-II related to speaker learning are all composed of a two-layer structure including one concealment layer. To improve registration speed.

The method of claim 1,

The MLP-II is a method of improving the registration speed of the MLP-based speaker authentication system by improving the learning speed, characterized in that consisting of one output neuron and two hidden neurons.