KR20040038419A

KR20040038419A - A method and apparatus for recognizing emotion from a speech

Info

Publication number: KR20040038419A
Application number: KR1020020067348A
Authority: KR
Inventors: 한우진; 최도연
Original assignee: 에스엘투(주)
Priority date: 2002-11-01
Filing date: 2002-11-01
Publication date: 2004-05-08

Abstract

본 발명은 음성을 이용한 감정 인식 시스템 및 감정 인식 방법에 관한 것으로서, 보다 상세하게는 발화된 음성 신호로부터 인간의 감정 상태를 나타내는 여러 파라미터를 추출 및 분석함으로써 인간의 감정 상태를 효과적으로 추정하기 위한 음성을 이용한 감정 인식 시스템 및 감정 인식 방법에 관한 것이다.The present invention relates to an emotion recognition system and an emotion recognition method using voice, and more particularly, to extract a voice for effectively estimating a human emotion state by extracting and analyzing various parameters representing a human emotion state from a spoken voice signal. It relates to an emotion recognition system and an emotion recognition method used.

본 발명에 의한 음성을 이용한 감정 인식 방법은, 화자의 발화에 따른 음성신호를 입력받는 발화음 입력단계와, 상기 발화음으로부터 적어도 하나의 비언어적 파라미터를 추출하는 비언어적 파라미터 추출단계와, 상기 음성신호를 음성인식함으로써 상기 발화음의 의미에 따른 적어도 하나의 언어적 파라미터를 추출하는 언어적 파라미터 추출단계와. 상기 언어적 파라미터 및 상기 비언어적 파라미터를 종합하여 상기 화자의 감정 상태를 최종 산출하는 감정 상태 산출단계를 포함하여 이루어진다.The emotion recognition method using the voice according to the present invention includes a speech input step of receiving a speech signal according to a speaker's speech, a non-verbal parameter extraction step of extracting at least one non-verbal parameter from the speech signal, and the speech signal. Linguistic parameter extraction step of extracting at least one linguistic parameter according to the meaning of the spoken sound by speech recognition; And an emotional state calculation step of finally calculating the emotional state of the speaker by combining the linguistic parameters and the non-verbal parameters.

Description

Emotion Recognition System and Emotion Recognition Method Using Voice {A METHOD AND APPARATUS FOR RECOGNIZING EMOTION FROM A SPEECH}

본 발명은 음성을 이용한 감정 인식 시스템 및 감정 인식 방법에 관한 것으로서, 보다 상세하게는 발화된 음성 신호로부터 인간의 감정 상태를 나타내는 여러파라미터를 추출 및 분석함으로써 인간의 감정 상태를 효과적으로 추정하기 위한 음성을 이용한 감정 인식 시스템 및 감정 인식 방법에 관한 것이다.The present invention relates to an emotion recognition system and an emotion recognition method using voice, and more particularly, to extract a voice for effectively estimating a human emotional state by extracting and analyzing various parameters representing a human emotional state from a spoken voice signal. It relates to an emotion recognition system and an emotion recognition method used.

음성은 인간의 가장 자연스러운 의사 소통 수단이면서 정보 전달 수단이자, 언어를 구현하는 수단으로서 인간이 내는 의미 있는 소리이다.Voice is the most meaningful means of human communication, a means of information delivery, and a means of language.

인간과 기계 사이의 음성을 통한 통신 구현에 대한 시도는 과거부터 꾸준히 발전되어 왔는 바, 더욱이 최근 음성 정보를 효과적으로 처리하기 위한 음성 정보 처리 기술(speech information technology;SIT) 분야가 괄목할 만한 발전을 이룩함에 따라 실생활에도 속속 적용이 되고 있다.Attempts to implement communication through voice between humans and machines have been steadily evolving in the past. Moreover, the recent field of speech information technology (SIT) for effectively processing voice information has made remarkable progress. As a result, it is being applied to real life one after another.

이러한 음성 정보 처리 기술을 크게 분류하면, 음성 인식(speech recognition), 음성 합성(speech synthesis), 화자 인증(speaker identification and verification), 음성 코딩(speech coding) 등의 카테고리로 분류할 수 있다.If the voice information processing technology is largely classified, it may be classified into categories such as speech recognition, speech synthesis, speaker identification and verification, and speech coding.

이 중에서, 음성 인식은 발화된 음성을 인식하여 문자열로 변환하는 기술이고, 음성 합성은 문자열을 음성 분석에서 얻어진 데이터나 파라미터를 이용하여 원래의 음성으로 변환하는 기술이며, 화자 인증은 발화된 음성을 통하여 발화자를 추정하거나 인증하는 기술이며 음성 코딩은 음성 신호를 효과적으로 압축하여 부호화하는 기술이다.Among these, speech recognition is a technique for recognizing spoken speech and converting it into a string, and speech synthesis is a technique for converting a string into original speech using data or parameters obtained from speech analysis. It is a technique for estimating or authenticating a talker through speech coding, and speech coding is a technique for effectively compressing and encoding a speech signal.

한편, 음성 인식 기술과 관련하여, 음성 신호를 통하여 화자의 감정 상태를 추정, 인식하는 감정 인식 기술을 생각할 수 있다.On the other hand, with respect to the speech recognition technology, it is possible to think of an emotion recognition technology that estimates and recognizes the speaker's emotional state through the speech signal.

감정 인식 기술은 사람이 일상 생활에서 사용하는 언어, 음성, 제스처 등을 통하여 사람과 사람이 대화하듯이 사람과 기계 사이의 인터페이스를 구현하는 것을궁극적인 목적으로 하고 있으며, 크게 나누어 화자의 음성을 통하여 감정을 인식하는 청각적인 면에서의 감정 인식과 화자의 표정을 통하여 감정을 인식하는 시각적인 면에서의 감정 인식으로 나누어 생각할 수 있다. 본 발명은 청각적인 면에서의 감정 인식과 관련된 것이다.Emotion recognition technology is ultimately aimed at realizing the interface between man and machine as he talks through the language, voice, and gestures that people use in their daily lives. It can be divided into emotion recognition in the auditory aspect that recognizes emotion and emotion recognition in the visual aspect that recognizes emotion through the speaker's expression. The present invention relates to emotional recognition in terms of hearing.

일반적으로 사람은 다른 사람의 음성만을 듣고도 그 감정 상태를 어느 정도 추정하는 것이 가능하다. 후쿠다(Fukuda)의 연구에 따르면, 인간은 언어 정보가 전혀 주어지지 않은 상태에서도 화자의 발화 자체의 비언어 정보만을 통하여 감정 상태를 어느 정도 추정할 수 있다(S. Fukuda and V. Kostov, "Extracting Emotion from Voice," Proc. of IEEE, pp. IV-299-304, 1999.).In general, a person can estimate the emotional state to some extent only by hearing another person's voice. According to Fukuda's research, humans can estimate the emotional state to some extent only through the non-verbal information of the speaker's speech itself even when no language information is given (S. Fukuda and V. Kostov, "Extracting"). Emotion from Voice, "Proc. Of IEEE, pp. IV-299-304, 1999.).

예를 들어, 감정이 격앙되고 분노한 화자의 발화는 평상시의 상태인 경우에 비해서 평균 음 높이가 높아질 뿐만 아니라, 동일한 단어로 이루어진 문장을 발화하더라도 문장 내에서의 평균 음 높이의 변화가 급격해지는 특성이 있다. 또한, 목소리가 커지게 되므로 같은 환경 내에서 평상시의 상태인 경우에 비하여 상대적으로 에너지가 커지며, 평균 에너지 변화 또한 상대적으로 급격하게 변화하는 특징이 있다. 후쿠다(Fukuda)는 음성을 분석하는 대표적 방법 중 하나인 켑스트럼(cepstrum) 방법을 이용하여 음높이 정보를 추출한 후, 보통 상태에서 수집된 음성 데이터로부터 얻은 음높이 평균값과 화가 난 상태에서 얻은 음높이 평균값을 이용하여 발화된 음성의 감정 상태를 추출하고, 이에 대한 실험을 수행한 바 있다.For example, the utterance of an angry and angry speaker not only increases the average pitch higher than in a normal state, but also changes the average pitch in a sentence even if a sentence consisting of the same word is spoken. have. In addition, since the voice becomes louder, the energy is increased relatively as compared to the case of the normal state in the same environment, and the average energy change is also relatively rapidly changed. Fukuda extracts pitch information using the cepstrum method, one of the representative methods of speech analysis, and then calculates the average pitch value obtained from the voice data collected in the normal state and the average pitch value obtained in the angry state. The emotional state of the spoken voice was extracted and experiments were performed.

또한, 조우(G. Zhou)도 음성의 감정 상태 인식을 위하여 발화음의 음높이 평균값을 이용하였으나, 후쿠다와는 달리 음높이의 추정에 있어서 켑스트럼보다 성능이 우수한 티거 에너지 오퍼레이터(teager energy operator:TEO)를 이용하였으며, 인위적으로 생성된 데이터 및 실제의 데이터에 대한 실험을 수행한 바 있다(G. Zhou, J. H. L. Hansen, and J. F. Kaiser, "Classification of Speech under Stress based on Features derived from the Nonlinear Teager Energy Operator," Proc. of ICASSP, pp. 549-552, 1998.).In addition, G. Zhou also used the average pitch value of the speech to recognize the emotional state of the voice.However, unlike Fukuda, a teaser energy operator (TEO), which is superior to the cepstrum in estimating the pitch, And G. Zhou, JHL Hansen, and JF Kaiser, "Classification of Speech under Stress based on Features derived from the Nonlinear Teager Energy Operator , "Proc. Of ICASSP, pp. 549-552, 1998.).

음높이의 범위는 일반적으로 60Hz 내지 750Hz의 범위를 취하며, 이러한 음높이의 평균값을 AvgP라고 할 때, 동일한 사람에 대하여 분노한 상태에서의 AvgP는 평상시의 상태에서의 AvgP보다 상대적으로 상승하는 현상을 보이므로, 이처럼 음높이의 평균값을 화자의 감정 상태를 인식하는 준거로 사용할 수 있는 것이다.The pitch range is generally in the range of 60Hz to 750Hz, and when the average value of the pitch is called AvgP, the AvgP in anger with the same person shows a relatively higher phenomenon than the AvgP in the normal state. In this way, the average value of the pitch can be used as a reference for recognizing the emotional state of the speaker.

마찬가지로, 음성의 크기를 나타내는 파라미터인 음성 에너지도 감정 추출에 매우 중요한 역할을 한다. 예를 들어, 동일한 사람에 대하여 분노한 상태에서의 발화는 평상시의 상태에서의 발화보다 에너지가 상대적으로 증가하는데, 이는 음파의 에너지가 그 진폭의 제곱에 비례하기 때문이다. 따라서, 분노한 상태에서는 언성을 높여 고함을 지르게 되는 경향이 크므로 일반적으로 음성 에너지가 일반적인 경우에 비해서 현저하게 큰 것으로 측정이 되면 이를 통하여 화자의 감정 상태가 분노한 상태임을 추정할 수 있게 되는 것이다.Similarly, speech energy, a parameter that represents the loudness of speech, also plays a very important role in emotion extraction. For example, speech in anger with the same person increases in energy more than speech in a normal state because the energy of sound waves is proportional to the square of its amplitude. Therefore, in a state of anger, the voice has a high tendency to yell, so in general, when the voice energy is measured to be significantly larger than the general case, it can be estimated that the emotional state of the speaker is anger state.

신뢰성 있는 감정 인식 시스템이 제공됨으로 인하여 기여할 수 있는 분야는 다양하다. 예를 들어, 상담원의 직접적인 고객 응대가 필요한 콜센터 등의 운영에 있어서, 고객의 감정이 격앙된 상태인 경우에는 ARS 자동응답의 단계에서 직접 상담원으로 연결이 되도록 하거나, 또는 상담원이 연결되어 있는 상태이더라도 더욱 경험이 많은 노련한 상담원으로 연결되도록 하는 시스템을 구축하는 것이 가능해진다. 또한, 고객이 특정 감정 상태에서 발화하는 어휘의 데이터베이스를 구축할 수 있음으로써, 향후 고객관계관리(Customer Relation Management:CRM)의 자료로 활용할 수도 있게 된다. 또한, 화자의 감정 상태에 따라 분류되어 있는 데이터로부터 학습을 수행함으로써 최적의 음성 인식 시스템을 구현하여 음성 인식 시스템의 전반적인 음성 인식률을 향상시킬 수 있으므로, 감정이 격앙된 화자의 목소리가 오히려 잘 인식이 안됨으로써 음성 인식 서비스 자체에 대한 불만까지 가중되는 현상을 해소하여 서비스 만족도를 제고할 수 있다.There are many areas that can contribute due to the provision of a reliable emotion recognition system. For example, in the operation of a call center that requires direct customer service, if the customer's feelings are furious, the ARS autoresponder will be connected directly to the agent or even if the agent is connected. It is possible to build a system that will lead to more experienced and experienced counselors. In addition, by building a database of vocabulary that a customer speaks in a certain emotional state, it can be used as a material for future Customer Relation Management (CRM). In addition, by learning from the data classified according to the speaker's emotional state, the optimal speech recognition system can be implemented to improve the overall speech recognition rate of the speech recognition system. By not doing so, the dissatisfaction with the voice recognition service itself can be eliminated, thereby improving service satisfaction.

이처럼 화자의 발화음을 통하여 감정을 인식하는 시스템이 폭넓고 유용한 활용분야를 가지고 있는 바, 후쿠다(Fukuda)의 연구 이후로 오늘날 발화음을 통한 화자의 감정 인식에 관한 기술이 세계적으로 다수 개발되어 출원 및 등록되고 있으며, 이러한 종래 기술에서는 대부분 후쿠다의 연구 내용에서와 같이 음높이 평균값 또는 에너지 크기에 관한 파라미터를 주로 이용하여 화자의 감정을 인식하고 있다.As such, the system that recognizes emotions through the speaker's speech has a wide range of useful applications. Since Fukuda's research, many technologies related to the speaker's emotion recognition through the speech have been developed and applied worldwide. In the prior art, most of the prior art recognizes the emotion of the speaker by using parameters related to the pitch average value or the energy level, as in the contents of Fukuda's research.

그러나, 이처럼 음높이의 평균값 또는 에너지 크기에 관한 파라미터를 이용하여 화자의 감정을 인식하는 데에는 몇 가지 문제점이 있다. 예를 들어, 음높이의 평균값은 화자에 따라 그 편차가 매우 크며, 음성 에너지의 경우에도 마이크의 볼륨이나 전화 회선의 상태, 주변 상황 등 외부 환경적인 요인이 감정 인식의 수행에 있어서 매우 민감하게 작용하는 것이다.However, there are some problems in recognizing the emotion of the speaker by using the parameter of the average value or energy level of the pitch. For example, the average value of pitch is very large depending on the speaker, and even in the case of voice energy, external environmental factors such as the volume of the microphone, the state of the telephone line, and the surrounding conditions are very sensitive to the performance of emotion recognition. will be.

따라서, 화자의 감정 상태를 보다 정확하게 인식하기 위해서는 종래 기술에서 사용되는 음성에 포함된 파라미터 이외에도 화자의 감정 상태를 반영하는 새로운 파라미터를 추출하고, 이들 파라미터에 의하여 측정된 데이터를 종합적으로 고려하여야 한다.Therefore, in order to recognize the emotion state of the speaker more accurately, in addition to the parameters included in the speech used in the prior art, new parameters reflecting the speaker's emotion state should be extracted, and the data measured by these parameters should be taken into consideration comprehensively.

이에, 본 발명자는 종래 기술에 의한 감정 인식 시스템에 사용되는 파라미터에 더하여 다수의 새로운 파라미터를 제안하고, 이들 파라미터에 의하여 측정된 데이터를 적절히 종합하는 감정 인식 시스템을 제공하고자 한다.Accordingly, the present inventors propose a plurality of new parameters in addition to the parameters used in the emotion recognition system according to the prior art, and provide an emotion recognition system that appropriately aggregates the data measured by these parameters.

예를 들어, 음높이의 평균값이 사람마다 심한 편차를 가지고 있는 것에 착안하여 음높이의 평균값의 변화량(average pitch defference)도 고려하도록 하며, 음성 에너지의 측정에 있어서도 주변 환경의 외부적 요인에 따른 간섭을 최소화하기 위하여 음성 에너지의 변화량(average energy difference)도 고려하도록 하고, 또한 감정이 격앙된 상태에 있는 화자의 경우에는 발화속도가 매우 빨라지는 점에 착안하여, 평균 발화속도(average speech rate)를 감정 인식의 판단 요소의 하나로 삼고자 하는 것이다.For example, focusing on the fact that the average value of pitches varies greatly from person to person, consider the average pitch defference of the pitch, and minimize the interference caused by external factors in the surrounding environment when measuring the energy of the pitch. For this purpose, consider the average energy difference. Also, in the case of a speaker who is in an emotional state, the speech rate is very fast, and the average speech rate is recognized. One of the factors of judgment is to

더욱이, 종래 기술에 의한 감정 인식 시스템에서는 발화음의 비언어적인 정보를 통하여 화자의 감정 상태를 추정할 수 있다는 점에 주목한 나머지, 언어적 정보, 즉 발화의 내용 면의 파악을 간과하는 문제점이 있었다. 예를 들어, 본래 격앙된 어조로 빠르게 말하는 화자(가령, 경상도 방언 구사자 등)의 경우에는 발화 내용 면에서 볼때 실제로 감정이 격앙된 상태가 아님에도 불구하고, 화난 상태로 잘못 인식될 우려가 있는 문제점이 있는 것이다. 또한, 그 반대의 경우에도 마찬가지이다.따라서, 본 출원인은 비언어적 정보뿐만 아니라 언어적 정보까지도 감정 인식의 준거로 활용하도록 함으로써 더욱 신뢰성있는 감정 인식 시스템을 제공하고자 하는 것이다.Furthermore, in the prior art emotion recognition system, it is noted that the emotional state of the speaker can be estimated through the non-verbal information of the spoken sounds, but there is a problem of overlooking the linguistic information, that is, the grasp of the contents of the speech. . For example, a speaker (eg, Gyeongsangdo dialect speaker, etc.) who speaks quickly in a furious tone may be mistaken for an angry state in spite of the content of speech, even though the emotion is not furious. Is there. In addition, vice versa. Applicant intends to provide a more reliable emotion recognition system by using not only non-verbal information but also linguistic information as a reference for emotion recognition.

본 발명은 화자의 감정 상태를 효과적으로 추정,인식하기 위하여 화자의 발화에 포함된 화자의 감정 상태를 나타내는 언어적 정보 및 비언어적 정보에 관한 다수의 파라미터를 추출하여 종합적으로 분석하는 음성을 이용한 감정 인식 시스템 및 감정인식 방법을 제공하는 것을 목적으로 한다.The present invention is an emotion recognition system using speech that comprehensively extracts and analyzes a plurality of parameters related to linguistic and nonverbal information indicating a speaker's emotional state in order to effectively estimate and recognize the speaker's emotional state. And an emotion recognition method.

도 1은 본 발명에 의한 음성을 이용한 감정 인식 방법의 일례를 나타낸 흐름도,1 is a flowchart showing an example of a emotion recognition method using voice according to the present invention;

도 2는 도 1의 흐름도를 더욱 자세히 나타낸 흐름도,2 is a flow chart illustrating the flow chart of FIG. 1 in more detail;

도 3은 본 발명에 의한 음성을 이용한 감정 인식 방법의 일례에 있어서 각 파라미터의 측정값을 종합하여 감정 상태 산출단계에서 화자의 감정 상태를 산출하는 과정을 더욱 자세히 나타낸 흐름도,3 is a flowchart illustrating a process of calculating an emotional state of a speaker in an emotional state calculating step by combining measured values of each parameter in an example of an emotion recognition method using voice according to the present invention;

도 4는 본 발명에 의한 음성을 이용한 감정 인식 시스템의 일례를 나타낸 구성도,4 is a configuration diagram showing an example of an emotion recognition system using voice according to the present invention;

도 5는 본 발명에 의한 음성을 이용한 감정 인식 시스템을 이용한 콜센터 상담원 연결 시스템의 일례를 나타낸 구성도이다.5 is a block diagram illustrating an example of a call center counselor connection system using an emotion recognition system using voice according to the present invention.

상기의 목적을 달성하기 위하여, 본 발명에 의한 음성을 이용한 감정 인식 방법은, 화자의 발화에 따른 음성신호를 입력받는 발화음 입력단계와, 상기 발화음으로부터 적어도 하나의 비언어적 파라미터를 추출하는 비언어적 파라미터 추출단계와, 상기 음성신호를 음성인식함으로써 상기 발화음의 의미에 따른 적어도 하나의 언어적 파라미터를 추출하는 언어적 파라미터 추출단계와, 상기 언어적 파라미터 및 상기 비언어적 파라미터를 종합하여 상기 화자의 감정 상태를 최종 산출하는 감정 상태 산출단계를 포함하여 이루어진다.In order to achieve the above object, the emotion recognition method using the voice according to the present invention, a speech input step for receiving a speech signal according to the speaker's speech, and non-verbal parameters for extracting at least one non-verbal parameter from the speech An extracting step, a linguistic parameter extraction step of extracting at least one linguistic parameter according to the meaning of the spoken sound by speech recognition of the speech signal, and the emotional state of the speaker by combining the linguistic parameters and the nonverbal parameters It includes a step of calculating the emotional state to the final calculation.

이 때, 상기 비언어적 파라미터는, 상기 음성신호를 일정 시간 간격의 음성신호 프레임으로 나누고, 상기 프레임 중에서 유성음인 프레임에서의 음높이의 총합을 상기 유성음인 프레임의 개수로 나눔으로써 얻어지는 상기 음성신호의 음높이의 평균값을 포함하는 것이 바람직하다.In this case, the non-verbal parameter is obtained by dividing the speech signal into speech signal frames at predetermined time intervals, and dividing the sum of the pitches of the voiced frames among the frames by the number of the voiced frames. It is preferable to include an average value.

또한, 상기 비언어적 파라미터는, 상기 음높이의 평균값과 각각의 상기 유성음인 프레임에서의 음높이와의 차의 제곱의 총합을 상기 유성음인 프레임의 개수로 나눔으로써 얻어지는 상기 음성신호의 음높이의 분산(variance)을 포함하는 것이 바람직하다.The non-verbal parameter further includes a variance of the pitch of the speech signal obtained by dividing the sum of the squares of the difference between the mean value of the pitch and the pitch in each of the voiced frames by the number of frames of the voiced sounds. It is preferable to include.

또한, 상기 비언어적 파라미터는, 상기 음성신호를 일정 시간 간격의 음성신호 프레임으로 나누고, 상기 프레임 중에서 유성음인 프레임에서의 음성에너지의 총합을 상기 유성음인 프레임의 개수로 나눔으로써 얻어지는 상기 음성신호의 음성에너지의 평균값을 포함하는 것이 바람직하다.In addition, the non-verbal parameter is the speech energy of the speech signal obtained by dividing the speech signal into speech signal frames at predetermined time intervals, and dividing the sum of the speech energy in the frames that are voiced sounds by the number of frames that are voiced sounds. It is preferable to include the average value of.

또한, 상기 비언어적 파라미터는, 상기 음성에너지의 평균값과 각각의 상기 유성음인 프레임에서의 음성에너지와의 차의 제곱의 총합을 상기 유성음인 프레임의 개수로 나눔으로써 얻어지는 상기 음성신호의 음성에너지의 분산(variance)을 포함하는 것이 바람직하다.Further, the non-verbal parameter is a dispersion of speech energy of the speech signal obtained by dividing the sum of the squares of the difference between the mean value of the speech energy and the speech energy in each of the voiced frames by the number of frames of the voiced sound. variance).

또한, 상기 언어적 파라미터는, 상기 음성신호의 발화 속도를 포함하는 것이 바람직하다.In addition, the linguistic parameter preferably includes the speech rate of the voice signal.

이 때, 상기 음성신호의 상기 발화 속도는, 상기 음성신호의 표준 음소 길이의 평균값에 대한, 상기 음성신호의 실제 음소 길이의 평균값의 상대적인 속도인 것으로 할 수 있다. 또한, 상기 음성신호의 상기 표준 음소 길이의 평균값은, 음성 인식 모듈의 단위음 데이터베이스에 기저장되어 있는 각 발음기호 음소 중에서, 상기 음성 인식 모듈에 의하여 인식된 상기 음성신호를 구성하는 각 한국어 발음기호 음소의 길이의 평균값인 것으로 할 수 있다.At this time, the speech rate of the speech signal may be a relative speed of an average value of actual phoneme lengths of the speech signal to an average value of standard phoneme lengths of the speech signal. In addition, the average value of the standard phoneme length of the speech signal is each Korean phonetic code constituting the speech signal recognized by the speech recognition module, among the phonetic phoneme already stored in the unit sound database of the speech recognition module. It can be set as the average value of phoneme length.

또한, 상기 음성신호의 상기 실제 음소 길이의 평균값은, 상기 음성신호를 구성하는 각 발음기호 음소의 길이에 각 발음기호 음소의 가중치를 곱한 값의 총합을 상기 음성 인식 모듈에 저장된 발음기호 음소의 가짓수로 나눈 값인 것으로 할 수 있다.In addition, the average value of the actual phoneme length of the speech signal, the sum of the length of each phonetic phoneme constituting the speech signal multiplied by the weight of the phonetic phoneme phonemes, the number of the number of phonetic phoneme stored in the speech recognition module Can be divided by.

이 때, 상기 발음기호 음소 중 중성(中聲)의 발음기호 음소의 가중치가 상기 발음기호 음소 중 초성(初聲) 및 종성(終聲)의 가중치보다 높게 하는 것이 바람직하다.At this time, it is preferable that the weight of the neutral phonetic phoneme of the phonetic phoneme is higher than the weight of the initial and the finality of the phonetic phoneme.

또한, 상기 언어적 파라미터는, 음성 인식 모듈에 의한 상기 음성신호의 인식 결과에 포함되는 각 특수 단어에 대응되는 감정산출계수를 포함하는 것이 바람직하다.In addition, the linguistic parameter may include an emotion calculation coefficient corresponding to each special word included in a result of recognizing the voice signal by the voice recognition module.

또한, 상기 감정 상태 산출단계는, 실제 음높이 평균값과 표준 음높이 평균값과의 차를 상기 표준 음높이 평균값으로 나눈 값에 제1 가중치를 곱한 값과, 실제 음성에너지의 평균값과 표준 음성에너지의 평균값과의 차를 상기 표준 음성에너지의 평균값으로 나눈 값에 제2 가중치를 곱한 값과, 실제 음높이의 표준편차와 표준 음높이의 표준편차와의 차를 상기 표준 음높이 평균값의 표준편차로 나눈 값에 제3 가중치를 곱한 값과, 실제 음성에너지의 표준편차와 표준 음성에너지의 표준편차와의 차를 상기 표준 음성에너지의 표준편차로 나눈 값에 제4 가중치를 곱한 값과, 실제 음소 길이의 평균값과 기준 음소 길이의 평균값과의 차를 상기 기준 음소 길이의 평균값으로 나눈 값에 제5 가중치를 곱한 값과, 음성인식의 결과에 포함된 특수 단어에 대응되는 감정산출계수에 제6 가중치를 곱한 값을 합산함으로써 산출되는 최종 감정상태계수를, 소정의 임계치와 비교함으로써 상기 화자의 감정 상태를 산출하는 것이 바람직하다.The emotion state calculating step may include: a difference obtained by dividing a difference between an actual pitch average value and a standard pitch average value by the first weight multiplied by a value obtained by dividing the difference by the standard pitch average value, and a difference between an average value of actual voice energy and an average value of standard voice energy. Multiplied by the average value of the standard speech energy multiplied by the second weight, and the difference between the standard deviation of the actual pitch and the standard deviation of the standard pitch divided by the standard deviation of the mean standard pitch, multiplied by the third weight. A value obtained by dividing the difference between the standard deviation of the actual voice energy and the standard deviation of the standard voice energy by the standard deviation of the standard voice energy by the fourth weight, and the average value of the actual phoneme length and the average phoneme length. The difference between and is divided by the average value of the reference phoneme length multiplied by the fifth weight, and corresponding to the special words included in the result of speech recognition The final emotional state coefficient that is calculated by the sum of a value obtained by multiplying a sixth weight to the emotion calculation coefficient, it is preferable to calculate the emotional state of the speaker by comparing with a predetermined threshold value.

또한, 본 발명에 의한 음성을 이용한 감정 인식 시스템은, 음성신호 입력부와, 신호처리부와, 음성인식부와, 제어부 및 출력부를 포함하여 이루어지며, 상기 음성신호 입력부는, 화자의 음성을 전기적인 신호로 변환하여 상기 신호처리부 및 상기 음성인식부 중 적어도 하나로 전달하고, 상기 신호처리부는, 상기 음성신호의 음높이의 평균값을 추출하는 음높이 평균값 추출부와, 상기 음높이의 평균값의 변화 정도를 추출하는 음높이 분산(variance) 추출부와, 상기 음성신호의 음성에너지의 평균값을 추출하는 음성에너지 평균값 추출부와, 상기 음성에너지의 평균값의 변화 정도를 추출하는 음성에너지 분산 추출부를 더 포함하며, 상기 음성인식부는, 상기 음성신호의 발화속도를 추출하는 발화속도 추출부와, 상기 음성신호에 포함된 특정 단어를 추출하는 음성인식 결과 추출부를 더 포함하며, 상기 제어부는 상기 신호처리부 및 상기 음성인식부에서 산출되는 상기 음높이 평균값, 상기 음높이 분산, 상기 음성에너지 평균값, 상기 음성에너지 분산, 상기 발화속도 및 상기 음성인식 결과 중 적어도 하나의 파라미터를 이용하여 최종 감정 인식 계수를 산출하고, 상기 최종 감정 인식 계수를 소정의 임계값과 비교함으로써 상기 화자의 감정 상태를 인식하도록 하는 것을 특징으로 한다.In addition, the emotion recognition system using the voice according to the present invention comprises a voice signal input unit, a signal processing unit, a voice recognition unit, a control unit and an output unit, the voice signal input unit, the electrical signal of the speaker voice Is converted to and transmitted to at least one of the signal processing unit and the speech recognition unit, and the signal processing unit includes: a pitch average value extracting unit extracting an average value of the pitch of the speech signal, and a pitch distribution extracting a degree of change of the average value of the pitch; and a variance extracting unit, a voice energy average value extracting unit extracting an average value of voice energy of the voice signal, and a voice energy dispersion extracting unit extracting a degree of change of the average value of the voice energy. A speech rate extracting unit for extracting a speech rate of the speech signal and a specific word included in the speech signal; Further includes a speech recognition result extracting unit, wherein the controller is configured to calculate the pitch average value, the pitch variance, the speech energy average value, the speech energy dispersion, the speech rate, and the speech recognition result calculated by the signal processor and the speech recognition unit. The final emotion recognition coefficient is calculated using at least one of the parameters, and the emotion state of the speaker is recognized by comparing the final emotion recognition coefficient with a predetermined threshold value.

또한, 본 발명에 의한 상기 감정 인식 시스템을 이용한 콜센터 상담원 연결 시스템은, 고객의 전화회선과 연결되는 사설 구내 교환기와, 콜센터의 전반적인 운영을 제어하는 CTI 서버와, 감정 인식 시스템이 내장된 감정 인식 서버와, 상기CTI 서버에 연결된 자동 응답 서버와, 상기 CTI 서버에 연결된 적어도 하나의 상담원용 전화회선을 포함하고, 상기 감정 인식 서버가 상기 고객의 발화에 따른 음성신호로부터 상기 고객의 감정 상태를 인식하면, 그 결과에 따라 상기 CTI 서버가 상기 고객의 전화회선의 호 접속을 상기 자동응답 서버로부터 상기 상담원용 전화회선으로 전환하거나, 또는 호 접속되어있는 하나의 상담원용 전화회선으로부터 다른 전화회선으로 호 접속을 전환하도록 하는 것을 특징으로 한다.In addition, the call center agent connection system using the emotion recognition system according to the present invention, the private branch exchange connected to the customer's telephone line, the CTI server for controlling the overall operation of the call center, and the emotion recognition server with a built-in emotion recognition system And an answering machine connected to the CTI server and at least one telephone line for counseling connected to the CTI server, wherein the emotion recognition server recognizes the emotion state of the customer from the voice signal according to the customer's speech. And, according to the result, the CTI server switches the call connection of the customer's telephone line from the answering machine to the agent's telephone line, or connects a call from one agent's telephone line to another telephone line. Characterized in that to switch.

이하, 첨부된 도면을 참조하여 더욱 상세하게 설명하기로 한다.Hereinafter, with reference to the accompanying drawings will be described in more detail.

도 1은 본 발명에 의한 음성을 이용한 감정 인식 방법의 일례를 나타낸 흐름도이다.1 is a flowchart illustrating an example of a emotion recognition method using voice according to the present invention.

상술한 바와 같이, 본 발명에 의한 음성을 이용한 감정 인식 방법은 비언어적 정보뿐만 아니라 언어적 정보까지도 감정 인식의 준거로 활용하도록 함으로써 더욱 신뢰성있는 감정 인식 시스템을 제공하고자 하고 있다.As described above, the emotion recognition method using the voice according to the present invention intends to provide a more reliable emotion recognition system by using not only non-verbal information but also verbal information as a reference for emotion recognition.

도 1에 나타난 바와 같이, 본 발명에 의한 음성을 이용한 감정 인식 방법에 의하면, 먼저 발화음 입력 단계(100)에서 적절한 음성 입력 수단(마이크 등)으로써 화자의 발화음을 입력받게 되고, 다음으로 비언어적 파라미터 추출단계(110)에서는 입력된 발화음에 따른 음성신호의 음높이, 음성에너지 등, 음성신호가 나타내는 언어적 의미와는 무관한 물리적 파라미터를 추출하게 된다. 다음, 언어적 파라미터 추출단계(120)에서는 음성인식부의 음성인식작업 수행에 의하여 음성신호를 음소의 단위로 분리함으로써, 그 결과로써 음소 길이 및 음성인식된 상기 음성신호의 의미등의 언어적, 음운학적으로 유의미한 파라미터를 추출하게 된다. 감정상태 산출단계(130)에서는 위 비언어적 파라미터 추출단계(110) 및 언어적 파라미터 추출단계(120)에서 추출된 각종 파라미터를 종합하고, 복수의 파라미터 각각에 적절한 가중치를 부여함으로써 화자의 감정상태에 최대한 근접한 최종 감정상태계수를 산출하게 되는 것이다.As shown in FIG. 1, according to the emotion recognition method using the voice according to the present invention, first, the speaker's speech is input by an appropriate voice input means (microphone, etc.) in the speech input step 100, and then non-verbal. In the parameter extraction step 110, a physical parameter that is not related to the linguistic meaning of the voice signal, such as the pitch of the voice signal, the voice energy, etc. according to the input speech is extracted. Next, in the linguistic parameter extraction step 120, the speech signal is separated into units of phonemes by performing the speech recognition operation of the speech recognition unit. It will extract scientifically significant parameters. In the emotional state calculating step 130, various parameters extracted in the non-verbal parameter extraction step 110 and the linguistic parameter extraction step 120 are synthesized, and the weights are appropriately assigned to each of the plurality of parameters to maximize the speaker's emotional state. The final emotional state coefficient is calculated.

본 실시예에서는 도면에 나타난 바와 같이 비언어적 파라미터 추출단계(110)가 수행된 후에 언어적 파라미터 추출단계(120)가 진행되는 것으로 되어 있으나, 위 두 단계는 순서가 바뀌어도 무방하며, 또다른 실시예에서 위 두 단계가 병렬적으로 동시에 수행되도록 할 수도 있다.In this embodiment, the linguistic parameter extraction step 120 is performed after the non-verbal parameter extraction step 110 is performed as shown in the drawing. However, the above two steps may be reversed. You can also make these two steps run in parallel at the same time.

도 2는 도 1의 흐름도를 더욱 자세히 나타낸 흐름도이다.2 is a flow chart illustrating the flow chart of FIG. 1 in more detail.

본 실시예에서는 비언어적 파라미터 추출단계(21)가 음높이 평균값 추출단계(210), 음높이 분산 추출단계(220), 음성에너지 평균값 추출단계(230), 음성에너지 분산 추출단계(240) 등 4가지의 비언어적 파라미터를 추출할 수 있는 4개의 추출단계를 포함하고, 언어적 파라미터 추출단계(22)가 음소길이 평균값 추출단계(250) 및 음성인식 결과 추출단계(260) 등 2가지의 언어적 파라미터를 추출할 수 있는 2개의 추출단계를 포함하도록 하고 있다.In the present embodiment, the non-verbal parameter extraction step 21 includes four non-verbal, including pitch average value extraction step 210, pitch variance extraction step 220, voice energy average value extraction step 230, and voice energy dispersion extraction step 240. Including four extraction steps for extracting parameters, the linguistic parameter extraction step 22 extracts two linguistic parameters, such as a phoneme length average value extraction step 250 and a speech recognition result extraction step 260. It is intended to include two extraction steps.

먼저, 음높이 평균값 추출단계(210)에 대하여 설명한다.First, the pitch average value extraction step 210 will be described.

음성신호의 평균 음높이(pitch)를 계산하는 방법은 지금까지 많은 방법들이 제안되어 왔다. 그러나, 음성의 본질적인 특성 상 음높이는 유성음 부분에서만 측정이 가능하며, 무성음 부분에서는 측정이 불가능하다. 따라서, 추출된 음성신호의 음높이에 대한 신뢰성을 높이기 위해서는 필연적으로 음성신호를 유성음 부분과 무성음 부분을 구분하는 작업이 수행되어야 한다. 본 발명에서는 정규화된 자기 상관 함수법(normalized autocorelation function method)을 사용하여 음높이를 추정하였으며, 또한 음성신호를 유성음 부분과 무성음 부분으로 구분하는 새로운 방법을 착안함으로써 음성신호의 평균 음높이 추출의 신뢰성을 크게 향상시켰다.Many methods have been proposed to calculate the average pitch of a speech signal. However, due to the inherent nature of speech, the pitch can be measured only in the voiced part and not in the unvoiced part. Therefore, in order to increase the reliability of the pitch of the extracted voice signal, a task of dividing the voice signal part from the voiced sound part must be performed. In the present invention, the pitch is estimated by using a normalized autocorelation function method, and a new method of classifying the speech signal into the voiced and unvoiced parts has been devised to greatly increase the reliability of the average pitch extraction of the speech signal. Improved.

정규화된 자기 상관 함수법에서는 일단 음성신호를 20ms 정도의 짧은 구간들로 나눈 후, 각각의 구간에 대해서 가능한 모든 음높이에 대한 정규화된 자기 상관 함수값을 계산하고, 이 중에서 함수값을 최대로 만드는 음높이를 찾아 음성신호의 음높이로 결정하게 된다. 이때, 20ms 정도로 나뉘어진 각각의 구간들을 프레임(frame)이라고 하는데, 전체 프레임의 수를 N개, 그 중에서 n번째의 프레임의 음높이를 P(n)이라고 할 때, 입력된 음성신호의 음높이의 평균값은 아래와 같이 결정된다.In the normalized autocorrelation function method, once the speech signal is divided into short intervals of about 20 ms, the normalized autocorrelation function values for all possible pitches are calculated for each interval, and the pitch that maximizes the function value is calculated. Find and determine the pitch of the voice signal. At this time, each section divided about 20ms is called a frame, and when the total number of frames is N and the pitch of the nth frame is P (n), the average value of the pitch of the input voice signal Is determined as follows.

그러나, 실제로 이렇게 음높이 평균값을 구해보면, 일부 자음 구간과 같이 무성음 부분에 해당하는 구간에서 잘못 구해진 음높이들이 전체적인 정확도에 영향을 주므로, 본 발명에서는 아래와 같은 방식으로 음높이 평균값을 구하도록 하였다.However, when the average pitch value is actually obtained, the pitches incorrectly obtained in the section corresponding to the unvoiced portion, such as some consonant sections, affect the overall accuracy, so the present invention is to obtain the average pitch value in the following manner.

위 식에서, v(n)은 n번째 프레임이 유성임 프레임이면 1, 무성음 프레임이면 0이 되는 값으로서, v(n)의 경우 현재를 포함하여 이전 3개 프레임의 음높이 값, 즉 P(n), P(n-1), P(n-2)의 변화량이 실험에 의해 결정된 임계값(예를 들어, 현재 음높이의 10%)보다 작으면 음높이가 비교적 안정적인 구간인 것으로 간주하여, 이 구간을 유성음 구간, 즉 v(n)=1이 되는 구간으로 취급한다. 마찬가지로, 그 변화량이 임계값보다 크면 음높이가 불안정한 구간인 것으로 간주하여, 해당 구간을 무성음 구간, 즉 v(n)=0이 되는 구간으로 취급하는 것이다. 이러한 관계를 수식으로 표현하면 아래와 같다.In the above formula, v (n) is 1 if the nth frame is a voiced frame, and 0 if it is an unvoiced frame. For v (n), the pitch value of the previous 3 frames including the present, that is, P (n) If the amount of change in P (n-1) or P (n-2) is less than the experimentally determined threshold (for example, 10% of the current pitch), the interval is considered to be a relatively stable interval. It is treated as a voiced sound interval, that is, a section in which v (n) = 1. Similarly, if the change amount is larger than the threshold value, the pitch is regarded as an unstable section, and the section is treated as an unvoiced section, that is, a section in which v (n) = 0. This relationship is expressed as a formula below.

다음, 음높이 분산 추출단계(220)에 대하여 설명한다.Next, the pitch dispersion extraction step 220 will be described.

사람에 따라 음높이(pitch)는 서로 다를 수밖에 없다. 따라서, 불특정 다수의 화자를 대상으로 감정 상태를 인식하기 위해서 음높이의 소정 기준을 설정하여, 이러한 음높이만을 감정 상태 판단의 준거로 사용하는 것은 곤란하다.Different people have different pitches. Therefore, it is difficult to set a predetermined criterion of pitch in order to recognize an emotional state for an unspecified number of speakers, and to use only such pitch as a criterion for determining the emotional state.

이에, 사람에 따라 음높이는 서로 다를 수밖에 없지만, 음높이의 평균값을 계산하였을 때, 입력된 음성신호가 평균값으로부터 얼마만큼이나 변화를 하는지를 판단하는 음높이의 변화량을 감정 상태 판단을 위한 또다른 파라미터로 활용하는 것은 매우 유용하다. 각종 연구 및 실험에 의하여, 일반적으로 감정이 격앙되는 경우에는 발화음의 음높이 변화가 평상시보다 급격해지는 것이 알려져있기 때문이다.Therefore, the pitch is different from one person to another, but when calculating the average value of the pitch, using the amount of pitch change that determines how much the input voice signal changes from the average value as another parameter for determining the emotional state Very useful. It is because, by various studies and experiments, it is known that in general, when the emotion is furious, the pitch change of the speech is more rapid than usual.

본 발명에 의한 감정 인식 방법에서는, 일단 음높이 평균값 추출단계(210)에서 음높이 평균값이 계산되고 나면, 그 변화량은 음높이의 분산(variance)에 의하여 아래와 같이 용이하게 산출될 수 있다.In the emotion recognition method according to the present invention, once the average pitch value is calculated in the pitch average value extracting step 210, the change amount can be easily calculated by the variation of the pitch as follows.

다음, 음성에너지 평균값 추출단계(230) 및 음성에너지 분산 추출단계(240)에 대하여 설명한다.Next, the voice energy average value extraction step 230 and the voice energy dispersion extraction step 240 will be described.

에너지는 음높이와는 달리 일반적으로 유성음/무성음 여부와 무관한 파라미터이지만, 본 발명에서는 감정 상태에 따라서 변화가 뚜렷한 부분에 대한 파라미터를 추출하도록 하였으므로, 유성음 구간에서의 평균 음성에너지만을 계산하여 감정 인식을 위한 파라미터로 사용하도록 하고 있다. 즉, 분노한 상태의 화자는 평상시의 경우에 비하여 상대적으로 발화음이 커지게 되는 것이 일반적인데, 이때 무성음 구간의 에너지는 큰 변화가 없고, 유성음 구간의 에너지에만 큰 변화가 생기게 된다. 따라서, 유성음 구간에서의 평균 에너지를 계산하는 것이 효과적인 것이다.Unlike the pitch, energy is generally a parameter that is not related to voiced / unvoiced sound. However, in the present invention, since the parameter for the part where the change is distinct according to the emotional state is extracted, only the average voice energy in the voiced sound section is calculated to recognize the emotion. It is intended to be used as a parameter. In other words, the speaker in anger is generally louder than usual, and the energy of the unvoiced section does not change significantly, and only the energy of the voiced section occurs. Therefore, it is effective to calculate the average energy in the voiced sound interval.

따라서, 음성에너지 평균값 추출단계(230)에서는 아래와 같은 식에 의하여 n번째 프레임의 음성에너지 E(n)을 추출함으로써 입력된 음성신호의 음성에너지 평균값을 추출하도록 하였다.Therefore, in the sound energy average value extraction step 230, the sound energy average value of the input voice signal is extracted by extracting the voice energy E (n) of the nth frame by the following equation.

또한, 상술한 음높이의 경우에서와 마찬가지로, 음성에너지의 경우에도 평균값뿐만 아니라 음성에너지의 변화량, 즉 분산(variance)을 구하여 감정 상태 인식의 또다른 파라미터로 활용하는 것이 효과적이므로, 음성에너지 분산 추출단계(240)에서는 아래와 같은 식에 의하여 음성에너지의 분산을 추출하도록 하였다.In addition, as in the case of the pitch described above, in the case of the voice energy, it is effective to obtain not only the average value but also the change amount of the voice energy, that is, the variance, and use it as another parameter of the emotional state recognition. In 240, the dispersion of speech energy was extracted by the following equation.

다음으로, 언어적 파라미터 추출단계(22)에 포함되는 음소길이 평균값 추출단계(250) 및 음성인식 결과 추출단계(260)에 대하여 설명한다.Next, the phoneme length average value extraction step 250 and the speech recognition result extraction step 260 included in the linguistic parameter extraction step 22 will be described.

음소길이 평균값 추출단계(250)에서는 입력된 음성신호에서 음소길이의 평균값을 추출하는데, 이러한 음소길이의 평균값은 화자의 발화속도를 계산하기 위하여 측정된다. 일반적으로 감정이 격앙된 상태에 있는 화자는 평상시보다 말을 빨리 하는 경향이 있기 때문이다.The phoneme length average value extracting step 250 extracts an average value of phoneme lengths from an input voice signal, and the average value of the phoneme lengths is measured to calculate a speaker's speech rate. This is because speakers who are in a state of emotion generally tend to speak faster than usual.

음성의 발화 속도를 추정하는 방법은 여러 가지가 있을 수 있다. 가장 간단한 것은, 내용상 동일한 발화에 대하여, 미리 발화를 수행한 음성신호의 발화 길이(지속시간)를 기억한 후, 이를 기준으로 하여 이후의 발화에 대한 발화 속도 판단의 기준으로 삼는 것이다. 그러나, 이는 발화하는 내용을 미리 정해두어야 한다는 점에서 실용화 가능성이 매우 낮다. 이러한 문제를 해결하기 위하여, 본 발명에서는 음성의 발화 길이를 측정하는 새로운 방법을 제안하였다. 즉, 음성을 음성인식 모듈에 의하여 인식하고, 이 음성인식 모듈에 의하여 인식된 단어에 기초하여 한국어의 발음기호 음소열을 실제 발화된 음성에 대응시킨 후에, 각각의 음소열에대한 평균 발화시간을 구하는 것이다. 그 후, 출현한 모든 음소의 평균 발화시간에 대한 최종적인 평균값을 구함으로써 이를 음성 발화속도를 표현하는 파라미터로 삼게 되는 것이다.There may be several ways to estimate the speech rate of speech. The simplest thing is to memorize the utterance length (duration time) of the speech signal that has been uttered in advance for the same utterance in terms of contents, and use it as a reference for judging the utterance rate for subsequent utterances. However, this is very unlikely to be practical in that the content to be uttered must be determined in advance. In order to solve this problem, the present invention proposed a new method for measuring the speech length of speech. That is, the speech is recognized by the speech recognition module, and based on the words recognized by the speech recognition module, the phonetic sequence of the phonetic symbols of Korean corresponds to the actual speech, and then the average speech time for each phoneme sequence is obtained. will be. After that, the final average value of the average uttering time of all the phonemes appearing is used as a parameter representing the speech uttering speed.

본 발명에 의한 음성을 이용한 감정 인식 시스템의 일례에서 음성 발화 속도 계산을 위하여, 아래의 표 1과 같은 총 47개의 음소열이 한국어 발음기호 음소열로서 사용될 수 있다.In order to calculate the speech utterance in the example of the emotion recognition system using the voice according to the present invention, a total of 47 phoneme sequences as shown in Table 1 below may be used as the phonetic sequence of Korean phonetic symbols.

초성Initiality ㄱA ㄲㄲ ㄴN ㄷC ㄸㄸ ㄹD ㅁM ㅂㅂ ㅃㅃ ㅅS 초성Initiality 중성neutrality ㅆㅆ ㅇㅇ ㅈㅈ ㅉㅉ ㅊH ㅋLol ㅌㅌ ㅍㅍ ㅎㅎ ㅏㅏ 중성neutrality ㅐㅐ ㅑㅑ ㅒㅒ ㅓㅓ ㅔㅔ ㅕㅕ ㅖㅖ ㅗㅗ ㅘㅘ ㅙㅙ 중성neutrality ㅚㅚ ㅛㅛ ㅜTT ㅝㅝ ㅞㅞ ㅟㅟ ㅠㅠ ㅡㅡ ㅢㅢ ㅣㅣ 종성Jongseong ㄱA ㄴN ㄷC ㄹD ㅁM ㅂㅂ ㅇㅇ

[표 1]TABLE 1

한편, 본 발명자는 일반적으로 발화 속도가 변화할 때 실제로 변화되는 부분은 중성 부분이며, 초성과 종성 부분의 길이는 거의 변화가 없거나 또는 뚜렷한 규칙정이 없이 변화하는 특징이 있다는 점을 발견하였다. 따라서, 이러한 특징을 반영하기 위하여, n번째 음소의 평균 발화시간을 T(n)이라 할 때, 아래의 식과 같이 음소 길이의 평균값 D를 계산하도록 하였다.On the other hand, the present inventors have generally found that the part that actually changes when the rate of ignition changes is the neutral part, and the lengths of the initial and the final part have characteristics that change little or without distinct rules. Therefore, in order to reflect this characteristic, when the average speech duration of the nth phoneme is T (n), the average value D of phoneme lengths is calculated as shown in the following equation.

즉, 음소길이의 평균값은 음성신호를 구성하는 각 발음기호 음소의 길이에 각 발음기호 음소의 가중치(예를 들어, 가중치 α(n)은 n번째 음소가 초정 또는 종성이면 0.2, 중성이면 1.0의 값을 갖도록 할 수 있다)를 곱한 값의 총합을 상기 음성 인식 모듈에 저장된 발음기호 음소의 가짓수(예를 들어, 47가지)로 나눈 값이 되는 것이다.That is, the average length of the phoneme is the weight of each phonetic phoneme constituting the speech signal (for example, the weight α (n) is 0.2 if the nth phoneme is initial or final and 1.0 if neutral). The sum of the product multiplied by the " multiplied by " is divided by the number (eg, 47) of the phonetic phoneme stored in the speech recognition module.

따라서, 실제 중성의 발화 시간 평균에 더욱 가중치를 두어 음소길이의 평균값을 추출할 수 있게 되며, 이렇게 하여 구해지는 평균 발화 시간은 음소 단위로 측정되는 발화 시간의 평균값으로부터 얻어지는 값이므로, 실제 발화 내용이 기저장되어있는지의 여부와 무관하게 발화의 상대적인 속도를 정확하게 측정할 수 있다는 장점을 가진다.Therefore, the average value of the phoneme length can be extracted by further weighting the average of the average neutral utterance time, and the average utterance time thus obtained is obtained from the average value of the utterance time measured in the phoneme unit. It has the advantage of being able to accurately measure the relative speed of ignition regardless of whether it is already stored.

본 실시예에서, 평균 발화 시간을 측정하기 위한 음성인식 모듈은 연속 HMM(continuous hidden Markov model)을 기반으로 하는 멀티 믹스춰 시스템(multi-mixture system)이 사용될 수 있으며, 음성인식기는 16 믹스춰 트라이폰(16 mixture triphone)을 기반으로 하여 구성될 수 있다. 한편, 발음기호 음소열을 실제로 발화되는 음성신호와 대응시키는 과정은 비터비 알고리즘(Viterbi algorithm)에 의하여 수행되도록 할 수 있다.In this embodiment, the speech recognition module for measuring the average speech time may be used a multi-mixture system based on the continuous hidden Markov model (HMM), the speech recognition is 16 mix It can be configured based on a phone (16 mixture triphone). On the other hand, the process of mapping the phonetic sequence of phonetic symbols with the speech signal actually spoken may be performed by the Viterbi algorithm.

다음, 음성인식 결과 추출단계(260)에 대하여 설명한다.Next, the voice recognition result extraction step 260 will be described.

종래 기술에 의한 감정 인식 방법은 "인간은 언어 정보가 전혀 주어지지 않은 상태에서도 화자의 발화 자체의 비언어 정보만을 통하여 감정 상태를 어느 정도 추정할 수 있다"고 하는 후쿠다(Fukuda)의 연구를 이론적 근거로 하여 발전된 것이 대부분이므로, 발화의 내용 면을 염두에 두지는 않았다.The prior art emotion recognition method is theoretically based on Fukuda's research that "humans can estimate the emotional state to some extent only by the non-verbal information of the speaker's speech itself even when no language information is given at all." Since most of the development was based on the grounds, the contents of the speech were not taken into account.

그러나, 음성인식된 결과에 의하여서도 실제 발화한 사람의 감정 상태를 어느 정도 추정하는 것이 가능할 때가 있는 것으로서, 가령, "야!"와 같은 단어는 화자의 평상시의 감정 상태에서보다는 격앙된 감정 상태에서 사용될 확률이 더욱 높은 단어라고 생각할 수 있는 것이다.However, there are times when it is possible to estimate the emotional state of a person who is actually speaking even by the voice-recognized result, for example, a word such as "Hey!" Is in a more emotional state than in the speaker's normal emotional state. You can think of it as a word with a higher probability of being used.

따라서, 이처럼 감정이 격앙된 상태임을 암시하는 특정한 단어 목록을 구비하여 두고, 화자의 발화 내용을 음성인식하여 이러한 특정 단어 목록에 대응되는 단어를 추출하여, 각 단어가 격앙된 감정 상태에서 사용될 수 있는 확률을 설정한다면, 비언어적인 파라미터에 더하여 감정 인식의 정확도를 한층 높일 수 있는 새로운 파라미터로 활용할 수 있게 되는 것이다.Thus, a list of specific words suggesting that the emotions are frenzy is provided, speech recognition of the speaker's utterances is performed to extract words corresponding to the specific words list, and each word can be used in the frenzyed emotional state. If you set the probability, in addition to non-verbal parameters, it can be used as a new parameter that can improve the accuracy of emotion recognition.

위와 같은 단어의 목록은 경우에 따라 변동될 수 있는 것이기는 하지만, 어느 정도 공통적인 목록을 작성하는 것도 가능하다. 아래의 표 2는 이러한 단어 목록 중에서 일부만을 예시한 것이다.Although the above list of words may vary from case to case, it is also possible to create some common lists. Table 2 below illustrates some of these word lists.

단어word 격앙된 감정 상태에서 사용될 확률(변동가능)Chance to be used in a raging emotional state (variable) 야!Hey! 1.01.0 너!you! 1.01.0 바꿔!change! 0.80.8 빨리Quickly 0.70.7 안 바꿀래?Do you want to change it? 0.70.7 도대체why 0.60.6 엉망mess 0.80.8 기타Etc 0.00.0

[표 2]TABLE 2

도 3은 본 발명에 의한 음성을 이용한 감정 인식 방법의 일례에 있어서 각 파라미터의 측정값을 종합하여 감정 상태 산출단계에서 화자의 감정 상태를 산출하는 과정을 더욱 자세히 나타낸 흐름도이다.3 is a flowchart illustrating a process of calculating an emotional state of a speaker in an emotional state calculating step by combining measured values of each parameter in an example of an emotion recognition method using voice according to the present invention.

본 발명에 의한 음성을 이용한 감정 인식 방법에서는 음높이 평균값, 음높이분산, 음성에너지 평균값, 음성에너지 분산, 음소길이 평균값 및 특정 단어에 대한 확률값 등 총 6가지 정보들을 효과적으로 종합하기 위해서, 평상시 상태 및 격앙된 감정 상태에 대한 대량의 음성 데이터베이스를 수집하고, 이들의 통계적인 특징을 분석한 후에 이를 이용하여 최종 감정상태 추출 알고리즘을 정의하였다.In the emotion recognition method using the voice according to the present invention, in order to effectively sum up all six pieces of information such as average pitch, pitch variance, average voice energy value, variance of voice energy, average length of phoneme, and probability value for a specific word, the state of normal and furious After collecting a large number of speech databases about emotional states and analyzing their statistical characteristics, the final emotional state extraction algorithm was defined.

입력된 음성신호의 음높이 평균값을, 음성에너지 평균값을, 음높이 분산 분산을, 음성에너지 분산을, 음소길이 평균값을, 특정 단어에 대한 확률값을라 하고, 대량의 음성데이터로부터 구한 평상시 상태에서의 음높이 평균값을, 음성에너지 평균값을, 음높이 분산을음성에너지 분산을, 음소길이 평균값을라고 하였을 때, 최종 감정상태계수 AER은 아래 식과 같이 산출된다.The pitch average value of the input voice signal , The average of the negative energy Pitch dispersion dispersion Voice energy dispersion , The phoneme length average , The probability value for a particular word The average pitch value in the normal state obtained from a large amount of voice data , The average of the negative energy Pitch dispersion Voice energy dispersion , The phoneme length average , The final emotional state coefficient AER is calculated as follows.

위 식에서 a0 ~ a6은 각 파라미터에 대한 가중치로서, 실제 음성 서비스의 특성에 따라서 변경이 가능하다.In the above equation, a0 to a6 are weights for each parameter and can be changed according to the characteristics of the actual voice service.

일단 최종 감정상태계수 AER이 계산되면, AER의 값에 따라서 감정 상태를 추출할 수 있다. 평상시 상태와 분노 상태의 2단계로 나눌 필요가 있는 경우에는, 하나의 임계값을 설정하여, AER이 그 임계값보다 작은 경우에는 평상시 상태인 것으로, 임계값보다 큰 경우에는 분노 상태인 것으로 판단할 수 있다. 또는, 도 3에 나타낸 바와 같이 3단계의 감정 상태로 나누는 경우에는, 두 개의 임계값을 설정하여, 최종 감정상태계수 AER이 제1 임계값보다 작은 경우에는 평상시 상태인 것으로, 제1 임계값보다 크고 제2 임계값보다 작은 경우에는 다소 흥분한 상태인 것으로, 제2 임계값보다 큰 경우에는 분노한 상태인 것으로 판단할 수도 있다. 또는 다른 실시예에서 감정 상태를 더욱 세분화하도록 구성하는 것도 가능하다.Once the final emotional state coefficient AER is calculated, the emotional state can be extracted according to the value of the AER. When it is necessary to divide into two stages of normal state and anger state, one threshold value is set, and when the AER is smaller than the threshold value, it is determined as the normal state, and when it is larger than the threshold value, it is determined that it is anger state. Can be. Alternatively, in the case of dividing into three emotional states as shown in FIG. 3, two threshold values are set, and when the final emotional state coefficient AER is smaller than the first threshold value, the normal state is assumed. If it is larger and smaller than the second threshold, it may be determined to be somewhat excited, and if it is larger than the second threshold, it may be determined to be angry. Alternatively, it may be configured to further refine the emotional state in another embodiment.

도 4는 본 발명에 의한 음성을 이용한 감정 인식 시스템의 일례를 나타낸 구성도이다.4 is a block diagram showing an example of an emotion recognition system using voice according to the present invention.

본 발명에 의한 음성을 이용한 감정 인식 방법은 도 4에 나타난 바와 같은 음성을 이용한 감정 인식 시스템(40) 상에서 수행될 수 있다.The emotion recognition method using the voice according to the present invention may be performed on the emotion recognition system 40 using the voice as shown in FIG. 4.

즉, 발화음 입력은 입력부(400)에서 화자의 발화를 입력받음으로써 수행되고, 신호처리부(420)는 입력된 음성신호로부터 음높이 평균값, 음높이 분산, 음성에너지 평균값, 음성에너지 분산을 추출한다. 음성인식부(430)는 음소길이 평균값 및 음성인식 결과에 따른 특정 단어의 추출을 수행한다. 제어부(410)는 입력부(400), 신호처리부(420), 음성인식부(430), 데이터베이스(440) 및 출력부(450)의 전반적인 동작을 수행하는 한편, 최종 감정상태계수 AER을 산출하는 역할을 한다. 데이터베이스(440)에는 평상시 상태 및 분노 상태에 대한 평균 음높이, 평균 음성에너지, 음높이의 분산, 음성에너지의 분산에 관한 대량의 음성 데이터가 저장되어 있으며, 또한 상기 음성인식부(430)의 음성인식을 위하여 사용되는발음기호 음소열에 관한 데이터가 저장되어 있다. 출력부(450)는 제어부(410)에서 산출된 최종 감정상태계수에 따른 감정 상태에 관한 정보를, 그 정보를 필요로 하는 다른 시스템으로 전달하기 위한 적절한 형태의 신호를 출력하는 역할을 한다.That is, the speech input is performed by receiving the speaker's speech from the input unit 400, and the signal processor 420 extracts the pitch average value, pitch variance, speech energy average value, and speech energy variance from the input speech signal. The speech recognition unit 430 extracts a specific word according to the phoneme length average value and the speech recognition result. The controller 410 performs an overall operation of the input unit 400, the signal processor 420, the voice recognizer 430, the database 440, and the output unit 450, and calculates a final emotional state coefficient AER. Do it. The database 440 stores a large amount of voice data about average pitch, average voice energy, pitch variance, and variance of voice energy for the normal state and the anger state, and also recognizes the voice recognition of the voice recognition unit 430. Phonetic symbol used for storing the phoneme data. The output unit 450 outputs a signal of an appropriate type for transmitting information on the emotional state according to the final emotional state coefficient calculated by the controller 410 to another system that needs the information.

본 발명에 의한 음성을 이용한 감정 인식 시스템은, 그 자체로서 독자적인 시스템을 구성하기보다는, 기존의 콜센터 시스템 등에 부가하여 콜센터 시스템의 성능을 대폭 향상시키는 등의 용도로 구현하는 것이 더욱 바람직할 수 있다.The emotion recognition system using the voice according to the present invention may be more preferably implemented for a purpose such as greatly improving the performance of the call center system in addition to an existing call center system or the like, rather than configuring an original system by itself.

도 5에 나타난 실시예에서는, 고객의 전화회선(500)과 연결되는 사설 구내 교환기(private branch exchange)(510)와, 콜센터의 전반적인 운영을 제어하는 CTI 서버(520)와, 감정 인식 시스템이 내장된 감정 인식 서버(540)와, 상기 CTI 서버에 연결된 자동 응답 서버(530)와, 상기 CTI 서버에 연결된 적어도 하나의 상담원용 전화회선(550)이 포함되어 있다.In the embodiment shown in Figure 5, a private branch exchange (510) connected to the customer's telephone line 500, the CTI server 520 for controlling the overall operation of the call center, and the emotion recognition system embedded The emotion recognition server 540, an answering machine 530 connected to the CTI server, and at least one counselor telephone line 550 connected to the CTI server are included.

감정 인식 서버(540)가 상기 고객의 발화에 따른 음성신호로부터 상술한 바와 같은 다양한 파라미터를 추출, 종합함으로써 상기 고객의 감정 상태를 인식하면, 그 결과에 따라 예를 들어 상기 고객이 분노한 상태인 경우에 상기 CTI 서버(520)가 상기 고객의 전화회선의 호 접속을 상기 자동응답 서버(530)로부터 상기 상담원용 전화회선(550)으로 전환하여, 상담원에 의한 맨투맨(man-to-man) 방식의 민원업무가 수행되도록 할 수 있다. 또는, 호 접속되어있는 하나의 상담원용 전화회선으로부터 보다 능숙한 상담원이 담당하고 있는 다른 전화회선으로 호 접속을전환하도록 함으로써, 고객과의 상담 효율 및 고객 만족도를 제고하도록 할 수 있는 것이다.When the emotion recognition server 540 recognizes the emotion state of the customer by extracting and combining the various parameters as described above from the voice signal according to the speech of the customer, for example, when the customer is in anger state The CTI server 520 switches the call connection of the customer's telephone line from the automatic answering server 530 to the counselor's telephone line 550, which is a man-to-man type of counselor. Complaint work can be carried out. Alternatively, it is possible to improve the consultation efficiency and customer satisfaction with the customer by switching the call connection from one telephone line for a counselor connected to another to a telephone line in which a more skilled counselor is in charge.

본 실시예에서는 일반전화망을 이용한 콜센터 시스템에 본 발명에 의한 음성을 이용한 감정 인식 시스템을 적용하는 경우를 설명하였으나, 이는 단지 예시적인 것으로서, 인터넷 망을 이용한 콜센터 시스템, 기타 고객 응대가 필요한 각종 IT 솔루션에서 상담원 연결 서비스 시에 고객의 음성으로부터 자동으로 감정 인식을 수행하고, 이로부터 상황에 적합한 상담원을 자동 연결하는 서비스에 모두 적용될 수 있다.In the present embodiment, the case of applying the emotion recognition system using the voice according to the present invention to the call center system using the general telephone network, but this is just an example, a call center system using the Internet network, various IT solutions requiring other customer service In the agent connection service, the service automatically performs the emotion recognition from the voice of the customer, and from this it can be applied to all services that automatically connect the appropriate agent for the situation.

상술한 바와 같은 본 발명에 의한 음성을 이용한 감정 인식 시스템 및 감정 인식 방법을 이용하면, 화자의 감정 상태를 효과적으로 추정,인식하기 위하여 화자의 발화에 포함된 화자의 감정 상태를 나타내는 언어적 정보 및 비언어적 정보에 관한 다수의 파라미터를 추출하여 종합적으로 분석하는 음성을 이용한 감정 인식 시스템 및 감정인식 방법을 구현할 수 있다.When using the emotion recognition system and emotion recognition method using the voice according to the present invention as described above, in order to effectively estimate and recognize the speaker's emotional state, verbal information and non-verbal information indicating the emotional state of the speaker included in the speaker's speech An emotion recognition system and an emotion recognition method using voice for extracting and comprehensively analyzing a plurality of parameters related to information may be implemented.

Claims

As a emotion recognition method using voice,

A speech input step of receiving a voice signal according to the speaker's speech;

Non-verbal parameter extraction step of extracting at least one non-verbal parameter from the spoken sound;

Linguistic parameter extraction step of extracting at least one linguistic parameter according to the meaning of the spoken sound by speech recognition of the speech signal;

And an emotional state calculating step of finally calculating the emotional state of the speaker by combining the linguistic parameters and the non-verbal parameters.

The method of claim 1,

The non-verbal parameter includes an average value of pitches of the speech signal obtained by dividing the speech signal into speech signal frames at predetermined time intervals, and dividing the sum of the pitches of the voiced frames among the frames by the number of the voiced frames. Emotion recognition method using a voice, characterized in that.

The method of claim 1,

The nonverbal parameter includes a variance of the pitch of the speech signal obtained by dividing the sum of the squares of the difference between the mean value of the pitch and the pitch in each of the voiced frames by the number of frames of the voiced sounds. Emotion recognition method using a voice, characterized in that.

The method of claim 1,

The non-verbal parameter is an average value of the speech energy of the speech signal obtained by dividing the speech signal into speech signal frames at predetermined time intervals, and dividing the total of speech energy in the frames, which are voiced sounds, by the number of frames, which are voiced sounds. Emotion recognition method using a voice comprising a.

The method of claim 1,

The non-verbal parameter is a variance of the speech energy of the speech signal obtained by dividing the sum of the squares of the difference between the mean value of the speech energy and the speech energy in each of the voiced frames by the number of frames of the voiced sounds. Emotion recognition method using a voice comprising a.

The method of claim 1,

The linguistic parameter includes a speech rate of the speech signal.

The method of claim 6,

And said speech rate of said speech signal is a relative velocity of an average value of actual phoneme lengths of said speech signal to an average value of standard phoneme lengths of said speech signal.

The method of claim 7, wherein

The average value of the standard phoneme length of the voice signal may include the phoneme of each Korean phonetic phoneme constituting the voice signal recognized by the voice recognition module. Emotion recognition method using speech, characterized in that the average value of the length.

The method of claim 7, wherein

The average value of the actual phoneme length of the speech signal is obtained by dividing the sum of the lengths of the phonetic phonemes constituting the voice signal by the weight of the phonetic phonemes by the number of the phonetic phonemes stored in the speech recognition module. Emotion recognition method using speech, characterized in that the value.

The method of claim 9,

The weight of the phonetic phoneme of the neutral among the phonetic phonemes is higher than the weights of the initial and the final voice of the phonetic phoneme.

The method of claim 1,

The linguistic parameter may include an emotion calculation coefficient corresponding to each special word included in a recognition result of the speech signal by the speech recognition module.

The method of claim 1,

The emotional state calculating step,

The difference between the actual pitch mean value and the standard pitch mean value divided by the standard pitch mean value, multiplied by a first weight,

A difference obtained by dividing the difference between the average value of the actual voice energy and the average value of the standard voice energy by the average value of the standard voice energy, multiplied by a second weight,

The difference between the standard deviation of the actual pitch and the standard deviation of the standard pitch divided by the standard deviation of the standard pitch mean, multiplied by a third weight,

The difference between the standard deviation of the actual voice energy and the standard deviation of the standard voice energy divided by the standard deviation of the standard voice energy, multiplied by a fourth weight,

The difference between the average value of the actual phoneme length and the average value of the reference phoneme length divided by the average value of the reference phoneme length is multiplied by a fifth weight value,

The sixth weight multiplied by the emotion calculation coefficient corresponding to the special word included in the result of speech recognition

And calculating the emotional state of the speaker by comparing the final emotional state coefficient calculated by summing with a predetermined threshold.

An emotion recognition system using a voice including a voice signal input unit, a signal processor, a voice recognition unit, a controller and an output unit,

The voice signal input unit converts a speaker's voice into an electrical signal and transmits the voice signal to at least one of the signal processor and the voice recognizer.

The signal processing unit may include: a pitch average value extracting unit extracting an average value of pitches of the voice signal, a pitch variation extracting unit extracting a degree of change of the average value of the pitches, and an average value of voice energy of the voice signal; Further comprising a voice energy average value extraction unit, and a voice energy dispersion extraction unit for extracting the degree of change of the average value of the voice energy,

The speech recognition unit further includes a speech rate extracting unit for extracting a speech rate of the speech signal and a speech recognition result extracting unit for extracting a specific word included in the speech signal.

The control unit uses the at least one parameter of the pitch average value, the pitch variance, the speech energy average value, the speech energy dispersion, the speech rate, and the speech recognition result calculated by the signal processing unit and the speech recognition unit to determine the final feeling. Calculating a recognition coefficient and comparing the final emotion recognition coefficient with a predetermined threshold to recognize the emotional state of the speaker.

A call center agent connection system using the emotion recognition system according to claim 13,

A private branch exchange connected to the customer's telephone line, a CTI server that controls the overall operation of the call center, an emotion recognition server with a built-in emotion recognition system, an answering machine connected to the CTI server, and at least a connection to the CTI server. Includes one agent phone line,

When the emotion recognition server recognizes the emotion state of the customer from the voice signal according to the customer's utterance, the CTI server according to the result of the call connection of the customer's telephone line from the answering machine to the telephone line for the counselor Call center agent connection system using an emotion recognition system to switch to or transfer a call connection from one agent telephone line to another.