KR102195246B1

KR102195246B1 - Method of emotion recognition using audio signal, computer readable medium and apparatus for performing the method

Info

Publication number: KR102195246B1
Application number: KR1020190029864A
Authority: KR
Inventors: 김명호; 서민지
Original assignee: 숭실대학교산학협력단
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-12-24
Anticipated expiration: 2039-03-15
Also published as: KR20200109958A

Abstract

음성 신호를 이용한 감정 분류 방법, 이를 수행하기 위한 기록 매체 및 장치가 개시된다. 음성 신호를 이용한 감정 분류 방법은 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축하는 단계 및 상기 복수의 감정 분류 모델을 이용하여 상기 사용자의 음성 신호로부터 계산되는 복수의 감정 별 감정 확률을 획득하고, 이를 조합하여 사용자의 음성 신호에 나타나는 감정을 결정하는 단계를 포함한다.Disclosed are a method for classifying emotions using voice signals, and a recording medium and apparatus for performing the same. The emotion classification method using a voice signal includes the steps of constructing a plurality of emotion classification models obtained by learning a plurality of emotion-specific voice signals, and a plurality of emotion-specific emotion probabilities calculated from the user's voice signal using the plurality of emotion classification models. And determining an emotion appearing in the user's voice signal by acquiring and combining them.

Description

A method for classifying emotions using voice signals, a recording medium and a device for performing it {METHOD OF EMOTION RECOGNITION USING AUDIO SIGNAL, COMPUTER READABLE MEDIUM AND APPARATUS FOR PERFORMING THE METHOD}

본 발명은 음성 신호를 이용한 감정 분류 방법, 이를 수행하기 위한 기록 매체 및 장치에 관한 것으로, 더욱 상세하게는 음성 신호에 나타나는 사람의 감정을 결정하는 음성 신호를 이용한 감정 분류 방법, 이를 수행하기 위한 기록 매체 및 장치에 관한 것이다.The present invention relates to a method for classifying emotions using a voice signal, a recording medium and apparatus for performing the same, and more particularly, a method for classifying emotions using a voice signal for determining a person's emotions appearing in a voice signal, and a recording for performing the same It relates to media and devices.

데이터 처리 및 클라우드 기술의 발전에 따라 1인 가구를 위한 서비스 보급이 인기를 끌고 있다. 이에 따라 인간의 감정을 판단하여 적절한 서비스를 제공할 수 있도록 인간과 기계 간의 의사소통의 중요성이 대두되고 있으며, 얼굴, 음성, 행동과 같은 다양한 신호를 이용하여 인간의 감정을 판단하기 위한 연구가 활발히 이루어지고 있다. With the development of data processing and cloud technology, the spread of services for single-person households is gaining popularity. Accordingly, the importance of communication between humans and machines is emerging so that human emotions can be judged and appropriate services are provided, and studies to judge human emotions using various signals such as faces, voices, and actions are actively conducted. It is being done.

예를 들면, 음성 신호의 경우, 인간의 감정 상태에 따라 떨림, 강세, 발화 속도 등이 달라질 수 있다. 따라서 이러한 음성 신호를 이용하여 인간의 감정을 판단하고, 이를 이용한 다양한 서비스가 제안된바 있다. 미국 AT&T 에서는 콜센터에서 고객의 감정을 실시간으로 파악하여 대응하는 서비스를 제공중이며, 일본 소프트뱅크에서는 음성 신호를 이용하여 사용자의 감정을 인식하고, 자연스럽게 소통할 수 있는 감정 인식 로봇을 서비스 중이다. For example, in the case of a voice signal, tremor, stress, and speech speed may vary according to a human's emotional state. Therefore, various services have been proposed that determine human emotions using such voice signals and use them. AT&T in the US is providing a service that recognizes and responds to customer's emotions in real time in a call center, and SoftBank in Japan is providing an emotion recognition robot that can recognize users' emotions using voice signals and communicate naturally.

한국공개특허 10-2014-0007883에는 음성 신호를 이용한 감정 인식 시스템이 개시된다. 이는 음성 신호의 키프레임에서 특정 파라미터 벡터를 추출하여 사용자의 감정을 분류하는데, 미리 학습된 가우스 믹스처 모델(GMM: Gaussian Mixture Model) 또는 히든 마르코프 모델(HMM: Hidden Markov Model)과 같이 음성 특징 데이터의 분포 모델을 기반으로 하여 사용자의 감정을 분류하며, 선별된 소수의 키프레임에서만 특징 파라미터 벡터를 추출하므로, 사용자의 성별 및 발성 특징, 발화 문구가 다를 경우에는 감정 판별 성능이 저하될 수 있다.Korean Patent Publication No. 10-2014-0007883 discloses an emotion recognition system using a voice signal. This classifies the user's emotions by extracting a specific parameter vector from the keyframe of the speech signal, and speech feature data such as a pre-learned Gaussian Mixture Model (GMM) or Hidden Markov Model (HMM). Since the user's emotions are classified based on the distribution model of, and feature parameter vectors are extracted only from a few selected keyframes, if the user's gender, vocal features, and speech phrases are different, the emotion discrimination performance may be degraded.

아울러 음성 신호를 이용한 감정 분류 방법은 얼굴 표정을 이용한 감정 분류 방법에 비해 다소 정확도가 낮다는 문제점이 있어 이를 보완하기 위한 연구가 요구되는 실정이다.In addition, since the emotion classification method using voice signals has a problem that the accuracy is somewhat lower than that of the emotion classification method using facial expressions, research is required to supplement this.

본 발명의 일측면은 음성 신호로부터 추출할 수 있는 음성 특징 데이터를 입력 데이터로 하는 복수의 머신 러닝 분류기를 이용하여 음성 신호에 나타나는 감정을 판별하는 음성 신호를 이용한 감정 분류 방법, 이를 수행하기 위한 기록 매체 및 장치를 제공한다.One aspect of the present invention is a method for classifying emotions using a voice signal for discriminating emotions appearing in a voice signal using a plurality of machine learning classifiers using voice feature data that can be extracted from the voice signal as input data, and a recording for performing the same Provides media and devices.

본 발명의 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problem of the present invention is not limited to the technical problem mentioned above, and other technical problems that are not mentioned will be clearly understood by those skilled in the art from the following description.

상기 과제를 해결하기 위한 본 발명의 음성 신호를 이용한 감정 분류 방법은 사용자의 음성 신호를 입력 받아 상기 사용자의 감정을 분류하는 음성 신호를 이용한 감정 분류 장치에서의 감정 분류 방법에 있어서, 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축하는 단계, 상기 복수의 감정 분류 모델을 이용하여 상기 사용자의 음성 신호로부터 계산되는 복수의 감정 별 감정 확률을 상기 복수의 감정 분류 모델 별로 각각 획득하는 단계, 복수의 감정 분류 모델 별로 각각 획득하는 상기 복수의 감정 별 감정 확률을 조합하여 상기 복수의 감정 별로 감정 판별 확신도를 계산하는 단계 및 상기 복수의 감정 별 감정 판별 확신도와 상기 복수의 감정 별 감정 판별 확신도 기준 값을 비교하여 상기 복수의 감정 중 상기 사용자의 음성 신호에 나타나는 감정을 결정하는 단계를 포함한다.In the emotion classification method using a voice signal of the present invention for solving the above problem, in the emotion classification method using a voice signal for classifying the user's emotion by receiving a user's voice signal, a plurality of emotions Constructing a plurality of emotion classification models from which voice signals are learned, and obtaining, for each of the plurality of emotion classification models, a plurality of emotion probabilities calculated from the voice signal of the user using the plurality of emotion classification models. , Computing an emotion determination confidence level for each of the plurality of emotions by combining the emotion probabilities of each of the plurality of emotions obtained for each of the plurality of emotion classification models, and the emotion determination confidence level for the plurality of emotions and the emotion determination for each of the plurality of emotions And determining an emotion appearing in the user's voice signal among the plurality of emotions by comparing the confidence level reference values.

한편, 상기 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축하는 단계는, 상기 복수의 감정 중 어느 하나의 감정을 나타내도록 생성한 학습용 음성 신호를 입력 받는 단계, 상기 학습용 음성 신호로부터 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출하는 단계 및 상기 학습용 음성 신호가 나타내는 감정에 대해 상기 적어도 하나의 음성 특징 데이터를 학습한 상기 복수의 감정 분류 모델을 구축하는 단계를 포함할 수 있다.On the other hand, the step of constructing a plurality of emotion classification models in which the plurality of emotion-specific speech signals are learned may include receiving a learning speech signal generated to represent any one of the plurality of emotions, from the learning speech signal. To extract speech feature data including at least one of pitch, zero crossing rate (ZCR), mel-scaled power spectrogram (MFCC), root mean square energy (RMSE), tempo, beat, and short-time fourier transform (STFT) And constructing the plurality of emotion classification models obtained by learning the at least one voice characteristic data for the emotion indicated by the learning voice signal.

또한, 상기 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축하는 단계는, 상기 복수의 감정 중 어느 하나의 감정을 나타내도록 생성한 학습용 음성 신호를 입력 받는 단계, 상기 학습용 음성 신호가 상기 복수의 감정 중 어느 하나의 감정을 나타내는지 여부를 식별할 수 있도록 상기 학습용 음성 신호를 이진(binary)의 형태로 라벨링하는 단계 및 라벨링 한 상기 학습용 음성 신호를 복수의 이진 분류 학습 모델에 각각 입력하여 상기 복수의 감정 분류 모델을 구축하는 단계를 포함할 수 있다.In addition, the step of constructing a plurality of emotion classification models in which the plurality of emotion-specific speech signals are learned may include receiving a learning speech signal generated to represent any one of the plurality of emotions, wherein the learning speech signal is Labeling the learning speech signal in a binary form to identify whether one of the plurality of emotions represents an emotion, and inputting the labeled learning speech signal to a plurality of binary classification learning models, respectively Thus, it may include the step of constructing the plurality of emotion classification models.

또한, 라벨링 한 상기 학습용 음성 신호를 복수의 이진 분류 학습 모델에 각각 입력하여 상기 복수의 감정 분류 모델을 구축하는 단계는, 라벨링 한 상기 학습용 음성 신호를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여 상기 복수의 감정 분류 모델을 구축하는 단계일 수 있다.In addition, the step of constructing the plurality of emotion classification models by inputting the labeled training speech signals into a plurality of binary classification learning models, respectively, comprises: applying the labeled training speech signals to a support vector machine model, an Xgboost model, and a random forest model. It may be a step of constructing the plurality of emotion classification models by inputting each.

또한, 복수의 감정 분류 모델 별로 각각 획득하는 상기 복수의 감정 별 감정 확률을 조합하여 상기 복수의 감정 별로 감정 판별 확신도를 계산하는 단계는, 상기 복수의 감정 중 어느 하나의 감정에 대해 상기 복수의 감정 분류 모델 별로 가중치를 산출하는 단계 및 상기 복수의 감정 중 어느 하나의 감정에 대해 상기 복수의 감정 분류 모델 별로 각각 획득하는 감정 확률에 상기 복수의 감정 분류 모델 별로 산출한 가중치를 곱한 값을 모두 더하여 감정 판별 확신도를 계산하는 단계를 포함할 수 있다.In addition, the step of calculating an emotion discrimination confidence level for each of the plurality of emotions by combining the emotion probabilities of the plurality of emotions each obtained for each of the plurality of emotion classification models may include: Calculating a weight for each emotion classification model, and adding all values obtained by multiplying the weights calculated for each of the plurality of emotion classification models to the emotion probability obtained by each of the plurality of emotion classification models for any one emotion among the plurality of emotions. It may include calculating an emotion discrimination confidence level.

또한, 상기 복수의 감정 중 어느 하나의 감정에 대해 상기 복수의 감정 분류 모델 별로 가중치를 산출하는 단계는, 상기 복수의 감정 분류 모델 중 어느 하나의 감정 분류 모델에 있어서 상기 복수의 감정 각각의 확률을 평균내는 단계 및 상기 복수의 감정 각각의 확률의 평균에 대한 상기 복수의 감정 중 어느 하나의 감정의 비율을 해당 감정에 대한 감정 분류 모델의 가중치로 산출하는 단계를 포함할 수 있다.In addition, the step of calculating a weight for each of the plurality of emotion classification models for any one of the plurality of emotions includes determining the probability of each of the plurality of emotions in any one of the plurality of emotion classification models. It may include calculating the average and calculating a ratio of one emotion among the plurality of emotions to the average of the probability of each of the plurality of emotions as a weight of an emotion classification model for the corresponding emotion.

또한, 상기 음성 신호를 이용한 감정 분류 방법을 수행하기 위한, 컴퓨터 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체일 수 있다.In addition, it may be a computer-readable recording medium in which a computer program is recorded for performing the emotion classification method using the voice signal.

한편, 본 발명의 음성 신호를 이용한 감정 분류 장치는 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축하는 감정 분류 모델 구축부, 상기 복수의 감정 분류 모델을 이용하여 사용자의 음성 신호로부터 계산되는 복수의 감정 별 감정 확률을 상기 복수의 감정 분류 모델 별로 각각 획득하는 감정 확률 산출부, 복수의 감정 분류 모델 별로 각각 획득하는 상기 복수의 감정 별 감정 확률을 조합하여 상기 복수의 감정 별로 감정 판별 확신도를 계산하는 감정 판별 확신도 산출부 및 상기 복수의 감정 별 감정 판별 확신도와 상기 복수의 감정 별 감정 판별 확신도 기준 값을 비교하여 상기 복수의 감정 중 상기 사용자의 음성 신호에 나타나는 감정을 결정하는 감정 분류부를 포함한다.On the other hand, the emotion classification apparatus using the voice signal of the present invention includes an emotion classification model construction unit for constructing a plurality of emotion classification models in which a plurality of emotion-specific voice signals are learned, and from the user's voice signal using the plurality of emotion classification models. An emotion probability calculation unit that obtains a calculated emotion probability for each of a plurality of emotions for each of the plurality of emotion classification models, and determines the emotion for each of the plurality of emotions by combining the emotion probability for each of the plurality of emotions obtained for each of the plurality of emotion classification models. An emotion determination confidence level calculation unit that calculates the confidence level and the emotion determination confidence level for each of the plurality of emotions are compared with a reference value of the emotion determination confidence level for each emotion to determine an emotion appearing in the user's voice signal among the plurality of emotions. It includes an emotion classification unit.

한편, 상기 감정 확률 산출부는, 상기 사용자의 음성 신호로부터 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출하고, 상기 음성 신호가 나타내는 감정에 대해 상기 적어도 하나의 음성 특징 데이터를 학습한 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여 복수의 감정 별 감정 확률을 상기 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델 별로 각각 획득할 수 있다.Meanwhile, the emotion probability calculation unit includes a pitch, a zero crossing rate (ZCR), a mel-scaled power spectrogram (MFCC), a root mean square energy (RMSE), a tempo, a beat, and a short-time fourier (STFT) from the voice signal of the user. transform), extracting voice feature data including at least one of the voice signals, and inputting them to a support vector machine model, an Xgboost model, and a random forest model, respectively, in which the at least one voice feature data is learned for the emotion represented by the voice signal. The emotion probability for each emotion may be obtained for each of the support vector machine model, the Xgboost model, and the random forest model.

또한, 상기 감정 판별 확신도 산출부는, 상기 복수의 감정 분류 모델 중 어느 하나의 감정 분류 모델에 있어서 상기 복수의 감정 각각의 확률을 평균내고, 상기 복수의 감정 각각의 확률의 평균에 대한 상기 복수의 감정 중 어느 하나의 감정의 비율을 해당 감정에 대한 해당 감정 분류 모델의 가중치로 산출하며, 상기 복수의 감정 중 어느 하나의 감정에 대해 상기 복수의 감정 분류 모델 별로 각각 획득하는 감정 확률에 상기 복수의 감정 분류 모델 별로 산출한 가중치를 곱한 값을 모두 더하여 감정 판별 확신도를 계산할 수 있다.Further, the emotion discrimination confidence degree calculation unit averages the probabilities of each of the plurality of emotions in any one of the plurality of emotion classification models, and the plurality of emotions with respect to the average of the probabilities of each of the plurality of emotions The ratio of one of the emotions is calculated as the weight of the corresponding emotion classification model for the corresponding emotion, and the plurality of emotion probabilities obtained for each of the plurality of emotion classification models for any one emotion among the plurality of emotions are calculated. The emotion discrimination confidence level can be calculated by adding all the values obtained by multiplying the weights calculated for each emotion classification model.

본 발명에 따르면 음성 신호를 이용한 감정 분류에 있어서 음성 신호로부터 추출할 수 있는 음성 특징 데이터를 모두 반영함으로써, 화자 및 문맥에 독립적인 유연한 감정 판별 성능을 보일 수 있다.According to the present invention, by reflecting all of the voice feature data that can be extracted from the voice signal in the emotion classification using the voice signal, flexible emotion discrimination performance independent of the speaker and context can be exhibited.

아울러, 이진 분류 및 학습에 높은 성능을 보이는 복수의 머신 러닝 분류기를 이용함으로써 감정 분류의 정확도를 높일 수 있다.In addition, it is possible to improve the accuracy of emotion classification by using a plurality of machine learning classifiers exhibiting high performance in binary classification and learning.

도 1은 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치의 블록도이다.
도 2 내지 도 7은 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 방법의 흐름도이다.1 is a block diagram of an apparatus for classifying emotions using a voice signal according to an embodiment of the present invention.
2 to 7 are flowcharts of a method for classifying emotions using a voice signal according to an embodiment of the present invention.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예와 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION OF THE INVENTION The detailed description of the present invention to be described below refers to the accompanying drawings, which illustrate specific embodiments in which the present invention may be practiced. These embodiments are described in detail sufficient to enable a person skilled in the art to practice the present invention. It is to be understood that the various embodiments of the present invention are different from each other but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the present invention in relation to one embodiment. In addition, it is to be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the detailed description to be described below is not intended to be taken in a limiting sense, and the scope of the present invention, if appropriately described, is limited only by the appended claims, along with all scopes equivalent to those claimed by the claims. In the drawings, like reference numerals refer to the same or similar functions over several aspects.

이하, 도면들을 참조하여 본 발명의 바람직한 실시예들을 보다 상세하게 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in more detail with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치의 블록도이다.1 is a block diagram of an apparatus for classifying emotions using a voice signal according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 입력부(10), 감정 분류 모델 구축부(30), 감정 확률 산출부(50), 감정 판별 확신도 산출부(70) 및 감정 분류부(90)를 포함한다.Referring to FIG. 1, the emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention includes an input unit 10, an emotion classification model construction unit 30, an emotion probability calculation unit 50, and an emotion discrimination confidence. It includes a degree calculation unit 70 and an emotion classification unit 90.

본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 도 1에 도시된 구성요소보다 많은 구성요소에 의해 구현될 수 있고, 그보다 적은 구성요소에 의해 구현될 수도 있다.The apparatus 1 for classifying emotions using a voice signal according to an embodiment of the present invention may be implemented by more elements than those shown in FIG. 1, or may be implemented by fewer elements.

본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 통신이 가능하고 정보의 입출력이 가능한 장치로, 예를 들면, 스마트폰, 태블릿, PC 등으로 구현될 수 있으며, 본 발명의 감정 분류를 위한 소프트웨어(애플리케이션)가 설치되어 실행될 수 있다.The emotion classification device 1 using a voice signal according to an embodiment of the present invention is a device capable of communication and input/output of information, and may be implemented by, for example, a smartphone, a tablet, a PC, etc., and the present invention Software (application) for classifying emotions may be installed and executed.

도 1에 도시된 입력부(10), 감정 분류 모델 구축부(30), 감정 확률 산출부(50), 감정 판별 확신도 산출부(70) 및 감정 분류부(90)의 구성은 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)에서 실행되는 소프트웨어에 의해 제어될 수 있다.The configuration of the input unit 10, the emotion classification model construction unit 30, the emotion probability calculation unit 50, the emotion determination confidence level calculation unit 70, and the emotion classification unit 90 shown in FIG. 1 is one of the present invention. It can be controlled by software executed in the emotion classification apparatus 1 using a voice signal according to the embodiment.

본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 사용자의 발화로부터 사용자의 감정을 판별할 수 있다. 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 사용자의 발화로 생성되는 음성 신호를 분석하여, 음성 신호에 나타나는 사용자의 감정을 결정할 수 있다.The emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention may determine a user's emotion from the user's utterance. The emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention may determine a user's emotion appearing in the voice signal by analyzing a voice signal generated by a user's utterance.

본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 음성 신호를 이용한 감정 분류에 있어서, 음성 신호로부터 추출할 수 있는 특징 데이터를 모두 반영함으로써 유연한 감정 판별 성능을 보일 수 있다. 또한 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 음성 신호로부터 추출하는 특징 데이터를 머신 러닝 분류기에 입력하고 그 출력값에 따라 음성 신호에 나타나는 감정을 결정하는데, 복수의 머신 러닝 분류기를 이용함으로써 감정 판별의 정확성을 높일 수 있다.The emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention can exhibit flexible emotion discrimination performance by reflecting all feature data that can be extracted from the voice signal in the emotion classification using the voice signal. In addition, the emotion classification apparatus 1 using a speech signal according to an embodiment of the present invention inputs feature data extracted from the speech signal into a machine learning classifier and determines emotions appearing in the speech signal according to the output values. The accuracy of emotion discrimination can be improved by using a running classifier.

이하, 도 1에 도시된 본 발명의 일 실시예에 다른 음성 신호를 이용한 감정 분류 장치(1)의 각 구성요소에 대해 구체적으로 설명한다.Hereinafter, each component of the emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention shown in FIG. 1 will be described in detail.

입력부(10)는 사용자의 음성 신호를 입력 받을 수 있다. 예를 들면, 입력부(10)는 마이크로 구현되어 사용자의 발화에 의해 발생하는 음성 신호를 수신할 수 있다. 또는, 입력부(10)는 컴퓨터 판독가능 스토리지 매체에 이전에 저장되어 있는 음성 신호를 수신할 수 있다.The input unit 10 may receive a user's voice signal. For example, the input unit 10 may be implemented as a microphone to receive a voice signal generated by a user's speech. Alternatively, the input unit 10 may receive an audio signal previously stored in a computer-readable storage medium.

본 실시예에서 음성 신호는 크게 학습용 음성 신호 및 테스트용 음성 신호로 나뉠 수 있다. 학습용 음성 신호는 감정 분류를 위한 머신 러닝 분류기인 감정 분류 모델 구축에 사용되는 음성 신호로, 사용자가 특정 감정을 연기하면서 발화한 음성 신호에 해당한다. 테스트용 음성 신호는 감정 분류 대상인 음성 신호에 해당한다. 이하에서는 설명의 편의를 위해 음성 신호를 학습용 음성 신호 및 테스트용 음성 신호로 나누어 설명한다. In this embodiment, the voice signal can be roughly divided into a learning voice signal and a test voice signal. The training speech signal is a speech signal used to build an emotion classification model, which is a machine learning classifier for emotion classification, and corresponds to a speech signal uttered by a user acting a specific emotion. The test voice signal corresponds to a voice signal subject to emotion classification. Hereinafter, for convenience of description, the voice signal is divided into a learning voice signal and a test voice signal.

또한 본 실시예에서 음성 신호에 나타나는 감정은 음성 신호를 발생시킨 사용자의 감정으로 정의하며, 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함 중 하나 이상으로 이루어지는 감정인 것으로 간주한다. 여기서, 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함의 각 감정에 따른 음성 신호의 특징은 일반적으로 아래 표 1과 같은 양상을 보인다.In addition, in this embodiment, the emotion that appears in the voice signal is defined as the emotion of the user who generated the voice signal, and is regarded as an emotion consisting of one or more of anger, happiness, serenity, sadness, fear, disgust, and boredom. Here, the characteristics of the voice signal according to each emotion of anger, happiness, tranquility, sadness, fear, disgust, and boredom are generally shown in Table 1 below.

감정 분류 모델 구축부(30)는 학습용 음성 신호를 학습한 복수의 감정 분류 모델을 구축할 수 있다. The emotion classification model construction unit 30 may build a plurality of emotion classification models obtained by learning a speech signal for training.

감정 분류 모델은 상술한 것처럼 감정 분류를 위한 머신 러닝 분류기로 본 실시예에서는 이진(binary) 분류에 높은 성능을 보임이 증명된 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델을 채택할 수 있다.As described above, the emotion classification model is a machine learning classifier for emotion classification, and in this embodiment, a support vector machine model, an Xgboost model, and a random forest model, which have been proven to exhibit high performance in binary classification, may be adopted.

감정 분류 모델 구축부(30)는 학습용 음성 신호를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여 총 3개의 감정 분류 모델을 구축할 수 있다.The emotion classification model construction unit 30 may construct a total of three emotion classification models by inputting a training speech signal to a support vector machine model, an Xgboost model, and a random forest model, respectively.

여기서, 학습용 음성 신호는 상술한 것처럼 복수의 감정 별로 나뉠 수 있다. 예를 들면, 학습용 음성 신호는 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함 중 어느 하나의 감정이 나타나는 음성 신호이거나, 상술한 감정이 나타나지 않은 평범한 감정 상태에서 발화한 음성 신호일 수 있다. Here, the learning voice signal may be divided into a plurality of emotions as described above. For example, the learning voice signal may be a voice signal in which any one of anger, happiness, tranquility, sadness, fear, disgust, and boredom appears, or may be a voice signal uttered in a normal emotional state in which the above-described emotion does not appear.

감정 분류 모델 구축부(30)는 이러한 복수의 감정 별 학습용 음성 신호를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여, 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축할 수 있다. The emotion classification model building unit 30 inputs the plurality of emotion-specific learning speech signals to a support vector machine model, an Xgboost model, and a random forest model, respectively, and builds a plurality of emotion classification models in which speech signals for each emotion are learned. can do.

이를 위해 감정 분류 모델 구축부(30)는 학습용 음성 신호로부터 음성 신호의 특징을 나타내는 적어도 하나의 음성 특징 데이터를 추출하고, 해당 학습용 음성 신호가 복수의 감정 중 어느 하나의 감정을 나타내는지 여부를 식별하기 위해 이진(binary)의 형태로 라벨링할 수 있다.To this end, the emotion classification model building unit 30 extracts at least one voice feature data representing a feature of the voice signal from the training voice signal, and identifies whether the training voice signal represents any one of a plurality of emotions. To do this, it can be labeled in the form of binary.

구체적으로는, 감정 분류 모델 구축부(30)는 학습용 음성 신호를 일정한 크기의 단위 프레임으로 분리할 수 있다. 예를 들면, 감정 분류 모델 구축부(30)는 학습용 음성 신호를 20~30ms의 크기의 복수의 단위 프레임으로 분리할 수 있다.Specifically, the emotion classification model building unit 30 may separate the training speech signal into unit frames having a predetermined size. For example, the emotion classification model building unit 30 may separate the training speech signal into a plurality of unit frames having a size of 20 to 30 ms.

감정 분류 모델 구축부(30)는 복수의 단위 프레임 별로 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출할 수 있다. 피치, ZCR, MFCC, RMSE, 템포, 박자 및 STFT는 음성 신호로부터 추출할 수 있는 대표적인 특징으로 그 정의는 아래 표 2와 같다.The emotion classification model building unit 30 includes a pitch, a zero crossing rate (ZCR), a mel-scaled power spectrogram (MFCC), a root mean square energy (RMSE), a tempo, a beat, and a short-time fourier (STFT) for each of a plurality of unit frames. transform) can be extracted. Pitch, ZCR, MFCC, RMSE, tempo, beat, and STFT are representative features that can be extracted from a voice signal, and their definitions are shown in Table 2 below.

감정 분류 모델 구축부(30)는 복수의 단위 프레임으로부터 각각 추출한 적어도 하나의 음성 특징 데이터를 글로벌 피쳐로 평준화시킬 수 있다. 예를 들면, 감정 분류 모델 구축부(30)는 복수의 단위 프레임으로부터 각각 추출한 적어도 하나의 음성 특징 데이터를 최댓값, 최솟값, 평균값, 표준편차 값 중 어느 하나의 형태로 평준화시켜 학습용 음성 데이터를 전체적으로 나타낼 수 있도록 할 수 있다.The emotion classification model construction unit 30 may level at least one voice feature data extracted from each of a plurality of unit frames as a global feature. For example, the emotion classification model building unit 30 equalizes at least one speech feature data each extracted from a plurality of unit frames in any one of a maximum value, a minimum value, an average value, and a standard deviation value to display the training speech data as a whole. You can do it.

감정 분류 모델 구축부(30)는 이와 같이 학습용 음성 데이터로부터 추출하여 평준화 한 적어도 하나의 음성 특징 데이터를 복수의 감정 분류 모델에 각각 입력하여 복수의 감정 분류 모델을 구축하는데, 감정 분류 모델에서의 학습 시 학습용 음성 신호가 복수의 감정 중 어느 하나의 감정을 나타내는지 여부를 식별할 수 있도록 적어도 하나의 음성 특징 데이터를 이진의 형태로 라벨링할 수 있다.The emotion classification model construction unit 30 constructs a plurality of emotion classification models by inputting at least one speech feature data extracted from the learning speech data and leveled in this way into a plurality of emotion classification models, respectively. At least one voice feature data may be labeled in a binary form so as to identify whether the voice signal for poetry learning represents any one of a plurality of emotions.

예를 들면, 감정 분류 모델 구축부(30)는 먼저 복수의 감정 중 분노에 대한 음성 특징 데이터를 학습한 복수의 감정 분류 모델을 구축할 수 있다. 이때 감정 분류 모델 구축부(30)는 학습용 음성 신호가 분노를 연기한 음성 신호인 경우, 해당 학습용 음성 신호로부터 추출한 적어도 하나의 음성 특징 데이터가 참이 되도록 '1'로 라벨링하고, 학습용 음성 신호가 평범한 감정 상태에서 발화하거나, 다른 감정을 연기한 음성 신호인 경우, 해당 학습용 음성 신호로부터 추출한 적어도 하나의 음성 특징 데이터가 거짓이 되도록 '0'으로 라벨링할 수 있다.For example, the emotion classification model building unit 30 may first build a plurality of emotion classification models in which voice feature data for anger among a plurality of emotions is learned. At this time, the emotion classification model building unit 30, when the speech signal for learning is a speech signal that delays anger, labels at least one speech feature data extracted from the speech signal for learning as '1' so that the speech signal for learning is true. In the case of a voice signal that utters in a normal emotional state or postpones another emotion, at least one voice feature data extracted from the corresponding learning voice signal may be labeled as '0' so that it becomes false.

감정 분류 모델 구축부(30)는 이와 같이 라벨링 한 음성 특징 데이터를 복수의 감정 분류 모델에 각각 입력하여, 복수의 감정 별 음성 특징 데이터를 학습한 복수의 감정 분류 모델을 구축할 수 있다.The emotion classification model construction unit 30 may input the voice feature data labeled as described above into a plurality of emotion classification models, respectively, to construct a plurality of emotion classification models in which voice feature data for each emotion is learned.

감정 확률 산출부(50)는 테스트용 음성 신호를 복수의 감정 분류 모델에 각각 입력하여 복수의 감정 별 감정 확률을 획득할 수 있다. 본 실시예에서 감정 확률은 음성 신호에 특정 감정이 나타나는 확률을 의미한다. 감정 확률 산출부(50)는 하나의 테스트용 음성 신호에 대해 복수의 감정 분류 모델 별로 각각 감정 별 감정 확률을 획득할 수 있다.The emotion probability calculation unit 50 may obtain emotion probabilities for a plurality of emotions by respectively inputting test voice signals to a plurality of emotion classification models. In this embodiment, the emotion probability means the probability that a specific emotion appears in the voice signal. The emotion probability calculator 50 may obtain an emotion probability for each emotion for each of a plurality of emotion classification models for one test voice signal.

이를 위해 감정 확률 산출부(50)는 테스트용 음성 신호로부터 음성 신호의 특징을 나타내는 적어도 하나의 음성 특징 데이터를 추출할 수 있다. To this end, the emotion probability calculation unit 50 may extract at least one voice feature data representing a feature of the voice signal from the test voice signal.

구체적으로는, 감정 확률 산출부(50)는 감정 분류 모델 구축부(30)와 같이 테스트용 음성 신호를 일정한 크기의 단위 프레임으로 분리하고, 복수의 단위 프레임 별로 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출하며, 복수의 단위 프레임으로부터 각각 추출한 적어도 하나의 음성 특징 데이터를 글로벌 피쳐로 평준화시켜 테스트용 음성 데이터를 전체적으로 나타낼 수 있도록 할 수 있다.Specifically, like the emotion classification model building unit 30, the emotion probability calculation unit 50 divides the test voice signal into unit frames of a predetermined size, and the pitch, ZCR (Zero Crossing Rate), and Speech feature data including at least one of mel-scaled power spectrogram (MFCC), root mean square energy (RMSE), tempo, beat, and short-time fourier transform (STFT) is extracted, and at least each extracted from a plurality of unit frames One voice feature data can be equalized to a global feature so that the test voice data can be displayed as a whole.

감정 확률 산출부(50)는 이와 같이 테스트용 음성 데이터로부터 추출하여 평준화 한 적어도 하나의 음성 특징 데이터를 복수의 감정 분류 모델에 각각 입력하여 테스트용 음성 신호에 대해 감정 별 감정 확률을 획득할 수 있다. 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델 기반의 감정 분류 모델은 테스트용 음성 신호로부터 추출한 음성 특징 데이터가 입력되는 경우, 입력된 음성 특징 데이터와 학습한 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함의 7가지 감정의 음성 특징 데이터의 유사도를 계산하여 테스트용 음성 신호에 대해 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함 각각의 감정이 나타날 확률을 나타내는 감정 확률들을 출력할 수 있다.The emotion probability calculation unit 50 may obtain an emotion probability for each emotion for the test speech signal by inputting at least one speech feature data extracted and leveled from the test speech data into a plurality of emotion classification models, respectively. . The emotion classification model based on the support vector machine model, Xgboost model, and random forest model, when the voice feature data extracted from the test voice signal is input, the input voice feature data and learned anger, happiness, tranquility, sadness, fear, disgust. , By calculating the similarity of voice characteristic data of seven emotions of boredom, emotion probabilities indicating the probability of each emotion of anger, happiness, tranquility, sadness, fear, disgust, and boredom appearing can be output for the test voice signal.

감정 판별 확신도 산출부(70)는 테스트용 음성 데이터에 대해 복수의 감정 분류 모델 별로 각각 획득하는 감정 별 감정 확률을 조합하여 감정 별 감정 판별 확신도를 산출할 수 있다. 본 실시예에서 감정 판별 확신도는 음성 신호에 나타나는 감정을 판별하는 기준이 되는 값으로, 0 내지 1 사이의 값을 가질 수 있다.The emotion determination confidence level calculation unit 70 may calculate the emotion determination confidence level for each emotion by combining emotion probabilities for each emotion obtained for each of a plurality of emotion classification models with respect to the test voice data. In this embodiment, the confidence level of emotion determination is a value used as a criterion for determining emotions appearing in a voice signal, and may have a value between 0 and 1.

감정 판별 확신도 산출부(70)는 테스트용 음성 데이터에 대해 복수의 감정 분류 모델 별로 각각 획득하는 감정 별 감정 확률의 조합에 반영할 복수의 감정 분류 모델 별 가중치를 아래 수학식 1과 같이 계산할 수 있다.The emotion discrimination confidence level calculation unit 70 may calculate weights for each of a plurality of emotion classification models to be reflected in a combination of emotion probabilities for each emotion obtained for each of a plurality of emotion classification models for test voice data as shown in Equation 1 below. have.

예를 들면, 감정 판별 확신도 산출부(70)는 서포트 벡터 머신 모델에 있어서, 서포트 벡터 머신 모델로부터 테스트용 음성 신호에 대해 산출한 복수의 감정 별 감정 확률들의 평균을 산출할 수 있다. 즉 감정 판별 확신도 산출부(70)는 테스트용 음성 신호에 대해 서포트 벡터 머신 모델로부터 산출되는 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함의 감정 확률의 평균을 산출할 수 있다. 감정 판별 확신도 산출부(70)는 복수의 감정 확률의 평균에 대한 특정 감정 확률의 비율을 해당 감정에 대한 감정 분류 모델의 가중치로 산출할 수 있다. 즉, 감정 판별 확신도 산출부(70)는 서포트 벡터 머신 모델의 분노 감정에 대한 가중치를 전체 감정 확률의 평균에 대한 분노 감정 확률의 비로 산출할 수 있다. For example, in the support vector machine model, the emotion discrimination confidence level calculation unit 70 may calculate an average of the plurality of emotion probabilities calculated for the test voice signal from the support vector machine model. That is, the emotion discrimination confidence level calculation unit 70 may calculate an average of the emotional probabilities of anger, happiness, tranquility, sadness, fear, disgust, and boredom calculated from the support vector machine model for the test voice signal. The emotion discrimination confidence level calculation unit 70 may calculate a ratio of a specific emotion probability to an average of a plurality of emotion probabilities as a weight of an emotion classification model for a corresponding emotion. That is, the emotion discrimination confidence level calculation unit 70 may calculate the weight of the anger emotion of the support vector machine model as a ratio of the anger emotion probability to the average of the total emotion probability.

감정 판별 확신도 산출부(70)는 아래 수학식 2와 같이 복수의 감정 중 어느 하나의 감정에 대해 복수의 감정 분류 모델 별로 각각 획득하는 해당 감정 확률에 복수의 감정 분류 모델 별로 산출한 가중치를 곱한 값을 모두 더하여 감정 판별 확신도를 계산할 수 있다.As shown in Equation 2 below, the emotion discrimination confidence calculation unit 70 multiplies the corresponding emotion probability obtained by each of the plurality of emotion classification models by the weight calculated for each of the plurality of emotion classification models, as shown in Equation 2 below. By adding all the values, you can calculate the confidence level of emotion discrimination.

예를 들면, 감정 판별 확신도 산출부(70)는 테스트용 음성 신호에 나타나는 분노 감정에 대한 감정 판별 확신도를 계산하는 경우, 서포트 벡터 머신 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 서포트 벡터 머신 모델로부터 산출되는 분노 감정 확률을 곱한 값, Xgboost 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 Xgboost 모델로부터 산출되는 분노 감정 확률을 곱한 값 및 랜덤 포레스트 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 랜덤 포레스트 모델로부터 산출되는 분노 감정 확률을 곱한 값을 모두 더하여 테스트용 음성 신호의 분노 감정에 대한 감정 판별 확신도를 계산할 수 있다.For example, when the emotion discrimination confidence level calculation unit 70 calculates the emotion discrimination confidence level for the anger emotion that appears in the test voice signal, the weight for the anger emotion of the support vector machine model and the test voice signal A value multiplied by the anger emotion probability calculated from the support vector machine model, the weight for the anger emotion in the Xgboost model, multiplied by the anger emotion probability calculated from the Xgboost model for the test voice signal, and the weight for the anger emotion in the random forest model. And by adding all the values obtained by multiplying the test voice signal by the anger emotion probability calculated from the random forest model, the confidence level of the emotion discrimination for the anger emotion of the test voice signal can be calculated.

이와 같이 감정 판별 확신도 산출부(70)는 테스트용 음성 신호에 대해 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함 각각의 감정의 감정 판별 확신도를 계산할 수 있다. 본 실시예에 따르면, 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델의 3 가지 감정 분류 모델로부터 계산되는 감정 확률을 조합하여 음성 신호에 나타나는 감정 분류에 사용되는 감정 판별 확신도를 계산함으로써 음성 신호를 이용한 감정 분류의 정확도를 높일 수 있을 것이다.In this way, the emotion determination confidence level calculation unit 70 may calculate the emotion determination confidence level of each emotion of anger, happiness, tranquility, sadness, fear, disgust, and boredom with respect to the test voice signal. According to the present embodiment, a speech signal is generated by calculating an emotion discrimination confidence level used for classifying emotions appearing in a speech signal by combining emotion probabilities calculated from three emotion classification models of a support vector machine model, an Xgboost model, and a random forest model. The accuracy of the used emotion classification can be improved.

감정 분류부(90)는 테스트용 음성 신호에 대한 복수의 감정 별 감정 판별 확신도를 이용하여 복수의 감정 중 테스트용 음성 신호에 나타나는 적어도 하나의 감정을 결정할 수 있다.The emotion classifying unit 90 may determine at least one emotion appearing in the test voice signal among the plurality of emotions by using the confidence level of determination of emotions for each of the plurality of emotions with respect to the test voice signal.

감정 분류부(90)는 테스트용 음성 신호에 대한 복수의 감정 별 감정 판별 확신도와 복수의 감정 별 감정 판별 확신도 기준 값을 비교할 수 있다. 본 실시예에서는 복수의 감정 별 감정 판별 확신도 기준 값은 0.5로 모두 동일하게 설정될 수 있으며, 다른 실시예에 따르면 복수의 감정 별로 상이하게 설정될 수도 있다.The emotion classifying unit 90 may compare the emotion determination confidence level for each emotion with a plurality of emotion determination confidence level reference values for the test voice signal. In the present embodiment, the emotion discrimination confidence reference value for a plurality of emotions may all be set equally to 0.5, and according to another embodiment, it may be set differently for a plurality of emotions.

예를 들면, 감정 분류부(90)는 테스트용 음성 신호에 대한 분노 감정의 감정 판별 확신도가 0.5 이상인 경우, 테스트용 음성 신호에 분노 감정이 나타나는 것으로 결정할 수 있다. For example, the emotion classifying unit 90 may determine that the anger emotion appears in the test voice signal when the emotion discrimination confidence level of the anger emotion with respect to the test voice signal is 0.5 or more.

이와 같이, 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)는 음성 신호를 이용한 감정 분류에 있어서 음성 신호로부터 추출할 수 있는 음성 특징 데이터를 모두 반영함으로써, 화자 및 문맥에 독립적인 유연한 감정 판별 성능을 보일 수 있으며, 이진 분류 및 학습에 높은 성능을 보이는 복수의 머신 러닝 분류기를 이용함으로써 감정 분류의 정확도를 높일 수 있다.In this way, the emotion classification apparatus 1 using a voice signal according to an embodiment of the present invention reflects all voice characteristic data that can be extracted from the voice signal in the emotion classification using the voice signal, so that it is independent of the speaker and the context. It can show flexible emotion discrimination performance, and the accuracy of emotion classification can be improved by using a plurality of machine learning classifiers that have high performance in binary classification and learning.

이하에서는 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 방법에 대하여 설명한다.Hereinafter, a method for classifying emotions using a voice signal according to an embodiment of the present invention will be described.

본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 방법은 도 1에 도시된 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 장치(1)와 실질적으로 동일한 구성에서 진행될 수 있다. 따라서 도 1의 음성 신호를 이용한 감정 분류 장치(1)와 동일한 구성요소는 동일한 도면부호를 부여하고 반복되는 설명은 생략한다.The emotion classification method using a voice signal according to an embodiment of the present invention may be performed in substantially the same configuration as the emotion classification apparatus 1 using a voice signal according to the embodiment of the present invention shown in FIG. 1. Accordingly, the same components as those of the emotion classification apparatus 1 using the voice signal of FIG. 1 are given the same reference numerals, and repeated descriptions are omitted.

도 2 내지 도 7은 본 발명의 일 실시예에 따른 음성 신호를 이용한 감정 분류 방법의 흐름도이다.2 to 7 are flowcharts of a method for classifying emotions using a voice signal according to an embodiment of the present invention.

본 발명의 음성 신호를 이용한 감정 분류 방법은 크게 감정 분류를 위한 감정 분류 모델을 구축하는 단계와 감정 분류 모델을 이용한 감정 분류 단계로 나뉠 수 있다. 이하 도 2 내지 도 4를 참조하여 감정 분류 모델을 구축하는 단계에 대해 설명하고, 도 5 내지 도 7을 참조하여 감정 분류 단계에 대해 설명한다.The emotion classification method using the voice signal of the present invention can be largely divided into a step of constructing an emotion classification model for classifying emotions and a step of classifying emotions using the emotion classification model. Hereinafter, a step of constructing an emotion classification model will be described with reference to FIGS. 2 to 4, and an emotion classification step will be described with reference to FIGS. 5 to 7.

먼저 도 2를 참조하면, 입력부(10)는 음성 신호를 입력 받을 수 있다(S100). First, referring to FIG. 2, the input unit 10 may receive an audio signal (S100).

여기에서, 음성 신호는 감정 분류를 위한 머신 러닝 분류기인 감정 분류 모델 구축에 사용되는 음성 신호로, 사용자가 특정 감정을 연기하면서 발화한 학습용 음성 신호에 해당한다.Here, the speech signal is a speech signal used to build an emotion classification model, which is a machine learning classifier for emotion classification, and corresponds to a learning speech signal uttered while a user plays a specific emotion.

감정 분류 모델 구축부(30)는 음성 신호를 학습한 감정 분류 모델을 구축할 수 있다(S200). The emotion classification model building unit 30 may build an emotion classification model obtained by learning a voice signal (S200).

감정 분류 모델 구축부(30)는 감정 분류 모델에서의 음성 신호의 학습을 위해 입력부(10)에서 입력 받는 음성 신호에 소정의 전처리 단계를 진행할 수 있다. 이와 관련하여 도 3을 참조하여 설명한다.The emotion classification model construction unit 30 may perform a predetermined preprocessing step on the speech signal input from the input unit 10 in order to learn the speech signal in the emotion classification model. This will be described with reference to FIG. 3.

도 3을 참조하면, 감정 분류 모델 구축부(30)는 음성 신호로부터 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출할 수 있다(S110).Referring to FIG. 3, the emotion classification model construction unit 30 includes a pitch, a zero crossing rate (ZCR), a mel-scaled power spectrogram (MFCC), a root mean square energy (RMSE), a tempo, a beat, and a STFT ( Voice feature data including at least one of short-time fourier transform) may be extracted (S110).

감정 분류 모델 구축부(30)는 음성 신호를 일정한 크기의 단위 프레임으로 분리하고, 복수의 단위 프레임 별로 피치, ZCR, MFCC, RMSE, 템포, 박자 및 STFT 중 적어도 하나를 포함하는 음성 특징 데이터를 추출하며, 복수의 단위 프레임으로부터 각각 추출한 적어도 하나의 음성 특징 데이터를 글로벌 피쳐로 평준화시킬 수 있다.The emotion classification model construction unit 30 separates the speech signal into unit frames of a predetermined size, and extracts speech feature data including at least one of pitch, ZCR, MFCC, RMSE, tempo, beat, and STFT for each of a plurality of unit frames. In addition, at least one voice feature data extracted from each of a plurality of unit frames may be equalized as a global feature.

감정 분류 모델 구축부(30)는 음성 신호가 감정을 나타내는 경우(S120), 음성 특징 데이터를 음성 신호가 나타내는 감정에 대해 참이 되도록 이진화하고(S130), 음성 신호가 감정을 나타내지 않는 경우(S120), 음성 특징 데이터를 음성 신호가 나타내는 감정에 대해 거짓이 되도록 이진화할 수 있다(S140).When the voice signal represents emotion (S120), the emotion classification model construction unit 30 binarizes the voice characteristic data to be true for the emotion represented by the voice signal (S130), and when the voice signal does not represent emotion (S120). ), the voice characteristic data may be binarized so as to be false with respect to the emotion indicated by the voice signal (S140).

감정 분류 모델 구축부(30)는 음성 데이터로부터 추출하여 평준화 한 적어도 하나의 음성 특징 데이터를 이진의 형태로 라벨링할 수 있다. 예를 들면, 감정 분류 모델 구축부(30)는 음성 신호가 분노를 연기한 음성 신호인 경우, 해당 음성 신호로부터 추출한 적어도 하나의 음성 특징 데이터가 참이 되도록 '1'로 라벨링하고, 음성 신호가 평범한 감정 상태에서 발화하거나, 다른 감정을 연기한 음성 신호인 경우, 해당 음성 신호로부터 추출한 적어도 하나의 음성 특징 데이터가 거짓이 되도록 '0'으로 라벨링할 수 있다.The emotion classification model building unit 30 may label at least one voice feature data extracted from voice data and leveled in a binary form. For example, when the voice signal is a voice signal that postpones anger, the emotion classification model construction unit 30 labels at least one voice characteristic data extracted from the corresponding voice signal as '1' so that the voice signal is true. In the case of a voice signal that utters in a normal emotional state or postpones another emotion, at least one voice feature data extracted from the corresponding voice signal may be labeled as '0' so that it is false.

도 4를 참조하면, 감정 분류 모델 구축부(30)는 음성 신호가 나타내는 감정에 대해 이진화 된 음성 특징 데이터를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여(S210), 음성 신호가 나타내는 감정 별 복수의 감정 분류 모델을 구축할 수 있다(S220).Referring to FIG. 4, the emotion classification model building unit 30 inputs voice feature data binarized with respect to the emotion indicated by the voice signal to the support vector machine model, the Xgboost model, and the random forest model (S210), and the voice signal is A plurality of emotion classification models for each displayed emotion may be constructed (S220).

감정 분류 모델 구축부(30)는 음성 신호를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여 총 3개의 감정 분류 모델을 구축할 수 있다. The emotion classification model building unit 30 may construct a total of three emotion classification models by inputting a speech signal to a support vector machine model, an Xgboost model, and a random forest model, respectively.

감정 분류 모델 구축부(30)는 이러한 복수의 감정 별 음성 신호를 서포트 벡터 머신 모델, Xgboost 모델 및 랜덤 포레스트 모델에 각각 입력하여, 복수의 감정 별 음성 신호를 학습한 복수의 감정 분류 모델을 구축할 수 있다. The emotion classification model building unit 30 inputs a plurality of voice signals for each emotion into a support vector machine model, an Xgboost model, and a random forest model, respectively, and builds a plurality of emotion classification models in which the plurality of emotion-specific voice signals are learned. I can.

도 5를 참조하면, 입력부(10)는 음성 신호를 입력 받을 수 있다(S300).Referring to FIG. 5, the input unit 10 may receive an audio signal (S300).

여기서 음성 신호는 감정 분류 대상인 테스트용 음성 신호에 해당한다.Here, the voice signal corresponds to a test voice signal that is subject to emotion classification.

감정 확률 산출부(50)는 음성 신호를 복수의 감정 분류 모델에 입력하여(S400), 복수의 감정 분류 모델로부터 음성 신호에 대해 복수의 감정 별 감정 확률을 산출할 수 있다(S500).The emotion probability calculation unit 50 may input the voice signals to the plurality of emotion classification models (S400), and calculate emotion probabilities for each of the plurality of emotions for the voice signals from the plurality of emotion classification models (S500).

감정 확률 산출부(50)는 음성 신호를 복수의 감정 분류 모델에 각각 입력하여 복수의 감정 별 감정 확률을 획득할 수 있다. 본 실시예에서 감정 확률은 음성 신호에 특정 감정이 나타나는 확률을 의미한다. 감정 확률 산출부(50)는 하나의 음성 신호에 대해 복수의 감정 분류 모델 별로 각각 감정 별 감정 확률을 획득할 수 있다.The emotion probability calculation unit 50 may obtain emotion probabilities for a plurality of emotions by respectively inputting a voice signal to a plurality of emotion classification models. In this embodiment, the emotion probability means the probability that a specific emotion appears in the voice signal. The emotion probability calculation unit 50 may obtain an emotion probability for each emotion for each of a plurality of emotion classification models for one voice signal.

감정 확률 산출부(50)는 음성 신호를 일정한 크기의 단위 프레임으로 분리하고, 복수의 단위 프레임 별로 피치, ZCR(Zero Crossing Rate), MFCC(mel-scaled power spectrogram), RMSE(root mean square energy), 템포, 박자 및 STFT(Short-time fourier transform) 중 적어도 하나를 포함하는 음성 특징 데이터를 추출하며, 복수의 단위 프레임으로부터 각각 추출한 적어도 하나의 음성 특징 데이터를 글로벌 피쳐로 평준화시켜 음성 데이터를 전체적으로 나타낼 수 있도록 할 수 있다.The emotion probability calculation unit 50 divides the speech signal into unit frames of a certain size, and a pitch, a zero crossing rate (ZCR), a mel-scaled power spectrogram (MFCC), and a root mean square energy (RMSE) for each of a plurality of unit frames , Tempo, tempo, and short-time fourier transform (STFT), extracting voice feature data including at least one of, and leveling at least one voice feature data extracted from each of a plurality of unit frames into a global feature to represent the voice data as a whole. You can do it.

감정 확률 산출부(50)는 이와 같이 음성 데이터로부터 추출하여 평준화 한 적어도 하나의 음성 특징 데이터를 복수의 감정 분류 모델에 각각 입력하여 테스트용 음성 신호에 대해 감정 별 감정 확률을 획득할 수 있다. 서포트 벡터 머신 모델, Xgboost 모델 및 렌덤 포레스트 모델 기반의 감정 분류 모델은 음성 신호로부터 추출한 음성 특징 데이터가 입력되는 경우, 입력된 음성 특징 데이터와 학습한 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함의 7가지 감정의 음성 특징 데이터의 유사도를 계산하여 음성 신호에 대해 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함 각각의 감정이 나타날 확률을 나타내는 감정 확률들을 출력할 수 있다.The emotion probability calculation unit 50 may obtain an emotion probability for each emotion with respect to the test speech signal by inputting at least one speech feature data extracted from the speech data and leveled into the plurality of emotion classification models, respectively. The emotion classification model based on the support vector machine model, Xgboost model, and random forest model, when the voice feature data extracted from the voice signal is input, the input voice feature data and the learned anger, happiness, tranquility, sadness, fear, disgust, boredom By calculating the similarity of the voice characteristic data of the seven emotions of, the emotion probabilities indicating the probability of each emotion appearing in the voice signal can be outputted for anger, happiness, tranquility, sadness, fear, disgust, and boredom.

감정 판별 확신도 산출부(70)는 음성 신호에 대한 복수의 감정 별 감정 확률을 조합하여, 복수의 감정 별 감정 판별 확신도를 산출할 수 있다(S600). 이와 관련하여 구체적인 설명은 도 6을 참조하여 후술한다.The emotion determination confidence level calculation unit 70 may calculate the emotion determination confidence level for each emotion by combining the emotion probabilities of a plurality of emotions with respect to the voice signal (S600). In this regard, a detailed description will be described later with reference to FIG. 6.

감정 분류부(90)는 복수의 감정 별 감정 판별 확신도를 이용하여 음성 신호에 나타나는 감정을 결정할 수 있다(S700). 이와 관련하여 구체적인 설명은 도 7을 참조하여 후술한다.The emotion classifying unit 90 may determine an emotion appearing in the voice signal by using the emotion discrimination confidence level for each emotion (S700). A detailed description in this regard will be described later with reference to FIG. 7.

도 6을 참조하면, 감정 판별 확신도 산출부(70)는 복수의 감정 분류 모델 각각의 복수의 감정 별 가중치를 산출할 수 있다(S610).Referring to FIG. 6, the emotion determination confidence level calculation unit 70 may calculate weights for each of a plurality of emotions of each of the plurality of emotion classification models (S610).

감정 판별 확신도 산출부(70)는 수학식 1에 따라 복수의 감정 분류 모델 각각의 분노, 행복, 평온, 슬픔, 공포, 역겨움, 지루함의 7가지 감정에 대한 가중치를 산출할 수 있다.The emotion discrimination confidence level calculation unit 70 may calculate weights for seven emotions of anger, happiness, tranquility, sadness, fear, disgust, and boredom of each of the plurality of emotion classification models according to Equation 1.

감정 판별 확신도 산출부(70)는 복수의 감정 별 감정 확률 및 복수의 감정 분류 모델 각각의 복수의 감정 별 가중치를 이용하여 복수의 감정 별 감정 판별 확신도를 산출할 수 있다(S620).The emotion determination confidence level calculation unit 70 may calculate the emotion determination confidence level for a plurality of emotions by using the emotion probabilities for each emotion and the weights for each emotion of each of the plurality of emotion classification models (S620).

감정 판별 확신도 산출부(70)는 수학식 2와 같이 복수의 감정 중 어느 하나의 감정에 대해 복수의 감정 분류 모델 별로 각각 획득하는 해당 감정 확률에 복수의 감정 분류 모델 별로 산출한 가중치를 곱한 값을 모두 더하여 감정 판별 확신도를 계산할 수 있다.The emotion discrimination confidence level calculation unit 70 is a value obtained by multiplying the corresponding emotion probability obtained by each of the plurality of emotion classification models for any one emotion among the plurality of emotions by the weight calculated for each of the plurality of emotion classification models, as shown in Equation 2 You can calculate the confidence level of emotion discrimination by adding all of them.

예를 들면, 감정 판별 확신도 산출부(70)는 음성 신호에 나타나는 분노 감정에 대한 감정 판별 확신도를 계산하는 경우, 서포트 벡터 머신 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 서포트 벡터 머신 모델로부터 산출되는 분노 감정 확률을 곱한 값, Xgboost 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 Xgboost 모델로부터 산출되는 분노 감정 확률을 곱한 값 및 랜덤 포레스트 모델의 분노 감정에 대한 가중치와 테스트용 음성 신호에 대해 랜덤 포레스트 모델로부터 산출되는 분노 감정 확률을 곱한 값을 모두 더하여 음성 신호의 분노 감정에 대한 감정 판별 확신도를 계산할 수 있다.For example, when the emotion discrimination confidence level calculation unit 70 calculates the emotion discrimination confidence level for the anger emotion appearing in the voice signal, the weight for the anger emotion of the support vector machine model and the support vector for the test voice signal Multiplied by the anger emotion probability calculated from the machine model, the weight for the anger emotion in the Xgboost model and the product of the anger emotion probability calculated from the Xgboost model for the test voice signal, and the weight and test for the anger emotion in the random forest model. The emotion discrimination confidence level for the anger emotion of the speech signal can be calculated by adding all the values obtained by multiplying the anger emotion probability calculated from the random forest model with respect to the dragon speech signal.

도 7을 참조하면, 감정 분류부(90)는 복수의 감정 별 감정 판별 확신도가 0.5 이상인 경우(S710), 해당 감정을 음성 신호에 나타나는 감정으로 결정하고(S720), 복수의 감정 별 감정 판별 확신도가 0.5 미만인 경우(S710), 해당 감정을 음성 신호에 나타나지 않은 감정으로 결정할 수 있다(S730).Referring to FIG. 7, the emotion classifying unit 90 determines the emotion as an emotion appearing in a voice signal when the confidence level of determination of emotion for a plurality of emotions is 0.5 or more (S710), and determines emotion for a plurality of emotions When the confidence level is less than 0.5 (S710), the corresponding emotion may be determined as an emotion that does not appear in the voice signal (S730).

감정 분류부(90)는 음성 신호에 대한 복수의 감정 별 감정 판별 확신도와 복수의 감정 별 감정 판별 확신도 기준 값을 비교할 수 있다. 본 실시예에서는 복수의 감정 별 감정 판별 확신도 기준 값은 0.5로 모두 동일하게 설정될 수 있으며, 다른 실시예에 따르면 복수의 감정 별로 상이하게 설정될 수도 있다.The emotion classifying unit 90 may compare the emotion determination confidence level for each emotion of the voice signal and a reference value of the emotion determination confidence level for each emotion. In the present embodiment, the reference value of the emotion discrimination confidence level for a plurality of emotions may all be set equally to 0.5, and according to another embodiment, it may be set differently for a plurality of emotions.

예를 들면, 감정 분류부(90)는 음성 신호에 대한 분노 감정의 감정 판별 확신도가 0.5 이상인 경우, 음성 신호에 분노 감정이 나타나는 것으로 결정할 수 있다. For example, the emotion classifying unit 90 may determine that the anger emotion appears in the voice signal when the emotion discrimination confidence level of the anger emotion relative to the voice signal is 0.5 or more.

이와 같은, 본 발명의 음성 신호를 이용한 감정 분류 방법은 어플리케이션으로 구현되거나 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.As described above, the emotion classification method using the voice signal of the present invention may be implemented as an application or implemented in the form of program instructions that may be executed through various computer components, and recorded in a computer-readable recording medium. The computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination.

상기 컴퓨터 판독 가능한 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것들이거니와 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다.The program instructions recorded in the computer-readable recording medium may be specially designed and constructed for the present invention, and may be known and usable to those skilled in the computer software field.

컴퓨터 판독 가능한 기록 매체의 예에는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 ROM, RAM, 플래시 메모리 등과 같은 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic-optical media such as floptical disks. media), and a hardware device specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

프로그램 명령어의 예에는, 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Examples of the program instructions include not only machine language codes such as those produced by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware device may be configured to operate as one or more software modules to perform processing according to the present invention, and vice versa.

이상에서는 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.Although the above has been described with reference to embodiments, those skilled in the art will understand that various modifications and changes can be made to the present invention without departing from the spirit and scope of the present invention described in the following claims. I will be able to.

1: 음성 신호를 이용한 감정 분류 장치
10: 입력부
30: 감정 분류 모델 구축부
50: 감정 확률 산출부
70: 감정 판별 확신도 산출부
90: 감정 분류부 1: emotion classification device using voice signals
10: input
30: emotion classification model construction unit
50: emotion probability calculation unit
70: emotion discrimination confidence level calculation unit
90: emotion classification unit

Claims

In the emotion classification method in an emotion classification apparatus using a voice signal for receiving a user's voice signal and classifying the user's emotion,
Building a plurality of emotion classification models in which voice signals for each emotion are learned;
Obtaining, for each of the plurality of emotion classification models, a plurality of emotion probabilities calculated from the voice signal of the user using the plurality of emotion classification models;
Calculating an emotion discrimination confidence level for each of the plurality of emotions by combining the emotion probabilities for each of the plurality of emotions, respectively acquired for each of a plurality of emotion classification models; And
Comprising the step of comparing the emotion determination confidence level of the plurality of emotions and the emotion determination confidence level of the plurality of emotions to determine an emotion appearing in the voice signal of the user among the plurality of emotions,
Computing an emotion discrimination confidence level for each of the plurality of emotions by combining the emotion probabilities of the plurality of emotions each obtained for each of the plurality of emotion classification models,
Calculating a weight for each of the plurality of emotion classification models for any one of the plurality of emotions; And
Computing an emotion discrimination confidence level by adding all values obtained by multiplying the weights calculated for each of the plurality of emotion classification models to the emotion probabilities obtained for each of the plurality of emotion classification models for any one of the plurality of emotions. And
The step of calculating a weight for each of the plurality of emotion classification models for any one of the plurality of emotions,
Averaging the probability of each of the plurality of emotions in one of the plurality of emotion classification models; And
And calculating a ratio of an emotion probability of any one of the plurality of emotions to an average of the probabilities of each of the plurality of emotions as a weight of an emotion classification model for the corresponding emotion,
The step of determining an emotion appearing in the voice signal of the user among the plurality of emotions by comparing the emotion determination confidence level for each emotion and the emotion determination confidence level for the plurality of emotions,
When the emotion discrimination confidence level for any one of the plurality of emotions is equal to or greater than the reference value for the emotion discrimination confidence level for the corresponding emotion, the emotion is determined as an emotion appearing in the voice signal of the user, and the emotion for each of the plurality of emotions An emotion classification method using a voice signal in which the discrimination confidence reference value is set differently for each of the plurality of emotions.

The method of claim 1,
Building a plurality of emotion classification models in which the plurality of emotion-specific voice signals are learned,
Receiving a speech signal for learning generated to represent one of the plurality of emotions;
Voice including at least one of pitch, zero crossing rate (ZCR), mel-scaled power spectrogram (MFCC), root mean square energy (RMSE), tempo, beat, and short-time fourier transform (STFT) from the training speech signal Extracting feature data; And
And constructing the plurality of emotion classification models obtained by learning the at least one voice characteristic data for the emotion indicated by the training voice signal.

The method of claim 1,
Building a plurality of emotion classification models in which the plurality of emotion-specific voice signals are learned,
Receiving a speech signal for learning generated to represent one of the plurality of emotions;
Labeling the learning speech signal in a binary form so as to identify whether the learning speech signal represents any one of the plurality of emotions; And
And constructing the plurality of emotion classification models by inputting the labeled speech signals for learning into a plurality of binary classification learning models, respectively.

The method of claim 3,
The step of constructing the plurality of emotion classification models by respectively inputting the labeled speech signals for training into a plurality of binary classification learning models,
An emotion classification method using a speech signal, comprising inputting the labeled speech signal for training into a support vector machine model, an Xgboost model, and a random forest model, respectively, to construct the plurality of emotion classification models.

delete

A computer-readable recording medium having a computer program recorded thereon for performing the method for classifying emotions using the voice signal according to claim 1.

An emotion classification model construction unit for constructing a plurality of emotion classification models obtained by learning voice signals for each emotion;
An emotion probability calculator configured to obtain, for each of the plurality of emotion classification models, a plurality of emotion probabilities calculated from a voice signal of a user using the plurality of emotion classification models;
An emotion discrimination confidence level calculating unit for calculating an emotion discrimination confidence level for each of the plurality of emotions by combining the emotion probabilities for each of the plurality of emotions obtained for each of a plurality of emotion classification models; And
And an emotion classifying unit configured to determine an emotion appearing in the voice signal of the user among the plurality of emotions by comparing the emotion determination confidence level for each emotion and a reference value for emotion determination confidence level for the plurality of emotions,
The emotion discrimination confidence level calculation unit calculates a weight for each of the plurality of emotion classification models for any one of the plurality of emotions, and for each of the plurality of emotion classification models for any one of the plurality of emotions The emotion discrimination confidence level is calculated by adding all the values obtained by multiplying the obtained emotion probability by the weight calculated for each of the plurality of emotion classification models, and in one emotion classification model among the plurality of emotion classification models, each of the plurality of emotions Probability is averaged, and a ratio of an emotion probability of any one of the plurality of emotions to the average of the probability of each of the plurality of emotions is calculated as a weight of an emotion classification model for the emotion,
The emotion classification unit, when the emotion determination confidence level for any one emotion among the plurality of emotions is equal to or greater than a reference value for the emotion determination confidence level for the corresponding emotion, determines the emotion as an emotion appearing in the voice signal of the user, An emotion classification apparatus using a voice signal in which a plurality of emotion discrimination confidence reference values are set differently for each of the plurality of emotions.

The method of claim 8,
The emotion probability calculation unit,
Including at least one of a pitch, a zero crossing rate (ZCR), a mel-scaled power spectrogram (MFCC), a root mean square energy (RMSE), a tempo, a beat, and a short-time fourier transform (STFT) from the user's voice signal. The voice feature data is extracted, and the at least one voice feature data is learned for the emotion represented by the voice signal and input into a support vector machine model, an Xgboost model, and a random forest model, respectively, and the emotion probabilities for a plurality of emotions are calculated as the support vector A device for classifying emotions using voice signals acquired for each machine model, Xgboost model, and random forest model.

delete