KR102128158B1

KR102128158B1 - Emotion recognition apparatus and method based on spatiotemporal attention

Info

Publication number: KR102128158B1
Application number: KR1020180053306A
Authority: KR
Inventors: 손광훈; 이지영
Original assignee: 연세대학교 산학협력단
Priority date: 2018-05-09
Filing date: 2018-05-09
Publication date: 2020-06-29
Anticipated expiration: 2038-05-09
Also published as: KR20190128933A

Abstract

감정 인식 장치 및 방법을 공개한다. 본 발명의 감정 인식 장치 및 방법은 다수의 프레임을 포함하는 이미지 시퀀스로부터 3차원 특징을 획득함과 동시에 각 프레임에 대한 시공간 특징을 추출하여 시공간 가중치로 획득하고, 3차원 특징에 시공간 가중치를 가중함으로써, 별도의 관심 영역을 설정하지 않더라도 정확한 감정을 판별할 수 있다.Disclosure of emotion recognition device and method. The apparatus and method for recognizing emotions of the present invention acquire 3D features from an image sequence including a plurality of frames, and at the same time extract space-time features for each frame, obtain them with space-time weights, and weight space-time weights on the 3D features. , Even if a separate region of interest is not set, accurate emotions can be determined.

Description

Apparatus and method for emotion recognition based on spatiotemporal attention {EMOTION RECOGNITION APPARATUS AND METHOD BASED ON SPATIOTEMPORAL ATTENTION}

본 발명은 감정 인식 장치 및 방법에 관한 것으로, 특히 시공간 주의 기반 감정 인식 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for recognizing emotions, and more particularly to an apparatus and method for recognizing emotions based on space-time attention.

감정 인식은 대화형 시스템에서 중요한 이슈 중 하나이다. 대화형 시스템은 기존의 명령 입력 방식이 아닌 사용자와의 상호 대화를 통해 사용자의 요구 사항을 판별한다. 이때, 감정 인식 기술이 적용되면, 사용자의 요구 사항을 더욱 정확하게 판별할 수 있다는 장점이 있다.Emotion recognition is an important issue in interactive systems. The interactive system determines the user's requirements through interaction with the user instead of the conventional command input method. At this time, when the emotion recognition technology is applied, there is an advantage that the user's requirements can be more accurately determined.

또한 감정 인식은 통증이나 심리적 고통 탐지와 같이 의료 분야 등에 적용될 수 있으며 그 외에도 다양한 분야에 적용될 수 있다.In addition, emotion recognition can be applied to medical fields such as pain and psychological pain detection, and can be applied to various other fields.

기존의 감정 인식에 대한 연구는 대부분 감정을 공포, 분노, 행복, 혐오, 슬픔, 놀람과 같은 기지정된 개수(예를 들면 6가지)로 지정된 기본 감정에 따라 이산된 범주로 분류하는 범주형 감정 인식 방식이 대부분이었다. 그러나 범주형 감정 인식은 지정된 감정으로만 분류하여 인식함에 따라 분류되지 않는 감정의 영역이 존재될 뿐만 아니라, 인식 가능한 감정의 종류가 제한되는 한계가 있다.Most of the existing studies on emotion recognition recognize categorical emotions that classify emotions into discrete categories according to the basic emotions designated by a predetermined number (for example, six) such as fear, anger, happiness, disgust, sadness, and surprise. Most of the way. However, as categorical emotion recognition is classified and recognized only as a specified emotion, there is a limitation in that not only an area of emotion that is not classified exists, but also a type of recognizable emotion is limited.

도1 은 사람의 감정을 나타내는 이미지의 일예를 나타낸다.1 shows an example of an image representing a person's emotions.

도1 은 (a) 내지 (d)는 Ekman이 정의한 4 가지 유형의 놀람에 대한 표정 이미지로서, (a)는 놀랄만한 질문(questioning surprise), (b)는 깜짝 놀람(astonished surprise), (c)는 어리둥절한 놀람(dazed surprise)을 나타내고, (d)는 완전히 놀람(full surprise)을 표현하고 있다.1 is (a) to (d) are expression images of four types of surprises defined by Ekman, (a) is a surprising question (questioning surprise), (b) is a surprise surprise (astonished surprise), (c) ) Represents a dazed surprise, and (d) represents a full surprise.

도1 에 도시된 바와 같이, 놀람에도 다양한 놀람이 존재할 수 있으나, 기존의 범주형 감정 인식은 모두 놀람으로만 분류될 뿐, 미묘한 감정의 차이를 인식할 수 없다는 한계가 있다. 이에 감정을 연속되는 2개의 영역(domain)에 따라 2차원으로 표현하는 방안이 제안되었다.As illustrated in FIG. 1, various surprises may exist in surprises, but all existing categorical emotion recognition is classified only as surprises, and there is a limitation that subtle emotion differences cannot be recognized. Accordingly, a method of expressing emotion in two dimensions according to two consecutive domains has been proposed.

도2 는 연속하는 2차원 그래프로 나타나는 감정의 일예를 나타낸다.Fig. 2 shows an example of emotion represented by a continuous two-dimensional graph.

도2 에서 2차원의 각 축은 각성(Arousal) 및 유인가(Valence)를 나타내고, 각성축은 활동적인지 비활동적의 수준을 나타내고, 유인가축은 긍정적 또는 부정적인 수준을 나타낸다. 도2 에서 도시된 바와 같이, 연속되는 2차원으로 감정을 묘사하는 방식은 기존 범주형에 비해 더 복잡하고 미묘한 감정을 표현할 수 있다.In FIG. 2, each axis of the 2D represents arousal and valence, the arousal axis represents a level of being active or inactive, and the axis of arousal represents a positive or negative level. As shown in FIG. 2, the method of describing emotions in two-dimensional series can express more complex and subtle emotions than the conventional categorical type.

한편 최근에는 감정 인식 기법에 신경망(neural network)를 적용하여 감정 인식의 정확도를 향상시키고 있다. 그러나 기존의 감정 인식 기법은 대부분 단일 이미지로부터 감정을 인식하도록 연구가 수행되어, 시간에 따른 이미지 시퀀스(image sequence)로부터 사람의 감정을 정확하게 인식하는 방법에 대한 연구가 부족한 실정이다. 실제 사람의 감정은 시간의 흐름에 따라 서서히 연속되어 변화되므로, 연속되는 이미지 시퀀스를 이용하여 감정을 인식하는 경우, 단일 이미지보다 더욱 정확하게 감정을 인식할 수 있다.Meanwhile, recently, a neural network is applied to the emotion recognition technique to improve the accuracy of emotion recognition. However, most of the existing emotion recognition techniques have been conducted to recognize emotions from a single image, so there is a lack of research on how to accurately recognize human emotions from image sequences over time. Since the emotions of a real person are gradually and continuously changed with the passage of time, emotions can be recognized more accurately than a single image when emotions are recognized using a sequence of images.

또한 기존의 감정 인식은 도1 에 도시된 바와 같이, 사람의 얼굴 이미지에서 감정 표출이 강하게 나타나는 것으로 예상되는 관심 영역을 미리 지정하고, 지정된 관심 영역에 대해 분석을 수행한다. 그러나 일부의 관심 영역만을 활성화하여 표정을 추정하고, 감정을 인식함에 따라 다양한 얼굴 이미지에 대해 최적의 성능으로 감정을 인식할 수 없다는 한계가 있다.In addition, as shown in FIG. 1, in the conventional emotion recognition, a region of interest in which emotion expression is expected to appear strongly in a human face image is previously designated, and analysis is performed on the designated region of interest. However, there is a limitation in that emotions cannot be recognized with optimal performance for various facial images as the facial expressions are estimated by activating only a region of interest and the emotions are recognized.

한국 공개 특허 제10-2013-0015958호 (2013.02.14 공개)Korean Patent Publication No. 10-2013-0015958 (published on February 14, 2013)

본 발명의 목적은 얼굴 이미지 시퀀스로부터 시간적, 공간적 주의에 기반하여 감정을 인식할 수 있는 감정 인식 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide an apparatus and method for recognizing emotion based on temporal and spatial attention from a face image sequence.

본 발명의 다른 목적은 얼굴 이미지 시퀀스에 관심 영역을 지정하지 않고도 감정을 인식할 수 있는 감정 인식 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for recognizing emotions capable of recognizing emotions without specifying a region of interest in a face image sequence.

본 발명의 또 다른 목적은 얼굴 이미지 시퀀스로부터 2차원 및 3차원 특징을 각각 추출하고, 추출된 2차원 및 3차원 특징을 이용하여 감정을 정확하게 인식할 수 있는 감정 인식 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a device and a method for recognizing emotions by extracting 2D and 3D features from a face image sequence, respectively, and accurately recognizing emotions using the extracted 2D and 3D features.

상기 목적을 달성하기 위한 본 발명의 일 예에 따른 감정 인식 장치는 기지정된 3차원 패턴 인식 기법에 따라 미리 학습되어, 이미지 시퀀스의 시간적으로 연속하는 T개(여기서 T는 자연수)의 프레임을 3차원의 단일 이미지로서 패턴 인식하여 3D 특징을 추출하는 3D 특징 추출부; 기지정된 2차원 패턴 인식 기법에 따라 미리 학습되어, 상기 T개의 프레임 각각으로부터 패턴 인식을 통해 T개의 공간적 특징을 추출하고, 획득된 T개의 공간적 특징 사이의 시공간 특징을 추가하여 시공간 가중치를 획득하는 시공간 특징 추출부; 상기 시공간 가중치를 상기 3D 특징에 가중하여 감정 특징을 획득하고, 기지정된 3차원 패턴 인식 기법에 따라 미리 학습되어, 상기 감정 특징으로부터 미리 지정된 범위 이내의 값을 갖는 감정값을 추출하는 감정값 추출부; 및 감정값에 대비한 감정이 미리 저장되어, 상기 감정값 추출부에서 획득된 상기 감정값에 대응하는 감정을 판별하는 감정 판별부를 포함한다.In order to achieve the above object, the emotion recognition apparatus according to an embodiment of the present invention is pre-learned according to a predetermined 3D pattern recognition technique, and 3D temporally continuous T frames (where T is a natural number) of an image sequence A 3D feature extraction unit that extracts a 3D feature by recognizing a pattern as a single image of the; Spatio-temporal, which is pre-trained according to a known 2D pattern recognition technique, extracts T spatial features through pattern recognition from each of the T frames, and adds spatio-temporal features between the obtained T spatial features to obtain spatio-temporal weights Feature extraction unit; The emotion value extraction unit extracts an emotion value having a value within a predetermined range from the emotion feature by learning the emotion value by weighting the space-time weight to the 3D feature and learning in advance according to a known 3D pattern recognition technique. ; And an emotion discrimination unit that stores emotions prepared for the emotion values in advance, and determines emotions corresponding to the emotion values obtained from the emotion value extraction unit.

상기 3D 특징 추출부는 미리 학습된 3D CNN(3D Convolutional Neural Networks)을 포함하여, 상기 3D 특징을 추출할 수 있다.The 3D feature extraction unit may extract 3D features, including 3D Cvo (3D Convolutional Neural Networks) previously learned.

상기 시공간 특징 추출부는 미리 학습된 2D CNN(2D Convolutional Neural Networks)을 포함하여, 상기 T개의 프레임 각각에 대한 상기 T개의 공간적 특징을 추출하는 공간 인코더; 미리 학습된 ConvLSTM(Convolutional Long Short-Term Memory)을 포함하여, 상기 T개의 공간적 특징 사이의 시공간 특징을 추출하는 시간 디코더; 및 상기 시간 디코더에서 추출된 시공간 특징을 기지정된 방식으로 정규화하여, 상기 시공간 가중치를 획득하는 정규화부를 포함할 수 있다.The space-time feature extraction unit includes a pre-trained 2D CNN (2D Convolutional Neural Networks), a spatial encoder for extracting the T spatial features for each of the T frames; A temporal decoder for extracting the spatio-temporal feature between the T spatial features, including a pre-trained convolutional long short-term memory (ConvLSTM); And a normalization unit that normalizes the spatiotemporal features extracted from the time decoder in a predetermined manner to obtain the spatiotemporal weights.

상기 공간 인코더는 상기 2D CNN가 각각 다수의 필터를 포함하는 컨볼루션 레이어, ReLU(Rectified Linear Unit) 레이어 및 맥스 풀링(Max-Pooling) 레이어를 포함하여 상기 공간적 특징의 공간 해상도를 상기 프레임의 공간 해상도보다 낮도록 축소할 수 있다.The spatial encoder includes the convolutional layer, the rectified linear unit (ReLU) layer, and the Max-Pooling layer, each of which includes a plurality of filters, and the spatial resolution of the spatial feature of the frame. It can be reduced to lower.

상기 시간 디코더는 상기 ConvLSTM가 다수의 ConvLSTM 레이어를 포함하여, 순차적 디콘볼루션을 수행함으로써, 상기 공간적 특징의 축소된 공간 해상도를 복구할 수 있다.In the temporal decoder, the ConvLSTM includes a plurality of ConvLSTM layers to perform sequential deconvolution to recover the reduced spatial resolution of the spatial feature.

상기 정규화기는 소프트 맥스 함수를 이용하여, 상기 시공간 특징을 정규화할 수 있다.The normalizer may normalize the spatiotemporal features using a soft max function.

상기 감정값 추출부는 상기 3D 특징과 상기 시공간 가중치를 하다마드 곱셈하여 상기 감정 특징을 획득하는 특징 결합부; 및 미리 학습된 3D CNN을 포함하여 상기 감정 특징으로부터 감정을 대표하는 감정값을 추출하는 감정값 획득부를 포함할 수 있다.The emotion value extracting unit combines the 3D feature and the space-time weight by Hadamard to obtain the emotion feature; And an emotion value acquiring unit for extracting an emotion value representing emotion from the emotion feature, including a previously learned 3D CNN.

상기 목적을 달성하기 위한 본 발명의 다른 예에 따른 감정 인식 방법은 기지정된 3차원 패턴 인식 기법에 따라 미리 학습되어, 이미지 시퀀스의 시간적으로 연속하는 T개(여기서 T는 자연수)의 프레임을 3차원의 단일 이미지로서 패턴 인식하여 3D 특징을 추출하는 단계; 기지정된 2차원 패턴 인식 기법에 따라 미리 학습되어, 상기 T개의 프레임 각각으로부터 패턴 인식을 통해 T개의 공간적 특징을 추출하고, 획득된 T개의 공간적 특징 사이의 시공간 특징을 추가하여 시공간 가중치를 획득하는 단계; 상기 시공간 가중치를 상기 3D 특징에 가중하여 감정 특징을 획득하는 단계; 기지정된 3차원 패턴 인식 기법에 따라 미리 학습되어, 상기 감정 특징으로부터 미리 지정된 범위 이내의 값을 갖는 감정값을 추출하는 단계; 및 상기 감정값에 대응하는 감정을 판별하는 단계를 포함한다.The emotion recognition method according to another example of the present invention for achieving the above object is pre-learned according to a known three-dimensional pattern recognition technique, and three-dimensionally sequential T consecutive frames of the image sequence (where T is a natural number) Extracting 3D features by recognizing a pattern as a single image; Pre-learning according to a known two-dimensional pattern recognition technique, extracting T spatial features through pattern recognition from each of the T frames, and adding space-time features between the obtained T spatial features to obtain space-time weights ; Obtaining emotional features by weighting the space-time weights with the 3D features; Learning in advance according to a predetermined 3D pattern recognition technique, and extracting an emotion value having a value within a predetermined range from the emotion feature; And determining an emotion corresponding to the emotion value.

따라서, 본 발명의 감정 인식 장치 및 방법은 이미지 시퀀스로부터 2차원 및 3차원 특징을 각각 획득하고, 획득된 2차원 및 3차원 특징을 함께 이용하여 정확하게 감정을 인식할 수 있다. 또한 시간적 및 공간적 주의에 기반하여 감정을 인식할 뿐만 아니라, 감정을 인식하기 위한 영역을 별도로 지정하지 않고도 감정을 연속적인 유인가를 기반으로 정확하게 인식할 수 있다.Accordingly, the apparatus and method for recognizing emotions of the present invention can acquire 2D and 3D features from an image sequence, respectively, and accurately recognize emotions using the obtained 2D and 3D features together. In addition, not only can emotions be recognized based on temporal and spatial attention, but emotions can be accurately recognized based on successive accreditation without specifying a region for recognizing emotions.

도1 은 사람의 감정을 나타내는 이미지의 일예를 나타낸다.
도2 는 연속하는 2차원 그래프로 나타나는 감정의 일예를 나타낸다.
도3 은 본 발명의 일 실시예에 따른 감정 인식 장치의 개략적 구성을 나타낸다.
도4 는 도3 의 시공간 특징 추출부의 상세 구성의 일예를 나타낸다.
도5 는 도3 의 감정 인식 장치의 학습 방법을 설명하기 위한 도면이다.
도6 은 본 발명의 일 실시예에 따른 감정 인식 방법을 나타낸다.
도7 은 본 실시예의 시공간 가중치를 시각화한 도면이다.
도8 및 도9 는 각각 2 종류의 RECOLA 데이터 세트와 AV + EC 데이터 세트에 대해 본 실시예에 따른 감정 인식 방법을 적용하여 획득되는 감정값과 검증값을 비교한 결과를 나타낸다.1 shows an example of an image representing a person's emotions.
Fig. 2 shows an example of emotion represented by a continuous two-dimensional graph.
3 shows a schematic configuration of an emotion recognition device according to an embodiment of the present invention.
4 shows an example of a detailed configuration of the space-time feature extraction unit of FIG. 3.
5 is a diagram for explaining a learning method of the emotion recognition device of FIG. 3.
6 shows an emotion recognition method according to an embodiment of the present invention.
7 is a diagram visualizing the spatiotemporal weight of the present embodiment.
8 and 9 show the results of comparing emotion values and verification values obtained by applying the emotion recognition method according to the present embodiment to the two types of RECOLA data sets and AV + EC data sets, respectively.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings and the contents described in the accompanying drawings, which illustrate preferred embodiments of the present invention.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by explaining preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be implemented in various different forms, and is not limited to the described embodiments. And, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part “includes” a certain component, it means that the component may further include other components, not to exclude other components, unless otherwise stated. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware or software or hardware. And software.

도3 은 본 발명의 일 실시예에 따른 감정 인식 장치의 개략적 구성을 나타내고, 도4 는 도3 의 시공간 특징 추출부의 상세 구성의 일예를 나타낸다.3 shows a schematic configuration of an emotion recognition apparatus according to an embodiment of the present invention, and FIG. 4 shows an example of a detailed configuration of the spatiotemporal feature extraction unit of FIG. 3.

도3 을 참조하면, 본 실시예에 따른 감정 인식 장치는 감정을 인식해야 하는 대상이 포함된 이미지 시퀀스를 획득하는 이미지 획득부(100), 이미지 획득부(100)에서 전달된 이미지 시퀀스로부터 감정값을 추출하는 감정 추출부(200) 및 추출된 감정값에 따라 이미지에 포함된 대상의 감정을 판별하는 감정 판별부(300)를 포함한다.Referring to FIG. 3, in the emotion recognition apparatus according to the present embodiment, an emotion value is obtained from an image acquisition unit 100 that acquires an image sequence including an object to recognize emotion, and an image sequence transmitted from the image acquisition unit 100. It includes an emotion extraction unit 200 for extracting and the emotion determination unit 300 for determining the emotion of the object included in the image according to the extracted emotion value.

우선 이미지 획득부(100)는 감정을 인식해야 하는 대상이 포함된 이미지 시퀀스를 획득한다. 특히 본 실시예에서 이미지 획득부(100)는 단일 이미지가 아닌, 연속된 T(여기서 T는 자연수)개의 프레임(I_f)(여기서 f는 프레임 인덱스로서 자연수)을 포함하는 이미지 시퀀스(I_1:T = {I₁, I₂, …, I_T})를 획득한다. 여기서 이미지 시퀀스(I_1:T)의 각 프레임에는 감정을 인식할 수 있도록 대상의 얼굴이 포함된다.First, the image acquisition unit 100 acquires an image sequence including an object to recognize emotions. In particular, in the present embodiment, the image acquisition unit 100 is not a single image, but a sequence of T (where T is a natural number) frames I _f (where f is a natural number as a frame index) image sequence (I _{1: T} = (I ₁ , I ₂ , …, I _T }). Here, each frame of the image sequence (I _{1: T} ) includes the face of the object so that emotions can be recognized.

그리고 이미지 획득부(100)는 획득된 이미지 시퀀스를 감정 추출부(200)로 전달한다. 이때 이미지 획득부(100)는 획득된 이미지 시퀀스에 포함된 프레임의 개수가 T개를 초과하는 경우, 이미지 시퀀스에서 대상의 감정을 인식하고자 하는 시점의 프레임이 포함된 T개의 프레임을 분리하여 감정 추출부(200)로 전달할 수 있다. 예를 들면 이미지 획득부(100)는 100개의 프레임을 포함하는 이미지 시퀀스에서 제11 내지 제20 프레임(I₁₁ ~ I₂₀)를 별도로 분리(T가 10 인 것으로 가정)하여, 감정 추출부(200)로 전달할 수 있다.In addition, the image acquisition unit 100 transmits the acquired image sequence to the emotion extraction unit 200. At this time, when the number of frames included in the acquired image sequence exceeds T, the image acquisition unit 100 extracts emotions by separating T frames including frames at the time of recognizing the emotion of the object in the image sequence It can be delivered to the unit 200. For example, the image acquisition unit 100 separately separates the 11th to 20th frames (I ₁₁ to I ₂₀ ) from the image sequence including 100 frames (assuming T is 10), thereby extracting the emotion 200 ).

또한 이미지 시퀀스로부터 대상의 연속적인 감정 변화를 인식하고자 하는 경우에는, 이미지 시퀀스에서 순차적으로 T개의 프레임을 분리하여, 감정 추출부(200)로 전달할 수 있다. 일예로 제1 내지 제10 프레임(I₁ ~ I₁₀)을 전달하고, 이후 제2 내지 제11 프레임(I₂ ~ I₁₁)을 전달할 수 있다.In addition, when it is desired to recognize a continuous emotion change of an object from an image sequence, T frames may be sequentially separated from the image sequence and transmitted to the emotion extraction unit 200. As an example, the first to tenth frames I ₁ to I ₁₀ may be transmitted, and then the second to eleventh frames I ₂ to I ₁₁ may be transmitted.

감정 추출부(200)는 이미지 획득부(100)에서 전달된 이미지 시퀀스(I_1:T)로부터 감정값을 추출한다. 특히 본 실시예에서 감정 추출부(200)는 이미지 시퀀스(I_1:T)에 대해 미리 학습된 2차원(2D) 및 3차원(3D) 패턴 인식 기법을 이용하여 이미지 시퀀스(I_1:T)의 특징을 추출하고, 추출된 특징을 결합하여 감정값을 추출한다.The emotion extraction unit 200 extracts the emotion value from the image sequence (I _{1: T} ) transmitted from the image acquisition unit 100. In particular, in the present embodiment, the emotion extracting unit 200 uses an image sequence (I _{1: T} ) using a two-dimensional (2D) and three-dimensional (3D) pattern recognition technique previously learned about the image sequence (I _{1: T} ). Features are extracted, and emotion values are extracted by combining the extracted features.

도3 에 도시된 바와 같이, 감정 추출부(200)는 3D 특징 추출부(210), 시공간 특징 추출부(220), 특징 결합부(230) 및 감정값 획득부(240)를 포함할 수 있다.As illustrated in FIG. 3, the emotion extraction unit 200 may include a 3D feature extraction unit 210, a spatiotemporal feature extraction unit 220, a feature combination unit 230, and an emotion value acquisition unit 240. .

3D 특징 추출부(210)는 대상의 감정을 판별하기 위해 이미지 획득부(100)에서 전달된 2차원의 이미지 시퀀스(I_1:T)의 프레임({I₁, I₂, …, I_T}) 전체를 3차원의 단일 객체로서 패턴 인식하여 3D 특징(X'_1:T)을 추출한다. 즉 시간의 흐름에 따라 누적된 다수의 2차원 프레임을 포함하는 이미지 시퀀스(I_1:T)를 3차원 이미지로 인식하여, 3차원의 이미지 시퀀스(I_1:T)를 미리 지정된 패턴 인식 기법에 따라 분석함으로써 3D 특징(X'_1:T)을 추출한다. 본 실시예에서는 3D 특징 추출부(210)가 시간에 따라 연속하는 T개의 프레임({I₁, I₂, …, I_T})을 포함하는 이미지 시퀀스(I_1:T)로부터 감정 인식을 위한 3D 특징(X'_1:T)을 추출하므로, 단일 이미지로부터 감정 인식을 위한 2D 특징을 추출하는 방식에 비해, 상대적으로 정확한 특징을 추출할 수 있다. 즉 대상의 감정을 매우 정확하게 판별할 수 있도록 한다.The 3D feature extraction unit 210 is a frame ({I ₁ , I ₂ , …, I _T }) of a two-dimensional image sequence (I _{1: T} ) transmitted from the image acquisition unit 100 to determine the emotion of the object ) 3D features (X' _1:T ) are extracted by recognizing the entire pattern as a single 3D object. That is, an image sequence (I _1:T ) including a plurality of 2D frames accumulated over time is recognized as a 3D image, and the 3D image sequence (I _1:T ) is used for a predetermined pattern recognition technique. 3D features (X' _1:T ) are extracted by analysis accordingly. In this embodiment, the 3D feature extraction unit 210 is used for emotion recognition from an image sequence (I _{1: T} ) including _T frames ({I ₁ , I ₂ , …, I _T }) that are continuous over time. Since 3D features (X' _1:T ) are extracted, relatively accurate features can be extracted compared to a method of extracting 2D features for emotion recognition from a single image. In other words, it is possible to accurately determine the emotion of the object.

3D 특징 추출부(210)는 일예로 미리 학습된 3차원 콘볼루션 신경망(3D Convolutional Neural Networks: 이하 3D CNN)으로 구현될 수 있다.The 3D feature extraction unit 210 may be implemented as, for example, 3D Convolutional Neural Networks (hereinafter referred to as 3D CNN) previously learned.

시공간 특징 추출부(220)는 이미지 획득부(100)에서 전달된 T개의 2차원 프레임({I₁, I₂, …, I_T}) 각각으로부터 시공간 주의(Spatiotemporal Attention)에 기반하여 특징을 추출한다. 특히 본 실시예에서 시공간 특징 추출부(220)는 이미지 시퀀스(I_1:T)의 시공간 주의 기반 특징을 추출함으로써, 이미지 시퀀스(I_1:T)에 대해 별도의 관심 영역을 지정하지 않더라도 T개의 프레임({I₁, I₂, …, I_T})내의 각 영역별 중요도에 따른 가중치를 획득할 수 있도록 한다.The spatiotemporal feature extraction unit 220 extracts features based on spatiotemporal attention from each of the T two-dimensional frames ({I ₁ , I ₂ , ..., I _T }) transmitted from the image acquisition unit 100. do. In particular, space-time characteristic extracting unit 220 in the embodiment is an image sequence of, even if you do not specify a separate area of interest _{_for:: (T} I _{1) T} by extracting space-time care based on features of (I _{1 T),} the image sequence It is possible to obtain a weight according to the importance of each area in the frame ({I ₁ , I ₂ , …, I _T }).

즉 본 실시예에 따른 감정 인식 장치는, 도1 과 같이 각 프레임({I₁, I₂, …, I_T})에서 사람의 얼굴에서 감정이 강하게 표출되는 영역(눈, 입)을 별도로 지정하지 않더라도, 시공간 특징 추출부(220)가 각 프레임의 영역별 감정 표출의 중요도를 시공간 주의에 기반하여 특징으로 추출하고, 추출된 특징을 시공간 가중치(A_1:T)로서 3D 특징(X'_1:T)에 부가함으로써 최적의 감정 인식 성능을 제공할 수 있다.That is, the emotion recognition apparatus according to the present embodiment separately designates regions (eyes and mouths) in which emotions are strongly expressed on a person's face in each frame ({I ₁ , I ₂ , …, I _T }) as shown in FIG. 1. Even if not, the spatiotemporal feature extraction unit 220 extracts the importance of emotion expression for each region of each frame as a feature based on the spatiotemporal attention, and extracts the extracted feature as a 3D feature (X' ₁ ) as a spatiotemporal weight (A _{1: T} ). _:T ) to provide optimal emotion recognition performance.

이를 위해, 시공간 특징 추출부(220)는 도4 와 같이 구성될 수 있다.To this end, the spatiotemporal feature extraction unit 220 may be configured as shown in FIG. 4.

도4 를 참조하면, 시공간 특징 추출부(220)는 공간 주의(Spatial Attention) 기반 특징을 추출하기 위한 공간 인코더(221), 시공간 주의(Spatiotemporal Attention) 기반 특징을 추출하기 위한 시간 디코더(223) 및 추출된 특징을 지정된 범위 이내의 가중치로 변환하는 정규화하는 정규화기(225)를 포함한다.Referring to FIG. 4, the spatiotemporal feature extraction unit 220 includes a spatial encoder 221 for extracting spatial attention-based features, a temporal decoder 223 for extracting spatiotemporal attention-based features, and And a normalizer 225 that normalizes the extracted features to be converted into weights within a specified range.

공간 인코더(221)는 이미지 시퀀스(I_1:T)의 T개의 2차원 프레임({I₁, I₂, …, I_T}) 각각에 대해 공간적 특징(X_1:T)을 추출하여 출력한다. 공간 인코더(221)는 지정된 2차원 패턴 인식 기법에 의해 미리 학습되어, T개의 프레임({I₁, I₂, …, I_T}) 각각의 공간적 패턴을 인식함으로써, 2차원의 공간적 특징(X_1:T)을 추출한다.The spatial encoder 221 extracts and outputs the spatial feature (X _1:T ) for each of the T two-dimensional frames ({I ₁ , I ₂ , …, I _T }) of the image sequence (I _1:T ). . The spatial encoder 221 is pre-learned by a designated two-dimensional pattern recognition technique, and recognizes a spatial pattern of each of the _T frames ({I ₁ , I ₂ , …, I _T }), thereby providing two-dimensional spatial features (X _1:T ).

공간 인코더(221)는 일예로 미리 학습된 2차원 콘볼루션 신경망(2D Convolutional Neural Networks: 이하 2D CNN)으로 구현될 수 있다. 2D CNN은 2차원의 이미지에서 특징을 추출하기 위해 주로 이용되는 인공 신경망의 하나이다.The spatial encoder 221 may be implemented as, for example, 2D convolutional neural networks (hereinafter referred to as 2D CNN) that have been previously learned. 2D CNN is one of artificial neural networks mainly used to extract features from 2D images.

공간 인코더(221)는 이미지 획득부(100)로부터 T개의 프레임({I₁, I₂, …, I_T})을 순차적으로 인가받아 공간적 특징({X₁, X₂, …, X_T})을 순차적으로 출력하도록 구성될 수 있으나, 시간을 줄이기 위해 T개의 프레임({I₁, I₂, …, I_T})을 동시에 인가받아 특징을 추출할 수 있도록 병렬로 구성될 수도 있다. 공간 인코더(221)가 병렬로 구성되는 경우, 모든 공간 인코더는 가중치 및 바이어스 값이 동일하게 공유되는 사이어미즈(Siamese) 네트워크로 구성된다.The spatial encoder 221 sequentially receives T frames ({I ₁ , I ₂ , …, I _T }) from the image acquisition unit 100 and spatial features ({X ₁ , X ₂ , …, X _T } ) May be sequentially output, but in order to reduce time, T frames ({I ₁ , I ₂ , …, I _T }) may be simultaneously applied to extract features. When the spatial encoders 221 are configured in parallel, all spatial encoders are composed of a Siamese network in which weights and bias values are shared equally.

또한 공간 인코더(221)가 학습되는 과정에서 T개의 프레임({I₁, I₂, …, I_T})이 순차적으로 인가되더라도, T개의 프레임({I₁, I₂, …, I_T})에 대한 공간적 특징({X₁, X₂, …, X_T})이 모두 출력되기 이전에는 공간 인코더(221)의 가중치 및 바이어스 값이 가변되지 않아야 하며, T개의 프레임({I₁, I₂, …, I_T})에 대한 공간적 특징({X₁, X₂, …, X_T})이 모두 출력된 이후, 공간 인코더(221)의 가중치 및 바이어스 값이 가변될 수 있다. 이는 본 실시예에 따른 감정 인식 장치가 T개의 프레임({I₁, I₂, …, I_T})을 포함하는 이미지 시퀀스(I_1:T)를 감정 인식을 위한 단위로 처리하기 때문이다.Also, even if T frames ({I ₁ , I ₂ , …, I _T }) are sequentially applied in the course of learning the spatial encoder 221, T frames ({I ₁ , I ₂ , …, I _T } ) Before all of the spatial features ({X ₁ , X ₂ , …, X _T }) are output, the weight and bias values of the spatial encoder 221 must not be changed, and T frames ({I ₁ , I ₂ , …, I _T }) after all of the spatial characteristics ({X ₁ , X ₂ , …, X _T }) are output, the weight and bias values of the spatial encoder 221 may be varied. This is because the emotion recognition apparatus according to the present embodiment processes the image sequence I _{1: T} including _T frames ({I ₁ , I ₂ , …, I _T }) as a unit for emotion recognition.

한편, 본 실시예에서 2D CNN으로 구현되는 공간 인코더(221)는 일예로 연속되는 3 X 3 컨볼루션 레이어와 ReLU(Rectified Linear Unit) 레이어 및 2 X 2 스트라이드(stride)의 맥스 풀링(Max-Pooling) 레이어를 포함하도록 구성될 수 있다. 여기서 3 X 3 컨볼루션 레이어와 ReLU 레이어 및 맥스 풀링 레이어는 각각 기지정된 개수의 필터를 포함할 수 있다. 일예로, 3 X 3 컨볼루션 레이어는 32개의 필터를 포함할 수 있고, ReLU 레이어는 64개의 필터를 포함할 수 있으며, 맥스 풀링 레이어는 128개의 필터를 포함하도록 구성될 수 있다.On the other hand, in this embodiment, the spatial encoder 221 implemented as 2D CNN is, for example, a continuous 3 X 3 convolution layer and a ReLU (Rectified Linear Unit) layer and 2 X 2 stride max pooling (Max-Pooling). ) It may be configured to include a layer. Here, the 3 X 3 convolution layer, the ReLU layer, and the max pooling layer may each include a predetermined number of filters. In one example, the 3 X 3 convolution layer may include 32 filters, the ReLU layer may include 64 filters, and the max pooling layer may be configured to include 128 filters.

공간 인코더(221)가 3 X 3 컨볼루션 레이어와 ReLU 레이어 및 맥스 풀링 레이어를 포함하는 것은 이미지 시퀀스(I_1:T)로부터 공간적 특징(X_1:T)을 추출할 때, 매개 변수의 수를 줄임으로써 오버 피팅(overfitting) 문제를 방지하기 위함이다.The spatial encoder 221 includes a 3 X 3 convolutional layer, a ReLU layer, and a max pooling layer to determine the number of parameters when extracting spatial features (X _{1: T} ) from the image sequence (I _{1: T} ). This is to prevent overfitting problems by reducing.

시간 디코더(223)는 공간 인코더(221)에서 획득된 공간적 특징(X_1:T)에 대해 시공간 주의 기반 특징을 추출한다.The temporal decoder 223 extracts the spatio-temporal attention based feature on the spatial feature (X _1:T ) obtained from the spatial encoder 221.

공간 인코더(221)가 2D CNN으로 구현되는 경우, 이미지 시퀀스(I_1:T)의 T개의 프레임({I₁, I₂, …, I_T}) 각각에서의 공간적 특징, 즉 영역별 특징을 추출할 수 있다. 그러나 공간 인코더(221)가 T개의 프레임({I₁, I₂, …, I_T})을 개별적으로 특징을 추출함으로써, 시간적으로 연속하는 T개의 프레임({I₁, I₂, …, I_T}) 사이의 시간적 특징이 반영되지 않는 한계가 있다.When the spatial encoder 221 is implemented as a 2D CNN, spatial characteristics in each of the _T frames ({I ₁ , I ₂ , …, I _T }) of the image sequence (I _1:T ), that is, region-specific characteristics Can be extracted. However, spatial encoder 221 is the T-frame _{_{({I 1, I 2,}} ..., I T}) with by separately extracting a feature as, T frames ({I _1, temporally consecutive I _2, ..., I _T }) There is a limit that does not reflect the temporal characteristics.

이에 본 실시예에서 시간 디코더(223)는 시공간 주의 기반 특징을 추출함으로써, 공간적 특징(X_1:T)에 시간적 특징이 더 부가되도록 한다. 시간 디코더(223)는 지정된 패턴 인식 기법에 의해 미리 학습되어, 공간적 특징({X₁, X₂, …, X_T})에 포함된 공간 패턴 특징을 가능한 유지하면서, 공간적 특징({X₁, X₂, …, X_T}) 중 시간적으로 서로 인접한 공간적 특징 사이의 시간적 특징을 추가로 추출한다.Accordingly, in this embodiment, the temporal decoder 223 extracts the spatiotemporal attention-based feature, so that the temporal feature is further added to the spatial feature (X _1:T ). The temporal decoder 223 is pre-learned by the designated pattern recognition technique, and while maintaining the spatial pattern features included in the spatial features ({X ₁ , X ₂ , ..., X _T }) as much as possible, the spatial features ({X ₁ , X ₂ , …, X _T }), temporal features between spatial features adjacent to each other are additionally extracted.

시간 디코더(223)는 일예로 미리 학습된 ConvLSTM(Convolutional Long Short-Term Memory)으로 구현될 수 있다. ConvLSTM 또한 인공 신경망의 하나로서, 순환 신경망(Recurrent Neural Network: RNN)이 장기간(Long Term) 특징을 반영할 수 있도록 개선한 LSTM(Long Short-Term Memory)을 더욱 개선하여 공간적 특징을 더 반영할 수 있도록 하였다.The time decoder 223 may be embodied as a Convolutional Long Short-Term Memory (ConvLSTM) that is previously learned as an example. ConvLSTM is also an artificial neural network, and it can further reflect spatial features by further improving the Long Short-Term Memory (LSTM), which has been improved so that the Recurrent Neural Network (RNN) can reflect the long term characteristics. To ensure that.

여기서 시간 디코더(223)가 시간적 특징을 반영할 수 있는 LSTM이 아닌 ConvLSTM을 이용하는 것은 공간 인코더(221)에서 획득된 공간적 특징(X_1:T)을 가능한 유지할 수 있도록 하기 위함이다.Here, the time decoder 223 uses ConvLSTM rather than LSTM, which can reflect temporal characteristics, so that the spatial characteristics (X _{1: T} ) obtained from the spatial encoder 221 can be maintained as much as possible.

수학식 1은 시간 디코더(223)에서 ConvLSTM이 수행하는 기능을 수학식으로 표현한 것이다. 수학식 1에서 i_t, f_t, o_t, c_t 및 h_t 는 각각 시간 t에서 입력 게이트(input gate), 망각 게이트(forget gate), 출력 게이트(output gate), 활성화 셀(activation cell) 및 셀 출력(cell output)을 나타낸다. 그리고 σ(·)와 tanh(·)는 각각 시그모이드(sigmoid) 함수와 쌍곡 탄젠트 함수(hyperbolic tangent)를 나타내며, *는 컨볼루션 연산자이고, ⊙는 하다마드(Hadamard) 곱셈 연산자를 나타낸다. 그리고 W_*은 다른 게이트를 연결하는 필터 행렬이고, b_*는 각 게이트에 상응하는 바이어스 벡터를 나타낸다.Equation 1 expresses the function performed by the ConvLSTM in the time decoder 223 by the equation. In Equation 1, i _t , f _t , o _t , c _t and h _t are input gates, forget gates, output gates, and activation cells at time t, respectively. And cell output. In addition, σ(·) and tanh(·) denote a sigmoid function and a hyperbolic tangent, respectively, * denotes a convolution operator, and ⊙ denotes a Hadamard multiplication operator. And W _* is a filter matrix connecting different gates, and b _* represents a bias vector corresponding to each gate.

수학식 1에 나타난 바와 같이 ConvLSTM은 입력 대 상태 및 상태 대 상태 천이 시에 모두 컨볼루션 구조를 갖고 있어, 시간적 특징을 추출할 수 있을 뿐만 아니라 공간적 특징을 유지할 수 있다.As shown in Equation 1, ConvLSTM has a convolutional structure for both input-to-state and state-to-state transitions, and can not only extract temporal features but also maintain spatial characteristics.

또한 시간 디코더(223)는 순차적 디콘볼루션(deconvolution)을 통해 인가된 공간적 특징(X_1:T)의 공간 해상도를 점차적으로 확대한다. 즉 시간 디코더(223)는 공간적 특징(X_1:T)의 공간 구조를 유지하면서 각 프레임 간의 시간 상관에 따른 특징을 추출한다.In addition, the temporal decoder 223 gradually enlarges the spatial resolution of the spatial feature (X _{1: T} ) applied through sequential deconvolution. That is, the time decoder 223 extracts features according to time correlation between frames while maintaining the spatial structure of the spatial features (X _{1: T} ).

이를 위해 시간 디코더(223)는 다수개의 ConvLSTM 레이어를 포함할 수 있으며, ConvLSTM 레이어 각각은 기지정된 개수의 필터를 포함할 수 있다. 도4 에서는 일예로 2개의 ConvLSTM 레이어가 각각 64개 및 32개의 필터를 포함하는 경우를 도시하였다.To this end, the time decoder 223 may include a plurality of ConvLSTM layers, and each of the ConvLSTM layers may include a predetermined number of filters. In FIG. 4, for example, two ConvLSTM layers include 64 and 32 filters, respectively.

정규화기(225)는 시간 디코더(223)에서 출력되는 시공간 특징을 수학식 2에 따른 공간적 소프트 맥스(spatial softmax) 함수를 사용하여 정규화한다.The normalizer 225 normalizes the spatiotemporal features output from the time decoder 223 using a spatial softmax function according to Equation (2).

수학식 2에서 H_t-1은 히든 상태(hidden state)를 나타내고, W_i는 위치 소프트맥스의 i번째 요소에 매핑되는 가중치이고, j는 위치를 나타낸다.In Equation 2, H _t-1 represents a hidden state, W _i is a weight mapped to the i-th element of the location softmax, and j represents a location.

정규화기(225)에 의해 시간 디코더(223)에서 출력되는 시공간 특징은 정규화되어 시공간 가중치(A_1:T)로서 출력된다. 일예로 정규화기(225)는 시공간 가중치(A_1:T)의 합이 1이되도록 정규화할 수 있다.The spatiotemporal features output from the time decoder 223 by the normalizer 225 are normalized and output as spatiotemporal weights (A _{1: T} ). As an example, the normalizer 225 may normalize such that the sum of space-time weights (A _{1: T} ) is 1.

특징 결합부(230)는 3D 특징(X'_1:T)과 시공간 가중치(A_1:T)를 수학식 3에 따라 결합하여, 감정 특징(X")을 획득한다.The feature combining unit 230 combines the 3D feature (X′ _1:T ) and the spatio-temporal weight (A _1:T ) according to Equation 3 to obtain an emotional feature (X″).

수학식 3에서 3D 특징(X'_1:T)은 대상의 감정을 판별하기 위한 특징이고, 정규화기(225)에 의해 정규화된 시공간 가중치(A_1:T)는 3D 특징(X'_1:T)의 대응하는 각 영역에 대한 중요도를 지정하는 가중치로서 기능한다.In Equation 3, the 3D feature (X' _1:T ) is a feature for determining the emotion of the object, and the space-time weight (A _1:T ) normalized by the normalizer 225 is a 3D feature (X' _1:T). ) As a weight that designates importance for each corresponding area.

감정값 획득부(240)는 감정 특징(X")에 대해 다시 3차원 특징을 추출하여 감정값(y)를 획득한다. The emotion value acquisition unit 240 extracts the 3D feature again for the emotion feature X" to obtain the emotion value y.

감정값 획득부(240)는 일예로 3D 특징 추출부(210)와 유사하게 미리 학습된 3D CNN으로 구현될 수 있다. 그리고 본 실시예에서 감정값 획득부(240)는 감정값을 -1 에서 1 사이의 스칼라 값(scalar value)(y ∈ [-1, 1])으로 획득되도록 특징을 추출할 수 있으나, 이에 한정되지 않는다.The emotion value acquisition unit 240 may be implemented as a 3D CNN that has been previously learned similarly to the 3D feature extraction unit 210 as an example. Also, in this embodiment, the emotion value acquisition unit 240 may extract features so that the emotion value is obtained as a scalar value (y ∈ [-1, 1]) between -1 and 1, but is not limited thereto. Does not work.

감정값 획득부(240) 또한 효율적인 감정값을 획득하기 위해 다수개의 레이어로 구성될 수 있다. 일예로 감정값 획득부(240)는 다수개(예를 들면 4개)의 3D CNN 레이어와 다수개(예를 들면 3개)의 3D 맥스 풀링 레이어 및 다수개(예를 들면 2개)의 완전 연결 레이어(fully-connected layer)를 포함할 수 있다. 그리고 다수개의 3D CNN 레이어는 일예로 각각 32, 64, 128 및 256개의 필터를 포함할 수 있다.The emotion value acquisition unit 240 may also be composed of a plurality of layers in order to acquire an efficient emotion value. For example, the emotion value acquisition unit 240 includes multiple (for example, 4) 3D CNN layers, multiple (for example, 3) 3D Max pooling layers, and multiple (for example, 2) completes It may include a fully-connected layer. In addition, a plurality of 3D CNN layers may include 32, 64, 128, and 256 filters, respectively, as an example.

한편, 완전 연결 레이어는 단일 출력 채널을 갖고, 선형 회귀 레이어를 이용하여 감정값(y)을 획득할 수 있다.Meanwhile, the fully connected layer has a single output channel, and an emotion value y may be obtained using a linear regression layer.

감정 판별부(300)는 감정값 획득부(240)에서 획득된 감정값(y)을 미리 저장된 감정값별 감정 기준에 대입함으로써, 대상의 감정을 판별한다. 상기에서 감정값(y)가 -1 에서 1 사이의 스칼라 값인 것으로 가정하였으므로, 감정값별 감정 기준은 각 감정에 대한 감정값이 -1 에서 1 사이의 연속되는 범위값으로 설정될 수 있다. 따라서, 감정 판별부(300)는 인가된 감정값(y)에 대응하는 감정을 용이하게 판별할 수 있다.The emotion determination unit 300 determines the emotion of the object by substituting the emotion value y obtained in the emotion value acquisition unit 240 into the emotion criteria for each emotion value stored in advance. In the above, since it is assumed that the emotion value y is a scalar value between -1 and 1, the emotion criteria for each emotion value may be set to a continuous range value between -1 and 1 for each emotion value. Therefore, the emotion determining unit 300 can easily determine the emotion corresponding to the applied emotion value y.

본 실시예에서는 일예로 도2 에 도시된 2차원 감정 그래프의 각성(Arousal)과 유인가(Valence)의 2개의 축 중 유인가에 대응하는 감정값(y)을 추출한다. 그러나 이는 일예로서 경우에 따라 감정 추출부(200)는 각성에 대응하는 감정값을 추출하도록 구성될 수도 있으며, 각성 및 유인가 양쪽에 대응하는 감정값을 추출하도록 구성될 수도 있다.In this embodiment, as an example, the emotion value y corresponding to the validity among the two axes of the arousal and the validity of the two-dimensional emotion graph shown in FIG. 2 is extracted. However, this is an example, and in some cases, the emotion extracting unit 200 may be configured to extract emotion values corresponding to arousal, or may be configured to extract emotion values corresponding to both arousal and authorization.

본 실시예에서 3D 특징 추출부(210)와 시공간 특징 추출부(220)의 공간 인코더(221) 및 시간 디코더(222), 그리고 감정값 획득부(240)는 각각 지정된 딥-러닝 알고리즘에 따라 미리 학습된 인공 신경망이다.In this embodiment, the spatial encoder 221 and the time decoder 222 of the 3D feature extraction unit 210 and the spatio-temporal feature extraction unit 220, and the emotion value acquisition unit 240 are respectively preset according to a designated deep-learning algorithm. It is a learned artificial neural network.

그리고 감정 추출부(200)에 인가되는 이미지 시퀀스(I_1:T)의 T개의 프레임({I₁, I₂, …, I_T})과 각 특징(X_1:T, X'_1:T, X", A_1:T)는 백터 행렬(vector matrix)로 표현될 수 있다.Then, T frames ({I ₁ , I ₂ , …, I _T }) of the image sequence (I _{1: T} ) applied to the emotion extracting unit 200 and each feature (X _{1: T} , X′ _{1: T)} , X", A _1:T ) can be expressed as a vector matrix.

결과적으로 본 실시예에 따른 감정 인식 장치는 연속되는 T개의 프레임({I₁, I₂, …, I_T})을 포함하는 이미지 시퀀스(I_1:T)로부터 3차원으로 감정을 판별하기 위한 3D 특징(X'_1:T)을 추출하고, 이와 동시에 T개의 프레임({I₁, I₂, …, I_T}) 각각의 시공간 주의에 기반한 특징을 추출하여 T개의 프레임({I₁, I₂, …, I_T}) 각각의 영역별 가중치(A_1:T)를 획득한다. 그리고 3D 특징(X'_1:T)에 영역별 가중치(A_1:T)를 가중하여 감정 특징(X")을 획득하고, 감정 특징(X")으로부터 감정값(y)를 추출함으로써, 이미지 시퀀스(I_1:T)에 별도의 관심 영역을 설정하지 않고서도 대상의 감정을 매우 정확하게 추출 및 판별할 수 있도록 한다.As a result, the emotion recognition apparatus according to the present embodiment is for determining emotion in three dimensions from an image sequence (I _{1: T} ) including consecutive T frames ({I ₁ , I ₂ , …, I _T }). 3D feature (X _{'1: T)} extract, and at the same time the T-frame _{_{({I 1, I 2,}} ..., I T}) extracts features based on the respective space-time attention to the T-frame ({I _1, I ₂ , …, I _T }) Obtain weights (A _{1: T} ) for each region. Then, by weighting each region's weight (A _1:T ) on the 3D feature (X' _1:T ), the emotion feature (X") is obtained, and the emotion value (y) is extracted from the emotion feature (X") to obtain an image. It is possible to extract and discriminate the emotion of an object very accurately without setting a separate region of interest in the sequence (I _{1: T} ).

도5 는 도3 의 감정 인식 장치의 학습 방법을 설명하기 위한 도면이다.5 is a diagram for explaining a learning method of the emotion recognition device of FIG. 3.

도3 및 도4 를 참조하여, 도5 의 학습 방법을 설명하면, 이미지 획득부(100)는 미리 감정값이 판별된 다수의 프레임을 포함하는 이미지 시퀀스 중 T개의 프레임({I₁, I₂, …, I_T})씩 순차적으로 감정 추출부(200)로 전달한다.Referring to FIGS. 3 and 4, when the learning method of FIG. 5 is described, the image acquisition unit 100 includes T frames ({I ₁ , I _{2) in} an image sequence including a plurality of frames in which emotion values have been determined in advance. , …, I _T }) are sequentially transmitted to the emotion extraction unit 200.

감정 추출부(200)를 학습시키기 위해서는 다수의 이미지 시퀀스 또는 다수의 프레임이 필요하므로, 여기서는 일예로 이미지 시퀀스가 3500개의 프레임을 포함하는 경우를 도시하였다.In order to train the emotion extracting unit 200, a plurality of image sequences or a plurality of frames are required, and thus, as an example, an image sequence includes 3500 frames.

그리고 이미지 획득부(100)는 이미지 시퀀스에서 순차적으로 T개씩의 프레임을 분리하여 전달하며, 이때 이미지 획득부(100)는 제1 내지 제10 프레임(I₁ ~ I_T)을 전달하고, 이후 제2 내지 제11 프레임(I₂ ~ I_T+1)을 전달하는 방식으로 전달할 수 있다.In addition, the image acquisition unit 100 sequentially transmits T frames in an image sequence, and at this time, the image acquisition unit 100 transmits the first to tenth frames I ₁ to I _T , and thereafter, The second to eleventh frames (I ₂ to I _T+1 ) may be transmitted.

감정 추출부(200)의 3D 특징 추출부(210)는 T개의 프레임({I₁, I₂, …, I_T})이 포함된 이미지 시퀀스(I_1:T) 전체에 대해 3D 특징(X'_1:T)을 추출한다. 이와 함께 시공간 특징 추출부(220)의 공간 인코더(221)와 시간 디코더(223) 및 정규화기(225)가 T개의 프레임({I₁, I₂, …, I_T}) 각각에 대해 공간적 특징(X_1:T)을 추출하고, 추출된 공간적 특징(X_1:T)에 대해 다시 시공간 특징을 추출하여 정규화함으로써, 시공간 가중치(A_1:T)를 획득한다.The 3D feature extraction unit 210 of the emotion extraction unit 200 is a 3D feature (X) for the entire image sequence (I _{1: T} ) including _T frames ({I ₁ , I ₂ , …, I _T }) ' _1:T ). In addition, the spatial encoder 221 of the spatiotemporal feature extraction unit 220, the temporal decoder 223, and the normalizer 225 are spatial features for each of the _T frames ({I ₁ , I ₂ , …, I _T }). (X _{1: T)} _{_obtains::,} the weight space-time _(T a ₁₎ by normalizing to again extract the space-time characteristics for the extraction, and the extracted spatial characteristics _(T X _1).

한편, 감정 추출부(200)의 특징 결합부(230)는 3D 특징(X'_1:T)에 시공간 가중치(A_1:T)를 가중하여, 감정 특징(X")을 획득하고, 감정값 획득부(240)는 획득된 감정 특징(X")에 대해 다시 3차원 측징을 추출하여, 기지정된 범위(여기서는 일예로 -1 ~ 1) 이내의 스칼라 값을 갖는 감정값(y)를 획득한다.On the other hand, the feature combining unit 230 of the emotion extracting unit 200 weights the spatio-temporal weight (A _1:T ) to the 3D feature (X' _1:T ) to obtain the emotion feature (X"), and the emotion value The acquiring unit 240 extracts the three-dimensional measurement again with respect to the acquired emotional feature (X"), and acquires an emotional value y having a scalar value within a predetermined range (eg, -1 to 1 in this example). .

여기서 3D 특징 추출부(210)와 공간 인코더(221)와 시간 디코더(223) 및 감정값 획득부(240)이 모두 학습되지 않은 인공 신경망이므로, 추출되는 3D 특징(X'_1:T)과 공간적 특징(X_1:T), 시공간 가중치(A_1:T) 및 감정값(y)은 모두 상당한 오차를 포함한 상태이다.Here, since the 3D feature extraction unit 210, the spatial encoder 221, the time decoder 223, and the emotion value acquisition unit 240 are all untrained artificial neural networks, the extracted 3D features (X' _1:T ) and spatial Features (X _{1: T} ), spatio-temporal weights (A _{1: T} ), and emotional values (y) are all in a state with significant errors.

이에 획득된 감정값(y)을 해당 프레임에서 미리 판별되어 저장된 감정값과 비교하여 오차를 분석한다. 도5 의 오른쪽 그래프는 학습용으로 3500개의 프레임을 포함하는 이미지 시퀀스에서 프레임별로 획득된 감정값(y)과 미리 저장된 감정값을 나타낸다. 여기서는 감정값이 유인가 점수(Valence Score)인 경우를 나타내었으며, 청색 선은 각 프레임에 대해 획득된 감정값(y)를 나타내고, 적색 선은 미리 저장된 감정값을 나타낸다. 즉 x 축에 해당하는 특정 프레임에서 청색 선과 적색 선 사이의 차이가 오차이다.The obtained emotion value y is determined in advance in the corresponding frame and compared with the stored emotion value to analyze the error. The graph on the right of FIG. 5 represents the emotion value y obtained for each frame and the previously stored emotion value in an image sequence including 3500 frames for learning. Here, the case where the emotion value is a valence score is shown, the blue line represents the emotion value (y) obtained for each frame, and the red line represents the previously stored emotion value. That is, the difference between the blue line and the red line in a specific frame corresponding to the x-axis is an error.

감정 추출부(200)는 분석된 오차가 감소하도록 3D 특징 추출부(210)와 공간 인코더(221)와 시간 디코더(223) 및 감정값 획득부(240)의 가중치 및 바이어스 벡터등을 조절하여 학습시킨다. 이때 오차는 이미지 시퀀스(I_1:T)를 처리하는 순서의 역순으로 감정값 획득부(240)로부터 시간 디코더(223)와 공간 인코더(221) 및 3D 특징 추출부(210)로 전파되어, 점차로 오차를 줄이도록 학습된다.The emotion extraction unit 200 learns by adjusting the weights and bias vectors of the 3D feature extraction unit 210, the spatial encoder 221, the time decoder 223, and the emotion value acquisition unit 240 to reduce the analyzed error. Order. At this time, the error is propagated from the emotion value acquisition unit 240 to the time decoder 223, the spatial encoder 221, and the 3D feature extraction unit 210 in the reverse order of processing the image sequence (I _{1: T} ), and gradually Learned to reduce errors.

그리고 감정 추출부(200)는 다시 이미지 획득부(100)로부터 T개의 프레임을 인가받아, 감정값(y)을 획득하여 오차를 판별함으로써, 반복적으로 학습한다. 결과적으로 다수의 프레임을 포함하는 이미지 시퀀스에 대해 반복적으로 감정값(y)을 획득하고, 획득된 감정값(y)의 오차가 감소되도록 함으로써, 감정 인식 장치가 학습될 수 있다.Then, the emotion extraction unit 200 receives T frames from the image acquisition unit 100 again, acquires the emotion value y, and determines an error, thereby repeatedly learning. As a result, the emotion recognition apparatus can be trained by repeatedly acquiring the emotion value y for an image sequence including a plurality of frames and reducing the error of the obtained emotion value y.

도6 은 본 발명의 일 실시예에 따른 감정 인식 방법을 나타낸다.6 shows an emotion recognition method according to an embodiment of the present invention.

도3 및 도4 를 참조하여 도6 의 감정 인식 방법을 설명하면, 우선 이미지 획득부(100)가 T개 프레임({I₁, I₂, …, I_T})을 포함하는 이미지 시퀀스(I_1:T)를 획득하여 감정 추출부(200)로 전달한다(S10).Referring to FIGS. 3 and 4, the emotion recognition method of FIG. 6, first, the image acquisition unit 100 includes an image sequence (I including _T frames ({I ₁ , I ₂ , …, I _T })) _1:T ) is acquired and transmitted to the emotion extraction unit 200 (S10).

이에 감정 추출부(200)의 3D 특징 추출부(210)는 T개의 프레임({I₁, I₂, …, I_T})이 포함된 이미지 시퀀스(I_1:T) 전체에 대해 3D 특징(X'_1:T)을 추출한다(S20). 3D 특징 추출부(210)는 일예로 3D CNN으로 구현될 수 있으며, 이에 T개의 2차원 프레임({I₁, I₂, …, I_T})에서 3D 특징(X'_1:T)을 추출할 수 있다.Accordingly, the 3D feature extraction unit 210 of the emotion extraction unit 200 is a 3D feature for the entire image sequence (I _{1: T} ) including _T frames ({I ₁ , I ₂ , …, I _T }) ( X'1 _:T ) is extracted (S20). The 3D feature extraction unit 210 may be implemented as a 3D CNN, for example, and extract 3D features (X' _1:T ) from _T 2D frames ({I ₁ , I ₂ , ..., I _T }). can do.

이와 동시에 감정 추출부(200)의 시공간 특징 추출부(220)는 T개 프레임({I₁, I₂, …, I_T}) 각각에 대해 우선 공간적 특징(X_1:T = {X₁, X₂, …, X_T})을 추출한다(S30). 시공간 특징 추출부(220)는 일예로 2D CNN을 이용하여, T개 프레임({I₁, I₂, …, I_T}) 각각의 공간적 특징({X₁, X₂, …, X_T})을 추출할 수 있다. 이때, 시공간 특징 추출부(220)는 오버 피팅 문제를 방지하기 위해, 2D CNN으로 컨볼루현 레이어와 ReLU 레이어 및 맥스 풀링 레이어를 포함하여, 공간 해상도를 축소시킬 수 있다.At the same time, the spatiotemporal feature extraction unit 220 of the emotion extraction unit 200 _first preferentially spatial features (X _{1: T} = {X ₁ , for each of the _T frames ({I ₁ , I ₂ , …, I _T })). X ₂ , …, X _T }) is extracted (S30 ). The spatiotemporal feature extraction unit 220 uses, for example, 2D CNN, spatial features ({X ₁ , X ₂ , …, X _T }) of each of the _T frames ({I ₁ , I ₂ , …, I _T }). ) Can be extracted. At this time, the space-time feature extraction unit 220 may reduce the spatial resolution, including a convolution layer, a ReLU layer, and a max pooling layer with 2D CNN to prevent an over-fitting problem.

그리고 추출된 공간적 특징(X_1:T)을 가능한 유지하면서 시간적 특징을 더하기 위해, 시공간 특징을 추출한다(S40). 시공간 특징 추출부(220)는 공간적 특징(X_1:T)을 유지하면서 시간적 특징을 더 추출하기 위해 일예로 ConvLSTM을 이용한다. 이때 시공간 특징 추출부(220)는 다수개의 ConvLSTM 레이어를 포함하여, 순차적 디콘볼루션함으로써, 축소된 공간 해상도를 다시 확대할 수 있다.Then, in order to add temporal features while maintaining the extracted spatial features (X _{1: T} ) as possible, space-time features are extracted (S40 ). The spatio-temporal feature extraction unit 220 uses ConvLSTM as an example to further extract temporal features while maintaining the spatial features (X _{1: T} ). At this time, the spatiotemporal feature extraction unit 220 may include a plurality of ConvLSTM layers, and sequentially deconvolute to enlarge the reduced spatial resolution again.

그리고 감정 추출부(200)는 추출된 시공간 특징을 미리 지정된 방식으로 정규화하여, 시공간 가중치(A_1:T)를 획득한다(S50).Then, the emotion extracting unit 200 normalizes the extracted space-time features in a predetermined manner to obtain space-time weights (A _{1: T} ) (S50).

그리고 감정 추출부(200)의 특징 결합부(230)는 3D 특징(X'_1:T)에 시공간 가중치(A_1:T)를 가중하여, 감정 특징(X")을 획득한다(S60). 3D 특징(X'_1:T)에 시공간 가중치(A_1:T)가 가중됨으로써, T개의 프레임({I₁, I₂, …, I_T})으로부터 추출된 3D 특징(X'_1:T)의 각 시공간 영역별 가중치가 상이하게 가중될 수 있다.Then, the feature combining unit 230 of the emotion extracting unit 200 weights the space-time weight (A _1:T ) to the 3D feature (X' _1:T ) to obtain the emotion feature (X") (S60 ). The 3D features (X' _1:T ) are extracted from the _T frames ({I ₁ , I ₂ , ..., I _T }) by weighting the space-time weights (A _1:T ) to the 3D features (X' _{1:T ).} ) May be weighted differently for each space-time area.

이는 별도의 관심 영역이 지정되지 않더라도, 감정 추출부(200)가 T개의 프레임({I₁, I₂, …, I_T}) 각각에서 영역별 중요도를 결정할 수 있음을 의미한다.This means that even if a separate region of interest is not specified, the emotion extracting unit 200 can determine the importance of each region in each of the _T frames ({I ₁ , I ₂ , …, I _T }).

그리고 감정 추출부(200)의 감정값 획득부(240)는 감정 특징(X")에 대해 다시 3차원 특징을 추출하여, 기지정된 범위 이내의 스칼라 값을 갖는 감정값(y)를 획득한다(S70). 감정값 획득부(240) 또한 일예로 3D CNN으로 구현될 수 있으며, 여기서 획득된 감정값(y)는 이미지 시퀀스(I_1:T)에 포함된 대상의 감정을 대표하는 값이다. 감정값 획득부(240)는 감정값을 획득하기 위해, 3D CNN 레이어와 3D 맥스 풀링 레이어 및 완전 연결 레이어를 포함할 수 있다.Then, the emotion value obtaining unit 240 of the emotion extracting unit 200 extracts the 3D feature again for the emotion feature (X") to obtain an emotion value (y) having a scalar value within a predetermined range ( S70).The emotion value acquiring unit 240 may also be embodied as a 3D CNN as an example, and the emotion value y obtained is a value representing emotion of an object included in the image sequence I _1:T . The emotion value acquisition unit 240 may include a 3D CNN layer, a 3D max pooling layer, and a complete connection layer to acquire the emotion values.

감정 판별부(300)는 미리 저장된 감정값별 대한 감정에 획득된 감정값(y)를 대입하여 비교함으로써, 대상의 감정을 판별한다(S80).The emotion determining unit 300 determines the emotion of the object by substituting and comparing the obtained emotion value y to the emotion for each emotion value stored in advance (S80).

이하에서는 본 실시예에 따른 감정 인식 장치 및 방법의 성능을 기존의 감정 인식 방법과 비교하여 설명한다.Hereinafter, the performance of the emotion recognition apparatus and method according to the present embodiment will be described in comparison with an existing emotion recognition method.

여기서는 본 실시예에 따른 감정 인식 장치 및 방법의 성능을 정량적으로 평가하기 위해, 평균 제곱근 오차(Root Mean Square Error: RMSE)와 피어슨 상관 계수(Pearson Correlation Coefficient)(CC) 및 일치 상관 계수(Concordance Correlation Coefficient)(CCC)의 3가지 측정 기준을 이용하였다.Here, in order to quantitatively evaluate the performance of the emotion recognition apparatus and method according to the present embodiment, the Root Mean Square Error (RMSE), the Pearson Correlation Coefficient (CC), and the Concordance Correlation Coefficient) (CCC) was used.

이중 일치 상관 계수(CCC)는 수학식 4에 따라 두 변수 사이의 일치성을 측정한다.The double coincidence correlation coefficient (CCC) measures the correspondence between two variables according to Equation (4).

수학식 4에서 ρ는 피어슨 상관 계수이고, σ_x ²와 σ_y ²는 예측 및 측정값의 분산이며, μ_x와 μ_y는 예측 및 측정값의 평균을 나타낸다.In Equation 4, ρ is the Pearson correlation coefficient, σ _x ² and σ _y ² are variances of the prediction and measurement values, and μ _x and μ _y represent the average of the prediction and measurement values.

그리고 성능 검증을 위한 데이터로 2015년 및 2016년 Audio/Visual Emotion recognition Challenges (이하 AV + EC)에서 채택된 RECOLA 데이터 세트와 2017년 AV + EC의 데이터 세트를 이용하였다.Also, RECOLA data set adopted in 2015/2016 Audio/Visual Emotion recognition Challenges (hereinafter AV + EC) and data set of AV + EC in 2017 were used as data for performance verification.

표1 은 2D CNN을 이용한 경우와 3D CNN을 이용한 경우 및 본 실시예에 따른 3D CNN과 시공간 주의(STA)를 함께 이용한 감정 인식 방법에 대한 측정 결과를 나타낸다.Table 1 shows the measurement results for the case of using the 2D CNN and the case of using the 3D CNN and the emotion recognition method using the 3D CNN and the spatiotemporal attention (STA) according to the present embodiment.

표1 의 3번째 행에 나타난 바와 같이, 본 실시예에 따른 감정 인식 방법은 3D CNN과 시공간 주의(STA)를 함께 이용함에 따라 평균 제곱근 오차(RMSE)가 줄어들었으며, 피어슨 상관 계수(CC) 및 일치 상관 계수(CCC)가 각각 0.062 및 0.053만큼 증가되었음을 알 수 있다.As shown in the third row of Table 1, the method of recognizing emotion according to the present embodiment reduces the mean square root error (RMSE) by using 3D CNN and space-time attention (STA) together, and the Pearson correlation coefficient (CC) and It can be seen that the coincidence correlation coefficient (CCC) was increased by 0.062 and 0.053, respectively.

도7 은 본 실시예의 시공간 가중치를 시각화한 도면이다.7 is a diagram visualizing the spatiotemporal weight of the present embodiment.

도7 은 RECOLA 데이터 세트에 대해 시공간 특징 추출부(220)가 시공간 특징을 추출하여 획득된 시공간 가중치(A_1:T)를 색상별로 구분하여 시각화한 도면이다. 도7 에서 붉은 색 영역이 가중치가 높은 영역을 나타내고, 파란색 영역은 가중치가 낮은 영역을 나타낸다.FIG. 7 is a view visualizing the spatial and temporal weights (A _{1: T} ) obtained by extracting the spatiotemporal feature by color and the color space by extracting the spatiotemporal feature for the RECOLA data set. In FIG. 7, the red area indicates a high weighted area, and the blue area indicates a low weighted area.

상기한 바와 같이, 본 발명의 실시예에서 시공간 특징 추출부(220)는 다수의 프레임({I₁, I₂, …, I_T})에 대해 공간적 특징을 추출하고, 추출된 공간적 특징을 유지하면서 시간적 특징을 더 추출함으로써, 시공간 주의 기반 특징을 추출한다. 즉 시공간 가중치(A_1:T)를 획득한다. 이로 인해, 시공간 특징 추출부(220)는 도7 에 도시된 바와 같이, 별도의 관심 영역이 지정되지 않더라도, 학습된 바에 따라 다수의 프레임({I₁, I₂, …, I_T}) 각각에서 감정 인식을 위한 각 영역의 가중치를 차등화시킬 수 있다.As described above, in the embodiment of the present invention, the spatiotemporal feature extraction unit 220 extracts spatial features for a plurality of frames ({I ₁ , I ₂ , …, I _T }) and maintains the extracted spatial features While extracting more temporal features, it extracts features based on space-time attention. That is, the space-time weight (A _{1: T} ) is obtained. For this reason, the spatiotemporal feature extraction unit 220, as shown in FIG. 7, although a separate region of interest is not designated, each of a plurality of frames ({I ₁ , I ₂ , …, I _T }) as learned. In, it is possible to differentiate the weight of each area for emotion recognition.

도7 로부터 시공간 특징 추출부(220)가 눈과 입 주위의 영역을 감정을 추정하기 위해 중요한 영역으로 스스로 판별하였음을 알 수 있다.It can be seen from FIG. 7 that the spatio-temporal feature extraction unit 220 has determined the area around the eyes and mouth as an important area for estimating emotion.

도8 및 도9 는 각각 2 종류의 RECOLA 데이터 세트와 AV + EC 데이터 세트에 대해 본 실시예에 따른 감정 인식 방법을 적용하여 획득되는 감정값과 검증값을 비교한 결과를 나타낸다.8 and 9 show the results of comparing emotion values and verification values obtained by applying the emotion recognition method according to the present embodiment to the two types of RECOLA data sets and AV + EC data sets, respectively.

도8 및 도9 에서 적색 선은 검증값(ground truth)를 나타내고, 청색 선은 감정값(y)를 나타낸다. 도8 및 도9 에 도시된 바와 같이, 본 발명의 실시예에 따른 감정 인식 방법에 의해 획득된 감정값(y)는 검증값과 유사하게 변동됨을 확인할 수 있다.8 and 9, the red line represents the ground truth, and the blue line represents the emotion value y. 8 and 9, it can be confirmed that the emotion value y obtained by the emotion recognition method according to the embodiment of the present invention fluctuates similarly to the verification value.

그리고 표2 및 표3 에서는 각각 RECOLA 데이터 세트와 AV + EC 데이터 세트에 대한 본 실시예에 따른 감정 인식 결과를 다른 감정 인식 방법과 비교하였다.In Table 2 and Table 3, the emotion recognition results according to the present embodiment for the RECOLA data set and the AV + EC data set were compared with other emotion recognition methods, respectively.

상기 표2 및 표3 에 나타난 바와 같이, 본 실시예에 따른 감정 인식 방법은 기존의 다른 감정 인식 방법에 비해, 가장 낮은 평균 제곱근 오차(RMSE)를 나타내는 반면, 피어슨 상관 계수(CC) 및 일치 상관 계수(CCC)는 가장 높게 나타남을 확인할 수 있다. 즉 감정 인식 성능이 매우 우수함을 확인할 수 있다.As shown in Tables 2 and 3, the emotion recognition method according to the present embodiment shows the lowest mean square root error (RMSE) compared to other conventional emotion recognition methods, while the Pearson correlation coefficient (CC) and coincidence correlation It can be seen that the coefficient (CCC) is highest. That is, it can be confirmed that the emotion recognition performance is very excellent.

본 발명에 따른 방법은 컴퓨터에서 실행 시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. The computer readable medium herein can be any available medium that can be accessed by a computer, and can also include any computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and ROM (readable) Dedicated memory), RAM (random access memory), CD (compact disk)-ROM, DVD (digital video disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, these are merely exemplary, and those skilled in the art will understand that various modifications and other equivalent embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

100: 이미지 획득부 200: 감정 추출부
300: 감정 판별부 210: 3D 특징 추출부
220: 시공간 특징 추출부 230: 특징 결합부
240: 감정값 획득부 221: 공간 인코더
223: 시간 디코더 225: 정규화부100: image acquisition unit 200: emotion extraction unit
300: emotion discrimination unit 210: 3D feature extraction unit
220: space-time feature extraction unit 230: feature coupling unit
240: emotion value acquisition unit 221: spatial encoder
223: time decoder 225: normalization

Claims

A 3D feature extraction unit that is pre-learned according to a known 3D pattern recognition technique and extracts 3D features by pattern recognition of T consecutive frames of an image sequence (where T is a natural number) as a single 3D image;
Spatio-temporal, which is pre-trained according to a known 2D pattern recognition technique, extracts T spatial features through pattern recognition from each of the T frames, and adds spatio-temporal features between the obtained T spatial features to obtain spatio-temporal weights Feature extraction unit;
The emotion value extraction unit extracts an emotion value having a value within a predetermined range from the emotion feature by learning the emotion value by weighting the space-time weight to the 3D feature and learning in advance according to a known 3D pattern recognition technique. ; And
An emotion discrimination unit that stores emotions prepared for emotion values in advance, and determines emotions corresponding to the emotion values obtained by the emotion value extraction unit; Emotion recognition device comprising a.

The method of claim 1, wherein the 3D feature extraction unit
Emotion recognition device for extracting the 3D features, including pre-trained 3D CNN (3D Convolutional Neural Networks).

The method of claim 1, wherein the space-time feature extraction unit
A spatial encoder for extracting the T spatial features for each of the T frames, including pre-trained 2D Convolutional Neural Networks (CNN);
A temporal decoder for extracting the spatio-temporal feature between the T spatial features, including a pre-trained convolutional long short-term memory (ConvLSTM); And
A normalization unit to normalize the spatiotemporal features extracted from the time decoder in a predetermined manner to obtain the spatiotemporal weights; Emotion recognition device comprising a.

The method of claim 3, wherein the spatial encoder
The 2D CNN includes a convolutional layer including a plurality of filters, a ReLU (Rectified Linear Unit) layer, and a Max-Pooling layer to reduce the spatial resolution of the spatial feature to be lower than the spatial resolution of the frame. Emotion recognition device.

The method of claim 4, wherein the time decoder
The ConvLSTM includes a plurality of ConvLSTM layers, and performs sequential deconvolution to restore the reduced spatial resolution of the spatial feature.

The method of claim 5, wherein the normalization unit
An emotion recognition device that normalizes the spatiotemporal features using a soft max function.

The method of claim 6, wherein the emotion value extraction unit
A feature combining unit for multiplying the 3D feature and the spatiotemporal weight to obtain the emotional feature; And
An emotion value acquiring unit for extracting emotion values representing emotions from the emotion features, including 3D CNNs learned in advance; Emotion recognition device comprising a.

The method of claim 7, wherein the emotion value acquisition unit
An emotion recognition device that acquires the emotion value as a scalar value within a predetermined range.

The method of claim 7, wherein the emotional value
An emotion recognition device that indicates the value of the validity in an emotion model that expresses emotion in two dimensions with two axes, Arousal and Valence.

According to claim 1, The emotion recognition device
An image acquiring unit for separating and sequentially outputting the T frames consecutively in an image sequence including a plurality of frames; Emotion recognition device further comprising.

As a method for recognizing emotions performed in a space-time attention-based emotion recognition device,
Extracting 3D features by pattern-recognizing T consecutive frames of the image sequence (where T is a natural number) as a single 3D image by learning in advance according to a predetermined 3D pattern recognition technique;
Pre-learning according to a known two-dimensional pattern recognition technique, extracting T spatial features through pattern recognition from each of the T frames, and adding space-time features between the obtained T spatial features to obtain space-time weights ;
Obtaining emotional features by weighting the space-time weights with the 3D features;
Learning in advance according to a predetermined 3D pattern recognition technique, and extracting an emotion value having a value within a predetermined range from the emotion feature; And
Determining an emotion corresponding to the emotion value; Emotion recognition method comprising a.

The method of claim 11, wherein the step of extracting the 3D feature
Emotion recognition method for extracting the 3D features using 3D CNN (3D Convolutional Neural Networks) previously learned.

The method of claim 11, wherein the step of obtaining the space-time weight
Extracting the T spatial features for each of the T frames using pre-trained 2D CNNs (2D Convolutional Neural Networks);
Extracting the spatiotemporal feature between the T spatial features using a previously learned Convlutional Long Short-Term Memory (ConvLSTM); And
Normalizing the extracted spatiotemporal features in a predetermined manner to obtain the spatiotemporal weights; Emotion recognition method comprising a.

The method of claim 11, wherein the step of extracting the emotion value
An emotion recognition method for extracting emotion values representing emotions from the emotion features including 3D CNNs learned in advance.