KR102290186B1

KR102290186B1 - Method of processing video for determining emotion of a person

Info

Publication number: KR102290186B1
Application number: KR1020200081613A
Authority: KR
Inventors: 유대훈; 이영복
Original assignee: 주식회사 제네시스랩
Priority date: 2018-01-02
Filing date: 2020-07-02
Publication date: 2021-08-17
Anticipated expiration: 2038-01-02
Also published as: KR20200085696A

Abstract

본 기재의 인공지능을 이용한 멀티모달 감성인식 방법은, 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법에 있어서, 사람의 외형을 표현하는 영상과 음성을 제공하는, 상기 영상은 제1 영상부와, 상기 제1 영상부를 바로 뒤따른 제2 영상부와, 상기 제2 영상부를 바로 뒤따르는 제3 영상부를 포함하는, 단계; 상기 제1 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제1 영상부를 처리하며, 상기 제1 영상부에서는 상기 사람의 얼굴 및 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴의 어떠한 일부도 중첩되지 않는 것을 특징으로 하는 단계; 및 상기 제2 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제2 영상부를 처리하며, 상기 제2 영상부에서는 상기 사람의 얼굴과 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴과 중첩되는 것을 특징으로 하는 단계;를 포함하고, 상기 제1 영상부를 처리하는 단계는, 상기 적어도 하나의 손이 상기 사람의 얼굴을 가리는지 여부를 결정하기 위하여 상기 제1 영상부의 적어도 하나의 프레임을 처리하는 단계와, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제1 영상부에서 상기 사람의 목소리의 특성(characteristics)에 기초한 voice feature를 획득하기 위하여 제1 영상부의 오디오 데이터를 처리하는 단계;와, 상기 제1 영상부의 제1 얼굴 특징 데이터 및 voice feature를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계;를 포함하고, 상기 제2 영상부를 처리하는 단계는, 상기 사람의 얼굴이 적어도 하나의 손에 의하여 가려지는 지 여부를 결정하기 위하여 상기 제2 영상부의 적어도 하나의 프레임을 처리하는, 특히 상기 제2 영상부에서 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부가 결정되는, 단계와, 상기 제2 영상부의 적어도 하나의 프레임에서 상기 사람의 상기 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제2 영상부에서 상기 사람의 목소리 특성에 기초한 음성 특징 데이터를 획득하기 위하여 제2 영상부의 오디오 데이터를 처리하는 단계와, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴 일부를 적어도 하나의 손이 가린 위치를 지시하는 부가 데이터를 포함하는 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계를 포함한다.The multi-modal emotion recognition method using artificial intelligence of the present description is an emotion recognition method that processes an image to determine a person's emotional state, and provides an image and a voice expressing the appearance of a person, wherein the image is the first comprising an imaging unit, a second imaging unit immediately following the first imaging unit, and a third imaging unit immediately following the second imaging unit; The first image unit processes the first image unit to determine the emotional state of the person, wherein the first image unit shows the person's face and at least one hand, and the at least one hand characterized in that no part of the face overlaps; and the second image unit processes the second image unit to determine the emotional state of the person, wherein the second image unit shows the person's face and at least one hand, and the at least one hand is the person Including, wherein the processing of the first image unit includes at least one of the first image unit to determine whether the at least one hand covers the face of the person processing a frame of ; finding a first facial element of the person in the at least one frame of the first image unit; and in a state in which the first facial element is located, the at least one obtaining first facial feature data of the first image unit based on the shape of the first facial element shown in the frame of processing the audio data of the first image unit to obtain determining a state; wherein the processing of the second image unit includes processing at least one frame of the second image unit to determine whether the face of the person is covered by at least one hand. determining whether or not the at least one hand covers the face of the person in the second image unit; and finding the first facial element of the person in at least one frame of the second image unit. obtaining the first facial feature data of the first image unit based on the shape of the first face element shown in the at least one frame of the second image unit in a state where the first facial element is located step, and the audio of the second image unit to obtain the voice characteristic data based on the human voice characteristics in the second image unit processing data, the first facial feature data of the second image unit, the audio feature data of the second image unit, and additional data indicating a location where at least one hand covers a part of the person's face and determining the emotional state of the person for the second image unit based on the plurality of data included therein.

Description

Emotion recognition method that processes images to determine a person's emotional state

본 발명의 실시예들은 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법에 관한 것이다.Embodiments of the present invention relate to an emotion recognition method of processing an image to determine a person's emotional state.

종래의 기술에서는 가림(Occlusion)을 인식하여 오류로 처리한다. 손으로 입을 가린다는 것은 중요한 정보로 감정 상태의 세기 정도를 알아낼 수 있다. 단순히 정적 이미지로는 가림(Occlusion) 문제로 인식 정보가 부족할 수가 있다.In the prior art, occlusion is recognized and treated as an error. Covering your mouth with your hand is important information and can determine the intensity of your emotional state. A simple static image may lack recognition information due to an occlusion problem.

또한, 얼굴 표정으로 감정을 인식할 때 대상자가 말을 하면 잘못된 감정 인식 결과를 도출한다. 표정인식을 통한 감정인식은 입모양이 매우 중요한 정보지만 말을 할 때는 입모양이 수시로 변하기 때문에 놀람, 화, 웃음 등과 같은 입모양이 나올 수 있어 잘못된 인식 결과를 초래한다. In addition, if the subject speaks when recognizing emotions through facial expressions, false emotion recognition results are derived. Mouth shape is very important information for emotion recognition through facial expression recognition, but since the mouth shape changes frequently when speaking, mouth shapes such as surprise, anger, and laughter may appear, leading to incorrect recognition results.

이와 같이, 종래의 기술 중에는 얼굴 표정만으로 감정을 인식하는 경우 이를 해결하기 위한 대안은 거의 없으며, 멀티 모달인 경우에는 이러한 노이즈를 최소화하기 위해 얼굴 표정과 음성 정보를 혼용하여 오류를 최소화하는 방법으로 접근하고 있다. 본 특허에서는 얼굴 혹은 입모양을 추적하여 현재 말하는 상태인지 판별한 후, 말하는 상태인 경우에는 입모양 정보를 최소화하고 음성 특징정보의 비중을 확대하는 방법으로 정확한 감정 인식 결과를 도출 할 수 있도록 한다.As such, among the prior art, there are few alternatives to solve the case of recognizing emotions only with facial expressions. are doing In this patent, after determining whether a person is currently speaking by tracking the shape of a face or mouth, it is possible to derive an accurate emotion recognition result by minimizing the mouth shape information and increasing the weight of the voice feature information in the case of the speaking state.

본 발명의 실시예들은 손의 움직임 및 식별 정보, 입모양에 대한 정보, 음성 정보, 부분 표정 정보와 더불어 시간적 정보를 이용하여 보다 정확한 감정인식을 수행하는 멀티 모달 감성인식 장치, 방법 및 저장매체를 제공하고자 한다.Embodiments of the present invention provide a multi-modal emotion recognition apparatus, method and storage medium for performing more accurate emotion recognition using temporal information as well as hand movement and identification information, mouth shape information, voice information, and partial expression information. would like to provide

본 발명의 실시예의 일 측면에 따른 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법은, 사람의 외형을 표현하는 영상과 음성을 제공하는, 상기 영상은 제1 영상부와, 상기 제1 영상부를 바로 뒤따른 제2 영상부와, 상기 제2 영상부를 바로 뒤따르는 제3 영상부를 포함하는, 단계; 상기 제1 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제1 영상부를 처리하며, 상기 제1 영상부에서는 상기 사람의 얼굴 및 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴의 어떠한 일부도 중첩되지 않는 것을 특징으로 하는 단계; 및 상기 제2 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제2 영상부를 처리하며, 상기 제2 영상부에서는 상기 사람의 얼굴과 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴과 중첩되는 것을 특징으로 하는 단계;를 포함하고, 상기 제1 영상부를 처리하는 단계는, 상기 적어도 하나의 손이 상기 사람의 얼굴을 가리는지 여부를 결정하기 위하여 상기 제1 영상부의 적어도 하나의 프레임을 처리하는 단계와, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제1 영상부에서 상기 사람의 목소리의 특성(characteristics)에 기초한 voice feature를 획득하기 위하여 제1 영상부의 오디오 데이터를 처리하는 단계;와, 상기 제1 영상부의 제1 얼굴 특징 데이터 및 voice feature를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계;를 포함하고, 상기 제2 영상부를 처리하는 단계는, 상기 사람의 얼굴이 적어도 하나의 손에 의하여 가려지는 지 여부를 결정하기 위하여 상기 제2 영상부의 적어도 하나의 프레임을 처리하는, 특히 상기 제2 영상부에서 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부가 결정되는, 단계와, 상기 제2 영상부의 적어도 하나의 프레임에서 상기 사람의 상기 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제2 영상부에서 상기 사람의 목소리 특성에 기초한 음성 특징 데이터를 획득하기 위하여 제2 영상부의 오디오 데이터를 처리하는 단계와, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴 일부를 적어도 하나의 손이 가린 위치를 지시하는 부가 데이터를 포함하는 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계를 포함한다.An emotion recognition method for processing an image to determine a person's emotional state according to an aspect of an embodiment of the present invention provides an image and a sound expressing the appearance of a person, wherein the image includes a first image unit and the second image comprising a second imaging part immediately following the first imaging part and a third imaging part immediately following the second imaging part; The first image unit processes the first image unit to determine the emotional state of the person, wherein the first image unit shows the person's face and at least one hand, and the at least one hand characterized in that no part of the face overlaps; and the second image unit processes the second image unit to determine the emotional state of the person, wherein the second image unit shows the person's face and at least one hand, and the at least one hand is the person Including, wherein the processing of the first image unit includes at least one of the first image unit to determine whether the at least one hand covers the face of the person processing a frame of ; finding a first facial element of the person in the at least one frame of the first image unit; and in a state in which the first facial element is located, the at least one obtaining first facial feature data of the first image unit based on the shape of the first facial element shown in the frame of processing the audio data of the first image unit to obtain determining a state; wherein the processing of the second image unit includes processing at least one frame of the second image unit to determine whether the face of the person is covered by at least one hand. determining whether or not the at least one hand covers the face of the person in the second image unit; and finding the first facial element of the person in at least one frame of the second image unit. obtaining the first facial feature data of the first image unit based on the shape of the first face element shown in the at least one frame of the second image unit in a state where the first facial element is located step, and the audio of the second image unit to obtain the voice characteristic data based on the human voice characteristics in the second image unit processing data, the first facial feature data of the second image unit, the audio feature data of the second image unit, and additional data indicating a location where at least one hand covers a part of the person's face and determining the emotional state of the person for the second image unit based on the plurality of data included therein.

또한, 상기 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계는, 상기 제2 영상부에서 상기 사람의 얼굴의 일부가 적어도 하나의 손에 의하여 가려지는 경우, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터보다 상기 제2 영상부의 상기 음성 특징 데이터에 더 가중치를 둘 수 있다.In addition, the determining of the emotional state of the person for the second image unit based on the plurality of data may include: when a part of the person's face is covered by at least one hand in the second image unit, The audio feature data of the second image unit may be given more weight than the first facial feature data of the second image unit.

또한, 상기 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계는, 상기 제1 영상부에서는 상기 사람의 얼굴의 어느 부분도 적어도 하나의 손에 의하여 가려지지 않았으나, 상기 제2 영상부에서 상기 사람의 얼굴의 일부가 적어도 하나의 손에 의하여 가려지는 경우, 상기 제1 영상부의 상기 음성 특징 데이터보다 상기 제2 영상부의 상기 음성 특징 데이터에 더 가중치를 둘 수 있다.In addition, in the step of determining the emotional state of the person for the second image unit based on the plurality of data, any part of the person's face is not covered by at least one hand in the first image unit. , when a part of the face of the person is covered by at least one hand in the second image unit, the audio characteristic data of the second image unit may be given more weight than the audio characteristic data of the first image unit .

또한, 상기 제1 영상부를 처리하는 단계는, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제2 얼굴 요소를 찾는 단계와, 상기 제2 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제2 얼굴 요소의 형상에 기초하여 상기 제1 영상부의 제2 얼굴 특징 데이터를 획득하는 단계를 더 포함하고, 특히, 상기 제1 영상부의 상기 제1 얼굴 요소 특징 데이터와, 제1 영상부의 상기 제2 얼굴 요소 특징 데이터와, 상기 제1 영상부의 음성 특징 데이터를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하며, 상기 제2 영상부를 처리하는 단계는, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제2 얼굴 요소를 찾는, 적어도 하나의 손에 의하여 상기 제2 얼굴 요소가 가려지는 지 여부를 결정하는, 단계와, 상기 제1 영상부의 상기 제2 face feature와 제2 얼굴 요소의 가려짐에 대한 기설정된 가중치에 기초하여 제2 영상부의 제2 얼굴 특징 데이터를 획득하는 단계를 더 포함하고, 특히, 상기 제2 영상부의 상기 제1 얼굴 요소 특징 데이터와, 상기 제2 영상부의 제2 얼굴 요소 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴의 일부를 가지는 상기 적어도 하나의 손의 위치를 지시하는 부가 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정할 수 있다.The processing of the first image unit may include: finding a second facial element of the person in the at least one frame of the first image unit; The method further comprises acquiring second facial feature data of the first image part based on a shape of the second facial element shown in the at least one frame of part, in particular, the first facial element of the first image part determining the emotional state of the person with respect to the first image unit based on a plurality of data including feature data, the second facial element feature data of the first image unit, and audio feature data of the first image unit, The processing of the second image unit may include determining whether the second face element is covered by at least one hand that finds the second face element of the person in the at least one frame of the second image unit. , and obtaining second facial feature data of a second image unit based on a preset weight for occlusion of the second face feature and a second facial element of the first image unit, in particular, The first facial element feature data of the second image part, the second facial element feature data of the second image part, the audio feature data of the second image part, and the at least one The emotional state of the person with respect to the second image unit may be determined based on additional data indicating the position of the hand.

또한, 제3 영상부에 대한 상기 사람의 감성 상태를 결정하기 위하여 상기 3 영상부를 처리하는 단계를 더 포함하고, 상기 사람의 얼굴의 어느 부분도 적어도 나의 손에 의하여 가려지지 않은 상태에서, 상기 제3 영상부 상에 상기 사람의 얼굴 및 상기 적도 하나의 손이 보여지며, 상기 제3 영상부를 처리하는 단계는, 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부를 결정하기 위하여 상기 제3 영상부의 적어도 하나의 프레임을 처리하는 단계; 상기 제3 영상부의 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계; 제1 얼굴 요소가 위치된 상태에서, 상기 제3 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 형상에 기초하여 상기 제3 영상부의 제1 얼굴 특징 데이터를 획득하는 단계; 상기 제3 영상부에서 상기 사람의 목소리 특성에 기초하여 상기 제3 영상부의 음성 특징 데이터를 획득하기 위하여 상기 제1 영상부의 오디오 데이터를 처리하는 단계; 및 상기 제3 영상부의 상기 제1 얼굴 특징 데이터 및 상기 음성 특징 데이터를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부의 상기 사람의 감성 상태를 결정하는 단계;를 포함할 수 있다.In addition, further comprising the step of processing the third image unit to determine the emotional state of the person for the third image unit, in a state in which any part of the person's face is not covered by at least my hand, 3 The face of the person and the hand of the equator are shown on the image unit, and the processing of the third image unit includes the third image to determine whether the face of the person is covered by the at least one hand. processing at least one frame of the image unit; finding a first facial element of the person in at least one frame of the third image unit; obtaining first facial feature data of the third image unit based on the shape of the first facial element shown in the at least one frame of the third image unit in a state where the first facial element is located; processing the audio data of the first image unit to obtain voice characteristic data of the third image unit based on the human voice characteristic in the third image unit; and determining the emotional state of the person of the first image unit based on a plurality of data including the first facial feature data and the voice feature data of the third image unit.

본 발명의 실시예의 다른 측면에 따른 컴퓨터에 의하여 실행될 때, 기설정된 명령어를 저장하는 컴퓨터 판독가능한 저장 매체는, 사람의 외형을 표현하는 영상과 음성을 제공하는, 상기 영상은 제1 영상부와, 상기 제1 영상부를 바로 뒤따른 제2 영상부와, 상기 제2 영상부를 바로 뒤따르는 제3 영상부를 포함하는, 단계; 상기 제1 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제1 영상부를 처리하며, 상기 제1 영상부에서는 상기 사람의 얼굴 및 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴의 어떠한 일부도 중첩되지 않는 것을 특징으로 하는 단계; 및 상기 제2 영상부에서 상기 사람의 감성 상태를 결정하기 위하여 상기 제2 영상부를 처리하며, 상기 제2 영상부에서는 상기 사람의 얼굴과 적어도 하나의 손이 보여지며 상기 적어도 하나의 손이 상기 사람의 얼굴과 중첩되는 것을 특징으로 하는 단계;를 포함하고, 상기 제1 영상부를 처리하는 단계는, 상기 적어도 하나의 손이 상기 사람의 얼굴을 가리는지 여부를 결정하기 위하여 상기 제1 영상부의 적어도 하나의 프레임을 처리하는 단계와, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 상기 사람의 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제1 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제1 영상부에서 상기 사람의 목소리의 특성(characteristics)에 기초한 voice feature를 획득하기 위하여 제1 영상부의 오디오 데이터를 처리하는 단계;와, 상기 제1 영상부의 제1 얼굴 특징 데이터 및 voice feature를 포함하는 복수의 데이터에 기초하여 상기 제1 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계;를 포함하고, 상기 제2 영상부를 처리하는 단계는, 상기 사람의 얼굴이 적어도 하나의 손에 의하여 가려지는 지 여부를 결정하기 위하여 상기 제2 영상부의 적어도 하나의 프레임을 처리하는, 특히 상기 제2 영상부에서 상기 사람의 얼굴을 상기 적어도 하나의 손이 가리는 지 여부가 결정되는, 단계와, 상기 제2 영상부의 적어도 하나의 프레임에서 상기 사람의 상기 제1 얼굴 요소를 찾는 단계와, 상기 제1 얼굴 요소가 위치된 상태에서, 상기 제2 영상부의 상기 적어도 하나의 프레임에서 보여지는 상기 제1 얼굴 요소의 모양에 기초하여 상기 제1 영상부 제1 얼굴 특징 데이터를 획득하는 단계와, 상기 제2 영상부에서 상기 사람의 목소리 특성에 기초한 음성 특징 데이터를 획득하기 위하여 제2 영상부의 오디오 데이터를 처리하는 단계와, 상기 제2 영상부의 상기 제1 얼굴 특징 데이터와, 상기 제2 영상부의 상기 음성 특징 데이터와, 상기 사람의 얼굴 일부를 적어도 하나의 손이 가린 위치를 지시하는 부가 데이터를 포함하는 복수의 데이터에 기초하여 상기 제2 영상부에 대한 상기 사람의 감성 상태를 결정하는 단계를 포함하는 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법을 수행하는 명령어를 저장한다.When executed by a computer according to another aspect of an embodiment of the present invention, the computer readable storage medium for storing a preset instruction provides an image and a sound expressing the appearance of a person, the image comprising: a first image unit; comprising a second imaging part immediately following the first imaging part and a third imaging part immediately following the second imaging part; The first image unit processes the first image unit to determine the emotional state of the person, wherein the first image unit shows the person's face and at least one hand, and the at least one hand characterized in that no part of the face overlaps; and the second image unit processes the second image unit to determine the emotional state of the person, wherein the second image unit shows the person's face and at least one hand, and the at least one hand is the person Including, wherein the processing of the first image unit includes at least one of the first image unit to determine whether the at least one hand covers the face of the person processing a frame of ; finding a first facial element of the person in the at least one frame of the first image unit; and in a state in which the first facial element is located, the at least one obtaining first facial feature data of the first image unit based on the shape of the first facial element shown in the frame of processing the audio data of the first image unit to obtain determining a state; wherein the processing of the second image unit includes processing at least one frame of the second image unit to determine whether the face of the person is covered by at least one hand. determining whether or not the at least one hand covers the face of the person in the second image unit; and finding the first facial element of the person in at least one frame of the second image unit. obtaining the first facial feature data of the first image unit based on the shape of the first face element shown in the at least one frame of the second image unit in a state where the first facial element is located step, and the audio of the second image unit to obtain the voice characteristic data based on the human voice characteristics in the second image unit processing data, the first facial feature data of the second image unit, the audio feature data of the second image unit, and additional data indicating a location where at least one hand covers a part of the person's face Stores a command for performing an emotion recognition method of processing an image to determine a person's emotional state, which includes determining the emotional state of the person with respect to the second image unit based on a plurality of data included therein.

상기한 바와 같은 본 발명의 실시예에 따르면, 사람의 감성 상태를 결정하기 위하여 영상을 처리하는 감성인식 방법은 대화하는 경우 및 손과 같은 객체에 의한 표정 가림을 하는 경우의 감정 상태를 정확하게 파악할 수 있다.According to the embodiment of the present invention as described above, the emotion recognition method for processing an image to determine the emotional state of a person can accurately determine the emotional state in the case of conversation and when the expression is covered by an object such as a hand. there is.

도 1은 본 발명의 실시예에 따른 멀티모달 감성 인식 장치의 구성을 개략적으로 도시한 도면이다.
도 2는 도 1의 멀티모달 감성 인식 장치 중 데이터 전처리부의 구성을 개략적으로 도시한 도면이다.
도 3는 도 1의 멀티모달 감성 인식 장치 중 예비 추론부의 구성을 개략적으로 도시한 도면이다.
도 4는 도 1의 멀티모달 감성 인식 장치 중 메인 추론부의 구성을 개략적으로 도시한 도면이다.
도 5는 도 1의 멀티모달 감성 인식 장치에 의한 멀티모달 감성 인식 방법을 보여주는 순서도이다.
도 6은 도 5의 멀티모달 감성 인식 방법 중 데이터 전처리 단계를 상세하게 보여주는 순서도이다.
도 7은 도 5의 멀티모달 감성 인식 방법 중 예비 추론 단계를 상세하게 보여주는 순서도이다.
도 8은 도 5의 멀티모달 감성 인식 방법 중 메인 추론 단계를 상세하게 보여주는 순서도이다.
도 9는 도 1의 멀티모달 감성 인식 장치에서 상황 변화 여부에 따른 얼굴 인식 과정을 보여주는 예시적인 도면이다.1 is a diagram schematically illustrating a configuration of a multi-modal emotion recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram schematically illustrating the configuration of a data preprocessor in the multi-modal emotion recognition apparatus of FIG. 1 .
3 is a diagram schematically illustrating a configuration of a preliminary inference unit in the multi-modal emotion recognition apparatus of FIG. 1 .
FIG. 4 is a diagram schematically illustrating the configuration of a main reasoning unit in the multi-modal emotion recognition apparatus of FIG. 1 .
5 is a flowchart illustrating a multi-modal emotion recognition method by the multi-modal emotion recognition apparatus of FIG. 1 .
6 is a flowchart illustrating in detail a data pre-processing step in the multi-modal emotion recognition method of FIG. 5 .
7 is a flowchart illustrating in detail a preliminary inference step in the multi-modal emotion recognition method of FIG. 5 .
8 is a flowchart illustrating in detail a main reasoning step in the multi-modal emotion recognition method of FIG. 5 .
FIG. 9 is an exemplary view illustrating a face recognition process according to whether a situation has changed in the multi-modal emotion recognition apparatus of FIG. 1 .

이하, 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시 할 수 있도록 상세히 설명한다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those of ordinary skill in the art to which the present invention pertains can easily implement them.

본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조부호를 붙였다. 또한, 도면에서 나타난 각 구성의 크기 및 두께는 설명의 편의를 위해 임의로 나타내었으므로, 본 발명이 반드시 도시된 바에 한정되지 않는다.The present invention may be embodied in many different forms and is not limited to the embodiments described herein. In order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and the same reference numerals are assigned to the same or similar components throughout the specification. In addition, since the size and thickness of each component shown in the drawings are arbitrarily indicated for convenience of description, the present invention is not necessarily limited to the illustrated bar.

본 발명에 있어서 "~상에"라 함은 대상부재의 위 또는 아래에 위치함을 의미하는 것이며, 반드시 중력방향을 기준으로 상부에 위치하는 것을 의미하는 것은 아니다. 또한, 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. In the present invention, "on" means to be located above or below the target member, and does not necessarily mean to be located above the target member based on the direction of gravity. In addition, throughout the specification, when a part "includes" a certain component, this means that other components may be further included, rather than excluding other components, unless otherwise stated.

이하, 첨부된 도면을 참조하여 본 발명의 실시예들을 상세히 설명하기로 하며, 도면을 참조하여 설명할 때 동일하거나 대응하는 구성 요소는 동일한 도면부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, and when described with reference to the drawings, the same or corresponding components are given the same reference numerals, and the overlapping description thereof will be omitted. .

본 발명은 대상자의 동영상과 음성 데이터를 기반으로 얼굴 표정, 말 상태, 손, 음성을 고려한 인공지능을 이용하여 보다 정확한 감성인식 결과를 도출한다.The present invention derives a more accurate emotional recognition result by using artificial intelligence that considers facial expressions, speech states, hands, and voices based on the subject's video and voice data.

도 1은 본 발명의 실시예에 따른 멀티모달 감성 인식 장치의 구성을 개략적으로 도시한 도면이다.1 is a diagram schematically illustrating a configuration of a multi-modal emotion recognition apparatus according to an embodiment of the present invention.

도 1을 참조하면, 멀티 모달 감성 인식 장치(10)는, 데이터 입력부(100), 데이터 전처리부(200), 예비 추론부(300), 메인 추론부(400) 및 출력부(500)를 포함할 수 있다.Referring to FIG. 1 , the multi-modal emotion recognition apparatus 10 includes a data input unit 100 , a data preprocessor 200 , a preliminary reasoning unit 300 , a main reasoning unit 400 , and an output unit 500 . can do.

데이터 입력부(100)는 사용자의 영상 데이터(DV) 및 음성 데이터(DS)를 입력 받을 수 있다.The data input unit 100 may receive the user's image data DV and audio data DS.

데이터 입력부(100)는 사용자의 감성 인식을 하기 위한 영상 데이터(DV)를 수신 받는 영상 입력부(110) 및 사용자의 음성 데이터(DS)를 수신 받는 음성 입력부(120)를 포함할 수 있다.The data input unit 100 may include an image input unit 110 that receives image data DV for recognizing the user's emotions and a voice input unit 120 that receives the user's voice data DS.

또한, 데이터 전처리부(200)는 음성 데이터(DS)로부터 음성 특징 데이터(DF₂)를 생성하는 음성 전처리부(220), 영상 데이터(DV)로부터 하나 이상의 얼굴 특징 데이터(DF₁)를 생성하는 영상 전처리부(210)를 포함할 수 있다.In addition, the data preprocessor 200 includes _{an audio preprocessor 220 that generates voice feature data DF 2} from the voice data DS, and one or more facial feature data DF _{1 from the image data DV.} An image preprocessor 210 may be included.

이 때, 얼굴 특징 데이터(DF₁)는 이미지, 위치 정보, 크기 정보, 얼굴 비율 정보, 뎁스 정보(Depth Information) 중 적어도 하나 이상을 포함할 수 있고, 음성 특징 데이터(DF₂)는 억양, 음높이 정보, 발성 강도, 발화속도 등 음성의 특징을 나타낼 수 있는 정보를 포함할 수 있다. In this case, the facial feature data DF ₁ may include at least one of an image, location information, size information, face ratio information, and depth information, and the voice feature data DF ₂ may include intonation and pitch. Information, utterance strength, utterance speed, etc. may include information that can represent the characteristics of the voice.

영상 전처리부(210)는 영상 데이터(DV)로부터 사용자의 얼굴 특징 데이터(DF₁)를 추출하기 위한 영상 전처리를 수행한다.The image preprocessor 210 performs image preprocessing for extracting the user's facial feature data DF _{1 from the image data DV.}

상기 영상 전처리는, 얼굴 전체 또는 부분 인식, 노이즈 제거, 사용자 얼굴 특징 및 이미지 추출 등 학습 모델을 사용하기 위한 영상 데이터(DV)를 적절한 양태로 변환할 수 있다.The image preprocessing may convert image data (DV) for using a learning model, such as whole or partial face recognition, noise removal, user facial features and image extraction, into an appropriate aspect.

음성 전처리부(220)는 음성 데이터(DS)로부터 사용자의 음성 특징 데이터(DF₂)를 추출하기 위한 음성 전처리를 수행한다.The voice preprocessor 220 performs voice preprocessing for extracting the user's voice feature data DF _{2 from the voice data DS.}

상기 음성 전처리는, 외부 소음 제거, 노이즈 제거, 사용자 음성 특징 추출 등 학습 모델을 사용하기 위한 적절한 양태로 음성 데이터(DS)를 변환할 수 있다.The voice preprocessing may transform the voice data DS into an appropriate aspect for using a learning model, such as external noise removal, noise removal, and user voice feature extraction.

예비 추론부(300)는, 영상 데이터(DV)에 기반하여, 시간적 순서에 따른 사용자의 상황 변화 여부에 관한 상황 판단 데이터(P)를 생성할 수 있다.The preliminary inference unit 300 may generate situation determination data P regarding whether the user's situation changes according to a temporal sequence, based on the image data DV.

이 때, 상황 판단 데이터(P)는, 사용자가 대화 상태인지 여부에 대한 대화 판단 데이터(P₁) 또는 영상 데이터(DV)의 전체 영상 영역 중 일부인 추적 대상 영역(B)과 다른 인식 대상 영역(A)과의 중첩 여부에 대한 중첩 판단 데이터(P₂)를 포함할 수 있다.At this time, the situation determination data P is a recognition target area different from the tracking target area B which is a part _{of the dialogue determination data P 1 or the entire image area of the image data DV for whether the user is in a conversation state (} It may include _{overlap determination data P 2} on whether overlap with A).

상세하게는, 예비 추론부(300)는 영상 데이터(DV)에 기반하여 추적 대상 영역(B)의 위치를 추론하기 위한 위치 추론 데이터(DM₁)를 생성하고, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여, 추적 대상 영역(B)과 인식 대상 영역(A)의 중첩 여부에 대한 중첩 판단 데이터(P₂)를 생성할 수 있다. _{In detail, the preliminary inference unit 300 generates the position inference data DM 1} for inferring the position of the tracking target region B based on the image data DV, and the facial feature data DF ₁ and Based on the location inference data DM ₁ , it is possible to generate _{overlap determination data P 2} as to whether the tracking target area B and the recognition target area A overlap.

또한, 예비 추론부(300)는, 얼굴 특징 데이터(DF₁)에 기반하여 사용자가 대화 상태 인지 여부를 판단하는 대화 판단 데이터(P₁)를 생성할 수 있다.Also, the preliminary inference unit 300 may generate _{dialogue determination data P 1} for determining whether the user is in a conversation state based on the _{facial feature data DF 1 .}

메인 추론부(400)는, 음성 특징 데이터(DF₂) 또는 얼굴 특징 데이터(DF₁)에 기반하여 적어도 하나의 서브 특징맵(FM)을 생성하고, 서브 특징맵(FM) 및 상황 판단 데이터(P)에 기반하여 사용자의 감성 상태를 추론할 수 있다.The main inference unit 400 _{generates at least one sub feature map FM based on the voice feature data DF 2} or the facial feature data DF ₁ , and the sub feature map FM and the situation determination data ( P) based on the user's emotional state can be inferred.

상기 감성 상태는 행복, 화, 두려움, 혐오, 슬픔, 놀람 등의 사용자의 감정 상태 정보를 포함할 수 있다.The emotional state may include information on the user's emotional state, such as happiness, anger, fear, disgust, sadness, and surprise.

출력부(500)는 메인 추론부(400)에서 추론된 감성상태의 결과를 출력할 수 있다.The output unit 500 may output the result of the emotional state inferred by the main reasoning unit 400 .

이 때, 출력부(500)는 시그모이드 함수(Sigmoid Function), 단계 함수(Step Function), 소프트맥스 함수(Softmax), ReLU(Rectified Linear Unit)등 활성화 함수를 이용하여 다양한 형태로 출력할 수 있다.At this time, the output unit 500 can output in various forms using activation functions such as a sigmoid function, a step function, a softmax function, and a ReLU (Rectified Linear Unit). there is.

도 2는 도 1의 멀티모달 감성 인식 장치 중 데이터 전처리부의 구성을 개략적으로 도시한 도면이다.FIG. 2 is a diagram schematically illustrating the configuration of a data preprocessor in the multi-modal emotion recognition apparatus of FIG. 1 .

도 2를 참조하면, 데이터 전처리부(200)는 영상 전처리부(210) 및 음성 전처리부(220)를 포함할 수 있다.Referring to FIG. 2 , the data preprocessor 200 may include an image preprocessor 210 and an audio preprocessor 220 .

영상 전처리부(210)는 얼굴 검출기(211), 이미지 전처리 모듈(212), 랜드 마크 검출모듈(213), 위치 조정모듈(214) 및 얼굴 요소 추출 모듈(215)을 포함 할 수 있다.The image preprocessor 210 may include a face detector 211 , an image preprocessing module 212 , a landmark detection module 213 , a position adjustment module 214 , and a face element extraction module 215 .

얼굴 검출기(211)는 영상 데이터(DV)의 전체 영역에서 사용자의 얼굴에 대응되는 영역인 인식 대상 영역(A)을 검출할 수 있다.The face detector 211 may detect the recognition target area A, which is an area corresponding to the user's face, from the entire area of the image data DV.

이미지 전처리 모듈(212)은 인식 대상 영역(A)을 보정할 수 있다. The image pre-processing module 212 may correct the recognition target area A.

상세하게는, 이미지 전처리 모듈(212)은 이미지의 밝기, 블러(Blur)의 보정, 및 영상 데이터(DV)의 노이즈 제거를 수행할 수 있다.In detail, the image preprocessing module 212 may perform image brightness, blur correction, and noise removal of image data DV.

랜드마크 검출모듈(213)은 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)를 추출할 수 있다.The landmark detection module 213 may extract the facial element location information AL of the recognition target area A.

상세하게는, 인식 대상 영역(A) 중 얼굴, 눈, 입, 코, 이마 등 얼굴 중요 요소의 위치 정보를 파악하여 얼굴 인식이 가능하게 수행할 수 있다.In detail, positional information of important face elements such as a face, eyes, mouth, nose, and forehead in the recognition target area A may be grasped to enable face recognition.

위치 조정모듈(214)은 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)에 기반하여 위치를 조정할 수 있다.The position adjustment module 214 may adjust the position based on the facial element position information AL of the recognition target area A.

상세하게는, 위치 조정모듈(214)은 랜드마크 검출모듈(213)로부터 추출된 얼굴 요소 위치 정보(AL)를 기준으로 수평 또는 수직에 맞춰 이미지를 정렬할 수 있다.In detail, the position adjustment module 214 may align the image horizontally or vertically based on the facial element position information AL extracted from the landmark detection module 213 .

얼굴 요소 추출 모듈(215)은 인식 대상 영역(A) 내에 위치하며 인식 대상 영역(A)보다 작은 서브 인식 대상 영역(AA)을 설정하고, 서브 인식 대상 영역(AA)의 얼굴 특징 데이터(DF₁)를 생성할 수 있다.The facial element extraction module 215 is located in the recognition target area A and sets a sub recognition target area AA smaller than the recognition target area A, and facial feature data DF _{1 of the sub recognition target area AA} ) can be created.

서브 인식 대상 영역(AA)은 얼굴, 눈, 입, 코, 이마 등 적어도 하나 이상의 얼굴 요소가 판별된 복수의 영역 또는 하나의 영역일 수 있다.The sub-recognition target area AA may be a plurality of areas or one area in which at least one or more facial elements such as a face, eyes, mouth, nose, and forehead are determined.

예를 들어, 인식 대상 영역(A) 중 얼굴 요소 위치 정보(AL)가 추출된 눈, 코, 입이 추출될 경우, 얼굴 요소 추출 모듈(215)은 서브 인식 대상 영역(AA)인 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 설정 및 상기 설정된 서브 인식 대상 영역(AA)에 대해 적어도 하나 이상의 얼굴 특징 데이터(DF₁)를 생성할 수 있다.For example, when the eyes, nose, and mouth from which the facial element location information AL are extracted are extracted from the recognition target area A, the facial element extraction module 215 may be configured to perform an eye recognition area that is the sub recognition target area AA. (A ₁ ), the nose recognition area A ₂ , and the mouth recognition area A ₃ may be set, and at least one or more facial feature data DF ₁ may be generated for the set sub-recognition target area AA.

또한, 얼굴 요소 추출 모듈(215)은 서브 인식 대상 영역(AA)이 설정되지 않을 경우, 인식 대상 영역(A)을 기반으로 얼굴 특징 데이터(DF₁)를 생성할 수 있다.Also, when the sub-recognition target area AA is not set, the facial element extraction module 215 may generate _{facial feature data DF 1 based on the recognition target area A.}

음성 전처리부(220)는 음성 보정 모듈(221), 음성 특징 데이터 추출 모듈(222)을 포함할 수 있다.The voice preprocessor 220 may include a voice correction module 221 and a voice feature data extraction module 222 .

음성 보정 모듈(221)은 음성 데이터(DS)를 보정할 수 있다. The voice correction module 221 may correct the voice data DS.

상세하게는, 음성 보정 모듈(221)은 음성 데이터(DS)에 포함된 다양한 노이즈 및 외부 소음 제거, 음량 조절, 주파수 보정 등 다양한 보정 방법을 수행하여, 보정된 음성 데이터를 생성할 수 있다.In detail, the voice correction module 221 may generate corrected voice data by performing various correction methods, such as various noise and external noise removal, volume control, and frequency correction included in the voice data DS.

음성 특징 데이터 추출 모듈(222)은 음성 보정 모듈(221)을 거친 음성 데이터(DS)의 특징을 추출하여, 음성 특징 데이터(DF₂)를 생성할 수 있다.The voice feature data extraction module 222 may extract features of the voice data DS that have passed through the voice correction module 221 to generate the _{voice feature data DF 2 .}

상세하게는, 음성 특징 데이터 추출 모듈(222)은 MFCC(Mel-frequency Cepstral Coefficients), eGeMAPS(Geneva Minimalistic Acoustic Parameter Set), Logbank 등과 같은 음성 데이터, 주파수 및 스펙트럼 분석 모듈 중 하나 이상의 모듈을 통하여 사용자의 음성 특징 데이터(DF₂)를 생성 할 수 있다.In detail, the voice feature data extraction module 222 uses one or more modules of voice data, frequency and spectrum analysis modules such as Mel-frequency Cepstral Coefficients (MFCC), Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), Logbank, and the like. Voice feature data (DF ₂ ) can be generated.

이 때, 음성 특징 데이터 추출 모듈(222)은 상기 보정된 음성 데이터를 사용하거나, 음성 데이터(DS)를 사용할 수도 있다.In this case, the voice feature data extraction module 222 may use the corrected voice data or the voice data DS.

도 3은 도 1의 멀티모달 감성 인식 장치 중 예비 추론부의 구성을 개략적으로 도시한 도면이다.FIG. 3 is a diagram schematically illustrating the configuration of a preliminary inference unit in the multi-modal emotion recognition apparatus of FIG. 1 .

도 3을 참조하면, 예비 추론부(300)는 손 검출 추론모듈(310), 대화 상태 추론모듈(320) 및 얼굴 겹침 검사모듈(330)을 포함할 수 있다.Referring to FIG. 3 , the preliminary inference unit 300 may include a hand detection inference module 310 , a conversation state inference module 320 , and a face overlap inspection module 330 .

대화 상태 추론모듈(320)은, 제1 학습 모델(LM₁)을 이용하고, 얼굴 특징 데이터(DF₁)에 기반하여 대화 판단 데이터(P₁)를 생성할 수 있다.The dialogue state inference module 320 may use the first learning model LM ₁ and generate dialogue determination data P ₁ based on the facial feature data DF _{1 .}

상세하게는, 대화 상태 추론모듈(320)은 사용자의 얼굴 특징 데이터(DF₁)의 전체 또는 부분을 사용하여, 사용자가 대화 상태인지를 판별할 수 있는 제1 학습 모델(LM₁)을 이용하여, 대화 판단 여부인 대화 판단 데이터(P₁)를 생성할 수 있다.In detail, the dialog state inference module 320 _{uses the whole or part of the user's facial feature data DF 1} _{, and uses the first learning model LM 1} that can determine whether the user is in a dialog state. , it is possible to generate dialogue determination data P _{1 that is whether dialogue is determined.}

얼굴 특징 데이터(DF₁)는, 인식 대상 영역(A) 중 사용자의 입에 대응되는 부분에 대한 영상 데이터(DV)인 입 영상 데이터(DV₂)를 포함하고, 제1 학습 모델(LM₁)을 이용하여, 입 영상 데이터(DV₂)로부터 사용자의 대화 상태 여부에 대한 대화 판단 데이터(P₁)를 생성할 수 있다.The facial feature data DF ₁ _{includes mouth image data DV 2 that} is image data DV for a portion corresponding to the user's mouth in the recognition target area A, and the first learning model LM ₁ . _{By using , conversation determination data P 1} on whether the user is in a conversation state may be generated from the mouth image data DV _{2 .}

제1 학습 모델(LM₁)은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등 시간적 특징 또는 공간적 특징을 추론 할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있다.The first learning model (LM ₁ ) is an artificial intelligence that can infer temporal or spatial features such as Long Short-Term Memory (LSTM), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), Convolutional Neural Networks (CNNs), etc. The method may be at least one or more of an intelligent model, machine learning, and deep learning method.

손 검출 추론모듈(310)은, 영상 데이터(DV)에서 추적 대상 영역(B)에 대한 손 영상 데이터(DV₁)를 검출하고, 제2 학습 모델(LM₂)을 이용하여 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성할 수 있다.The hand detection inference module 310 detects the hand image data DV ₁ for the tracking target region B from the image data DV, and uses the second learning model LM ₂ to obtain the hand image data DV ₁ ) based on the location inference data DM ₁ may be generated.

이 때, 제2 학습 모델(LM₂)은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등 시간적 특징 또는 공간적 특징을 추론 할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법이며, 이를 통해 손에 대한 위치 추론 데이터(DM₁)를 생성할 수 있다.At this time, the second learning model (LM ₂ ) is used to infer temporal or spatial features such as Long Short-Term Memory (LSTM), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), and Convolutional Neural Networks (CNNs). the number of artificial intelligence models, machine learning, methods of at least one of the deep learning how can you generate an inferred position data (DM ₁₎ for the hand over.

또한, 손 검출 추론모듈(310)은, 위치 추론 데이터(DM₁)에 대한 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P), 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론할 수 있다.In addition, the hand detection reasoning module 310 generates a location inference feature map (FM ₁ _{) for the location inference data (DM 1} ), and a sub feature map (FM), situation determination data (P), and a location inference feature Based on the map FM ₁ , the emotional state of the user may be inferred.

이 때, 위치 추론 특징맵(FM₁)은 손에 대한 특징 정보, 즉, 손에 대한 제스처 및 손에 대한 위치에 대한 정보 등 손의 움직임의 의미 있는 정보를 포함할 수 있다.In this case, the location inference feature map FM ₁ may include meaningful information on hand movement, such as hand feature information, that is, hand gesture information and hand location information.

얼굴 겹침 검사모듈(330)은, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하고, 중첩 여부 판단 결과에 따라 중첩 판단 데이터(P₂)를 생성 할 수 있다.The face overlap inspection module 330 determines whether or not the recognition target area A and the tracking target area B overlap based on the _{facial feature data DF 1} and the location inference data DM _{1 , and determines whether overlapping} Depending on the result, the overlap determination data P ₂ may be generated.

상세하게는, 중첩 판단 데이터(P₂)는 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하여, 인식 대상 영역(A)의 해당하는 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)의 중요도 및 사용 여부를 결정하는 하나 이상의 파라미터를 생성할 수 있다.In detail, the overlap determination data P ₂ determines whether the recognition target area A and the tracking target area B overlap, and the corresponding facial feature data DF ₁ of the recognition target area A and the voice One or more parameters that determine the importance and use of the feature data DF _{2 may be generated.}

도 4는 도 1의 멀티모달 감성 인식 장치 중 메인 추론부의 구성을 개략적으로 도시한 도면이다.FIG. 4 is a diagram schematically illustrating the configuration of a main reasoning unit in the multi-modal emotion recognition apparatus of FIG. 1 .

도 4를 참조하면, 메인 추론부(400)는, 복수의 서브 특징맵 생성부(410; 411, 412, 413, 414), 멀티 모달 특징맵 생성부(420) 및 감성 인식 추론부(430)를 포함할 수 있다.Referring to FIG. 4 , the main inference unit 400 includes a plurality of sub feature map generators 410 ; 411 , 412 , 413 , 414 , a multi-modal feature map generator 420 , and an emotion recognition inference unit 430 . may include.

복수의 서브 특징맵 생성부(410; 411, 412, 413, 414)는 제3 학습 모델(LM₃)을 이용하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 기반하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 대한 복수의 서브 특징맵(FM)을 생성할 수 있다.The plurality of sub-feature map generators 410; 411, 412, 413, 414 uses the third learning model LM ₃ to generate voice features based on the voice feature data DF ₂ and the facial feature data DF ₁ . A plurality of sub-feature maps FM for the data DF ₂ and the facial feature data DF _{1 may be generated.}

상세하게는, 제3 학습 모델(LM₃)은 DNN(Deep Neural Networks), CNN(Convolutional Neural Network) 등을 적어도 하나 이상의 공간적 특징을 추론할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있고, 제3 학습 모델(LM₃)을 이용하여, 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)의 특징이 함축된 복수의 서브 특징맵(FM)을 생성할 수 있다.In detail, the third learning model (LM ₃ ) is an artificial intelligence model that can infer at least one or more spatial features, such as Deep Neural Networks (DNN), Convolutional Neural Network (CNN), etc., at least among machine learning and deep learning methods. There may be one or more methods, and a plurality of sub-feature maps FM in which the features of the speech feature data DF ₂ and the facial feature data DF ₁ _{are implied are generated using the third learning model LM 3 .} can

멀티 모달 특징맵 생성부(420)는 상황 판단 데이터(P)를 참조하여, 복수의 서브 특징맵(FM)으로부터 멀티 모달 특징맵(M)을 생성할 수 있다.The multi-modal feature map generator 420 may generate the multi-modal feature map M from the plurality of sub feature maps FM with reference to the situation determination data P.

상황 판단 데이터(P)는, 사용자의 상황에 따라 기설정된 상황 판단값(PV)을 가지며, 멀티 모달 특징맵 생성부(420)는, 복수의 서브 특징맵(FM) 중 적어도 하나의 상황 판단값(PV)을 적용하여 멀티 모달 특징맵(M)을 생성할 수 있다.The situation determination data P has a predetermined situation determination value PV according to the user's situation, and the multi-modal feature map generator 420 includes at least one situation determination value among the plurality of sub-feature maps FM. (PV) can be applied to generate a multi-modal feature map (M).

상세하게는, 상황 판단값(PV)은 각각의 서브 특징맵(FM)이 가지는 중요도 및 사용여부를 나타내는 파라미터일 수 있다.In detail, the situation determination value PV may be a parameter indicating the importance of each sub-feature map FM and whether to use it.

상황 판단 데이터(P)와 서브 특징맵(FM)과의 연산을 통하여 상황 판단 데이터(P)의 상황 판단값(PV)이 적용된 서브 특징맵(FM)을 생성하고, 복수의 서브 특징맵(FM)을 통합하여, 멀티 모달 특징맵(M)을 생성할 수 있다.A sub-feature map FM to which the situation determination value PV of the situation determination data P is applied is generated through the operation of the situation determination data P and the sub-feature map FM, and a plurality of sub-feature maps FM are generated. ) to create a multi-modal feature map (M).

예를 들면, 사용자의 눈이 가려졌을 경우, 눈에 대한 상황 판단값을 0으로 출력하여, 상기 눈에 대한 상황 판단값과 눈에 대한 서브 특징맵(FM)의 곱연산을 통해 0을 출력하게 되어, 메인 추론부(400)가 상기 눈에 대한 서브 특징맵을 제외한 다른 서브 특징맵을 기준으로 멀티 모달 특징맵(M)을 생성할 수 있다.For example, when the user's eyes are covered, the situation judgment value for the eyes is output as 0, and 0 is output through the multiplication operation of the situation judgment value for the eyes and the sub-feature map (FM) for the eyes. Thus, the main reasoning unit 400 may generate the multi-modal feature map M based on other sub-feature maps except for the sub-feature map for the eye.

또한, 손 검출 추론모듈(320)로부터 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P) 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론하는 멀티 모달 특징맵(M)을 생성할 수 있다. In addition, the hand detection inference module 320 generates a location inference feature map (FM ₁ ), and based on the sub feature map (FM), the situation determination data (P) and the location inference feature map (FM ₁ ), the user's emotion It is possible to generate a multi-modal feature map (M) that infers a state.

멀티 모달 특징맵(M)은 Concat, Merge 및 딥 네트워크(Deep Network) 등을 사용하여 서브 특징맵(FM) 및 위치 추론 특징맵(FM₁) 적어도 하나 이상을 병합하여 생성될 수 있다. The multi-modal feature map M may be generated by merging at least one of a sub feature map FM and a location inference feature map FM _{1 using Concat, Merge, and a deep network.}

감성 인식 추론부(430)는 제4 학습 모델(LM₄)을 사용하여, 멀티 모달 특징맵(M)에 기반하여 감성상태를 추론할 수 있다.The emotion recognition inference unit 430 may infer the emotional state based on the multi-modal feature map M using _{the fourth learning model LM 4 .}

이 때, 제4 학습 모델(LM₄)은, LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), GRU(Gated Recurrent Unit) 등 순환 신경망과 같은 시간적 학습 모델일 수 있고, 시간적 특징과 공간적 특징을 추론 또는 분석할 수 있는 인공지능 모델, 머신 러닝, 딥 러닝 방법 중 적어도 하나 이상의 방법일 수 있다.At this time, the fourth learning model (LM ₄ ) may be a temporal learning model such as a recurrent neural network such as Long Short-Term Memory (LSTM), Recurrent Neural Network (RNNs), Gated Recurrent Unit (GRU), and temporal characteristics and It may be at least one method among an artificial intelligence model, machine learning, and deep learning method capable of inferring or analyzing spatial features.

도 5는 도 1의 멀티모달 감성 인식 장치에 의한 멀티모달 감성 인식 방법을 보여주는 순서도이다.5 is a flowchart illustrating a multi-modal emotion recognition method by the multi-modal emotion recognition apparatus of FIG. 1 .

도 5를 참조하면, 사용자의 영상 데이터(DV) 및 음성 데이터(DS)를 입력 받는 데이터 입력 단계(S100)를 수행된다.Referring to FIG. 5 , a data input step S100 of receiving the user's image data DV and audio data DS is performed.

그 다음, 음성 데이터(DS)로부터 음성 특징 데이터(DF₂)를 생성하는 음성 전처리 단계, 영상 데이터(DV)로부터 하나 이상의 얼굴 특징 데이터(DF₁)를 생성하는 영상 전처리단계를 포함하는 데이터 전처리 단계(S200)가 수행될 수 있다.Then, a data pre-processing step comprising an audio pre-processing step of generating _{audio feature data DF 2} from the audio data DS, and an image pre-processing step of generating _{one or more facial feature data DF 1 from the image data DV.} (S200) may be performed.

이 때, 데이터 전처리 단계(S200)는 학습 모델을 사용하기 위한 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)를 생성할 수 있다.In this case, the data preprocessing step S200 may generate facial feature data DF ₁ and voice feature data DF ₂ for using the learning model.

상기 학습 모델은 인공지능, 머신 러닝 및 딥 러닝 방법이 될 수 있다. The learning model may be artificial intelligence, machine learning, and deep learning methods.

그 다음, 영상 데이터(DV)에 기반하여, 시간적 순서에 따른 사용자의 상황 변화 여부에 관한 상황 판단 데이터(P)를 생성하는 예비 추론 단계(S300)가 수행될 수 있다.Then, based on the image data DV, a preliminary inference step (S300) of generating the situation determination data P regarding whether the user's situation changes according to the temporal order may be performed.

이 때, 상기 시간적 순서는 대화상태의 여부가 될 수 있고, 신체부분의 움직임에 대한 특징을 파악하기 위한 데이터일 수 있다.In this case, the temporal order may be whether the conversation is in a state of conversation, and may be data for identifying the characteristics of the movement of the body part.

또한, 상황 판단 데이터(P)는 영상 데이터(DV)로부터 겹칩 여부와 대화 상태의 여부를 판별하여, 하나 이상의 얼굴 특징 데이터(DF₁) 또는 음성 특징 데이터(DF₂)의 중요도 또는 사용 여부를 나타내는 파라미터를 포함할 수 있다.In addition, the situation determination data P determines whether the image data DV overlaps and whether there is a conversation state, indicating the importance or use of _{one or more facial feature data DF 1} or voice feature data DF _{2 .} It may contain parameters.

또한, 데이터 전처리 단계(S200)에서 생성된 하나 이상의 얼굴 특징 데이터(DF₁) 이외의 사용자의 신체 부분에 대한 특징 정보를 추출하여 생성할 수 있다.In addition, it is possible to extract and generate feature information on a body part of the user other than the _{one or more facial feature data DF 1} generated in the data pre-processing step S200 .

그 다음, 음성 특징 데이터(DF₂) 또는 얼굴 특징 데이터(DF₁)에 기반하여 적어도 하나의 서브 특징맵(FM)을 생성하고, 서브 특징맵(FM) 및 상황 판단 데이터(P)에 기반하여 사용자의 감성 상태를 추론하는 메인 추론 단계(S400)가 수행될 수 있다.Then, at least one sub feature map FM is generated based on the voice feature data DF ₂ or the facial feature data DF ₁ , and based on the sub feature map FM and the situation determination data P A main reasoning step ( S400 ) of inferring the emotional state of the user may be performed.

이 때, 사용자로부터 추출된 특징 정보를 포함한 서브 특징맵(FM)과 특징 정보의 중요도 또는 사용여부에 대한 파라미터를 포함한 상황 판단 데이터(P)를 연산하여, 서브 특징맵(FM)에 중요도 또는 사용여부에 대한 정보를 포함하여, 사용자의 감성 상태를 추론할 수 있다.At this time, by calculating the sub-feature map (FM) including the feature information extracted from the user and the situation determination data (P) including the parameter for the importance or use of feature information, the importance or use of the sub-feature map (FM) The emotional state of the user may be inferred, including information on whether or not there is.

그 다음, 메인 추론 단계(S400)에서의 감성 상태의 추론 결과를 출력하는 결과 도출 단계(S500)가 수행된다.Then, a result derivation step (S500) of outputting the inference result of the emotional state in the main reasoning step (S400) is performed.

도 6은 도 5의 멀티모달 감성 인식 방법 중 데이터 전처리 단계를 상세하게 보여주는 순서도이다.6 is a flowchart illustrating in detail a data pre-processing step in the multi-modal emotion recognition method of FIG. 5 .

도 6을 참조하면, 데이터 전처리 단계(S200)는 영상 전처리 단계(S210)와 음성 전처리 단계(S220)를 포함한다.Referring to FIG. 6 , the data preprocessing step S200 includes an image preprocessing step S210 and an audio preprocessing step S220 .

영상 전처리 단계(S210)는, 영상 데이터(DV)의 전체 영역에서 인식 대상 영상 영역, 인식 대상 영역(A)은 사용자의 얼굴에 대응되는 영역인,을 검출하는 얼굴 검출 단계가 수행된다.In the image pre-processing step S210 , a face detection step of detecting a recognition target image area in the entire area of the image data DV, and the recognition target area A is an area corresponding to the user's face, is performed.

그 다음, 인식 대상 영역(A)을 보정하는 이미지 전처리 단계가 수행된다.Then, an image pre-processing step of correcting the recognition target area A is performed.

상세하게는, 상기 이미지 전처리 단계에서 이미지의 밝기, 블러(Blur)의 보정, 및 영상 데이터(DV)의 노이즈 제거가 수행될 수 있다In detail, in the image pre-processing step, image brightness, blur correction, and noise removal of image data DV may be performed.

그 다음, 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)를 추출하는 랜드마크 검출 단계가 수행된다.Next, a landmark detection step of extracting the facial element location information AL of the recognition target area A is performed.

상세하게는, 인식 대상 영역(A) 중 얼굴, 눈, 코, 입, 이마 등 얼굴 중요 요소의 위치 정보를 파악하여 얼굴 인식이 가능하게 수행될 수 있다.In detail, face recognition may be performed by grasping location information of important face elements such as a face, eyes, nose, mouth, and forehead in the recognition target area A.

그 다음, 인식 대상 영역(A)의 얼굴 요소 위치 정보(AL)에 기반하여 위치를 조정하는 위치 조정 단계가 수행될 수 있다.Then, a position adjustment step of adjusting the position based on the facial element position information AL of the recognition target area A may be performed.

상세하게는, 랜드마크 검출모듈(213)로부터 추출된 얼굴 요소 위치 정보(AL)를 기준으로 수평 또는 수직에 맞춰 이미지가 정렬될 수 있다.In detail, images may be aligned horizontally or vertically based on the facial element location information AL extracted from the landmark detection module 213 .

그 다음, 인식 대상 영역(A)에서 얼굴 요소 위치 정보(AL)에 기반하여 인식 대상 영역(A) 내에 위치하며 인식 대상 영역(A)보다 작은 서브 인식 대상 영역(AA)을 설정하고, 서브 인식 대상 영역(AA)의 얼굴 특징 데이터(DF₁)를 생성하는 얼굴 요소 추출 단계가 수행될 수 있다.Then, based on the facial element location information AL in the recognition target area A, a sub recognition target area AA located in the recognition target area A and smaller than the recognition target area A is set, and sub recognition is performed. A facial element extraction step of generating the facial feature data DF _{1 of the target area AA may be performed.}

이 때, 서브 인식 대상 영역(AA)은 얼굴전체, 눈, 입, 코, 이마 등 적어도 하나 이상의 얼굴 요소가 판별된 복수의 영역 또는 하나의 영역일 수 있다.In this case, the sub-recognition target area AA may be a plurality of areas or one area in which at least one or more facial elements such as the entire face, eyes, mouth, nose, and forehead are determined.

또한, 상기 얼굴 요소 추출 단계는 서브 인식 대상 영역(AA)이 설정되지 않을 경우, 인식 대상 영역(A)을 기반으로 얼굴 특징 데이터(DF₁)를 생성할 수 있다.In addition, in the extracting of the face element, when the sub-recognition target area AA is not set, the facial feature data DF ₁ may be generated based on the recognition target area A.

음성 전처리 단계(S220)는 음성 보정 단계 및 음성 특징 데이터 추출 단계를 포함한다. The voice preprocessing step S220 includes a voice correction step and a voice feature data extraction step.

먼저, 음성 데이터(DS)를 보정하는 상기 음성 보정 단계가 수행된다.First, the voice correction step of correcting the voice data DS is performed.

상세하게는, 상기 음성 보정 단계에서 음성 데이터(DS)에 포함된 다양한 노이즈 및 외부 소음 제거, 음량 조절, 주파수 보정 등 다양한 보정 방법을 수행하여, 보정된 음성 데이터를 생성될 수 있다.In detail, in the voice correction step, various correction methods such as removing various noises and external noise included in the voice data DS, volume control, and frequency correction may be performed to generate corrected voice data.

상기 음성 보정 단계를 거친 음성 데이터(DS)의 특징을 추출하여, 음성 특징 데이터(DF₂)를 생성하는 상기 음성 특징 데이터 추출 단계가 수행된다.The voice feature data extraction step of extracting features of the voice data DS that has undergone the voice correction step to generate the voice feature data _{DF 2 is performed.}

상세하게는, MFCC(Mel-frequency cepstral coefficients), eGeMAPS(Geneva Minimalistic Acoustic Parameter Set), Logbank 등과 같은 음성 데이터, 주파수 및 스펙트럼 분석 모듈 중 하나 이상의 모듈을 통하여 사용자의 음성 특징 데이터(DF₂)를 생성 될 수 있다. _{Specifically, the user's voice feature data (DF 2} ) is generated through one or more modules of voice data, frequency and spectrum analysis modules such as Mel-frequency cepstral coefficients (MFCC), Geneva Minimalistic Acoustic Parameter Set (eGeMAPS), Logbank, etc. can be

이 때, 상기 음성 특징 데이터 추출 단계는 상기 보정된 음성 데이터를 사용하거나, 상기 음성 보정 단계가 수행되지 않고 음성 데이터(DS)하여 음성 특징 데이터(DF₂)를 생성할 수도 있다.In this case, the voice feature data extraction step may generate _{the voice feature data DF 2} by using the corrected voice data or by performing the voice data DS without performing the voice correction step.

또한, 이는 예시적인 것으로서 적어도 일부의 단계들은 전후의 단계들과 동시에 수행되거나 또는 순서를 바꾸어 수행될 수도 있다.Also, this is exemplary, and at least some of the steps may be performed simultaneously with the preceding and subsequent steps or performed in a different order.

도 7은 도 5의 멀티모달 감성 인식 방법 중 예비 추론 단계를 상세하게 보여주는 순서도이다.7 is a flowchart illustrating in detail a preliminary inference step in the multi-modal emotion recognition method of FIG. 5 .

제1 학습 모델(LM₁)을 이용하고, 얼굴 특징 데이터(DF₁)에 기반하여 대화 판단 데이터(P₁)를 생성하는 대화 상태 추론 단계(S310)가 수행될 수 있다.A dialog state inference step S310 of using the first learning model LM ₁ and generating the dialog determination data P ₁ based on the facial feature data DF _{1 may be performed.}

대화 상태 추론 단계(S310)에서, 제1 학습 모델(LM₁)을 이용하여 이전 상황에서의 대화 여부와 얼굴 특징 데이터(DF₁)로부터 얼굴 요소의 특징 및 움직임을 감지하여, 대화 상태 여부를 감지될 수 있다.In the dialog state inference step (S310), _{by using the first learning model (LM 1} ) to detect the presence or absence of a conversation in the previous situation and the features and movements of the facial elements from the facial _{feature data (DF 1 ) to detect the conversation state} can be

상세하게는, 사용자의 얼굴 특징 데이터(DF₁)의 전체 또는 부분을 사용하여, 사용자가 대화 중인지를 제1 학습 모델(LM₁)을 이용하여, 대화 판단 여부인 대화 판단 데이터(P₁)가 생성될 수 있다.In detail, by using the whole or part of the user's facial feature data DF ₁ _{, using the first learning model LM 1} _{to determine whether the user is talking, the conversation determination data P 1} , which is whether the conversation is determined or not, is can be created

이 때, 얼굴 특징 데이터(DF₁)는, 인식 대상 영역(A) 중 사용자의 입에 대응되는 부분에 대한 입 영상 데이터(DV₂)를 포함할 수 있다.In this case, the facial feature data DF ₁ _{may include mouth image data DV 2 of} a portion corresponding to the user's mouth in the recognition target area A.

또한, 제1 학습 모델(LM₁)을 이용하여, 입 영상 데이터(DV₂)로부터 사용자의 대화 상태 여부에 대한 대화 판단 데이터(P₁)를 생성할 수 있다.Also, by using the first learning model LM ₁ _{, the conversation determination data P 1} on whether the user is in a conversation state may be generated from the mouth image data DV _{2 .}

그 다음, 영상 데이터(DV)에서 추적 대상 영역(B)에 대한 손 영상 데이터(DV₁)를 검출하고, 제2 학습 모델(LM₂)을 이용하여 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성하는 손 검출 추론 단계(S320)가 수행된다.Next, the hand image data DV ₁ for the tracking target region B is detected from the image data DV, and position inference based on the hand image data DV ₁ using the second learning model LM _{2 .} The hand detection inference step S320 of generating the data DM _{1 is performed.}

이 때, 제2 학습 모델(LM₂)을 사용하여 손에 대한 위치에 대한 이전 상황과의 시간적 추론이 가능할 수 있다. 예를 들어, 일시적으로 손이 얼굴에 겹쳤는지 여부를 판별할 수 있다.In this case, using the second learning model LM ₂ , temporal inference with respect to the position of the hand with the previous situation may be possible. For example, it may be determined whether the hand temporarily overlaps the face.

또한, 손 검출 추론 단계(S320)는, 위치 추론 데이터(DM₁)에 대한 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P), 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론할 수 있다.In addition, the hand detection reasoning step (S320) generates a location inference feature map (FM ₁ _{) for the location inference data (DM 1} ), and a sub feature map (FM), situation determination data (P), and a location inference feature Based on the map FM ₁ , the emotional state of the user may be inferred.

상세하게는, 위치 추론 특징맵(FM₁)은 손에 대한 제스처를 파악할 수 있는 특징 및 손에 대한 위치에 대한 정보 등 손의 움직임의 의미 있는 정보를 포함할 수 있다.In detail, the location inference feature map FM ₁ may include meaningful information on the movement of the hand, such as a feature capable of recognizing a gesture for the hand and information on the location of the hand.

그 다음, 얼굴 특징 데이터(DF₁) 및 위치 추론 데이터(DM₁)에 기반하여 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하고, 중첩 여부 판단 결과에 따라 중첩 판단 데이터(P₂)를 생성하는 얼굴 겹침 검사 단계(S330)가 수행된다.Then, based on the facial feature data (DF ₁ ) and the location inference data (DM ₁ ), it is determined whether the recognition target area (A) and the tracking target area (B) overlap, and the overlap determination data is based on the overlapping determination result A face overlap detection step (S330) of generating (P _{2 ) is performed.}

상세하게는, 중첩 판단 데이터(P₂)는 인식 대상 영역(A)과 추적 대상 영역(B)의 중첩 여부를 판단하여, 인식 대상 영역(A)의 해당하는 얼굴 특징 데이터(DF₁)와 음성 특징 데이터(DF₂)의 중요도 및 사용 여부를 결정하는 하나 이상의 파라미터를 포함할 수 있다.In detail, the overlap determination data P ₂ determines whether the recognition target area A and the tracking target area B overlap, and the corresponding facial feature data DF ₁ of the recognition target area A and the voice It may include one or more parameters that determine the importance and use of the feature data DF _{2 .}

도 8은 도 5의 멀티모달 감성 인식 방법 중 메인 추론 단계를 상세하게 보여주는 순서도이다.8 is a flowchart illustrating in detail a main reasoning step in the multi-modal emotion recognition method of FIG. 5 .

도 8을 참조하면, 메인 추론 단계(S400)는, 복수의 서브 특징맵 생성 단계(S410), 멀티 모달 특징맵 생성 단계(S420) 및 감성 인식 추론 단계(S430)를 포함한다.Referring to FIG. 8 , the main reasoning step ( S400 ) includes a plurality of sub-feature map generation steps ( S410 ), a multi-modal feature map generation step ( S420 ), and an emotion recognition inference step ( S430 ).

먼저, 제3 학습 모델(LM₃)을 이용하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 기반하여 음성 특징 데이터(DF₂) 및 얼굴 특징 데이터(DF₁)에 대한 복수의 서브 특징맵(FM)을 생성하는 복수의 서브 특징맵 생성 단계(S410)가 수행된다.First, a plurality of speech feature data (DF ₂ ) and facial feature data (DF ₁ ) based on the speech feature data (DF ₂ ) and the facial feature data (DF ₁ ) using the third learning model (LM _{3 )} A plurality of sub-feature map generation steps (S410) for generating the sub-feature map FM are performed.

그 다음, 제3 학습 모델(LM₃)은 상황 판단 데이터(P)를 참조하여, 복수의 서브 특징맵(FM)으로부터 멀티 모달 특징맵(M)을 생성하는 멀티 모달 특징맵 생성 단계(S420)가 수행된다.Next, the third learning model (LM ₃ ) generates a multi-modal feature map (S420) for generating a multi-modal feature map (M) from a plurality of sub feature maps (FM) with reference to the situation determination data (P) is performed

이 때, 상황 판단 데이터(P)는, 사용자의 상황에 따라 기설정된 상황 판단값(PV)을 가지며, 멀티 모달 특징맵 생성 단계(S420)는, 복수의 서브 특징맵(FM) 중 적어도 하나에 상황 판단값(PV)을 적용하여 멀티 모달 특징맵(M)을 포함할 수 있다.At this time, the situation determination data P has a predetermined situation determination value PV according to the user's situation, and the multi-modal feature map generation step S420 is performed on at least one of the plurality of sub-feature maps FM. A multi-modal feature map M may be included by applying the situation determination value PV.

또한, 멀티 모달 특징맵 생성 단계(S420)에서, 손 검출 추론모듈(320)로부터 위치 추론 특징맵(FM₁)을 생성하고, 서브 특징맵(FM), 상황 판단 데이터(P) 및 위치 추론 특징맵(FM₁)에 기반하여 사용자의 감성 상태를 추론하는 멀티 모달 특징맵(M)이 생성될 수 있다. In addition, in the multi-modal feature map generation step (S420), the position inference feature map (FM ₁ ) is generated from the hand detection inference module 320, and the sub feature map (FM), the situation determination data (P) and the position inference feature A multi-modal feature map M for inferring a user's emotional state based on the map FM _{1 may be generated.}

그 다음, 제4 학습 모델(LM₄)을 사용하여, 멀티 모달 특징맵(M)에 기반하여 감성상태를 추론하는 감성 인식 추론 단계(S430)가 수행된다.Next, using the fourth learning model (LM ₄ ), the emotional recognition inference step ( S430 ) of inferring the emotional state based on the multi-modal feature map (M) is performed.

도 9는 도 1의 멀티모달 감성 인식 장치에서 상황 변화 여부에 따른 얼굴 인식 과정을 보여주는 예시적인 도면이다.FIG. 9 is an exemplary view illustrating a face recognition process according to whether a situation has changed in the multi-modal emotion recognition apparatus of FIG. 1 .

도 9를 참조하면, ((A)단계) 사용자가 손을 얼굴에 대고 있으며, 손이 입과 코를 가리고 있지는 않는 상황을 나타내고 있다.Referring to FIG. 9 , (step (A)) shows a situation in which the user puts his or her hand on the face and the hand does not cover the mouth and nose.

영상 입력부(110)를 통해 사용자의 영상 데이터(DV)가 입력되고, 음성 입력부(120)를 통해 사용자의 음성 데이터(DS)가 입력된다. The user's image data DV is input through the image input unit 110 , and the user's voice data DS is input through the audio input unit 120 .

이 후, 영상 전처리부(210)는 영상 전처리가 된 얼굴 특징 데이터(DF₁)를 생성하고, 또한, 음성 전처리부(220)를 통해 음성 전처리가 된 음성 특징 데이터(DF₂)를 생성하고, 영상 전처리부(210)는 인식 가능한 사용자의 눈, 코, 입의 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 포함하는 인식 대상 영역(A)이 설정되고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다.Thereafter, the image preprocessor 210 generates the image preprocessed facial feature data DF ₁ , and also generates the voice preprocessed voice feature data DF ₂ through the audio preprocessor 220 , _{The image preprocessor 210 is configured to perform an eye recognition area (A 1} ), a nose recognition area (A ₂ ), and a mouth recognition area (A ₃ ) based on the recognizable face element location information (AL) of the user's eyes, nose, and mouth. The recognition target area A including

이 후, 예비 추론부(300)는 영상 데이터(DV)로부터 검출된 추적 대상 영역(B₁)에 대한 손 영상 데이터(DV₁)를 생성한다.Thereafter, the preliminary inference unit 300 generates the hand image data DV ₁ _{for the tracking target area B 1 detected from the image data DV.}

이 때, 예비 추론부(300)는 손 영상 데이터(DV₁)를 통해 손의 움직임을 파악하는 위치 추론 데이터(DM₁)를 생성되고, 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₁)과 인식 대상 영역(A)의 중첩됨 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.At this time, the preliminary inference unit 300 _{generates the position inference data DM 1} for grasping the movement of the hand through the _{hand image data DV 1} , and the tracking target area B based on the position inference data DM _{1 .} ₁ _{) and overlap determination data P 2} are generated based on the determination of whether the recognition target region A overlaps.

여기서, 중첩 판단 데이터(P₂)는 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 사용을 나타내는 파라미터를 포함할 수 있다.Here, the overlap determination data P ₂ may include parameters indicating use of the eye recognition area A ₁ , the nose recognition area A ₂ , and the mouth recognition area A _{3 .}

또한, 대화 상태 추론모듈(310)은 입 영상 데이터(DV₂)에 기반한 입 인식 영역(A₃)을 통하여 대화 상태 여부를 판단하여 대화 판단 데이터(P₁)를 생성한다.In addition, the dialog state inference module 310 generates _{dialog determination data P 1} by determining whether a dialog state exists through the mouth recognition region A ₃ _{based on the mouth image data DV 2 .}

이 후, 서브 특징맵 생성부(410)는 눈, 코, 입에 해당되는 얼굴 특징 데이터(DF₁)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성한다.Thereafter, the sub-feature map generation unit 410 generates a plurality of sub-feature maps FM using the third learning model LM ₃ _{using the facial feature data DF 1 corresponding to the eyes, nose, and mouth.} .

이 후, 멀티 모달 특징맵 생성부(420)는 복수의 서브 특징맵(FM)과 손에 해당되는 위치 추론 특징맵(FM₁)을 통합하여 멀티 모달 특징맵(M)을 생성한다.Thereafter, the multi-modal feature map generator 420 generates the multi-modal feature map M by integrating the plurality of sub feature maps FM and the position inference feature map FM _{1 corresponding to the hand.}

이 후, 제4 학습 모델(LM₄)을 통해 이전의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, emotional recognition may be inferred in consideration of the previous user's behavior through the fourth learning model LM _{4 , and this may be expressed as an emotional recognition result.}

((B)단계) B단계는, A단계의 연속적인 동작을 나타내고 있다.(Step (B)) Step B shows the continuous operation of step A.

예를 들어, B단계는 30FPS 속도로 A단계에 이어 연속적으로 촬영된 영상으로 가정 할 수 있다.For example, it can be assumed that stage B is an image continuously taken following stage A at a speed of 30 FPS.

A단계와 마찬가지로, 영상 입력부(110)를 통해 사용자의 영상 데이터(DV)가 입력되고, 음성 입력부(120)를 통해 사용자의 음성 데이터(DS)가 입력된다. As in step A, the user's image data DV is input through the image input unit 110 , and the user's voice data DS is input through the audio input unit 120 .

이 후, 음성 전처리부(220)를 통해 음성 전처리가 된 음성 특징 데이터(DF₂)를 생성하고, 영상 전처리부(210)는 얼굴 특징 데이터(DF₁) 및 얼굴 요소 위치 정보(AL)를 생성하고, 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁), 코 인식 영역(A₂), 입 인식 영역(A₃)을 포함하는 인식 대상 영역(A)을 설정하고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다. _{Thereafter, the voice pre-processing unit DF 2} is generated through the voice pre-processing unit 220 , and the image pre-processing unit 210 generates the facial feature data DF ₁ and the facial element location information AL. and set a recognition target area (A) including an _{eye recognition area (A 1} ), a nose recognition area (A ₂ ), and a mouth recognition area (A ₃ ) based on the face element location information (AL), and The area A is transmitted to the preliminary inference unit 300 .

이 때, 인식 대상 영역(A)이 사용자의 동작에 따라 크기가 변화할 수 있다.In this case, the size of the recognition target area A may change according to a user's motion.

B단계는 A단계와 비교하여, 인식 대상 영역(A)이 동작에 따라 크기가 변화되는 것을 나타내고 있다.In step B, compared with step A, the size of the recognition target area A changes according to the operation.

이 후, 예비 추론부(300)는 손 영상 데이터(DV₁)에 기반한 위치 추론 데이터(DM₁)를 생성하여, A단계에서 B단계로의 손의 움직임을 추적할 수 있다.Thereafter, the preliminary inference unit 300 may generate the position inference data DM ₁ _{based on the hand image data DV 1} , and track the movement of the hand from step A to step B.

예비 추론부(300)는 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₂)과 인식 대상 영역(A)의 중첩됨 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.The preliminary reasoning unit 300 generates _{overlap determination data P 2} based on determining whether the tracking target area B ₂ and the recognition target area A overlap _{based on the location inference data DM 1 .}

또한, 예비 추론부(300)는 대화 상태 여부를 판단하여 대화 판단 데이터(P₁)를 생성한다. In addition, the preliminary reasoning unit 300 determines whether a conversation state exists and generates _{the conversation determination data P 1 .}

이 때, 예비 추론부(300)는 제1 학습 모델(LM₁)을 이용하여, (A)단계를 포함한 이전 상황에서 감성인식 대상이 되는 사용자의 대화 여부가 지속되고 있는지를 고려하여 대화 상태 여부를 판단 할 수 있다.At this time, the preliminary reasoning unit 300 uses the first learning model (LM ₁ ) to consider whether the conversation of the user who is the subject of emotion recognition in the previous situation including the step (A) is continuing in consideration of whether the conversation state can be judged

예를 들어, A단계에서 사용자가 대화 상태가 아닌 것으로 추론된 경우, 상기 결과를 바탕으로, B단계에서 입 인식 영역(A₃)에 기초하여 일시적으로 사용자의 입 모양이 대화 상태에서의 입 모양과 유사하더라도, 예비 추론부(300)는 제1 학습 모델(LM₁)을 이용하여, 사용자가 대화 상태가 아닌 것으로 판단할 수 있다. 즉, 예비 추론부(300)는 A단계에서의 대화 상태 판단 결과에 기초하여, 다음 장면인 B단계에서의 대화 상태 판단 여부에 대한 추론을 실시할 수 있다.For example, if it is inferred that the user is not in a conversational state in step A, based on the result, the user's mouth shape is temporarily changed based _{on the mouth recognition area A 3 in step B.} Although similar to , the preliminary inference unit 300 may determine that the user is not in a conversational state by using _{the first learning model LM 1 .} That is, the preliminary inference unit 300 may perform inference on whether to determine the conversation state in the next scene, step B, based on the result of determining the dialogue state in the step A.

이 후, 메인 추론부(400)는 수신된 얼굴 특징 데이터(DF₁) 및 음성 특징 데이터(DF₂)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성하고, 복수의 서브 특징맵(FM)과 손에 해당되는 위치 추론 특징맵(FM₁)을 통합하여 멀티 모달 특징맵(M)을 생성한다.Thereafter, the main inference unit 400 generates a plurality of sub-feature maps FM using the _{received facial feature data DF 1} and the voice feature data DF ₂ using the third learning model LM _{3 , and} , a plurality of sub-feature maps (FM) and a position inference feature map (FM ₁ ) corresponding to the hand are integrated to generate a multi-modal feature map (M).

이 후, 메인 추론부(400)는 제4 학습 모델(LM₄)을 통해 이전((A)단계)의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, the main reasoning unit 400 infers emotional recognition in consideration of the user's behavior of the previous (step (A)) through _{the fourth learning model LM 4 , and may represent this as an emotional recognition result.}

((C)단계) B단계 이후, 사용자가 입을 손으로 가리는 행동을 나타내고 있다.(Step (C)) After step B, the action of the user covering his mouth with his hand is shown.

영상 전처리부(210)는 인식 가능한 사용자의 눈의 얼굴 요소 위치 정보(AL)를 기반으로 눈 인식 영역(A₁)을 포함하는 인식 대상 영역(A)이 설정되고, 인식 대상 영역(A)을 예비 추론부(300)로 송신한다.The image preprocessor 210 sets the recognition target area A including the _{eye recognition area A 1} based on the recognizable user's eye facial element position information AL, and selects the recognition target area A It is transmitted to the preliminary inference unit 300 .

이 후, 예비 추론부(300)는 영상 데이터(DV)로부터 검출된 추적 대상 영역(B₃)에 대한 손 영상 데이터(DV₁)를 생성한다. 이 때, 손 영상 데이터(DV₁)를 통해 손의 움직임을 파악하는 위치 추론 데이터(DM₁)를 생성하고, 위치 추론 데이터(DM₁)에 기반한 추적 대상 영역(B₃)과 인식 대상 영역(A)의 중첩 여부 판단을 기반으로 중첩 판단 데이터(P₂)가 생성된다.Thereafter, the preliminary inference unit 300 generates the hand image data DV ₁ _{for the tracking target area B 3 detected from the image data DV.} At this time, _{the position inference data DM 1} for recognizing the movement of the hand is generated through the _{hand image data DV 1} , and the tracking target area B ₃ and the recognition target area based on the position inference data DM _{1 (} _{The overlap determination data P 2} is generated based on the overlap determination of A).

여기서, 중첩 판단 데이터(P₂)는 눈 인식 영역(A₁)에 기초한 얼굴 특징 데이터(DF₁)의 사용 여부 또는 얼굴 특징 데이터(DF₁)에 적용되는 가중치를 나타내는 파라미터를 포함할 수 있다.Here, the overlap determination data P ₂ may include a parameter indicating whether the facial feature data _{DF 1} based on the eye recognition area A ₁ is used or a weight applied to the _{facial feature data DF 1 .}

또한, 예비 추론부(300)는 (A)단계, (B)단계에서 인식 대상 영역(A)이었던 코 인식 영역(A₂) 또는 입 인식 영역(A₃)과 사용자의 손 위치에 대한 영역인 추적 대상 영역(B₃)과의 중첩을 인지하여, 감성인식 추론에서 제외됨 또는 중요도가 떨어짐을 나타내는 파라미터가 중첩 판단 데이터(P₂)에 포함될 수 있다. _{In addition, the preliminary reasoning unit 300 is an area for the coin recognition area (A 2} ) or the mouth recognition area (A ₃ ), which was the recognition target area (A) in steps (A) and (B), and the user's hand position. By recognizing the overlap with the tracking target area (B ₃ ), a parameter indicating that it is excluded from the emotion recognition inference or the importance is lowered may be included _{in the overlap determination data (P 2 ).}

또한, 예비 추론부(300)는 입 인식 영역(A₃)에 대응되는 입 영상 데이터(DV₂)가 인식되지 않는 상황과 사용자가 이전 대화 상태 여부의 판단 결과를 고려하여, 음성 특징 데이터(DF₂)의 사용 판단 여부의 나타내는 값을 대화 판단 데이터(P₁)에 포함시킬 수 있다.In addition, the preliminary inference unit 300 considers a situation in which the mouth image data DV ₂ _{corresponding to the mouth recognition area A 3} is not recognized and a result of determining whether the user has a previous conversation state, and the voice feature data DF ₂ ) may be included in the _{conversation determination data P 1} a value indicating whether use is determined.

여기서, 상기 이전 대화 상태 여부의 판단 결과는 시간적 학습 모델을 통해 추론한다. 이 때, 시간적 학습 모델은 LSTM(Long Short-Term Memory), RNNs(Recurrent Neural Network), GRU(Gated Recurrent Unit) 등 순환 신경망과 같은 시간적 학습 모델일 수 있다.Here, the determination result of whether the previous conversation state is inferred through a temporal learning model. In this case, the temporal learning model may be a temporal learning model such as a recurrent neural network such as a Long Short-Term Memory (LSTM), a Recurrent Neural Network (RNNs), or a Gated Recurrent Unit (GRU).

이 후, 서브 특징맵 생성부(410)는 눈에 해당되는 영역의 얼굴 특징 데이터(DF₁)를 제3 학습 모델(LM₃)을 사용하여 복수의 서브 특징맵(FM)을 생성한다.Thereafter, the sub-feature map generator 410 generates a plurality of sub-feature maps FM using the third learning model LM ₃ _{using the facial feature data DF 1 of the region corresponding to the eye.}

이 후, 감정인식 추론부(430)는 제4 학습 모델(LM₄)을 통해 이전의 사용자의 행동을 고려하여 감성인식을 추론하고, 이를 감성인식 결과로 나타낼 수 있다.Thereafter, the emotion recognition inference unit 430 may infer emotion recognition _{in consideration of the previous user's behavior through the fourth learning model LM 4} , and may represent this as an emotion recognition result.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 시스템, 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(Arithmetic Logic Unit), 디지털 신호 프로세서(Digital signalprocessor), 마이크로컴퓨터, FPA(Field Programmable Array), PLU(Programmable Logic Unit), 마이크로프로세서, 또는 명령(Instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리요소(Processing Element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(Parallel Processor)와 같은, 다른 처리 구성(Processing Configuration)도 가능하다.The system or apparatus described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, the systems, devices and components described in the embodiments may include, for example, a processor, a controller, an Arithmetic Logic Unit (ALU), a digital signal processor, a microcomputer, a Field Programmable Array (FPA). , a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose computers or special purpose computers. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, the processing device is sometimes described as being used, but one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other Processing Configurations are also possible, such as a Parallel Processor.

소프트웨어는 컴퓨터 프로그램(Computer Program), 코드(Code), 명령(Instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로 (collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(Component), 물리적 장치, 가상 장치(Virtual Equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(Signal Wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or provide instructions or data to the processing device. , or may be permanently or temporarily embodied in a transmitted signal wave (Signal Wave). The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예들에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도 록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiments may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, and magnetic media such as floppy disks. - Includes Magneto-optical Media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, Flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

10 : 멀티 모달 감성 인식 장치
100 : 데이터 입력부 110 : 영상 입력부
120 : 음성 입력부
200 : 데이터 전처리부 210 : 영상 전처리부
211 : 얼굴 검출 모듈 212 : 이미지 전처리 모듈
213 : 랜드마크 검출 모듈 214 : 위치 조정 모듈
215 : 얼굴 요소 추출 모듈
220 : 음성 전처리부 221 : 음성 보정 모듈
222 : 음성 특징 데이터 추출 모듈 300 : 예비 추론부
310 : 대화 상태 추론 모듈
320 : 손 검출 추론 모듈 330 : 얼굴 겹침 검사 모듈
400 : 메인 추론부
411 : 제1 서브 특징맵 생성부 412 : 제2 서브 특징맵 생성부
413 : 제3 서브 특징맵 생성부 414 : 제n 서브 특징맵 생성부
420 : 멀티 모달 특징맵 생성부 430 : 감성 인식 추론부
500 : 출력부
S100: 데이터 입력 단계 S200: 데이터 전처리 단계
S210: 영상 전처리 단계 S220: 음성 전처리 단계
S300: 예비 추론 단계 S310: 대화 상태 추론 단계
S320: 손 검출 추론 단계 S330: 얼굴 겹침 검사 단계
S400: 메인 추론 단계 S410: 서브 특징맵 생성 단계
S420: 멀티 모달 특징맵 생성 단계 S430: 감성 인식 추론 단계
S500: 결과 도출 단계 A₁ : 눈 인식 영역
A₂ : 코 인식 영역 A₃ : 입 인식 영역
B_1,B_2,B₃ : 추적 대상 영역10: Multi-modal emotion recognition device
100: data input unit 110: video input unit
120: voice input unit
200: data preprocessor 210: image preprocessor
211: face detection module 212: image preprocessing module
213: landmark detection module 214: position adjustment module
215: face element extraction module
220: voice preprocessor 221: voice correction module
222: speech feature data extraction module 300: preliminary inference unit
310: dialog state inference module
320: hand detection inference module 330: face overlap inspection module
400: main reasoning unit
411: first sub feature map generator 412: second sub feature map generator
413: third sub feature map generator 414: nth sub feature map generator
420: multi-modal feature map generation unit 430: emotion recognition inference unit
500: output unit
S100: data input step S200: data pre-processing step
S210: image pre-processing step S220: audio pre-processing step
S300: preliminary reasoning step S310: dialogue state reasoning step
S320: Hand detection inference step S330: Face overlap check step
S400: Main reasoning step S410: Sub feature map generation step
S420: Multi-modal feature map generation step S430: Emotion recognition inference step
S500: Result Derivation Step A ₁ : Eye Recognition Area
A ₂ : nose recognition area A ₃ : mouth recognition area
B _1, B _2, B ₃ : Tracking target area

Claims

In the emotion recognition method using an emotion recognition device that processes an image to determine a person's emotional state,
The image, which provides an image and a sound expressing the appearance of a person, includes a first image unit, a second image unit immediately following the first image unit, and a third image unit immediately following the second image unit, step;
The first image unit processes the first image unit to determine the emotional state of the person, wherein the first image unit shows the person's face and at least one hand, and the at least one hand characterized in that no part of the face overlaps; and
The second image unit processes the second image unit to determine the emotional state of the person, and the second image unit shows the person's face and at least one hand, and the at least one hand Including; characterized in that it overlaps the face;
The step of processing the first image unit,
processing at least one frame of the first image unit to determine whether the at least one hand covers the face of the person;
finding a first facial element of the person in the at least one frame of the first image unit;
obtaining first facial feature data of the first image unit based on the shape of the first facial element shown in the at least one frame of the first image unit in a state where the first facial element is located;
processing the audio data of the first image unit to obtain voice characteristic data based on characteristics of the human voice in the first image unit;
determining the emotional state of the person for the first image unit based on a plurality of data including the first facial feature data and the voice feature data of the first image unit;
The step of processing the second image unit,
processing at least one frame of the second image unit to determine whether the face of the person is obscured by the at least one hand, in particular the face of the person in the second image unit The step, at which it is decided whether to cover, and
finding the first facial element of the person in at least one frame of the second image unit;
obtaining first facial feature data of the second image unit based on the shape of the first facial element shown in the at least one frame of the second image unit in a state where the first facial element is located;
processing the audio data of the second image unit to obtain voice characteristic data based on the human voice characteristic in the second image unit;
Based on a plurality of data including the first facial feature data of the second image unit, the audio feature data of the second image unit, and additional data indicating a position where at least one hand covers a part of the person's face and determining the emotional state of the person for the second image unit by processing an image to determine the emotional state of the person.

According to claim 1,
The step of determining the emotional state of the person for the second image unit based on the plurality of data,
A person who places more weight on the voice feature data of the second image unit than on the first facial feature data of the second image unit when a part of the person's face is covered by at least one hand in the second image unit Emotion recognition method that processes images to determine the emotional state of

According to claim 1,
The step of determining the emotional state of the person for the second image unit based on the plurality of data,
When at least one hand does not cover any part of the person's face in the first image unit, but a part of the person's face is covered by at least one hand in the second image unit, the first An emotion recognition method of processing an image to determine an emotional state of a person who places more weight on the audio feature data of the second video part than on the audio feature data of the video part.

According to claim 1,
The step of processing the first image unit,
finding a second facial element of the person in the at least one frame of the first image unit;
obtaining second facial feature data of the first image part based on the shape of the second facial element shown in the at least one frame of the first image part in a state where the second facial element is located; including,
In particular, based on a plurality of data including the first facial feature data of the first image part, the second facial feature data of the first image part, and the audio feature data of the first image part, the first image part is determines the emotional state of the person for
The step of processing the second image unit,
determining whether the second facial element is obscured by at least one hand, finding a second facial element of the person in the at least one frame of the second image unit;
The method further comprises: acquiring second facial feature data of a second image part based on a preset weight for occlusion of the second facial feature data and a second facial element of the first image part;
In particular, the first facial feature data of the second image part, the second facial feature data of the second image part, the audio feature data of the second image part, and the at least one Emotion recognition method for processing an image to determine the emotional state of a person, characterized in that determining the emotional state of the person for the second image unit based on additional data indicating the position of the hand.

According to claim 1,
Further comprising the step of processing the third image unit to determine the emotional state of the person with respect to the third image unit, in a state in which any part of the person's face is not covered by at least one hand, 3 The face of the person and the at least one hand are shown on the image unit,
The step of processing the third image unit,
processing at least one frame of the third image unit to determine whether the at least one hand covers the face of the person;
finding a first facial element of the person in at least one frame of the third image unit;
obtaining first facial feature data of the third image unit based on the shape of the first facial element shown in the at least one frame of the third image unit in a state where the first facial element is located;
processing the audio data of the first image unit to obtain voice characteristic data of the third image unit based on the human voice characteristic in the third image unit; and
determining the emotional state of the person including; determining the emotional state of the person in the first image unit based on a plurality of data including the first facial feature data and the voice feature data in the third image unit Emotion recognition method that processes images for

A computer readable storage medium storing instructions that, when executed by a computer, perform the method of claim 1 .