KR101030603B1

KR101030603B1 - Determination of starting point of multi-modal command and acquisition of control authority using clap sound and gesture image

Info

Publication number: KR101030603B1
Application number: KR1020090118875A
Authority: KR
Inventors: 홍광석; 김정현; 김동주
Original assignee: 성균관대학교산학협력단
Priority date: 2009-12-03
Filing date: 2009-12-03
Publication date: 2011-04-20
Anticipated expiration: 2029-12-03

Abstract

본 발명은 멀티 모달 응용 시스템에 있어 다중 모달리티를 이용한 효율적인 명령어의 시작점을 검출하는 방법을 제시하되, 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어 오인식 등을 최소화하고자 하는데 그 목적 및 효과가 있다.The present invention provides a method of detecting the starting point of an efficient command using multi-modality in a multi-modal application system, but the system command misrecognition due to misunderstanding or unintentional gesture behavior expressed in the user's consciousness / unconsciousness. There is a purpose and effect to minimize.

본 발명은 멀티모달 시스템의 운용상에 필요한 다중 모달리티 명령어를 입력함에 있어서, 소리와 제스처가 동시에 발생하는 손뼉치기를 1회 또는 n회 선 수행하여 발생하는 손뼉소리와 손뼉 제스처 영상을 마이크와 카메라 등으로 인지하고, 이들이 인지된 시점을 기준으로 개별 또는 멀티 모달리티를 사용하는 방식에 의한 시스템 운용 명령어 또는 콘텐츠 제어 명령어가 시작됨을 시스템에 전송해 주며, 다수의 사용자 중 현재 명령어를 입력하고자 하는 사용자를 식별하여 제어 권한을 부여할 수 있을 뿐만 아니라 다양한 소리 및 제스처 명령어에 대한 제어 효율 및 정확성을 향상시킬 수 있도록 구성되어 있다. In the present invention, when inputting a multi-modality command required for operation of a multi-modal system, a hand-held sound and a hand gesture image generated by performing one or n lines of hand gestures simultaneously occurring at the same time, a microphone and a camera, etc. Recognizes that the system operation command or content control command is started by using individual or multi-modality based on the recognized time point, and identifies the user who wants to input the current command among a plurality of users. In addition to granting control, it is configured to improve control efficiency and accuracy for various sound and gesture commands.

손뼉소리, 제스처 영상, 멀티모탈, 제어권한 Clap, gesture video, multi-mortal, control

Description

A Starting Point Decision and Control Right Acquisition Method of Multi-Mode Commands Using Clap and Gesture Image}

본 발명은 멀티모달(다중모드) 시스템의 운용상에 필요한 다중 모달리티 명령어를 입력함에 있어서 소리와 제스처가 동시에 발생하는 손뼉치기를 1회 또는 n회 선 수행하여 발생하는 손뼉 소리와 손뼉제스처 영상을 마이크와 카메라 등으로 인지하고, 이들의 인지 시점을 기준으로 개별 또는 멀티 모달리티를 사용하는 방식에 의한 시스템 운용 명령어 또는 콘텐츠 제어 명령어가 시작됨을 시스템에 전송해 주며, 다수의 사용자 중 현재 명령어를 입력하고자 하는 사용자를 식별하여 제어 권한을 부여할 수 있을 뿐만 아니라 다양한 소리, 음성 및 제스처 명령어에 대한 제어 효율 및 정확성을 향상시킬 수 있는 멀티모달 명령어의 시작점 판단 및 제어 권한 획득 방법에 관한 것이다. The present invention provides a microphone for the sound of a hand gesture and a gesture of a hand gesture generated by performing one or n lines of a touch that simultaneously occurs a sound and a gesture in inputting a multi-modality command necessary for operation of a multi-modal (multi-mode) system. The system operation command or the content control command is started to the system by using the individual or multi-modality based on the recognition time, and the current command among a plurality of users The present invention relates to a method for determining a starting point of a multimodal command and obtaining a control right, which can identify and grant a control right to a user and improve control efficiency and accuracy of various sound, voice, and gesture commands.

본 발명과 관련된 종래기술인 대한민국 등록특허공보 제10-0873470호에서는 인식의 시작 및 종료에 사용하고자 하는 멀티모달 명령어를 이용하여 시스템을 동작 제어하는 구성이 일부 개시되어 있으나, 모달 명령어로 소리와 제스처 영상신호 를 동시에 이용하지 아니하므로 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어의 오인식이 발생할 수 있는 문제점이 있다. Korean Patent Publication No. 10-0873470, which is related to the present invention, discloses a configuration of controlling a system using a multi-modal command to be used for starting and ending recognition, but showing a sound and gesture image using a modal command. Since the signals are not used at the same time, there is a problem that misunderstanding of system commands may occur due to misunderstanding or unintentional gestures expressed in the user's consciousness / unconsciousness.

본 발명과 관련된 또 다른 종래기술인 대한민국 등록특허공보 제10-0860407호에서는 행위에 대한 추론을 통해 해당 명령어를 수행하는 기술적 구성이 개시되어 있으나, 행위에 대하여서만 명령을 수행함으로써 시스템 명령어의 오인식 가능성이 높아지는 문제점이 있다.Korean Patent Publication No. 10-0860407, another prior art related to the present invention, discloses a technical configuration for performing a corresponding command by inferring an action, but there is a possibility of misrecognition of a system command by performing a command only for the action. There is a problem that increases.

본 발명과 관련된 또 다른 종래기술은 대한민국 등록특허공보 제10-0651729호에서 사용자의 상황에 기초한 명령어를 생성 출력할 수 있는 유사점이 있으나, 한 가지 종류를 사용하여 명령어를 생성 출력하므로 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어의 오인식이 발생하는 문제점이 있다.Another related art related to the present invention is similar in that it can generate and output a command based on a user's situation in Korean Patent Publication No. 10-0651729, but the user's consciousness is generated by using one kind of output. There is a problem in that misunderstanding of a system command occurs due to misunderstanding or unintentional gesture behavior expressed in unconsciousness.

본 발명이 해결하고자 하는 과제는 멀티모달 응용 시스템에 있어 다중 모달리티를 이용한 효율적인 명령어의 시작점을 검출하는 방법을 제시하는 바, 오디오 모달리티(손뼉 소리 등)와 제스처 모달리티 정보를 동시에 획득할 수 있는 손뼉 치기를 1회 또는 n회 선 수행하고, 이들 모달리티 정보를 인지하며, 사용자를 식별하여 제어 권한을 부여함으로써 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 오류 등을 최소화하고자 하는데 있다. The problem to be solved by the present invention is to propose a method for detecting the starting point of an efficient command using multi-modality in a multi-modal application system, the clasp that can simultaneously acquire the audio modality (such as the sound of a gesture) and gesture modality information By performing one or n times, recognizing these modality information, and identifying and granting control to users so that system errors due to misunderstanding or unintentional gestures expressed during user's consciousness or unconsciousness I want to minimize it.

본 발명이 해결하고자 하는 또 다른 과제는 오디오(손뼉소리 등) 모달리티 및 제스처 모달리티를 획득하여 AND연산 등의 간단한 판단 알고리즘을 통하여 개별 인식 결과에 대하여 동일 입력/인식임을 판단하며, 명령어를 입력하고자 하는 사용자를 식별하여 시스템 제어 권한을 부여함으로써 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어의 오인식 등을 최소화하고자 하는데 있다.Another problem to be solved by the present invention is to obtain the audio modality and gesture modality to determine the same input / recognition for the individual recognition result through a simple decision algorithm such as AND operation, and to input a command By identifying the user and granting the system control authority, it is intended to minimize the misunderstanding of the system command due to the misunderstanding or unintentional gesture action expressed in the user's consciousness or unconsciousness.

본 발명이 해결하고자 하는 또 다른 과제는 끝점 검출 방법을 단구간 에너지와 영교차율을 이용한 시간 영역에서는 물론, 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, 음성 대역을 개별 대역으로 세분화하고 분할된 대역별로 스펙트럼 차감법과 대역 에너지를 이용한 방법 등과 같은 끝점 구간 검출 방법 등을 포함하되, FFT 스펙트럼 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 이용하여 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어의 오인식 등을 최소화하고자 하는데 있다.Another problem to be solved by the present invention is the end point detection method in the time domain using the short-term energy and zero crossing rate, the end point detection method using the frequency spectrum, the end point detection method based on the MFB (Mel-frequency Filter Bank), voice Subdividing the band into individual bands and detecting end point intervals such as the method of using the spectrum subtraction method and the band energy for each divided band, and using the end point detection method using the FFT spectrum and Mel-spectrum based entropy. It aims to minimize the misunderstanding of system commands due to misunderstanding or unintentional gestures expressed during consciousness / unconsciousness.

본 발명 과제의 해결수단은 사용자가 손뼉치기 행위를 1회 또는 n회 이상 행하여, 손뼉칠 때 발생하는 소리를 마이크 또는 스테레오 마이크 등으로 획득하는 단계와 카메라 또는 스테레오 카메라 등을 사용하여 손뼉소리를 내기 위한 제스처 영상을 촬영하는 단계를 거치고, 마이크 등으로 검출한 손뼉소리로부터 획득한 신호 중에서 해당 손뼉소리 신호의 특징을 추출하는 단계와 상기 카메라에 의하여 검출된 제스처 영상신호로부터 사용자 영상의 특징을 추출하는 단계를 거쳐서, 검출된 손뼉소리 신호의 특징과 서버 또는 메모리 등에 저장된 손뼉소리 특징을 비교 및 인식하는 손뼉소리 인식부와 검출된 손뼉제스처 영상신호의 특징과 서버 또는 메모리 등에 저장된 손뼉제스처 영상신호의 특징을 비교 및 인식하는 손뼉제스처 인식부를 통해서 소리와 제스처 영상신호를 인식하는 단계를 거쳐서, 상기 손뼉소리 인식부 및 손뼉 제스처 인식부를 거쳐서 소리 및 제스처 영상신호를 멀티모달 방식으로 인식하여 명령어 시작점을 판단하는 단계를 거치고, 다중 사용자 또는 단일 사용자 환경에서 해당 손뼉소리 및 손뼉 제스처를 행위한 사용자를 식별하여 시스템 제어 권한을 획득하는 단계로 구성된 손뼉 소리 및 제스처 영상을 이용한 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득 방법을 구현하는데 있다. The solution of the present invention is that the user performs the clapping action once or n or more times, to obtain a sound generated when the clapping with a microphone or a stereo microphone, and to make a clasp by using a camera or a stereo camera, etc. Photographing a gesture image for extracting a gesture signal from a signal detected by a microphone, and extracting a feature of a user image from a gesture image signal detected by the camera. Through the steps, the finger recognition unit for comparing and recognizing the characteristics of the detected hand signal and the handheld sound stored in the server or the memory and the characteristic of the detected hand gesture video signal and the hand gesture video signal stored in the server or the memory Through the gesture recognition unit to compare and recognize And a step of recognizing a gesture video signal, and determining a command starting point by recognizing a sound and gesture video signal through a multi-modal method through the hand gesture recognition unit and the hand gesture recognition unit, in a multi-user or single user environment. The present invention provides a method of determining a starting point of a multi-modal command and obtaining a control right using a clap sound and a gesture image, which includes acquiring a system control right by identifying a user who has performed the clap and a hand gesture.

본 발명의 또 다른 과제의 해결수단은 상기 손뼉소리에 포함된 배경 잡음을 제거하 여 손뼉소리의 끝점 검출 성능을 향상시키기 위하여 단구간 에너지와 영교차율을 이용한 시간 영역에서의 끝점 검출 방법, 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, 소리 대역을 개별 대역으로 세분화하고 분할된 대역별로 스펙트럼 차감법과 대역 에너지를 이용한 방법 등과 같은 끝점 구간 검출 방법 및 FFT 스펙트럼 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 이용하여 배경잡음을 제거하는 단계를 사용한 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득 방법을 구현하는데 있다. In another aspect of the present invention, there is provided a method for detecting an end point in a time domain using short-term energy and a zero crossing rate, and a frequency spectrum, in order to remove the background noise included in the beep and improve the end point detection performance of the beep. End point detection method and FFT spectrum and method such as end point detection method using MFB (Mel-frequency Filter Bank) based end point detection method, subdividing sound band into individual bands and using spectral subtraction method and band energy for each divided band The present invention provides a method for determining a starting point of a multi-modal instruction and obtaining a control authority using a step of removing background noise by using an endpoint detection method using mel-spectrum-based entropy.

본 발명의 또 다른 과제의 해결수단은 손뼉 제스처 시에 손 제스처 영상신호를 인식하기 위하여 기본적으로 다차원으로 손 모양을 비교 인식하는 단계와 손, 팔의 움직임 형상을 비교 분석하는 단계로 나누고, 손 형상을 추출하고, 다차원 휴먼 모델의 재구성에 의한 팔의 다차원 형상 움직임을 추적 및 분석하는 단계를 거쳐서 사용자 손 또는 팔의 구성 요소들을 추적함으로써 보다 정확한 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득 방법을 구현하는데 있다. According to another aspect of the present invention, in order to recognize a hand gesture video signal at the time of a gesture of a hand, the method is divided into a step of comparing and recognizing the shape of the hand in a multi-dimensional manner and comparing and analyzing the motion shapes of the hand and the arm. In order to implement a more accurate method for determining the starting point of a multi-modal command and obtaining control authority by tracing the components of a user's hand or arm through the steps of tracking and analyzing the multidimensional shape movement of the arm by reconstruction of the multidimensional human model. have.

본 발명의 또 다른 과제의 해결수단은 상기 다차원 손 형상(영상) 추정을 이용하여 각도 및 자기 겹침(Self Occlusion)에 강인한 손 형상을 인식하고, 손 영상의 움직임을 서버 또는 메모리에 저장된 사용자의 특징적인 제스처 영상신호와 비교 분석함으로써 해당 제스처 영상을 인식하는 단계를 구비한 보다 효율적인 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득 방법을 구현하는데 있다. Another solution of the present invention is to recognize a hand shape that is robust to angle and self occlusion using the multidimensional hand shape (image) estimation, and to store the motion of the hand image in a server or a memory. The present invention provides a method for determining a starting point of a multi-modal command and acquiring control authority by performing a comparative analysis with a gesture image signal.

본 발명은 멀티모달 응용 시스템에 있어서 다중 모달리티를 이용한 효율적인 명령어의 시작점을 검출하는 방법을 제시하는 바, 오디오 모달리티(손뼉 소리 등)와 제스처 모달리티 정보를 동시에 획득할 수 있는 손뼉 치기를 1회 또는 n회 선 수행하고, 이들 모달리티 정보를 인지하며, 사용자를 식별하여 시스템을 제어할 수 있는 권한을 부여함으로써 사용자가 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어 오인식 등을 최소화할 수 있는 효과가 있다. The present invention proposes a method for detecting an effective starting point of an instruction using multi-modality in a multi-modal application system, wherein a hand-wound stroke capable of simultaneously acquiring audio modality (such as a clapping sound) and gesture modality information is performed once or n. By performing the circuit, recognizing the modality information, and identifying the user and granting the authority to control the system, the system command misrecognition due to misunderstanding or unintentional gestures expressed by the user during consciousness or unconsciousness There is an effect that can be minimized.

본 발명의 또 다른 효과는 오디오(손뼉 소리 등) 모달리티 및 제스처 모달리티를 획득하여, AND연산 등의 간단한 판단 알고리즘을 통하여 명령어를 입력하고자 하는 사용자를 식별하여 시스템 제어 권한을 부여함으로써 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어 오인식 등을 최소화하고자 하는데 있다.Another effect of the present invention is to acquire the audio modality and gesture modality, such as the sound of a hand, and to identify the user who wants to input a command through a simple decision algorithm such as AND operation to give the system control authority during the consciousness of the user / It aims to minimize the misunderstanding of system commands due to misunderstanding or unintentional gestures expressed in the unconscious.

본 발명의 또 다른 효과는 잡음에 강한 손뼉 소리를 인식하기 위하여 손뼉 소리의 끝점 검출 성능을 향상시키기 위하여 단구간 에너지와 영교차율을 이용한 시간 영역에서의 끝점 검출 방법, 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, FFT 스펙트럼 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 이용함으로 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어 오인식 등을 최소화할 수 있도록 하는데 있다.Another effect of the present invention is to detect the end-point performance of the hand-capped sound in order to recognize the noise-resistant hand-capped sound in the time domain using the short-term energy and zero crossing rate, the end point detection method using the frequency spectrum, By using the endpoint detection method based on the Mel-frequency Filter Bank (MFB) and the endpoint detection method using the FFT spectrum and Mel-spectrum based entropy, it is possible to minimize the system command misrecognition due to misleading or unintended gesture behavior. It is to make it possible.

본 발명의 실시를 위한 구체적인 내용에 대하여 살펴본다. 본 발명에 따른 손뼉 소리 및 손뼉 치기 제스처 영상을 이용한 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득 방법은 사용자가 손뼉치기 행위를 1회 또는 n회 이상 행하여, 손뼉치기에 의하여 발생하는 소리를 마이크 또는 스테레오 마이크 등으로 검출하는 단계와 카메라 또는 스테레오 카메라 등을 사용하여 손뼉소리를 발생하기 위하여 행하는 제스처 영상신호를 획득하는 단계를 거치고, 상기 획득한 소리신호에서 끝점검출을 수행하여 검출된 손뼉소리의 특징을 추출하는 단계와 상기 촬영된 제스처 영상신호로부터 특징을 추출하는 단계를 거쳐서, 검출된 손뼉소리 신호의 특징과 서버 또는 메모리 등에 기 저장된 사용자의 손뼉소리 특징과 비교하여 인식하기 위한 손뼉소리 인식부와 촬영된 손뼉제스처 영상신호의 특징과 서버 또는 메모리에 기 저장된 사용자 손뼉제스처 영상신호의 특징을 비교 분석하기 위한 손뼉제스처 인식부를 거치며, 해당 손뼉치기 행위자가 시스템 제어 권한을 획득하는 단계로 구성된 멀티모달 명령어의 시작점 판단 및 제어 권한을 획득하는 단계로 구성되어 있다. Hereinafter, the present invention will be described in detail. According to the present invention, a method for determining a starting point of a multi-modal command and obtaining control authority using a clap gesture and a clap gesture image includes performing a clap gesture at least once or n times to generate a microphone or stereo sound. And detecting gestures using a microphone or the like and acquiring a gesture video signal generated to generate a beeping sound using a camera or a stereo camera, and performing end point detection on the obtained sounding signal. Through the extracting and extracting the feature from the photographed gesture image signal, the hand recognition unit and the photographing unit for recognizing and comparing the detected hand signal feature with the user's hand feature previously stored in a server or a memory, etc. Features of hand gestures and video signals Comprising a hand gesture recognition unit for comparing and analyzing the characteristics of the user gesture gesture image signal stored in the step, the step of acquiring the control point of the multi-modal command consisting of the step of acquiring the system control authority by the corresponding touch gesture is configured to acquire the control authority It is.

또한, 본 발명은 상기 손뼉소리 신호에 포함된 배경 잡음을 제거하여 잡음에 강하도록 구성하기 위하여 손뼉 소리의 끝점 검출 성능을 향상시키기 위하여 단구간 에너지와 영교차율을 이용한 시간 영역에서의 끝점 검출 방법, 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, 소리 대역을 개별 대역으로 세분화하고 분할된 대역별로 스펙트럼 차감법과 대역 에너지를 이용한 방법 등과 같은 끝점 구간 검출 방법 및 FFT 스펙트럼 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 이용하여 배경잡음을 제 거하는 단계 등을 포함한다.In addition, the present invention is an end point detection method in the time domain using the short-term energy and the zero crossing rate to improve the end point detection performance of the clasping sound in order to remove the background noise included in the clasping signal to be resistant to the noise, Endpoint detection method and FFT such as endpoint detection method using frequency spectrum, endpoint detection method based on MFB (Mel-frequency Filter Bank), subdividing sound band into individual bands and using spectrum subtraction method and band energy for each divided band Removing the background noise using an end point detection method using spectral and mel-spectrum based entropy.

통상적으로, 손뼉을 칠 때에는 손뼉소리와 손뼉을 치기 위한 제스처가 동시에 수반되고, 사람마다 체격, 체중, 손의 크기 및 손의 형상이 서로 다르고, 이에 따라 발생하는 손뼉 치기 행위의 손, 팔 및 몸이 움직이는 형상에서도 일정한 특성을 가지게 된다. Typically, clapping hands are accompanied by both clapping and clapping gestures, and each person's physique, weight, hand size, and hand shape are different, and the hands, arms, and body of the clapping action that occur accordingly This moving shape also has certain characteristics.

본 발명에 따른 손뼉소리 특징 및 제스처 영상 신호의 특징들은 일반적으로 인식/참조 모델로 구성되어 서버의 데이터베이스 또는 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득을 위한 장치의 메모리 등에 저장되어 있다. Features of the clasp and gesture image signal according to the present invention are generally configured as a recognition / reference model and stored in a database of a server or a memory of a device for determining a starting point of a multi-modal command and obtaining control authority.

이하, 첨부된 도면을 참조하여 본 발명의 실시 예의 구성과 작용을 설명하며, 도면에 도시되고 설명되는 본 발명의 구성과 작용은 적어도 하나 이상의 실시 예로서 설명되는 것이며, 이것에 의해 상기한 본 발명의 기술적 사상과 그 핵심 구성 및 작용이 제한되지는 않는다.Hereinafter, the configuration and operation of the embodiments of the present invention with reference to the accompanying drawings, the configuration and operation of the present invention shown and described in the drawings will be described by at least one embodiment, whereby the present invention described above The technical idea and its core composition and operation are not limited.

본 발명의 이해를 용이하게 하는 도면을 살펴본다. 도 1은 본 발명에 따른 전체 시스템 구조의 예를 도시한 것이며, 도 2는 손뼉 소리 및 제스처를 이용한 멀티모달 명령어의 시작점 판단 및 제어 권한 획득 방법에 대한 전체 시스템 흐름도의 예를 도시한 것이다. 도 3은 본 발명에 따른 손뼉 소리에 대한 신호파형, 주파수 스펙트럼의 분석 결과 및 끝점 검출 결과의 예를 도시한 것이며, 도 4는 본 발명에 따른 손뼉 치기로부터 동시에 획득 가능한 오디오 및 제스처 모달리티 정보의 획득 예를 도시한 것이다. 본 발명에 따른 구체적인 실시 예를 살펴본다. Look at the drawings to facilitate understanding of the present invention. Figure 1 shows an example of the overall system structure according to the present invention, Figure 2 shows an example of the entire system flow diagram for a method for determining the starting point of the multi-modal command using the hand gesture and gestures and obtaining control authority. 3 illustrates an example of a signal waveform, an analysis result of a frequency spectrum, and an end point detection result of a clapping sound according to the present invention, and FIG. 4 shows acquisition of audio and gesture modality information that can be simultaneously acquired from a clapping according to the present invention. An example is shown. A specific embodiment according to the present invention will be described.

<실시 예><Example>

본 발명에 따른 구체적인 실시 예를 도면에 기초하여 살펴본다. 도 1은 본 발명에 따른 전체 시스템을 블록도로 도시한 것이다. 도1을 통해서 본 발명에 따른 손뼉소리를 인식하는 방법을 살펴보면, 사용자가 손뼉치기 행위를 하여, 손뼉을 칠 때 발생하는 소리를 마이크 또는 스테레오 마이크 등으로 검출하는 단계를 거치고, 마이크 또는 스테레오 마이크 등으로 검출한 손뼉소리로부터 획득한 신호 중에서 사용자의 소리신호의 특징을 추출하는 단계를 거쳐서, 추출된 손뼉소리 신호의 특징을 서버 또는 메모리 등에 저장된 손뼉소리 특징을 비교하기 위한 손뼉소리 인식부를 거치며, 상기 손뼉소리 인식부를 거쳐서 검출된 손뼉소리와 메모리에 저장된 손뼉소리의 특징을 비교하여 일치하면 명령어 시작점을 판단하는 단계로 구성된 멀티모달 명령어의 시작점 판단방법과, 상기 시작점을 판단하는 단계를 거쳐서 해당 손뼉치기 행위자가 시스템 제어 권한을 획득하는 단계로 구성된 멀티모달 명령어의 시작점 판단 및 제어 권한 획득 방법으로 이루어져 있다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A preferred embodiment of the present invention will now be described with reference to the drawings. 1 shows a block diagram of an entire system according to the invention. Looking at the method of recognizing the clasp according to the present invention through Figure 1, the user performs a clap, the step of detecting the sound generated when clapping with a microphone or a stereo microphone, such as a microphone or a stereo microphone Extracting the feature of the user's sound signal from the signal obtained from the detected gesture, and passing the gesture recognition unit for comparing the feature of the extracted finger signal stored in a server or a memory. Comparing the characteristics of the handwriting sound detected through the handwriting recognition unit with the handwriting sound stored in the memory, the start point determination method of the multi-modal command consisting of the step of determining the command starting point and the step of determining the starting point The stage where the actor gains control of the system A judgment is made as a starting point and a control method of obtaining multimodal commands.

도1을 통해서 본 발명에 따른 제스처 영상신호를 인식하는 방법을 살펴보면, 사용자가 손뼉치기 행위를 할 때, 카메라 또는 스테레오 카메라 등을 사용하여 손뼉치기를 할 때 취하는 다양한 제스처 영상을 촬영하는 단계를 거치고, 상기 카메라 또는 스테레오 카메라 등에 의하여 촬영된 제스처 영상신호로부터 특징을 검출하는 단계를 거쳐서, 손뼉 치기를 할 때 취한 다양한 제스처 영상신호의 특징을 추출하고, 추출된 체스처 영상신호의 특징과 메모리에 저장된 손뼉치기 제스처 영상신호의 특징을 비교 및 인식하기 위한 손뼉 제스처 인식부를 거치며, 상기 손뼉 제스처 인식부를 거쳐서 명령어 시작점을 판단하고 손뼉 제스처 행위자가 시스템 사 용권한을 획득하는 단계로 구성되어 있다.Referring to the method of recognizing a gesture video signal according to the present invention with reference to Figure 1, when the user performs a gesture of clapping, using the camera or a stereo camera to take a variety of gestures to take a gesture image taken Detecting a feature from a gesture video signal captured by the camera or a stereo camera, extracting features of various gesture video signals taken at the time of clapping, and storing the features of the extracted chess image signal and the memory A gesture gesture for comparing and recognizing features of a gesture gesture image signal is performed, and a gesture start point is determined through the gesture gesture recognition unit, and a gesture gesture actor acquires system permission.

통상적으로, 손뼉을 칠 때에는 손뼉소리와 손뼉을 치기 위한 제스처가 동시에 수반되고, 사람마다 체격, 체중, 손의 크기 및 손의 형상이 서로 다르며, 이에 따라 발생하는 손뼉 치기 행위의 손, 팔 및 몸이 움직이는 형상에서도 일정한 특성을 가지게 된다. Typically, clapping is accompanied by a clapping gesture and a clapping gesture at the same time, and each person has a different physique, weight, hand size, and hand shape, and thus the hands, arms, and body of the clapping action that occur. This moving shape also has certain characteristics.

본 발명에 따른 손뼉소리 특징 및 제스처 영상 신호의 특징들은 일반적으로 인식/참조 모델로 구성되어 서버의 데이터베이스 또는 멀티 모달 명령어의 시작점 판단 및 제어 권한 획득을 위한 장치의 메모리 등에 저장되어 있다.Features of the clasp and gesture image signal according to the present invention are generally configured as a recognition / reference model and stored in a database of a server or a memory of a device for determining a starting point of a multi-modal command and obtaining control authority.

상기 마이크 또는 스테레오 마이크 등으로 검출한 손뼉소리에 포함된 배경 잡음을 제거하여 손뼉 소리의 끝점 검출 성능을 향상시켜 인식 시 판단의 정확성을 증대시키기 위하여 단구간 에너지와 영교차율을 이용한 시간 영역에서의 끝점 검출 방법, 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, 소리 대역을 개별 대역으로 세분화하고 분할된 대역별로 스펙트럼을 분석하는 스펙트럼 차감법과 대역 에너지를 이용한 방법 등과 같은 끝점 구간 검출 방법, FFT 스펙트럼 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 이용하여 배경잡음을 제거하도록 구성되어 있다. End point in the time domain using short-term energy and zero crossing rate to remove the background noise included in the handheld sound detected by the microphone or stereo microphone to improve the end point detection performance of the handwriting sound, and to increase the accuracy of the judgment in recognition. Detection method, endpoint detection method using frequency spectrum, endpoint detection method based on MFB (Mel-frequency Filter Bank), spectral subtraction method and spectrum energy method to subdivide sound band into individual bands and analyze the spectrum by divided bands It is configured to remove background noise by using the same end point detection method, the end point detection method using the FFT spectrum and mel-spectrum based entropy.

상기 손뼉치기 제스처 시에 손 제스처 영상을 인식하기 위해서는 기본적으로 카메라 등으로 촬영한 영상신호로부터 다차원의 손 형상의 영상신호를 비교 인식하는 단계와 손, 팔의 움직임을 비교 분석하는 단계로 나누고, 카메라 등으로 촬영한 영상신호로부터 손 영상 신호를 추출하고, 다차원 휴먼 모델의 재구성에 의한 팔의 다차원 형상의 패턴 움직임을 추적 및 분석하는 단계를 구비하고 있다.In order to recognize the hand gesture image at the time of the clap gesture, basically, a multi-dimensional image of a hand-shaped image signal is recognized from a video signal photographed by a camera or the like and a step of comparing and analyzing the movements of the hand and arm is performed. And extracting the hand image signal from the image signal photographed by the camera, and tracking and analyzing the pattern motion of the multidimensional shape of the arm by reconstruction of the multidimensional human model.

상기 다차원 손 포즈 추정을 이용하여 각도 및 자기 겹침(Self Occlusion)에 강인한 손 형상 패턴을 인식하고, 손의 형상 패턴의 움직임을 비교 분석함으로써 해당 제스처를 인식하는 단계를 구비한다.Recognizing a hand shape pattern that is robust to angle and self occlusion using the multi-dimensional hand pose estimation, and recognizing a corresponding gesture by comparing and analyzing the movement of the shape pattern of the hand.

본 발명에 따른 손뼉 소리 인식 기술은 입력된 소리 신호에 대한 전처리 단계로 소리에 포함된 배경잡음의 제거, 유효 손뼉 소리 구간의 검출, 고역강조(Pre-emphasis), Windowing 등의 과정 등을 포함한다. 상기 유효 손뼉 소리 구간의 검출에는 신호의 단 구간 에너지 및 ZCR(Zero Crossing Rate)등의 시간 축 특징 등을 사용하며, 프레임 블로킹된 음성 신호는 고역강조 및 Windowing 과정을 거쳐서, 개별 프레임을 대상으로 특징 벡터 추출 과정을 수행한다. 본 발명에서 특징 벡터는 LPC(Linear Predictive Coefficient), LPCC(Linear Predictive Cepstral Coefficient), MFCC(Mel Frequency Cepstral Coefficient), PLP(Perceptual Linear Prediction), Rasta(Relative Spectral Transform)-PLP 등을 이용할 수 있다. 특징 추출 단계에서 획득되어진 특징 벡터는 유사도 측정 및 손뼉 소리의 패턴을 인식하는 단계를 거치게 된다. The clasp recognition technology according to the present invention includes a process of removing background noise included in a sound, detecting an effective clapping sound section, pre-emphasis, and windowing as a preprocessing step for an input sound signal. . The detection of the effective clapping sound section uses time-base features such as energy of the short section of the signal and ZCR (Zero Crossing Rate). Perform the vector extraction process. In the present invention, the feature vector may use LPC (Linear Predictive Coefficient), LPCC (Linear Predictive Cepstral Coefficient), MFCC (Mel Frequency Cepstral Coefficient), PLP (Perceptual Linear Prediction), Rasta (Relative Spectral Transform) -PLP. The feature vector acquired in the feature extraction step is subjected to the similarity measurement and the step of recognizing the pattern of the clapping sound.

손뼉 소리 인식을 위한 기술은 규칙 기반의 인식, HMM 기반의 인식 방법과 가우시안 밀도(Gaussian Density)의 혼합으로 표현하는 GMM(Gaussian Mixture Model) 기술을 이용하는 방법 등을 포함한다. 본 발명에 있어 손뼉 소리 인식이라 함은 마이크 및 스테레오 마이크 등을 이용한 전반적인 프로세스를 포함한다. 잡음환경에 강한 손뼉 소리 인식 시스템을 구현하기 위해서는 전처리 단계에서 손뼉 소 리에 포함된 배경 잡음을 제거함으로써 손뼉 소리의 끝점 검출 성능 향상을 할 수 있으며, 이를 이용하여 인식 시스템의 성능 저하를 최소화할 수 있다. 이에 본 발명은 단 구간 에너지와 영교차율을 이용한 시간 영역에서의 끝점 검출 방법을 주파수 스펙트럼을 이용한 끝점 검출 방법, MFB(Mel-frequency Filter Bank) 기반의 끝점 검출 방법, 음성 대역을 개별 대역으로 세분화하고 분할된 대역별로 스펙트럼을 분석하는 스펙트럼 차감법과 대역 에너지를 이용한 방법, FFT 스펙트럼 방법 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 방법 등을 적용할 수 있다. 이를 적용하기 위하여 도3에서는 손뼉 소리에 대한 신호파형 및 주파수 스펙트럼의 분석 결과 및 멜-스펙트럼 기반의 엔트로피를 이용한 끝점 검출 결과를 적용한 예를 보여주고 있다.Techniques for clasp recognition include rule-based recognition, HMM-based recognition, and Gaussian Mixture Model (GMM), which is a mixture of Gaussian density. In the present invention, the clapping sound recognition includes an overall process using a microphone and a stereo microphone. In order to realize the noise recognition system that is resistant to the noise environment, it is possible to improve the end point detection performance of the noise by removing the background noise included in the noise at the preprocessing stage, and use it to minimize the performance degradation of the recognition system. . Accordingly, the present invention subdivids the end point detection method in the time domain using the short-term energy and the zero crossing rate, the end point detection method using the frequency spectrum, the end point detection method based on the MFB (Mel-frequency Filter Bank), and the voice band into individual bands. The spectral subtraction method for analyzing the spectrum for each divided band, a method using band energy, an FFT spectrum method, and an endpoint detection method using mel-spectrum-based entropy may be applied. In order to apply this, FIG. 3 shows an example of applying an analysis result of a signal waveform and a frequency spectrum of a clapping sound and an end point detection result using mel-spectrum-based entropy.

본 발명에 따른 제스처 영상 인식 기술은 손뼉소리를 발생하기 위하여 손을 움직일 때 손 제스처를 인식하기 위해서 기본적으로 다차원의 손 형상을 인식하는 단계와 손, 팔의 움직임의 형상을 분석 인식하는 단계로 대변되며 손뼉을 칠 때 발생하는 손 제스처 영상신호를 추출하고, 팔의 다차원 형상 움직임을 추적 및 분석하기 위해서는 다차원 휴먼 모델의 재구성이 필요하다. 즉, 다차원 휴먼 모델 재구성을 통해 얻어진 몸의 구성 요소들을 추적함으로써 손과 팔의 움직임 영상을 추적할 수 있고, 손의 영역을 추출할 수 있다. 다차원 손 움직임 영상의 추정을 이용하여 각도 및 자기 겹침(Self Occlusion)에 강인한 손 형상을 인식하고, 손 영상신호의 움직임을 비교 분석함으로써 해당 제스처 영상신호를 인식하는 과정을 포함한다. Gesture image recognition technology according to the present invention basically recognizes the multi-dimensional hand shape in order to recognize the hand gesture when moving the hand to generate hand gestures, and analyzes and recognizes the shape of the movement of the hand, arm In order to extract the hand gesture video signal generated when clapping, and to track and analyze the multidimensional motion of the arm, it is necessary to reconstruct the multidimensional human model. That is, by tracking the body components obtained through the multidimensional human model reconstruction, it is possible to track the motion image of the hand and arm, and extract the hand region. Recognizing the shape of the hand robust to angle and self occlusion using the estimation of the multi-dimensional hand motion image, and comparing the movement of the hand image signal to recognize the gesture image signal.

이러한 제스처 영상신호를 인식하기 위한 방법으로, 광학적 마커(marker)를 신체의 중요 부위에 부착하고 카메라로 입력된 영상신호로부터 이 마커들의 궤적을 추적하여 동작을 인식하는 방법, 신체의 특징 점을 추출하여 그 특징 점의 운동을 시간에 대한 패턴으로 구별하여 인식하는 방법, 인체의 형상 모델을 이용하는 방법 및 인체의 전체적인 모습(whole bodily appearance)을 여러 가지 매개변수(parameter)를 이용하여 표현하고, 그 변화를 해석하여 동작을 인식하는 방법 등을 사용할 수 있다. 일반적으로 인식 결과를 안정시키기 위해선 시간적인 연속성을 반드시 고려하여야 하며, 동작의 시간적 위치 변화를 나타내는 2진 영상인 MEI(Motion Energy Image) 및 동작의 시간적 경과 정보를 표현하는 MHI(Motion History Image) 등을 이용하여 연속 자세를 상태변화로 표현한 HMM(Hidden Markov Model) 인식 방법 등을 사용할 수 있다.As a method for recognizing a gesture video signal, a method of recognizing motion by attaching an optical marker to an important part of the body and tracking the traces of the markers from an image signal input by a camera, and extracting feature points of the body The method recognizes the movement of the feature point by the pattern of time, expresses the human body shape model, and the whole body appearance using various parameters. You can use the method of recognizing the movement by interpreting the change. In general, in order to stabilize the recognition result, temporal continuity must be considered. Motion energy image (MEI), which is a binary image representing a change in temporal position of motion, and motion history image (MHI), which expresses temporal lapse information of motion, etc. Using the HMM (Hidden Markov Model) recognition method that expresses the continuous posture as a state change by using.

또한, 카메라 또는 스테레오 카메라 등을 통해서 입력되는 영상신호에 포함된 잡음을 제거하기 위하여 한 번의 침식(erosion) 및 모폴로지(Morpology) 연산을 수행하며, 이 때 신체영역도 같이 줄어드는 현상을 막기 위해 팽창(dilation) 모폴로지(Morpology) 연산을 이용하는 방법 등을 사용할 수 있다. In addition, a single erosion and morphology calculation is performed to remove noise included in a video signal input through a camera or a stereo camera. dilation) or the like using a morphology operation.

카메라 또는 스테레오 카메라 등을 통해서 입력되는 제스처 영상신호로부터 특징 값을 추출하여 신체의 포즈와 모션을 효과적으로 표현하고 인식하는 방법은 전역 모션 정보로부터 특징을 추출하는 방법과 머리, 손, 발과 같이 신체의 특정 부위에 의미가 있는 부분 모션 정보로부터 특징을 추출하는 방법으로 나눌 수 있다. 이러한 제스처 또는 모션 특징 추출 방법은 상기 MEI(Motion EnergyImage), MHI(MotionHistory Image), 로스 커틀러(Ross Cutler)의 방법 및 카드보드 모델(cardboard model) 등을 이용할 수 있다.The method of effectively extracting and recognizing body poses and motions by extracting feature values from a gesture video signal input through a camera or a stereo camera is a method of extracting features from global motion information and the body, such as head, hands and feet. It can be divided into a method of extracting a feature from partial motion information meaningful to a specific part. The gesture or motion feature extraction method may use the Motion Energy Image (MEI), Motion History Image (MHI), Ross Cutler's method, a cardboard model, and the like.

상기 제스처 영상신호의 인식 기술은 2차원 또는 3차원 제스처 영상신호를 포함하는 영상 기반의 제스처 인식 기술의 전반적인 응용이 가능함에 따라 본 발명에 있어 제스처 영상 인식 기술이라 함은 단일/다중 카메라 또는 스테레오 카메라 등을 이용하는 통상의 영상 인식 기술의 전반적인 프로세스를 포함할 수 있다. 도 4는 손뼉 제스처 영상에 대한 인식 시점을 도식화한 것으로 제스처 영상 인식 기술에서 두 손뼉이 마주친 순간을 상기 다양한 영상신호의 인식 등을 이용하여 검출하고 이를 다차원적으로 해석 및 인지하는 기술적 구성으로 이루어져 있다.As the recognition technology of the gesture video signal is applicable to the overall application of the image-based gesture recognition technology including a 2D or 3D gesture video signal, the gesture image recognition technology in the present invention is a single / multi camera or a stereo camera. And the overall process of conventional image recognition techniques using the same. 4 is a diagram illustrating a recognition time point of a gesture image of a hand gesture. A gesture image recognition technology includes a technical configuration of detecting a moment when two hands touch each other using recognition of the various image signals, and interpreting and recognizing it in a multi-dimensional manner. have.

본 발명은 사용자가 1회 또는 n회 이상의 손뼉치기를 수행한 경우에 각각의 손뼉치기에서 발생한 소리 및 영상신호를 멀티모달 방식으로 시스템 명령어 인식에 사용할 수 있으므로, 인식의 정확도를 높일 수 있다. According to the present invention, when the user performs one or n or more touches, the sound and video signals generated by each touch can be used for system command recognition in a multi-modal manner, thereby increasing the accuracy of recognition.

본 발명은 멀티모달 응용 시스템에 있어 다중 모달리티를 이용한 효율적인 명령어의 시작점을 검출하는 방법을 제시하는 바, 오디오 모달리티와 제스처 모달리티 정보를 동시에 획득할 수 있는 손뼉 치기를 1회 또는 n회 선 수행하고, 이들 모달리티 정보를 인지하며, 사용자를 식별하여 제어 권한을 부여함으로써 사용자의 의식중/무의식중에 표현된 오발성 또는 의도하지 않은 제스처 행위 등으로 인한 시스템 명령어 오인식 등을 최소화할 수 있으므로 다양한 응용분야로의 적용이 가능하다.The present invention provides a method for detecting the starting point of an efficient instruction using multi-modality in a multi-modal application system, and performs one or n lines of clapping, which can simultaneously acquire audio modality and gesture modality information. By recognizing these modality information and granting control by identifying users, system command misrecognition due to misunderstanding or unintentional gestures expressed in user's consciousness / unconsciousness can be minimized. Application is possible.

도1 : 본 발명에 따른 전체 시스템 구조의 예를 도시한 도면1 shows an example of the overall system structure according to the present invention.

도2 : 손뼉 소리 및 제스처를 이용한 멀티모달 명령어의 시작점 판단 및 제어 권한 획득 방법에 대한 전체 시스템 흐름의 예시도2 is an exemplary diagram of an overall system flow for a method of determining a starting point of a multimodal command using a clap and a gesture and obtaining control authority

도3 : 본 발명에 따른 손뼉 소리에 대한 신호파형, 주파수 스펙트럼의 분석 결과 및 끝점 검출 결과의 예시도3 is an exemplary view of a signal waveform, a frequency spectrum analysis result, and an endpoint detection result for a clapping sound according to the present invention.

도4 : 본 발명에 따른 손뼉 치기로부터 동시에 획득 가능한 오디오 및 제스처 모달리티 정보의 획득한 예4 is an example of acquiring audio and gesture modality information that can be simultaneously acquired from a clasp according to the present invention.

Claims

In the method of determining the starting point of a multi-modal instruction,

A user performing a clap gesture one or more times, detecting a sound generated by the clap with a microphone or a stereo microphone, and capturing a clap gesture image using a camera or a stereo camera;

Extracting a feature of a sound signal possessed by a user from a sound signal obtained from a hand-held sound detected by a microphone or a stereo microphone, and extracting a feature of the gesture video signal from the captured gesture image;

Comparing and recognizing the characteristics of the extracted gesture signal and the characteristics of the extracted gesture image signal and the characteristics of the extracted gesture image signal and the characteristics of the gesture image signal stored in the server or memory Recognizing a sound and gesture video signal through a hand gesture recognition unit; And

A method for determining a starting point of a multi-modal command using the clapping sound and the gesture image, comprising: determining a command starting point through the clapping sound recognition unit and the clapping gesture recognizing unit.

In the method of determining the starting point of the multi-modal instruction and obtaining the control authority,

Extracting a feature of a sound signal possessed by a user from a sound signal obtained from a hand gesture detected by a microphone or a stereo microphone, and extracting a feature of the gesture image signal from the captured gesture image;

Comparing and recognizing the characteristics of the extracted gesture signal and the characteristics of the extracted gesture image signal and the characteristics of the extracted gesture image signal and the characteristics of the gesture image signal stored in the server or memory Recognizing a sound and gesture video signal through a hand gesture recognition unit;

Determining a starting point of a command through the hand gesture recognition unit and the hand gesture recognition unit; And

Determining the starting point of the command, and comparing and analyzing the characteristics of the hand gesture and the gesture image signal stored in the memory, the user acquires the authority to control the system. Way.

The method according to claim 2,

In order to remove the background noise included in the handheld sound detected by the microphone or the stereo microphone, the end point detection method in the time domain using the short-term energy and the zero crossing rate, the end point detection method using the frequency spectrum, MFB-based end point detection method, subdivided sound band into individual bands, spectrum subtraction method to analyze the spectrum by divided band, band energy method, FFT spectrum and Mel-spectrum based end point detection method Method for acquiring a starting point of a multi-modal command and acquiring control authority by using a clap and gesture image, the method including removing background noise.

The method according to claim 2,

Extracting the feature of the gesture video signal is performed using a MEI, MHI, Los Cutler method and a cardboard model.

The method according to claim 2,

In the step of recognizing the gesture image signal, an optical marker is attached to an important part of the body, and a method of recognizing motion by tracking the trajectories of the markers from the gesture image signal inputted by the camera, extracting the feature points of the body It recognizes the motion of the human body by the image signal with respect to time, uses the shape model of the human body, and expresses the overall shape of the human body using various parameters, and analyzes the change to recognize the motion. Method for acquiring a starting point of a multi-modal command and acquiring control authority by using a clapping sound and a gesture image.

The method according to claim 2,

In the recognizing the hand gesture image signal, the step of comparing and recognizing the hand image signal is basically multi-dimensional in order to recognize the hand gesture image during the hand gesture;

Comparing and analyzing the shape of the movement of the hand and the arm; And

And extracting hand image signals and tracking and analyzing multi-dimensional shape movements of the arm by reconstruction of the multi-dimensional human model.

The method according to claim 2,

In the step of extracting the feature from the clasp and clasp gesture video signals, a sound and a video signal generated when each clap is performed when the user performs a clasp one or more times n are used for recognition in a multi-modal manner. Method for acquiring a starting point of a multi-modal command and acquiring control authority by using a clapping sound and a gesture image.