JP2021163217A

JP2021163217A - Motion detection device, motion detection method, and program

Info

Publication number: JP2021163217A
Application number: JP2020064339A
Authority: JP
Inventors: 有二配川; Yuji Haikawa; 弘亘藤吉; Hironobu Fujiyoshi; 隆義山下; Takayoshi Yamashita
Original assignee: Honda Motor Co Ltd; Chubu University
Current assignee: Honda Motor Co Ltd; Chubu University
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2021-10-11

Abstract

【課題】あいまいな動作であっても検出精度を、従来より向上させることができる動作検出装置、動作検出方法、およびプログラムを提供することを目的とする。【解決手段】動作検出装置は、動作を検出する対象の画像を取得する画像取得部と、環境の音響信号を取得する音響信号取得部と、取得された画像における特徴的な特徴点の動きを変化量として検出し、取得された音響信号の音響特徴情報を検出し、検出した変化量と音響特徴情報を、学習済みのニューラルネットワークに入力して、所定の動作の検出を行う検出部と、を備える。【選択図】図１PROBLEM TO BE SOLVED: To provide an operation detection device, an operation detection method, and a program capable of improving the detection accuracy even if the operation is ambiguous. SOLUTION: An motion detection device has an image acquisition unit that acquires an image of an object for which an motion is detected, an acoustic signal acquisition unit that acquires an acoustic signal of an environment, and movements of characteristic feature points in the acquired image. A detection unit that detects the amount of change, detects the acoustic feature information of the acquired acoustic signal, inputs the detected change amount and the acoustic feature information to the trained neural network, and detects a predetermined operation. To prepare for. [Selection diagram] Fig. 1

Description

本発明は、動作検出装置、動作検出方法、およびプログラムに関する。 The present invention relates to a motion detection device, a motion detection method, and a program.

実際の生活環境においてロボットと人間が協調するためには、人間とロボットのコミュニケーションが重要である。ロボットは、言語表現だけでなく、頷きやジェスチャーといった非言語表現も理解することが必要である。
人の動作を検出する手法として、人の動作の動画から動きを認識する装置が提案されている（例えば、特許文献１、特許文献２参照）。 Communication between humans and robots is important for robots and humans to cooperate in the actual living environment. Robots need to understand not only linguistic expressions but also non-verbal expressions such as nods and gestures.
As a method for detecting a human motion, a device that recognizes the motion from a moving image of the human motion has been proposed (see, for example, Patent Document 1 and Patent Document 2).

特開２００７−９４６１９号公報JP-A-2007-94619 特開平５−６７２１３号公報Japanese Unexamined Patent Publication No. 5-67213

しかしながら、従来技術では、動作があいまいである場合、動作の検出が困難であった。 However, in the prior art, when the operation is ambiguous, it is difficult to detect the operation.

本発明は、上記の問題点に鑑みてなされたものであって、あいまいな動作であっても検出精度を、従来より向上させることができる動作検出装置、動作検出方法、およびプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and provides an operation detection device, an operation detection method, and a program capable of improving the detection accuracy even if the operation is ambiguous. With the goal.

（１）上記目的を達成するため、本発明の一態様に係る動作検出装置は、動作を検出する対象の画像を取得する画像取得部と、環境の音響信号を取得する音響信号取得部と、取得された前記画像における特徴的な特徴点の動きを変化量として検出し、取得された音響信号の音響特徴情報を検出し、検出した前記変化量と前記音響特徴情報を、学習済みのニューラルネットワークに入力して、所定の動作の検出を行う検出部と、を備える。 (1) In order to achieve the above object, the motion detection device according to one aspect of the present invention includes an image acquisition unit that acquires an image of an object for which motion is detected, an acoustic signal acquisition unit that acquires an environmental acoustic signal, and an acoustic signal acquisition unit. The movement of the characteristic feature point in the acquired image is detected as a change amount, the acoustic feature information of the acquired acoustic signal is detected, and the detected change amount and the acoustic feature information are used in a trained neural network. It is provided with a detection unit for detecting a predetermined operation by inputting to.

（２）また、本発明の一態様に係る動作検出装置において、前記音響特徴情報は、所定時間の前記音響信号のパワー総和に基づく前記音響信号の強弱と、前記音響信号が所定の大きさ以上であるか否かであり、前記変化量は、撮影された前記画像から前記特徴点の位置を検出し、第１時刻における検出した前記特徴点の位置と、第２時刻における検出した前記特徴点の位置との差であるようにしてもよい。 (2) Further, in the motion detection device according to one aspect of the present invention, the acoustic feature information includes the strength and weakness of the acoustic signal based on the total power of the acoustic signal for a predetermined time, and the acoustic signal having a predetermined magnitude or more. The amount of change is the position of the feature point detected from the captured image, the position of the feature point detected at the first time, and the feature point detected at the second time. It may be the difference from the position of.

（３）また、本発明の一態様に係る動作検出装置において、前記所定時間は、前記画像のフレームレートに合わせるための長さであるようにしてもよい。 (3) Further, in the motion detection device according to one aspect of the present invention, the predetermined time may be a length for adjusting to the frame rate of the image.

（４）また、本発明の一態様に係る動作検出装置において、前記検出部は、前記変化量と前記音響特徴情報とを、前記ニューラルネットワークの入力前に連結するようにしてもよい。 (4) Further, in the motion detection device according to one aspect of the present invention, the detection unit may connect the change amount and the acoustic feature information before inputting to the neural network.

（５）また、本発明の一態様に係る動作検出装置において、前記動作を検出する対象は人であり、前記動作は頷きであるようにしてもよい。 (5) Further, in the motion detection device according to one aspect of the present invention, the target for detecting the motion may be a person, and the motion may be a nod.

（６）上記目的を達成するため、本発明の一態様に係る動作検出方法は、画像取得部が、動作を検出する対象の画像を取得し、音響信号取得部が、環境の音響信号を取得し、検出部が、取得された前記画像における特徴的な特徴点の動きを変化量として検出し、取得された音響信号の音響特徴情報を検出し、検出した前記変化量と前記音響特徴情報を、学習済みのニューラルネットワークに入力して、所定の動作の検出を行う。 (6) In order to achieve the above object, in the motion detection method according to one aspect of the present invention, the image acquisition unit acquires the image of the target for which the motion is detected, and the acoustic signal acquisition unit acquires the acoustic signal of the environment. Then, the detection unit detects the movement of the characteristic feature point in the acquired image as a change amount, detects the acoustic feature information of the acquired acoustic signal, and detects the change amount and the acoustic feature information. , Input to the trained neural network to detect a predetermined motion.

（７）上記目的を達成するため、本発明の一態様に係るプログラムは、コンピュータに、動作を検出する対象の画像を取得させ、音響信号取得部が、環境の音響信号を取得させ、取得された前記画像における特徴的な特徴点の動きを変化量として検出させ、取得された音響信号の音響特徴情報を検出させ、検出された前記変化量と前記音響特徴情報を、学習済みのニューラルネットワークに入力して、所定の動作の検出を行わせる。 (7) In order to achieve the above object, the program according to one aspect of the present invention causes a computer to acquire an image of a target for detecting an operation, and an acoustic signal acquisition unit acquires an environmental acoustic signal and acquires the image. The movement of characteristic feature points in the image is detected as a change amount, the acoustic feature information of the acquired acoustic signal is detected, and the detected change amount and the acoustic feature information are transmitted to a trained neural network. Input to detect a predetermined operation.

上述した（１）〜（６）によれば、あいまいな動作であっても検出精度を、従来より向上させることができる。
上述した（２）によれば、検出対象の動きの情報を取得することができ、環境の音響信号に関する情報を取得することができる。
上述した（３）によれば、音響特徴情報を画像に基づく変化量と合わせることができる。
上述した（４）によれば、あいまいな動作であっても検出精度を、従来より向上させることができる。
上述した（５）によれば、あいまいな行動である頷きを検出することができる。 According to the above-mentioned (1) to (6), the detection accuracy can be improved as compared with the conventional case even if the operation is ambiguous.
According to (2) described above, information on the movement of the detection target can be acquired, and information on the acoustic signal of the environment can be acquired.
According to (3) described above, the acoustic feature information can be combined with the amount of change based on the image.
According to (4) described above, the detection accuracy can be improved as compared with the conventional case even if the operation is ambiguous.
According to (5) described above, nodding, which is an ambiguous behavior, can be detected.

実施形態に係る動作検出システムの構成例を示すブロック図である。It is a block diagram which shows the structural example of the motion detection system which concerns on embodiment. 話し手の発話抑揚と聞き手の頷きを可視化した例を示す図である。It is a figure which shows the example which visualized the speaker's utterance intonation and the listener's nod. Ｄｌｉｂを使用した顔のキーポイントの検出を説明するための図である。It is a figure for demonstrating the detection of the key point of a face using Dlib. 実施形態に係る検出部の構成例と処理例を説明するための図である。It is a figure for demonstrating the configuration example and the processing example of the detection part which concerns on embodiment. 実施形態に係る動作検出システムが行う処理手順例のフローチャートである。It is a flowchart of the processing procedure example performed by the motion detection system which concerns on embodiment. 第１の比較例の音声情報のみに基づいて頷きを判断した場合の精度例を示す図である。It is a figure which shows the accuracy example when the nod is judged only based on the voice information of the 1st comparative example. 第２の比較例の顔器官点の変化量のみに基づいて頷きを判断した場合の精度例を示す図である。It is a figure which shows the accuracy example when the nod is judged only based on the change amount of the facial organ point of the 2nd comparative example. 実施形態における頷きを判断した場合の精度例を示す図である。It is a figure which shows the accuracy example at the time of determining the nod in an embodiment. 本実施形態に係るニューラルネットワークの第１変形例を示す図である。It is a figure which shows the 1st modification of the neural network which concerns on this embodiment. 本実施形態に係るニューラルネットワークの第２変形例を示す図である。It is a figure which shows the 2nd modification of the neural network which concerns on this embodiment.

以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used in the following description, the scale of each member is appropriately changed in order to make each member recognizable.

図１は、本実施形態に係る動作検出システム１の構成例を示すブロック図である。図１に示すように、動作検出システム１は、収音部２（音響信号取得部）、撮影部３（画像取得部）、および動作検出装置４を備えている。
動作検出装置４は、音声処理部４１（音響信号取得部）、画像処理部４２（画像取得部）、検出部４３、および出力部４４を備えている。 FIG. 1 is a block diagram showing a configuration example of the motion detection system 1 according to the present embodiment. As shown in FIG. 1, the motion detection system 1 includes a sound collecting section 2 (acoustic signal acquisition section), a photographing section 3 (image acquisition section), and a motion detection device 4.
The motion detection device 4 includes a voice processing unit 41 (acoustic signal acquisition unit), an image processing unit 42 (image acquisition unit), a detection unit 43, and an output unit 44.

収音部２は、マイクロホンであり、音響信号を収音して、収音した音響信号を動作検出装置４に出力する。なお、収音部２は、複数のマイクロホンを備えるマイクロホンアレイであってもよい。また、収音部２は、収音した音響信号をアナログ信号からデジタル信号に変換し、変換したデジタル信号の音響信号を動作検出装置４に出力するようにしてもよい。 The sound collecting unit 2 is a microphone, collects an acoustic signal, and outputs the collected acoustic signal to the motion detection device 4. The sound collecting unit 2 may be a microphone array including a plurality of microphones. Further, the sound collecting unit 2 may convert the collected acoustic signal from an analog signal to a digital signal and output the sound signal of the converted digital signal to the motion detection device 4.

撮影部３は、例えばＣＭＯＳ（ＣｏｍｐｌｅｍｅｎｔａｒｙＭＯＳ）撮像素子、またはＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）撮影素子である。撮影部３は、撮影した画像を動作検出装置４に出力する。なお、撮影される画像は。所定時間毎の静止画であってもよく、動画であってもよい。 The photographing unit 3 is, for example, a CMOS (Complementary MOS) image sensor or a CCD (Charge Coupled Device) image sensor. The photographing unit 3 outputs the captured image to the motion detection device 4. The image to be taken is. It may be a still image at predetermined time intervals or a moving image.

動作検出装置４は、音響信号（音声信号）と、画像情報に基づいて、話し手の発話に対する聞き手の頷きである確率を推定する。 The motion detection device 4 estimates the probability that the listener nods to the speaker's utterance based on the acoustic signal (voice signal) and the image information.

音声処理部４１は、収音部２が出力する音響信号を取得する。音声処理部４１は、なお、収音部２が出力する音響信号に対して、アナログ信号からデジタル信号（例えば１６ビットの離散データ）に変換する。音声処理部４１は、取得した音響信号に対して、音声の有無（音響信号が所定の大きさ以上であるか否か）と、音声抑揚（音響信号の強弱）を検出する。なお、検出方法については後述する。音声処理部４１は、検出した発話の有無情報と発話抑揚に関する情報を発話情報（音響特徴情報）として検出部４３に出力する。発話情報は、話し手がロボットなどのシステムの場合、システム側の時系列情報でもある。 The voice processing unit 41 acquires the acoustic signal output by the sound collecting unit 2. The audio processing unit 41 still converts the acoustic signal output by the sound collecting unit 2 from an analog signal to a digital signal (for example, 16-bit discrete data). The voice processing unit 41 detects the presence or absence of voice (whether or not the sound signal has a predetermined magnitude or more) and voice intonation (strength or weakness of the sound signal) with respect to the acquired sound signal. The detection method will be described later. The voice processing unit 41 outputs the detected utterance presence / absence information and utterance inflection information to the detection unit 43 as utterance information (acoustic feature information). The utterance information is also time-series information on the system side when the speaker is a system such as a robot.

画像処理部４２は、撮影部３が撮影した画像を取得する。画像処理部４２は、聞き手のあいまいな動作として頷きを検出する場合、取得した画像に対して画像処理を行って、例えば聞き手の顔の領域を抽出し、抽出した顔に対応する画像から特徴量を抽出する。画像処理部４２は、頷き認識に不要な情報を削除する。画像処理部４２は、画像に含まれている話者の顔のキーポイントを検出する。画像処理部４２は、検出した顔のキーポイントを示す顔キーポイントそれぞれの顔器官点の変化量を算出し、算出した顔器官点の変化量を検出部４３に出力する。なお、顔のキーポイントの検出方法、顔器官点の変化量の算出方法については、後述する。 The image processing unit 42 acquires an image taken by the photographing unit 3. When the image processing unit 42 detects a nod as an ambiguous motion of the listener, the image processing unit 42 performs image processing on the acquired image, extracts, for example, a region of the listener's face, and features an amount from the image corresponding to the extracted face. Is extracted. The image processing unit 42 deletes information unnecessary for nod recognition. The image processing unit 42 detects the key points of the speaker's face included in the image. The image processing unit 42 calculates the amount of change in the facial organ points of each of the face key points indicating the detected key points of the face, and outputs the calculated amount of change in the facial organ points to the detection unit 43. The method of detecting the key points of the face and the method of calculating the amount of change in the facial organ points will be described later.

検出部４３は、音声処理部４１が出力する発話情報と、画像処理部が出力する顔キーポイント情報を取得する。検出部４３は、取得した頷き情報と顔キーポイント情報を統合し、学習されたモデルを用いて、聞き手の動作を検出する。なお、検出部４３の構成例、検出部４３の動作例は、後述する。検出部４３は、検出した聞き手の動作を示す動作情報を出力部４４に出力する。 The detection unit 43 acquires the utterance information output by the voice processing unit 41 and the face key point information output by the image processing unit. The detection unit 43 integrates the acquired nod information and the face key point information, and detects the movement of the listener by using the learned model. A configuration example of the detection unit 43 and an operation example of the detection unit 43 will be described later. The detection unit 43 outputs operation information indicating the detected operation of the listener to the output unit 44.

出力部４４は、例えば画像表示装置、印刷装置等である。出力部４４は、検出部４３が出力する動作情報を出力する。 The output unit 44 is, for example, an image display device, a printing device, or the like. The output unit 44 outputs the operation information output by the detection unit 43.

なお、以下の実施例では、聞き手のあいまいな動作の例として、頷きを例に説明するが、聞き手のあいまいな動作はこれに限らない。聞き手のあいまいな動作は、例えば、瞬きや手を振るジャスチャー、顔を横に向ける動作等であってもよい。 In the following embodiment, nodding will be described as an example of the listener's ambiguous movement, but the listener's ambiguous movement is not limited to this. The listener's ambiguous movements may be, for example, blinking, waving gestures, or turning the face sideways.

［話し手の発話抑揚と聞き手の頷き］
ここで、発話における話し手の発話抑揚と聞き手の頷きの例を説明する。
図２は、話し手の発話抑揚と聞き手の頷きを可視化した例を示す図である。図２において、横軸はフレーム番号であり、縦軸は発話抑揚（振幅の変化）である。なお、図２において、話者は話し手であり、頷きの動作を行っているのは聞き手である。また、頷きの区間ｇ１１は、目視で確認した。 [Speaker's speech intonation and listener's nod]
Here, an example of the speaker's utterance intonation and the listener's nod in the utterance will be described.
FIG. 2 is a diagram showing an example of visualizing the speaker's utterance intonation and the listener's nod. In FIG. 2, the horizontal axis is the frame number, and the vertical axis is the utterance intonation (change in amplitude). In FIG. 2, the speaker is the speaker, and the listener is performing the nodding action. In addition, the nodding section g11 was visually confirmed.

図２のように、話し手の発話中に頷きが発生しやすい。また、図２のように、話し手は、１文発話すると、発話抑揚の形状が山なりになる傾向がある。聞き手の頷きは、話者の話し声によって引き起こされるため、本実施形態では、聞き手の頷きのタイミングの追加の手がかりとして音声情報を使用する。
なお、本実施形態で用いる音響信号と聞き手の動作の画像は、例えばビデオチャット等を録画したものであってもよく、例えばビデオチャット中にリアルタイム取得したものであってもよい。 As shown in FIG. 2, nodding is likely to occur during the speaker's utterance. Further, as shown in FIG. 2, when a speaker speaks one sentence, the shape of the utterance intonation tends to be mountainous. Since the listener's nod is caused by the speaker's voice, the present embodiment uses voice information as an additional clue to the listener's nod timing.
The acoustic signal and the image of the listener's motion used in the present embodiment may be, for example, a recorded video chat or the like, or may be acquired in real time, for example, during the video chat.

［音声処理部の処理］
次に、音声処理部４１が行う処理例を説明する。
音声処理部４１は、例えばデジタル化された音声信号のバイナリ値と、音声有無のしきい値を比較して、音声の有無を検出する。音声処理部４１は、例えば、発話区間全域を１とし、それ以外の区間を０とする。または、音声処理部４１は、フレーム毎の音声パワーに基づいて、音声の有無を検出するようにしてもよい。 [Processing of voice processing unit]
Next, a processing example performed by the voice processing unit 41 will be described.
The voice processing unit 41 detects the presence or absence of voice by comparing, for example, the binary value of the digitized voice signal with the threshold value of the presence or absence of voice. For example, the voice processing unit 41 sets the entire utterance section to 1 and the other sections to 0. Alternatively, the voice processing unit 41 may detect the presence or absence of voice based on the voice power for each frame.

また、音声処理部４１は、音響信号の振幅に基づいて、音声抑揚を検出する。ただし、音声信号には負の値があるため、音声処理部４１は、前処理で波形データから音声パワーを計算する。さらに、サンプリング周波数は、ビデオ画像シーケンスと音響信号で異なっている。 Further, the voice processing unit 41 detects voice intonation based on the amplitude of the acoustic signal. However, since the voice signal has a negative value, the voice processing unit 41 calculates the voice power from the waveform data in the preprocessing. In addition, the sampling frequency is different for the video image sequence and the acoustic signal.

このため、本実施形態において、音声処理部４１が、次式（１）に示すように、時間ｔの前後１０サンプルずつの音声パワーを合計した値を音声情報の特徴Ｓ（ｔ）として使用する。これにより、本実施形態によれば、音響特徴情報を画像に基づく変化量と合わせることができる。 Therefore, in the present embodiment, as shown in the following equation (1), the voice processing unit 41 uses the sum of the voice powers of 10 samples before and after the time t as the feature S (t) of the voice information. .. Thereby, according to the present embodiment, the acoustic feature information can be matched with the amount of change based on the image.

なお、式（２）において、ｓ（ｔ）は時刻ｔにおける音声パワーであり、ｓ（ｔ＋ｉ）^２は音声パワーであり、ｉはサンプル数である。なお、本実施形態では、サンプル数を、時間ｔの前後１０サンプルとした例を説明するが、サンプル数は一例であり、これに限らない。 In the equation (2), s (t) is the voice power at time t, s (t + i) ² is the voice power, and i is the number of samples. In this embodiment, an example in which the number of samples is 10 samples before and after the time t will be described, but the number of samples is an example and is not limited to this.

一般的に、頷きには２つのタイプがある。他の人のスピーチに応じた頷き（「あなた（ｙｏｕ）」というラベルで表される）と、自分のスピーチ中の頷き（「私（ｍｅ）」ラベルで表される）である。本実施形態では、他の人のスピーチに応じて頷きに焦点を当てるため、トレーニングと評価には「あなた」のサンプルのみを使用する。 In general, there are two types of nods. Nods in response to other people's speeches (represented by the label "you") and nods in your own speech (represented by the "me" label). In this embodiment, only the "you" sample is used for training and evaluation because the focus is on nodding in response to the speeches of others.

［画像処理部の処理］
次に、画像処理部４２が行う処理例を説明する。
本実施形態では、現在のフレームと前のフレームの間の顔のキーポイントの位置の違いを、動き情報として使用する。
顔画像全体の特徴を抽出すると、これにより外観の特徴が形成される。これにより、うなずき認識には、不要な情報が取得できる。このため、本実施形態では、顔のキーポイントの動きの違いを使用して、動き情報をキャプチャし、不要な外観情報を削除する。 [Processing of image processing unit]
Next, a processing example performed by the image processing unit 42 will be described.
In this embodiment, the difference in the position of the key point of the face between the current frame and the previous frame is used as motion information.
When the features of the entire face image are extracted, the features of the appearance are formed by this. As a result, information unnecessary for nodding recognition can be acquired. Therefore, in the present embodiment, the movement information is captured and unnecessary appearance information is deleted by using the difference in the movement of the key points of the face.

画像処理部４２は、例えばＤｌｉｂ（参考文献１参照）によって、顔検出と顔のキーポイント検出を行う。 The image processing unit 42 performs face detection and face key point detection by, for example, Dlib (see Reference 1).

参考文献１；Davis E.King, “Dlib-ml: A machine learning toolkit.”, Journal of Machine Learning Research, Vol.10, pp.1755-1758, 2009. Reference 1; Davis E. King, “Dlib-ml: A machine learning toolkit.”, Journal of Machine Learning Research, Vol.10, pp.1755-1758, 2009.

図３は、Ｄｌｉｂを使用した顔のキーポイントの検出を説明するための図である。符号ｇ２１は検出された顔の領域であり、符号ｇ２２は検出された顔のキーポイントの例である。図３のように、Ｄｌｉｂは、例えば６８の顔のキーポイントを検出する。 FIG. 3 is a diagram for explaining the detection of facial key points using Dlib. Reference numeral g21 is a region of the detected face, and reference numeral g22 is an example of the key points of the detected face. As shown in FIG. 3, Dlib detects, for example, 68 facial key points.

顔のキーポイントｉの時間ｔにおけるｘ方向の移動量ｄ_ｘ，ｉとｙ方向の移動量ｄ_ｙ，ｉは、次式（２）で与えられる。画像処理部４２は、時間ｔにおけるｘ方向の移動量ｄ_ｘ，ｉとｙ方向の移動量ｄ_ｙ，ｉを顔器官点の変化量として検出部４３に出力する。 Movement amount _{d x} in the x direction at time t keypoint i _face, the movement amount _{d y} _i and _{y-direction, i} is given by the following equation (2). The image processing unit 42 outputs the time shift amount d _x in the x direction in _t, the movement amount d _y _i and y _{directions, i} to the detection unit 43 as a variation of the face organ point.

式（２）において、ｉは顔器官点の対応番号であり、ｄ_ｘ，ｉ（ｔ）とｄ_ｘ，ｉ（ｔ）は時刻ｔ（第１時刻）における特徴量である。また、ｘ_ｉ（ｔ）とｙ_ｉ（ｔ）は時刻ｔにおける顔器官点位置であり、ｘ_ｉ（ｔ−５）とｙ_ｉ（ｔ−５）は５フレーム前（第２時刻）における顔器官点位置である。なお、本実施形態では、５フレーム前の位置と時刻ｔの位置の差を特徴量とする例を説明したが、比較に用いるフレーム数は５フレーム前に限らず、他のフレーム数であってもよい。
なお、顔のキーポイントの総数が６８であるため、ｄ_ｘ，ｉ（ｔ）とｄ_ｘ，ｉ（ｔ）を組み合わせると、モーション情報の特徴ベクトルは１３６次元になる。
また、Ｄｌｉｂによって顔検出できなかったフレームは、前後の顔検出の平均を用いて補間するようにしてもよい。 In the formula (2), i is the corresponding number of the facial organ point, and d _{x, i} (t) and d _{x, i} (t) are the feature quantities at the time t (first time). Further, x _i (t) and y _i (t) are facial organ point positions at time t, and x _i (t-5) and y _i (t-5) are faces 5 frames before (second time). Organ point position. In the present embodiment, an example in which the difference between the position 5 frames before and the position at time t is used as the feature quantity has been described, but the number of frames used for comparison is not limited to 5 frames before, but may be other frames. May be good.
Since the total number of key points on the face is 68, the feature vector of the motion information becomes 136 dimensions when _{d x, i} (t) and d _{x, i (t) are combined.}
Further, the frame whose face could not be detected by Dlib may be interpolated by using the average of the front and back face detection.

［検出部の構成例と処理例］
次に、検出部４３が備えるモデルの構成例と処理例を説明する。
図４は、本実施形態に係る検出部４３の構成例と処理例を説明するための図である。図４のように、検出部４３は、連結部４３１（ｃｏｎｃａｔ）、全結合層４３２（ＦＣ；ＦｕｌｌｙＣｏｎｎｅｃｔｅｄ）、ＬＳＴＭ４３３（ＬｏｎｇＳｈｏｒｔＴｅｒｍＭｅｍｏｒｙ）、ＬＳＴＭ４３４、および全結合層４３５を備える。なお、図４のように、検出部４３は、ＲＮＮを２層持つ構造のネットワークモデルである。 [Configuration example and processing example of detection unit]
Next, a configuration example and a processing example of the model included in the detection unit 43 will be described.
FIG. 4 is a diagram for explaining a configuration example and a processing example of the detection unit 43 according to the present embodiment. As shown in FIG. 4, the detection unit 43 includes a connecting unit 431 (concat), a fully connected layer 432 (FC; Fully Connected), an LSTM433 (Long Short Term Memory), an LSTM434, and a fully connected layer 435. As shown in FIG. 4, the detection unit 43 is a network model having a structure having two layers of RNNs.

検出部４３には、１３６次元の顔器官点の変化量と、１次元の発話情報が入力される。
連結部４３１は、１３６次元の顔器官点の変化量と、１次元の発話情報の各入力値を０〜１に正規化し、連結する。なお、連結部４３１は、連結中にこれらの機能値に重みを適用する。
全結合層４３２は、例えば、活性化関数がＲｅＬＵ（ＲｅｃｔｉｆｉｅｄＬｉｎｅａｒＵｎｉｔ、参考文献２参照）である。
ＬＳＴＭ４３３は、例えば、ユニット数が２５６であり、ドロップアウトが０．５であり、活性化関数がＲｅＬＵである。
ＬＳＴＭ４３４は、例えば、ユニット数が２５６であり、ドロップアウトが０．５であり、活性化関数がＲｅＬＵである。
全結合層４３５は、例えば、活性化関数がＲｅＬＵである。 The amount of change in the 136-dimensional facial organ points and the one-dimensional utterance information are input to the detection unit 43.
The connecting unit 431 normalizes the amount of change in the 136-dimensional facial organ points and each input value of the one-dimensional utterance information to 0 to 1, and connects them. The connecting unit 431 applies weights to these functional values during connection.
The activation function of the fully connected layer 432 is, for example, ReLU (Rectifier Liner Unit, see Reference 2).
The SSTM433 has, for example, 256 units, a dropout of 0.5, and an activation function of ReLU.
The SSTM434 has, for example, 256 units, a dropout of 0.5, and an activation function of ReLU.
The activation function of the fully connected layer 435 is, for example, ReLU.

参考文献２；Vinod Nair, Geoffrey E. Hinton,”Rectied linear units improve restricted boltzmann machines”, Proceedings of the 27th International Conference on Machine Learning, pp. 807-814 (2010). Reference 2; Vinod Nair, Geoffrey E. Hinton, "Rectied linear units improve restricted boltzmann machines", Proceedings of the 27th International Conference on Machine Learning, pp. 807-814 (2010).

検出部４３は、例えば、ＲＭＳｐｒｏｐを最適化に使用し、バッチサイズ３２で３００エポック（ｅｐｏｃｈ）、検出部４３が備えるモデルのトレーニングを行う。なお、ＲＭＳｐｒｏｐは、深層学習における勾配法の１つであり、ＡｄａＧｒａｄを改良したアルゴリズムである。また、検出結果は、例えば“１”が発話を行っていることを示し、“０”が発話を行っていないことを示す。 The detection unit 43 uses, for example, RMSprop for optimization, and trains a model included in the detection unit 43 with a batch size of 32 and 300 epochs. RMSprop is one of the gradient methods in deep learning, and is an algorithm improved from AdaGrad. Further, the detection result indicates, for example, that "1" is speaking and "0" is not speaking.

また、活性化関数はＲｅＬＵに限らず、例えばｓｉｇｍｏｉｄやｔａｎｈ等の他の活性化関数であってもよい。
さらに、ＬＳＴＭに限らず、現在のデータと過去のデータを時系列に扱える手法であってもよい。 Further, the activation function is not limited to ReLU, and may be another activation function such as sigmoid or tanh.
Further, the method is not limited to LSTM, and may be a method capable of handling current data and past data in time series.

［処理手順］
次に、処理手順例を説明する。
図５は、本実施形態に係る動作検出システム１が行う処理手順例のフローチャートである。なお、検出部４３が備えるモデル（ニューラルネットワーク）は、例えばビデオ会議等の教師データを用いて学習させた後、以下の処理を行う。 [Processing procedure]
Next, an example of the processing procedure will be described.
FIG. 5 is a flowchart of an example of a processing procedure performed by the motion detection system 1 according to the present embodiment. The model (neural network) included in the detection unit 43 is trained using teacher data such as a video conference, and then performs the following processing.

（ステップＳ１）収音部２は、話し手の音響信号を収音する。動作検出装置４は、話し手の音響信号を取得する。 (Step S1) The sound collecting unit 2 collects the acoustic signal of the speaker. The motion detection device 4 acquires the acoustic signal of the speaker.

（ステップＳ２）撮影部３は、聞き手の顔を含む画像を撮影する。動作検出装置４は、聞き手の画像を取得する。 (Step S2) The photographing unit 3 photographs an image including the face of the listener. The motion detection device 4 acquires an image of the listener.

（ステップＳ３）音声処理部４１は、音響信号を用いて、発話情報（発話の有無、発話抑揚）を検出する。 (Step S3) The voice processing unit 41 detects utterance information (presence or absence of utterance, utterance intonation) using an acoustic signal.

（ステップＳ４）画像処理部４２は、画像から顔の領域の検出、顔のキーポイントの検出を行う。続けて、画像処理部４２は、顔のキーポイントの移動量を算出する。 (Step S4) The image processing unit 42 detects a face region and a face key point from the image. Subsequently, the image processing unit 42 calculates the amount of movement of the key points of the face.

（ステップＳ５）検出部４３は、１３６次元の顔器官点の変化量と、１次元の発話情報の各入力値を０〜１に正規化し、連結する。 (Step S5) The detection unit 43 normalizes the amount of change in the 136-dimensional facial organ point and each input value of the one-dimensional utterance information to 0 to 1, and concatenates them.

（ステップＳ６）検出部４３は、連結した情報をモデルに入力して、頷きであるか否かを検出する。 (Step S6) The detection unit 43 inputs the concatenated information into the model and detects whether or not it is a nod.

（ステップＳ７）出力部４４は、検出部４３が判断した結果を出力する。 (Step S7) The output unit 44 outputs the result determined by the detection unit 43.

［評価結果］
次に、評価結果例を説明する。
図６は、第１の比較例の音声情報のみに基づいて頷きを判断した場合の精度例を示す図である。なお、第１の比較例に用いた検出部のネットワーク構造は、第１ＬＳＴＭと第２ＬＳＴＭと全結合層を備え、第１ＬＳＴＭ、第２ＬＳＴＭ、全結合層の順に接続されている。第１の比較例では、第１ＬＳＴＭに１次元の発話情報が入力され、頷きの判断結果が出力される。
図６のように、入力としてのスピーチの有無だけでは、うなずきの判断ができない。
発話抑揚の情報は、約４２．０％の認識精度を達成している。平均認識率は約６３．６％である。
この結果、聞き手が頷いているかどうかを判断するには、話し手の発話抑揚の情報が重要であることを示している。 [Evaluation results]
Next, an example of the evaluation result will be described.
FIG. 6 is a diagram showing an accuracy example when the nod is determined based only on the voice information of the first comparative example. The network structure of the detection unit used in the first comparative example includes a first LSTM, a second LSTM, and a fully connected layer, and is connected in the order of the first LSTM, the second LSTM, and the fully connected layer. In the first comparative example, one-dimensional utterance information is input to the first LSTM, and a nodding judgment result is output.
As shown in FIG. 6, it is not possible to judge a nod only by the presence or absence of a speech as an input.
The speech intonation information achieves a recognition accuracy of about 42.0%. The average recognition rate is about 63.6%.
As a result, it is shown that the information of the speaker's speech intonation is important for determining whether the listener is nodding.

図７は、第２の比較例の顔器官点の変化量のみに基づいて頷きを判断した場合の精度例を示す図である。なお、第２の比較例に用いた検出部のネットワーク構造も、第１ＬＳＴＭと第２ＬＳＴＭと全結合層を備え、第１ＬＳＴＭ、第２ＬＳＴＭ、全結合層の順に接続されている。第２の比較例では、第１ＬＳＴＭに１３６次元の顔器官点の変化量が入力され、頷きの判断結果が出力される。
上述したように、顔のキーポイントの移動量は、現在のフレームと前のフレームのキーポイントの位置の差である。 FIG. 7 is a diagram showing an accuracy example when the nod is determined based only on the amount of change in the facial organ points of the second comparative example. The network structure of the detection unit used in the second comparative example also includes a first LSTM, a second LSTM, and a fully connected layer, and is connected in the order of the first LSTM, the second LSTM, and the fully connected layer. In the second comparative example, the amount of change in the facial organ points of 136 dimensions is input to the first LSTM, and the nodding judgment result is output.
As described above, the amount of movement of the key points of the face is the difference between the positions of the key points of the current frame and the previous frame.

図７に示す例では、顔のキーポイントの移動量の計算に使用されるフレーム間隔の効果を評価した例である。図７のように、５フレームの間隔を使用した場合は、約８３．１％頷きの認識精度が達成された。１フレームの間隔を使用した場合は、頷きの認識精度が５フレームより向上したが、５フレームの間隔を使用する場合よりも多くの誤検知（存在しない場合にうなずきを認識する）が発生した。このため、本実施形態では、顔のキーポイントの移動量の計算に使用されるフレーム間隔を５フレームとした。 In the example shown in FIG. 7, the effect of the frame interval used for calculating the movement amount of the key points of the face is evaluated. As shown in FIG. 7, when the interval of 5 frames was used, the recognition accuracy of nodding of about 83.1% was achieved. When the 1-frame interval was used, the nodding recognition accuracy was improved compared to 5 frames, but more false positives (recognizing the nod when it did not exist) occurred than when the 5-frame interval was used. Therefore, in the present embodiment, the frame interval used for calculating the movement amount of the key points of the face is set to 5 frames.

図８は、本実施形態における頷きを判断した場合の精度例を示す図である。
発話情報機と顔のキーポイントの移動量の両方についてモデルを同時にトレーニングした。図８に示すように、２つの情報を同時に使用すると、８４．４％の全体的な精度が達成され、頷きがあるか否かが判断された。この結果は、一方の情報のみを使用して達成された第１の比較例や第２の比較例の結果よりも正確である。 FIG. 8 is a diagram showing an accuracy example when the nod in the present embodiment is determined.
The model was trained simultaneously on both the speech detector and the amount of movement of key points on the face. As shown in FIG. 8, when the two pieces of information were used simultaneously, an overall accuracy of 84.4% was achieved and it was determined whether or not there was a nod. This result is more accurate than the results of the first and second comparative examples achieved using only one piece of information.

［変形例］
なお、図４に示した検出部４３が有するモデル（ニューラルネットワーク）の構成例は一例であり、これに限らない。例えばＬＳＴＭのユニット数は２５６未満であっても、２５６より多くてもよい。また、ＬＳＴＭの個数も２つに限らず、１つであってもよく、３つ以上であってもよい。
また、連結部４３１の接続位置は、全結合層４３５の前であってもよい。この場合、モデルは、例えば２層のＬＳＴＭを、顔器官点の変化量用と、１次元の発話情報用の２系統備えていてもよい。 [Modification example]
The configuration example of the model (neural network) included in the detection unit 43 shown in FIG. 4 is an example, and is not limited to this. For example, the number of RSTM units may be less than 256 or more than 256. Further, the number of RSTMs is not limited to two, and may be one or three or more.
Further, the connection position of the connecting portion 431 may be in front of the fully connected layer 435. In this case, the model may include, for example, two layers of LSTMs, one for the amount of change in facial organ points and the other for one-dimensional utterance information.

図９は、本実施形態に係るニューラルネットワークの第１変形例を示す図である。図９の例では、全結合層が１つの例であり、連結部、ＬＳＴＭ、ＬＳＴＭ、全結合層の順に接続されている。このようなニューラルネットワークの構成であっても、従来と比較して、頷きの認識率（正答率）を向上させることができる。 FIG. 9 is a diagram showing a first modification of the neural network according to the present embodiment. In the example of FIG. 9, the fully connected layer is one example, and the connecting portion, the LSTM, the LSTM, and the fully connected layer are connected in this order. Even with such a neural network configuration, the nodding recognition rate (correct answer rate) can be improved as compared with the conventional case.

図１０は、本実施形態に係るニューラルネットワークの第２変形例を示す図である。図１０の例では、ＬＴＳＭ層が１つの例であり、連結部、全結合層、ＬＳＴＭ、全結合層の順に接続されている。このようなニューラルネットワークの構成であっても、従来と比較して、頷きの認識率（正答率）を向上させることができる。 FIG. 10 is a diagram showing a second modification of the neural network according to the present embodiment. In the example of FIG. 10, the LTSM layer is one example, and the connecting portion, the fully connected layer, the LSTM, and the fully connected layer are connected in this order. Even with such a neural network configuration, the nodding recognition rate (correct answer rate) can be improved as compared with the conventional case.

また、話し手は、動作時に音を発生させる物、例えばロボット等であってもよい。例えば、環境音を収音部２が収音し、その環境に存在する動作を検出する対象の人や物の画像を撮影部３が撮影するようにしてもよい。あるいは、例えば、話し手の発話を収音部２が収音し、その環境に存在する動作を検出する対象の人や物の画像を撮影部３が撮影するようにしてもよい。
また、動作の検出対象が物である場合、動作検出装置４は、撮影された物の画像から、動作している領域を抽出し、抽出した領域の画像の特徴的な位置をキーポイントとし、そのキーポイントのフレーム間での移動量を、顔器官点の変化量の代わりに用いてもよい。 Further, the speaker may be an object that generates sound during operation, for example, a robot or the like. For example, the sound collecting unit 2 may collect the environmental sound, and the photographing unit 3 may take an image of a person or an object for detecting an operation existing in the environment. Alternatively, for example, the sound collecting unit 2 may collect the utterance of the speaker, and the photographing unit 3 may take an image of a person or an object for detecting an operation existing in the environment.
When the motion detection target is an object, the motion detection device 4 extracts an operating area from the captured image of the object, and uses the characteristic position of the image of the extracted area as a key point. The amount of movement of the key point between frames may be used instead of the amount of change in facial organ points.

以上のように、本実施形態では、音響信号から音響特徴情報を検出するようにし、画像から変化量を検出するようにした。また、本実施形態では、音響特徴情報を、所定時間の音響信号のパワー総和に基づく前記音響信号の強弱と、音響信号が所定の大きさ以上であるか否かであるようにし、変化量は、撮影された画像から特徴点の位置を検出し、第１時刻における検出した特徴点の位置と、第２時刻における検出した前記特徴点の位置との差であるようにした。また、本実施形態では、人の動きの動画に加え、話者の音声パワーを学習済みニューラルネットワークに入力して、頷きを判断するようにした。 As described above, in the present embodiment, the acoustic feature information is detected from the acoustic signal, and the amount of change is detected from the image. Further, in the present embodiment, the acoustic feature information is determined by the strength of the acoustic signal based on the total power of the acoustic signals for a predetermined time and whether or not the acoustic signal has a predetermined magnitude or more, and the amount of change is changed. , The position of the feature point was detected from the captured image, and the difference between the position of the detected feature point at the first time and the position of the detected feature point at the second time was set. Further, in the present embodiment, in addition to the moving image of the person, the voice power of the speaker is input to the trained neural network to judge the nod.

これにより、本実施形態によれば、検出対象の動きの情報を取得することができ、環境の音響信号に関する情報を取得することができる。この結果、本実施形態によれば、頷きなど特定の動きの検出を行うことができる。 As a result, according to the present embodiment, it is possible to acquire information on the movement of the detection target, and it is possible to acquire information on the acoustic signal of the environment. As a result, according to the present embodiment, it is possible to detect a specific movement such as nodding.

なお、本発明における動作検出装置４の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより動作検出装置４が行う処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、ＯＳや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境（あるいは表示環境）を備えたＷＷＷシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ（ＲＡＭ）のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or a part of the functions of the motion detection device 4 in the present invention is recorded on a computer-readable recording medium, and the program recorded on the recording medium is read by the computer system and executed. By doing so, all or part of the processing performed by the motion detection device 4 may be performed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer system" shall also include a WWW system provided with a homepage providing environment (or display environment). Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, or a storage device such as a hard disk built in a computer system. Furthermore, a "computer-readable recording medium" is a volatile memory (RAM) inside a computer system that serves as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, it shall include those that hold the program for a certain period of time.

また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク（通信網）や電話回線等の通信回線（通信線）のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であってもよい。 Further, the program may be transmitted from a computer system in which this program is stored in a storage device or the like to another computer system via a transmission medium or by a transmission wave in the transmission medium. Here, the "transmission medium" for transmitting a program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the above program may be for realizing a part of the above-mentioned functions. Further, it may be a so-called difference file (difference program) that can realize the above-mentioned function in combination with a program already recorded in the computer system.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…動作検出システム、２…収音部、３…撮影部、４…動作検出装置、４１…音声処理部、４２…画像処理部、４３…検出部、４４…出力部 1 ... Motion detection system, 2 ... Sound collection unit, 3 ... Shooting unit, 4 ... Motion detection device, 41 ... Voice processing unit, 42 ... Image processing unit, 43 ... Detection unit, 44 ... Output unit

Claims

An image acquisition unit that acquires an image of the target for which motion is detected, and an image acquisition unit
An acoustic signal acquisition unit that acquires the acoustic signal of the environment,
The movement of the characteristic feature point in the acquired image is detected as a change amount, the acoustic feature information of the acquired acoustic signal is detected, and the detected change amount and the acoustic feature information are used in a trained neural network. A detector that detects a predetermined operation by inputting to
Motion detection device.

The acoustic feature information is the strength of the acoustic signal based on the total power of the acoustic signal for a predetermined time, and whether or not the acoustic signal has a predetermined magnitude or more.
The amount of change is the difference between the position of the feature point detected at the first time and the position of the feature point detected at the second time by detecting the position of the feature point from the captured image.
The motion detection device according to claim 1.

The predetermined time is a length for adjusting to the frame rate of the image.
The motion detection device according to claim 2.

The detection unit
The change amount and the acoustic feature information are connected before the input of the neural network.
The motion detection device according to any one of claims 1 to 3.

The target for detecting the above motion is a person.
The action is nodding,
The motion detection device according to any one of claims 1 to 4.

The image acquisition unit acquires the image of the target for which the operation is detected,
The acoustic signal acquisition unit acquires the acoustic signal of the environment and
The detection unit detects the movement of the characteristic feature point in the acquired image as a change amount, detects the acoustic feature information of the acquired acoustic signal, and learns the detected change amount and the acoustic feature information. Input to a completed neural network to detect a predetermined motion,
Motion detection method.

On the computer
Get the image of the target to detect the movement,
The acoustic signal acquisition unit acquires the acoustic signal of the environment,
The movement of the characteristic feature points in the acquired image is detected as the amount of change.
Detects the acoustic feature information of the acquired acoustic signal and
The detected change amount and the acoustic feature information are input to the trained neural network to detect a predetermined motion.
program.