JP7361460B2

JP7361460B2 - Communication devices, communication programs, and communication methods

Info

Publication number: JP7361460B2
Application number: JP2018182423A
Authority: JP
Inventors: 尚也川畑
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2018-09-27
Filing date: 2018-09-27
Publication date: 2023-10-16
Anticipated expiration: 2038-09-27
Also published as: JP2020053882A

Description

本発明は、コミュニケーション装置、コミュニケーションプログラム、及びコミュニケーション方法に関し、例えば、テレビ会議システムや電話会議システム等において用いられるコミュニケーション装置に適用し得るものである。 The present invention relates to a communication device, a communication program, and a communication method, and is applicable to, for example, a communication device used in a video conference system, a telephone conference system, and the like.

近年、テレビ会議システムや電話会議システム等のコミュニケーションシステムを用いて、遠隔地と、テレビ会議やテレワークなどの通話やコミュニケーションを行う機会が増えている。 2. Description of the Related Art In recent years, there have been an increasing number of opportunities to conduct calls and communicate with remote locations using communication systems such as video conference systems and telephone conference systems.

遠隔通話システムでは、遠隔地の相手と通話を行うために、遠隔通話システムに接続されている入力装置（例えば、マウス、キーボード、リモコンなど）で通話相手先の電話番号などの連絡先を入力、選択して接続する。また近年ではモバイル端末（例えば、スマートフォンやタブレットパソコンなどの端末）の普及により、遠隔通話システムがモバイル端末で動作するものもある。この場合、モバイル端末の画面上に表示されるキーボードで連絡先を入力して接続したり、モバイル端末のタッチパネルディスプレイに表示されている連絡先をタッチして接続したり、モバイル端末の画面上に表示されている通話相手の映像をタッチして接続するなどして、遠隔通話システムが相手側と接続することが多い。 In a remote call system, in order to make a call to a remote party, you enter contact information such as the telephone number of the other party using an input device (e.g., mouse, keyboard, remote control, etc.) connected to the remote call system. Select and connect. Furthermore, in recent years, with the spread of mobile terminals (for example, terminals such as smartphones and tablet computers), some remote call systems operate on mobile terminals. In this case, you can connect by entering the contact information using the keyboard displayed on the screen of the mobile device, connect by touching the contact information displayed on the touch panel display of the mobile device, or connect by entering the contact information on the keyboard displayed on the screen of the mobile device. A remote call system often connects to the other party by touching the displayed image of the other party.

さらに、特許文献１には、遠隔通話システムをロボッ卜に組込み、近親者と単身の高齢者とのコミュニケーション支援するコミュニケーション支援ロボットシステムが提案されている。 Further, Patent Document 1 proposes a communication support robot system that incorporates a remote call system into a robot and supports communication between close relatives and single elderly people.

特許文献１に記載のコミュニケーション支援ロボッ卜システムは、タッチパネルディスプレイに表示されている、近親者や高齢者の映像をタッチすることで通話相手に接続され、通話が開始する。 The communication support robot system described in Patent Document 1 connects to the other party by touching an image of a close relative or elderly person displayed on a touch panel display, and starts a call.

特開２０１５－１８４５９７号公報Japanese Patent Application Publication No. 2015-184597

しかしながら、特許文献１に記載のコミュニケーション支援ロボットシステムは、従来の遠隔通話システムの接続方法と同様に、入力装置で通話相手の連絡先を入力したり、タッチパネルディスプレイに表示されている連絡先をタッチしたりするなどして、通話の開始や終了を行っている。従来の接続方法で遠隔地と接続することは、実際の対面での通話と異なっているため臨場感（例えば、対面で会話しているような感覚）が非常に低い。 However, in the communication support robot system described in Patent Document 1, the contact information of the other party is input using an input device, or the contact information displayed on the touch panel display is touched, similar to the connection method of a conventional remote call system. Start or end a call by doing something like Connecting to a remote location using conventional connection methods is different from an actual face-to-face conversation, so the sense of realism (for example, the feeling of having a face-to-face conversation) is very low.

上記の問題を解決するために、例えば、特許文献１に記載のコミュニケーション支援ロボットに搭載されている音声認識システムを使用して、接続先の通話相手の名前や会話を開始するコマンド（例えば、「人名＋こんにちは」、「人名＋こんばんは」など）などの呼びかける音声（以下、呼びかけ音声）を使用者が発話し、その言葉を音声認識システムに入力し、コミュニケーション支援ロボットが音声認識結果から接続先を判定して接続を開始できるようにすることも考えられる。 In order to solve the above problem, for example, the voice recognition system installed in the communication support robot described in Patent Document 1 is used to identify the name of the other party to connect to and a command to start a conversation (for example, " The user utters a calling voice (hereinafter referred to as a calling voice) such as "Person's name + Hello", "Person's name + Good evening", etc., inputs the words into the voice recognition system, and the communication support robot determines the connection destination from the voice recognition result. It is also possible to make the determination and start the connection.

しかし、呼びかけ音声が音声認識システムに入力され、音声認識結果が得られた後に、音声認識結果から通話相手が決定して相手側に接続されるため、呼びかけ音声が通話相手に伝わらない。このため、通話相手からすると突然接続されることになるので、通話相手は違和感や不安感を得て、臨場感が向上しない。 However, after the calling voice is input to the voice recognition system and the voice recognition result is obtained, the other party is determined from the voice recognition result and the person is connected to the other party, so the calling voice is not transmitted to the other party. For this reason, from the perspective of the other party, the connection is suddenly made, which makes the other party feel uncomfortable and uneasy, and the sense of realism is not improved.

また、例えば、周りで人が話をしていたり、空調などの騒音が大きかったりする場合、使用環境の雑音が大きく、使用者が呼びかけ音声を発話しても、特許文献１に記載のコミュニケーション支援ロボットでは、呼びかけ音声を品質良く収音することができず、相手側との接続や切断が正しく機能しない。 In addition, for example, if there are people talking around you or there is a lot of noise from an air conditioner, etc., the noise in the usage environment is large, and even if the user utters a calling voice, the communication support described in Patent Document 1 Robots are unable to pick up voice calls with good quality, and cannot connect or disconnect from the other party properly.

さらに、特許文献１に記載のコミュニケーション支援ロボットシステムでは、ロボットの使用者側が単身の高齢者向けであるので、１人でしか遠隔通話システムを使用できず、遠隔通話システムを複数の使用者が使用することが出来ない。 Furthermore, in the communication support robot system described in Patent Document 1, since the user of the robot is for a single elderly person, the remote call system can only be used by one person, and the remote call system can be used by multiple users. I can't do it.

そのため、使用環境の雑音が大きい場合でも、複数の使用者のいずれかが発話した呼びかけ音声を収音し、収音した呼びかけ音声を正しく認識して、その呼びかけ音声で相手側との接続を開始した後に、その呼びかけ音声を相手側に伝達することで臨場感のある通話を開始することができ、また、通話を終了するときには、複数の使用者のいずれかが発話した通話を切断する音声（以下、切断音声）を正しく認識して、通話が終了してから、相手側との通話が切断できるコミュニケーション装置、コミュニケーションプログラム、及びコミュニケーション方法が望まれている。 Therefore, even if the usage environment is noisy, it can collect the calling voice uttered by any one of multiple users, correctly recognize the collected calling voice, and start a connection with the other party using that calling voice. After that, you can start a realistic call by transmitting the calling voice to the other party, and when ending the call, a voice ( There is a need for a communication device, a communication program, and a communication method that can correctly recognize the disconnection voice (hereinafter referred to as disconnection voice) and disconnect the call with the other party after the call ends.

本発明は、以上の点を考慮してなされたものであり、マイクアレイを使用して複数の話者が発話した音声を強調する信号処理を行い、信号処理した信号を一度バッファに保持すると同時に信号処理した信号に対して音声認識を行う。そして、その音声認識結果が呼びかけ音声かを判定し、呼びかけ音声の場合は、通話相手に接続してから、バッファに保持している呼びかけ音声を出力して呼びかけ音声が相手に伝達して通話を開始することができる。また、通話を終了するときには、マイクアレイを使用して複数の使用者が発話した音声を強調する信号処理を行い、信号処理した信号に対して音声認識を行う。そして、その音声認識結果が切断音声かを判定し、切断音声の場合は、相手側との通話を切断して、より対面での会話に近い状態を再現できる呼びかけ処理装置を提供しようとするものである。 The present invention has been made in consideration of the above points, and uses a microphone array to perform signal processing that emphasizes voices uttered by multiple speakers, and simultaneously stores the processed signals in a buffer. Speech recognition is performed on the processed signal. Then, it is determined whether the voice recognition result is a calling voice, and if it is a calling voice, it connects to the other party, outputs the calling voice held in the buffer, transmits the calling voice to the other party, and completes the call. You can start. Furthermore, when ending a call, signal processing is performed to emphasize the sounds uttered by a plurality of users using a microphone array, and speech recognition is performed on the signal-processed signals. The present invention attempts to provide a calling processing device that can determine whether the voice recognition result is a disconnection voice, and if it is a disconnection voice, disconnect the call with the other party to reproduce a situation that is more similar to a face-to-face conversation. It is.

例えば、雑音が大きい環境での複数の話者の呼びかけ音声の収音は、マイクアレイを使用して音声を強調する信号処理で解決する。相手側の呼びかけ音声の再生、及び臨場感の向上は、バッファに保持している呼びかけ音声を出力する処理で解決する。 For example, collecting the voices of multiple speakers in a noisy environment can be solved by signal processing that uses a microphone array to enhance the voices. Reproducing the calling voice of the other party and improving the sense of presence are solved by outputting the calling voice held in the buffer.

第１の本発明に係るコミュニケーション装置は、（１）入力された映像信号から１又は複数の人物を検知し、検知した各人物の位置に関する情報を獲得する人物検知部と、（２）人物検知部の各人物の位置に関する情報に基づいて、１又は複数のマイクロホンの指向性を形成して、各人物の音声信号を抽出する信号処理部と、（３）相手側と接続後に送信する接続コマンド音声を相手側で再生させるためのバッファであり、上記信号処理部による上記接続コマンド音声を含む各人物の音声信号を一定期間保持する保持部と、（４）人物検知部により１又は複数の人物が検知されたときに、入力された音声信号に基づいて音声認識をする音声認識部と、（５）少なくとも、接続先との接続を開始する接続コマンド及び接続を切断する切断コマンドを含む複数のコマンドを記憶するコマンド記憶部と、（６）音声認識部による音声認識結果がコマンド記憶部に記憶される接続コマンド又は切断コマンドを含むか否かを判定するコマンド判定部と、（７）コマンド判定部によるコマンド判定結果に応じて、出力音声信号を決定する出力切替部と、（８）音声認識結果及び上記コマンド判定結果に基づいて、相手側の接続先との接続処理を行う接続判定部とを備え、音声認識結果が接続コマンドを含むとき、接続判定部が、相手側の接続先との接続処理を行い、出力切替部は、相手側の接続先との接続後、保持部に保持されている接続コマンドを含む音声信号を出力した後に、信号処理部により処理された信号を出力することを特徴とする。 A communication device according to a first aspect of the present invention includes (1) a person detection unit that detects one or more people from an input video signal and acquires information regarding the position of each detected person; and (2) person detection. (3) a signal processing unit that forms the directivity of one or more microphones based on information regarding the position of each person in the part and extracts the audio signal of each person; and (3) a connection command that is sent after connecting with the other party. (4) a holding unit that is a buffer for reproducing audio on the other party's side and holds audio signals of each person for a certain period of time including the connection command audio by the signal processing unit; (5) a voice recognition unit that performs voice recognition based on the input voice signal when a connection is detected; a command storage unit that stores commands; (6) a command determination unit that determines whether the voice recognition result by the voice recognition unit includes a connection command or a disconnection command stored in the command storage unit; and (7) command determination. (8) an output switching unit that determines an output audio signal according to the command determination result by the unit; and (8) a connection determination unit that performs connection processing with the other party's connection destination based on the voice recognition result and the command determination result. and when the voice recognition result includes a connection command, the connection determination unit performs connection processing with the other party's connection destination, and the output switching unit is held in the holding unit after connection with the other party's connection destination. The present invention is characterized in that after outputting an audio signal including a connection command , a signal processed by a signal processing section is outputted.

第２の本発明に係るコミュニケーションプログラムは、コンピュータを、（１）入力された映像信号から１又は複数の人物を検知し、検知した各人物の位置に関する情報を獲得する人物検知部と、（２）人物検知部の各人物の位置に関する情報に基づいて、１又は複数のマイクロホンの指向性を形成して、各人物の音声信号を抽出する信号処理部と、（３）相手側と接続後に送信する接続コマンド音声を相手側で再生させるためのバッファであり、上記信号処理部による上記接続コマンド音声を含む各人物の音声信号を一定期間保持する保持部と、（４）人物検知部により１又は複数の人物が検知されたときに、入力された音声信号に基づいて音声認識をする音声認識部と、（５）少なくとも、接続先との接続を開始する接続コマンド及び接続を切断する切断コマンドを含む複数のコマンドを記憶するコマンド記憶部と、（６）音声認識部による音声認識結果がコマンド記憶部に記憶される接続コマンド又は切断コマンドを含むか否かを判定するコマンド判定部と、（７）コマンド判定部によるコマンド判定結果に応じて、出力音声信号を決定する出力切替部と、（８）音声認識結果及びコマンド判定結果に基づいて、相手側の接続先との接続処理を行う接続判定部として機能させ、音声認識結果が接続コマンドを含むとき、接続判定部が、相手側の接続先との接続処理を行い、出力切替部は、相手側の接続先との接続後、保持部に保持されている接続コマンドを含む音声信号を出力した後に、信号処理部により処理された信号を出力することを特徴とする。 The communication program according to the second aspect of the present invention includes: (1) a person detection unit that detects one or more people from an input video signal and acquires information regarding the position of each detected person; ) A signal processing unit that forms the directivity of one or more microphones based on the information regarding the position of each person in the person detection unit and extracts the audio signal of each person, and (3) transmits after connecting with the other party. (4) a buffer for reproducing the connection command voice on the other party's side, and a holding section that holds each person's voice signal including the connection command voice by the signal processing section for a certain period of time; (5) a voice recognition unit that performs voice recognition based on input voice signals when multiple people are detected; and (5) at least a connection command to start a connection with a connection destination and a disconnection command to disconnect the connection. a command storage unit that stores a plurality of commands including (6) a command determination unit that determines whether the voice recognition result by the voice recognition unit includes a connection command or a disconnection command stored in the command storage unit; ) an output switching unit that determines an output audio signal according to the command determination result by the command determination unit; and (8) a connection determination unit that performs connection processing with the other party's connection destination based on the voice recognition result and the command determination result. When the voice recognition result includes a connection command, the connection determination unit performs connection processing with the other party's connection destination, and the output switching unit sends the message to the holding unit after connection with the other party's connection destination. It is characterized in that after outputting the audio signal including the held connection command , the signal processed by the signal processing section is outputted.

第３の本発明に係るコミュニケーション方法は、（１）人物検知部が、入力された映像信号から１又は複数の人物を検知し、検知した各人物の位置に関する情報を獲得し、（２）信号処理部が、人物検知部の各人物の位置に関する情報に基づいて、１又は複数のマイクロホンの指向性を形成して、上記各人物の音声信号を抽出し、（３）保持部が、相手側と接続後に送信する接続コマンド音声を相手側で再生させるためのバッファであり、上記信号処理部による上記接続コマンド音声を含む各人物の音声信号を一定期間保持し、（４）音声認識部が、人物検知部により１又は複数の人物が検知されたときに、入力された音声信号に基づいて音声認識をし、（５）コマンド記憶部が、少なくとも、接続先との接続を開始する接続コマンド及び接続を切断する切断コマンドを含む複数のコマンドを記憶し、（６）コマンド判定部が、音声認識部による音声認識結果がコマンド記憶部に記憶される接続コマンド又は切断コマンドを含むか否かを判定し、（７）出力切替部が、コマンド判定部によるコマンド判定結果に応じて、出力音声信号を決定し、（８）接続判定部が、音声認識結果及びコマンド判定結果に基づいて、相手側の接続先との接続処理を行い、（９）音声認識結果が接続コマンドを含むとき、（１０）接続判定部が、相手側の接続先との接続処理を行い、（１１）出力切替部は、相手側の接続先との接続後、保持部に保持されている接続コマンドを含む音声信号を出力した後に、信号処理部により処理された信号を出力することを特徴とする。 In the communication method according to the third aspect of the present invention, (1) the person detection unit detects one or more people from the input video signal and acquires information regarding the position of each detected person; The processing unit forms the directivity of one or more microphones based on the information regarding the position of each person in the person detection unit, and extracts the audio signal of each person, and (3) the holding unit extracts the audio signal of each person . This is a buffer for reproducing the connection command voice transmitted after connection with the other party, and holds the voice signal of each person including the connection command voice by the signal processing unit for a certain period of time , and (4) the voice recognition unit, When one or more people are detected by the person detection unit, voice recognition is performed based on the input voice signal, and (5) the command storage unit stores at least a connection command and a command to start a connection with the connection destination. A plurality of commands including a disconnection command for disconnecting the connection are stored, and (6) a command determination unit determines whether the voice recognition result by the voice recognition unit includes a connection command or a disconnection command stored in the command storage unit. (7) The output switching unit determines the output audio signal according to the command determination result by the command determination unit, and (8) the connection determination unit determines the output audio signal of the other party based on the voice recognition result and the command determination result. (9) When the voice recognition result includes a connection command, (10) the connection determination section performs connection processing with the other party's connection destination; (11) the output switching section: The device is characterized in that after connection with the other party's connection destination, an audio signal including a connection command held in the holding unit is output, and then a signal processed by the signal processing unit is output.

本発明によれば、使用環境の雑音が大きい場合でも、複数の使用者が発話した音声を強調し、呼びかけ音声かどうか判定し、呼びかけ音声の場合は、呼びかけ音声で相手側との接続を開始した後に、その呼びかけ音声を相手側に伝達することで臨場感のある通話を開始することができ、又通話を終了するときには、複数の使用者が発話した音声を強調し、切断音声かどうか判定し、切断音声の場合は、相手側との通話が切断できる。 According to the present invention, even when the noise in the usage environment is large, voices uttered by multiple users are emphasized, it is determined whether the voice is a calling voice, and if it is a calling voice, a connection with the other party is started using the calling voice. Then, by transmitting the calling voice to the other party, it is possible to start a realistic call. Also, when ending a call, the voice uttered by multiple users is emphasized and it is determined whether it is a disconnection voice. However, in the case of disconnection voice, the call with the other party can be disconnected.

また、本発明によれば、使用者がマイクから離れていても通話相手と接続するときに、実際の対面での通話するときと同じ、接続先の通話相手の名前等と会話が開始する言葉で接続を開始し、通話が終了する言葉で接続を終了することで、会話が開始する状態と終了する状態を再現し、双方が高い臨場感を感じることができる。 Furthermore, according to the present invention, when the user connects to the other party even when the user is away from the microphone, the name, etc. of the other party to be connected and the words used to start the conversation are the same as when talking face-to-face. By starting the connection with and ending the connection with the words that end the conversation, both parties can experience a high sense of realism by recreating the beginning and end of a conversation.

第１の実施形態に係るコミュニケーション装置の構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a communication device according to a first embodiment. 第１の実施形態において、一方の拠点の部屋内に設置されるコミュニケーション装置に係る機器の配置や使用者との位置関係の一例を説明する説明図である。FIG. 2 is an explanatory diagram illustrating an example of the arrangement of equipment related to a communication device installed in a room of one base and the positional relationship with a user in the first embodiment. 第１の実施形態に係るコマンドリスト部の構成例を説明する説明図である。FIG. 2 is an explanatory diagram illustrating a configuration example of a command list section according to the first embodiment. 第２の実施形態に係るコミュニケーション装置の構成を示すブロック図である。FIG. 2 is a block diagram showing the configuration of a communication device according to a second embodiment. 第２の実施形態において、一方の拠点の部屋内に設置されるコミュニケーション装置に係る機器の配置や使用者との位置関係の一例を説明する説明図である（その１）。FIG. 7 is an explanatory diagram (part 1) illustrating an example of the arrangement of devices related to a communication device installed in a room of one base and the positional relationship with a user in the second embodiment; 第２の実施形態において、一方の拠点の部屋内に設置されるコミュニケーション装置に係る機器の配置や使用者との位置関係の一例を説明する説明図である（その２）。FIG. 7 is an explanatory diagram (Part 2) illustrating an example of the arrangement of devices related to communication devices installed in a room of one base and their positional relationship with users in the second embodiment;

（Ａ）第１の実施形態
以下では、本発明のコミュニケーション装置、コミュニケーションプログラム、及びコミュニケーション方法の実施形態を、図面を参照しながら詳細に説明する。 (A) First Embodiment Below, embodiments of a communication device, a communication program, and a communication method of the present invention will be described in detail with reference to the drawings.

第１の実施形態は、例えば、テレビ会議システムや電話会議システム等のマイク入力部に、上述した本発明のコミュニケーション装置、コミュニケーションプログラム、及びコミュニケーション方法を適用した場合を例示したものである。 The first embodiment exemplifies a case where the above-described communication device, communication program, and communication method of the present invention are applied to a microphone input unit of a video conference system, a telephone conference system, or the like.

（Ａ－１）第１の実施形態の構成
図１は、第１の実施形態に係るコミュニケーション装置１００の構成を示すブロック図である。 (A-1) Configuration of First Embodiment FIG. 1 is a block diagram showing the configuration of a communication device 100 according to the first embodiment.

第１の実施形態のコミュニケーション装置１００は、例えば、専用ボードとして構築されるようにしても良いし、ＤＳＰ（デジタルシグナルプロセッサ）への遠隔コミュニケーションプログラムの書き込みによって実現されたものであっても良く、ＣＰＵと、ＣＰＵが実行するソフトウェア（例えば、遠隔コミュニケーションプログラム）によって実現されたものであっても良いが、機能的には、図１で表すことができる。 The communication device 100 of the first embodiment may be constructed as a dedicated board, or may be realized by writing a remote communication program to a DSP (digital signal processor), for example. Although it may be realized by a CPU and software (for example, a remote communication program) executed by the CPU, it can be functionally represented as shown in FIG.

コミュニケーション装置１００は、ネットワーク１０７を介して、遠隔地にある相手側の拠点に設置されているコミュニケーション装置との間で、映像信号及び音信号を通信して、相手側とコミュニケーションを図るものである。ここでは、相手側の拠点にも、図１に例示するコミュニケーション装置１００が配置されているものとする。 The communication device 100 communicates with the other party by communicating video and audio signals with the communication device installed at the other party's base in a remote location via the network 107. . Here, it is assumed that the communication device 100 illustrated in FIG. 1 is also located at the other party's base.

図１において、第１の実施形態に係るコミュニケーション装置１００は、マイクアレイ１０１、マイクアンプ１０２、アナログ－デジタル（ＡＤ）変換部１０３、ビデオカメラ１０４、呼びかけ処理部１０５、ＮＷ通信部１０６、デジタル－アナログ（ＤＡ）変換部１０８、スピーカアンプ１０９、スピーカ１１０ａ及び１１０ｂ、モニター１１１を有する。 In FIG. 1, a communication device 100 according to the first embodiment includes a microphone array 101, a microphone amplifier 102, an analog-to-digital (AD) converter 103, a video camera 104, a call processing unit 105, a NW communication unit 106, a digital It includes an analog (DA) converter 108, a speaker amplifier 109, speakers 110a and 110b, and a monitor 111.

マイクアレイ１０１は、人の音声や音を受音する複数本のマイクロホン（以下、「マイク」とも呼ぶ。）を有するものである。 The microphone array 101 includes a plurality of microphones (hereinafter also referred to as "microphones") that receive human voices and sounds.

マイクアンプ１０２は、マイクアレイ１０１の各マイクにより受音された複数の入力信号（アナログの音信号）のそれぞれを増幅して、ＡＤ変換部１０３に出力する。 The microphone amplifier 102 amplifies each of the plurality of input signals (analog sound signals) received by each microphone of the microphone array 101 and outputs the amplified signals to the AD converter 103 .

ＡＤ変換部１０３は、マイクアンプ１０２により増幅された複数の入力信号をアナログ信号からデジタル信号に変換して、コミュニケーション装置１００に出力する。以下、ＡＤ変換部１０３で変換された信号を「マイク入力信号」とも呼ぶ。 The AD conversion unit 103 converts the plurality of input signals amplified by the microphone amplifier 102 from analog signals to digital signals, and outputs the digital signals to the communication device 100. Hereinafter, the signal converted by the AD converter 103 will also be referred to as a "microphone input signal."

ビデオカメラ１０４は、自拠点（コミュニケーション装置１００が設置される拠点）に設置される撮影デバイス（撮像装置）である。ビデオカメラ１０４によって撮像された映像信号は、呼びかけ処理部１０５を介してＮＷ通信部１０６に出力され、映像信号はＮＷ通信部１０６によってネットワーク１０７に送信される。 The video camera 104 is a photographing device (imaging device) installed at its own base (the base where the communication device 100 is installed). The video signal captured by the video camera 104 is output to the NW communication unit 106 via the calling processing unit 105, and the video signal is transmitted to the network 107 by the NW communication unit 106.

呼びかけ処理部１０５には、ビデオカメラ１０４からの映像信号と、マイクアレイ１０１が受音したマイク入力信号とが入力する。ビデオカメラ１０４から入力された映像信号は、呼びかけ処理部１０５を介してＮＷ通信部１０６に出力されてネットワーク１０７に送信される。 A video signal from the video camera 104 and a microphone input signal received by the microphone array 101 are input to the calling processing unit 105 . A video signal input from the video camera 104 is output to the NW communication section 106 via the calling processing section 105 and transmitted to the network 107.

呼びかけ処理部１０５は、ビデオカメラ１０４から入力された映像信号に人が撮像されているか否かを判定する。そして、映像信号に人が映っていると判定した場合、呼びかけ処理部１０５は、入力された複数のマイク入力信号を信号処理して、ＮＷ通信部１０６に出力する共に、同時に信号処理した信号をオーディオバッファ部１１７に保存する。 The calling processing unit 105 determines whether or not a person is imaged in the video signal input from the video camera 104 . If it is determined that a person is shown in the video signal, the calling processing unit 105 performs signal processing on the plurality of input microphone input signals and outputs the processed signals to the NW communication unit 106. It is saved in the audio buffer section 117.

また、呼びかけ処理部１０５は、信号処理した信号を用いて音声認識を行ない、音声認識結果がコマンドリスト部１１９のコマンドの１つと一致するか否かを判定する。音声認識結果がコマンドの１つと一致する場合、呼びかけ処理部１０５は、接続判定結果と、オーディオバッファ部１１７に保存されている音信号をＮＷ通信部１０６に一定時間出力する。そして、一定時間出力が完了すると、呼びかけ処理部１０５は、再び信号処理した信号を、ＮＷ通信部１０６に出力する。音声認識結果がコマンドの１つと一致しない場合、呼びかけ処理部１０５は、接続判定結果と信号処理した信号とを、ＮＷ通信部１０６に出力する。 Further, the calling processing unit 105 performs voice recognition using the processed signal, and determines whether the voice recognition result matches one of the commands in the command list unit 119. If the voice recognition result matches one of the commands, the calling processing unit 105 outputs the connection determination result and the sound signal stored in the audio buffer unit 117 to the NW communication unit 106 for a certain period of time. Then, when the output is completed for a certain period of time, the calling processing section 105 outputs the signal processed again to the NW communication section 106. If the voice recognition result does not match one of the commands, the calling processing unit 105 outputs the connection determination result and the signal processed by the signal to the NW communication unit 106.

一方、映像信号に人が映っていないと判定した場合、呼びかけ処理部１０５は、信号処理を停止する。このとき、ＮＷ通信部１０６には音声を出力しない。 On the other hand, if it is determined that no person is shown in the video signal, the calling processing unit 105 stops signal processing. At this time, no audio is output to the NW communication unit 106.

ＮＷ通信部１０６は、ネットワーク１０７を介して、相手側の拠点に設置されているコミュニケーション装置１００との間で映像信号及び音信号を送受信するものである。ＮＷ通信部１０６は、呼びかけ処理部１０５からの接続判定結果に基づいて、ネットワーク１０７との接続処理を行う。つまり、ネットワーク１０７と接続指示を受けると、ＮＷ通信部１０６は、指示された相手側のコミュニケーション装置１００と接続を開始する。ネットワーク１０７との接続後、コミュニケーション装置１００は、ＮＷ通信部１０６を介して、相手側のコミュニケーション装置１００との間で音声のやり取りを行なう。 The NW communication unit 106 transmits and receives video signals and audio signals to and from the communication device 100 installed at the other party's base via the network 107. The NW communication unit 106 performs connection processing with the network 107 based on the connection determination result from the calling processing unit 105. That is, upon receiving the instruction to connect to the network 107, the NW communication unit 106 starts connecting to the communication device 100 of the other party to which the instruction was given. After being connected to the network 107, the communication device 100 exchanges audio with the other party's communication device 100 via the NW communication unit 106.

また、ネットワーク１０７との接続切断指示を受けると、ＮＷ通信部１０６は、相手側のコミュニケーション装置１００との接続を切断する。 Further, upon receiving an instruction to disconnect from the network 107, the NW communication unit 106 disconnects from the other party's communication device 100.

ＤＡ変換部１０８は、ネットワーク１０７からの音信号（ＮＷ通信部１０６を介して送信されてきた音信号）をデジタル信号からアナログ信号に変換して、スピーカアンプ１０９に出力する。 The DA conversion unit 108 converts the sound signal from the network 107 (the sound signal transmitted via the NW communication unit 106) from a digital signal to an analog signal, and outputs the analog signal to the speaker amplifier 109.

スピーカアンプ１０９は、ＤＡ変換部１０８により変換されたアナログ信号を増幅して、スピーカ１１０ａ及び１１０ｂに出力する。 The speaker amplifier 109 amplifies the analog signal converted by the DA converter 108 and outputs it to the speakers 110a and 110b.

スピーカ１１０ａ、１１０ｂは、電気信号を空気の振動に変換して音として出力するスピーカである。第１の実施形態では、スピーカ１１０ａ及び１１０ｂはステレオスピーカである場合を例示するが、スピーカ１１０ａ及び１１０ｂは、ステレオスピーカに限定されるものではない。 The speakers 110a and 110b are speakers that convert electrical signals into air vibrations and output them as sound. In the first embodiment, the speakers 110a and 110b are stereo speakers, but the speakers 110a and 110b are not limited to stereo speakers.

モニター１１１は、映像出力デバイス（映像出力装置）である。モニター１１１が出力する映像は、例えば、相手側の拠点に設置されたビデオカメラ１０４によって撮影された映像であって、この映像（エンコードされたデータ）はネットワーク１０７を介してＮＷ通信部１０６で受信されデコード（復号）した後、モニター１１１に入力される。 The monitor 111 is a video output device (video output device). The video output by the monitor 111 is, for example, a video shot by a video camera 104 installed at the other party's base, and this video (encoded data) is received by the NW communication unit 106 via the network 107. After being decoded, it is input to the monitor 111.

次に、第１の実施形態に係る呼びかけ処理部１０５の詳細な構成を説明する。 Next, a detailed configuration of the call processing unit 105 according to the first embodiment will be explained.

呼びかけ処理部１０５は、音入力端子１１５、映像入力端子１１２、映像出力端子１１３、人物位置検知部１１４、信号処理部１１６、オーディオバッファ部１１７、音声認識部１１８、コマンドリスト部１１９、コマンド判定部１２０、出力切替部１２１、音出力端子１２２、接続判定部１２３、接続判定結果出力端子１２４を有する。 The calling processing section 105 includes a sound input terminal 115, a video input terminal 112, a video output terminal 113, a person position detection section 114, a signal processing section 116, an audio buffer section 117, a voice recognition section 118, a command list section 119, and a command determination section. 120, an output switching section 121, a sound output terminal 122, a connection determination section 123, and a connection determination result output terminal 124.

映像入力端子１１２は、ビデオカメラ１０４から映像信号を入力するインタフェース部である。 The video input terminal 112 is an interface unit that inputs a video signal from the video camera 104.

映像出力端子１１３は、ビデオカメラ１０４から入力された映像信号をＮＷ通信部１０６に出力するインタフェース部である。 The video output terminal 113 is an interface unit that outputs a video signal input from the video camera 104 to the NW communication unit 106.

人物位置検知部１１４は、ビデオカメラ１０４から入力された映像信号に人が映っているか否かを判定し、その判定結果を信号処理部１１６及び音声認識部１１８に出力する。例えば、人物位置検知部１１４は、入力される映像信号を用いた画像処理により、映像フレームに人が映っているか否かを判定し、人物を検知した場合には、人を検知したことを示す判定結果（例えば、「１」など）を出力し、それ以外の場合には、人非検知を示す判定結果（例えば、「０」など）を出力する。 The person position detection unit 114 determines whether or not a person is shown in the video signal input from the video camera 104, and outputs the determination result to the signal processing unit 116 and the voice recognition unit 118. For example, the person position detection unit 114 determines whether or not a person is shown in the video frame by image processing using the input video signal, and if a person is detected, it indicates that a person has been detected. A determination result (for example, "1", etc.) is output, and in other cases, a determination result (for example, "0", etc.) indicating that no person is detected is output.

また、映像信号に人が映っていると判定した場合、人物位置検知部１１４は、人がいる方向情報を信号処理部１１６に出力する。さらに、複数の人物を検知した場合には、検知した各人の方向情報を出力する。 Further, when it is determined that a person is shown in the video signal, the person position detection section 114 outputs information on the direction in which the person is present to the signal processing section 116. Furthermore, if multiple people are detected, direction information for each detected person is output.

音入力端子１１５は、ＡＤ変換部１０３からマイク入力信号を入力するインタフェース部である。 The sound input terminal 115 is an interface unit that inputs a microphone input signal from the AD conversion unit 103.

信号処理部１１６は、入力されたマイク入力信号を信号処理し、信号処理した信号（以下、「マイクアレイ処理信号」とも呼ぶ。）を出力切替部１２１、オーディオバッファ部１１７及び音声認識部１１８に出力する。 The signal processing section 116 performs signal processing on the input microphone input signal, and sends the signal processed signal (hereinafter also referred to as "microphone array processing signal") to the output switching section 121, the audio buffer section 117, and the speech recognition section 118. Output.

オーディオバッファ部１１７は、信号処理部１１６により信号処理信号を、一定時間保持するバッファである。オーディオバッファ部１１７は、一定時間経過後、保持している信号を出力切替部１２１に出力する。 The audio buffer section 117 is a buffer that holds the signal processed by the signal processing section 116 for a certain period of time. The audio buffer section 117 outputs the held signal to the output switching section 121 after a certain period of time has elapsed.

音声認識部１１８は、信号処理部１１６により信号処理されたマイクアレイ処理信号を音声認識して、その音声認識結果をコマンド判定部１２０に出力する。 The speech recognition section 118 performs speech recognition on the microphone array processed signal processed by the signal processing section 116 and outputs the speech recognition result to the command determination section 120 .

コマンドリスト部１１９は、コマンドの一覧が保持されているテキストファイルである。ここで、コマンドには、様々なコマンドを含むことができるが、この実施形態では、コマンドの一例として接続コマンドと切断コマンドとが含まれる。コマンドの一例である接続コマンドと切断コマンドの詳細な説明は後述する。 The command list section 119 is a text file that holds a list of commands. Here, the command can include various commands, and in this embodiment, a connection command and a disconnection command are included as examples of commands. A detailed explanation of the connection command and disconnection command, which are examples of commands, will be given later.

コマンド判定部１２０は、音声認識部１１８からの音声認識結果がコマンドリスト部１１９に保持されているコマンドに存在するか否か判定するものであり、その判定結果を、出力切替部１２１及び接続判定部１２３に出力する。コマンド判定部１２０による判定方法の詳細な説明は後述する。 The command determination unit 120 determines whether or not the voice recognition result from the voice recognition unit 118 exists in the commands held in the command list unit 119, and transmits the determination result to the output switching unit 121 and the connection determination unit. It is output to section 123. A detailed explanation of the determination method by the command determination unit 120 will be given later.

出力切替部１２１は、信号処理部１１６とオーディオバッファ部１１７とに接続しており、コマンド判定部１２０による判定結果に応じて、信号処理部１１６からの出力信号と、オーディオバッファ部１１７からの出力信号とのいずれかを切り替えて、音出力端子１２２に出力する。 The output switching unit 121 is connected to the signal processing unit 116 and the audio buffer unit 117, and switches between the output signal from the signal processing unit 116 and the output from the audio buffer unit 117 according to the determination result by the command determination unit 120. The signal is switched and output to the sound output terminal 122.

音出力端子１２２は、出力切替部１２１により切り替えられた音信号を出力するインタフェース部である。音出力端子１２２から出力される音信号が、呼びかけ処理部１０５から出力される音信号となる。 The sound output terminal 122 is an interface unit that outputs the sound signal switched by the output switching unit 121. The sound signal output from the sound output terminal 122 becomes the sound signal output from the calling processing section 105.

接続判定部１２３は、コマンド判定部１２０により判定された判定結果に基づいて、ネットワーク１０７との接続判定を行なうものである。 The connection determination unit 123 determines the connection to the network 107 based on the determination result determined by the command determination unit 120.

例えば、音声認識結果が、人名と接続コマンドと続けて一致するとの判定結果である場合、接続判定部１２３は、接続コマンドに基づいて、相手側の接続先であるコミュニケーション装置１００を決定し、決定した接続先に関する情報と、当該接続先への接続指示とを含む接続判定結果をＮＷ通信部１０６に出力する。音声認識結果が切断コマンドと一致するとの判定結果である場合、接続判定部１２３は、接続している相手側のコミュニケーション装置１００との接続切断指示を含む接続判定結果をＮＷ通信部１０６に出力する。 For example, when the voice recognition result is determined to match the person's name and the connection command, the connection determination unit 123 determines the communication device 100 to which the other party is connected based on the connection command, and A connection determination result including information regarding the connected destination and an instruction to connect to the connected destination is output to the NW communication unit 106. If it is determined that the voice recognition result matches the disconnection command, the connection determination unit 123 outputs the connection determination result including a connection disconnection instruction with the connected communication device 100 to the NW communication unit 106. .

接続判定結果出力端子１２４は、接続判定部１２３からの接続判定結果を、ＮＷ通信部１０６に出力する。 The connection determination result output terminal 124 outputs the connection determination result from the connection determination section 123 to the NW communication section 106.

（Ａ－２）第１の実施形態の動作
次に、第１の実施形態に係るコミュニケーション装置１００における処理動作を、図面を参照しながら詳細に説明する。 (A-2) Operation of First Embodiment Next, processing operations in the communication device 100 according to the first embodiment will be described in detail with reference to the drawings.

図２は、第１の実施形態において、一方の拠点の部屋内に設置されるコミュニケーション装置に係る機器の配置や使用者との位置関係の一例を説明する説明図である。なお、他方の拠点においても図２と同様に、遠隔コミュニケーション１００が設置されているものとする。 FIG. 2 is an explanatory diagram illustrating an example of the arrangement of equipment related to a communication device installed in a room at one base and the positional relationship with a user in the first embodiment. Note that it is assumed that the remote communication 100 is also installed at the other base as in FIG. 2.

図２において、部屋１５１は例えば会議室であり、部屋１５１の高さは、モニター１１１を簡単に設置でき、かつ十分に余裕のある高さ（例えば、モニター１１１の高さ＋数ｍ、または２ｍ以上）があれば良く、部屋１５１の大きさ（面積）は、モニター１１１やマイクアレイ１０１、スピーカ１１０ａ及び１１０ｂなどが簡単に設置でき、かつ、十分に余裕がある広さ、または使用者１５２ａ及び１５２ｂが会話するのに十分広さ（例えば、横縦数ｍ）があれば良い。 In FIG. 2, the room 151 is, for example, a conference room, and the height of the room 151 is set to a height where the monitor 111 can be easily installed and has enough room (for example, the height of the monitor 111 + several meters, or 2 meters). The size (area) of the room 151 should be large enough to easily install the monitor 111, the microphone array 101, the speakers 110a and 110b, etc., and have enough room for the users 152a and 110b. 152b should be sufficiently wide (for example, several meters in width and height) for conversation.

まず、コミュニケーション装置１００の動作が開始すると、モニター１１１は、相手側の拠点のコミュニケーション装置１００のビデオカメラ１０４で撮影している映像を表示する。 First, when the communication device 100 starts operating, the monitor 111 displays the video captured by the video camera 104 of the communication device 100 at the other party's base.

つまり、コミュニケーション装置１００が動作開始し、自拠点のビデオカメラ１０４が起動すると、ビデオカメラ１０４で撮影された映像信号は、呼びかけ処理部１０５を介してＮＷ通信部１０６に与えられ、ＮＷ通信部１０６が、ネットワーク１０７を通じて、相手側の拠点のＮＷ通信部１０６に映像信号を送信する。これにより、自拠点の映像は相手側の拠点のモニター１１１に表示される。同様に、相手拠点の映像が自拠点のモニター１１１に表示される。 That is, when the communication device 100 starts operating and the video camera 104 at its own site is activated, the video signal captured by the video camera 104 is given to the NW communication unit 106 via the call processing unit 105, and transmits the video signal to the NW communication section 106 of the other party's base via the network 107. As a result, the image of the own base is displayed on the monitor 111 of the other party's base. Similarly, the image of the other party's base is displayed on the monitor 111 of the own base.

このとき、両拠点のコミュニケーション装置１００は音声信号を送受信しておらず、両拠点とも相手側のビデオカメラ１０４で撮影した映像だけがモニター１１１に表示されて、お互いの拠点の様子を確認できる。 At this time, the communication devices 100 at both bases are not transmitting or receiving audio signals, and only the video captured by the video camera 104 of the other base is displayed on the monitor 111 at both bases, so that the situation at each base can be confirmed.

また、ビデオカメラ１０４により撮影された映像信号は、呼びかけ処理部１０５の映像入力端子１１２に入力され、映像信号が人物位置検知部１１４に入力される。 Further, the video signal captured by the video camera 104 is input to the video input terminal 112 of the calling processing section 105, and the video signal is input to the person position detection section 114.

コミュニケーション装置１００が動作開始後から人がコミュニケーション装置１００に近づくまでは，各拠点のコミュニケーション装置１００は音声信号を送受信しておらず、両拠点とも相手側のビデオカメラ１０４で撮影した映像だけがモニター１１１に表示されて、お互いの拠点の様子を確認できる状態になっている。 After the communication device 100 starts operating until a person approaches the communication device 100, the communication device 100 at each location does not send or receive audio signals, and both locations monitor only the video captured by the video camera 104 on the other side. 111, allowing you to check the status of each other's bases.

コミュニケーション装置１００が動作開始してしばらくすると、相手拠点にいる人と通話を試みようとする使用者１５２ａ及び１５２ｂは、相手側の拠点の映像を見て、通話相手を探したり、確認したりするためにモニター１１１に近づく。このとき、図２に例示するように、ビデオカメラ１０４はモニター１１１付近に設置されている（図２の例では、モニター１１１の上部にビデオカメラ１０４が設置されている）ため、ビデオカメラ１０４は、モニター１１１に近づく使用者１５２ａ及び１５２ｂを撮影し、使用者１５２ａ及び１５２ｂが映っている映像信号が呼びかけ処理部１０５の映像入力端子１１２に入力される。 After a while after the communication device 100 starts operating, users 152a and 152b who are trying to talk to a person at the other party's base look for the image of the other party's base and search for or confirm the other party. To do this, approach monitor 111. At this time, as illustrated in FIG. 2, the video camera 104 is installed near the monitor 111 (in the example of FIG. 2, the video camera 104 is installed above the monitor 111). , the users 152a and 152b approaching the monitor 111 are photographed, and a video signal showing the users 152a and 152b is input to the video input terminal 112 of the calling processing section 105.

呼びかけ処理部１０５の映像入力端子１１２に、使用者１５２ａ及び１５２ｂが映っている映像信号が入力され始めると、ビデオカメラ１０４の映像信号が人物位置検知部１１４に入力される。 When a video signal showing the users 152a and 152b begins to be input to the video input terminal 112 of the calling processing unit 105, a video signal from the video camera 104 is input to the person position detection unit 114.

人物位置検知部１１４は、映像信号に映っている使用者１５２ａ及び１５２ｂの２人を検知し、人物位置検知部１１４は、人が映っていることを示す判定結果（例えば「１」など）を信号処理部１１６及び音声認識部１１８に出力すると共に、映像フレームにおける使用者１５２ａ及び１５２ｂの位置に関する情報（例えば、方向情報）を信号処理部１１６に出力する。 The person position detection unit 114 detects the two users 152a and 152b shown in the video signal, and the person position detection unit 114 detects a determination result (for example, “1”) indicating that a person is shown. The information is output to the signal processing unit 116 and the voice recognition unit 118, and information (for example, direction information) regarding the positions of the users 152a and 152b in the video frame is output to the signal processing unit 116.

さらに、相手側の拠点にいる人と通話を試みようとする使用者１５２ａと１５２ｂのいずれかは、通話したい相手を呼びかけるために、呼びかけ音声を発声する。ここで、呼びかけ音声とは、相手側の拠点で通話したい相手を呼びかける音声であると共に、相手側の拠点との通話を開始するものとして機能する。呼びかけ音声は、実際に対面して会話をする際に用いられる言葉を含むことが望ましい（例えば、「人名＋こんにちは」、「人名＋こんばんは」など）。これにより、コミュニケーション装置１００を通じて相手側の拠点の人と通話をする際に違和感なく通話を開始させることができる。 Furthermore, either user 152a or 152b who attempts to talk to a person at the other party's base utters a calling voice in order to call out the person with whom the user wishes to talk. Here, the calling voice is a voice that calls out to the person with whom the user wants to talk at the other party's base, and also functions as a voice that starts a call with the other party's base. It is preferable that the greeting voice includes words used when actually having a face-to-face conversation (for example, "person's name + hello", "person's name + good evening", etc.). Thereby, when talking to a person at the other party's base through the communication device 100, it is possible to start the call without feeling uncomfortable.

使用者１５２ａ又は１５２ｂのいずれかが発話した呼びかけ音声は、マイクアレイ１０１の各マイクに受音される。このとき、部屋１５１における環境音も各マイクに受音されるため、各マイクに受音される音信号は、使用者１５２ａ及び１５２ｂが発話した音声信号に環境音が重畳した信号となる。 The calling voice uttered by either the user 152a or 152b is received by each microphone of the microphone array 101. At this time, since the environmental sounds in the room 151 are also received by each microphone, the sound signals received by each microphone are signals in which the environmental sounds are superimposed on the audio signals uttered by the users 152a and 152b.

マイクアレイ１０１の各マイクに入力したアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換部１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部１０５の音入力端子１１５にマイク入力信号ｘ（ｍ，ｎ）として入力される。なお、マイク入力信号ｘ（ｍ，ｎ）において、ｍはマイクアレイ１０１内の各マイクを識別するパラメータであり、ｎは入力信号の時系列を示すパラメータである。 Analog sound signals input to each microphone of the microphone array 101 are amplified by the microphone amplifier 102, converted from analog signals to digital signals by the AD converter 103, and sent to the sound input terminal 115 of the calling processor 105 as the microphone input signal x. Input as (m, n). Note that in the microphone input signal x(m, n), m is a parameter that identifies each microphone in the microphone array 101, and n is a parameter that indicates the time series of the input signal.

呼びかけ処理部１０５の音入力端子１１５に信号が入力され始めると、まず、マイク入力信号ｘ（ｍ，ｎ）が信号処理部１１６に入力される。 When a signal starts to be input to the sound input terminal 115 of the calling processing section 105, first, a microphone input signal x (m, n) is input to the signal processing section 116.

人物位置検知部１１４でビデオカメラ１０４の映像信号に人が映っていると判定されたとき、信号処理部１１６は、人物位置検知部１１４から各人物の位置に関する情報（例えば、方向情報）を用いて、マイク入力信号ｘ（ｍ，ｎ）に対してマイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理をする。 When the person position detection unit 114 determines that a person is reflected in the video signal of the video camera 104, the signal processing unit 116 uses information regarding the position of each person (for example, direction information) from the person position detection unit 114. Then, microphone array processing is performed on the microphone input signal x(m, n), and directionality processing and sound source separation processing for separating the sound sources are performed.

このように、映像信号に人が映っていると判定されたときに、信号処理部１１６が信号処理を行うことで、使用環境となる部屋１５１に使用者以外の人がいるような場合でも、使用者以外の人の音声をマイクが受音して、誤って相手側の拠点と接続することなく、モニター１１１の前にいる使用者の音声を正しく捉えることができる。 In this way, when it is determined that a person is reflected in the video signal, the signal processing unit 116 performs signal processing, so that even if there are people other than the user in the room 151, which is the usage environment, The voice of the user in front of the monitor 111 can be correctly captured without the microphone receiving the voice of a person other than the user and erroneously connecting to the other party's base.

また、映像信号に人が映っていると判定されたとき、信号処理部１１６は、人物位置検知部１１４からの人の方向情報に基づいて、映像信号に映っている各人の音声として扱い、マイクアレイ１０１に形成される指向性や音源分離処理を行なう。人物位置検知部１１４による人の検知方法は特に限定されるものではなく、種々の方法を広く適用することができ、例えば、ビデオカメラ１０４が撮影する映像信号（映像フレーム）のＸ－Ｙ座標系と、マイクアレイ１０１の各マイクの位置を決めるＸ－Ｙ座標系との対応させるために、映像信号（映像フレーム）のＸ－Ｙ座標系と、マイクアレイ１０１の各マイク位置のＸ－Ｙ座標系の原点との間で座標変換処理を行ない、人のいる方向情報を算出するようにしても良い。 Further, when it is determined that a person is shown in the video signal, the signal processing unit 116 treats the sound as the voice of each person shown in the video signal based on the direction information of the person from the person position detection unit 114, The directivity formed in the microphone array 101 and sound source separation processing are performed. The method of detecting a person by the person position detection unit 114 is not particularly limited, and various methods can be widely applied. In order to make the X-Y coordinate system of the video signal (video frame) correspond to the X-Y coordinate system that determines the position of each microphone in the microphone array 101, Information on the direction in which the person is located may be calculated by performing coordinate transformation processing between the origin of the system and the origin of the system.

指向性処理の手法は、例えば、従来のマイクアレイ処理である遅延和アレイ処理でマイクアレイ１０１が直線型のマイクアレイの場合に、以下の（１）式に従い、処理する手法がある。

An example of a directivity processing method is a delay-sum array processing which is conventional microphone array processing, and when the microphone array 101 is a linear microphone array, processing is performed according to the following equation (1).

上記（１）式のｘ’＿ｋ（ｎ）はマイクアレイ処理信号、Ｄｍは各マイク信号に付加する遅延量、Ｋは指向性を形成する数、Ｍはマイクの本数、（２）式のＤ０は固定遅延量、（３）式のτ＿ｋはマイク間の遅延量、ｄはマイク間隔、θｋは指向性を形成する角度（人物位置検知部１１４からの人の方向情報）、ｃは音速である。 In the above equation (1), x'_k(n) is the microphone array processed signal, Dm is the amount of delay added to each microphone signal, K is the number forming directivity, M is the number of microphones, and D0 in equation (2) is the fixed delay amount, τ_k in equation (3) is the delay amount between the microphones, d is the microphone interval, θk is the angle forming the directivity (direction information of the person from the person position detection unit 114), and c is the speed of sound .

例えば、１つの指向性をマイクアレイ１０１の正面方向に指向性を形成する場合は、Ｋ＝１、指向性を形成する角度θ１＝０になるので、上記（３）式より、τ＿１＝０となる。また例えば、２つの指向性をマイクアレイ１０１の９０度方向に指向性を形成する場合は、Ｋ＝２、指向性を形成する角度θ２＝π／２（πは円周率）になり、上記（３）式より、τ＿２＝ｄ／ｃとなる。 For example, when one directivity is formed in the front direction of the microphone array 101, K=1 and the angle forming the directivity θ1=0, so from equation (3) above, τ_1=0. Become. For example, when forming two directivity in the 90 degree direction of the microphone array 101, K=2, the angle θ2 forming the directivity=π/2 (π is pi), and the above From equation (3), τ_2=d/c.

なお、信号処理の算出手段は、種々の方法を広く適用することができ、例えば、遅延和アレイ処理以外の従来の別マイクアレイ処理や、マイクアレイを２組使用して、ある特定のエリアの収音できるマイクアレイ処理でも良い。 Note that various methods can be widely applied to the signal processing calculation means, such as conventional separate microphone array processing other than delay-sum array processing, or calculation of a certain area using two microphone arrays. Microphone array processing that can collect sound may also be used.

そして、信号処理部１１６は、人物位置検知部１１４でビデオカメラ１０４の映像信号に人が映っていると判定されたときは、算出したマイクアレイ処理信号ｘ’＿ｋ（ｎ）を、オーディオバッファ部１１７と、音声認識部１１８と、出力切替部１２１に出力し、人物位置検知部１１４でビデオカメラ１０４の映像信号に人が映っていないと判定されたときは、（４）式に示すように、無線信号をオーディオバッファ部１１７と、音声認識部１１８と、出力切替部１２１に出力する。
ｘ’＿ｋ（ｎ）＝０ …（４） Then, when the person position detection unit 114 determines that a person is reflected in the video signal of the video camera 104, the signal processing unit 116 transmits the calculated microphone array processed signal x′_k(n) to the audio buffer unit. 117, the voice recognition unit 118, and the output switching unit 121, and when the person position detection unit 114 determines that no person is reflected in the video signal of the video camera 104, as shown in equation (4), , the wireless signal is output to the audio buffer section 117, the speech recognition section 118, and the output switching section 121.
x'_k(n)=0...(4)

また、呼びかけ処理部１０５は、同時にマイクアレイ処理信号ｘ’＿ｋ（ｎ）を、以下の（５）式に従い、オーディオバッファ部１１７のオーディオバッファｂｕｆｆｅｒ＿ｋ（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部１０５は、以下の（６）式に示すように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの値に「１」をインクリメントして処理を進める。

Further, the calling processing unit 105 simultaneously holds the microphone array processed signal x′_k(n) at the write position write_index of the audio buffer buffer_k(n) of the audio buffer unit 117 according to the following equation (5). After holding, the call processing unit 105 increments the value of the write position write_index by "1" and proceeds with the process, as shown in equation (6) below.

上記（６）式のＢＵＦＦＥＲ＿ＳＩＺＥは、オーディオバッファ部１１７のバッファの長さである。 BUFFER_SIZE in the above equation (6) is the length of the buffer in the audio buffer unit 117.

さらに、呼びかけ処理部１０５は、同時にマイクアレイ処理信号ｘ’＿ｋ（ｎ）を音声認識部１１８で音声認識を行う。そして、マイクアレイ処理信号ｘ’＿ｋ（ｎ）の音声認識結果をマイクアレイ処理信号毎にコマンド判定部１２０に出力する。 Furthermore, the calling processing unit 105 simultaneously performs voice recognition on the microphone array processed signal x'_k(n) using the voice recognition unit 118. Then, the voice recognition result of the microphone array processed signal x'_k(n) is output to the command determination unit 120 for each microphone array processed signal.

コマンド判定部１２０は、音声認識結果とコマンドリスト部１１９に保持されているコマンド一覧（例えば、図３のコマンドリスト）とを比較し、コマンドリストにある「人名」とコマンドリストにある「接続コマンド」が続けて音声認識されたか否かの判定を行う。例えば、使用者が「○○さんこんにちは」などのように発話し、音声認識結果が、コマンドリストに設定されている「人名」と「接続コマンド」とが連続して音声認識された場合、判定結果として「１」を、後述する「切断コマンド」が音声認識された場合は、判定結果として「２」を、それ以外は「０」を出力する。そして、コマンド判定部１２０は、判定結果を出力切替部１２１に出力し、判定結果と音声認識結果を接続判定部１２３に出力する。 The command determination unit 120 compares the voice recognition result with the command list held in the command list unit 119 (for example, the command list in FIG. 3), and selects the “person name” in the command list and the “connection command” in the command list. ” is then judged whether or not the voice has been recognized. For example, if the user utters something like "Hello Mr. XXX" and the voice recognition results show that the "person's name" and "connection command" set in the command list are consecutively recognized, the judgment will be made. If a "disconnect command" to be described later is voice recognized, "2" is output as the determination result; otherwise, "0" is output. The command determination unit 120 then outputs the determination result to the output switching unit 121 and outputs the determination result and the voice recognition result to the connection determination unit 123.

コマンドリスト部１１９は、例えば、図３のようにコマンドの一覧がテキス卜ファイルで保持されている。例えば、図３に例示するコマンドリストは、大別して、少なくとも相手側の拠点の通話相手となり得る人の名前等を示す「人名」、実際に対面する相手と会話を始める際に用いる言葉であって、且つ、相手側の拠点との接続開始を実行するコマンドとして機能する「接続コマンド」、実際に対面する相手と会話を終了する際に用いる言葉であって、且つ、相手側の拠点との接続終了を実行するコマンドとして機能する「切断コマンド」を有している。なお、図３のコマンド一覧は一例であって、コマンドリスト部１１９が保持するデータの内容及び形式は、種々様々な値（形式）を適用することができる。 The command list section 119 stores a list of commands in the form of a text file, as shown in FIG. 3, for example. For example, the command list illustrated in FIG. 3 can be roughly divided into "person's name," which indicates at least the name of the person who can be called at the other party's base, and words used when starting a conversation with the person you actually meet. , and a "connection command" that functions as a command to start a connection with the other party's base, a word used when ending a conversation with the other party that you actually meet, and which also serves as a command to start a connection with the other party's base. It has a "disconnect command" that functions as a command to execute termination. Note that the command list in FIG. 3 is an example, and various values (formats) can be applied to the content and format of the data held by the command list section 119.

接続判定部１２３は、音声認識部１１８による音声認識結果及びコマンド判定部１２０に基づくコマンド判定結果に基づいて接続判定を行い、接続判定結果をＮＷ通信部１０６に出力する。 The connection determination unit 123 performs connection determination based on the voice recognition result by the voice recognition unit 118 and the command determination result based on the command determination unit 120, and outputs the connection determination result to the NW communication unit 106.

例えば、コマンド判定部１２０の判定結果が「１」で音声認識結果１１８の認識結果が「○○さんこんにちは」という音声認識結果が出力された場合、接続判定部１２３は、相手側の拠点のコミュニケーション装置１００が設置されている近くに「○○さん」がいるとき、相手側の拠点のコミュニケーション装置１００に接続する信号を接続判定結果出力端子１２４に出力する。拠点のコミュニケーション装置１００が設置されている近くに「○○さん」が入るかどうかの判定は、事前に端末の近くにいる人を登録した情報を使用する。 For example, if the judgment result of the command judgment unit 120 is “1” and the recognition result of the speech recognition result 118 is “Hello Mr. XXX”, the connection judgment unit 123 When "Mr. ○○" is near where the device 100 is installed, a signal for connecting to the communication device 100 at the other party's base is output to the connection determination result output terminal 124. Determination as to whether "Mr. ○○" will enter the vicinity where the communication device 100 of the base is installed uses information that has been registered in advance of people who are near the terminal.

ＮＷ通信部１０６は、接続判定結果出力端子１２４を介して出力された接続判定結果に基づき、ネットワーク１０７との接続処理を行う。 The NW communication unit 106 performs connection processing with the network 107 based on the connection determination result outputted via the connection determination result output terminal 124.

コマンド判定部１２０により「人名」と「接続コマンド」が続けて音声認識された場合には、オーディオバッファ部１１７に保持されている該当のマイクアレイ処理信号のオーディオバッファ音を出力する。 When the command determination unit 120 recognizes the “person's name” and the “connection command” in succession, the audio buffered sound of the corresponding microphone array processed signal held in the audio buffer unit 117 is output.

オーディオバッファ部１１７に保持されている音を出力するために、読出し位置ｒｅａｄ＿ｉｎｄｅｘを、下記の（７）式に従い計算する。

In order to output the sound held in the audio buffer section 117, a read position read_index is calculated according to the following equation (7).

上記（７）式のＬＥＮは、オーディオバッファ部１１７に保持されている処理信号を再生する長さである。なお、ＬＥＮの決定方法は、種々の方法を広く適用することができ、例えば、オーディオバッファ部１１７のバッファサイズと同じ長さ（ＬＥＮ＝ＢＵＦＦＥＲ＿ＳＩＺＥ）とするなどの定数とする方法が存在する。また、オーディオバッファ部１１７に保持されているマイク入力信号に音声区間処理を行い、バッファに保持されている音の長さを求めて、その長さをＬＥＮとする方法でも良い。 LEN in the above equation (7) is the length for reproducing the processed signal held in the audio buffer unit 117. Note that various methods can be widely applied to determine LEN, and for example, there is a method of setting it to a constant such as setting it to the same length as the buffer size of the audio buffer unit 117 (LEN=BUFFER_SIZE). Alternatively, a method may be used in which the microphone input signal held in the audio buffer unit 117 is subjected to voice section processing, the length of the sound held in the buffer is determined, and the length is set as LEN.

そして、出力切替部１２１は、以下の（８）式に示すようにオーディオバッファ部１１７に保持されている音信号を出力信号ｙ（ｎ）として音出力端子１２２に一定時間（例えば、ＬＥＮの時間長分）出力し、以下の（９）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘを進める（インクリメン卜する）。

Then, the output switching unit 121 outputs the sound signal held in the audio buffer unit 117 as an output signal y(n) to the sound output terminal 122 for a certain period of time (for example, a period of LEN) as shown in equation (8) below. (length)), and the read position read_index is advanced (incremented) as shown in equation (9) below.

ＮＷ通信部１０６は、音出力端子１２２から介して出力された出力信号ｙ（ｎ）をネットワーク１０７で接続している相手側のコミュニケーション装置１００のＮＷ通信部１０６に送信する。 The NW communication unit 106 transmits the output signal y(n) outputted from the sound output terminal 122 to the NW communication unit 106 of the communication device 100 on the other side connected via the network 107.

出力切替部１２１は、オーディオバッファ部１１７に保持されている音信号を一定時間出力すると、以下の（１０）式に示すように、マイクアレイ処理信号ｘ’＿ｋ（ｎ）を出力信号ｙ（ｎ）として音出力端子１２２に出力する。
ｙ（ｎ）＝ｘ’＿ｋ（ｎ） …（１０） When the output switching unit 121 outputs the sound signal held in the audio buffer unit 117 for a certain period of time, the output switching unit 121 changes the microphone array processed signal x′_k(n) to the output signal y(n ) is output to the sound output terminal 122.
y(n)=x'_k(n)...(10)

一方、出力切替部１２１は、コマンド判定部１２０で音声認識部１１８の音声認識結果が「人名」と「接続コマンド」が続けて音声認識されない場合は、（４）式に示すように、ｘ’＿ｋ（ｎ）が無音信号になるので、（１０）式に示すようにｙ（ｎ）も無音信号になり、無音信号を音出力端子１２２に出力し続ける。
ｙ（ｎ）＝０ …（１１） On the other hand, if the command determination unit 120 does not recognize the voice recognition result of the voice recognition unit 118 as “person name” and “connection command” consecutively, the output switching unit 121 determines that x' Since _k(n) becomes a silent signal, y(n) also becomes a silent signal as shown in equation (10), and the silent signal continues to be output to the sound output terminal 122.
y(n)=0...(11)

ＮＷ通信部１０６は、音出力端子１２２を介して出力された出力信号ｙ（ｎ）を引き続きネットワーク１０７に接続している相手側のコミュニケーション装置１００のＮＷ通信部１０６に送信する。 The NW communication unit 106 continues to transmit the output signal y(n) output through the sound output terminal 122 to the NW communication unit 106 of the communication device 100 on the other side connected to the network 107.

一方、ネットワーク１０７から送信されてきた相手側の音声信号は、ＮＷ通信部１０６を介してＤＡ変換部１０８に入力する。そして、ＤＡ変換部１０８によりデジタル信号からアナログ信号に変換後、音声信号がスピーカアンプ１０９で増幅され、音声がスピーカ１１０から出力される。 On the other hand, the other party's audio signal transmitted from the network 107 is input to the DA conversion unit 108 via the NW communication unit 106. After the digital signal is converted into an analog signal by the DA converter 108, the audio signal is amplified by the speaker amplifier 109, and the audio is output from the speaker 110.

呼びかけ音声再生後は、自拠点のコミュニケーション装置１００と相手側の拠点のコミュニケーション装置１００とが接続し、両拠点の間で、ビデオカメラ映像と音声のやりとりが行われる。 After the calling voice is played back, the communication device 100 at the own location and the communication device 100 at the other party's location are connected, and video camera images and audio are exchanged between the two locations.

しばらくして、通話を終了する場合は、使用者１５２ａと１５２ｂのいずれかが、切断音声を発話して会話を終了する。 If the call is to be ended after a while, either user 152a or 152b utters a disconnection voice to end the conversation.

使用者１５２ａ、１５２ｂのいずれかが発した音声は、環境音が重畳しマイクアレイ１０１の各マイクに入力される。 The sound emitted by either user 152a or 152b is input to each microphone of microphone array 101 with environmental sound superimposed thereon.

マイクアレイ１０１に入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換部１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部１０５の音入力端子１１５にマイク入力信号ｘ＿ｋ（ｍ，ｎ）として入力され、マイク入力信号ｘ＿ｋ（ｍ，ｎ）が信号処理部１１６に入力される。 The analog sound signal input to the microphone array 101 is amplified by the microphone amplifier 102, converted from the analog signal to a digital signal by the AD converter 103, and the microphone input signal x_k(m , n), and the microphone input signal x_k(m, n) is input to the signal processing unit 116.

信号処理部１１６は、マイク入力信号ｘ＿ｋ（ｍ，ｎ）に対して（１）、（２）、（３）式に示すように、マイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理を行い、算出したマイクアレイ処理信号ｘ’＿ｋ（ｎ）をオーディオバッファ部１１７と音声認識部１１８と出力切替部１２１に出力する。 The signal processing unit 116 performs microphone array processing on the microphone input signal x_k(m, n) as shown in equations (1), (2), and (3), and performs directional processing and sound source separation to separate sound sources. The separation process is performed, and the calculated microphone array processed signal x'_k(n) is output to the audio buffer section 117, the speech recognition section 118, and the output switching section 121.

出力切替部１２１は、（１０）式に示すように、マイクアレイ処理信号ｘ’＿ｋ（ｎ）を出力信号ｙ（ｎ）として音出力端子１２２に出力する。 The output switching unit 121 outputs the microphone array processed signal x'_k(n) to the sound output terminal 122 as an output signal y(n), as shown in equation (10).

また、呼びかけ処理部１０５は、同時にマイクアレイ処理信号ｘ’＿ｋ（ｎ）を、（５）式に従い、オーディオバッファ部１１７のオーディオバッファｂｕｆｆｅｒ＿ｋ（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部１０５は、（６）式に示すように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘを進める（すなわち、書き込み位置をインクリメン卜する）。 Further, the calling processing unit 105 simultaneously holds the microphone array processed signal x'_k(n) at the write position write_index of the audio buffer buffer_k(n) of the audio buffer unit 117 according to equation (5). After holding, the call processing unit 105 advances the write position write_index (that is, increments the write position) as shown in equation (6).

さらに、呼びかけ処理部１０５は、同時にマイクアレイ処理信号ｘ’＿ｋ（ｎ）を音声認識部１１８で音声認識を行い、音声認識結果をコマンド判定部１２０に出力する。 Furthermore, the calling processing section 105 simultaneously performs speech recognition on the microphone array processed signal x'_k(n) using the speech recognition section 118 and outputs the speech recognition result to the command determination section 120 .

コマンド判定部１２０は、音声認識結果と、コマンドリスト部１１９に保持されているコマンド一覧（図３のコマンドリスト）とを比較し、音声認識の結果が「切断コマンド」の一覧に存在するか否かの判定を行う。そして、コマンド判定部１２０は、コマンドリストにある「切断コマンド」が音声認識された場合（例えば、「さようなら」など）、判定結果を出力切替部１２１、及び接続判定部１２３に出力する。例えば、使用者が「○○さんこんにちは」などのように発話し、音声認識結果が、コマンドリストに設定されている「人名」と「接続コマンド」とが連続して音声認識された場合、判定結果として「１」を、「切断コマンド」が音声認識された場合は、判定結果として「２」を、それ以外は「０」を出力する。 The command determination unit 120 compares the voice recognition result with the command list held in the command list unit 119 (command list in FIG. 3), and determines whether the voice recognition result exists in the list of “disconnection commands”. Make a judgment. Then, when the “disconnection command” in the command list is voice recognized (for example, “goodbye”), the command determination unit 120 outputs the determination result to the output switching unit 121 and the connection determination unit 123. For example, if the user utters something like "Hello Mr. XXX" and the voice recognition results show that the "person's name" and "connection command" set in the command list are consecutively recognized, the judgment will be made. If the "disconnection command" is voice recognized, "2" is output as the determination result; otherwise, "0" is output.

接続判定部１２３は、音声認識部１１８による音声認識結果及びコマンド判定部１２０に基づくコマンド判定結果に基づいて、切断判定を行い、ＮＷ通信部１０６に相手側のＮＷ通信部と切断する信号を接続判定結果出力端子１２４に出力する。 The connection determination unit 123 makes a disconnection determination based on the voice recognition result by the voice recognition unit 118 and the command determination result based on the command determination unit 120, and connects the NW communication unit 106 with a signal to disconnect from the other party's NW communication unit. The determination result is output to the output terminal 124.

ＮＷ通信部１０６は、接続判定結果出力端子１２４を介して出力された接続判定結果に基づき、相手側のコミュニケーション装置１００のＮＷ通信部１０６との切断処理を行う。 The NW communication unit 106 performs a disconnection process from the NW communication unit 106 of the communication device 100 on the other side based on the connection determination result outputted via the connection determination result output terminal 124.

出力切替部１２１は、コマンド判定部１２０で音声認識部１１８の音声認識結果がコマンドリスト部１１９の切断コマンド一覧に存在しないと判定された場合には、マイクアレイ処理信号を音出力端子１２２に出力し続ける。 When the command determination unit 120 determines that the voice recognition result of the voice recognition unit 118 does not exist in the disconnection command list of the command list unit 119, the output switching unit 121 outputs the microphone array processed signal to the sound output terminal 122. Continue to do so.

一方、コマンド判定部１２０で音声認識部１１８の音声認識結果がコマンドリスト部１１９の切断コマンド一覧に存在すると判定された場合には、出力切替部１２１は、（１１）式に示すように、無音信号を出力信号ｙ（ｎ）として音出力端子１２２に出力される。 On the other hand, if the command determination unit 120 determines that the voice recognition result of the voice recognition unit 118 is present in the disconnection command list of the command list unit 119, the output switching unit 121 selects a silent mode as shown in equation (11). The signal is outputted to the sound output terminal 122 as an output signal y(n).

（Ａ－３）第１の実施形態の効果
以上のように、第１の実施形態によれば、コミュニケーション装置１００は、マイクアレイに受音される音声信号と人の方向情報から、各使用者の音声を強調する信号処理を行い、信号処理した信号を一度オーディオバッファ部に保持し、同時に音声認識部が信号処理した信号に対して音声認識を行なう。そして、音声認識結果が呼びかけ音声か否かを判定し、呼びかけ音声の場合には、相手側のコミュニケーション装置と接続してから、バッファに保持している呼びかけ音声を出力することで、呼びかけ音声が相手に伝わってから会話を開始することができる。また、相手側との会話が開始してから、音声認識部が信号処理した信号に対して音声認識を行い、その音声認識結果が切断音声か否かを判定し、切断音声の場合には切断する。このことにより、対面での会話に近い状態を再現でき、複数の話者で高い臨場感で会話を開始することができる。 (A-3) Effects of First Embodiment As described above, according to the first embodiment, the communication device 100 is able to identify each user based on the audio signal received by the microphone array and the direction information of the person. The processed signal is once held in the audio buffer section, and at the same time, the speech recognition section performs speech recognition on the signal processed by the speech recognition section. Then, it is determined whether the voice recognition result is a calling voice or not, and if it is a calling voice, the calling voice is output by connecting with the communication device of the other party and outputting the calling voice held in the buffer. You can start a conversation after the other person understands what you are saying. In addition, after the conversation with the other party starts, the voice recognition unit performs voice recognition on the processed signal, determines whether the voice recognition result is disconnected voice, and disconnects if it is disconnected voice. do. This makes it possible to reproduce a situation similar to a face-to-face conversation, and allows multiple speakers to start a conversation with a high sense of realism.

また、第１の実施形態のコミュニケーション装置１００は、使用環境の雑音が大きい環境においても、呼びかけ音声の収音はマイクアレイを使用して、音声を強調する信号処理を行っているため、呼びかけ音声を正しく認識でき、雑音が大きい環境でも通話を行うことができる。 Furthermore, even in a noisy usage environment, the communication device 100 of the first embodiment uses a microphone array to collect the calling voice and performs signal processing to enhance the voice. can be recognized correctly, and calls can be made even in noisy environments.

（Ｂ）第２の実施形態
次に、本発明のコミュニケーション装置、コミュニケーションプログラム、及びコミュニケーション方法の第２の実施形態を、図面を参照しながら詳細に説明する。 (B) Second Embodiment Next, a second embodiment of the communication device, communication program, and communication method of the present invention will be described in detail with reference to the drawings.

第２の実施形態は、本発明のコミュニケーション装置の音出力方法が、第１の実施形態と異なっている場合を例示する。 The second embodiment exemplifies a case where the sound output method of the communication device of the present invention is different from the first embodiment.

（Ｂ－１）第２の実施形態の構成
図４は、第２の実施形態に係るコミュニケーション装置２００の構成を示すブロック図である。 (B-1) Configuration of Second Embodiment FIG. 4 is a block diagram showing the configuration of a communication device 200 according to the second embodiment.

図４において、第２の実施形態に係るコミュニケーション装置２００は、マイクアレイ１０１、マイクアンプ１０２、アナログ－デジタル（ＡＤ）変換部１０３、２台のビデオカメラ１０４ａ及び１０４ｂ、呼びかけ処理部２０１、ＮＷ通信部１０６、デジタル－アナログ（ＤＡ）変換部１０８、スピーカアンプ１０９、２台のスピーカ１１０ａ及び１１０ｂ、モニター１１１を有する。 In FIG. 4, a communication device 200 according to the second embodiment includes a microphone array 101, a microphone amplifier 102, an analog-digital (AD) converter 103, two video cameras 104a and 104b, a calling processor 201, and a NW communication 106, a digital-to-analog (DA) converter 108, a speaker amplifier 109, two speakers 110a and 110b, and a monitor 111.

また、呼びかけ処理部２０１は、音入力端子１１５、映像入力端子１１２ａ及び１１２ｂ、映像出力端子１１３ａ及び１１３ｂ、人物位置検知部２０２、信号処理部１１６、オーディオバッファ部１１７、音声認識部１１８、コマンドリスト部１１９、コマンド判定部１２０、出力切替部２０３、音出力端子１２２ａ及び１２２ｂ、接続判定部１２３、接続判定結果出力端子１２４を有する。 The calling processing unit 201 also includes a sound input terminal 115, video input terminals 112a and 112b, video output terminals 113a and 113b, a person position detection unit 202, a signal processing unit 116, an audio buffer unit 117, a voice recognition unit 118, and a command list. section 119, command determination section 120, output switching section 203, sound output terminals 122a and 122b, connection determination section 123, and connection determination result output terminal 124.

第２の実施形態に係るコミュニケーション装置２００は、２台のビデオカメラ１０４ａ及び１０４ｂと、２台のスピーカ１１０ａ及び１１０ｂとを備え、さらに、呼びかけ処理部２０１の映像入力端子１１２ａ及び１１２ｂ、映像出力端子１１３ａ及び１１３ｂ、音出力端子１２２ａ及び１２２ｂが２個に増えたことにより、人物位置検知部２０２と出力切替部２０３の動作が第１の実施形態と異なる。 The communication device 200 according to the second embodiment includes two video cameras 104a and 104b, two speakers 110a and 110b, and further includes video input terminals 112a and 112b of the calling processing section 201, and a video output terminal. 113a and 113b and sound output terminals 122a and 122b are increased to two, the operations of the person position detection section 202 and the output switching section 203 differ from those in the first embodiment.

それ以外の構成要素は、第１の実施形態に係る図１のコミュニケーション装置１００の構成要素と同一、又は対応するものである。なお、図４において、第１の実施形態に係るコミュニケーション装置１００の構成要素と同一、又は対応するものについては同一の符号を付している。また、第１の実施形態と同一、又は対応する構成要素の詳細な説明は重複するため、ここでは省略する。 The other components are the same as or correspond to the components of the communication device 100 of FIG. 1 according to the first embodiment. Note that in FIG. 4, components that are the same as or correspond to the components of the communication device 100 according to the first embodiment are given the same reference numerals. Furthermore, since detailed explanations of components that are the same as or correspond to those in the first embodiment are redundant, they will be omitted here.

呼びかけ処理部２０１は、２台のビデオカメラ１０４ａ及び１０４ｂと接続しており、入力された各々のビデオカメラ１０４ａ及び１０４ｂからの営巣信号に人が映っているか否かを判定する。いずれか又は両方の映像信号に人が映っていると判定された場合のみ、呼びかけ処理部２０１は、入力された複数のマイク入力信号を信号処理し、信号処理した信号を音出力端子に出力する。同時に、呼びかけ処理部２０１は信号処理した信号をオーディオバッファ部１１７に保存する。さらに、呼びかけ処理部２０１は、信号処理した信号を音声認識し、音声認識結果がコマンドリスト部１１９のコマンドの１つと一致した場合に、接続判定結果とオーディオバッファに保存されている音信号を一定時間出力し、一定時間出力が完了すると再び信号処理した信号を出力する。 The calling processing unit 201 is connected to two video cameras 104a and 104b, and determines whether a person is visible in the input nesting signals from the respective video cameras 104a and 104b. Only when it is determined that a person is reflected in one or both of the video signals, the calling processing unit 201 performs signal processing on the plurality of input microphone input signals, and outputs the processed signal to the sound output terminal. . At the same time, the call processing section 201 stores the processed signal in the audio buffer section 117. Furthermore, the call processing unit 201 performs voice recognition on the signal processed signal, and when the voice recognition result matches one of the commands in the command list unit 119, the call processing unit 201 uses the connection determination result and the sound signal stored in the audio buffer as a constant. The signal is output for a certain period of time, and when the output is completed for a certain period of time, the processed signal is output again.

次に、呼びかけ処理部２０１の詳細な構成を説明する。 Next, the detailed configuration of the call processing section 201 will be explained.

映像入力端子１１２ａ、１１２ｂは、ビデオカメラ１０４ａ、１０４ｂからの映像信号を呼びかけ処理部２０１に入力するインタフェース部である。 The video input terminals 112a and 112b are interface units that input video signals from the video cameras 104a and 104b to the calling processing unit 201.

映像出力端子１１３ａ、１１３ｂは、ビデオカメラ１０４ａ、１０４ｂからの映像信号を呼びかけ処理部２０１から出力するインタフェース部である。 The video output terminals 113a and 113b are interface units that output video signals from the video cameras 104a and 104b from the calling processing unit 201.

人物位置検知部２０２は、映像入力端子１１２ａ、１１２ｂから入力したビデオカメラ１０４ａ、１０４ｂのそれぞれの映像信号に人が映っているか否かを判定するものである。 The person position detection unit 202 determines whether or not a person is shown in the video signals of the video cameras 104a and 104b inputted from the video input terminals 112a and 112b.

出力切替部２０３は、コマンド判定部１２０によるコマンド判定結果に基づいて出力する音信号を決定し、音信号を出力する。 The output switching unit 203 determines the sound signal to be output based on the command determination result by the command determination unit 120, and outputs the sound signal.

（Ｂ－２）第２の実施形態の動作
第２の実施形態に係るコミュニケーション装置２００における音声処理の基本的な動作は、第１の実施形態で説明した音声処理と同様である。 (B-2) Operation of Second Embodiment The basic operation of voice processing in the communication device 200 according to the second embodiment is the same as the voice processing described in the first embodiment.

以下では、第１の実施形態と異なる点である人物位置検知部２０２、及び出力切替部２０３における処理動作を中心に詳細に説明する。 Below, processing operations in the person position detection section 202 and the output switching section 203, which are different from the first embodiment, will be explained in detail.

また、以下では、図５に示すように、１人の使用者１５２ａが相手側の拠点にいる人とコミュニケーションをとっており、その後、２人目の使用者１５２ｂがコミュニケーションに参加してきた場合を想定して説明する。この場合、使用者１５２ａはビデオカメラ１０４ａにより撮影され、使用者１５２ｂはビデオカメラ１０４ｂに撮影されるものとして説明する。 Further, in the following, as shown in FIG. 5, it is assumed that one user 152a is communicating with a person at the other party's base, and then a second user 152b joins the communication. and explain. In this case, the description will be made assuming that the user 152a is photographed by the video camera 104a, and the user 152b is photographed by the video camera 104b.

まず、コミュニケーション装置２００の動作が開始すると、モニター１１１は、相手側の拠点のコミュニケーション装置１００のビデオカメラ１０４ａ、１０４ｂで撮影している映像を表示する。 First, when the communication device 200 starts operating, the monitor 111 displays images captured by the video cameras 104a and 104b of the communication device 100 at the other party's base.

自拠点のビデオカメラ１０４ａ、１０４ｂで撮影している映像は、呼びかけ処理部２０１を介してＮＷ通信部１０６に与えられ、ＮＷ通信部１０６がＮＷを通して相手側の拠点に映像信号を送信する。映像信号は相手側の拠点のＮＷ通信部１０６で受信され、相手の拠点のモニター１１１には、自拠点のビデオカメラ１０４ａ、１０４ｂで撮影された映像が表示される。 The images captured by the video cameras 104a and 104b at the own base are given to the NW communication unit 106 via the calling processing unit 201, and the NW communication unit 106 transmits a video signal to the other party's base through the NW. The video signal is received by the NW communication unit 106 of the other party's base, and the video captured by the video cameras 104a and 104b of the own base is displayed on the monitor 111 of the other party's base.

このとき、両拠点のコミュニケーション装置２００は音声信号を送受信しておらず、両拠点とも相手側のビデオカメラ１０４ａ、１０４ｂで撮影した映像だけがモニター１１１に表示されて、お互いの拠点の様子を確認できる。また、各拠点の音声信号がお互いに送受信されるようにしても良く、その場合には、お互いの映像がモニター１１１に表示されると共に、お互いの音が聞こえる。 At this time, the communication devices 200 at both bases are not transmitting or receiving audio signals, and only the images shot by the video cameras 104a and 104b of the other base are displayed on the monitor 111 at both bases, allowing the two bases to check the status of each other's bases. can. Further, audio signals from each base may be transmitted and received from each other, and in that case, each other's images are displayed on the monitor 111, and each other's sounds can be heard.

また、ビデオカメラ１０４ａ、１０４ｂで撮影している映像信号は、呼びかけ処理部２０１の映像入力端子１１２ａ、１１２ｂに入力され、人物位置検知部２０２に入力される。 Further, video signals captured by the video cameras 104a and 104b are input to video input terminals 112a and 112b of the calling processing section 201, and then input to the person position detection section 202.

人物位置検知部２０２は、ビデオカメラ１０４ａ、１０４ｂで撮影された映像信号に人が映っているか否かを判定し、その判定結果を、信号処理部１１６及び音声認識部１１８に出力する。例えば、人物位置検知部２０２は、ビデオカメラ１０４ａに人が映っていると判定したときには判定結果を「１」、ビデオカメラ１０４ｂに人が映っていると判定したときには判定結果を「２」、それ以外は判定結果を「０」などとして出力する。 The person position detection section 202 determines whether or not a person is shown in the video signals captured by the video cameras 104a and 104b, and outputs the determination result to the signal processing section 116 and the voice recognition section 118. For example, when the person position detection unit 202 determines that a person is reflected in the video camera 104a, the determination result is set to "1," and when it determines that a person is reflected in the video camera 104b, the determination result is set to "2." Otherwise, the determination result is output as "0" or the like.

例えば、使用者１５２ａが相手側の拠点の全体映像に映っている人と通話を行う場合は、モニター１１１に表示されている相手側の拠点の映像を見るために、図５に示すように、使用者１５２ａは、モニター１１１に近づき、モニター１１１に映っている相手の拠点の映像を確認する。 For example, when the user 152a has a call with a person who is displayed in the overall image of the other party's base, in order to view the image of the other party's base displayed on the monitor 111, as shown in FIG. The user 152a approaches the monitor 111 and checks the image of the other party's base displayed on the monitor 111.

このとき、図５に例示するように、モニター１１１付近に設置されているビデオカメラ１０４ａが使用者１５２ａを撮影するので、ビデオカメラ１０４ａの映像信号を監視する人物位置検知部２０２は、ビデオカメラ１０４ａの映像信号に人が映っているという判定結果（例えば、判定結果「１」等）を、信号処理部１１６及び音声認識部１１８に出力する。また、人物位置検知部２０２は、ビデオカメラ１０４ａの映像フレームにおける使用者１５２ａの方向情報を信号処理部１１６に出力する。 At this time, as illustrated in FIG. 5, the video camera 104a installed near the monitor 111 photographs the user 152a, so the person position detection unit 202 that monitors the video signal of the video camera 104a A determination result that a person is shown in the video signal (for example, determination result "1", etc.) is output to the signal processing unit 116 and the voice recognition unit 118. Further, the person position detection unit 202 outputs direction information of the user 152a in the video frame of the video camera 104a to the signal processing unit 116.

使用者１５２ａは、通話したい相手を呼びかけるために、呼びかけ音声を発声する。使用者１５２ａが発した音声は、環境音が重畳しマイクアレイ１０１ａの各マイクに入力される。 The user 152a utters a calling voice in order to call the person with whom the user wants to talk. The sound emitted by the user 152a is superimposed with environmental sounds and input to each microphone of the microphone array 101a.

マイクアレイ１０１ａに入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換部１０３でアナログ信号からデジタル信号に変換され、呼びかけ処理部２０１の音入力端子１１５にマイク入力信号ｘ（ｍ，ｎ）として入力される。 The analog sound signal input to the microphone array 101a is amplified by the microphone amplifier 102, converted from the analog signal to a digital signal by the AD converter 103, and the microphone input signal x(m , n).

呼びかけ処理部２０１の音入力端子１１５に信号が入力され始めると、まず、マイク入力信号ｘ（ｍ，ｎ）が信号処理部１１６に入力される。 When a signal starts to be input to the sound input terminal 115 of the calling processing section 201, first, a microphone input signal x (m, n) is input to the signal processing section 116.

人物位置検知部２０２でビデオカメラ１０４ａの映像信号に人が映っていると判定されたとき、信号処理部１１６は入力信号に対してマイクアレイ処理を行う。このとき、信号処理部１１６は、人物位置検知部２０２から取得した、ビデオカメラ１０４ａの映像における使用者１５２ａの方向情報に基づいて、使用者１５２ａの位置方向から到来する使用者１５２ａの音声をマイクアレイ１０１が収音する指向性処理や、使用者１５２ａの音声を抽出する音源分離処理を行う。 When the person position detection unit 202 determines that a person is reflected in the video signal of the video camera 104a, the signal processing unit 116 performs microphone array processing on the input signal. At this time, the signal processing unit 116 detects the voice of the user 152a coming from the direction of the user 152a's position based on the direction information of the user 152a in the image of the video camera 104a acquired from the person position detection unit 202. The array 101 performs directional processing for collecting sound and sound source separation processing for extracting the voice of the user 152a.

そして、信号処理部１１６は、人物位置検知部２０２でビデオカメラ１０４の映像信号に人が映っていると判定されたときは、算出したマイクアレイ処理信号ｘ’＿１（ｎ）を、オーディオバッファ部１１７と、音声認識部１１８と、出力切替部２０３に出力し、人物位置検知部２０２でビデオカメラ１０４の映像信号に人が映っていないと判定されたときは、（１２）式に示すように、無線信号をオーディオバッファ部１１７と、音声認識部１１８と、無線信号を出力切替部２０３に出力する。
ｘ’＿１（ｎ）＝０ …（１２） Then, when the person position detection unit 202 determines that a person is reflected in the video signal of the video camera 104, the signal processing unit 116 transmits the calculated microphone array processed signal x′_1(n) to the audio buffer unit. 117, the voice recognition unit 118, and the output switching unit 203, and when the person position detection unit 202 determines that no person is reflected in the video signal of the video camera 104, as shown in equation (12), , the wireless signal is output to the audio buffer section 117 and the speech recognition section 118, and the wireless signal is output to the output switching section 203.
x'_1(n)=0...(12)

呼びかけ処理部２０１は、同時にマイクアレイ処理信号ｘ’＿１（ｎ）は、（５）式に従い、オーディオバッファ部１１７のオーディオバッファｂｕｆｆｅｒ＿１（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部２０１は、（６）式のように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの値をインクリメントして進める。 At the same time, the calling processing unit 201 holds the microphone array processed signal x'_1(n) at the write position write_index of the audio buffer buffer_1(n) of the audio buffer unit 117 according to equation (5). After holding, the call processing unit 201 increments and advances the value of the write position write_index, as shown in equation (6).

さらに、呼びかけ処理部２０１では、同時にマイクアレイ処理信号ｘ’＿１（ｎ）を音声認識部１１８に入力し、音声認識部１１８が音声認識を行い、マイクアレイ処理信号ｘ’＿１（ｎ）の音声認識結果をコマンド判定部１２０に出力する。 Furthermore, the calling processing unit 201 simultaneously inputs the microphone array processed signal x'_1(n) to the voice recognition unit 118, and the voice recognition unit 118 performs voice recognition to generate the voice of the microphone array processed signal x'_1(n). The recognition result is output to the command determination unit 120.

コマンド判定部１２０は、マイクアレイ処理信号ｘ’＿１（ｎ）の音声認識結果と、コマンドリスト部１１９に保持されているコマンド一覧（例えば、図３のコマンドリスト）を比較し、コマンドリストにある「人名」とコマンドリストにある「接続コマンド」が続けて音声認識されたか否かの判定を行う（例えば、「○○さんこんにちは」など）。 The command determination unit 120 compares the voice recognition result of the microphone array processed signal x'_1(n) with the command list held in the command list unit 119 (for example, the command list in FIG. 3), and It is determined whether or not the "person's name" and the "connection command" in the command list have been voice recognized in succession (for example, "Hello Mr. XXX").

そして、コマンド判定部１２０は、判定結果を出力切替部２０３に、判定結果と音声認識結果を接続判定部１２３に出力する。例えば、マイクアレイ処理信号ｘ’＿１（ｎ）の音声認識結果が「人名」と「接続コマンド」が続けて音声認識された場合、コマンド判定部１２０は、判定結果を「１」、後述するマイクアレイ処理信号ｘ’＿２（ｎ）の音声認識結果が「人名」と「接続コマンド」が続けて音声認識された場合は判定結果を「２」、それ以外は「０」などのように出力する。 Then, the command determination unit 120 outputs the determination result to the output switching unit 203 and outputs the determination result and the voice recognition result to the connection determination unit 123. For example, if the voice recognition result of the microphone array processed signal x'_1(n) is that "person's name" and "connection command" are consecutively recognized, the command determination unit 120 sets the determination result to "1" and the microphone If the speech recognition result of the array processing signal x'_2(n) is a sequence of "person's name" and "connection command", the judgment result is output as "2", otherwise it is output as "0", etc. .

接続判定部１２３は、音声認識部１１８による音声認識結果及びコマンド判定部１２０に基づくコマンド判定結果に基づいて、接続判定を行い、接続判定結果をＮＷ通信部１０６に出力する。例えば、判定結果が「１」で、コマンド判定部１２０から「○○さんこんにちは」という音声認識結果が出力された場合、接続判定部１２３は、相手側の拠点のコミュニケーション装置１００が設置されている近くに「○○さん」がいるときは、ビデオカメラ１０４ａとマイクアレイ１０１ａと相手側の拠点のコミュニケーション装置２００の「○○さん」の近くのビデオカメラとマイクアレイに接続する信号を接続判定結果出力端子１２４に出力する。拠点のコミュニケーション装置２００が設置されている近くに○○さんが入るかどうかの判定は、事前に端末の近くにいる人を登録した情報を使用する。 The connection determination unit 123 performs connection determination based on the voice recognition result by the voice recognition unit 118 and the command determination result based on the command determination unit 120, and outputs the connection determination result to the NW communication unit 106. For example, if the determination result is "1" and the command determination unit 120 outputs a voice recognition result of "Hello Mr. XXX," the connection determination unit 123 determines that the communication device 100 of the other party's base is installed. When "Mr. ○○" is nearby, the video camera 104a and the microphone array 101a are connected to the communication device 200 of the other party's base, which connects the signal to be connected to the video camera and microphone array near "Mr. ○○" as a connection determination result. It is output to the output terminal 124. To determine whether or not Mr. ○○ will enter the vicinity where the communication device 200 of the base is installed, information that has been registered in advance about people near the terminal is used.

コマンド判定部１２０で音声認識部１１８の音声認識の結果が「人名」と「接続コマンド」が続けて音声認識されない場合は、出力切替部２０３は、（１３）式に示すように、無音信号を出力信号ｙ＿１（ｎ）として音出力端子１２２ａに出力する。
ｙ＿１（ｎ）＝０ …（１３） If the voice recognition result of the voice recognition unit 118 in the command determination unit 120 is that “person name” and “connection command” are not recognized consecutively, the output switching unit 203 outputs a silent signal as shown in equation (13). It is output to the sound output terminal 122a as an output signal y_1(n).
y_1(n)=0...(13)

一方、コマンド判定部１２０で「人名」と「接続コマンド」が続けて音声認識された場合には、出力切替部２０３は、オーディオバッファ部１１７の読出し位置ｒｅａｄ＿ｉｎｄｅｘ＿１を、下記の（１４）式に従い計算する。

On the other hand, when the command determination unit 120 successively recognizes the “person name” and the “connection command”, the output switching unit 203 calculates the read position read_index_1 of the audio buffer unit 117 according to equation (14) below. do.

また、「人名」と「接続コマンド」が続けて音声認識されない場合は、出力切替部２０３は、信号処理部１１６からのマイク入力信号ｘ（ｍ，ｎ）を出力するようにしても良い。 Further, if the "person's name" and "connection command" are not voice recognized consecutively, the output switching unit 203 may output the microphone input signal x(m,n) from the signal processing unit 116.

そして、出力切替部２０３は、以下の（１５）式に示すようにオーディオバッファ部１１７に保持されている音信号を出力信号ｙ＿１（ｎ）として音出力端子１２２に一定時間（例えば、ＬＥＮの時間長分）出力し、以下の（１６）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘ＿１をインクリメントして進める。

Then, the output switching unit 203 outputs the sound signal held in the audio buffer unit 117 as an output signal y_1(n) to the sound output terminal 122 for a certain period of time (for example, the time of LEN) as shown in equation (15) below. (length)), and the read position read_index_1 is incremented and advanced as shown in equation (16) below.

ＮＷ通信部１０６は、音出力端子１２２ａから出力された出力信号ｙ＿１（ｎ）をネットワーク１０７で接続している相手側のコミュニケーション装置２００のＮＷ通信部１０６に送信する。 The NW communication unit 106 transmits the output signal y_1(n) output from the sound output terminal 122a to the NW communication unit 106 of the communication device 200 on the other side connected via the network 107.

出力切替部２０３は、オーディオバッファ部１１７に保持されている音信号を一定時間出力すると、以下の（１７）式に示すように、マイクアレイ処理信号ｘ’＿１（ｎ）を、出力信号ｙ＿１（ｎ）として音出力端子１２２ａに出力する。
ｙ＿１（ｎ）＝ｘ’＿１（ｎ） …（１７） When the output switching unit 203 outputs the sound signal held in the audio buffer unit 117 for a certain period of time, the output switching unit 203 changes the microphone array processed signal x'_1(n) to the output signal y_1( n) is output to the sound output terminal 122a.
y_1(n)=x'_1(n)...(17)

ＮＷ通信部１０６は、音出力端子１２２から出力された出力信号ｙ＿１（ｎ）を引き続きネットワーク１０７で接続している相手側のコミュニケーション装置２００のＮＷ通信部１０６に送信する。 The NW communication unit 106 continues to transmit the output signal y_1(n) output from the sound output terminal 122 to the NW communication unit 106 of the communication device 200 on the other side connected via the network 107.

一方、ネットワーク１０７から送信されてきた相手側の音声信号は、ＮＷ通信部１０６を介してＤＡ変換部１０８に入力し、ＤＡ変換部１０８によりデジタル信号からアナログ信号に変換後、音声信号がスピーカアンプ１０９で増幅され、音声がスピーカ１１０ａにより出力される。つまり、スピーカ１１０ａから、使用者１５２ａの音声が出力される。 On the other hand, the audio signal of the other party transmitted from the network 107 is input to the DA converter 108 via the NW communication unit 106, and after being converted from a digital signal to an analog signal by the DA converter 108, the audio signal is sent to the speaker amplifier. 109, and the sound is output by speaker 110a. In other words, the user's 152a's voice is output from the speaker 110a.

呼びかけ音声再生後は、自拠点のコミュニケーション装置２００と相手側の拠点のコミュニケーション装置２００とが接続し、両拠点の間で、ビデオカメラ映像と音声のやりとりが行われる。 After the calling voice is played back, the communication device 200 at the own location and the communication device 200 at the other party's location are connected, and video camera images and audio are exchanged between the two locations.

次に、図６に示すように、２人目の使用者１５２ｂがコミュニケーションに参加して、相手側の拠点にいる人と通話する場合を説明する。 Next, as shown in FIG. 6, a case where the second user 152b participates in communication and speaks with a person at the other party's base will be described.

この場合も、使用者１５２ｂがモニター１１１に表示されている相手側の拠点の映像を見るために、図６に示すように、使用者１５２ｂがモニター１１１に近づき、モニター１１１に映っている相手の拠点の映像を確認する。そうすると、人物位置検知部１１４は、ビデオカメラ１０４ｂに人が映っていることを判定し、その旨の判定結果を信号処理部１１６及び音声認識部１１８に出力する。 In this case as well, in order for the user 152b to view the image of the other party's base displayed on the monitor 111, the user 152b approaches the monitor 111 and views the other party's base displayed on the monitor 111, as shown in FIG. Check the video of the base. Then, the person position detection unit 114 determines that a person is captured on the video camera 104b, and outputs the determination result to that effect to the signal processing unit 116 and the voice recognition unit 118.

使用者１５２ｂは、映像に通話したい相手が映っていると、使用者１５２ｂが呼びかけ音声を発話する。使用者１５２ｂが発した音声は、環境音が重畳しマイクアレイ１０１ｂの各マイクに入力される。 When the user 152b sees the person he or she wants to talk to in the video, the user 152b utters a calling voice. The voice emitted by the user 152b is superimposed with environmental sounds and input to each microphone of the microphone array 101b.

マイクアレイ１０１ｂに入力されたアナログの音信号は、マイクアンプ１０２で増幅され、ＡＤ変換部１０３でアナログ信号からデジタル信号に変換され、音声信号が、呼びかけ処理部２０１の音入力端子１１５にマイク入力信号ｘ（ｍ，ｎ）として入力される。 The analog sound signal input to the microphone array 101b is amplified by the microphone amplifier 102, the analog signal is converted to a digital signal by the AD converter 103, and the audio signal is input to the sound input terminal 115 of the calling processor 201 by the microphone. It is input as a signal x(m,n).

人物位置検知部２０２でビデオカメラ１０４ｂの映像に人が映っていると判定されたとき、信号処理部１１６は入力信号に対してマイクアレイ処理を行い、指向性処理や音源を分離する音源分離処理をする。 When the person position detection unit 202 determines that a person is shown in the image of the video camera 104b, the signal processing unit 116 performs microphone array processing on the input signal, and performs directional processing and sound source separation processing to separate the sound source. do.

そして、信号処理部１１６は、算出したマイクアレイ処理信号ｘ’＿２（ｎ）、をオーディオバッファ部１１７と、音声認識部１１８と、出力切替部２０３に出力する。 Then, the signal processing section 116 outputs the calculated microphone array processed signal x'_2(n) to the audio buffer section 117, the speech recognition section 118, and the output switching section 203.

そして、信号処理部１１６は、人物位置検知部２０２でビデオカメラ１０４ｂの映像信号に人が映っていると判定されたときは、算出したマイクアレイ処理信号ｘ’＿２（ｎ）を、オーディオバッファ部１１７と、音声認識部１１８と、出力切替部２０３に出力し、人物位置検知部２０２でビデオカメラ１０４の映像信号に人が映っていないと判定されたときは、（１８）式に示すように、無線信号をオーディオバッファ部１１７と、音声認識部１１８と、無線信号を出力切替部２０３に出力する。
ｘ’＿２（ｎ）＝０ …（１８） Then, when the person position detection unit 202 determines that a person is reflected in the video signal of the video camera 104b, the signal processing unit 116 transmits the calculated microphone array processed signal x′_2(n) to the audio buffer unit. 117, the voice recognition unit 118, and the output switching unit 203, and when the person position detection unit 202 determines that no person is reflected in the video signal of the video camera 104, as shown in equation (18), , the wireless signal is output to the audio buffer section 117 and the speech recognition section 118, and the wireless signal is output to the output switching section 203.
x'_2(n)=0...(18)

呼びかけ処理部２０１は、同時にマイクアレイ処理信号ｘ’＿２（ｎ）は、（５）式に従い、オーディオバッファ部１１７のオーディオバッファｂｕｆｆｅｒ＿２（ｎ）の書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの位置に保持する。保持した後、呼びかけ処理部２０１は、（６）式のように、書込み位置ｗｒｉｔｅ＿ｉｎｄｅｘの値に「１」をインクリメントして進める。 At the same time, the calling processing unit 201 holds the microphone array processed signal x'_2(n) at the write position write_index of the audio buffer buffer_2(n) of the audio buffer unit 117 according to equation (5). After holding, the call processing unit 201 increments the value of the write position write_index by "1" and proceeds as shown in equation (6).

さらに、呼びかけ処理部２０１では、同時にマイクアレイ処理信号ｘ’＿２（ｎ）を音声認識部１１８に入力し、音声認識部１１８が音声認識を行い、マイクアレイ処理信号ｘ’＿２（ｎ）の音声認識結果をコマンド判定部１２０に出力する。 Furthermore, the calling processing unit 201 simultaneously inputs the microphone array processed signal x'_2(n) to the voice recognition unit 118, and the voice recognition unit 118 performs voice recognition to generate the voice of the microphone array processed signal x'_2(n). The recognition result is output to the command determination unit 120.

コマンド判定部１２０は、マイクアレイ処理信号ｘ’＿２（ｎ）の音声認識結果とコマンドリスト部１１９に保持されているコマンド一覧（例えば、図３のコマンドリスト）を比較し、コマンドリストにある「人名」とコマンドリストにある「接続コマンド」が続けて音声認識されたか否かの判定を行う（例えば、「○○さんこんにちは」など）。そして、コマンド判定部１２０は、判定結果を出力切替部１２１に、判定結果と音声認識結果を接続判定部１２３に出力する。 The command determination unit 120 compares the voice recognition result of the microphone array processed signal x'_2(n) with the command list held in the command list unit 119 (for example, the command list in FIG. 3), and compares the " A determination is made as to whether or not the words ``person's name'' and ``connection command'' in the command list have been voice-recognized (for example, ``Hello Mr. ___''). Then, the command determination unit 120 outputs the determination result to the output switching unit 121 and outputs the determination result and the voice recognition result to the connection determination unit 123.

接続判定部１２３は、音声認識部１１８による音声認識結果及びコマンド判定部１２０に基づくコマンド判定結果に基づいて、接続判定を行い、接続判定結果をＮＷ通信部１０６に出力する。 The connection determination unit 123 performs connection determination based on the voice recognition result by the voice recognition unit 118 and the command determination result based on the command determination unit 120, and outputs the connection determination result to the NW communication unit 106.

例えば、音声認識部１１８による音声認識結果が「２」で、コマンド判定部１２０から「××さんこんにちは」という音声認識結果が出力された場合、接続判定部１２３は、相手側の拠点のコミュニケーション装置２００が設置されている近くに「××さん」がいる場合は、ビデオカメラ１０４ｂとマイクアレイ１０１ｂと相手側の拠点のコミュニケーション装置１００の○○さんの接続されていないビデオカメラとマイクアレイに接続する信号を接続判定結果出力端子１２４に出力する。拠点のコミュニケーション装置２００が設置されている近くに「××さん」が入るかどうかの判定は、事前に端末の近くにいる人を登録した情報を使用する。 For example, if the voice recognition result by the voice recognition unit 118 is “2” and the command determination unit 120 outputs the voice recognition result “Hello, Mr. XXX”, the connection determination unit 123 If there is "Mr. XX" near where 200 is installed, connect the video camera 104b and microphone array 101b to the unconnected video camera and microphone array of Mr. ○○ of the communication device 100 at the other party's base. A signal is output to the connection determination result output terminal 124. To determine whether "Mr. XX" will be near where the communication device 200 of the base is installed, information that has been registered in advance about people near the terminal is used.

ＮＷ通信部１０６は、接続判定結果出力端子１２４から出力された接続判定結果に基づき、ネットワーク１０７との接続処理を行う。 The NW communication unit 106 performs connection processing with the network 107 based on the connection determination result output from the connection determination result output terminal 124.

コマンド判定部１２０で音声認識部１１８のマイクアレイ処理信号ｘ’＿２（ｎ）の音声認識の結果が「人名」と「接続コマンド」が続けて音声認識されない場合は、出力切替部２０３は、（１９）式に示すように、無音信号を出力信号ｙ＿２（ｎ）として音出力端子１２２ｂに出力し続ける。
ｙ＿２（ｎ）＝０ …（１９） If the result of voice recognition of the microphone array processed signal x'_2(n) of the voice recognition unit 118 by the command determination unit 120 is that "person's name" and "connection command" are not recognized consecutively, the output switching unit 203 selects ( As shown in equation 19), the silent signal continues to be output as the output signal y_2(n) to the sound output terminal 122b.
y_2(n)=0...(19)

一方、コマンド判定部１２０で「人名」と「接続コマンド」が続けて音声認識された場合には、出力切替部２０３は、オーディオバッファ部１１７の読出し位置ｒｅａｄ＿ｉｎｄｅｘ＿２を、下記の（２０）式に従い計算する。

On the other hand, when the command determination unit 120 successively recognizes the “person name” and the “connection command”, the output switching unit 203 calculates the read position read_index_2 of the audio buffer unit 117 according to equation (20) below. do.

そして、出力切替部２０３は、以下の（２１）式に示すようにオーディオバッファ部１１７に保持されている音信号を出力信号ｙ＿２（ｎ）として音出力端子１２２ｂに、一定時間（例えば、ＬＥＮの時間長分）出力し、以下の（２２）式に示すように読出し位置ｒｅａｄ＿ｉｎｄｅｘ＿２の値に「１」をインクリメントして進める。

Then, the output switching unit 203 outputs the sound signal held in the audio buffer unit 117 as an output signal y_2(n) to the sound output terminal 122b for a certain period of time (for example, LEN) as shown in equation (21) below. time length) and increments the value of the read position read_index_2 by "1" as shown in equation (22) below.

ＮＷ通信部１０６は、音出力端子１２２から出力された出力信号ｙ＿２（ｎ）をネットワーク１０７で接続している相手のＮＷ通信部１０６に送信する。 The NW communication unit 106 transmits the output signal y_2(n) output from the sound output terminal 122 to the NW communication unit 106 of the other party connected via the network 107.

出力切替部２０３は、オーディオバッファ部１１７に保持されている音信号を一定時間出力すると、以下の（２３）式に示すように、マイクアレイ処理信号ｘ’＿２（ｎ）を出力信号ｙ＿２（ｎ）として音出力端子１２２ｂに出力する。
ｙ＿２（ｎ）＝ｘ’＿２（ｎ） …（２３） When the output switching unit 203 outputs the sound signal held in the audio buffer unit 117 for a certain period of time, the output switching unit 203 changes the microphone array processed signal x′_2(n) to the output signal y_2(n ) is output to the sound output terminal 122b.
y_2(n)=x'_2(n)...(23)

ＮＷ通信部１０６は、音出力端子１２２を介して出力された出力信号ｙ＿２（ｎ）を引き続きネットワーク１０７で接続している相手のＮＷ通信部１０６に送信する。 The NW communication unit 106 transmits the output signal y_2(n) output via the sound output terminal 122 to the NW communication unit 106 of the other party connected via the network 107.

一方、ネットワーク１０７から送信されてきた相手側の音声は、ＮＷ通信部１０６を介してＤＡ変換部１０８に入力し、ＤＡ変換部１０８によりデジタル信号からアナログ信号に変換後、音声信号がスピーカアンプ１０９で増幅され、音声がスピーカ１１０ｂにより出力される。つまり、スピーカ１１０ｂから使用者１５２ｂの音声が出力される。 On the other hand, the voice of the other party transmitted from the network 107 is input to the DA converter 108 via the NW communication unit 106, and after being converted from a digital signal to an analog signal by the DA converter 108, the voice signal is sent to the speaker amplifier 109. The sound is amplified by the speaker 110b, and the sound is output from the speaker 110b. In other words, the user's 152b's voice is output from the speaker 110b.

呼びかけ音声再生後は、接続後に遠隔通話装置はＮＷ通信部１０６を介して、ビデオカメラ映像と音声のやりとりが行われる。 After the calling voice is played back, the remote communication device exchanges the video camera image and voice via the NW communication unit 106 after being connected.

（Ｂ－３）第２の実施形態の効果
以上のように、第２の実施形態によれば、コミュニケーション装置は、複数のマイクアレイを使用して、複数の話者の音声を別々に強調する信号処理を行う。そして、信号処理した信号を一度オーディオバッファ部に保持し、同時に信号処理した信号に対して音声認識を行い、その音声認識結果が呼びかけ音声か否かを各マイクアレイ信号毎に判定する。呼びかけ音声の場合には、通話相手に接続してからオーディオバッファ部に保持している呼びかけ音声を出力することで、呼びかけ音声が相手に伝わってから会話を開始することができる。また通話を終了する際には、信号処理した信号に対して音声認識を行い、その音声認識結果が切断音声か否かを判定し、切断音声の場合には、相手側の拠点との接続を切断する。このことにより、対面での会話に近い状態を再現でき、複数の話者で高い臨場感で会話を開始することができる。 (B-3) Effects of the second embodiment As described above, according to the second embodiment, the communication device uses multiple microphone arrays to separately emphasize the voices of multiple speakers. Perform signal processing. Then, the processed signal is once held in the audio buffer section, and at the same time, speech recognition is performed on the signal processed signal, and it is determined for each microphone array signal whether or not the speech recognition result is a calling speech. In the case of a calling voice, by outputting the calling voice held in the audio buffer section after connecting to the other party, it is possible to start a conversation after the calling voice is transmitted to the other party. In addition, when ending a call, voice recognition is performed on the processed signal, and it is determined whether the voice recognition result is a disconnection voice. If it is a disconnection voice, the connection with the other party's base is established. disconnect. This makes it possible to reproduce a situation similar to a face-to-face conversation, and allows multiple speakers to start a conversation with a high sense of realism.

（Ｃ）他の実施形態
上述した各実施形態においても、種々の変形実施形態を説明したが、本発明は以下の変形実施形態についても適用することができる。 (C) Other Embodiments Although various modified embodiments have been described in each of the above-described embodiments, the present invention can also be applied to the following modified embodiments.

（Ｃ－１）上述した各実施形態で説明したコミュニケーション装置は、例えば、電話会議で通話を開始するときに、音声の入力によるコマンドで通話を開始する装置に搭載されるようにしても良い。 (C-1) The communication device described in each of the above-described embodiments may be installed in a device that starts a call using a voice input command, for example, when starting a call in a conference call.

（Ｃ－２）上述した各実施形態で説明したコミュニケーション装置における、呼びかけ処理部やＮＷ通信部は、ネットワーク上に設けられた処理装置（例えば、サーバなど）で処理されるようにしても良い。 (C-2) The call processing unit and the NW communication unit in the communication device described in each of the above-described embodiments may be processed by a processing device (for example, a server, etc.) provided on the network.

（Ｃ－３）上述した各実施形態で説明したコミュニケーション装置では、マイクアレイ１０１が、図２、図５、図６で例示したように、モニター１１１の前方に配置される場合を例示した。しかし、マイクアレイ１０１の配置例は、図２、図５、図６に限定されない。例えば、マイクアレイ１０１は、モニター１１１の上部又は側面に配置されても良い。また、コミュニケーション装置がプロジェクターとスクリーンを備えている場合、プロジェクターからの投影映像を結像させるためのスクリーンをモニター１１１に代えて設けるようにしても良い。このスクリーンの種類は、様々なものを用いることができ、例えば投影映像を結像させる通常のスクリーンでも良いし、また例えば、音を透過するスクリーンでも良い。音を透過するスクリーンの場合、マイクアレイ１０１は、スクリーンの後方に配置しても良い。 (C-3) In the communication devices described in each of the above-described embodiments, the microphone array 101 is arranged in front of the monitor 111, as illustrated in FIGS. 2, 5, and 6. However, the arrangement examples of the microphone array 101 are not limited to those shown in FIGS. 2, 5, and 6. For example, the microphone array 101 may be placed on the top or side of the monitor 111. Furthermore, if the communication device includes a projector and a screen, the monitor 111 may be replaced with a screen for forming a projected image from the projector. Various types of screens can be used, such as a normal screen that forms a projected image, or a screen that transmits sound. In the case of a screen that transmits sound, the microphone array 101 may be placed behind the screen.

（Ｃ－４）上述した各実施形態で説明したコミュニケーション装置で、１つのマイクアレイ１０１を用意する場合を例示しているが、２つのマイクアレイを用意するようにしても良い。ここでは、例えば、マイクアレイ１０１ａ、１０１ｂとする。その場合、例えば、マイクアレイ１０１ａは使用者１５２ａの音声を収音するものとし、マイクアレイ１０１ｂは使用者１５２ｂの音声を収音するものとする。 (C-4) In the communication device described in each of the embodiments described above, one microphone array 101 is provided, but two microphone arrays may be provided. Here, for example, microphone arrays 101a and 101b are used. In that case, for example, the microphone array 101a shall collect the voice of the user 152a, and the microphone array 101b shall collect the voice of the user 152b.

１００及び２００…コミュニケーション装置、１０１…マイクアレイ、１０２…マイクアンプ、１０３…ＡＤ変換部、１０４、１０４ａ及び１０４ｂ…ビデオカメラ、１０５及び２０１…呼びかけ処理部、１０６…ＮＷ通信部、１０７…ネットワーク、１０８…ＤＡ変換器、１０９…スピーカアンプ、１１０ａ及び１１０ｂ…スピーカ、１１１…モニター、１１２、１１２ａ及び１１２ｂ…映像入力端子、１１３、１１３ａ及び１１３ｂ…映像出力端子、１１４…人物位置検知部、１１５…音入力端子、１１６…信号処理部、１１７…オーディオバッファ部、１１８…音声認識部、１１９…コマンドリスト部、１２０…コマンド判定部、１２１及び２０３…出力切替部、１２２、１２２ａ及び１２２ｂ…音出力端子、１２３…接続判定部、１２４…接続判定結果出力端子。 100 and 200... communication device, 101... microphone array, 102... microphone amplifier, 103... AD converter, 104, 104a and 104b... video camera, 105 and 201... call processing unit, 106... NW communication unit, 107... network, 108...DA converter, 109...Speaker amplifier, 110a and 110b...Speaker, 111...Monitor, 112, 112a and 112b...Video input terminal, 113, 113a and 113b...Video output terminal, 114...Person position detection unit, 115... Sound input terminal, 116...Signal processing section, 117...Audio buffer section, 118...Speech recognition section, 119...Command list section, 120...Command judgment section, 121 and 203...Output switching section, 122, 122a and 122b...Sound output Terminal, 123...Connection determination unit, 124...Connection determination result output terminal.

Claims

a person detection unit that detects one or more people from an input video signal and acquires information regarding the position of each detected person;
a signal processing unit that forms the directivity of one or more microphones based on the information regarding the position of each person from the person detection unit and extracts the audio signal of each person;
a holding unit that is a buffer for causing the other party to reproduce a connection command voice transmitted after connection with the other party, and retains the voice signal of each person including the connection command voice produced by the signal processing unit for a certain period of time ;
a voice recognition unit that performs voice recognition based on an input voice signal when the one or more people are detected by the person detection unit;
a command storage unit that stores a plurality of commands including at least a connection command for starting a connection with a connection destination and a disconnection command for disconnecting the connection;
a command determination unit that determines whether the voice recognition result by the voice recognition unit includes the connection command or the disconnection command stored in the command storage unit;
an output switching unit that determines an output audio signal according to a command determination result by the command determination unit;
a connection determination unit that performs connection processing with a connection destination on the other side based on the voice recognition result and the command determination result;
When the above voice recognition result includes the above connection command,
The connection determination unit performs connection processing with the connection destination of the other party,
The output switching unit outputs the audio signal containing the connection command held in the holding unit after connecting with the connection destination of the other party, and then outputs the signal processed by the signal processing unit. A communication device featuring:

When the above voice recognition result includes the above disconnection command,
2. The connection determining section disconnects the connection to the connection destination after the output switching section outputs the audio signal including the disconnection command held in the holding section. communication device.

Video signals are input from each of multiple video cameras,
The person detection unit detects a person from video signals from each of the plurality of video cameras,
The signal processing section forms the directivity of the audio signal picked up by the one or more microphones based on information regarding the position of each person for each of the video signals detected by the person detection section. , extract the audio signals of each person above,
The communication device according to claim 1 or 2, wherein the output switching unit separately outputs the audio signals of each person extracted by the signal processing unit.

computer,
a person detection unit that detects one or more people from an input video signal and acquires information regarding the position of each detected person;
a signal processing unit that forms the directivity of one or more microphones based on the information regarding the position of each person from the person detection unit and extracts the audio signal of each person;
a holding unit that is a buffer for causing the other party to reproduce a connection command voice transmitted after connection with the other party, and retains the voice signal of each person including the connection command voice produced by the signal processing unit for a certain period of time ;
a voice recognition unit that performs voice recognition based on an input voice signal when the one or more people are detected by the person detection unit;
a command storage unit that stores a plurality of commands including at least a connection command for starting a connection with a connection destination and a disconnection command for disconnecting the connection;
a command determination unit that determines whether the voice recognition result by the voice recognition unit includes the connection command or the disconnection command stored in the command storage unit;
an output switching unit that determines an output audio signal according to a command determination result by the command determination unit;
Function as a connection determination unit that performs connection processing with the other party's connection destination based on the voice recognition result and the command determination result,
When the above voice recognition result includes the above connection command,
The connection determination unit performs connection processing with the connection destination of the other party,
The output switching unit outputs the audio signal containing the connection command held in the holding unit after connecting with the connection destination of the other party, and then outputs the signal processed by the signal processing unit. A communication program featuring

The person detection unit detects one or more people from the input video signal and acquires information regarding the position of each detected person,
a signal processing unit forms directivity of one or more microphones based on information regarding the position of each person from the person detection unit, and extracts an audio signal of each person;
The holding unit is a buffer for causing the other party to reproduce the connection command voice transmitted after connection with the other party, and retains the voice signal of each person including the connection command voice by the signal processing unit for a certain period of time ;
a voice recognition unit performs voice recognition based on the input voice signal when the one or more people are detected by the person detection unit;
a command storage unit stores a plurality of commands including at least a connection command for starting a connection with a connection destination and a disconnection command for disconnecting the connection;
a command determination unit determines whether the voice recognition result by the voice recognition unit includes the connection command or the disconnection command stored in the command storage unit;
an output switching unit determines an output audio signal according to a command determination result by the command determination unit;
The connection determination unit performs connection processing with the other party's connection destination based on the voice recognition result and the command determination result,
When the above voice recognition result includes the above connection command,
The connection determination unit performs connection processing with the connection destination of the other party,
The output switching unit outputs the audio signal containing the connection command held in the holding unit after connecting with the connection destination of the other party, and then outputs the signal processed by the signal processing unit. A communication method characterized by