JP2009021922A

JP2009021922A - Video conference apparatus

Info

Publication number: JP2009021922A
Application number: JP2007184237A
Authority: JP
Inventors: Toshiaki Ishibashi; 利晃石橋; Akio Yamane; 章生山根; Jun Asami; 純浅見; Kazuto Kawai; 和人川合; Satoshi Suzuki; 智鈴木
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2007-07-13
Filing date: 2007-07-13
Publication date: 2009-01-29

Abstract

PROBLEM TO BE SOLVED: To provide a video conference apparatus which displays not only an audio or video image, but also incidental information for a speaker, at the same time. SOLUTION: The video control section 13 of the video conference apparatus 1 assigns video data 501-503 of cameras CA1-CA3 to partial video regions 511-512 of composite video data 500. The composite video data 500 are constituted of four divided screen and pieces of speaker individual information 541-546 are assigned to a remaining partial video region 514. Camera identification marks 521-523 are assigned to each of the partial video regions 511-513. When speaker azimuth camera information is accepted, the video control section 13 performs setting, to display the speaker individual information and the camera identification mark, corresponding to the speaker azimuth camera information in a display mode which is different from non-corresponding speaker individual information and camera identification marks. The processing is carried out at predetermined intervals; and when a speaker azimuth is changed, the display mode of the speaker individual information and the camera identification mark is also changed. COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、互いに離れた位置の会議室間で映像と音声とを通信することにより会議を行うテレビ会議装置に関するものである。 The present invention relates to a video conference apparatus that conducts a conference by communicating video and audio between conference rooms at positions separated from each other.

従来、遠隔地間で会議を行うシステムとして、音声を通信する音声会議システムが各種存在するが、音声とともに映像も通信するテレビ会議システムが普及しつつある。このようなテレビ会議システムでは、各会議室にテレビ会議装置およびディスプレイを設置し、他の会議室で撮造された映像を表示する。そして、自装置が複数のテレビ会議装置と接続している場合には、他の複数のテレビ会議装置でそれぞれ撮像された映像がディスプレイ上に表示される。例えば、特許文献１の複数映像合成方法では、複数のテレビ会議室で撮像された各映像を映像信号マトリクス化して、当該映像信号マトリクスに基づく合成映像を生成する。これにより、例えば、画面を四分割することで設定される各部分領域にそれぞれ異なるテレビ会議装置で撮像された映像を割り当てて表示させる。
特開平６−３０３３７号公報 Conventionally, there are various types of audio conference systems that communicate audio as a system for conducting a conference between remote locations, but video conference systems that communicate video together with audio are becoming widespread. In such a video conference system, a video conference device and a display are installed in each conference room, and images taken in other conference rooms are displayed. Then, when the own device is connected to a plurality of video conference devices, videos respectively captured by the other video conference devices are displayed on the display. For example, in the multiple video composition method disclosed in Patent Document 1, each video captured in a plurality of video conference rooms is converted into a video signal matrix, and a composite video based on the video signal matrix is generated. Thereby, for example, images captured by different video conference apparatuses are assigned to and displayed on each partial area set by dividing the screen into four.
JP-A-6-30337

しかしながら、特許文献１に示すような従来の方法では、音声と映像とにより話者を識別することができるが、さらなる追加情報、例えば話者の名前等を一見して識別することができなかった。 However, in the conventional method as shown in Patent Document 1, a speaker can be identified by sound and video, but additional information such as the name of the speaker cannot be identified at a glance. .

したがって、本発明の目的は、音声や映像のみでなく、話者に対する他の付帯情報をも、同時に識別することができるテレビ会議装置を提供することにある。 Therefore, an object of the present invention is to provide a video conference apparatus that can simultaneously identify not only audio and video but also other incidental information for a speaker.

この発明のテレビ会議装置は、収音機能として、それぞれに収音を行って収音音声データを生成する複数のマイクと、複数のマイクの収音音声データに基づいて、それぞれに異なる方向を中心とする収音指向性からなる複数の収音ビーム音声データを生成する収音ビーム音声データ生成手段と、を備える。また、テレビ会議装置は、撮像機能として、それぞれに異なる撮像領域からなる複数の映像データを生成する撮像手段と、該撮像手段からの複数の映像データを合成して合成映像データを生成する合成映像データ生成手段と、を備える。さらに、テレビ会議装置は、通信機能として、収音ビーム音声データと合成映像データとを出力する通信制御手段と、を備える。 The video conference apparatus according to the present invention has a plurality of microphones for collecting sound and generating collected sound data as sound collecting functions, and focusing on different directions based on the collected sound data of the plurality of microphones. Sound collecting beam sound data generating means for generating a plurality of sound collecting beam sound data having the sound collecting directivity. In addition, the video conference apparatus, as an imaging function, an imaging unit that generates a plurality of video data each having a different imaging area, and a composite video that generates a composite video data by combining the plurality of video data from the imaging unit Data generating means. Furthermore, the video conference apparatus includes communication control means for outputting the collected sound beam audio data and the synthesized video data as a communication function.

さらに、このテレビ会議装置は、複数の収音ビーム音声データのレベルに基づいて話者方位を検出する話者方位検出手段を備える。
そして、このテレビ会議装置の合成映像データ生成手段は、次の処理を実行して合成映像データを生成することを特徴としている。
合成映像データ生成手段は、合成映像データを画面分割した各領域に複数の映像データをそれぞれ割り当てる。合成映像データ生成手段は、撮像手段から出力される複数の映像データ（例えば、一台もしくは複数台のカメラで取得した複数の映像データ）毎に撮像される話者を示す情報を予め設定した話者表示用データを、合成映像データの複数の映像データが割り当てられていない領域に割り当てる。合成映像データ生成手段は、各映像データに個別に対応する撮像領域特定マーク（例えば、カメラ毎のカメラ特定マーク）を、各映像データの画面分割領域内に当てはめる。そして、合成映像データ生成手段は、話者方位検出手段で検出された話者方位に対応する話者表示用データおよび撮像領域特定マークを、話者方位以外の話者表示用データおよび撮像領域特定マークと異なる表示態様に設定する。 Further, the video conference apparatus includes a speaker orientation detection unit that detects the speaker orientation based on the levels of a plurality of collected beam sound data.
The composite video data generation means of the video conference apparatus generates the composite video data by executing the following processing.
The composite video data generation means assigns a plurality of video data to each area obtained by dividing the composite video data into screens. The composite video data generation means is a story in which information indicating a speaker to be imaged is set in advance for each of a plurality of video data output from the imaging means (for example, a plurality of video data acquired by one or a plurality of cameras). The person display data is allocated to an area where a plurality of video data of the composite video data is not allocated. The composite video data generating means applies imaging region specifying marks (for example, camera specifying marks for each camera) individually corresponding to each video data in the screen division region of each video data. Then, the synthesized video data generating means displays the speaker display data and the imaging area specifying mark corresponding to the speaker orientation detected by the speaker orientation detecting means, the speaker display data and the imaging area specifying other than the speaker orientation. Set the display mode different from the mark.

この構成では、一台もしくは複数台のカメラによりそれぞれ異なる撮像領域内の会議者が撮像される。会議者が発言すると、当該発言した会議者の方向、すなわち話者方位が検出される。例えば、カメラが複数台の場合、この話者方位を撮像領域に含むカメラが選択され、選択されたカメラに関する情報（例えば、センタカメラ（Ｃ）、右カメラ（Ｒ）、左カメラ（Ｌ）等）や、当該カメラで撮像される話者に関する話者関連情報が、他のカメラに関する情報や話者関連情報とは異なる表示態様で表示されるように、合成映像データが生成される。この処理は、予め設定された所定のフレームレートで繰り返されて、順次合成映像データが出力される。 In this configuration, a conference person in a different imaging area is imaged by one or a plurality of cameras. When the conference speaker speaks, the direction of the conference speaker who speaks, that is, the speaker orientation is detected. For example, when there are a plurality of cameras, a camera including this speaker orientation in the imaging region is selected, and information about the selected cameras (for example, the center camera (C), the right camera (R), the left camera (L), etc.) ) Or the speaker-related information related to the speaker imaged by the camera is displayed in a display manner different from the information related to the other camera and the speaker-related information. This process is repeated at a predetermined frame rate set in advance, and the synthesized video data is sequentially output.

また、この発明のテレビ会議装置は、話者表示用データに、複数の個別話者表示用データを含む。テレビ会議装置の合成映像データ生成手段は、話者方位に基づいて話者表示用データ内で話者方位に対応する個別話者表示用データを、さらに異なる表示態様に設定する。 In the video conference apparatus of the present invention, the speaker display data includes a plurality of individual speaker display data. The synthesized video data generation means of the video conference apparatus sets the individual speaker display data corresponding to the speaker orientation in the speaker display data based on the speaker orientation in a further different display mode.

この構成では、一つの撮像領域内に複数の会議者が撮像されるような場合、当該撮像領域内の会議者の１人が話者となると、撮像領域のみでなく当該撮像領域内の特定の話者をさらに識別可能な合成映像データが生成される。 In this configuration, when a plurality of conference persons are imaged in one imaging area, if one of the conference persons in the imaging area becomes a speaker, not only the imaging area but also a specific area in the imaging area is selected. Composite video data that can further identify the speaker is generated.

また、この発明のテレビ会議装置は、収音ビーム音声データ生成手段で、話者方位に対応する収音ビーム音声データを選択して出力する。 In the video conference apparatus of the present invention, the sound collecting beam sound data generating means selects and outputs sound collecting beam sound data corresponding to the speaker direction.

この構成では、前記話者が識別可能な映像とともに、当該話者の音声が高いＳ／Ｎ比で出力される。 In this configuration, the voice of the speaker is output at a high S / N ratio together with the video that can be identified by the speaker.

また、この発明のテレビ会議装置は、さらに、放音用音声データの受信機能と放音機能を備える。放音機能は、複数のスピーカと放音制御手段とを備える。通信制御手段は、相手先装置で生成された収音ビーム音声データに対応する放音用音声データと、該放音用音声データに関連付けられた話者方位とを受信する。放音制御手段は、通信制御手段で受信した放音用音声データと放音用音声データに関連付けられた話者方位とに基づいて放音指向性を設定する。放音制御手段は、該放音指向性に準じて複数のスピーカのそれぞれに対するスピーカ駆動信号を生成する。複数のスピーカは、入力されたスピーカ駆動信号に基づいて放音する。 The video conference apparatus of the present invention further includes a function for receiving sound emission sound data and a sound emission function. The sound emission function includes a plurality of speakers and sound emission control means. The communication control means receives the sound emission sound data corresponding to the sound collection beam sound data generated by the counterpart device and the speaker orientation associated with the sound emission sound data. The sound emission control means sets the sound emission directivity based on the sound emission sound data received by the communication control means and the speaker orientation associated with the sound emission sound data. The sound emission control means generates a speaker drive signal for each of the plurality of speakers according to the sound emission directivity. The plurality of speakers emit sound based on the input speaker driving signal.

この構成では、上述のような収音ビーム音声データが相手先装置から送信された場合に、受信側のテレビ会議装置は、話者方位に応じた音像定位が行える。これにより、話者を容易に識別できるとともに、より臨場感のある話者音声を再現できる。 In this configuration, when the collected sound beam sound data as described above is transmitted from the counterpart device, the video conference device on the receiving side can perform sound image localization according to the speaker orientation. As a result, the speaker can be easily identified and more realistic speaker speech can be reproduced.

この発明によれば、話者音声や話者を含む複数の会議者の映像のみでなく、発言中の話者を識別する情報も会議者映像とともに出力されるので、これらの情報を取得して、放音・表示した他のテレビ会議装置の会議者は、話者の音声を聞きながら、当該話者を容易に識別することができる。 According to this invention, not only the voice of a plurality of conference persons including a speaker voice and a speaker, but also information for identifying a speaker who is speaking is output together with the conference person video. The conference person of the other video conference apparatus that has emitted and displayed can easily identify the speaker while listening to the voice of the speaker.

本発明の実施形態に係るテレビ会議装置について、図を参照して説明する。
図１は、本実施形態のテレビ会議装置の外観斜視図である。なお、本実施形態のテレビ会議装置は、正面壁に設置されたマイクアレイ、スピーカアレイの前方にパンチングメッシュ等からなるカバーが設置されているが本図では図示を省略する。
図１に示すように、テレビ会議装置１は略長尺形状からなる放収音素子設置筐体と制御系回路設置筐体とからなる。放収音素子設置筐体は長尺方向に沿った二側面をそれぞれ正面壁および背面壁とし、正面壁にマイクＭＣ１〜ＭＣ１６、スピーカＳＰ１〜ＳＰ１４、カメラＣＡ１〜ＣＡ３が設置され、背面壁側に制御系回路設置筐体が設置されている。 A video conference apparatus according to an embodiment of the present invention will be described with reference to the drawings.
FIG. 1 is an external perspective view of the video conference apparatus according to the present embodiment. In the video conference apparatus according to the present embodiment, a microphone array installed on the front wall and a cover made of a punching mesh or the like are installed in front of the speaker array.
As shown in FIG. 1, the video conference apparatus 1 includes a sound emitting and collecting element installation housing and a control system circuit installation housing having a substantially long shape. The sound emitting and collecting element installation housing has two side surfaces along the longitudinal direction as a front wall and a back wall, respectively, microphones MC1 to MC16, speakers SP1 to SP14, and cameras CA1 to CA3 are installed on the front wall, and on the back wall side. A control circuit installation housing is installed.

マイクＭＣ１〜ＭＣ１６は、同じ機構及び同じ収音特性を有するものであり、前記長尺方向に沿う一直線に所定間隔で、正面壁の天面側に設置されている。マイクＭＣ１〜ＭＣ１６は、配列方向の中心付近（すなわち正面方向からみた中心付近）では狭いピッチに配置され、配列方向の両端付近では広いピッチで配置される。マイクＭＣ１〜ＭＣ１６は、正面壁から外方側を収音範囲とするように設置される。このようなマイクＭＣ１〜ＭＣ１６により、正面方向を収音範囲とするマイクアレイが形成される。 The microphones MC 1 to MC 16 have the same mechanism and the same sound collection characteristics, and are installed on the top surface side of the front wall at a predetermined interval along a straight line along the longitudinal direction. The microphones MC1 to MC16 are arranged at a narrow pitch near the center in the arrangement direction (that is, near the center when viewed from the front direction), and are arranged at a wide pitch near both ends in the arrangement direction. The microphones MC 1 to MC 16 are installed so that the sound collecting range is on the outer side from the front wall. Such microphones MC1 to MC16 form a microphone array whose front direction is the sound collection range.

スピーカＳＰ１〜ＳＰ１４は、同じ機構および同じ放音特性を有するものであり、前記長尺方向に沿う一直線に所定間隔で、正面壁の垂直方向中心位置に設置されている。スピーカＳＰ１〜ＳＰ１４は、正面壁から外方側を放音方向とするように設置される。このようなスピーカＳＰ１〜ＳＰ１４により、正面方向を放音範囲とするスピーカアレイが形成される。 The speakers SP 1 to SP 14 have the same mechanism and the same sound emission characteristics, and are installed at a central position in the vertical direction of the front wall at a predetermined interval along a straight line along the longitudinal direction. The speakers SP1 to SP14 are installed so that the outer side from the front wall is the sound emission direction. Such speakers SP1 to SP14 form a speaker array having a sound emission range in the front direction.

カメラＣＡ１〜ＣＡ３は、同じ機構および同じ撮像特性を有するものである。
カメラＣＡ１は、正面壁の底面側で、長尺方向の中心位置（すなわち正面方向からみた中心位置）に設置されている。カメラＣＡ１は、撮像範囲の中心方向が正面壁に垂直な方向となるように設置されている。
カメラＣＡ２は、正面壁の底面側で、長尺方向の一方端（図１であれば、テレビ会議装置１を正面視した状態での右端）に設置されている。カメラＣＡ２は、撮像範囲の中心方向が正面壁に対して所定の角度、例えば、図１であれば、テレビ会議装置１を正面視した状態で、正面壁の右端から左端前方の所定位置を向く角度に設定されている。
カメラＣＡ３は、正面壁の底面側で、長尺方向の一方端（図１であれば、テレビ会議装置１を正面視した状態での左端）に設置されている。カメラＣＡ３は、撮像範囲の中心方向が正面壁に対して所定の角度、例えば、図１であれば、テレビ会議装置１を正面視した状態で、正面壁の左端から右端前方の所定位置を向く角度に設定されている。カメラＣＡ２の撮像範囲とカメラＣＡ３の撮像範囲とは、カメラＣＡ１の撮像範囲の中心方向を基準軸として対象に設定されている。そして、これらカメラＣＡ１〜ＣＡ３の撮像範囲は、それぞれの撮像範囲を連続させることで、テレビ会議装置１の正面側で長尺方向に沿う全周囲が撮像されるように設定されている。
テレビ会議装置１の制御系回路設置筐体には、上述のマイクＭＣ１〜ＭＣ１６によるマイクアレイ、スピーカＳＰ１〜ＳＰ１４によるスピーカアレイ、およびカメラＣＡ１〜ＣＡ３を除く機能部が設置されている。 The cameras CA1 to CA3 have the same mechanism and the same imaging characteristics.
The camera CA1 is installed at the center position in the longitudinal direction (that is, the center position viewed from the front direction) on the bottom side of the front wall. The camera CA1 is installed so that the center direction of the imaging range is perpendicular to the front wall.
The camera CA2 is installed at one end in the longitudinal direction on the bottom side of the front wall (in FIG. 1, the right end when the video conference device 1 is viewed from the front). When the center direction of the imaging range is a predetermined angle with respect to the front wall, for example, FIG. 1, the camera CA2 faces a predetermined position from the right end of the front wall to the front of the left end when the video conference device 1 is viewed from the front. It is set to an angle.
The camera CA3 is installed at one end in the longitudinal direction on the bottom side of the front wall (in FIG. 1, the left end when the video conference device 1 is viewed from the front). When the center direction of the imaging range is a predetermined angle with respect to the front wall, for example, FIG. 1, the camera CA3 faces a predetermined position from the left end of the front wall to the right front in a state where the video conference device 1 is viewed from the front. It is set to an angle. The imaging range of the camera CA2 and the imaging range of the camera CA3 are set as targets with the central direction of the imaging range of the camera CA1 as a reference axis. The imaging ranges of these cameras CA1 to CA3 are set so that the entire circumference along the longitudinal direction is imaged on the front side of the video conference device 1 by making each imaging range continuous.
In the control system circuit installation housing of the video conference apparatus 1, the above-described microphone array by the microphones MC1 to MC16, the speaker array by the speakers SP1 to SP14, and the functional units other than the cameras CA1 to CA3 are installed.

図２は、本実施形態のテレビ会議装置１の機能ブロックの構成および外部との接続関係を示した図である。
図２に示すように、テレビ会議装置１は、上述のマイクＭＣ１〜ＭＣ１６、スピーカＳＰ１〜ＳＰ１４、およびカメラＣＡ１〜ＣＡ３とともに、メイン制御部１０、収音制御部１１、エコーキャンセラ１２、映像制御部１３、放音制御部１４、通信制御部１５、操作部１６、を備える。 FIG. 2 is a diagram illustrating a functional block configuration and a connection relationship with the outside of the video conference apparatus 1 according to the present embodiment.
As shown in FIG. 2, the video conference device 1 includes a main control unit 10, a sound collection control unit 11, an echo canceller 12, and a video control unit together with the microphones MC 1 to MC 16, speakers SP 1 to SP 14, and cameras CA 1 to CA 3. 13, a sound emission control unit 14, a communication control unit 15, and an operation unit 16.

メイン制御部１０は、テレビ会議装置１の全体制御を行うとともに、操作部１６により入力された操作内容に準じた制御を行う。メイン制御部１０は、収音制御部１１で選択された収音ビーム音声データの選択情報を取得して、話者方位を検出する。メイン制御部１０は、検出した話者方位を撮像範囲とするカメラを選択して、話者方位カメラ情報として映像制御部１３へ出力する。 The main control unit 10 performs overall control of the video conference apparatus 1 and performs control according to the operation content input by the operation unit 16. The main control unit 10 acquires selection information of the collected sound beam sound data selected by the sound collection control unit 11 and detects the speaker orientation. The main control unit 10 selects a camera whose imaging range is the detected speaker orientation, and outputs the selected camera to the video control unit 13 as speaker orientation camera information.

また、メイン制御部１０は、検出した話者方位を示す話者方位データを通信制御部１５に与え、当該話者方位データに対応する収音ビーム音声データからなる出力音声データに関連付けして送信させる制御を、通信制御部１５へ与える。また、メイン制御部１０は、通信制御部１５で受信した相手先のテレビ会議装置の出力音声データに関連付けされた話者方位データを取得し、当該話者方位データに応じた音源定位を行う制御を、放音制御部１４へ与える。 Further, the main control unit 10 gives speaker direction data indicating the detected speaker direction to the communication control unit 15, and transmits the data in association with output voice data including sound collection beam voice data corresponding to the speaker direction data. The control to be performed is given to the communication control unit 15. In addition, the main control unit 10 obtains speaker orientation data associated with the output audio data of the other party's video conference device received by the communication control unit 15 and performs sound source localization according to the speaker orientation data. Is given to the sound emission control unit 14.

マイクＭＣ１〜ＭＣ１６は、自装置（テレビ会議装置１）の正面側の音声を収音して収音信号を生成し、収音制御部１１へ出力する。 The microphones MC 1 to MC 16 collect sound from the front side of the own device (the video conference device 1), generate a sound collecting signal, and output it to the sound collecting control unit 11.

収音制御部１１は、各マイクＭＣ１〜ＭＣ１６の収音信号に対して、それぞれに異なる遅延処理および振幅処理パターンで信号処理を行うことで、それぞれに異なる方位を収音指向性の中心方向とする複数の収音ビーム音声データを生成する。より具体的には、収音制御部１１は、各マイクＭＣ１〜ＭＣ１６の収音信号を所定増幅率で増幅し、Ａ／Ｄ（アナログ−デジタル）変換することで、個別収音音声データを生成する。収音制御部１１は、それぞれに異なる収音指向性を実現させる個別収音音声データ毎の遅延係数および振幅係数を予め記憶している。収音制御部１１は、設定された収音指向性毎に、これら遅延係数、振幅係数に基づくフィルタ処理を各個別収音音声データに実行することで、それぞれに異なる収音指向性からなる収音ビーム音声データを生成する。 The sound collection control unit 11 performs signal processing with different delay processing and amplitude processing patterns on the sound collection signals of the microphones MC1 to MC16, so that different orientations are set as the central directions of the sound collection directivities. A plurality of collected beam sound data is generated. More specifically, the sound collection control unit 11 amplifies the sound collection signals of the microphones MC1 to MC16 with a predetermined amplification factor and performs A / D (analog-digital) conversion to generate individual sound collection sound data. To do. The sound collection control unit 11 stores in advance a delay coefficient and an amplitude coefficient for each individual sound collection sound data for realizing different sound collection directivities. The sound collection control unit 11 performs filter processing based on these delay coefficient and amplitude coefficient for each set sound collection directivity for each individual sound collection sound data, thereby collecting sound collection directivity different from each other. Sound beam sound data is generated.

収音制御部１１は、生成した複数の収音ビーム音声データのレベル（音声レベル）を比較し、予め設定した有音検出閾値レベルを超える収音ビーム音声データを選択し、エコーキャンセラ１２へ出力する。
収音制御部１１は、選択した収音ビーム音声データを特定する選択情報をメイン制御部１０へ与える。 The sound collection control unit 11 compares the levels (sound levels) of the plurality of generated sound collection beam sound data, selects sound collection beam sound data that exceeds a preset sound detection threshold level, and outputs it to the echo canceller 12 To do.
The sound collection control unit 11 gives selection information for specifying the selected sound collection beam sound data to the main control unit 10.

エコーキャンセラ１２は、適応型フィルタとポストプロセッサとを備える。適応型フィルタは、通信制御部１５から出力される相手先テレビ会議装置からの出力音声データに基づく疑似回帰音データを生成して、ポストプロセッサへ与える。ポストプロセッサは加算器を備え、加算器は、収音制御部１１から出力された収音ビーム音声データから疑似回帰音データを減算してエコーキャンセルを行うことで出力音声データを生成して通信制御部１５へ出力する。この際、ポストプロセッサはエコーキャンセル結果を適応型フィルタへフィードバックする。 The echo canceller 12 includes an adaptive filter and a post processor. The adaptive filter generates pseudo-regression sound data based on the output audio data from the other party video conference apparatus output from the communication control unit 15 and gives the post-processor to the post-processor. The post processor includes an adder, and the adder subtracts the pseudo-regression sound data from the sound collection beam sound data output from the sound collection control unit 11 and performs echo cancellation to generate output sound data and control communication. To the unit 15. At this time, the post processor feeds back the echo cancellation result to the adaptive filter.

カメラＣＡ１〜ＣＡ３は、上述のようにそれぞれ異なる撮像範囲を撮像することで映像データを生成して、映像制御部１３へ出力する。 The cameras CA 1 to CA 3 generate video data by imaging different imaging ranges as described above, and output the video data to the video control unit 13.

映像制御部１３は、各カメラＣＡ１〜ＣＡ３からの映像データと、予め設定されている話者識別用画像データとを用いて合成映像データを生成する。この際、映像制御部１３は、合成映像データの全画面を、当該全画面の中心を基準にした均等な部分画像領域に四分割する（図５（詳細は後述）を参照）。映像制御部１３は、三つの部分画像領域に各カメラＣＡ１〜ＣＡ３の映像を割り当て、残りの一つの部分画像領域を話者識別用画像データに割り当てる。話者識別用画像データは、予め設定された話者名等の話者個別情報からなる。そして、話者個別情報のそれぞれは、対応する話者を撮像するカメラと関連付けされて記憶されている。 The video control unit 13 generates composite video data using video data from each of the cameras CA1 to CA3 and preset image data for speaker identification. At this time, the video control unit 13 divides the entire screen of the composite video data into four equal partial image areas based on the center of the entire screen (see FIG. 5 (details will be described later)). The video control unit 13 allocates the videos of the cameras CA1 to CA3 to the three partial image areas, and allocates the remaining one partial image area to the speaker identification image data. The speaker identification image data includes speaker individual information such as a speaker name set in advance. Each piece of speaker individual information is stored in association with a camera that captures the corresponding speaker.

映像制御部１３は、各カメラＣＡ１〜ＣＡ３に割り当てた部分画像領域内に、各カメラＣＡ１〜ＣＡ３に対応するカメラ識別マークを表示させるカメラ識別情報表示欄を形成する。
映像制御部１３は、メイン制御部１０から与えられた話者方位カメラ情報に基づいて、話者方位に対応するカメラ識別マークと、当該カメラに対応する話者個別情報とを選択する。映像制御部１３は、選択したカメラ識別マークおよび話者個別情報を、他のカメラ識別マークおよび話者個別情報と異なる態様で表示する設定を行う。例えば、選択した話者方位のカメラ識別マークおよび話者個別情報を、他のカメラ識別マークおよび話者個別情報よりも高輝度で表示させる。この処理は、予め設定された所定のフレームレートで継続して行われる。これにより、発言者が交代することで話者方位が変化し、発言する話者が撮像された映像データが変化する際には、カメラ識別マークおよび話者個別情報の表示態様も切り替わるように設定される。このように形成された合成映像データは、順次、通信制御部１５へ出力される。 The video controller 13 forms a camera identification information display field for displaying camera identification marks corresponding to the cameras CA1 to CA3 in the partial image areas assigned to the cameras CA1 to CA3.
The video control unit 13 selects a camera identification mark corresponding to the speaker orientation and speaker individual information corresponding to the camera based on the speaker orientation camera information given from the main control unit 10. The video control unit 13 performs a setting to display the selected camera identification mark and speaker individual information in a manner different from other camera identification marks and speaker individual information. For example, the camera identification mark and speaker individual information of the selected speaker orientation are displayed with higher brightness than other camera identification marks and speaker individual information. This process is continuously performed at a predetermined frame rate set in advance. As a result, the speaker orientation changes as the speaker changes, and when the video data of the speaker speaking changes, the display mode of the camera identification mark and speaker individual information is also switched. Is done. The composite video data formed in this way is sequentially output to the communication control unit 15.

通信制御部１５は、エコーキャンセラ１２からの出力音声データを話者方位データに関連付けして送信するとともに、映像制御部１３からの合成映像データを送信する。 The communication control unit 15 transmits the output audio data from the echo canceller 12 in association with the speaker orientation data, and transmits the composite video data from the video control unit 13.

これにより、相手側（受信側）のテレビ会議装置で、出力音声データに基づく放音を行うとともに、合成画像データを再生することで、相手側（受信側）の会議室に在席する会議者は、送信側の会議室全体の映像を見ながら、話者の発言を聞き取り、且つ、容易に話者を視認して識別することができる。この際、収音ビーム化された音声が放音されることで、話者の発言を高いＳ／Ｎ比で放音することができ、且つエコーキャンセル処理することで、さらに高いＳ／Ｎ比の放音が可能となる。これにより、会議者に分かりやすく、使い勝手の良いテレビ会議を実現することができる。 Thus, the other party (reception side) video conference device emits sound based on the output audio data and reproduces the composite image data, so that the conference person in the other party (reception side) conference room is present. Can listen to the speaker's remarks while watching the video of the entire conference room on the transmission side, and can easily identify the speaker by visually recognizing it. At this time, the voice of the collected sound beam is emitted, so that the speaker's speech can be emitted with a high S / N ratio, and the echo cancellation process further increases the S / N ratio. Sound can be emitted. As a result, it is possible to realize a video conference that is easy to understand and easy to use for the conference participants.

通信制御部１５は、ネットワーク９００を介して送信側からの出力音声データと話者方位データと、合成映像データを受信すると、出力音声データを、エコーキャンセラ１２を介して放音制御部１４へ出力する。通信制御部１５は、話者方位データをメイン制御部１０へ出力する。通信制御部１５は、合成映像データを、テレビ会議装置１とは別体の表示器２０へ出力する。表示器２０は、液晶ディスプレイ等からなり、通信制御部１５から入力された合成映像データを再生して表示する。 When the communication control unit 15 receives the output voice data, the speaker orientation data, and the synthesized video data from the transmission side via the network 900, the communication control unit 15 outputs the output voice data to the sound emission control unit 14 via the echo canceller 12. To do. The communication control unit 15 outputs the speaker orientation data to the main control unit 10. The communication control unit 15 outputs the composite video data to the display device 20 separate from the video conference device 1. The display device 20 includes a liquid crystal display or the like, and reproduces and displays the composite video data input from the communication control unit 15.

放音制御部１４は、通信制御部１５からの相手先で生成された出力音声データと、相手先で生成された話者方位データに基づくメイン制御部１０からの音源定位情報とに基づいて、各スピーカＳＰ１〜ＳＰ１４に与える個別駆動信号を生成する。より具体的には、放音制御部１４は、出力音声データを各スピーカＳＰ１〜ＳＰ１４用に分配し、分配した音声データ毎に、前記音源定位情報に基づく遅延処理および振幅処理を行うことで個別駆動音声データを生成する。放音制御部１４は、生成した各個別駆動音声データをＤ／Ａ（デジタル−アナログ）変換することで個別駆動信号を生成し、操作部１６で設定されたボリューム等に基づく所定の増幅率で増幅した後に、各スピーカＳＰ１〜ＳＰ１４へ出力する。 The sound emission control unit 14 is based on the output voice data generated by the other party from the communication control unit 15 and the sound source localization information from the main control unit 10 based on the speaker orientation data generated by the other party. Individual drive signals to be given to the speakers SP1 to SP14 are generated. More specifically, the sound emission control unit 14 distributes the output sound data to the speakers SP1 to SP14, and performs delay processing and amplitude processing based on the sound source localization information for each distributed sound data. Drive audio data is generated. The sound emission control unit 14 generates an individual drive signal by D / A (digital-analog) conversion of the generated individual drive audio data, and at a predetermined amplification factor based on the volume set by the operation unit 16. After amplification, it outputs to each speaker SP1-SP14.

スピーカＳＰ１〜ＳＰ１４は、入力された個別駆動信号に基づいて放音する。これにより、話者方位データに基づく音源定位が実現され、仮想の話者位置から発音されたように放音される。 The speakers SP1 to SP14 emit sound based on the input individual drive signal. As a result, sound source localization based on the speaker orientation data is realized, and sound is emitted as if it was pronounced from the virtual speaker position.

このような放音指向性制御を行うことで、上述のような映像の表示効果とともに、会議者は、話者方位に対応した話者音声を聞くことができ、より臨場感に溢れるテレビ会議を実現することができる。 By performing such sound emission directivity control, together with the video display effect as described above, the conference can hear the speaker voice corresponding to the speaker orientation, and can make a video conference more realistic. Can be realized.

次に、より具体的な使用例について図を参照して説明する。
図３はテレビ会議装置１の配置例および撮像範囲を表す平面図である。図４はテレビ会議装置１及び表示器２０の配置例を示す平面図である。
図３に示すように、会議室内には会議テーブル４００が設置され、当該会議テーブル４００の三方を囲むように、会議者３０１〜３０６が着席する。そして、会議テーブル４００の残りに一方にテレビ会議装置１が設置される。テレビ会議装置１は正面方向が会議テーブル４００側となるように設置される。テレビ会議装置１は、例えば図４に示すように、表示器２０の天面上に設置される。 Next, a more specific usage example will be described with reference to the drawings.
FIG. 3 is a plan view showing an arrangement example of the video conference device 1 and an imaging range. FIG. 4 is a plan view showing an arrangement example of the video conference device 1 and the display 20.
As shown in FIG. 3, a conference table 400 is installed in the conference room, and conference participants 301 to 306 are seated so as to surround three sides of the conference table 400. Then, the video conference apparatus 1 is installed on the other side of the conference table 400. The video conference apparatus 1 is installed so that the front direction is the conference table 400 side. The video conference apparatus 1 is installed on the top surface of the display device 20, for example, as shown in FIG.

会議者３０１，３０２は、会議テーブル４００に対してテレビ会議装置１と対向する側に着席している。会議者３０３，３０４は、テレビ会議装置１の左端側（カメラＣＡ３側）のテーブル４００の端辺に沿って着席しており、会議者３０５，３０６は、右端側（カメラＣＡ２側）の会議テーブル４００の端辺に沿って着席している。 The conference persons 301 and 302 are seated on the side facing the video conference apparatus 1 with respect to the conference table 400. The conference persons 303 and 304 are seated along the edge of the table 400 on the left end side (camera CA3 side) of the video conference apparatus 1, and the conference persons 305 and 306 are the conference tables on the right end side (camera CA2 side). Sitting along the 400 edges.

図５は、図３の状況における各カメラＣＡ１〜ＣＡ３の映像データ例を示す図であり、（Ａ）はカメラＣＡ１の映像データ例、（Ｂ）はカメラＣＡ２の映像データ例、（Ｃ）はカメラＣＡ３の映像データ例を示す。 FIG. 5 is a diagram illustrating an example of video data of each camera CA1 to CA3 in the situation of FIG. 3, (A) is an example of video data of camera CA1, (B) is an example of video data of camera CA2, and (C) is an example. An example of video data of the camera CA3 is shown.

図３に示すように会議者３０１〜３０６が着席してテレビ会議装置１を起動させると、カメラＣＡ１は、図５（Ａ）に示すように、会議者３０１，３０２を含む領域を撮像して、会議者３０１の映像３１１（以下、会議者映像３１１と称する）、会議者３０２の映像３１２（以下、会議者映像３１２と称する）、テーブル４００の映像４１０（以下、テーブル映像４１０と称する）を含む映像データ５０１を生成する。カメラＣＡ２は、図５（Ｂ）に示すように、会議者３０３，３０４を含む領域を撮像して、会議者３０３の映像３１３（以下、会議者映像３１３と称する）、会議者３０４の映像３１４（以下、会議者映像３１４と称する）、およびテーブル映像データ４１０を含む映像データ５０２を生成する。カメラＣＡ３は、図５（Ｃ）に示すように、会議者３０５，３０６を含む領域を撮像して、会議者３０５の映像３１５（以下、会議者映像３１５と称する）、会議者３０６の映像３１６（以下、会議者映像３１６と称する）、およびテーブル映像データ４１０を含む映像データ５０３を生成する。 When the conference participants 301 to 306 are seated and the video conference apparatus 1 is activated as shown in FIG. 3, the camera CA1 captures an area including the conference participants 301 and 302 as shown in FIG. The video 311 of the conference 301 (hereinafter referred to as conference video 311), the video 312 of the conference 302 (hereinafter referred to as conference video 312), and the video 410 of the table 400 (hereinafter referred to as table video 410). Including video data 501 is generated. As shown in FIG. 5B, the camera CA 2 captures an area including the conference participants 303 and 304, and a video 313 of the conference participant 303 (hereinafter referred to as conference participant video 313) and a video 314 of the conference participant 304. Video data 502 including the conference video 314 (hereinafter referred to as the conference video 314) and the table video data 410 is generated. As shown in FIG. 5C, the camera CA3 captures an area including the conference participants 305 and 306, and an image 315 of the conference participant 305 (hereinafter referred to as a conference participant video 315) and an image 316 of the conference participant 306 are obtained. Video data 503 including the conference video 316 (hereinafter referred to as conference video 316) and table video data 410 is generated.

映像制御部１３は、これら映像データ５０１〜５０３に基づいて合成映像データ５００を形成する。
図６は合成映像データの例を示す図であり、（Ａ）は会議者３０１が発言した状態を示し、（Ｂ）は会議者３０３が発言した状態を示す。
映像制御部１３は、合成映像データ５００の表示領域を、当該表示領域の中心を基準にして、四つの部分映像領域５１１〜５１４に分割して、映像データ５０１〜５０３、および話者識別用画像データを割り当てる。
より具体的には、映像制御部１３は、前記基準点をＸ−Ｙ座標系の中心として第２象限に当たる部分映像領域５１１に映像データ５０１を割り当て、第１象限に当たる部分映像領域５１２に映像データ５０２を割り当て、第３象限に当たる部分映像領域５１３に映像データ５０３を割り当てる。映像制御部１３は、第４象限に当たる部分映像領域５１４に、予め記憶した話者個別情報５４１〜５４６からなる話者識別用画像データを割り当てる。 The video control unit 13 forms composite video data 500 based on these video data 501 to 503.
6A and 6B are diagrams showing examples of composite video data. FIG. 6A shows a state where the conference person 301 speaks, and FIG. 6B shows a state where the conference person 303 speaks.
The video control unit 13 divides the display area of the composite video data 500 into four partial video areas 511 to 514 with reference to the center of the display area, and the video data 501 to 503 and the speaker identification image. Allocate data.
More specifically, the video control unit 13 assigns the video data 501 to the partial video area 511 corresponding to the second quadrant with the reference point as the center of the XY coordinate system, and the video data to the partial video area 512 corresponding to the first quadrant. 502 is allocated, and video data 503 is allocated to the partial video area 513 corresponding to the third quadrant. The video control unit 13 allocates speaker identification image data including pre-stored speaker individual information 541 to 546 to the partial video area 514 corresponding to the fourth quadrant.

映像制御部１３は、映像データ５０１〜５０３が割り当てられる部分映像領域５１１〜５１３のそれぞれに、カメラ識別マーク５２１〜５２３をさらに割り当てる。カメラ識別マークは、各カメラＣＡ１〜ＣＡ３に関連付けされており、カメラＣＡ１に対しては「Ｃ」マークが設定され、カメラＣＡ２に対しては「Ｒ」マークが設定され、カメラＣＡ３に対しては「Ｌ」マークが設定される。これらのカメラ識別マーク５２１〜５２３は、前記基準点を中心に互いに接するように配置される。 The video control unit 13 further assigns camera identification marks 521 to 523 to the partial video areas 511 to 513 to which the video data 501 to 503 are assigned. The camera identification mark is associated with each of the cameras CA1 to CA3, the “C” mark is set for the camera CA1, the “R” mark is set for the camera CA2, and the camera CA3 is set. An “L” mark is set. These camera identification marks 521 to 523 are arranged so as to contact each other around the reference point.

このような基本表示態様において、映像制御部１３は、話者方位カメラ情報に基づいて、話者を撮像するカメラに対するカメラ識別マークと、当該カメラに関連付けされた話者個別情報とを、高輝度表示設定する。例えば、会議者３０１が話者であることを検出すると、図６（Ａ）に示すように、会議者映像３１１を含む映像データ５０１の部分映像領域５１１内のカメラ識別マーク５２１を高輝度黄色点灯状態となるように表示させ、他のカメラ識別マーク５２２，５２３を低輝度グレー点灯状態となるように表示させる設定を行う。さらに、部分映像領域５１４内のカメラＣＡ１に対応する話者個別情報５４１，５４２を高輝度黄色点灯状態となるように表示させ、他の話者個別情報５４３〜５４６を低輝度グレー点灯状態となるように表示させる設定を行う。そして、話者が会議者３１３に切り替わったことを検出すると、図６（Ｂ）に示すように、会議者映像３１３を含む映像データ５０２の部分映像領域５１２内のカメラ識別マーク５２２を高輝度黄色点灯状態となるように表示させ、他のカメラ識別マーク５２１，５２３を低輝度グレー点灯状態となるように表示させる設定を行う。さらに、部分映像領域５１４内のカメラＣＡ２に対応する話者個別情報５４３，５４４を高輝度黄色点灯状態となるように表示させ、他の話者個別情報５４１，５４２，５４５，５４６を低輝度グレー点灯状態となるように表示させる設定を行う。 In such a basic display mode, the video control unit 13 displays, based on the speaker orientation camera information, a camera identification mark for the camera that captures the speaker and speaker individual information associated with the camera with high brightness. Set the display. For example, when it is detected that the conference person 301 is a speaker, as shown in FIG. 6A, the camera identification mark 521 in the partial video area 511 of the video data 501 including the conference person video 311 is lit in high brightness yellow. The other camera identification marks 522 and 523 are set to be displayed in a low-luminance gray lighting state. Furthermore, the individual speaker information 541 and 542 corresponding to the camera CA1 in the partial video area 514 is displayed so as to be in a high luminance yellow lighting state, and the other speaker individual information 543 to 546 is in a low luminance gray lighting state. Set to display as follows. Then, when it is detected that the speaker has been switched to the conference 313, as shown in FIG. 6B, the camera identification mark 522 in the partial video area 512 of the video data 502 including the conference video 313 is displayed in a high-intensity yellow color. A setting is made so that the camera is displayed in a lighting state and the other camera identification marks 521 and 523 are displayed in a low-luminance gray lighting state. Further, the individual speaker information 543, 544 corresponding to the camera CA2 in the partial video area 514 is displayed so as to be in a high luminance yellow lighting state, and the other individual speaker information 541, 542, 545, 546 is displayed in a low luminance gray. Set the display so that it is lit.

このような表示設定が行われた合成映像データを再生することで、再生側の会議者は、話者を含む映像を容易に識別することができ、話者の個別情報を容易に確認することができる。なお、図６の場合、一つの映像に二人の会議者が撮像されているので、話者個別情報は二つが、他の話者個別情報と異なる態様で表示されるが、一つの映像に一人の会議者が撮像される場合であれば、より確実に話者の個別情報を確認することができる。 By playing the composite video data with such display settings, the playback-side conference can easily identify the video including the speaker and easily check the individual information of the speaker. Can do. In the case of FIG. 6, since two conference persons are captured in one video, two pieces of speaker individual information are displayed in a different manner from the other speaker individual information. If one conference person is imaged, the individual information of the speaker can be confirmed more reliably.

なお、前述の説明では、話者方位カメラ情報に基づいて、当該話者方位を含む領域を撮像したカメラに関連付けられた全ての話者個別情報を、他の話者個別情報と異なる表示態様に設定した。しかしながら、本実施形態のテレビ会議装置では、カメラＣＡ１〜ＣＡ３の撮像領域単位でなく、会議者単位での方位検出が可能である。したがって、メイン制御部１０で、話者方位カメラ情報とともに、さらに詳しい話者方位情報を生成し、映像制御部１３へ与えるようにしてもよい。 In the above description, on the basis of the speaker orientation camera information, all the speaker individual information associated with the camera that captured the area including the speaker orientation is displayed in a display mode different from the other speaker individual information. Set. However, in the video conference apparatus according to the present embodiment, it is possible to detect the azimuth in units of conferences, not in units of imaging areas of the cameras CA1 to CA3. Therefore, the main control unit 10 may generate more detailed speaker direction information together with the speaker direction camera information and provide the information to the video control unit 13.

この場合、映像制御部１３には、話者方位情報と話者個別情報とが予め関連付けして記憶されている。映像制御部１３は、メイン制御部１０から話者方位カメラ情報と話者方位情報とを受け付けると、上述のカメラ識別マークに関する処理および話者方位カメラ情報に基づく話者個別情報に関する処理を実行するとともに、これら選択された複数の話者個別情報のうちで、真の話者方位に関連する話者個別情報を取得する。そして、映像制御部１３は、真の話者方位に関連する話者個別情報を、さらに別の表示態様とするように設定する。 In this case, the video controller 13 stores speaker orientation information and speaker individual information in association with each other in advance. When the video control unit 13 receives the speaker orientation camera information and the speaker orientation information from the main control unit 10, the video control unit 13 executes the process related to the camera identification mark and the process related to the individual speaker information based on the speaker orientation camera information. At the same time, speaker individual information related to the true speaker orientation is acquired from the selected plurality of speaker individual information. Then, the video control unit 13 sets the individual speaker information related to the true speaker direction so as to have another display mode.

図７は、話者単位での表示態様設定を行った場合の表示映像の例を示す図である。図７では、上述の図３や図６（Ａ）の状況と同じように、会議者３０１が発言した場合を示す。
図７に示すように、映像データ５０１からなる部分映像領域５１１、映像データ５０２からなる部分映像領域５１２、映像データ５０３からなる部分映像領域５１３の表示態様は図６（Ａ）と同じである。しかしながら、部分映像領域５１４では、カメラＣＡ１に対応する話者個別情報５４１，５４２を点灯状態となるように表示させ、他の話者個別情報５４３〜５４６を消灯状態となるように表示させる設定を行う。さらに、話者である会議者３０１に対応する話者個別情報５４１を、話者ではない会議者３０２に対応する話者個別情報５４２よりも、さらに高輝度で点灯させたり、点滅させたりする等の設定を行う。 FIG. 7 is a diagram illustrating an example of a display image when the display mode is set for each speaker. FIG. 7 shows a case where the conference person 301 speaks in the same manner as the situation of FIG. 3 and FIG.
As shown in FIG. 7, the display modes of the partial video area 511 including the video data 501, the partial video area 512 including the video data 502, and the partial video area 513 including the video data 503 are the same as those in FIG. However, in the partial video area 514, the speaker individual information 541 and 542 corresponding to the camera CA1 is displayed so as to be turned on, and the other speaker individual information 543 to 546 is displayed so as to be turned off. Do. Furthermore, the speaker individual information 541 corresponding to the speaker 301 who is the speaker is turned on or blinked with higher brightness than the speaker individual information 542 corresponding to the speaker 302 who is not the speaker. Set up.

このような表示態様とすることで、上述の効果とともに、さらに一見して確実に話者を識別することができる。 By setting it as such a display mode, a speaker can be identified reliably at a glance with the above-mentioned effect.

また、前述の説明では、部分映像領域５１４の表示例を一例だけ示したが、図８に示すようなＷｅｂ画面で設定を行うことで、表示内容を設定することができる。 In the above description, only a display example of the partial video area 514 has been shown. However, the display contents can be set by setting on the Web screen as shown in FIG.

図８はカメラと会議情報の設定入力画面を示す図である。
Ｗｅｂ画面設定を行う場合、ユーザは、パソコン等の機器をネットワーク９００に接続し、テレビ会議装置１に対してアクセスする。その後、ユーザは、リモート操作入力を行い、図８に示すようなＷｅｂ設定画面６００をパソコンの表示画面に表示させる。Ｗｅｂ表示画面６００には、設定方法を選択させる「デフォルト」選択欄６０１と「参加者情報」選択欄６０２とが設定される。ユーザは、マウス等の操作子によりいずれかを選択することで、設定方法が選択される。「参加者情報」選択欄６０２は、その内容を示す内容表示欄６２１を有し、内容表示欄６２１は、表示内容入力欄６２１１、カメラ割り当て設定欄６２１２を有する。 FIG. 8 is a diagram showing a camera and conference information setting input screen.
When setting the Web screen, the user connects a device such as a personal computer to the network 900 and accesses the video conference apparatus 1. Thereafter, the user performs remote operation input, and displays a Web setting screen 600 as shown in FIG. 8 on the display screen of the personal computer. On the Web display screen 600, a “default” selection field 601 and a “participant information” selection field 602 for selecting a setting method are set. The user selects one of the setting methods by using an operator such as a mouse to select a setting method. The “participant information” selection column 602 includes a content display column 621 indicating the content, and the content display column 621 includes a display content input column 6211 and a camera assignment setting column 6212.

表示内容入力欄６２１１には、図８に示すような会議者の名前等の個人識別情報が入力される。なお、図８では名前を入力する例を示したが、入力内容は個人が識別できる情報であれば特に指定されるものでない。カメラ割り当て設定欄６２１２は、表示内容入力欄６２１１の各個人識別情報の入力単位毎にカメラＣＡ１（Ｃ），ＣＡ２（Ｒ），ＣＡ３（Ｌ）の割り当てを選択させる欄であり、個人識別情報が入力された場合には、いずれかが選択される。この選択内容と上述の話者方位カメラ情報とにより部分映像領域５１４の表示態様が設定される。すなわち、話者方位カメラ情報と同じカメラが選択されている個人識別情報が、他の個人識別情報と異なる表示態様で表示される。さらに、個人識別情報が入力されていない場合には、対応するカメラ割り当て設定欄６２１２は、「使用しない」が選択される。これにより、入力された個人識別情報のみが表示される。 In the display content input field 6211, personal identification information such as the name of a conference person as shown in FIG. 8 is input. Although FIG. 8 shows an example in which a name is input, the input content is not particularly specified as long as it is information that can be identified by an individual. The camera assignment setting field 6212 is a field for selecting assignment of the cameras CA1 (C), CA2 (R), and CA3 (L) for each input unit of the individual identification information in the display content input field 6211. If it is input, one is selected. The display mode of the partial video area 514 is set based on this selection content and the above-described speaker orientation camera information. That is, personal identification information in which the same camera as the speaker orientation camera information is selected is displayed in a display mode different from other personal identification information. Further, when the personal identification information is not input, “Not use” is selected in the corresponding camera assignment setting column 6212. Thereby, only the input personal identification information is displayed.

このような設定が行われた後、設定ボタン欄６３１を選択することで、入力内容が決定して映像制御部１３へ反映される。さらに、リセットボタン欄６３２が選択されれば、入力内容がリセットされる。 After such a setting is made, by selecting the setting button field 631, the input content is determined and reflected on the video control unit 13. Furthermore, if the reset button column 632 is selected, the input content is reset.

このような設定操作入力を行うことで、上述の図６に示すような表示画面を得ることができる。 By performing such setting operation input, a display screen as shown in FIG. 6 can be obtained.

なお、「デフォルト」選択欄６０１が選択された場合には、予め工場出荷時に設定された画像が部分映像領域５１４に表示される。例えば、当該テレビ会議装置の名称や時刻、会議経過時間等が表示される。また、「使用しない」が選択された場合、部分映像領域５１４内の「使用しない」に対応する映像領域が空きになるので、当該映像領域を外部入力によるテキスト情報の表示領域に割り当ててもよい。すなわち、上述のようなパソコン等で、Ｗｅｂ画面にテキスト入力欄を設け、当該テキスト入力欄に入力された内容を表示させるように制御しても良い。これにより、会議参加者が会議中に決定した事項等の簡単なテキスト情報を入力し、各会議室の会議者がこの情報を共有することができる。 When the “default” selection field 601 is selected, an image set in advance at the time of factory shipment is displayed in the partial video area 514. For example, the name and time of the video conference device, the conference elapsed time, and the like are displayed. If “not used” is selected, a video area corresponding to “not used” in the partial video area 514 becomes empty, and the video area may be assigned to a text information display area by external input. . In other words, the above-described personal computer or the like may be controlled so that a text input field is provided on the Web screen and the content input in the text input field is displayed. Accordingly, simple text information such as matters determined during the conference by the conference participants can be input, and the conference participants in each conference room can share this information.

また、前述の説明では、テレビ会議装置１本体に全ての機能を設置した例を示したが、通信制御部１５や、これに伴うメイン制御部１０の機能の一部を、放収音機能、撮像機能と別体にしてもよい。また、前述の説明のスピーカ数、マイク数、カメラ数は一例であり、装置の仕様に応じて適宜設定すればよい。すなわち、スピーカ数は所望とする放音特性に準じ、マイク数は所望とする収音特性に準じて設定される。そして、カメラ数も所望とする撮像領域や撮像領域分割機能の有無等により三台より多くても少なくてもよい。ただし、カメラ数については、三台であることがより望ましい。これは、本発明の特徴である分割画面が、ユーザの視やすさも考慮して、画面中心基準の四分割画面であり、且つ四つの分割領域の一つを文字情報の表示用に用いるので、会議映像の表示用には最大で三つの画面の設定となる。従ってカメラ数が四台以上であれば、三つの画面に同時に表示することができず、例えば話者を含む映像が表示されていない場合などに、自動切り替え等で四台以上のカメラの映像を三つの分割領域に割り当てなければならない。このため、より複雑な処理が必要となってしまう。さらに、カメラ数が分割領域数よりも多い場合、会議室全体を継続的に撮像することができなくなる。したがって、本発明では、カメラ数が三台であることが望ましい。 Moreover, although the example which installed all the functions in the video conference apparatus 1 main body was shown in the above-mentioned description, the communication control part 15 and a part of function of the main control part 10 accompanying this are a sound emission sound collection function, It may be separate from the imaging function. Further, the number of speakers, the number of microphones, and the number of cameras described above are examples, and may be set as appropriate according to the specifications of the apparatus. That is, the number of speakers is set according to a desired sound emission characteristic, and the number of microphones is set according to a desired sound collection characteristic. The number of cameras may be more or less than three depending on the desired imaging area, the presence or absence of an imaging area dividing function, and the like. However, the number of cameras is more preferably three. This is because the split screen, which is a feature of the present invention, is a four-screen split based on the screen center in consideration of user visibility, and one of the four split areas is used for displaying character information. A maximum of three screens are set for displaying the conference video. Therefore, if the number of cameras is four or more, images cannot be displayed on three screens at the same time. For example, when images including speakers are not displayed, images from four or more cameras can be displayed automatically. It must be allocated to three divided areas. For this reason, more complicated processing is required. Furthermore, when the number of cameras is larger than the number of divided areas, the entire conference room cannot be continuously imaged. Therefore, in the present invention, it is desirable that the number of cameras is three.

また、前述の説明では、収音機能、撮像機能、放音機能が一筐体に設置される例を示したが、収音機能と撮像機能とが一体化され、放音機能が無いようなものや、放音機能を別体で設置するようにしてもよい。 In the above description, an example in which the sound collection function, the imaging function, and the sound emission function are installed in one housing is shown. However, the sound collection function and the imaging function are integrated, and there is no sound emission function. You may make it install a thing and a sound emission function separately.

また、上述の説明では、話者を識別させる表示として高輝度黄色点灯と低輝度グレー点灯とを用いた例を示したが、これらの点灯態様は一例であり、話者の有無が明確に識別できる点灯態様であれば、消灯も含み適宜設定が可能である。 Further, in the above description, an example in which high-intensity yellow lighting and low-intensity gray lighting are used as a display for identifying a speaker has been shown. However, these lighting modes are merely examples, and the presence or absence of a speaker is clearly identified. If it is a lighting mode that can be set, it can be set as appropriate including lighting.

本発明のテレビ会議装置の外観斜視図である。It is an external appearance perspective view of the video conference apparatus of this invention. 本発明のテレビ会議装置１の機能ブロックの構成および外部との接続関係を示した図である。It is the figure which showed the structure of the functional block of the video conference apparatus 1 of this invention, and the connection relationship with the exterior. テレビ会議装置１の配置例および撮像範囲を表す平面図である。3 is a plan view illustrating an arrangement example and an imaging range of the video conference device 1. FIG. テレビ会議装置１及び表示器２０の配置例を示す平面図である。3 is a plan view illustrating an arrangement example of a video conference device 1 and a display device 20. FIG. 図３の状況における各カメラＣＡ１〜ＣＡ３の映像データ例を示す図である。It is a figure which shows the image data example of each camera CA1-CA3 in the condition of FIG. 合成映像データの例を示す図である。It is a figure which shows the example of synthetic | combination video data. 話者単位での表示態様設定を行った場合の表示映像の例を示す図である。It is a figure which shows the example of the display image at the time of performing the display mode setting per speaker. カメラと会議情報の設定入力画面を示す図である。It is a figure which shows the setting input screen of a camera and meeting information.

Explanation of symbols

１−テレビ会議装置、１０−メイン制御部、１１−収音制御部、１２−エコーキャンセラ、１３−映像制御部、１４−放音制御部、１５−通信制御部、１６−操作部、２０−表示器、ＭＣ１〜ＭＣ１６−マイク、ＳＰ１〜ＳＰ１４−スピーカ、ＣＡ１〜ＣＡ３−カメラ、３０１〜３０６−会議者、３１１〜３１６−会議者映像、４００−会議テーブル、４１０−テーブル映像、５００−合成映像データ、５０１〜５０３−映像データ、５１１〜５１４−部分映像領域、５２１〜５２３−カメラ識別マーク、５４１〜５４６−話者個別情報 1-video conference device, 10-main control unit, 11-sound collection control unit, 12-echo canceller, 13-video control unit, 14-sound emission control unit, 15-communication control unit, 16-operation unit, 20- Display, MC1 to MC16-Microphone, SP1 to SP14-Speaker, CA1 to CA3-Camera, 301 to 306-Conference, 311 to 316-Conference video, 400-Conference table, 410-Table video, 500-Composite video Data, 501-503 video data, 511-514 partial video area, 521-523 camera identification mark, 541-546 individual speaker information

Claims

A plurality of microphones each collecting sound and generating collected sound data;
Based on the collected sound data of the plurality of microphones, sound collection beam sound data generating means for generating a plurality of sound collection beam sound data having sound collection directivities centering on different directions, respectively,
Imaging means for generating a plurality of video data each consisting of different imaging areas;
A composite video data generating unit that generates a composite video data by combining a plurality of video data from the imaging unit;
Communication control means for outputting the collected sound beam audio data and the synthesized video data;
A video conferencing apparatus comprising:
A speaker orientation detecting means for detecting a speaker orientation based on a level of the plurality of collected beam sound data;
The synthesized video data generating means includes
Assigning each of the plurality of video data to each area obtained by dividing the composite video data into a screen;
Allocate speaker display data in which information indicating a speaker to be imaged for each of a plurality of video data output from the imaging means is allocated to an area of the composite video data to which the plurality of video data is not allocated. ,
Apply the imaging area specific mark corresponding to each video data individually in the screen division area of each video data,
The speaker display data and the imaging region specifying mark corresponding to the speaker orientation detected by the speaker orientation detecting means are displayed in a display mode different from the speaker display data and the imaging region specifying mark other than the speaker orientation. ,
Video conferencing equipment.

The speaker display data comprises a plurality of individual speaker display data,
The synthesized video data generating means includes
Based on the speaker orientation, in the speaker display data, the individual speaker display data corresponding to the speaker orientation is further displayed in a different manner.
The video conference apparatus according to claim 1.

The sound collecting beam sound data generating means selects and outputs sound collecting beam sound data corresponding to the speaker orientation.
The video conference apparatus according to claim 1 or 2.

The communication control means has a function of receiving sound emission sound data corresponding to the sound collection beam sound data generated by the destination device and a speaker orientation associated with the sound emission sound data. ,
A plurality of speakers emitting sound based on the input speaker driving signal;
A sound emission directivity is set based on the sound emission sound data received by the communication control means and a speaker orientation associated with the sound emission sound data, and the plurality of sound emission directivities are set according to the sound emission directivity. Sound emission control means for generating the speaker drive signal for each of the speakers;
The video conference apparatus according to claim 1, comprising: