JP2007124140A

JP2007124140A - Photographing device and communication conference system

Info

Publication number: JP2007124140A
Application number: JP2005311656A
Authority: JP
Inventors: Takuya Tamaru; 卓也田丸; Katsuichi Osakabe; 勝一刑部
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2005-10-26
Filing date: 2005-10-26
Publication date: 2007-05-17
Anticipated expiration: 2025-10-26
Also published as: JP4892927B2

Abstract

PROBLEM TO BE SOLVED: To provide a photographing device capable of changing panning at high speed, while having a simple configuration, so as to clearly photograph the video of a speaker (especially, a face), and to provide a communication conference system. SOLUTION: Sound collection beams corresponding to a plurality of areas are formed with a microphone array including a plurality of microphones 2A-2M. A sound signal processor 4 selects the sound collection beam with the largest signal strength among the sound collection beams, and detects the area with the existence of a sound source. Images photographed with cameras 3A-3C are synthesized by an image processor 5 so as to be a panoramic image. A controller 6 performs setting in the image processor 5 to segment the image concerning the area with the existence of the sound source. The image processor 5 segments the image of the area and outputs it. That is, panning is electronically changed. COPYRIGHT: (C)2007,JPO&INPIT

Description

この発明は、話者等の映像を明瞭に撮影することができる撮影装置、および通信会議システムに関する。 The present invention relates to a photographing apparatus capable of clearly photographing a speaker or the like and a communication conference system.

近年、通信環境の整備が進み、映像と音声を送受信するテレビ会議システムが普及している。テレビ会議システムは、一般的に、カメラ、マイク、スピーカ、およびディスプレイを有し、送信側ではマイクで収音した音声とカメラで撮影した映像を送信する。受信側では受信音声をスピーカから発音し、受信映像をディスプレイに表示する。 In recent years, with the development of communication environments, video conference systems that transmit and receive video and audio have become widespread. A video conference system generally includes a camera, a microphone, a speaker, and a display. On the transmission side, audio collected by the microphone and video captured by the camera are transmitted. On the receiving side, the received sound is pronounced from the speaker and the received video is displayed on the display.

このようなテレビ会議システムにおいては、１対１の会話を行うだけでなく、複数人の会議参加者で会話を行うことが多い。この場合、話者のほうに都度撮影範囲を設定する（カメラを向ける）必要がある。カメラは、主に左右方向（パン）を変更して撮影範囲を変更する。 In such a video conference system, not only a one-on-one conversation but also a conversation between a plurality of conference participants is often performed. In this case, it is necessary to set a shooting range (point the camera) for the speaker each time. The camera mainly changes the shooting range by changing the left-right direction (pan).

このパンの変更をスムーズにするために、会議参加者のそれぞれに専用マイクを配置し、最も入力レベルが大きいマイクの方向に複数の話者をまとめて撮影するようにしたテレビ会議システムが提案されている（例えば特許文献１参照）。 In order to make this panning smooth, a videoconferencing system has been proposed in which a dedicated microphone is placed for each conference participant, and multiple speakers are shot together in the direction of the microphone with the highest input level. (For example, refer to Patent Document 1).

また、会議机上に設置した複数のカメラでパノラマ画像を生成し、所定領域の画像を切り出すことで、電子的にパンを変更する方法が提案されている（例えば特許文献２参照）。
特開平７−１３５６４６号公報特開２００１−９４８５７号公報 In addition, a method has been proposed in which a panoramic image is generated by a plurality of cameras installed on a conference desk, and a pan is electronically changed by cutting out an image of a predetermined area (see, for example, Patent Document 2).
JP-A-7-135646 JP 2001-94857 A

特許文献１のテレビ会議システムは、カメラを機械的に回動させることでパンを変更するため、カメラを回動させる機構が必要となり、ハードウェアが煩雑となりコストがかかる。また、機械的に回動させるため、故障発生の可能性が高くなり、メンテナンスの必要も生じる。また、カメラを機械的に回動させることでパンを変更するため、話者を検出してから撮影範囲を設定するまでタイムラグが大きくなるという問題が有った。さらに、特許文献１のテレビ会議システムは、話者を検出するために、専用のマイクをそれぞれの会議参加者毎に配置する必要があり、煩雑な構成となる。 Since the video conference system of Patent Document 1 changes the pan by mechanically rotating the camera, a mechanism for rotating the camera is required, which complicates the hardware and costs. Moreover, since it is mechanically rotated, the possibility of occurrence of a failure increases and maintenance is also required. In addition, since the pan is changed by mechanically rotating the camera, there is a problem that the time lag increases until the shooting range is set after the speaker is detected. Furthermore, the video conference system of Patent Document 1 requires a dedicated microphone for each conference participant in order to detect a speaker, and has a complicated configuration.

特許文献２のバーチャル・カメラの制御方法においては、機械的にカメラを回動させることなく高速にパンを変更することができる。しかし、特許文献２においても話者を検出するために、専用のマイクをそれぞれの会議参加者毎に設置、または、会議机中心に複数のマイク（円形に配置したマイクアレイ）を設置する必要があり、煩雑な構成となる。また、会議机中心に複数のカメラ（カメラアレイ）を設置する例が示されているが、このような構成では机上にマイクやカメラ等、多数の機器が設置されるために会議参加者にとって邪魔になる。また、機器設置の手間もかかる。一方でディスプレイ付近にマイク、カメラを設置することが考えられるが、会議参加者とマイクの位置が離れると、音源位置を検出できる程度に発話音声を収音することができなかった。また、一般に通信会議においては、ディスプレイ前方中心位置に会議机が存在し、これを囲むように会議参加者が着座するため、ディスプレイ付近にカメラを設置しては、会議参加者の顔正面を撮影できない（横顔の画像になってしまう）という問題があった。 In the virtual camera control method disclosed in Patent Document 2, panning can be changed at high speed without mechanically rotating the camera. However, also in Patent Document 2, in order to detect a speaker, it is necessary to install a dedicated microphone for each conference participant, or to install a plurality of microphones (a microphone array arranged in a circle) around the conference desk. There is a complicated configuration. In addition, an example in which a plurality of cameras (camera arrays) are installed at the center of the conference desk is shown, but in such a configuration, a large number of devices such as microphones and cameras are installed on the desk, which is an obstacle for conference participants. become. It also takes time to install the equipment. On the other hand, it is conceivable to install a microphone and a camera near the display, but when the conference participant and the microphone are separated from each other, the speech cannot be picked up to the extent that the sound source position can be detected. Also, in general, in a teleconference, there is a conference desk at the center position in front of the display, and conference participants sit around the conference desk, so a camera is installed near the display and the front of the conference participants are photographed. There was a problem that it could not be done (it would be a profile image).

この発明は、簡略な構成でありながら高速にパンを変更でき、発話者の映像（特に顔正面）を明瞭に撮影する撮影装置、および通信会議システムを提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a photographing apparatus and a communication conference system that can change a pan at high speed with a simple configuration and can clearly photograph a video of a speaker (particularly the front face).

この発明の撮影装置は、音源位置を検出する音源位置検出手段と、撮影視野が少なくとも連続し、互いに撮影視野が交差するようにそれぞれ異なる向きに設置した複数のカメラと、前記複数のカメラで撮影した連続画像から前記音源位置検出手段で検出した音源位置を含む範囲の画像を切り出す画像切り出し手段と、を備えたことを特徴とする。 The image capturing apparatus of the present invention includes a sound source position detecting means for detecting a sound source position, a plurality of cameras installed in different directions so that the field of view is at least continuous and the field of view intersects each other, and the plurality of cameras are used for shooting. Image cutout means for cutting out an image of a range including the sound source position detected by the sound source position detection means from the continuous images.

この発明において、音源位置検出手段（例えば赤外線センサ）で音源（例えば発話者）の位置を検出する。複数のカメラは継続的に画像を取得し、画像合成手段においてこれらの画像を合成する。各カメラの撮影視野範囲はその端部において連続となり、これを合成することでパノラマ画像を生成する。このパノラマ画像のうち、話者の位置に対応する部分を切り出して出力する。話者を検出した領域の画像を切り出して（電子的にパンを変更して）出力することで、機械的機構のない簡略な構成でありながら高速にパンを変更することができる。また、複数のカメラは、互いに撮影視野が交差する。例えば撮影装置の両端部に配置されたカメラが互いに内側方向を撮影する。一般に通信会議においては、通信会議システム（ディスプレイ）前方中心位置に会議机が存在し、これを囲むように会議参加者が存在するため、撮影装置の両端部に配置されたカメラが互いに内側方向を撮影することで会議参加者の顔正面を撮影し易くなる。 In this invention, the position of a sound source (for example, a speaker) is detected by sound source position detecting means (for example, an infrared sensor). The plurality of cameras continuously acquire images, and these images are combined by the image combining means. The photographing field of view of each camera is continuous at the end thereof, and a panoramic image is generated by combining these. Of this panoramic image, a portion corresponding to the position of the speaker is cut out and output. By extracting and outputting the image of the region where the speaker is detected (electronically changing the pan), it is possible to change the pan at high speed while having a simple configuration without a mechanical mechanism. In addition, the plurality of cameras have photographing fields intersecting each other. For example, cameras arranged at both ends of the photographing apparatus photograph the inner direction of each other. In general, in a teleconference, there is a conference desk in the center of the front of the teleconference system (display), and there are conference participants surrounding the conference desk. By taking a picture, it becomes easier to take a picture of the front face of the conference participant.

また、この発明は、さらに、前記音源位置検出手段は、複数のマイクを配列して構成されるマイクアレイと、前記複数のマイクが収音した音声信号をそれぞれ所定時間遅延して合成することにより、特定領域の音声を高レベルで収音する収音ビームを複数形成する収音信号処理手段と、前記収音信号処理手段が形成した複数の収音ビームのうち最もレベルが高い収音ビームの方向に音源が存在すると判断する音声信号選択手段と、からなることを特徴とする。 Further, according to the present invention, the sound source position detecting means further synthesizes a microphone array configured by arranging a plurality of microphones and an audio signal collected by the plurality of microphones with a predetermined time delay. A sound collecting signal processing means for forming a plurality of sound collecting beams for picking up sound of a specific area at a high level, and a sound collecting beam having the highest level among the plurality of sound collecting beams formed by the sound collecting signal processing means. Voice signal selection means for judging that a sound source is present in the direction.

この発明において、マイクアレイにより複数方向に収音ビームを形成する。複数の収音ビームのうち最もレベルが高い方向に音源が存在するとして、音源位置を検出する。 In the present invention, sound collecting beams are formed in a plurality of directions by the microphone array. The sound source position is detected on the assumption that the sound source exists in the direction of the highest level among the plurality of sound collecting beams.

この発明の通信会議システムは、請求項２に記載の撮影装置と、前記収音信号選択手段が選択した収音ビームの音声信号、および前記画像切り出し手段が切り出した画像信号を出力し、外部からの音声信号および画像信号を入力する送受信手段と、前記送受信手段が入力した音声信号に基づく音声を発する音声出力手段と、前記送受信手段が入力した画像信号に基づく画像を表示する表示手段と、を備えたことを特徴とする。 The communication conference system according to the present invention outputs the sound signal of the sound collecting beam selected by the sound collecting signal selecting means and the image signal cut out by the image cutting means from the outside. Transmitting / receiving means for inputting the audio signal and the image signal, sound output means for emitting sound based on the audio signal input by the transmitting / receiving means, and display means for displaying an image based on the image signal input by the transmitting / receiving means, It is characterized by having.

この発明において、他の通信会議システムから音声信号を入力し、スピーカから音声を発するとともに、複数のマイクで音声を収音し、他の通信会議システムに出力する。また、複数のカメラで撮影した映像データを他の通信会議システムに出力する。 In the present invention, an audio signal is input from another communication conference system, a sound is emitted from a speaker, and the sound is collected by a plurality of microphones and output to another communication conference system. Also, video data captured by a plurality of cameras is output to another communication conference system.

また、この発明は、さらに、前記音声出力手段は、複数のスピーカを配列して構成されるスピーカアレイと、前記送受信手段が入力した音声信号を、前記複数のスピーカにそれぞれ所定時間遅延して出力することにより、特定領域に音声を高レベルで放音する音声ビームを形成する放音信号処理手段と、からなり、前記放音信号処理手段は、収音側における音源位置に仮想的な音源が形成されるように音声ビームを形成し、前記表示手段に表示される画像中の音源の位置と、前記仮想的な音源の位置が同じ、または、同方向となるように前記音声ビームを制御することを特徴とする。 Further, according to the present invention, the sound output means outputs a speaker array configured by arranging a plurality of speakers, and a sound signal input by the transmitting / receiving means to each of the plurality of speakers with a predetermined delay. And a sound emission signal processing means for forming a sound beam that emits sound at a high level in a specific area, wherein the sound emission signal processing means has a virtual sound source at a sound source position on the sound collection side. An audio beam is formed so as to be formed, and the audio beam is controlled so that the position of the sound source in the image displayed on the display means is the same as or in the same direction as the position of the virtual sound source. It is characterized by that.

この発明において、スピーカアレイにより音声ビームを形成する。このとき、収音側における通信会議システムと音源の位置関係で仮想音源が形成されるように、音声ビームを形成する。これにより、映像に一致した音像定位ができ、よりリアルな会議環境を得ることができる。 In the present invention, a sound beam is formed by a speaker array. At this time, an audio beam is formed so that a virtual sound source is formed by the positional relationship between the communication conference system and the sound source on the sound collection side. Thereby, sound image localization that matches the video can be performed, and a more realistic conference environment can be obtained.

この発明によれば、複数のカメラでパノラマ画像を取得し、音源を検出した領域の画像を切り出して（電子的にパンを変更して）出力することで、機械的機構のない簡略な構成でありながら高速にパンを変更することができるとともに、撮影装置の両端部に配置されたカメラが互いに内側方向を撮影することで、音源（発話者）の映像を正面から明瞭に撮影することができる。 According to the present invention, a panoramic image is acquired by a plurality of cameras, and an image of a region where a sound source is detected is cut out (electronically panned) and output, thereby having a simple configuration without a mechanical mechanism. While being able to change the pan at high speed, the cameras arranged at both ends of the photographing device can photograph the sound source (speaker) clearly from the front by photographing the inside direction of each other. .

図面を参照して、本発明の実施形態に係る通信会議システムについて説明する。図１は、通信会議システムの構成を示すブロック図である。同図に示すように、この通信会議システムは、複数のスピーカ１Ａ〜１Ｍ、複数のマイク２Ａ〜２Ｍ、複数の（同図においては３つの）カメラ３Ａ〜３Ｃ、音声信号処理部４、画像処理部５、コントローラ６、入出力インターフェース７、およびディスプレイ８を備えている。 A communication conference system according to an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration of a communication conference system. As shown in the figure, this communication conference system includes a plurality of speakers 1A to 1M, a plurality of microphones 2A to 2M, a plurality of (three in the figure) cameras 3A to 3C, an audio signal processing unit 4, and image processing. Unit 5, controller 6, input / output interface 7, and display 8.

複数のスピーカ１Ａ〜１Ｍ、および複数のマイク２Ａ〜２Ｍは、音声信号処理部４に接続されている。３つのカメラ３Ａ、３Ｂ、および３Ｃは、画像処理部５に接続されている。音声信号処理部４、および画像処理部５は入出力インターフェース７に接続されるとともに、コントローラ６に接続されている。コントローラ６は、入出力インターフェース７に接続されている。入出力インターフェース７は、他の通信会議システムにネットワーク等を介して接続される。ディスプレイ８は、入出力インターフェース７および画像処理部５に接続されている。 The plurality of speakers 1 A to 1 M and the plurality of microphones 2 A to 2 M are connected to the audio signal processing unit 4. The three cameras 3A, 3B, and 3C are connected to the image processing unit 5. The audio signal processing unit 4 and the image processing unit 5 are connected to the input / output interface 7 and to the controller 6. The controller 6 is connected to the input / output interface 7. The input / output interface 7 is connected to another communication conference system via a network or the like. The display 8 is connected to the input / output interface 7 and the image processing unit 5.

この通信会議システムは、ネットワーク等を介して接続される他の通信会議システムから音声信号を入力し、複数のスピーカ１から音声を発するとともに、複数のマイク２で音声を収音し、他の通信会議システムに音声信号を出力する。また、他の通信会議システムから映像データを入力し、ディスプレイ８に表示するとともに、複数のカメラ３で撮影した映像データを他の通信会議システムに出力する。これにより所謂テレビ電話（テレビ会議）を実現するものである。また、この通信会議システムは、発話者の位置（領域）を上記複数のマイクにより検出するものであり、検出した発話者の領域の画像を他の通信会議システムに出力する。 This communication conference system inputs audio signals from other communication conference systems connected via a network or the like, emits audio from a plurality of speakers 1, collects audio from a plurality of microphones 2, and performs other communication. Output audio signals to the conference system. In addition, video data is input from another communication conference system, displayed on the display 8, and video data captured by a plurality of cameras 3 is output to another communication conference system. Thus, what is called a videophone (video conference) is realized. Further, this communication conference system detects the position (region) of the speaker by the plurality of microphones, and outputs an image of the detected speaker region to another communication conference system.

複数のスピーカ１Ａ〜１Ｍは、直線状に配列され、スピーカアレイ（図６参照）を構成する。また、複数のマイク２Ａ〜２Ｍも直線上に配列され、マイクアレイ（図６参照）を構成する。３つのカメラ３Ａ、３Ｂ、および３Ｃは、それぞれ所定間隔離れた位置（図６参照）に設置され、撮影視野が少なくとも連続となるように設置される。詳細は後述する。 The plurality of speakers 1 A to 1 M are linearly arranged to constitute a speaker array (see FIG. 6). A plurality of microphones 2A to 2M are also arranged on a straight line to constitute a microphone array (see FIG. 6). The three cameras 3A, 3B, and 3C are installed at positions (see FIG. 6) that are separated from each other by a predetermined distance, and are installed so that the field of view is continuous. Details will be described later.

スピーカ１は、一般的にはコーン型スピーカを用いるが、ホーン型スピーカ等、その他の形式を用いてもよい。また、スピーカアレイのスピーカ配列個数、配列間隔は、この通信会議システムを設置する環境や必要とする周波数帯域等により適宜設定する。 As the speaker 1, a cone type speaker is generally used, but other types such as a horn type speaker may be used. Further, the number of speakers arranged in the speaker array and the spacing between the speakers are appropriately set according to the environment where the communication conference system is installed, the required frequency band, and the like.

各スピーカ１に入力される音声信号は、音声信号処理部４により決定される。入出力インターフェース７を介して他の会議システムから入力された音声信号は音声信号処理部４に入力される。この音声信号は、他の会議システムから入出力インターフェース７を介してコントローラ６に入力された、音源位置を示す情報に基づいて、音声信号処理部４によって所定の遅延量を付与されて各スピーカ１に入力される。各スピーカ１では、入力された音声信号を音声として放音する。なお、図１においては、ディジタル音声信号をアナログ音声信号に変換するＤ／Ａ変換器や信号を増幅するアンプ等は省略している。 The audio signal input to each speaker 1 is determined by the audio signal processing unit 4. An audio signal input from another conference system via the input / output interface 7 is input to the audio signal processing unit 4. This audio signal is given a predetermined delay amount by the audio signal processing unit 4 based on the information indicating the sound source position input to the controller 6 from the other conference system via the input / output interface 7, and each speaker 1. Is input. Each speaker 1 emits the input audio signal as audio. In FIG. 1, a D / A converter that converts a digital audio signal into an analog audio signal, an amplifier that amplifies the signal, and the like are omitted.

例えば全てのスピーカ１に同じ遅延量の音声信号が同時に入力されると、各スピーカ１からは同時に音声（音波）が出力される。スピーカ１から出力された音波は放射状に伝搬していくが、これらの合成波面は、平行で前方のみに伝搬するもの、すなわち音声ビームとなる。これ以外の方向へ伝搬する成分は、各スピーカ１から出力された成分が合成されることによって（干渉しあうことによって）打ち消され、前方に向かう成分のみが合成によって強められて音声ビームとして残る。 For example, when audio signals having the same delay amount are simultaneously input to all the speakers 1, audio (sound waves) is output from each speaker 1 simultaneously. The sound waves output from the speaker 1 propagate radially, but these combined wavefronts are parallel and propagate only forward, that is, sound beams. The components propagating in the other directions are canceled by combining the components output from the speakers 1 (by interfering with each other), and only the components directed forward are strengthened by the synthesis and remain as sound beams.

また、一端のスピーカ１から最初に音声を出力し、続いて、所定時間が経過する毎に順次隣のスピーカ１から音声を出力すると、合成波面はその遅延時間に応じて傾斜し、音声ビームを斜め方向に向けることができる。このように、各スピーカ１を水平方向にライン状に配列することで、水平方向の任意の方向への指向性制御が可能となる。 Further, when sound is first output from the speaker 1 at one end and then sound is sequentially output from the adjacent speaker 1 every time a predetermined time elapses, the combined wavefront is inclined according to the delay time, and the sound beam is Can be directed diagonally. Thus, by arranging the speakers 1 in a line in the horizontal direction, directivity control in an arbitrary direction in the horizontal direction can be performed.

なお、以上の説明は、平面波を出力する場合のディレイ処理であるが、各スピーカ１に出力する信号のディレイを適当に制御することで、スピーカアレイの前方（または後方）にビームの焦点を持たせることもできる（図８参照）。 The above description is the delay processing when a plane wave is output. By appropriately controlling the delay of the signal output to each speaker 1, the beam is focused in front (or behind) the speaker array. (See FIG. 8).

各マイク２の前方領域のある位置で音声が発せられると、各マイク２がこれを収音する。マイク２は、収音した音声から音声信号を音声信号処理部４に出力する。マイク２は、一般的にはダイナミックマイクを用いるが、コンデンサマイク等、その他の形式を用いてもよい。また、マイクアレイのマイク配列個数、配列間隔は、この通信会議システムを設置する環境や必要とする周波数帯域等により適宜設定する。なお、図１においてはフロントエンドのアンプやアナログ音声信号をディジタル音声信号に変換するＡ／Ｄ変換器等は省略している。各マイク２から出力される音声信号は、音声信号処理部４にて合成され、入出力インターフェース７に出力される。入出力インターフェース７は他の会議システム等にこの音声信号を出力する。各マイク２で音声を収音した際、音声は各マイク２と音源との距離に応じた伝搬時間で伝搬されるので、各マイク２では収音タイミングに差が生じる。 When sound is emitted at a position in the front area of each microphone 2, each microphone 2 picks up the sound. The microphone 2 outputs an audio signal from the collected audio to the audio signal processing unit 4. The microphone 2 is generally a dynamic microphone, but other types such as a condenser microphone may be used. Further, the number of microphones arranged in the microphone array and the interval between the microphones are appropriately set according to the environment where the communication conference system is installed, the required frequency band, and the like. In FIG. 1, front-end amplifiers and A / D converters for converting analog audio signals into digital audio signals are omitted. The audio signal output from each microphone 2 is synthesized by the audio signal processing unit 4 and output to the input / output interface 7. The input / output interface 7 outputs this audio signal to another conference system or the like. When sound is collected by each microphone 2, the sound is propagated with a propagation time corresponding to the distance between each microphone 2 and the sound source, so that there is a difference in sound collection timing in each microphone 2.

ここで、例えば全てのマイク２に前方から同タイミングで音波が到来したとすると、各マイク２から出力された音声信号は、合成によって強められる。一方で、これ以外の方向から音波が到来すると、各マイク２から出力される音声信号はそれぞれ位相が異なるために合成されることによって弱められる。したがって、アレイマイクの感度はビーム状に絞り込まれて前方にのみ主感度（収音ビーム）を形成する。 Here, for example, if sound waves arrive at all the microphones 2 from the front at the same timing, the audio signals output from the respective microphones 2 are strengthened by synthesis. On the other hand, when sound waves arrive from other directions, the audio signals output from the microphones 2 are weakened by being synthesized because they have different phases. Therefore, the sensitivity of the array microphone is narrowed down into a beam shape, and the main sensitivity (sound collecting beam) is formed only in the front.

音声信号処理部４は、各マイク２が出力した音声信号にそれぞれ所定の遅延時間を付与することで収音ビームを斜めに向けることができる。収音ビームを斜めにする場合、一方の端部マイク２から所定時間が経過する毎に順次隣のマイク２から音声信号を出力するように設定する。例えば音源がマイクアレイの一方の端部前方に存在する場合、音源に最も近い一方の端部から音波が到来し、反対の端部に最後に音波が到来するが、音声信号処理部４は、この伝搬時間差を補正するように各マイク２の音声信号に遅延時間を付与した後合成する。これによりこの方向の音声信号を合成によって強められる。したがって、一列に並んでいるマイク２から出力する音声信号を一端から他端に向けて順次遅延することにより、収音ビームは、その遅延時間に応じて傾斜する。 The audio signal processing unit 4 can direct the sound collection beam obliquely by giving a predetermined delay time to the audio signal output from each microphone 2. When the sound collecting beam is inclined, the sound signal is set to be sequentially output from the adjacent microphone 2 every time a predetermined time elapses from one end microphone 2. For example, when the sound source is present in front of one end of the microphone array, the sound wave comes from one end closest to the sound source, and the sound wave comes last to the opposite end. In order to correct this propagation time difference, the audio signal of each microphone 2 is added with a delay time and then synthesized. Thereby, the audio signal in this direction can be strengthened by synthesis. Therefore, by sequentially delaying the audio signals output from the microphones 2 arranged in a row from one end to the other end, the sound collection beam is inclined according to the delay time.

なお、以上の説明は、平面波を収音する場合のディレイ処理であるが、各マイク２が出力する信号のディレイを適当に制御することで、マイクアレイの前方（または後方）に収音ビームの焦点を持たせることもできる（図８参照）。 Although the above description is the delay processing in the case of picking up plane waves, by appropriately controlling the delay of the signal output from each microphone 2, the sound collecting beam is placed in front of (or behind) the microphone array. A focal point can also be given (see FIG. 8).

また、この収音ビームは複数を同時に形成することも可能である。図２は、音声信号処理部４のうち、マイク２に接続される主要部の構成を示すブロック図である。マイク２Ａ〜２Ｍは、それぞれ音声信号処理部４のディジタルフィルタ４１Ａ〜４１Ｍに接続される。マイク２Ａ〜２Ｍで収音した音声は、ディジタル音声信号としてディジタルフィルタ４１Ａ〜４１Ｍに入力される。なお、図２においては、ディジタルフィルタ４１Ａ〜４１Ｍの内、ディジタルフィルタ４１Ａについてのみ詳細なブロック図を図示するが、他のディジタルフィルタ４１Ｂ〜４１Ｍについても同様の構造であり、同様の動作を行うものである。 Also, a plurality of sound collecting beams can be formed simultaneously. FIG. 2 is a block diagram illustrating a configuration of a main part connected to the microphone 2 in the audio signal processing unit 4. The microphones 2A to 2M are connected to the digital filters 41A to 41M of the audio signal processing unit 4, respectively. The sound collected by the microphones 2A to 2M is input to the digital filters 41A to 41M as a digital sound signal. 2 shows a detailed block diagram of only the digital filter 41A among the digital filters 41A to 41M. However, the other digital filters 41B to 41M have the same structure and perform the same operation. It is.

ディジタルフィルタ４１Ａは、複数段の出力を有するディレイバッファ４２Ａを備えている。ディレイバッファ４２Ａの各段の遅延量は、マイクアレイのマイク２の配置、およびマイクアレイ前方の領域（発話者を検出する領域）に応じて設定される。この例においてディレイバッファ４２Ａは４段の出力を有しており、これらの出力信号がＦＩＲフィルタ４３１Ａ〜４３４Ａに入力される。 The digital filter 41A includes a delay buffer 42A having a plurality of stages of outputs. The amount of delay at each stage of the delay buffer 42A is set according to the arrangement of the microphones 2 in the microphone array and the area in front of the microphone array (area for detecting a speaker). In this example, the delay buffer 42A has four-stage outputs, and these output signals are input to the FIR filters 431A to 434A.

ディレイバッファ４２Ａは、マイク２Ａが出力した音声信号に対してそれぞれ異なる遅延時間を付与した音声信号を各段にバッファし、ＦＩＲフィルタ４３１Ａ〜４３４Ａに各遅延音声信号を出力する。ここでＦＩＲフィルタ４３１Ａ〜４３４Ａに出力する遅延音声信号は、マイクアレイ前方の各領域に対応するものである。図３は音源方向検出方法の概念図である。同図（Ａ）は音源とマイクとの位置関係と、音源から発生した音が各マイクで収音される際のディレイとの関係を示した図であり、同図（Ｂ）、（Ｃ）は収音された音声信号のディレイに基づくディレイ補正量の形成概念を示す図である。 The delay buffer 42A buffers audio signals obtained by adding different delay times to the audio signal output from the microphone 2A, and outputs the delayed audio signals to the FIR filters 431A to 434A. Here, the delayed audio signals output to the FIR filters 431A to 434A correspond to the respective areas in front of the microphone array. FIG. 3 is a conceptual diagram of a sound source direction detection method. FIG. 6A is a diagram showing the relationship between the positional relationship between the sound source and the microphone and the delay when the sound generated from the sound source is picked up by each microphone, and FIGS. FIG. 4 is a diagram showing a concept of forming a delay correction amount based on a delay of a collected audio signal.

同図に示すように、この通信会議システムにおいてはマイクアレイ前方に４つの部分領域１０１〜１０４を設定している。部分領域１０１で発生した音は最も近いマイク２Ａで最初に収音される。そして、部分領域１０１とマイク２との距離に応じて順に、各マイクで収音され、最も遠いマイク（同図においてマイク２Ｌ）で最後に収音される。一方、部分領域１０４で発生した音は最も近いマイク２Ｌで最初に収音され、部分領域１０４とマイク２との距離に応じて順に、各マイクで収音され、最も遠いマイク２Ａで最後に収音される。このように、各領域で発生する音はマイクとの距離に応じた遅延時間（ディレイ）で収音される。 As shown in the figure, in this communication conference system, four partial areas 101 to 104 are set in front of the microphone array. The sound generated in the partial area 101 is first picked up by the nearest microphone 2A. Then, sound is collected by each microphone in order according to the distance between the partial area 101 and the microphone 2, and finally collected by the farthest microphone (microphone 2L in the figure). On the other hand, the sound generated in the partial area 104 is collected first by the nearest microphone 2L, sequentially collected by each microphone according to the distance between the partial area 104 and the microphone 2, and finally collected by the farthest microphone 2A. Sounded. Thus, the sound generated in each region is collected with a delay time (delay) corresponding to the distance from the microphone.

ここで、部分領域１０１に対しては、図３（Ｂ）に示すように、各マイク２Ａ〜２Ｌで収音される音声信号を遅延処理する。すなわち、図３（Ａ）に示すディレイを補正するように対応するディレイ補正量を設定する。一方で部分領域１０４に対しては、図３（Ｃ）に示すように各マイク２Ａ〜２Ｌで収音される音声信号を遅延処理する。 Here, for the partial area 101, as shown in FIG. 3B, the audio signals collected by the microphones 2A to 2L are subjected to delay processing. That is, the corresponding delay correction amount is set so as to correct the delay shown in FIG. On the other hand, as shown in FIG. 3C, the partial area 104 is subjected to delay processing on the audio signals collected by the microphones 2A to 2L.

部分領域１０１に対応する収音ビームを構成するための遅延音声信号がディレイバッファ４２Ａにおいて生成され、ＦＩＲフィルタ４３１Ａに出力される。また、部分領域１０２に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３２Ａに出力される。同様に、部分領域１０３に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３３Ａに出力され、部分領域１０４に対応する収音ビームを構成するための遅延音声信号がＦＩＲフィルタ４３４Ａに出力される。これらの遅延音声信号の遅延量は、図３に示すようにマイク２と各領域との距離に応じて設定される。例えば部分領域１０１に対応する遅延音声信号は、マイク２Ａと部分領域１０１との距離が近いため遅延量が大きく、部分領域１０４に対応する遅延音声信号は、マイク２Ａと部分領域１０４との距離が最も遠いために遅延量が小さい。 A delayed sound signal for forming a sound collecting beam corresponding to the partial region 101 is generated in the delay buffer 42A and output to the FIR filter 431A. In addition, a delayed sound signal for forming a sound collecting beam corresponding to the partial region 102 is output to the FIR filter 432A. Similarly, a delayed sound signal for forming a sound collecting beam corresponding to the partial area 103 is output to the FIR filter 433A, and a delayed sound signal for forming a sound collecting beam corresponding to the partial area 104 is supplied to the FIR filter 434A. Is output. The delay amounts of these delayed audio signals are set according to the distance between the microphone 2 and each area as shown in FIG. For example, the delayed audio signal corresponding to the partial area 101 has a large delay amount because the distance between the microphone 2A and the partial area 101 is short, and the delayed audio signal corresponding to the partial area 104 has a distance between the microphone 2A and the partial area 104. The delay is small because it is farthest away.

図２において、ＦＩＲフィルタ４３１Ａ〜４３４Ａは全て同じ構成からなり、それぞれに入力された遅延音声信号をフィルタリングして出力する。ＦＩＲフィルタ４３１Ａ〜４３４Ａは、ディレイバッファ４２Ａでは実現できない詳細な遅延時間を設定することができる。すなわち、ＦＩＲフィルタのサンプリング周期とタップ数とを所望の値に設定することにより、例えばディレイバッファ４２Ａでのサンプリング周期を遅延時間の整数部分とする場合にこの遅延時間の小数点部分を実現することができる。 In FIG. 2, the FIR filters 431A to 434A all have the same configuration, and filter and output the delayed audio signals input thereto. The FIR filters 431A to 434A can set a detailed delay time that cannot be realized by the delay buffer 42A. That is, by setting the sampling period and the number of taps of the FIR filter to desired values, for example, when the sampling period in the delay buffer 42A is an integer part of the delay time, the decimal part of the delay time can be realized. it can.

ＦＩＲフィルタ４３１Ａ〜４３４Ａから出力された遅延音声信号は、それぞれのアンプ４４１Ａ〜４４４Ａで増幅されて、加算器４５Ａ〜４５Ｄに入力される。他のディジタルフィルタ４１Ｂ〜４１Ｍにおいてもディジタルフィルタ４１Ａと同じ構成からなり、それぞれに予め設定された遅延条件にしたがって遅延音声信号を加算器４５Ａ〜４５Ｄに出力する。 The delayed audio signals output from the FIR filters 431A to 434A are amplified by the respective amplifiers 441A to 444A and input to the adders 45A to 45D. The other digital filters 41B to 41M have the same configuration as that of the digital filter 41A, and output delayed audio signals to the adders 45A to 45D in accordance with delay conditions set in advance.

加算器４５Ａは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０１に対応する収音ビームを生成する。同様に、加算器４５Ｂは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における収音領域１０２に対応する収音ビームを生成し、加算器４５Ｃは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０３に対応する収音ビームを生成する。また、加算器４５Ｄは、各ディジタルフィルタ４１Ａ〜４１Ｍから入力される遅延音声信号を合成して、図３における部分領域１０４に対応する収音ビームを生成する。 The adder 45A synthesizes the delayed audio signals input from the digital filters 41A to 41M, and generates a sound collection beam corresponding to the partial area 101 in FIG. Similarly, the adder 45B synthesizes the delayed audio signals input from the digital filters 41A to 41M to generate a sound collection beam corresponding to the sound collection region 102 in FIG. 3, and the adder 45C The delayed sound signals input from the filters 41A to 41M are synthesized to generate a sound collection beam corresponding to the partial region 103 in FIG. Further, the adder 45D synthesizes the delayed audio signals input from the digital filters 41A to 41M, and generates a sound collection beam corresponding to the partial region 104 in FIG.

各加算器４５Ａ〜４５Ｄから出力される収音ビームは、バンドパスフィルタ（ＢＰＦ）４６に出力される。ＢＰＦ４６は、各収音ビームをフィルタリングして所定の周波数帯域の収音ビームをレベル判定部４７に出力する。ここで、ＢＰＦ４６は、マイクアレイの幅やマイク２の設置間隔に応じてビーム化される周波数帯域が異なることを利用し、各収音ビームで収音したい音声に対応する周波数帯域を通過帯域に設定する。例えば収音したい音声が話者の発話音声であれば、人の音声帯域に相当する周波数帯域を通過帯域に設定すればよい。 The collected sound beams output from the adders 45 A to 45 D are output to a band pass filter (BPF) 46. The BPF 46 filters each sound collection beam and outputs a sound collection beam in a predetermined frequency band to the level determination unit 47. Here, the BPF 46 uses the fact that the frequency band to be beamed differs depending on the width of the microphone array and the installation interval of the microphones 2, and sets the frequency band corresponding to the sound to be collected by each sound collecting beam as the pass band. Set. For example, if the voice to be collected is the voice of the speaker, a frequency band corresponding to the human voice band may be set as the pass band.

レベル判定部４７は、各収音ビームのレベルを比較し、最もレベルが高い収音ビームを選択する。収音ビームのレベルが高いということは、この収音ビームに対応する領域に音源（発話者）が存在することとなり、図３において示した４つの領域に区分した場合における音源の存在領域を検出することができる。レベル判定部４７は、音源の存在領域を示す情報をコントローラ６に出力する。なお、レベル判定部４７は、単に最もレベルが高い収音ビームの情報をコントローラ６に出力するようにし、コントローラ６がこれに対応する領域を判断するようにしてもよい。 The level determination unit 47 compares the levels of the sound collecting beams and selects the sound collecting beam having the highest level. A high sound collecting beam level means that a sound source (speaker) exists in the region corresponding to the sound collecting beam, and the sound source existing region is detected when the sound collecting beam is divided into the four regions shown in FIG. can do. The level determination unit 47 outputs information indicating the sound source existing area to the controller 6. Note that the level determination unit 47 may simply output the information of the collected sound beam having the highest level to the controller 6 so that the controller 6 determines the corresponding region.

コントローラ６は、セレクタ４８に、音源の存在領域に対応する収音ビームを選択して出力するように設定する。セレクタ４８には、各加算器４５Ａ〜４５Ｄから出力された収音ビームが入力され、コントローラ６によって設定された収音ビームのみを出力する。このセレクタ４８の出力が入出力インターフェース７に入力され、他の通信会議システム等に出力される。したがって、この通信会議システムは発話者の音声のみを明瞭に他の会議システム等に送信することが可能となる。さらに、コントローラ６は、他の通信会議システムにおいて音源の存在領域が再現されるように（送信先において仮想的な音源が形成されるように）、音源の位置情報を入出力インタフェース７に出力する。音源の位置情報は、レベル判定部４７で判定した音源の存在領域を示す情報を基にする。位置情報は、音源の存在領域を示す情報（音源の位置座標等）であってもよいし、各スピーカ１に設定する遅延時間を示す情報であってもよい。 The controller 6 sets the selector 48 to select and output a sound collection beam corresponding to the sound source existing area. The selector 48 receives the sound collection beams output from the adders 45 A to 45 D and outputs only the sound collection beams set by the controller 6. The output of the selector 48 is input to the input / output interface 7 and output to another communication conference system or the like. Therefore, this communication conference system can clearly transmit only the voice of the speaker to another conference system or the like. Further, the controller 6 outputs the position information of the sound source to the input / output interface 7 so that the sound source existing area is reproduced in another communication conference system (so that a virtual sound source is formed at the transmission destination). . The position information of the sound source is based on information indicating the sound source existing area determined by the level determination unit 47. The position information may be information (sound source position coordinates or the like) indicating a sound source existing area, or information indicating a delay time set for each speaker 1.

カメラ３は、ＣＣＤやＣＭＯＳ等のイメージセンサにより構成され、この通信会議システムの前方を継続的に撮影し、通信会議システム前方の画像を取得する。なお、このカメラ３は高精細のイメージセンサである必要はなく、テレビ会議に必要とされる精細度（０．３Ｍピクセル／フレーム）程度の性能を有していればよい。各カメラ３は、互いに所定距離離れた位置に直線状に配置される。 The camera 3 is configured by an image sensor such as a CCD or a CMOS, and continuously captures the front of the communication conference system and acquires an image in front of the communication conference system. Note that the camera 3 does not have to be a high-definition image sensor, and only needs to have a performance of a degree of definition (0.3 M pixel / frame) required for a video conference. Each camera 3 is linearly arranged at a position separated from each other by a predetermined distance.

図４は、カメラ撮影範囲を示す概念図である。３つのカメラ３Ａ、３Ｂ、および３Ｃは直線状に配置される。また、カメラ３Ｂが中心位置に配置され、カメラ３Ａとカメラ３Ｃが端部に配置される。端部に配置されるカメラ３Ａ、およびカメラ３Ｃは、内側方向（正面中心方向）に向くように、つまり撮影視野範囲が交差するように配置されている。ここで、それぞれのカメラ撮影範囲はその範囲端部においてオーバーラップする（重複となる）ように配置される。同図においては、カメラ３Ａの撮影範囲右端部とカメラ３Ｂの撮影範囲左端部が重複し、カメラ３Ｃの撮影範囲左端部とカメラ３Ｂの撮影範囲右端部が重複している。したがってカメラ３Ａ〜３Ｃにおいては、その撮影範囲中心軸が交錯する位置３０でカメラを回動させて（パンを変更して）撮影する場合と略同じ画像が得られる。つまり仮想的に位置３０にカメラを設置し、パンを変更した場合と同様の効果が得られる。 FIG. 4 is a conceptual diagram showing a camera shooting range. The three cameras 3A, 3B, and 3C are arranged in a straight line. Further, the camera 3B is disposed at the center position, and the cameras 3A and 3C are disposed at the end portions. The camera 3A and the camera 3C arranged at the end are arranged so as to face the inner direction (front center direction), that is, so that the photographing visual field ranges intersect. Here, each camera photographing range is arranged so as to overlap (overlap) at the end of the range. In the figure, the right end of the shooting range of the camera 3A and the left end of the shooting range of the camera 3B overlap, and the left end of the shooting range of the camera 3C and the right end of the shooting range of the camera 3B overlap. Therefore, in the cameras 3A to 3C, substantially the same image as that obtained when the camera is rotated (changing pan) at the position 30 where the shooting range center axes intersect is obtained. That is, the same effect as when a camera is virtually installed at the position 30 and the pan is changed can be obtained.

各カメラ３で取得された画像は、画像処理部５に出力される。画像処理部５は、各カメラ３で取得されたそれぞれの画像を合成し、パノラマ画像を生成する。すなわち、各カメラ３の視野は、それぞれの端部において重複となるため、この端部を合成する（つなぎ合わせる）ことでパノラマ画像を生成することができる。一般に通信会議においては、通信会議システム前方中心位置に会議机が存在し、これを囲むように会議参加者が存在するため、端部に配置されたカメラ３Ａ、およびカメラ３Ｃが互いに内側方向を撮影することで会議参加者の顔正面を撮影し易くなる。 An image acquired by each camera 3 is output to the image processing unit 5. The image processing unit 5 combines the images acquired by the cameras 3 to generate a panoramic image. That is, since the field of view of each camera 3 overlaps at each end portion, a panoramic image can be generated by combining (connecting) the end portions. In general, in a teleconference, there is a conference desk at the front center position of the teleconference system, and there are conference participants surrounding the conference desk. Therefore, the camera 3A and the camera 3C arranged at the end shoot the inside direction of each other. By doing so, it becomes easy to photograph the front faces of the participants.

図４においては、通信会議システム前方中心付近の紙面左側（通信会議システムから見て右側）に会議参加者２００Ａ、および２００Ｂが存在する。また、通信会議システム前方中心に会議参加者２００Ｃが存在し、通信会議システム前方中心付近の紙面右側（通信会議システムから見て左側）に会議参加者２００Ｄ、および２００Ｅが存在する。各会議参加者２００Ａ〜２００Ｅは、それぞれ通信会議システム前方中心付近の会議机２１０を囲むように存在する。したがって、紙面左側の会議参加者２００Ａ、および２００Ｂは、ほぼカメラ３Ｃの方向を向き、紙面右側の会議参加者２００Ｄ、および２００Ｅは、ほぼカメラ３Ａの方向を向くこととなる。 In FIG. 4, conference participants 200 A and 200 B exist on the left side of the paper near the front center of the teleconference system (right side when viewed from the teleconference system). In addition, the conference participant 200C exists at the front center of the communication conference system, and the conference participants 200D and 200E exist on the right side of the paper near the front center of the communication conference system (left side when viewed from the communication conference system). Each conference participant 200A to 200E exists so as to surround the conference desk 210 near the front center of the communication conference system. Accordingly, the conference participants 200A and 200B on the left side of the paper face substantially the direction of the camera 3C, and the conference participants 200D and 200E on the right side of the paper face almost the direction of the camera 3A.

これにより、通信会議システム付近に設置した単一のカメラでパンを変更して各会議参加者を撮影する場合に比べ、会議参加者の顔正面を撮影し易くなる。 This makes it easier to capture the front faces of the conference participants than when shooting each conference participant by changing the pan with a single camera installed near the communication conference system.

図５は、画像処理部５の詳細な構成を示すブロック図である。画像処理部５の合成処理部５１にカメラ３Ａ、３Ｂ、および３Ｃの画像が入力される。合成処理部５１は、上記のように各画像を合成し、パノラマ画像を生成する。このパノラマ画像は画像バッファ５２に出力される。画像バッファ５２は、このパノラマ画像をバッファする。抽出部５３は、画像バッファ５２でバッファされるパノラマ画像を読み出し、一部の領域を切り出して入出力インターフェース７に出力する。この切り出す領域はコントローラ６により決定される。 FIG. 5 is a block diagram illustrating a detailed configuration of the image processing unit 5. The images of the cameras 3A, 3B, and 3C are input to the composition processing unit 51 of the image processing unit 5. The composition processing unit 51 combines the images as described above to generate a panoramic image. This panoramic image is output to the image buffer 52. The image buffer 52 buffers this panoramic image. The extraction unit 53 reads the panoramic image buffered by the image buffer 52, cuts out a partial area, and outputs it to the input / output interface 7. The area to be cut out is determined by the controller 6.

コントローラ６は、上述したように音源の存在領域を示す情報を取得している。したがって、コントローラ６は、この音源が存在する領域の画像を切り出すように抽出部５３に設定する。抽出部５３は、音源が存在する領域の画像を切り出して入出力インターフェース７に出力する。これにより、音源が存在する領域の画像のみ他の通信会議システムに送信されることとなる。したがって、音源となる発話者以外の音声（ノイズ）や画像が出力されることなく、発話者の映像と音声が鮮明に出力されることとなる。 As described above, the controller 6 acquires information indicating the sound source existing area. Therefore, the controller 6 sets the extraction unit 53 to cut out an image of an area where the sound source exists. The extraction unit 53 cuts out an image of an area where the sound source exists and outputs it to the input / output interface 7. As a result, only the image of the area where the sound source exists is transmitted to another communication conference system. Therefore, the sound and video of the speaker are clearly output without outputting sound (noise) and images other than the speaker as the sound source.

なお、ディスプレイ８には、他の通信会議システムから入力された通信先の画像を表示するが、画像処理部５（抽出部５３）から発話者自身の画像を表示することも可能である。ディスプレイ８において通信先に表示される画像を確認することができる。 In addition, although the image of the communication destination input from other communication conference systems is displayed on the display 8, it is also possible to display the image of the speaker himself from the image processing unit 5 (extraction unit 53). An image displayed on the communication destination on the display 8 can be confirmed.

図６は通信会議システムの外観の一例を示す図であり、図７はカメラ撮影範囲と音源検出領域範囲を示す図である。図６に示すように、本通信会議システムは、ディスプレイ８の上部に設置された複数（例えば１５個）のマイク２からなるマイクアレイと、複数（例えば１２個）のスピーカ１からなるスピーカアレイと、複数（例えば３個）のカメラ３とを備えている。ディスプレイ８には他の通信会議システムから受信した画像を表示する。３つのカメラ３は、複数のスピーカ１と同一直線上に配置され、外観上スピーカアレイのスピーカ１と同様に等間隔に配置される。つまり、本来であれば１５個配置されるべきスピーカ１のうち、中心位置および左右端部から一つ内側の位置におけるスピーカ１の代わりにカメラ３を配置した構成となる。カメラをスピーカに置き換えた配置とすることで、外観上カメラが目立たず、すっきりとした構成となる。なお、カメラ３の位置はこの例に限るものではないが、スピーカアレイの音声ビーム幅を確保するためにスピーカアレイの直線上、左右端部にはスピーカ１を設置する。 FIG. 6 is a diagram illustrating an example of the appearance of the communication conference system, and FIG. 7 is a diagram illustrating a camera photographing range and a sound source detection region range. As shown in FIG. 6, the communication conference system includes a microphone array including a plurality of (for example, 15) microphones 2 installed on an upper portion of a display 8 and a speaker array including a plurality of (for example, 12) speakers 1. A plurality of (for example, three) cameras 3 are provided. The display 8 displays an image received from another communication conference system. The three cameras 3 are arranged on the same straight line as the plurality of speakers 1 and are arranged at equal intervals in the same manner as the speakers 1 of the speaker array in appearance. That is, of the 15 speakers 1 that should be originally arranged, the camera 3 is arranged in place of the speaker 1 at a position inside the center position and the left and right end portions. When the camera is replaced with a speaker, the camera is not conspicuous in appearance and has a clean structure. The position of the camera 3 is not limited to this example, but the speakers 1 are installed at the left and right ends of the speaker array in order to ensure the sound beam width of the speaker array.

上述のように、本通信会議システムは、マイクアレイによって４つの音源検出部分領域１０１〜１０４に収音ビームを設定している。図７（Ａ）においては、部分領域１０３内に音源２５０が存在する。したがって、部分領域１０３に対応する収音ビームのレベルが最も高くなり、コントローラ６は、部分領域１０３に音源が存在すると判断する。 As described above, in the communication conference system, sound collection beams are set in the four sound source detection partial areas 101 to 104 by the microphone array. In FIG. 7A, the sound source 250 exists in the partial area 103. Therefore, the level of the sound collecting beam corresponding to the partial area 103 is the highest, and the controller 6 determines that a sound source exists in the partial area 103.

コントローラ６は、音源が存在すると判断すると、画像バッファ５２に対し、抽出部５３にパノラマ画像を出力するよう指示する。また、抽出部５３に対し、部分領域１０３に対応する部分の画像を切り出して出力するよう指示する。したがって、同図（Ｂ）に示す破線のパノラマ画像のうち、実線で示す部分領域１０３に対応する画像領域が抽出部５３から入出力インターフェース７に出力されることとなり、通信先においては発話者（音源２５０）の画像を鮮明に取得することができる。 When the controller 6 determines that there is a sound source, it instructs the image buffer 52 to output a panoramic image to the extraction unit 53. In addition, the extraction unit 53 is instructed to cut out and output an image of a portion corresponding to the partial region 103. Accordingly, in the panoramic image indicated by the broken line shown in FIG. 5B, an image area corresponding to the partial area 103 indicated by the solid line is output from the extraction unit 53 to the input / output interface 7, and the speaker ( The image of the sound source 250) can be clearly obtained.

ここで、例えば異なる領域（例えば部分領域１０１）で発話がなされると、部分領域１０１に対応する収音ビームのレベルが最も高くなり、コントローラ６は、この部分領域１０１に音源が存在すると判断する。したがって、コントローラ６は抽出部５３に対し、部分領域１０１に対応する部分の画像を切り出して出力するよう指示する。この際、機械的にカメラを移動してパンを変更するのではなく、バッファしているパノラマ画像の所望の領域の画像を切り出す（電子的にパンを変更する）ため、従来よりも簡略な構造でありながら高速に出力画像を変更することが可能となる。 Here, for example, when an utterance is made in a different area (for example, the partial area 101), the level of the sound collecting beam corresponding to the partial area 101 becomes the highest, and the controller 6 determines that a sound source exists in the partial area 101. . Therefore, the controller 6 instructs the extraction unit 53 to cut out and output the image of the part corresponding to the partial area 101. At this time, instead of moving the camera mechanically to change the pan, the image of the desired area of the buffered panoramic image is cut out (the pan is changed electronically), so the structure is simpler than before. However, the output image can be changed at high speed.

また、本実施形態の通信会議システムを送信側、受信側の両方に用いることで以下のような効果を得ることができる。図８は、指向特性を説明する図である。同図（Ａ）は、送信側のマイクアレイの指向特性（収音ビーム）を示した図である。同図において、音源２５０が発した音声は、最も近いマイク２から順に到達するが、それぞれのマイク２に遅延を与え、音源から発せられた音声が各マイク２で同位相で出力されるようにし、収音ビームに焦点を持たせる。 Moreover, the following effects can be acquired by using the communication conference system of this embodiment for both the transmission side and the reception side. FIG. 8 is a diagram for explaining directivity characteristics. FIG. 6A is a diagram showing the directivity characteristics (sound collecting beam) of the microphone array on the transmission side. In the figure, the sound emitted from the sound source 250 arrives in order from the nearest microphone 2, but delays each microphone 2 so that the sound emitted from the sound source is output in the same phase by each microphone 2. Focus on the sound collection beam.

一方で、同図（Ｂ）は、受信側のスピーカアレイの指向特性を示した図である。同図において、受信した音声信号をそれぞれのスピーカ１から出力する。このとき、入出力インタフェース７を介してコントローラ６に入力された送信側の音源位置情報に基づいて、同図（Ａ）に示したような通信会議システムと音源２５０の位置関係で、仮想音源を形成する。この仮想音源に最も近い位置のスピーカ１から最初に音声を出力し、隣のスピーカ１から順に遅延して出力する。このように、順次遅延させることにより、音声ビームに焦点を持たせることができ、音声が発話者の位置から発せられたかのような音像定位をさせることができる。したがって、従来の通信会議システムに比べ、映像に一致した音像定位ができ、よりリアルな会議環境を得ることができる。 On the other hand, FIG. 5B is a diagram showing the directivity characteristics of the speaker array on the receiving side. In the figure, the received audio signal is output from each speaker 1. At this time, based on the sound source position information on the transmission side input to the controller 6 via the input / output interface 7, the virtual sound source is determined by the positional relationship between the communication conference system and the sound source 250 as shown in FIG. Form. The sound is first output from the speaker 1 closest to the virtual sound source, and is sequentially output from the adjacent speaker 1 with a delay. By sequentially delaying in this way, the sound beam can be focused, and sound image localization can be achieved as if the sound was emitted from the position of the speaker. Therefore, compared with the conventional communication conference system, sound image localization that matches the video can be performed, and a more realistic conference environment can be obtained.

次に、この通信会議システムの動作についてフローチャートを用いて説明する。図９は、通信会議システムの動作を示すフローチャートである。まず、各マイク２で収音した音声信号が音声信号処理部４に入力される（ｓ１１）。その後、各ディジタルフィルタ４１Ａ〜４１Ｍのディレイバッファで複数段の遅延音声信号が形成される（ｓ１２）。ディレイバッファから出力された複数の遅延音声信号は、それぞれ音源検出領域に対応する複数の加算器において合成され、複数の収音ビームが形成される（ｓ１３）。各音源検出領域に対応する複数の収音ビームはレベル判定部４７でレベル比較される（ｓ１４）。 Next, the operation of this communication conference system will be described using a flowchart. FIG. 9 is a flowchart showing the operation of the communication conference system. First, the audio signal collected by each microphone 2 is input to the audio signal processing unit 4 (s11). Thereafter, a plurality of stages of delayed audio signals are formed by the delay buffers of the digital filters 41A to 41M (s12). The plurality of delayed audio signals output from the delay buffer are combined by a plurality of adders corresponding to the sound source detection areas, respectively, to form a plurality of sound collecting beams (s13). The level determination unit 47 compares the levels of the plurality of sound collection beams corresponding to each sound source detection region (s14).

コントローラ６は、最もレベルが高い収音ビームに対応する音源検出領域に発話者が存在すると判断する（ｓ１５）。その後、コントローラ６は、発話者が存在すると判断した音源検出領域の画像を切り出すように画像処理部５に設定し、音声信号処理部４のセレクタ４８にこの領域に対応する収音ビームを出力するよう設定する（ｓ１６）。その後、音声信号処理部４から発話者の音声信号が、また画像処理部５から発話者の画像が入出力インターフェース７に出力される（ｓ１７）。 The controller 6 determines that a speaker is present in the sound source detection area corresponding to the sound collecting beam having the highest level (s15). After that, the controller 6 sets the image processing unit 5 so as to cut out the image of the sound source detection area determined that the speaker is present, and outputs the sound collection beam corresponding to this area to the selector 48 of the audio signal processing unit 4. (S16). Thereafter, the voice signal of the speaker is output from the voice signal processing unit 4 and the image of the speaker is output from the image processing unit 5 to the input / output interface 7 (s17).

なお、本実施形態では、前面４つの領域において音源を検出する例について説明したが、さらに多数の領域に分けて音源を検出してもよい。図２におけるディレイバッファ４２Ａの段数を変更することで音源検出領域を多数設定することが可能である。なお、本実施形態では、マイクアレイにより発話者の位置を検出する例を示したが、赤外線センサ等、他のセンサで検出するようにしてもよい。また、カメラで撮影した画像を解析し、画像認識により発話者の位置を検出するようにしてもよい。 In the present embodiment, an example in which sound sources are detected in the four front areas has been described. However, the sound sources may be detected in more areas. Many sound source detection areas can be set by changing the number of stages of the delay buffer 42A in FIG. In the present embodiment, the example in which the position of the speaker is detected by the microphone array has been described. However, the position may be detected by another sensor such as an infrared sensor. Further, an image captured by a camera may be analyzed, and the position of the speaker may be detected by image recognition.

さらに、マイクアレイの配置は上記のものに限るものではなく、複数のマイクが所定のパターンで配置されたマイクアレイ（例えばマトリクス状に配列されたマイクアレイ）であればどのような配置であってもよい。また、図９に示すように、複数次元の円状にマイクをパターン配置することで、どの方向からも音源を検出することができ、これを本発明の構成に適用することで、電子的にパンを変更する例に限らず、発話者の位置にあわせてチルトを変更することも可能となる。 Furthermore, the arrangement of the microphone array is not limited to the above, and any arrangement may be used as long as the microphone array includes a plurality of microphones arranged in a predetermined pattern (for example, a microphone array arranged in a matrix). Also good. Further, as shown in FIG. 9, by arranging microphones in a multi-dimensional circular pattern, a sound source can be detected from any direction, and by applying this to the configuration of the present invention, electronically Not only an example of changing the pan, it is also possible to change the tilt according to the position of the speaker.

通信会議システムの構成を示すブロック図Block diagram showing configuration of teleconferencing system 音声信号処理部の主要部の構成を示すブロック図Block diagram showing the configuration of the main part of the audio signal processing unit 音源検出領域を示す図Diagram showing sound source detection area カメラ撮影範囲を示す図Diagram showing camera shooting range 画像処理部の構成を示すブロック図Block diagram showing the configuration of the image processing unit 通信会議システムの外観の一例を示す図The figure which shows an example of the external appearance of a teleconference system カメラ撮影範囲と音源検出領域範囲を示す図Diagram showing camera shooting range and sound source detection range 指向特性を説明する図Diagram explaining directional characteristics 通信会議システムの動作を示すフローチャートFlow chart showing operation of teleconferencing system 円状にマイクを配列したマイクアレイの構成図Configuration diagram of microphone array with microphones arranged in a circle

Explanation of symbols

１−スピーカ
２−マイク
３−カメラ
４−音声信号処理部
５−画像処理部
６−コントローラ
７−入出力インターフェース
８−ディスプレイ 1-speaker 2-microphone 3-camera 4-audio signal processing unit 5-image processing unit 6-controller 7-input / output interface 8-display

Claims

Sound source position detecting means for detecting a sound source position;
A plurality of cameras installed in different directions so that the field of view is at least continuous and the field of view intersects each other;
Image cutout means for cutting out an image of a range including a sound source position detected by the sound source position detection means from continuous images taken by the plurality of cameras;
An imaging device with

The sound source position detecting means includes a microphone array configured by arranging a plurality of microphones;
Collected sound signal processing means for forming a plurality of sound collecting beams that pick up sound of a specific region at a high level by synthesizing the sound signals picked up by the plurality of microphones by delaying each by a predetermined time, and
A sound signal selecting means for determining that a sound source is present in the direction of the sound collecting beam having the highest level among the plurality of sound collecting beams formed by the sound collecting signal processing means;
The imaging apparatus according to claim 1, comprising:

An imaging device according to claim 2;
An audio signal of a sound collection beam selected by the sound collection signal selection unit, and an image signal output by the image cutout unit, and a transmission / reception unit that inputs an audio signal and an image signal from the outside;
Voice output means for emitting voice based on the voice signal input by the transceiver means;
Display means for displaying an image based on the image signal input by the transmission / reception means;
Teleconferencing system with

The audio output means includes a speaker array configured by arranging a plurality of speakers,
A sound emission signal processing means for forming a sound beam that emits sound at a high level in a specific area by outputting the sound signals input by the transmission / reception means to each of the plurality of speakers with a predetermined time delay.
Consists of
The sound emission signal processing means forms an audio beam so that a virtual sound source is formed at the sound source position on the sound collection side, and the position of the sound source in the image displayed on the display means, and the virtual sound source The communication conference system according to claim 3, wherein the sound beams are controlled so that the positions of the sound sources are the same or in the same direction.