JPH08130723A

JPH08130723A - Video conference system speaker discrimination device

Info

Publication number: JPH08130723A
Application number: JP6265852A
Authority: JP
Inventors: Noboru Terasawa; 昇寺澤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1994-10-28
Filing date: 1994-10-28
Publication date: 1996-05-21
Anticipated expiration: 2013-11-11
Also published as: JP2822897B2

Abstract

PURPOSE: To display a conversation state on a television screen when two speakers speak and to advance a video conference provided with presence. CONSTITUTION: A conversation state detection part 62 compares voice level data 251 , 252 ,...25N. obtained from voice level detection parts 191 , 192 ,...19N corresponding to respective video conference terminals 111 , 112 ,...11N and extracts the two video conference terminals 11 indicating a maximum voice level and the next voice level. A speaker discrimination part 63 discriminates the two speakers based on them. When the two speakers are discriminated, a picture processing part bisects pictures for instance, displays the speakers and increases the presence of the conversation.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は複数の地点を結んでテレ
ビジョンによる会議を実現するテレビ会議システムに係
わり、特に多地点を結んでテレビ会議を行う際の話者の
判別に有効なテレビ会議システム話者判別装置に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a video conference system for realizing a video conference by connecting a plurality of points, and in particular, a video conference effective for distinguishing a speaker when a video conference is performed by connecting a plurality of points. The present invention relates to a system speaker discrimination device.

【０００２】[0002]

【従来の技術】公衆回線あるいは専用回線を使用して複
数の地点のそれぞれのテレビ会議用端末を結んでテレビ
会議を行うテレビ会議システムが多くの企業等で採用さ
れるようになってきている。このようなテレビ会議が多
地点で行われる場合には、発言を行っている話者のみを
テレビジョンで映したり、他の地点の者よりも大きな画
面で映し出すと便利である。そこで従来からこれら多地
点に分かれているそれぞれの話者の音声回線から送られ
てくる信号を分析し、話者を特定してその話者のみを他
の会議参加者と区別して前記したような強調表示を行う
ようになっている。2. Description of the Related Art Many companies and the like have come to employ a video conference system for conducting a video conference by connecting video conference terminals at a plurality of points using a public line or a dedicated line. When such a video conference is held at multiple points, it is convenient to display only the speaker who is making a statement on the television or a larger screen than those at other points. Therefore, by analyzing the signals sent from the voice lines of each speaker divided into these multiple points, the speaker is identified and only that speaker is distinguished from other conference participants as described above. It is designed to be highlighted.

【０００３】それぞれのテレビ参加者から話者を判別す
るにあたっては、例えば特開平４−１５０５９０号公報
に開示されているように、各テレビ端末ごとに音のレベ
ルを測定し、ここからバックグラウンドの音のレベルを
排除してそれぞれの音声レベルを求め、求められた音声
レベルを各テレビ端末ごとに比較して最も大きなレベル
の話者１人を判別することが従来から行われている。そ
して、この判別された１人の話者をテレビ画面上で強調
するようになっていた。In distinguishing the speaker from each TV participant, the sound level is measured for each TV terminal as disclosed in, for example, Japanese Patent Laid-Open No. 4-150590, and the background level is measured from here. It has been conventionally performed to eliminate the sound level, obtain each voice level, compare the obtained voice levels for each television terminal, and discriminate one speaker having the highest level. Then, the one speaker thus identified is emphasized on the television screen.

【０００４】ところが、このように各テレビ端末ごとの
音声レベルのうち最も高いものの会議参加者を話者とし
て単純に選択してその度ごとに強調する話者を切り替え
ると、例えば会議参加者の１人が咳をすると画面の表示
が変化するといった不都合を生じる。However, when the conference participant having the highest audio level among the television terminals is simply selected as the speaker and the speaker to be emphasized is switched at each time, for example, 1 of the conference participants is selected. When a person coughs, it causes inconvenience that the display on the screen changes.

【０００５】そこで特開平２−４０９５号公報では、該
当の音声入力ごとに有音と無音の検出を行い、有音を検
出しても直ちに話者と判定せずこれが一定の時間を越え
た際に話者と判別することにしている。また、一度話者
と判別した者については、無音を検出しても直ちに非話
者とせず、無音回数が一定の値を越えたときに初めて非
話者と判定することにしている。Therefore, in Japanese Unexamined Patent Publication No. 2-4095, the presence or absence of sound is detected for each corresponding voice input, and even if the presence of sound is detected, it is not immediately determined to be the speaker and when this exceeds a certain time. It is decided to distinguish from the speaker. Further, with regard to a person who is once determined to be a speaker, even if silence is detected, the person is not immediately determined to be a non-speaker, and is determined to be a non-speaker only when the number of silences exceeds a certain value.

【０００６】[0006]

【発明が解決しようとする課題】このように従来のテレ
ビ会議システム話者判別装置では、音声レベルの大小を
比較して話者の判別を行うようになっていたので、２者
が平行して発言を続ける状態となったときには、これら
の者の間で音声レベルに差があると一方の話者のみが話
者として判別され、その者のみがテレビ画面上で強調さ
れるといった問題があった。また、２者それぞれの発言
の度に音声レベルが変化して、両者の音声レベルの差が
変化したり大きい方が入れ替わったりすると、話者を的
確に判別することができなかった。もちろん、２人の話
者が存在するこのような場合には、非話者と判別された
側の話者については強調が行われず、強調の態様によっ
てはその者がテレビ画面上で何ら表示されないといった
事態も発生した。As described above, in the conventional video conference system speaker discrimination device, the speaker discrimination is performed by comparing the magnitudes of the voice levels, so that the two parties are parallel to each other. When it became possible to continue speaking, there was a problem that if one of the speakers had a difference in voice level, only one speaker was identified as the speaker and only that person was highlighted on the TV screen. . In addition, if the voice level changes each time each of the two utterances changes, and the difference between the voice levels of the two changes or the larger one is replaced, the speaker cannot be accurately discriminated. Of course, in such a case where there are two speakers, the speaker on the side determined to be the non-speaker is not emphasized, and depending on the emphasizing mode, the person is not displayed on the TV screen. Such a situation also occurred.

【０００７】更に、特開平２−４０９５号公報のように
一定の時間間隔を設定して話者であることを新規に判別
するようにすると、話者として判別されている者が話し
続けているときに、他の誰かが割り込む形で発言したよ
うな場合や、話者として判別されている者の他に複数の
発言者が存在するような場合には、新たな話者の判別が
的確に行えないといった問題があった。Further, when a certain time interval is set to newly determine that the speaker is a speaker as in Japanese Patent Laid-Open No. 2-4095, the person who is identified as the speaker continues to speak. Sometimes, when someone else speaks in a way that interrupts, or when there are multiple speakers in addition to the one who is identified as the speaker, the new speaker can be accurately identified. There was a problem that I could not do it.

【０００８】そこで本発明の目的は、話者が割り込んだ
り複数の発言者が存在する場合にそれぞれの話者を的確
に判別することのできるテレビ会議システム話者判別装
置を提供することにある。Therefore, an object of the present invention is to provide a video conference system speaker discrimination apparatus capable of accurately discriminating each speaker when a speaker interrupts or a plurality of speakers exist.

【０００９】本発明の他の目的は、２人の話者が発言を
行っている場合にはこれらの対話状態をテレビ画面に表
示して臨場感のあるテレビ会議を行えるようにしたテレ
ビ会議システム話者判別装置を提供することにある。Another object of the present invention is to provide a teleconferencing system capable of providing a realistic teleconferencing by displaying the interactive state of these two speakers on a television screen when two speakers are speaking. It is to provide a speaker discrimination device.

【００１０】[0010]

【課題を解決するための手段】請求項１記載の発明で
は、（イ）テレビ会議に参加する各テレビ会議端末の音
声レベルを端末ごとに検出する音声レベル検出手段と、
（ロ）所定の周期単位で各テレビ会議端末で話者が発言
しているかどうかを検出する発言状態検出手段と、
（ハ）音声レベル検出手段によって検出された音声レベ
ルを基にして前記した周期ごとに話者の判別を行う話者
判別手段と、（ニ）発言状態検出手段の検出結果と話者
判別手段の判別結果を用いて話者の交代状況を判別する
話者交代状況判別手段と、（ホ）この話者交代状況判別
手段が話者の交代を検出する際には前記した周期を短く
設定し交代までの区間が長いときにはこの周期を短くす
る周期可変手段とをテレビ会議システム話者判別装置に
具備させる。According to a first aspect of the present invention, (a) audio level detecting means for detecting the audio level of each video conference terminal participating in the video conference for each terminal,
(B) utterance state detecting means for detecting whether or not the speaker is speaking at each video conference terminal in a predetermined cycle unit,
(C) A speaker discrimination unit that discriminates the speaker for each cycle described above based on the voice level detected by the voice level detection unit, and (d) a detection result of the speech state detection unit and a speaker discrimination unit. Speaker change situation determination means for determining the change situation of the speaker using the determination result, and (e) When this speaker change situation determination means detects the change of the speaker, the above-mentioned cycle is set short and the change is performed. The video conference system speaker discrimination device is provided with a cycle changing means for shortening this cycle when the section up to is long.

【００１１】すなわち請求項１記載の発明では、話者が
１人で長く発言しているときや、他の話者が割り込む形
等で発言を開始したとき等の話者の交代状況を判別し、
これに応じて話者の判別を行うための周期を可変にする
ことで、不要な音声に対して話者の交代を誤認すること
を防止する一方で、話者が割り込んだり複数の発言者が
存在する場合にそれぞれの話者を的確に判別することが
できる。That is, according to the first aspect of the present invention, the change situation of the speaker is determined when the speaker is speaking for a long time by one speaker, or when another speaker starts to speak in a way that interrupts. ,
In response to this, by changing the period for distinguishing the speaker, it is possible to prevent the speaker from being mistaken for the unnecessary voice, while the speaker interrupts or multiple speakers When present, each speaker can be accurately discriminated.

【００１２】請求項２記載の発明では、（イ）テレビ会
議に参加する各テレビ会議端末の音声レベルを端末ごと
に検出する音声レベル検出手段と、（ロ）この音声レベ
ル検出手段によって検出された端末ごとの音声レベルの
最大のものと次に大きいものを比較する音声レベル比較
手段と、（ハ）この音声レベル比較手段によって比較さ
れた音声レベルの差が所定の範囲内のときには両者が対
話を行っていると判別する対話判別手段と、（ニ）この
対話判別手段が対話を行っていると判別した２人の話者
を他のテレビ会議参加者に比べて画面表示で強調する表
示強調手段とをテレビ会議システム話者判別装置に具備
させる。According to the second aspect of the invention, (a) a voice level detecting means for detecting the voice level of each video conference terminal participating in the video conference for each terminal, and (b) the voice level detecting means. A voice level comparing means for comparing the maximum voice level and the next highest voice level for each terminal, and (c) When the difference between the voice levels compared by this voice level comparing means is within a predetermined range, the two parties can talk with each other. A dialogue discriminating means for discriminating that the conversation is being conducted, and (d) a display emphasizing means for emphasizing the two speakers discriminated that the dialogue is discriminated by the screen display as compared with other video conference participants. And the video conference system speaker discrimination device.

【００１３】すなわち請求項２記載の発明では、発言中
の話者が２人いるかどうかを音声レベルが最大のものと
次のレベルのものを抽出してこれらのレベルの差を比較
することで判別することにした。そして、話者が２人い
るときには画面でこれらの者を強調表示することにし
て、対話の臨場感を盛り上げることにした。That is, according to the second aspect of the present invention, it is determined whether or not there are two speakers who are speaking by extracting the one with the maximum voice level and the one with the next voice level and comparing the difference between these levels. I decided to do it. Then, when there are two speakers, these persons are highlighted on the screen to enhance the realism of the dialogue.

【００１４】請求項３記載の発明では、話者の発言状態
を検出して画面表示の切り替えの単位としての周期を可
変することにして、請求項２記載の発明に請求項１記載
の発明の長所を盛り込むことにした。According to the third aspect of the invention, the period as a unit for switching the screen display is changed by detecting the utterance state of the speaker, and the invention according to the second aspect is modified. I decided to incorporate the advantages.

【００１５】請求項４記載の発明では、（イ）テレビ会
議に参加する各テレビ会議端末の音声レベルを端末ごと
に検出する音声レベル検出手段と、（ロ）この音声レベ
ル検出手段によって検出された端末ごとの音声レベルの
最大のものとこれに近似した比較的音声レベルの大きな
１または複数のものを抽出する発言者複数抽出手段と、
（ハ）この発言者複数抽出手段が抽出した複数の発言者
を他のテレビ会議参加者に比べて画面表示で強調する表
示強調手段とをテレビ会議システム話者判別装置に具備
させる。In the invention according to claim 4, (a) voice level detecting means for detecting the voice level of each video conference terminal participating in the video conference for each terminal, and (b) the voice level detecting means. A plurality of speaker extracting means for extracting one having a maximum voice level for each terminal and one or a plurality of voices having a relatively high voice level similar to the maximum voice level;
(C) The video conference system speaker discrimination device is provided with a display emphasizing unit for emphasizing the plural speakers extracted by the plural speaker extracting units on the screen display as compared with other video conference participants.

【００１６】すなわち請求項４記載の発明では、最大の
音声レベルの端末とこれに近似した１または複数の端末
の話者を同時に抽出して画像を強調表示することにした
ので、例えば米国ＣＮＮのクロストーク番組のように３
者以上のものが白熱して発言する状態をテレビ会議にお
いても画面上の表示形態として実現させることができ
る。That is, according to the invention of claim 4, the speaker of the terminal having the maximum voice level and the speakers of one or a plurality of terminals close to the terminal are simultaneously extracted and the image is emphasized. 3 like a crosstalk show
A state in which more than one person speaks incandescently can be realized as a display form on the screen even in a video conference.

【００１７】請求項５記載の発明では、発言状態検出手
段は周期内の全区間で話者の発言が分割して行われてい
るとき、その区間の和が周期の大部分を占めるときと、
周期の終了時点で発言が行われているときその周期で発
言が行われていると擬制することを特徴とし、これによ
り、周期が比較的長い場合でも発言者を有効に特定する
ことが可能になる。According to the fifth aspect of the present invention, the utterance state detection means, when the utterance of the speaker is divided into all the intervals in the cycle, when the sum of the intervals occupies most of the cycle,
When speaking at the end of the cycle, it is assumed that speaking is done in that cycle, which makes it possible to effectively identify the speaker even if the cycle is relatively long. Become.

【００１８】[0018]

【実施例】以下実施例につき本発明を詳細に説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described in detail below with reference to embodiments.

【００１９】第１の実施例 First embodiment

【００２０】図１は、本発明の第１の実施例におけるテ
レビ会議システム話者判別装置を使用したテレビ会議シ
ステムの概要を表わしたものである。東京の本社や名古
屋あるいは大阪の支社等のような各地点に配置されたテ
レビ会議端末１１₁、１１₂、……１１_Nは、映像回線
や音声回線等の回線１２₁、１２₂、……１２_Nを通じ
てテレビ会議システム話者判別装置の通信インタフェー
ス１３に接続されている。通信インタフェース１３から
は各テレビ会議端末１１₁、１１₂、……１１ _Nごとの
音声情報１４₁、１４₂、……１４_Nが取り出され、そ
れぞれ対応する音声レベル測定部１５₁、１５₂、……
１５_Nに入力される。ここで音声情報１４₁、１４₂、
……１４_Nは各テレビ会議端末１１₁、１１₂、……１
１_Nの図示しないマイクロフォンで検出した音の情報で
あり、音声とこれ以外のバックグラウンドの音も含んで
いる。FIG. 1 is a block diagram of a first embodiment of the present invention.
Levi conferencing system A video conferencing system using a speaker identification device.
This is an overview of the stem. Tokyo headquarters and Nago
Stores located at various locations such as shops or branch offices in Osaka.
Levi conference terminal 11₁, 11₂, …… 11_NIs a video line
12 such as voice and voice lines₁, 12₂, …… 12_NThrough
Video conferencing system Communication interface of speaker identification device
Connected to the switch 13. From communication interface 13
Is each video conference terminal 11₁, 11₂, …… 11 _NFor each
Voice information 14₁, 14₂, ...... 14_NIs taken out,
The corresponding voice level measuring unit 15₁, 15₂, ……
Fifteen_NEntered in. Voice information here 14₁, 14₂,
...... 14_NIs each video conference terminal 11₁, 11₂, …… 1
1_NOf the sound information detected by a microphone (not shown)
Yes, including voice and other background sounds
There is.

【００２１】音声レベル測定部１５₁、１５₂、……１
５_Nは、音声レベル測定周期制御部１６から送られてく
る測定周期制御信号１７を入力し、対応するテレビ会議
端末の音声情報１４₁、１４₂、……１４_Nを分析する
測定周期をこれによって増減する。そしてこれらの結果
を音声レベル測定データ１８₁、１８₂、……１８_Nと
して、各テレビ会議端末１１₁、１１₂、……１１_Nに
対応して配置された音声レベル検出部１９₁、１９₂、
……１９_Nとバックグラウンドレベル検出部２１₁、２
１₂、……２１_Nおよび発言状態検出部２２₁、２
２₂、……２２_Nに供給する。バックグラウンドレベル
検出部２１₁、２１₂、……２１_Nは音声以外の音とし
てのバックグラウンドレベルを検出し、このバックグラ
ウンドレベル測定データ２３₁、２３₂、……２３_Nを
音声レベル検出部１９₁、１９₂、……１９_Nと発言状
態検出部２２₁、２２₂、……２２_Nに供給する。Voice level measuring section 15 ₁ , 15 ₂ , ... 1
5 _N receives the measurement cycle control signal 17 sent from the audio level measurement cycle control unit 16 and analyzes the measurement cycle for analyzing the audio information 14 ₁ , 14 ₂ , ... 14 _N of the corresponding video conference terminal. Increase or decrease by. Then, these results are used as voice level measurement data 18 ₁ , 18 ₂ , ... 18 _N , and voice level detecting units 19 ₁ , 19 arranged corresponding to the respective video conference terminals 11 ₁ , 11 ₂ , ... 11 _N. ₂ ,
...... 19 _N and background level detectors 21 ₁ and 2
1 ₂ , ... 21 _N and speech state detection unit 22 ₁ , 2
2 ₂ , ... Supply to 22 _N. Background level detection unit 21 _1, 21 _2, ...... 21 _N detects the background level of the sounds other than speech, the background level measurement data 23 _1, 23 _2, the sound level detecting unit ...... 23 _N 19 _1, 19 _2, he said ...... 19 _N state detection unit 22 _1, 22 _2, and supplies the ...... 22 _N.

【００２２】音声レベル検出部１９₁、１９₂、……１
９_Nは、入力された音声レベル測定データ１８₁、１８
₂、……１８_Nとバックグラウンドレベル測定データ２
３₁、２３₂、……２３_Nのそれぞれ対の測定データの
差分を検出し、バックグラウンドレベルの影響が排除さ
れた音声レベルを検出する。これらの音声レベルを表わ
した音声レベルデータ２５₁、２５₂、……２５_Nは、
話者の判別を行う話者判別部２６に送られる。話者の判
別方法は後に説明する。Voice level detector 19 ₁ , 19 ₂ , ... 1
9 _N is the input voice level measurement data 18 ₁ , 18
_2, ...... 18 _N and the background level measured data 2
The difference between the measured data of each pair of 3 ₁ , 23 ₂ , ... 23 _N is detected, and the voice level from which the influence of the background level is eliminated is detected. The voice level data 25 ₁ , 25 ₂ , ... 25 _N representing these voice levels are
It is sent to the speaker discrimination unit 26 which discriminates the speaker. The method of discriminating the speaker will be described later.

【００２３】話者の判別結果は話者判別結果データ２７
として出力される。話者判別結果データ２７は画像処理
部２８に送られて、判別された１の話者を強調するため
の画像処理が行われる。この画像処理後の映像信号２９
は通信インタフェース１３から各テレビ会議端末１
１₁、１１₂、……１１_Nに送られて、それぞれのテレ
ビ画面上に映像が表示されることになる。The speaker discrimination result is the speaker discrimination result data 27.
Is output as The speaker discrimination result data 27 is sent to the image processing unit 28, and image processing for emphasizing the discriminated one speaker is performed. Video signal 29 after this image processing
From the communication interface 13 to each video conference terminal 1
11 ₁ , 11 ₂ , ... 11 _N , and the image is displayed on each TV screen.

【００２４】一方、発言状態検出部２２₁、２２₂、…
…２２_Nは、入力された音声レベル測定データ１８₁、
１８₂、……１８_Nとバックグラウンドレベル測定デー
タ２３₁、２３₂、……２３_Nを用いて、個々のテレビ
会議端末１１₁、１１₂、……１１_Nの発言状態の有無
を測定周期ごとに検出する。具体的には音声レベル測定
データ１８₁、１８₂、……１８_Nのそれぞれの変化率
を測定し、個々のテレビ会議端末１１₁、１１₂、……
１１_Nごとの音声レベルの立ち上がりと立ち下がりを検
出して、発言の開始状態と終了状態を判別することで各
周期ごとに発言の有無を検出する。On the other hand, the speech state detection units 22 ₁ , 22 ₂ , ...
22 _N is the input voice level measurement data 18 ₁ ,
18 ₂ , ... 18 _N and background level measurement data 23 ₁ , 23 ₂ , ... 23 _N are used to measure the presence / absence of a speech state of each video conference terminal 11 ₁ , 11 ₂ , ... 11 _N. Detect each. Specifically, the rate of change of each of the audio level measurement data 18 ₁ , 18 ₂ , ... 18 _N is measured, and the individual video conference terminals 11 ₁ , 11 ₂ ,.
The presence or absence of a utterance is detected for each cycle by detecting the rising and falling of the voice level for each 11 _N and determining the start state and end state of the utterance.

【００２５】ここで、各音声レベル測定データ１８₁、
１８₂、……１８_Nだけでは変化が見られない状態のと
きには、音声レベル測定データ１８₁、１８₂、……１
８_Nとバックグラウンドレベル測定データ２３₁、２３
₂、……２３_Nの比較を行って、１周期内でこれらにレ
ベル差があるテレビ会議端末１１では発言状態と判別
し、レベル差がない場合には発言が行われていない状態
と判別する。Here, each voice level measurement data 18 ₁ ,
18 ₂ , ・・・ 18 _N , when no change is observed, the voice level measurement data 18 ₁ , 18 ₂ , ...... 1
8 _N and background level measurement data 23 ₁ , 23
₂ , ... 23 _N are compared, and it is determined that the video conference terminal 11 having a level difference within one cycle has a speaking state, and if there is no level difference, it is determined that a speaking is not being performed. .

【００２６】また、１つのテレビ会議端末１１でこの周
期内に複数の発言状態の立ち上がりと立ち下がりが検出
されたときには、１周期における発言中の時間の積分値
が所定の閾値を越えたかどうかの判別を行う。この結
果、この閾値を越えた場合には、その周期で発言があっ
たと判別する。閾値を越えない場合には非発言として判
別してもよいが、周期の区切りに発言の区間が跨がって
いる場合を考慮して、周期の切り替わるタイミングで発
言状態となっているときにはその周期で発言があったも
のと判別する。When one TV conference terminal 11 detects rising and falling edges of a plurality of speech states within this period, it is determined whether the integrated value of the time during speech in one period exceeds a predetermined threshold value. Make a distinction. As a result, when the threshold value is exceeded, it is determined that the utterance was made in that cycle. If it does not exceed the threshold value, it may be judged as non-speaking, but considering the case where the section of the speech spans the boundary of the cycle, if the speech state is at the timing of switching the cycle, the cycle It is determined that the statement was made in.

【００２７】このようにして各発言状態検出部２２₁、
２２₂、……２２_Nの検出した１周期ごとの発言結果デ
ータ３１₁、３１₂、……３１_Nは、音声レベル測定周
期制御部１６に共通して入力される。一方、話者判別部
２６は音声レベル測定周期制御部１６から得られる発言
結果データ３１₁、３１₂、……３１_Nを用いて発言中
である１または複数のテレビ会議端末１１を検出する。
次に、これら発言中のテレビ会議端末１１の間で、音声
レベルデータ２５を比較し、最大のものを話者として判
別することになる。話者の判別結果は話者判別結果デー
タ２７として出力されることは既に説明した。In this way, each speech state detection unit 22 ₁ ,
The speech result data 31 ₁ , 31 ₂ , ... 31 _N detected by 22 ₂ ... 22 _N for each cycle are commonly input to the voice level measurement cycle control unit 16. On the other hand, the speaker discrimination unit 26 detects one or a plurality of video conference terminals 11 that are making a speech by using the speech result data 31 ₁ , 31 ₂ , ... 31 _N obtained from the voice level measurement cycle control unit 16.
Next, the voice level data 25 are compared between the talking videoconference terminals 11 to determine the maximum one as the speaker. It has already been described that the speaker discrimination result is output as the speaker discrimination result data 27.

【００２８】また、音声レベル測定周期制御部１６は、
話者判別部２６から話者判別データを取得して記憶して
おき、話者が同一であるか変化するかを周期単位で判別
する。この結果、（イ）話者が変化すると判別した場合
には、測定周期制御信号１７で音声レベルの測定周期を
現在の値よりも所定時間だけ短くするような制御を行
う。履歴を多少長めにとり、話者の変化する頻度を算出
して、この頻度情報を用いて周期を短くする値を制御す
るようにしてもよい。また、ＲＯＭ（リード・オンリ・
メモリ）にこのような加減算のための値をテーブルとし
て格納しておいて、状況に応じて加減算の値を読み出し
ながら１周期の間隔を設定するようにしてもよい。Further, the voice level measurement cycle control unit 16 is
The speaker discrimination data is acquired from the speaker discrimination unit 26 and stored, and it is discriminated whether the speakers are the same or change in a cycle unit. As a result, (a) when it is determined that the speaker changes, control is performed by the measurement cycle control signal 17 so that the measurement cycle of the voice level is shorter than the current value by a predetermined time. The history may be set to be slightly longer, the frequency at which the speaker changes is calculated, and the frequency information may be used to control the value for shortening the cycle. In addition, ROM (read only
Values for such addition and subtraction may be stored in a memory) as a table, and one cycle interval may be set while reading the addition and subtraction values according to the situation.

【００２９】（ロ）話者が一定していると判別した場合
には、話者以外に発言しているものがいるかどうかの判
別を行う。この判別は、話者と判別されている発言結果
データを削除した残りの発言結果データから発言開始お
よび発言中のデータ数を算出し、話者以外に発言してい
る者の有無を判断することによって行う。この結果とし
て、「話者の他に発言者がいる」と判別された場合に
は、音声レベルの測定周期を短くするような測定周期制
御信号１７を各音声レベル測定部１５₁、１５₂、……
１５_Nに送出する。このときデータ数、つまり発言者の
数が多いほど、測定周期は短くなる。(B) When it is determined that the number of speakers is constant, it is determined whether or not there is any speaker other than the speaker. This determination is to determine the presence / absence of a person other than the speaker by calculating the number of data items in the starting and speaking states from the remaining speech result data after deleting the speech result data determined to be the speaker. Done by. As a result, when it is determined that “there is a speaker in addition to the speaker”, the measurement period control signal 17 for shortening the measurement period of the voice level is output to the voice level measuring units 15 ₁ , 15 ₂ , ......
Send to 15 _N. At this time, the larger the number of data, that is, the number of speakers, the shorter the measurement cycle.

【００３０】これに対して、「話者の他に発言者がいな
い」と判別された場合には、音声レベルの測定周期を長
くするような測定周期制御信号１７を各音声レベル測定
部１５₁、１５₂、……１５_Nに送出することになる。
なお、この状態は同じ発言者が話者として継続している
ことを意味する。したがって、話者判別結果データ２７
の履歴から話者としての時間を判別し、その時間が長い
ほど測定周期も長くする。なお、測定周期は１人の話者
の予定される話し中の一区切りの時間を目安として設定
される。On the other hand, when it is determined that "there is no speaker other than the speaker", a measurement cycle control signal 17 for lengthening the measurement cycle of the voice level is provided to each voice level measuring section 15 _1. , 15 ₂ , ... 15 _N.
Note that this state means that the same speaker continues as a speaker. Therefore, the speaker discrimination result data 27
The time as a speaker is determined from the history of, and the longer the time is, the longer the measurement cycle is. The measurement cycle is set with reference to a scheduled break time of one speaker.

【００３１】このようにして、各テレビ会議端末１
１₁、１１₂、……１１_Nでは話者として特定された１
人の発言者が画面上で強調されて表示され、話者の交代
が生じたときにも画面の強調表示される話者が的確に交
代することになる。話者を強調する手法としては、話者
と判別された者のみを画面全体に表示する手法や、発言
者全員の小画面が設定されていて、話者と判別された者
の画面を他の者の画面よりも大きく拡大する手法や、話
者と判別された者の画面の枠を他の者の画面の枠とは異
なった色に変化させる手法等の公知の各種手法を使用す
ることができる。In this way, each video conference terminal 1
1 ₁ , 11 ₂ , ... 11 _N identified as speaker 1
The human speaker is emphasized and displayed on the screen, and even when the speaker change occurs, the speaker highlighted on the screen is appropriately changed. As a method of emphasizing the speaker, a method of displaying only the person who is determined to be the speaker on the entire screen, or a small screen for all speakers is set, and the screen of the person who is determined to be the speaker is It is possible to use various known methods such as a method of enlarging the screen of the person larger than the screen of the person or a method of changing the frame of the screen of the person who is determined to be the speaker to a color different from the frame of the screen of another person. it can.

【００３２】第２の実施例 Second embodiment

【００３３】図２は、本発明の第２の実施例におけるテ
レビ会議システム話者判別装置を使用したテレビ会議シ
ステムの概要を表わしたものである。図１と同一部分に
は同一の符号を付しており、これらの説明を適宜省略す
る。通信インタフェース１３からは各テレビ会議端末１
１₁、１１₂、……１１_Nごとの音声情報１４₁、１４
₂、……１４_Nが取り出され、それぞれ対応する音声レ
ベル測定部６１₁、６１₂、……６１_Nに入力される。
ここで音声情報１４₁、１４₂、……１４_Nは各テレビ
会議端末１１₁、１１₂、……１１_Nの図示しないマイ
クロフォンで検出した音の情報であり、音声とこれ以外
のバックグラウンドの音も含んでいる。FIG. 2 shows an outline of a video conference system using a video conference system speaker discrimination apparatus according to the second embodiment of the present invention. The same parts as those in FIG. 1 are designated by the same reference numerals, and the description thereof will be appropriately omitted. From the communication interface 13, each video conference terminal 1
11 ₁ , 11 ₂ , ... 11 _N voice information for each 14 ₁ , 14
₂ , ... 14 _N are taken out and input to the corresponding voice level measuring units 61 ₁ , 61 ₂ , ... 61 _N.
Here, the voice information 14 ₁ , 14 ₂ , ... 14 _N is the information of the sound detected by the microphone (not shown) of each video conference terminal 11 ₁ , 11 ₂ , ... 11 _N. It also includes sound.

【００３４】音声レベル測定部６１₁、６１₂、……６
１_Nの測定した音声レベル測定データ１８₁、１８₂、
……１８_Nは、各テレビ会議端末１１₁、１１₂、……
１１ _Nに対応して配置された音声レベル検出部１９₁、
１９₂、……１９_Nとバックグラウンドレベル検出部２
１₁、２１₂、……２１_Nおよび発言状態検出部２
２ ₁、２２₂、……２２_Nに供給される。音声レベル検
出部１９₁、１９₂、……１９_Nは、それぞれ入力され
る音声レベル測定データ１８₁、１８₂、……１８ _Nと
バックグラウンドレベル測定データ２３₁、２３₂、…
…２３_Nを用いてバックグラウンドレベルのそれぞれ除
去された音声レベルを表わした音声レベルデータ２
５₁、２５₂、……２５_Nを作成し、これらを対話状態
検出部６２と話者判別部６３に供給する。話者判別部６
３は対話状態検出部６２の検出した対話状態データ６４
と各音声レベルデータ２５₁、２５₂、……２５_Nを用
いて１または複数の話者を判別し、この結果を話者判別
データ６５として画像処理部６６に送出する。画像処理
部６６では、判別された１または複数の話者を強調する
ための画像処理を行う。この画像処理後の映像信号６７
は通信インタフェース１３から各テレビ会議端末１
１₁、１１₂、……１１_Nに送られて、それぞれのテレ
ビ画面上に映像が表示されることになる。Voice level measuring unit 61₁, 61₂, …… 6
1_NSound level measurement data measured by 18₁, 18₂,
…… 18_NIs each video conference terminal 11₁, 11₂, ……
11 _NAudio level detection unit 19 arranged corresponding to₁,
19₂, …… 19_NAnd background level detector 2
1₁, 21₂, …… 21_NAnd speech state detection unit 2
2 ₁, 22₂, …… 22_NIs supplied to. Voice level detection
Debu 19₁, 19₂, …… 19_NAre respectively entered
Voice level measurement data 18₁, 18₂, …… 18 _NWhen
Background level measurement data 23₁, 23₂, ...
… 23_NTo remove each of the background levels.
Voice level data 2 representing the removed voice level
5₁, 25₂, ...... 25_NCreate and interact these with
It is supplied to the detection unit 62 and the speaker discrimination unit 63. Speaker discrimination unit 6
3 is the dialogue state data 64 detected by the dialogue state detection unit 62.
And each voice level data 25₁, 25₂, ...... 25_NFor
Discriminate one or more speakers and judge the result
The data 65 is sent to the image processing unit 66. Image processing
The section 66 emphasizes the one or more determined speakers.
Image processing is performed. Video signal 67 after this image processing
From the communication interface 13 to each video conference terminal 1
1₁, 11₂, …… 11_NSent to each tele
The video will be displayed on the screen.

【００３５】このようなテレビ会議システム話者判別装
置で、対話状態検出部６２は各音声レベルデータ２
５₁、２５₂、……２５_Nを入力して、音声レベル測定
部６１₁、６１₂、……６１_Nが測定するそれぞれの測
定周期における音声レベルの積分を各テレビ会議端末１
１₁、１１₂、……１１_Nごとに算出する。そして、こ
れらの積分値としての音声レベルの最大のものと次の音
声レベルの２種類のものを抽出する。In such a television conference system speaker discrimination device, the dialogue state detection unit 62 uses the voice level data 2
5 _1, 25 _2, enter the ...... 25 _N, sound level measurement unit 61 _1, 61 _2, each of the integral of the sound level at each measuring period ...... 61 _N to measure the television conference terminal 1
It is calculated for each of 1 ₁ , 11 ₂ , ... 11 _N. Then, two types of the maximum voice level as the integrated value and the next voice level are extracted.

【００３６】ただし、装置によっては最大の音声レベル
にある程度近い２以上の音声レベルがあったときにはこ
れらを共に抽出するようにしてもよい。これは、測定周
期を比較的長く設定した場合には１周期の間に３者以上
のものが矢継ぎ早に会話を行う現象が発生することがあ
るからである。このような場合には、先の第１の実施例
と同様の主旨で、対話状態検出部６２から測定周期制御
信号１７を各音声レベル測定部６１₁、６１₂、……６
１_Nに送出するような構成を採って、測定周期を制御す
ることも有効である。本実施例では説明を簡単にするた
めに、２者の対話に限定して説明を行う。However, depending on the device, when there are two or more voice levels that are close to the maximum voice level to some extent, they may be extracted together. This is because, when the measurement cycle is set to be relatively long, a phenomenon may occur in which three or more persons have conversations in rapid succession during one cycle. In such a case, the measurement period control signal 17 is sent from the conversation state detecting section 62 to each of the voice level measuring sections 61 ₁ , 61 ₂ , ... 6 for the same purpose as in the first embodiment.
It is also effective to control the measurement cycle by adopting a configuration for sending to 1 _N. In the present embodiment, in order to simplify the description, the description will be limited to a dialogue between two parties.

【００３７】対話状態検出部６２は、２者の音声レベル
の積分値を比較し、これらが所定の値の範囲内の小差で
あるときには、現在行われているテレビ会議は対話状態
であるとの判別が行われる。これ以外の場合、すなわち
大差が生じているときには現在行われているテレビ会議
は非対話状態であるとの判別が行われる。更に対話状態
検出部６２は、発言者が１人の場合や、発言者がいない
場合には同様に非対話状態と判別する。The dialogue state detection unit 62 compares the integrated values of the voice levels of the two persons, and when they are a small difference within the range of a predetermined value, the video conference currently being conducted is in the dialogue state. Is determined. In other cases, that is, when there is a large difference, it is determined that the current video conference is in the non-interactive state. Further, the dialogue state detection unit 62 similarly determines the non-dialogue state when there is one speaker or when there is no speaker.

【００３８】対話状態の有無を表わす対話状態データ６
４は、各音声レベルデータ２５₁、２５₂、……２５_N
と共に話者判別部６３に入力される。話者判別部６３は
対話状態データ６４が非対話状態を示しているとき、そ
れぞれの音声レベルデータ２５₁、２５₂、……２５_N
を比較し、最大の音声レベルを表わしているテレビ会議
端末１１のものを「話者Ａパターン」として判別する。Dialog state data 6 representing the presence or absence of a dialogue state
4 is each voice level data 25 ₁ , 25 ₂ , ... 25 _N
It is also input to the speaker discrimination unit 63. When the conversation state data 64 indicates the non-conversation state, the speaker discrimination unit 63 determines the respective voice level data 25 ₁ , 25 ₂ , ... 25 _N.
And the video conference terminal 11 having the highest audio level is determined as the "speaker A pattern".

【００３９】また、対話状態データ６４が対話状態を示
しているときには、同様に最大の音声レベルを表わして
いるテレビ会議端末１１のものを「話者Ａパターン」と
して判別すると共に、対話状態にあるもう１つの話者と
しての次に大きな音声レベルを表わしているテレビ会議
端末１１のものを「話者Ｂパターン」として判別する。
これらの判別結果は、画像処理部６６に送られる。Further, when the conversation state data 64 indicates the conversation state, the one of the video conference terminal 11 which similarly represents the maximum voice level is discriminated as the "speaker A pattern" and is in the conversation state. The one of the video conference terminal 11 showing the next highest voice level as another speaker is discriminated as the "speaker B pattern".
These determination results are sent to the image processing unit 66.

【００４０】画像処理部６６は、図示しないＣＰＵ（中
央処理装置）を備えており、同じく図示しないＲＯＭ
（リード・オンリ・メモリ）に格納された制御プログラ
ムによって対話状態の有無や話者の有無に応じた画面の
強調処理を行うようになっている。The image processing section 66 has a CPU (central processing unit) (not shown), and also a ROM (not shown).
A control program stored in the (read only memory) is used to perform screen enhancement processing according to the presence / absence of a conversation state and the presence / absence of a speaker.

【００４１】図３はこの画像処理部の制御の様子を表わ
したものである。画像処理部６６は話者判別データ６５
を受信し、話者が存在しているかどうかをチェックする
（ステップＳ１０１）。「話者Ａパターン」が存在する
ときには話者が存在する。話者が存在しないときには
（Ｎ）、テレビ会議の全参加者の映像をそれぞれ小画面
で区分けして表示するような映像処理を行い（ステップ
Ｓ１０２）、この映像信号６７を通信インタフェース１
３を介して各テレビ会議端末１１₁、１１₂、……１１
_Nに送出させる。FIG. 3 shows how the image processing section is controlled. The image processing unit 66 uses the speaker discrimination data 65.
Is received and it is checked whether or not a speaker is present (step S101). The speaker exists when the "speaker A pattern" exists. When no speaker is present (N), video processing is performed such that the video of all participants of the video conference is divided into small screens and displayed (step S102), and this video signal 67 is sent to the communication interface 1
Each video conference terminal 11 ₁ , 11 ₂ , ... 11 via 3
Send to _N.

【００４２】話者が存在してかつ「話者Ｂパターン」が
存在する場合には（ステップＳ１０３；Ｙ）、画面を２
分割して話者Ａと話者Ｂの双方を表示するような映像信
号６７を作成し（ステップＳ１０４）、これを通信イン
タフェース１３に送出する。また、話者Ｂが存在しない
場合には（ステップＳ１０３；Ｎ）、話者Ａのみを画面
全体に表示するような映像信号６７を作成し（ステップ
Ｓ１０５）、これを通信インタフェース１３に送出する
ことになる。When the speaker exists and the "speaker B pattern" exists (step S103; Y), the screen is changed to 2
A video signal 67 that divides and displays both speaker A and speaker B is created (step S104), and this is sent to the communication interface 13. If the speaker B does not exist (step S103; N), a video signal 67 for displaying only the speaker A on the entire screen is created (step S105) and sent to the communication interface 13. become.

【００４３】以上の画面表示の態様は、測定周期ごとに
チェックして切り替えが行われることになる。なお、強
調の態様はこれ以外にも各種存在し得る。例えば全参加
者の画面を小枠で囲んで表示し、対話を行っている者あ
るいは話者のみを他とは異なった色の枠で表示したり、
これらの者の枠を他の者の枠よりも相対的に大きく表示
するようにしてもよい。The above screen display mode is checked and switched at each measurement cycle. There may be various emphasis modes other than this. For example, the screen of all participants is displayed in a small frame, and only the person or the speaker who is having a dialogue is displayed in a frame of a color different from the others,
The frames of these persons may be displayed relatively larger than the frames of other persons.

【００４４】なお、以上第１および第２の実施例を説明
したが、本発明はこれらに限られるものではない。例え
ば、第１の実施例の測定周期の変更のための音声レベル
測定周期制御部を第２の実施例に組み込んで、対話を前
提としたテレビ会議システム話者判別装置を実現するこ
とも可能である。Although the first and second embodiments have been described above, the present invention is not limited to these. For example, it is possible to realize a video conference system speaker discrimination device based on a dialogue by incorporating the voice level measurement cycle control unit for changing the measurement cycle of the first embodiment into the second embodiment. is there.

【００４５】[0045]

【発明の効果】以上説明したように請求項１記載の発明
では、話者が１人で長く発言しているときや、他の話者
が割り込む形等で発言を開始したとき等の話者の交代状
況を判別し、これに応じて話者の判別を行うための周期
を可変にすることで、不要な音声に対して話者の交代を
誤認することを防止する一方で、話者が割り込んだり複
数の発言者が存在する場合にそれぞれの話者を的確に判
別することができる。As described above, according to the first aspect of the invention, the speaker when one speaker speaks for a long time, or when another speaker starts to speak in a way that interrupts the talk, etc. By distinguishing the change situation of the speaker and changing the period for distinguishing the speaker accordingly, it is possible to prevent false recognition of the change of the speaker for unnecessary voice, while When there is an interruption or a plurality of speakers, each speaker can be accurately discriminated.

【００４６】また、請求項２記載の発明では、発言中の
話者が２人いるかどうかを音声レベルが最大のものと次
のレベルのものを抽出してこれらのレベルの差を比較す
ることで判別することにしたので、話者が２人いるとき
には画面でこれらの者を強調表示することにして、対話
の臨場感を盛り上げることができる。According to the second aspect of the present invention, it is determined whether or not there are two speakers who are speaking by extracting the one with the maximum voice level and the one with the next voice level and comparing the difference between these levels. Since it is decided to make a distinction, when there are two speakers, these persons can be highlighted on the screen to enhance the realism of the dialogue.

【００４７】更に請求項３記載の発明では、話者の発言
状態を検出して画面表示の切り替えの単位としての周期
を可変することにしたので、対話形式で画像の表示を行
う際にも話者の特定を適切に行うことができる。Furthermore, in the invention according to claim 3, since the period as a unit for switching the screen display is changed by detecting the utterance state of the speaker, the talk is performed even when the image is displayed in the interactive mode. The person can be appropriately identified.

【００４８】また、請求項４記載の発明では、最大の音
声レベルの端末とこれに近似した１または複数の端末の
話者を同時に抽出して画像を強調表示することにしたの
で、３者以上のものが白熱して発言する状態をテレビ会
議においても画面上の表示形態として実現させることが
できる。In the invention according to claim 4, since the speaker of the terminal having the highest voice level and the speaker of one or a plurality of terminals similar thereto are simultaneously extracted and the image is highlighted, three or more persons are selected. It is possible to realize a state in which a person speaks incandescently as a display form on the screen even in a video conference.

【００４９】更に、請求項５記載の発明では、発言状態
検出手段は周期内の全区間で話者の発言が分割して行わ
れているとき、その区間の和が周期の大部分を占めると
きと、周期の終了時点で発言が行われているときその周
期で発言が行われていると擬制することにしたので、設
定されている周期が比較的長い場合でも発言者を有効に
特定することが可能になるという効果がある。Further, in the invention according to claim 5, when the utterance of the speaker is divided into all the intervals in the cycle and the sum of the intervals occupies most of the cycle, Then, it was decided to pretend that speech was being made in that cycle when speech was being made at the end of the cycle, so it is possible to effectively identify the speaker even when the set cycle is relatively long. There is an effect that it becomes possible.

[Brief description of drawings]

【図１】本発明の第１の実施例におけるテレビ会議シス
テム話者判別装置とこれに接続された各テレビ会議端末
を示したブロック図である。FIG. 1 is a block diagram showing a video conference system speaker identification device according to a first embodiment of the present invention and respective video conference terminals connected thereto.

【図２】本発明の第２の実施例におけるテレビ会議シス
テム話者判別装置とこれに接続された各テレビ会議端末
を示したブロック図である。FIG. 2 is a block diagram showing a video conference system speaker identification device according to a second embodiment of the present invention and respective video conference terminals connected thereto.

【図３】本発明の第２の実施例における画像処理部の制
御の様子を表わした流れ図である。FIG. 3 is a flowchart showing how the image processing unit in the second embodiment of the present invention is controlled.

[Explanation of symbols]

１１テレビ会議端末１５、６１音声レベル測定部１６音声レベル測定周期制御部１９音声レベル検出部２１バックグラウンドレベル検出部２２発言状態検出部２５音声レベルデータ２６、６３話者判別部２８、６６画像処理部６２対話状態検出部 11 video conference terminal 15, 61 voice level measuring unit 16 voice level measuring cycle control unit 19 voice level detecting unit 21 background level detecting unit 22 speech state detecting unit 25 voice level data 26, 63 speaker discriminating unit 28, 66 image processing Part 62 Dialog state detecting part

Claims

[Claims]

1. A voice level detecting means for detecting a voice level of each video conference terminal participating in a video conference for each terminal, and detecting whether or not a speaker is speaking at each video conference terminal in a predetermined cycle unit. Utterance state detecting means, a speaker deciding means for deciding a speaker for each cycle based on the voice level detected by the voice level detecting means, a detection result of the utterance state detecting means and a speaker deciding means. Speaker change situation determination means for determining the change situation of the speaker using the determination result of the means, and when this speaker change situation determination means detects the change of the speaker, the cycle is set to be short and A teleconferencing system speaker discrimination apparatus comprising: a period changing means for shortening this period when a section is long.

2. A voice level detecting means for detecting the voice level of each video conference terminal participating in the video conference for each terminal, the maximum voice level of each terminal detected by the voice level detecting means, and next. A voice level comparing means for comparing a large one, a dialogue discriminating means for discriminating that both parties are having a dialogue when the difference between the voice levels compared by the voice level comparing means is within a predetermined range, and the dialogue discriminating means. A video conference system speaker discrimination device comprising: a display emphasizing unit for emphasizing the two speakers who have discriminated to each other on the screen display as compared with other video conference participants.

3. The speaker determining apparatus according to claim 2, wherein the speaking state of the speaker is detected and the period as a unit for switching the screen display is varied.

4. A voice level detecting means for detecting the voice level of each video conference terminal participating in a video conference for each terminal, and the maximum voice level of each terminal detected by the voice level detecting means and Speaker multiple extraction means for extracting one or a plurality of similar relatively high voice levels, and a plurality of speakers extracted by the speaker multiple extraction means are emphasized on the screen display as compared with other video conference participants. A video conference system speaker identification device, comprising:

5. The utterance state detection means makes a statement when a speaker's utterance is divided into all sections within a cycle, when the sum of the sections occupies most of the cycle, and at the end of the cycle. 2. The video conference system speaker discrimination device according to claim 1, wherein the video conferencing system speaker discrimination device imitates that a speech is being made in that cycle.