JP4100146B2

JP4100146B2 - Bi-directional communication system, video communication device

Info

Publication number: JP4100146B2
Application number: JP2002344164A
Authority: JP
Inventors: 良平岡田
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2002-11-27
Filing date: 2002-11-27
Publication date: 2008-06-11
Anticipated expiration: 2022-11-27
Also published as: JP2004179997A

Description

【０００１】
【発明の属する技術分野】
本発明は，映像配信ユニット間において映像データを送受信可能なシステムにかかり，特に双方向コミュニケーションシステム，映像通信装置，映像データ配信方法に関する。
【０００２】
【従来の技術】
近年，コンピュータなどの情報処理装置の高機能・低価格化による広範な普及と，ディジタル回線を始めとするネットワークのブロードバンド化に伴い，例えばデータ，音声，または映像などをやり取りするマルチメディア通信環境が急速に整備され始めている。
【０００３】
マルチメディア通信環境は，代表的な例として，音声及び画像を双方向でやり取りすることによりコミュニケーションを図るテレビ電話／テレビ会議システム（双方向コミュニケーションシステム）などのサービスがある（例えば，特許文献１参照）。なお，本願発明に関連する技術文献情報には，次のものがある。
【０００４】
【特許文献１】
特開平７−６７１０７号公報
【０００５】
【発明が解決しようとする課題】
しかしながら，映像データを伝送する場合において，映像データを圧縮符号化する際，１フレーム全体を画一的に圧縮符号化する場合が多く，帯域に大幅な制限があるネットワークを介して，情報量の多い画像データを送信するには，画一的に全体の画質を下げなければならなかった。
【０００６】
また，例えば，フレーム内に人間の顔など，映像を把握するのに不可欠な要素となる注目される特徴を有する領域（特徴領域）に対する映像データを個別に検出しても，的確に検出されない場合が多く，したがって，上記特徴領域についても画質が下がる状態で圧縮符号化され，視認性の低い映像データがネットワークを介して，接続先の相手に表示されていた。
【０００７】
本発明は，上記のような従来の問題点に鑑みてなされたものであり，特徴を有する領域を的確に判断し，各領域に応じて圧縮符号化を制御することが可能な，新規かつ改良された双方向コミュニケーションシステムを提供することを目的としている。
【０００８】
【課題を解決するための手段】
上記課題を解決するため，本発明の第１の観点によれば，１又は２以上の映像配信ユニットがネットワークにより接続された双方向コミュニケーションシステムであって：前記映像配信ユニットは，映像データを生成する撮像装置と；前記映像データから顔領域を検出し，顔領域情報を生成する特徴検出部，前記顔領域情報に基づき符号化パラメータを生成する符号化制御部，前記符号化パラメータに基づき前記映像データを伝送データに圧縮符号化するエンコーダ部及び前記伝送データを前記映像データに伸長するデコーダ部を少なくとも備える映像通信装置と；前記映像データを表示する出力装置と；を備え，送り手側の前記一の映像配信ユニットは，前記映像データのうち，少なくとも前記顔領域と前記顔領域に属さない領域との各領域ごとに圧縮符号化された前記伝送データを，受け手側の前記他の映像配信ユニットに対して配信し，前記顔領域情報は，前記顔領域の面積情報，前記顔領域の位置情報又は前記顔領域の信頼度情報の少なくともいずれかを含み，前記符号化制御部は，現フレーム画像から選択された１の顔領域の信頼度情報が，前記現フレーム画像の他の顔領域の信頼度情報，又は，前フレーム画像の顔領域の信頼度情報に比べて低い場合，前記選択された１の顔領域の信頼度情報を，前記前フレーム画像の顔領域の信頼度情報と同程度，又は，前記現フレーム画像の前記他の顔領域の信頼度情報以上の値に補正することを特徴とする，双方向コミュニケーションシステムが提供される。
【０００９】
本発明によれば，相互に映像データの送受信可能な映像配信ユニット間において，撮影された映像データのうち，視点が注目される特徴を有する領域（特徴領域）が検出されると，上記特徴領域と，特徴領域以外の領域とに区別し，領域に応じて圧縮符号化する。かかる発明によれば，例えば量子化パラメータが映像データ全体につき一律ではなく，特徴領域に対しては量子化パラメータを小さくし，特徴領域以外の領域に対しては量子化パラメータを大きくして圧縮符号化することにより，領域に応じた差別化を図れる。したがって，映像データのストリーム配信時に，画質が低くてもよい特徴領域以外の領域に対してデータ容量の軽減化，および特徴領域に対して視認性の高い画質の維持された映像データを表示させることができる。
【００１０】
映像通信装置は，特徴領域情報に基づき，圧縮符号化するために必要なパラメータである符号化パラメータを生成する符号化制御部を，さらに備えるように構成することができる。かかる構成により，映像データを圧縮符号化する際に，例えば映像データのフレーム単位であるフレーム画像のうち，検出された顔領域に対しては量子化パラメータを小さくし画質を向上させ，または顔領域以外の領域に対しては量子化パラメータを大きくし画質を落としデータ量を軽減するように，エンコーダ部に指示するための符号化パラメータを生成することができる。なお，映像データのフレーム単位であるフレーム画像に限定されず，例えば，映像データのフィールド単位であるフィールド画像または複数フレームから構成されるシーン単位であるシーン画像などの場合であってもよい。
【００１１】
エンコーダ部は，符号化パラメータに基づき映像データを伝送データに圧縮符号化するように構成することができる。かかる発明により，例えば，フレーム画像のうちオブジェクトとして特徴領域を切り出し，顔領域に限り圧縮符号化するように符号化パラメータによって制御されることができる。なお，フレーム画像に限定されず，例えば，フィールド画像またはシーン画像などの場合であってもよい。
【００１２】
特徴領域情報は，少なくとも顔領域の面積情報，顔領域の位置情報，または顔領域の信頼度情報が含まれる顔領域情報であるように構成することができる。かかる構成により，フレーム画像に構成されるマクロブロックのうち顔領域に属すマクロブロックを，信頼度に基づき的確に特定することが可能となる。なお面積情報は，例えば画素単位に示され，位置情報は，ＸＹ座標などにより示される。なお，特徴領域は，顔領域に限定されず，その他特徴を有するいかなる領域であってもよい。
【００１３】
符号化制御部は，映像データから特徴領域情報が生成された場合，当該映像データよりも少なくとも１フレーム又は１フィールド前に圧縮符号化された映像データの特徴領域情報に基づき，当該映像データの特徴領域情報を補正するように構成することができる。かかる構成により，フレーム画像内に複数の特徴領域が検出された場合に，検出されたフレーム画像よりも，例えば１フレーム，１フィールド，または１シーンなど前に検出された特徴領域情報に含まれる例えば信頼度などの情報に基づき，上記フレーム画像に関する適正な特徴領域情報に補正することができる。なお，フレーム画像に限定されず，例えば，フィールド画像またはシーン画像などの場合であってもよい。
【００１４】
映像通信装置は，ネットワークの混雑状況を検知する検査部を，さらに備えるように構成することができる。かかる構成により，ネットワークの混雑状況を把握することで，混雑状況に見合った伝送データ容量に基づきネットワークを介して配信することが可能となる。したがって，ネットワークトラフィックに対して負荷を最小限に留め，通信効率の向上を図れる。
【００１５】
符号化制御部は，ネットワークの混雑状況に応じて，特徴領域にかかる符号化パラメータと，特徴領域に属さない領域にかかる符号化パラメータとを変更するように構成することができる。かかる構成により，ネットワークトラフィックが混雑してくると，送信可能なデータ容量が限られてくるため，映像データであるフレーム画像のうち特徴領域のオブジェクトを切出して，上記オブジェクトに対しては高画質の状態で圧縮符号化し，伝送する。特徴領域以外の領域に対しては，圧縮符号化せず削除又は無視される。したがって，映像データの視認の上で不可欠な要素である特徴領域だけを切り出して送信するため，少ないデータ容量で，視認性の高い映像データを配信することができる。なお，混雑状況は，１又は２以上の閾値を段階的に設定しておくことで，混雑状況の段階に応じて，柔軟に画質及びデータ容量を変動させ，配信できる。また，フレーム画像に限定されず，例えば，フィールド画像またはシーン画像などの場合であってもよい。
【００１６】
符号化制御部は，特徴領域にかかる映像データの符号化パラメータと，特徴領域に属さない領域にかかる映像データの符号化パラメータとを，少なくともフレーム，フィールド，またはシーン単位に変更するように構成してもよい。
【００１７】
符号化制御部は，特徴領域にかかる映像データを，別オブジェクトとして切り出すように構成してもよい。かかる構成により，フレーム画像の特徴領域に属すマクロブロックに限定して圧縮符号化することができる。さらに，特徴領域に属さないマクロブロックに対して圧縮符号化するか否かを制御することができる。したがって，例えばネットワークのトラフィックなどに応じて柔軟に映像データを圧縮符号化できる。なお，フレーム画像に限定されず，例えば，フィールド画像またはシーン画像などの場合であってもよい。
【００１８】
エンコーダ部は，少なくともＨ．２６３又はＭＰＥＧ−４の圧縮符号化方式により，映像データを圧縮符号化するように構成することができる。なお，Ｈ．２６３又はＭＰＥＧ−４に限定されず，ＩＴＵ−Ｔ勧告Ｈ．２６１などの場合でもよい。
【００１９】
映像通信装置は，特徴領域にかかる映像データを少なくともモザイク変換する特殊処理部を，さらに備えるように構成することができる。かかる構成により，フレーム画像に検出された特徴領域について，モザイク変換または他の画像に置換などの特殊な処理をすることで，特徴領域を正確に認識できないようにすることができる。なお，フレーム画像に限定されず，例えば，フィールド画像またはシーン画像などの場合であってもよい。さらに，特徴領域以外の領域について，モザイク変換または他の画像に置換などの特殊な処理をする場合でもよい。
【００２０】
映像データは，少なくとも画像データもしくは音声データのうちいずれか一方又は双方であるように構成することができる。
【００２１】
さらに，本発明の別の観点によれば，映像データを生成する撮像装置と，前記映像データを表示する出力装置とを備えた映像配信ユニットに備わる映像通信装置であって：前記撮像装置により生成された前記映像データから顔領域を検出し，顔領域情報を生成する特徴検出部と；前記顔領域情報に基づき符号化パラメータを生成する符号化制御部と；前記符号化パラメータに基づき前記映像データを伝送データに圧縮符号化するエンコーダ部と；前記伝送データを前記映像データに伸長するデコーダ部と；を備え，前記顔領域情報は，前記顔領域の面積情報，前記顔領域の位置情報又は前記顔領域の信頼度情報の少なくともいずれかを含み，前記符号化制御部は，現フレーム画像から選択された１の顔領域の信頼度情報が，前記現フレーム画像の他の顔領域の信頼度情報，又は，前フレーム画像の顔領域の信頼度情報に比べて低い場合，前記選択された１の顔領域の信頼度情報を，前記前フレーム画像の顔領域の信頼度情報と同程度，又は，前記現フレーム画像の前記他の顔領域の信頼度情報以上の値に補正することを特徴とする，映像通信装置が提供される。
【００２２】
本発明によれば，相互に映像データの送受信可能な映像配信ユニット間において，撮影された映像データのうち，視認する上で不可欠な要素である特徴を有する領域（特徴領域）が検出されると，ネットワークの混雑状況を勘案し，上記特徴領域と，特徴領域以外の領域とを区別し，各領域に応じて圧縮符号化する。かかる発明によれば，特徴領域に対しては量子化パラメータを小さくし画質を通常の圧縮符号化時よりも向上させ，特徴領域以外の領域に対しては量子化パラメータを大きくして圧縮符号化することにより，ネットワークに負荷のかからない程度データ容量を軽減しつつ，視認性の高い映像データを配信先の出力装置に表示することができる。なお，この映像通信装置は，上記双方向コミュニケーションシステムで採用される映像通信装置とほぼ同様の構成を有する。
【００２３】
特徴領域情報は，少なくとも顔領域の面積情報，顔領域の位置情報，または顔領域の信頼度情報が含まれる顔領域情報であるように構成することができる。かかる構成により，フレーム画像に構成されるマクロブロックのうち顔領域に属すマクロブロックを，信頼度に基づき的確に特定することが可能となる。なお面積情報は，例えば画素単位に示され，位置情報は，ＸＹ座標などにより示される。なお，特徴領域は，顔領域に限定されず，その他特徴を有するいかなる領域であってもよい。
【００２４】
符号化制御部は，映像データから特徴領域情報が生成された場合，当該映像データよりも少なくとも１フレーム前に圧縮符号化された映像データの特徴領域情報に基づき，当該映像データの特徴領域情報を補正するように構成してもよい。
【００２５】
映像通信装置は，ネットワークの混雑状況を検知する検査部を，さらに備えるように構成してもよく，符号化制御部は，ネットワークの混雑状況に応じて，特徴領域にかかる符号化パラメータと，特徴領域に属さない領域にかかる符号化パラメータとを変更するように構成してもよい。
【００２６】
符号化制御部は，特徴領域にかかる映像データの符号化パラメータと，特徴領域に属さない領域にかかる映像データの符号化パラメータとを，少なくとも映像データのフレーム，フィールド，またはシーン単位に変更するように構成してもよい。
【００２７】
符号化制御部は，特徴領域にかかる映像データを，別オブジェクトとして切り出すように構成してもよく，エンコーダ部は，少なくともＨ．２６３又はＭＰＥＧ−４の圧縮符号化方式により，映像データを圧縮符号化するように構成してもよい。
【００２８】
映像通信装置は，特徴領域にかかる映像データを少なくともモザイク変換する特殊処理部を，さらに備えるように構成してもよい。
【００２９】
さらに，本発明の別の観点によれば，ネットワークに接続され，少なくとも映像データを生成し，映像データを表示する１又は２以上の映像配信ユニットに備わる映像通信装置の映像データ配信方法が提供される。この映像通信装置の映像データ配信方法において，映像通信装置は，映像データから特徴領域情報を生成し；特徴領域情報に基づき符号化パラメータを生成し；符号化パラメータに基づき映像データを伝送データに圧縮符号化することを特徴としている。
【００３０】
特徴領域情報は，少なくとも顔領域の面積情報，顔領域の位置情報，または顔領域の信頼度情報が含まれる顔領域情報であるように構成してもよい。
【００３１】
映像通信装置は，映像データから特徴領域情報が生成された場合，当該映像データよりも少なくとも１フレーム前に圧縮符号化された映像データの特徴領域情報に基づき，当該映像データの特徴領域情報を補正するように構成してもよい。
【００３２】
映像通信装置は，ネットワークの混雑状況を検知する検査部を，さらに備えるように構成してもよく，映像通信装置は，ネットワークの混雑状況に応じて，特徴領域にかかる符号化パラメータと，特徴領域に属さない領域にかかる符号化パラメータとを変更するように構成してもよい。
【００３３】
映像通信装置は，特徴領域にかかる映像データの符号化パラメータと，特徴領域に属さない領域にかかる映像データの符号化パラメータとを，少なくとも映像データのフレーム，フィールド，またはシーン単位に変更するように構成してもよい。
【００３４】
映像通信装置は，特徴領域にかかる映像データを，別オブジェクトとして切り出すように構成してもよく，映像通信装置は，少なくともＨ．２６３又はＭＰＥＧ−４の圧縮符号化方式により，映像データを圧縮符号化するように構成してもよい。
【００３５】
映像通信装置は，さらに，特徴領域にかかる映像データを少なくともモザイク処理又は他の映像データに置換処理するように構成してもよい。
【００３６】
【発明の実施の形態】
以下，本発明の好適な実施の形態について，添付図面を参照しながら詳細に説明する。なお，以下の説明及び添付図面において，略同一の機能及び構成を有する構成要素については，同一符号を付することにより，重複説明を省略する。なお，本発明にかかる特徴検出部は，例えば，本実施の形態にかかる顔検出ブロック２０３などに該当する。
【００３７】
（１．システム構成）
まず，図１を参照しながら，本実施の形態にかかる双方向コミュニケーションシステムについて説明する。図１は，本実施の形態にかかる双方向コミュニケーションシステムの概略的な構成を示すブロック図である。
【００３８】
図１に示すように，双方向コミュニケーションシステムは，１又は２以上の映像配信ユニット１０１（ａ，ｂ，…，ｎ）がネットワーク１０５に接続されている。
【００３９】
上記映像配信ユニット１０１（ａ，ｂ，…，ｎ）により，使用者１０６（ａ，ｂ，…，ｎ）は，ネットワーク１０５を介して，お互いの画像又は音声をやりとりすることで例えばテレビ会議システムなどのサービスを受けることができる。
【００４０】
映像配信ユニット１０１（ａ，ｂ，…，ｎ）は，ビデオカメラなどの撮像装置１０２（ａ，ｂ，…，ｎ）と，上記撮像装置１０２の撮影により生成された，映像データを送受信する映像通信装置１０４（ａ，ｂ，…，ｎ）と，映像データを表示する出力装置１０３（ａ，ｂ，…，ｎ）とが備えられている。なお，本実施の形態にかかる映像データは，少なくとも音声データ又は画像データのうちいずれか一方又は双方からなる。
【００４１】
撮像装置１０２は，映像データを生成可能なビデオカメラであり，例えば，テレビ会議，監視・モニタリングなどに適用される低ビットレート通信用のビデオカメラであるが，かかる例に限定されず，本実施の形態にかかる撮像装置１０２は，放送用のニュース番組の取材や，スポーツなどの試合の模様などを撮影するカムコーダなどの場合であっても実施可能である。
【００４２】
出力装置１０３は，映像データを表示することが可能な例えば，ＴＶ装置又は液晶ディスプレイ装置などが例示され，さらにスピーカを備えることにより，音声および画像を出力することが可能な装置である。
【００４３】
映像通信装置１０４は，上記撮像装置１０２により生成された映像データに基づき，使用者１０６の顔である顔領域を検出し，上記顔領域から生成される顔領域情報に基づき，映像データを圧縮符号化し，上記圧縮符号化された伝送データを，ネットワーク１０５を介して送信する。また送信された伝送データを受信し，上記伝送データを伸長する。上記伸長された映像データは，出力装置１０３に送信される。さらに，ネットワーク１０５を介して伝送データを送信する際に，ネットワーク１０５のトラフィックの混雑状況に応じて伝送データを制御する。
【００４４】
なお，本実施の形態にかかる顔領域に基づく圧縮符号化は，少なくともＨ．２６３，またはＭＰＥＧ−４に基づき行われるが，後程詳述する。さらに，ネットワーク１０５のトラフィックの混雑状況の検知についても後程詳述する。
【００４５】
次に，本システムの典型的な動作例について説明する。
【００４６】
ある使用者１０６との間で，例えば，使用者１０６ａと使用者１０６ｂとの間で，テレビ会議をする場合，映像配信ユニット１０１ａに備わる撮像装置１０２ａにより，使用者１０６ａの映像データが生成され，ネットワーク１０５を介して映像配信ユニット１０１ｂに映像データが送信される。
【００４７】
したがって映像配信ユニット１０１ｂに備わる出力装置１０３ｂは，ネットワーク１０５を介して送信された映像データを表示する。また，撮像装置１０２ｂにより，使用者１０６ｂの映像データが生成されて，ネットワーク１０５を介して映像配信ユニット１０１ａに送信され，出力装置１０３ａに表示される。
【００４８】
映像配信ユニット１０１ａと映像配信ユニット１０１ｂとの間で，遠隔地であってもネットワーク１０５を介して映像データを送受信することで，お互いの使用者１０６ａと使用者１０６ｂとのコミュニケーションを図ることができる。
【００４９】
なお，本実施の形態にかかる映像配信ユニット１０１には，撮像装置１０２，出力装置１０３，および映像通信装置１０４とがそれぞれ備わっている場合を例にあげて説明したが，かかる例に限定されず，例えば，１の映像配信ユニット１０１には，映像通信装置１０４及び出力装置１０３を備え，他の映像配信ユニット１０１には，撮像装置１０２及び映像通信装置１０４を備える場合であっても実施可能である。この場合，例えば，駐車場などに駐車された乗用車又は自動二輪車などのナンバープレートを撮像装置１０２により監視する監視システムとしても適用可能である。
【００５０】
（２双方向コミュニケーションシステムの各コンポーネントの構成）
次に，本実施の形態にかかる双方向コミュニケーションシステムの各コンポーネントの構成について説明する。
【００５１】
（２．１ネットワーク１０５）
ネットワーク１０５は，映像配信ユニット１０１（ａ，ｂ，…，ｎ）に備わる映像通信装置１０４（ａ，ｂ，…，ｎ）を相互に双方向通信可能に接続するものであり，典型的にはインターネットなどの公衆回線網であるが，ＷＡＮ，ＬＡＮ，ＩＰ−ＶＰＮなどの閉鎖回線網も含む。また接続媒体は，ＦＤＤＩ（ＦｉｂｅｒＤｉｓｔｒｉｂｕｔｅｄＤａｔａＩｎｔｅｒｆａｃｅ）などによる光ファイバケーブル，Ｅｔｈｅｒｎｅｔ(登録商標）による同軸ケーブル又はツイストペアケーブル，もしくはＩＥＥＥ８０２．１１ｂなど，有線無線を問わず，衛星通信網なども含む。
【００５２】
（２．２映像配信ユニット１０１）
映像配信ユニット１０１（ａ，ｂ，…，ｎ）は，撮像装置１０２（ａ，ｂ，…，ｎ），上記撮像装置１０２の撮影により生成された映像データを送受信する映像通信装置１０４（ａ，ｂ，…，ｎ），もしくは映像データを表示する出力装置１０３（ａ，ｂ，…，ｎ）のうちいずれか一つ又は任意の組み合わせとが備えられている。
【００５３】
（２．２．１撮像装置１０２）
図１に示す撮像装置１０２は，少なくとも１又は２以上の撮像素子（撮像デバイス）が備わる撮像部（図示せず）と，音声が入力されるマイク部（図示せず）と，映像通信装置１０４に映像入力信号として映像データを出力する出力部（図示せず）とを備えている。
【００５４】
上記撮像素子は，受光面に２次元的に設けられた光電変換素子からなる複数の画素により，被写体から受光した光学像を光電変換して画像データとして出力することが可能である。例えば，撮像素子は，多種からなるＣＣＤなどの固体撮像デバイスが挙げられる。
【００５５】
出力部は，撮像部により生成された画像データおよびマイク部から生成された音声データに基づき，映像データを生成し，映像通信装置１０４に映像入力信号として出力する。
【００５６】
なお，本実施の形態にかかる撮像装置１０２に備わる出力部は，映像データを映像通信装置１０４にアナログデータとして出力するが，かかる例に限定されず，Ａ／Ｄ変換部（Ａ／Ｄコンバータ）を備えることにより，ディジタルデータとして出力する場合であっても実施可能である。
【００５７】
（２．２．２映像通信装置１０４）
次に，図２を参照しながら，本実施の形態にかかる映像通信装置１０４について説明する。図２は，本実施の形態にかかる映像通信装置の概略的な構成を示すブロック図である。
【００５８】
図２に示すように，映像通信装置１０４は，撮像装置１０２により送出された映像データをＡ／Ｄ変換する変換部２０１と，映像データを一時的に記憶保持するメモリ部２０２と，映像データに基づき顔領域を検出する顔検出ブロック２０３と，映像データのうち，上記顔領域について少なくともモザイク変換又は他の画像に置換する特殊処理部２０４と，少なくとも顔検出ブロック２０３の検出結果により生成される顔領域情報に基づき符号化パラメータを生成する符号化制御部２０５と，上記符号化パラメータに基づき映像データを圧縮符号化するエンコーダ部２０６と，圧縮符号化された伝送データを送受信する通信部２０７と，通信部２０７により受信された伝送データを伸長するデコーダ部２０８と，映像データをＤ／Ａ変換し，出力装置１０３に送出する変換部２０９とを備える。なお，上記顔検出ブロック２０３及び通信部２０７については，後程詳述する。以下，顔領域は，後程詳述するが，図７に示す顔領域７００または顔領域７０２である。
【００５９】
（２．２．３出力装置１０３）
出力装置１０３は，図２に示すように，変換部２０９によりＤ／Ａ変換された映像データを表示する。また，出力装置１０３は，上記説明の通り，例えば，ＴＶ装置又は液晶ディスプレイ装置などが例示され，音声又は画像を出力することが可能な装置である。
【００６０】
なお，本実施の形態にかかる出力装置１０３は，Ｄ／Ａ変換された映像データを表示する場合を例に挙げて説明したが，かかる例に限定されず，例えば，Ｄ／Ａ変換せずに，ディジタルデータのまま映像データを表示する場合でも実施可能である。
【００６１】
（２．２．４顔検出ブロック２０３）
次に，図２を参照しながら，メモリ部２０２に記憶された映像データに含まれる顔領域を検出する顔検出ブロック２０３及び顔領域検出処理について説明する。
【００６２】
顔検出ブロック２０３は，メモリ部２０２に記憶された映像データをフレーム単位に，映像データから人間の顔画像である顔領域を検出する。したがって，顔検出ブロック２０３には，複数の工程により上記顔領域を検出するために，各部がそれぞれ備わっている。
【００６３】
なお，本実施の形態にかかる顔検出ブロック２０３は，人間の顔領域を検出する場合を例に挙げて説明したが，映像データのうち特徴的な領域を有する場合であれば，かかる例に限定されず，例えば，乗用車のナンバープレート，時計，またはパソコンなどの画像領域を検出する場合であっても実施可能である。
【００６４】
顔検出ブロック２０３は，図２に示すように，リサイズ部２３０と，ウィンドウ切出部２３１と，テンプレートマッチング部２３２と，前処理部２３３と，ＳＶＭ（サポートベクタマシン；ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）識別部２３４と，結果判定部２３５とが備わる。
【００６５】
リサイズ部２３０は，撮像装置１０２により生成された映像データを，メモリ部２０２からフレーム単位に読み出して，当該フレーム単位に読み出された映像データ（以下，フレーム画像）を縮小率が相異なる複数のスケール画像に変換する。
【００６６】
例えば，本実施の形態にかかるフレーム画像が，ＮＴＳＣ方式（ＮａｔｉｏｎａｌＴｅｌｅｖｉｓｉｏｎＳｙｓｔｅｍＣｏｍｍｉｔｔｅｅ方式）の７０４×４８０画素（横×縦）からなる場合，０．８倍ずつ順次縮小して５段階（１．０倍，０．８倍，０．６４倍，０．５１倍，０．４１倍）のスケール画像に変換する。なお以下，上記複数のスケール画像は，１．０倍のスケール画像を第１のスケール画像とし，順次縮小するごとに，第２〜第５のスケール画像とする。
【００６７】
ウィンドウ切出部２３１は，まず第１のスケール画像に対して，画像左上を起点として順にフレーム画像の右下まで，例えば２画素ずつなど，適当な画素ずつ右側又は下側にずらしながらスキャンするようにして，２０×２０画素の矩形領域（以下，ウィンドウ画像と呼ぶ）を順次切出す。なお，本実施の形態にかかるスケール画像の起点は，画像左上である場合に限らず，例えば画像右上などであっても実施可能である。
【００６８】
上記第１のスケール画像から切出された複数のウィンドウ画像は，順次，ウィンドウ切出部２３１により，後続のテンプレートマッチング部２３２に送出される。
【００６９】
テンプレートマッチング部２３２は，ウィンドウ切出部２３１により送出されたウィンドウ画像について，例えば正規化相関法，誤差二乗法などの演算処理を実行してピーク値をもつ関数曲線に変換した後，当該関数曲線に対して認識性能が落ちない程度に十分に低い閾値を設定し，当該閾値を基準として当該ウィンドウ画像の領域が顔領域であるか否かを判断する。
【００７０】
上記テンプレートマッチング部２３２には，予め，例えば１００人程度の人間の顔画像の平均から生成される平均的な人間の顔領域をテンプレートデータとして登録されている。
【００７１】
ウィンドウ画像の領域が顔領域であるか否かの判断は，上記テンプレートマッチング部２３２に顔領域のテンプレートデータとして登録することにより，かかる顔領域か否かの判断基準となる閾値が設定され，当該ウィンドウ画像について，テンプレートデータとなる平均的な顔領域との簡単なマッチングをすることにより判断される。
【００７２】
テンプレートマッチング部２３２は，ウィンドウ切出部２３１により送出されたウィンドウ画像について，テンプレートデータによるマッチング処理を行い，テンプレートデータとマッチングし，顔領域であると判断された場合には，当該ウィンドウ画像をスコア画像（顔領域と判断されたウィンドウ画像。）として後続の前処理部２３３に送出する。
【００７３】
また，上記ウィンドウ画像について，顔領域でないと判断された場合には，当該ウィンドウ画像そのまま結果判定部２３５に送出する。なお，上記スコア画像には，顔領域と判断された度合いがどの程度確からしいのかを示す信頼度情報が含まれる。例えば，信頼度情報は，スコア値が“００”〜“９９”の範囲内の数値を表し，数値が高いほど，より顔領域であることが確からしいことを表す。なお，信頼度情報は，例えば結果判定部２３５に備わるキャッシュ（図示せず。）などに格納される場合でもよい。
【００７４】
上記説明の正規化相関法，誤差二乗法などの演算処理は，後続の前処理部２３３およびＳＶＭ識別部２３４における演算処理と比較すると，演算処理量が１０分の１から１００分の１程度で済むとともに，テンプレートマッチング部２３２によるマッチング処理時点で，８０（％）以上の確率で顔領域であるウィンドウ画像を検出することが可能である。つまり，明らかに顔領域でないウィンドウ画像を，この時点で除去することが可能となる。
【００７５】
前処理部２３３は，テンプレートマッチング部２３２から得られたスコア画像について，矩形領域でなる当該スコア画像から人間の顔領域とは無関係な背景に相当する４隅の領域を抽出するべく，当該４隅の領域を切り取ったマスクを用いて，２０×２０画素あるスコア画像から３６０画素分を抽出する。なお本実施の形態にかかるスコア画像は４隅を切り取った３６０画素分を抽出する場合を例に挙げて説明したが，かかる例に限定されず，例えば，４隅を抽出しない場合であっても実施可能である。
【００７６】
さらに前処理部２３３は，撮像時の照明などにより濃淡で表される被写体の傾き条件を解消するために，例えば平均二乗誤差（ＲＳＭ：ＲｏｏｔＭｅａｎＳｑｕａｒｅ）などによる算出方法を用いて当該抽出された３６０画素のスコア画像の濃淡値に補正をかける。
【００７７】
続いて，前処理部２３３は，当該３６０画素のスコア画像のコントラストが強調された結果のスコア画像を，ヒストグラム平滑化処理を行うことにより，撮像装置１０２の撮像素子のゲイン又は照明の強弱に左右されないスコア画像を検出させることが可能となる。
【００７８】
またさらに，前処理部２３３は，例えばスコア画像をベクトル変換し，得られたベクトル群をさらに１本のパターンベクトルに変換するため，ガボア・フィルタリング（ＧａｂｏｒＦｉｌｔｅｒｉｎｇ）処理を行う。なお，ガボア・フィルタリングにおけるフィルタの種類は必要に応じて変更可能である。
【００７９】
ＳＶＭ識別部２３４は，前処理部２３３からパターンベクトルとして得られたスコア画像に対して顔領域の検出を行う。そして検出された場合，顔領域検出データとして出力する。検出されない場合は，顔領域未検出データとして追加され，さらに学習する。
【００８０】
ＳＶＭ識別部２３４は，前処理部２３３により送出されたスコア画像に基づいて生成されたパターンベクトルについて，当該スコア画像内に顔領域が存在するか否かを判断し，顔領域が検出された場合，当該スコア画像における顔領域の左上位置（座標位置），顔領域の面積（縦×横の画素数），顔領域であることの確からしさを表す信頼度情報，当該スコア画像の切出しの元となるスケール画像の縮小率（第１〜第５のスケール画像に該当する縮小率のうちのいずれか一つ。）とからなる顔領域情報を，例えば結果判定部２３５に備わるキャッシュ（図示せず。）に格納することにより，スコア画像ごとにリスト化する。なお，本実施の形態にかかる顔領域の位置（起点）は，画像左上である場合に限らず，例えば画像右上などであっても実施可能である。
【００８１】
ＳＶＭ識別部２３４により，例えば，第１のスケール画像のうち最初のウィンドウ画像の顔領域の検出が終了すると，ウィンドウ切出部２３１により第１のスケール画像の中の次にスキャンされたウィンドウ画像がテンプレートマッチング部２３２に送出される。
【００８２】
次にテンプレートマッチング部２３２は，当該ウィンドウ画像がテンプレートデータにマッチングした場合のみスコア画像として，前処理部２３３に送出する。前処理部２３３は，上記スコア画像をパターンベクトルに変換してＳＶＭ識別部２３４に送出する。ＳＶＭ識別部２３４は，パターンベクトルに基づき顔領域を検出した場合，上記スケール画像に関する顔領域情報を生成し，上記結果判定部２３５に備わるキャッシュに格納する。
【００８３】
上記記載のように，第１のスケール画像について，ウィンドウ切出部２３１により順次スキャンされたウィンドウ画像について，以降後続のテンプレートマッチング部２３２，前処理部２３３，及びＳＶＭ識別部２３４による各処理が実行され，当該第１のスケール画像から顔領域が含まれるスコア画像を複数検出することが可能となる。
【００８４】
さらに，ウィンドウ切出部２３１による第１のスケール画像のスキャンが全て終了し，後続のテンプレートマッチング部２３２，前処理部２３３，及びＳＶＭ識別部２３４による各処理についても終了すると，第２のスケール画像について，上記説明の第１のスケール画像とほぼ同様に顔領域の検出するための各処理が実行される。第３〜第５のスケール画像についても，第１のスケール画像とほぼ同様にして顔領域の検出処理が実行される。
【００８５】
ＳＶＭ識別部２３４は，メモリ部２０２から読み出した映像データであるフレーム画像を５段階の相異なる縮小率から構成される第１〜第５のスケール画像について，顔領域が検出されたスコア画像をそれぞれ複数検出し，その結果，生成される顔領域情報を，上記結果判定部２３５に備わるキャッシュ（図示せず。）に格納する。なお，本実施の形態にかかるキャッシュは，結果判定部２３５に備わる場合を例に挙げて説明したが，かかる例に限定されず，例えば，顔検出ブロック２０３内に単独で備わる場合などであっても実施可能である。さらに，顔領域が検出されずスコア画像が全く得られない場合もあるが，少なくとも１個など，所定の個数だけスコア画像が得られれば，顔検出処理は続行される。
【００８６】
上記第１〜第５のスケール画像において顔領域が検出されたスコア画像は，ウィンドウ切出部２３１におけるスキャンが所定画素（例えば，２画素など。）ずつ移動しながら実行されているため，前後のスコア画像の間では，近傍領域において高い相関性があり，相互に重なり合う領域を有する場合が多い。
【００８７】
結果判定部２３５は，上記重複する領域を除去するため，２つのスコア画像の位置，スコア画像の画素数，および所定の数式に基づき，重複しているか否かを判定する。
【００８８】
例えば，上記２つのスコア画像の位置として左上角の位置を，Ｘ．Ｙ座標により（Ｘ_Ａ，Ｙ_Ａ），（Ｘ_Ｂ，Ｙ_Ｂ）とそれぞれ表し，スコア画像の画素数（縦×横）を，それぞれＨ_Ａ×Ｌ_Ａ，Ｈ_Ｂ×Ｌ_Ｂ，ｄＸ（＝Ｘ_Ｂ−Ｘ_Ａ），ｄＸ（＝Ｘ_Ｂ−Ｘ_Ａ）とすると，以下に示す（１）式および（２）の関係が同時に成り立つ場合，２つのスコア画像は重なり合うと判定される。
【００８９】
（Ｌ_Ａ−ｄＸ）×（Ｌ_Ｂ＋ｄＸ）＞０・・・・・（１）
【００９０】
（Ｈ_Ａ−ｄＹ）×（Ｈ_Ｂ＋ｄＹ）＞０・・・・・（２）
【００９１】
結果判定部２３５は，当該判定結果に基づいて，複数のスコア画像のうち重なり合う領域を除くことにより，重なり合わない最終的な顔領域を取得し，最終的に確定となる顔領域情報を生成し，上記キャッシュに格納されていた顔領域情報を更新する。なお，本実施形態にかかる格納されていた顔領域情報は，確定された顔領域情報に更新される場合を例に挙げて説明したが，かかる場合に限らず，別途新規に確定された顔領域情報を格納する場合であっても実施可能である。
【００９２】
重なり合う領域が存在する場合，結果判定部２３５は，キャッシュ（図示せず。）に格納されたスコア画像に対応する信頼度情報に基づき，信頼度の高い，つまり顔領域である確からしさが高いスコア画像の方の顔領域情報を生成し，上記キャッシュに格納された当該顔領域情報を信頼度の高い顔領域情報に更新する。
【００９３】
結果判定部２３５は，上記顔領域が検出されない場合，キャッシュに格納処理を行わず，さらに重なり合う顔領域が存在しない場合は，顔領域情報の更新は行わない。
【００９４】
以上から，顔検出ブロック２０３は，撮像装置１０２により撮影された映像データから，信頼性の高い顔領域に対して顔領域情報を生成することが可能となる。したがって，複数の顔領域が検出されても，より確実に，例えば使用者１０６の顔領域を検出することが可能となる。
【００９５】
上記生成された顔領域情報は，図２に示す符号化制御部２０５に送信されて，顔領域情報に基づき，映像データを圧縮符号化するための符号化パラメータが生成される。
【００９６】
なお，本実施の形態にかかる結果判定部２３５による重複領域の判定処理は，（１）式に定められた場合を例に挙げて説明したが，かかる例に限定されず，他の数式を用いた場合であっても実施可能である。
【００９７】
また，本実施の形態にかかるスケール画像をはじめとする画像の位置は，左上隅を基準に表される場合を例に挙げて説明したが，かかる例に限定されず，他の位置を基準とした場合であっても実施可能である。
【００９８】
また，本実施の形態にかかる顔領域の検出される映像データは，フレーム単位に読み込まれて，顔領域が検出処理される場合を例に挙げて説明したが，かかる例に限定されず，例えば，フィールド単位又は複数フレームからなるシーンごとに顔領域の検出処理を行う場合などであっても実施可能である。
【００９９】
また，本実施の形態にかかるテンプレートマッチング２３２に登録されるテンプレートデータは，平均的な人間の顔を示す顔領域が登録される場合を例にあげて説明したが，かかる例に限定されず，例えばテンプレートデータとして，乗用車のナンバープレート，時計，またはペットなどの動物の顔の画像領域が登録される場合であっても実施可能である。
【０１００】
（２．２．５通信部２０７）
次に，本実施の形態にかかる通信部２０７について説明する。通信部２０７は，ネットワーク１０５と接続され，ネットワーク１０５を介して圧縮符号化された伝送データを送信，または伝送データを受信する。
【０１０１】
通信部２０７には，ネットワーク１０５のトラフィックの混雑状況を検知する検査部２１０を備える。検査部２１０は，ネットワーク１０５のトラフィックの混雑状況を検知するため，所定時間ごとに，例えば“ｐｉｎｇ”を利用したＩＣＭＰなどにより，接続先の映像通信装置１０４，または任意のホストに対し動作確認を要求（エコー検査）する。
【０１０２】
検査部２１０は，ｐｉｎｇコマンドにより，少なくとも接続相手先のアドレス情報を設定し，ＩＣＭＰパケットを送信する。接続相手先の例えばホストなどは，上記ＩＣＭＰパケットを受信すると，ｐｉｎｇコマンド発行元の検査部２１０に対し，正常に受信された旨の応答（Ｒｅｐｌｙ）パケットを送信する。なお，正常に接続相手先に受信されない場合（または，制限時間内にＩＣＭＰパケットが受信されなかった場合）は，エラーとなる。
【０１０３】
したがって，検査部２１０は，ＩＣＭＰパケット送信してから上記応答パケットを受信するまでの時間を取得し，トラフィックの混雑状況を検知する。例えば，ネットワーク１０５のトラフィックが平常時において，通信速度が１２８ＫＢｙｔｅ／ｓｅｃ及び上記応答パケットを受信するまでの時間（以下，応答時間）を４０ｍｓｅｃの場合，ある時点の検査部２１０の検査で，上記応答時間が８０ｍｓｅｃと検知されると，検査部２１０は，ネットワーク１０５のトラフィックは混雑していると判断する。
【０１０４】
検査部２１０は，ネットワーク１０５のトラフィックの混雑を検知すると，混雑情報を生成し，符号化制御部２０５に上記混雑情報を送信する。混雑情報はネットワーク１０５のトラフィックの混雑状況を示すデータであり，例えば，応答時間などの情報が含まれる。
【０１０５】
符号化制御部２０５は，上記混雑情報を受信すると，ネットワーク１０５のトラフィックの混雑状況に応じて，マクロブロック単位に映像データの圧縮符号化を制御させるため，符号化パラメータを設定する。例えば，所定時間内の複数フレームの映像データについては圧縮符号化せず，伝送データを送信しないように制御させる，または所定時間内の複数フレームの映像データについては，顔領域に属すマクロブロックだけを圧縮符号化し，伝送データを送信するよう制御させる符号化パラメータが例示される。なお，以下に記載されるマクロブロックは，図５に示すＭＢ５０３を示すこともある。マクロブロックについては，後程詳述する。
【０１０６】
なお，本実施の形態にかかる検査部２１０は，ＩＣＭＰ（ＩｎｔｅｒｎｅｔＣｏｎｔｒｏｌＭｅｓｓａｇｅＰｒｏｔｏｃｏｌ）によりトラフィックの混雑状況を検知する場合を例に挙げて説明したが，かかる例に限定されず，例えばＴＣＰ（ＴｒａｎｓｍｉｓｓｉｏｎＣｏｎｔｒｏｌＰｒｏｔｏｃｏｌ）セグメントなどデータの再送信処理の際に，接続先の相手側から確認応答が返ってくるまでの時間（ＲＴＴ：ＲｏｕｎｄＴｒｉｐＴｉｍｅ）を取得する，または接続先の例えばホストなどに，まとめてデータを送受信することが可能なウィンドウ・サイズ（受信可能なデータサイズ）の変動により混雑状況を検知する場合であっても実施可能である。
【０１０７】
（３．双方向コミュニケーションシステムの動作）
次に，図３を参照しながら，上記のように構成された双方向コミュニケーションシステムの動作の実施形態について説明する。図３は，本実施の形態にかかる双方向コミュニケーションシステムの動作の概略を示すフローチャートである。
【０１０８】
図３に示すように，本実施の形態にかかる双方向コミュニケーションシステムにおいて，例えばテレビ会議などにより複数の使用者１０６が打ち合わせをする場合，打ち合わせされる時間内は絶えず複数の映像配信ユニット１０１間で，相互に映像データをやりとりし，双方向コミュニケーションシステムの動作が継続される。
【０１０９】
したがって，打ち合わせ時間が終了（撮影処理が終了）するまで，映像配信ユニット１０１間で，映像データの配信処理（Ｓ３０１）が続行（配信ループ）される。
【０１１０】
（３．１映像配信ユニット１０１からの映像データ配信処理）
次に，図４を参照しながら，本実施の形態にかかる映像データ配信処理について説明する。図４は，本実施の形態にかかる映像データ配信処理の概略を示すフローチャートである。なお，以下の説明は，ＩＴＵ−Ｔ勧告Ｈ．２６３の場合である映像データ配信処理について説明するが，ＭＰＥＧ−４についても準拠する。
【０１１１】
映像データ配信処理（Ｓ３０１）は，撮像装置１０２の撮影処理により，映像データが生成されると，例えば，ＲＳ−２３２ＣまたはＲＳ−４２２などを介して，映像通信装置１０４の変換部２０１に送出される。
【０１１２】
変換部２０１は，上記映像データをＡ／Ｄ変換し，メモリ部２０２に送出する。映像データが，メモリ部２０２に送出されると，図４に示すように，顔検出ブロック２０３により顔検出処理（Ｓ４０１）が行われる。なお，本実施の形態にかかる顔検出処理は，上記説明したのとほぼ同様の構成であるため省略する。
【０１１３】
顔検出処理（Ｓ４０１）は，メモリ部２０２に送出される映像データのフレーム単位に行われるが，かかる例に限らず，フィールド単位の場合でもよい。また，フレーム単位の映像データであるフレーム画像（ピクチャ）内に顔領域が存在しない，検出されない（Ｓ４０２）場合は，再度顔検出処理（Ｓ４０１）が行われる。
【０１１４】
顔検出処理（Ｓ４０１）の結果，顔領域が検出された（Ｓ４０２）場合は，映像通信装置１０４に備わる結果判定部２３５のキャッシュに格納された顔領域情報が符号化制御部２０５に送信される（Ｓ４０３）。
【０１１５】
符号化制御部２０５は，上記顔領域情報を受信すると，符号化制御部２０５内に備わる記憶部（図示せず。）に格納された少なくとも１フレーム前のフレーム画像にかかる顔領域情報を取得する。なお，取得されるフレーム画像は，１フレーム前に限らず，例えば，複数フレーム前，または１フィールド前などであってもよい。
【０１１６】
上記１フレーム前のフレーム画像（前フレーム画像）にかかる顔領域情報が格納されている場合は，上記受信した現フレーム画像の顔領域情報と，前フレーム画像にかかる顔領域情報とを比較し，補正処理を行う（Ｓ４０５）。
【０１１７】
上記前のフレーム画像にかかる顔領域情報が記憶部に格納されて無い場合（Ｓ４０４），つまり前フレーム画像において顔領域が検出されない場合（Ｓ４０４）には，顔領域情報の補正処理（Ｓ４０５）は実行されない。
【０１１８】
上記補正処理（Ｓ４０５）は，前フレームおよび現フレーム画像にかかる顔領域情報の顔領域の面積情報，位置情報，または信頼度情報のうち少なくとも一つを比較することにより現フレーム画像にかかる顔領域情報を補正する。
【０１１９】
本実施の形態にかかる補正処理（Ｓ４０５）は，例えば，前フレーム画像において１の顔領域のみ存在し，現フレーム画像において２の顔領域が存在し，現フレーム画像においても前フレーム画像で検出された顔領域を選択する場合，現フレーム画像に含まれる前フレーム画像にかかる顔領域情報を，選択するため正確に顔領域情報を判別する必要がある。
【０１２０】
前フレーム画像および現フレーム画像間の時間差は極めて短く，人間の動作によりフレーム画像内を移動可能な範囲は極めて限られているため，符号化制御部２０５は，顔領域情報の面積情報と位置情報とに基づき，現フレーム画像にかかる顔領域のうち，前フレーム画像にかかる顔領域の近傍に存在する顔領域の顔領域情報を選択する。
【０１２１】
選択された顔領域情報のうち信頼度情報が，現フレーム画像の他の信頼度情報または前フレーム画像の信頼度情報に比べて低い場合，前フレーム画像の信頼度情報と同程度もしくは現フレーム画像にかかる他の信頼度情報以上の値に補正する（Ｓ４０５）。したがって，例えば，信頼度情報が最も高い顔領域情報を選択すれば，前フレーム画像の顔領域を現フレーム画像においても正確に選択することが可能となる。なお，本実施の形態にかかる補正処理は，かかる例に限定されない。
【０１２２】
符号化制御部２０５は，補正された現フレーム画像にかかる顔領域情報に基づき，信頼度情報の最も高い顔領域に対してオブジェクトの切出処理（Ｓ４０６）をする。なお，本実施の形態にかかるオブジェクトの切出処理は，信頼度情報の最も高い顔領域に限定されることなく，例えば，信頼度情報に依存しない全ての顔領域，または最も低い信頼度情報を除く他の顔領域全てについて，オブジェクトの切出処理（Ｓ４０６）をする場合であっても実施可能である。
【０１２３】
（３．１．１映像フォーマット）
ここで，オブジェクトの切出処理（Ｓ４０６）を説明する前に，図５を参照しながら，本実施の形態にかかる映像フォーマットについて説明する。図５は，本実施の形態にかかる映像フォーマットの概略的な構成を示す説明図である。
【０１２４】
撮像装置１０２により，ＮＴＳＣ方式又はＰＡＬ方式にて撮影された映像データは，フレーム画像単位に，例えばＩＴＵ−Ｔ勧告に定めるＨ．２６１，Ｈ．２６３，またはＩＳＯ／ＩＥＣ１４４９６に定めるＭＰＥＧ−４などの場合において，予め共通フォーマットとして定められたＣＩＦ画面，ＱＣＩＦ画面，またはＳＱＣＩＦ画面などのフレーム画像に変換され，さらに圧縮符号化され，伝送データとしてネットワーク１０５を介して送信される。
【０１２５】
図５に示すように，画面５０１は，上記ＣＩＦ画面，ＱＣＩＦ画面，またはＳＱＣＩＦ画面のいずれかに該当し，グループ・オブ・ブロックと呼ばれる複数のＧＯＢ（５０２Ａ，５０２Ｂ，５０２Ｃ，…）から構成されている。
【０１２６】
例えば，本実施の形態にかかるＧＯＢ５０２は，Ｈ．２６１の場合，ＣＩＦ画面では，１２個のＧＯＢ５０２から構成され，ＱＣＩＦ画面では３個のＧＯＢ５０２から構成される。
【０１２７】
また，ＧＯＢ５０２は，さらにマクロブロック（ＭＢ）と呼ばれる，複数のＭＢ（５０３Ａ，５０３Ｂ，５０３Ｃ，…）から構成され，各ＭＢ５０３は，１６×１６画素の輝度マクロブロックであるＭＢ５０３−１と，８×８画素のＣ_Ｂ色差マクロブロックであるＭＢ５０３−２と，８×８画素のＣ_Ｒ色差マクロブロック５０３−３とから構成されるが，ＧＯＢ５０２に構成されるＭＢ５０３の個数は，例えばＨ．２６１，Ｈ．２６３，またはＭＰＥＧ−４などに応じて変動し，Ｈ．２６１の場合，１のＧＯＢ５０２に，３３個のＭＢ５０３から構成されている。
【０１２８】
また，ＭＢ５０３は，さらに８×８画素からなる最小単位のブロック（５０４Ａ，５０４Ｂ，５０４Ｃ，５０４Ｄ）から構成されている。したがって，１のＭＢ５０３には，４個の輝度ブロック（５０４Ａ，５０４Ｂ，５０４Ｃ，５０４Ｄ）と，２個の（Ｃ_Ｂ，Ｃ_Ｒ）色差ブロック（５０４Ｅ，５０４Ｆ）とから構成されている。
【０１２９】
（３．１．２マクロブロックのデータ構造）
次に，図６を参照しながら，本実施の形態にかかるマクロブロックのデータ構造について説明する。図６は，本実施の形態にかかるマクロブロックのデータ構造の概略的な構成を示す説明図である。
【０１３０】
図６に示すように，マクロブロックのデータ構造は，マクロブロックヘッダと，ブロックデータとからなり，上記マクロブロックヘッダは，“ＣＯＤ”と，“ＭＣＢＰＣ”と，“ＭＯＤＢ”と，“ＣＢＰＢ”と，“ＣＢＰＹ”と，“ＤＱＵＡＮＴ”と，“ＭＶＤ”と，“ＭＶＤ_２”と，“ＭＶＤ_３”と，“ＭＶＤ_４”と，“ＭＶＤＢ”とから構成される。
【０１３１】
なお，本実施の形態にかかるマクロブロックのデータ構造は，Ｈ．２６３にかかるデータ構造である場合を例にあげて説明したが，かかる例に限定されず，例えば，Ｈ．２６１，またはＭＰＥＧ−４などの場合であっても，Ｈ．２６３に準拠する。
【０１３２】
上記“ＤＱＵＡＮＴ”は，２ビット又は可変長データであり，ＱＵＡＮＴの変化を定義する。ＱＵＡＮＴは，マクロブロックに対する量子化パラメータであり，１〜３１の範囲の値を取り得る。なおＱＵＡＮＴは，予め任意の値に設定されている。
【０１３３】
したがって，“ＤＱＵＡＮＴ”は，差分値を表すことから，例えば，“ＤＱＵＡＮＴ”が２進数表示で“００”の場合，差分値は“−１”であり，“０１”の場合，差分値は“−２”であり，“１０”の場合，差分値は“１”であり，“１１”の場合，差分値は“２”と表すことができる。
【０１３４】
“ＤＱＵＡＮＴ”の差分値が変化することにより，ＱＵＡＮＴの値が変化するが，量子化パラメータであるＱＵＡＮＴが大きくなると，該当するマクロブロックの画質は落ちて，ぼんやりと精細を欠いた画像になる，ＱＵＡＮＴが小さくなると画質は向上して，圧縮符号化しても，ほぼ元の原画に近い状態の画像になる。つまりマクロブロックごとに，“ＤＱＵＡＮＴ”の変化を制御することにより，映像データの任意領域の画質を制御することが可能となる。上記“ＤＱＵＡＮＴ”の変化は，符号化制御部２０５により生成される符号化パラメータに基づいて，制御される。
【０１３５】
図６に示すように，Ｈ．２６３にかかる“ＣＯＤ”は，符号化マクロブロックインジケータであり，１ビットからなるデータである。“ＣＯＤ”が“０”である場合，圧縮符号化される対象のマクロブロックであることを示し，“１”である場合，圧縮符号化されず削除または無視されるマクロブロックであることを示す。
【０１３６】
したがって，Ｈ．２６３の場合において，符号化制御部２０５は，マクロブロックを圧縮符号化するか否かを制御するため，上記マクロブロックの“ＣＯＤ”に値を指示するための符号化パラメータを生成する。
【０１３７】
ここで，図４に示すように，顔領域情報の補正処理（Ｓ４０５）が終了し，符号化制御部２０５は，上記顔領域情報を受信すると，上記顔領域情報に含まれる顔領域の面積情報または顔領域の位置情報に基づき，オブジェクトとしてフレーム画像の顔領域の切出処理（Ｓ４０６）を実行する。
【０１３８】
さらに，図７（Ａ）及び図７（Ｂ）を参照しながら，本実施の形態にかかるオブジェクトについて説明する。図７（Ａ）は，本実施の形態にかかる初期形成時の顔領域ブロックの概略的な構造を示す説明図であり，図７（Ｂ）は，本実施の形態にかかる最終決定時の顔領域ブロックの概略的な構造を示す説明図である。
【０１３９】
図７（Ａ）および図７（Ｂ）に示す映像データのフレーム画像７０１は，３６個（６×６）のマクロブロックから構成されている。
【０１４０】
まず図７（Ａ）に示すように，符号化制御部２０５は，受信する顔領域情報に含まれる面積情報または位置情報に基づき，顔領域７００の領域を初期形成する。図７（Ａ）に示す顔領域７００は，人間の顔が全て含まれる４つのマクロブロックの範囲内に収まっている。つまり顔領域７００上から３ブロック，左から３ブロックを左上隅とする３×３マクロブロックの範囲内に収まっている。
【０１４１】
しかし，圧縮符号化はマクロブロック単位に行われるため，図７（Ｂ）に示すように，符号化制御部２０５は，顔領域７００を，拡大又は縮小する割合が最小限であるマクロブロック単位領域の顔領域７０２に補正する。圧縮符号化する場合はマクロブロック単位に行われるため，顔領域７０２のように補正されて，顔領域として最終決定される。
【０１４２】
図７（Ｂ）に示す補正された顔領域７０２により，符号化制御部２０５は，顔領域７０２に属すマクロブロックと，顔領域７０２に属さないマクロブロックと，別の領域として，オブジェクト単位に切出す（Ｓ４０６）。したがって，顔領域７０２のオブジェクトに対して，量子化パラメータを小さくするなど，オブジェクトごとに圧縮符号化させるよう，符号化パラメータで指示することができる。
【０１４３】
さらに，例えば，符号化制御部２０５は，顔領域７０２に属すマクロブロックに対しては，“ＣＯＤ”に“０”が設定されるよう，符号化パラメータで指示し，顔領域７０２に属さないマクロブロックに対しては，“ＣＯＤ”に“１” が設定されるよう，符号化パラメータで指示することで，顔領域７０２だけが圧縮符号化されて，伝送データとしてネットワーク１０５を介して送信されることができる。
【０１４４】
（３．１．３顔領域変換処理）
図２に示す特殊処理部２０４は，メモリ部２０２に格納される映像データのフレーム単位に，検出された顔領域に対して，例えばモザイク処理，または動物の画像など他の画像に置換するなどの顔領域変換処理（Ｓ４０７）を実行する。
【０１４５】
上記顔領域変換処理（Ｓ４０７）は，例えば，映像通信装置１０４に備わるモザイク処理設定ボタン及び置換処理設定ボタン（図示せず。）などにより，モザイク処理または置換処理が設定された場合，実行される。なお，本実施の形態にかかる顔領域変換処理（Ｓ４０７）は，撮影処理前に予め設定する場合，または撮影処理中に設定する場合のどちらであっても実施可能である。
【０１４６】
ここで，図８を参照しながら，本実施の形態にかかる顔領域変換処理について説明する。図８は，本実施の形態にかかる顔領域変換処理の概略を示すフローチャートである。
【０１４７】
図８に示すように，モザイク処理または置換処理からなる顔領域変換処理が設定されていると（Ｓ８０１），特殊処理部２０４は，メモリ部２０２に格納された映像データをフレーム単位に読み出し，さらに置換処理が設定されている場合には，置換するための適当な置換画像データを読み出す。
【０１４８】
さらに特殊処理部２０４は，顔検出ブロック２０３から送信される顔領域情報に基づき，上記映像データにおけるフレーム画像の顔領域に対し，モザイク処理または置換処理（Ｓ８０２）して，エンコーダ部２０６にフレーム画像を送出する。
【０１４９】
モザイク処理または置換処理（Ｓ８０２）が終了することにより，図４に示す顔領域変換処理（Ｓ４０７）が終了する。なお，本実施の形態にかかる顔領域変換処理は，モザイク処理または置換処理から構成される場合を例にあげて説明したが，かかる例に限定されず，例えば，シャープネス処理，フレーム画像の明度を上げる明度処理などの場合であっても実施可能である。
【０１５０】
また本実施の形態にかかる顔領域変換処理は，顔領域に対してモザイク処理または置換処理が実行される場合を例にあげて説明したが，かかる例に限定されず，顔領域以外の領域に対してモザイク処理又は置換処理を実行する場合であっても実施可能である。
【０１５１】
次に，図４に示すように，特殊処理部２０４において顔領域変換処理（Ｓ４０７）が終了すると，符号化制御部２０５は，特殊処理部２０４から送出されるフレーム画像に対する符号化パラメータを生成する（Ｓ４０８）。
【０１５２】
符号化制御部２０５は，エンコーダ部２０６に，少なくとも顔領域７０２に属すマクロブロックに対する量子化パラメータの設定，顔領域７０２に属さないマクロブロックに対する量子化パラメータの設定，またはオブジェクト単位に圧縮符号化するか否かの設定などを指示するための符号化パラメータを生成する（Ｓ４０８）。
【０１５３】
さらに，上記説明したように検査部２１０により，ネットワーク１０５のトラフィックの混雑状況の検知処理（Ｓ４０９）を実行する。検知処理（Ｓ４０９）の結果，トラフィックの混雑状況が所定の閾値を超えて，検査部２１０により混雑していると判断されると（Ｓ４１０），混雑情報を生成し，符号化制御部２０５に送信する。
【０１５４】
符号化制御部２０５は，上記混雑情報を受信すると，例えば，顔領域７０２であるオブジェクトに限定して圧縮符号化させるようにエンコーダ部２０６に符号化パラメータを送信し，圧縮符号化を制御する。
【０１５５】
フレーム画像の顔領域７０２だけを圧縮符号化させるのは，上記説明の通り，顔領域７０２に属すマクロブロックの“ＣＯＤ”に“０”を設定し，顔領域７０２に属さないマクロブロックには，“ＣＯＤ”に“１”を設定することで，ネットワーク１０５には顔領域７０２にかかる伝送データが送信される。
【０１５６】
したがって，符号化制御部２０５は，エンコーダ部２０６に上記顔領域７０２のオブジェクトのみを圧縮符号化させるため，符号化パラメータ生成処理（Ｓ４０８）で生成された符号化パラメータを変更処理（Ｓ４１１）し，上記符号化パラメータをエンコーダ部２０６に送信する。
【０１５７】
上記符号化パラメータの変更処理（Ｓ４１１）により，エンコーダ部２０６の圧縮符号化するか否かを制御することが可能となり，ネットワーク１０５のトラフィックに負荷を最小限に留めることが可能となる。
【０１５８】
次に，エンコーダ部２０６は，符号化パラメータに基づき，特殊処理部２０４から送出される映像データであるフレーム画像を圧縮符号化（Ｓ４１２）し，通信部２０７に伝送データとして送出する。したがって，例えば，顔領域７０２に属すマクロブロックに対しては画質を落とさず圧縮符号化し，顔領域７０２に属さないマクロブロックに対しては画質を落として圧縮符号化させることが可能である。さらにまた，顔領域７０２に属すマクロブロックだけを圧縮符号化することも可能である。
【０１５９】
したがって，フレーム画像全体を圧縮符号化せずに，フレーム画像内の顔領域７０２に対するマクロブロックのみを切り出して圧縮符号化することが可能であり，ネットワーク１０５に送出するデータ容量を節約することが可能となり，さらに人間の顔画像の画質は落ちないため，視認性の高い映像データを表示することができる。
【０１６０】
ここで，ＭＰＥＧ−４の場合における本実施の形態にかかる圧縮符号化について説明すると，ＭＰＥＧ−４の圧縮符号化（Ｓ４１２）は，Ｈ．２６１及びＨ．２６３の圧縮符号化（Ｓ４１２）とは，エンコーダ部２０６に形状符号化部（図示せず。）およびテクスチャ符号化部（図示せず。）を備えることで実施される点で相違する。
【０１６１】
上記形状符号化部は，上記顔領域７０２であるオブジェクトの形状を符号化するために，まず符号化すべき領域を図７（Ａ）または（Ｂ）に示すフレーム画像７０１にバウンディングレクタングルを設定し，図７（Ｂ）に示すマクロブロックと同じ位置に１６×１６画素のブロック（２値形状ブロック：ＢＡＢ）を設定する。
【０１６２】
図９に示すように，形状符号化部は，符号化パラメータに基づき，２値形状ブロックを設定すると，顔領域７０２であるオブジェクトに属す２値形状ブロックは，“１”で表され，オブジェクトに属さない２値形状ブロックは，“０”で表される。図９は，本実施の形態にかかる２値形状ブロックの概略的な構成を示す説明図である。
【０１６３】
図９に示す２値形状ブロックのように，顔領域７０２であるオブジェクトの内部と外部とを区別するために，２値で表示されると，形状符号化部は，２値形状ブロックごとに当該フレーム画像７０１の形状符号化をする。
【０１６４】
また，形状符号化されるとともに，テクスチャ符号化部は，上記顔領域７０２であるオブジェクトに属すマクロブロックに対してパディング処理などを行い，テクスチャ（画素値）の圧縮符号化が行われる。形状符号化及びテクスチャ符号化されることにより，圧縮符号化処理（Ｓ４１２）が処理終了し，エンコーダ部２０６は，伝送データを通信部２０７に送出する。なお，本実施の形態にかかるテクスチャ符号化部は，オブジェクトに属さないマクロブロックに対して，圧縮符号化する場合であっても実施可能である。
【０１６５】
したがって，フレーム画像全体を圧縮符号化せずに，顔領域７０２に対するマクロブロックのみを切り出して圧縮符号化することが可能であり，ネットワーク１０５に送出するデータ容量の軽減化が図れ，人間の顔画像の画質は落ちないため，視認性の高い映像データを表示することができる。
【０１６６】
送出された伝送データは，通信部２０７により多重化され，ネットワーク１０５を介して，配信される（Ｓ４１３）。以上から構成される映像データ配信処理（Ｓ４０１〜Ｓ４１３）は，撮影処理が終了するまで継続される。
【０１６７】
なお，本実施の形態にかかる配信後の映像データの受信処理については，ネットワーク１０５を介して送信された伝送データが，通信部２０７により受信され，デコーダ部２０８により伸長されるとメモリ部２０２に順次，映像データが格納される。
【０１６８】
以後の処理については，図４に示す顔検出処理（Ｓ４０１）〜顔領域変換処理（Ｓ４０７）が行われ，映像データは，変換部２０９によりＤ／Ａ変換される。Ｄ／Ａ変換後，出力装置１０３は，映像データを表示する。なお本実施の形態にかかる映像データの受信処理の顔検出処理（Ｓ４０１）〜顔領域変換処理（Ｓ４０７）における処理は，映像データの配信処理の顔検出処理（Ｓ４０１）〜顔領域変換処理（Ｓ４０７）の処理とほぼ同様な構成であるため詳細な説明は省略する。
【０１６９】
以上，添付図面を参照しながら本発明の好適な実施形態について説明したが，本発明はかかる例に限定されない。当業者であれば，特許請求の範囲に記載された技術的思想の範疇内において各種の変更例または修正例を想定し得ることは明らかであり，それらについても当然に本発明の技術的範囲に属するものと了解される。
【０１７０】
上記実施形態においては，映像配信ユニットが複数台から構成される場合を例にあげて説明したが，本発明はかかる例に限定されない。例えば，映像配信ユニットが１台から構成される場合であっても実施することができる。この場合には，監視システムとして実施することが可能である。
【０１７１】
また，上記実施の形態においては，人間の顔領域である場合を例にあげて説明したが，本発明はかかる例に限定されない。例えば，乗用車のナンバープレートの画像などを特徴を有する領域として実施する場合であってもよい。
【０１７２】
また，上記実施の形態においては，映像データの配信処理および受信処理はフレーム単位に行われる場合を例に挙げて説明したが，本発明はかかる例に限定されない。例えば，映像データのフィールド単位，または，映像データの複数フレームから構成されるシーン単位で行われる場合でも実施可能である。
【０１７３】
また，上記実施の形態においては，映像配信ユニットは，テレビ会議に用いられる場合を例にあげて説明したが，本発明は，かかる例に限定されない。例えば，携帯電話，携帯端末，またはパソコン（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）などに用いる場合であっても実施可能である。
【０１７４】
【発明の効果】
以上説明したように，本発明によれば，複数の特徴領域が存在する場合でも過去の特徴領域の情報により的確に特徴領域を判断し，特徴領域のみ画質を落とさず切出して圧縮符号化することにより，ネットワークのトラフィックに依存せず視認性の高い画像を表示することができる。
【図面の簡単な説明】
【図１】図１は，本実施の形態にかかる双方向コミュニケーションシステムの概略的な構成を示すブロック図である。
【図２】図２は，本実施の形態にかかる映像通信装置の概略的な構成を示すブロック図である。
【図３】図３は，本実施の形態にかかる双方向コミュニケーションシステムの動作の概略を示すフローチャートである。
【図４】図４は，本実施の形態にかかる映像データ配信処理の概略を示すフローチャートである。
【図５】図５は，本実施の形態にかかる映像フォーマットの概略的な構成を示す説明図である。
【図６】図６は，本実施の形態にかかるマクロブロックのデータ構造の概略的な構成を示す説明図である。
【図７】図７（Ａ）は，本実施の形態にかかる初期形成時の顔領域ブロックの概略的な構造を示す説明図であり，図７（Ｂ）は，本実施の形態にかかる最終決定時の顔領域ブロックの概略的な構造を示す説明図である。
【図８】図８は，本実施の形態にかかる顔領域変換処理の概略を示すフローチャートである。
【図９】図９は，本実施の形態にかかる２値形状ブロックの概略的な構成を示す説明図である。
【符号の説明】
１０１：映像配信ユニット
１０２：撮像装置
１０３：出力装置
１０４：映像通信装置
１０５：ネットワーク
１０６：使用者[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a system capable of transmitting and receiving video data between video distribution units, and more particularly to a bidirectional communication system, a video communication device, and a video data distribution method.
[0002]
[Prior art]
In recent years, with the widespread use of high-performance and low-priced information processing devices such as computers, and broadband networks such as digital circuits, multimedia communication environments that exchange data, voice, video, etc., for example, have become available. It is beginning to be maintained rapidly.
[0003]
As a typical example of the multimedia communication environment, there is a service such as a videophone / videoconferencing system (bidirectional communication system) that performs communication by exchanging voice and images in both directions (see, for example, Patent Document 1). ). The technical literature information related to the present invention includes the following.
[0004]
[Patent Document 1]
JP 7-67107 A
[0005]
[Problems to be solved by the invention]
However, when transmitting video data, when compressing and encoding video data, it is often the case that the entire frame is compressed and encoded uniformly, and the amount of information can be reduced via a network with significant bandwidth limitations. In order to transmit a large amount of image data, the overall image quality had to be lowered uniformly.
[0006]
In addition, for example, when video data for a region (feature region) having a feature of interest that is an indispensable element for grasping a video, such as a human face in a frame, is not detected accurately Therefore, the feature area is also compression-encoded with the image quality lowered, and video data with low visibility is displayed to the connection partner via the network.
[0007]
The present invention has been made in view of the above-described conventional problems, and is a new and improved technique capable of accurately determining a region having characteristics and controlling compression coding in accordance with each region. The purpose is to provide an interactive communication system.
[0008]
[Means for Solving the Problems]
In order to solve the above problems, according to the first aspect of the present invention,An interactive communication system in which one or more video distribution units are connected by a network: the video distribution unit; an imaging device that generates video data; and a face area information detected from the video data; A feature detection unit that generates an encoding parameter based on the face area information, an encoder unit that compresses and encodes the video data into transmission data based on the encoding parameter, and the transmission data as the video A video communication device including at least a decoder unit that decompresses the data; and an output device that displays the video data. The one video distribution unit on the sender side includes at least the face area of the video data The transmission data compressed and encoded for each area with the area not belonging to the face area is The face area information includes at least one of area information of the face area, position information of the face area, or reliability information of the face area, and the encoding control is performed. The reliability information of one face area selected from the current frame image is lower than the reliability information of the other face area of the current frame image or the reliability information of the face area of the previous frame image. In this case, the reliability information of the selected one face area is equal to the reliability information of the face area of the previous frame image, or a value greater than the reliability information of the other face area of the current frame image. A two-way communication system characterized by correcting to
[0009]
According to the present invention, when a region (feature region) having a feature whose viewpoint is noticed is detected in the captured video data between video distribution units capable of transmitting and receiving video data to each other, And a region other than the feature region, and compression encoding is performed according to the region. According to this invention, for example, the quantization parameter is not uniform for the entire video data, the quantization parameter is reduced for the feature region, and the quantization parameter is increased for the region other than the feature region. By differentiating, differentiation according to the area can be achieved. Therefore, when streaming video data, it is possible to reduce the data capacity for areas other than the feature areas where image quality may be low, and to display video data with high visibility maintained in the feature areas. Can do.
[0010]
The video communication apparatus can be configured to further include an encoding control unit that generates an encoding parameter that is a parameter necessary for compression encoding based on the feature region information. With this configuration, when compressing and encoding video data, for example, among the frame images that are frame units of the video data, the quantization parameter is reduced for the detected face area to improve the image quality, or the face area For other areas, the encoding parameter for instructing the encoder unit can be generated so as to reduce the image quality by reducing the image quality by increasing the quantization parameter. Note that the present invention is not limited to a frame image that is a frame unit of video data, and may be, for example, a field image that is a field unit of video data or a scene image that is a scene unit composed of a plurality of frames.
[0011]
The encoder unit can be configured to compress and encode the video data into transmission data based on the encoding parameter. According to this invention, for example, a feature region can be cut out as an object from a frame image, and can be controlled by the encoding parameter so that only the face region is compressed and encoded. Note that the present invention is not limited to a frame image, and may be a field image or a scene image, for example.
[0012]
The feature area information can be configured to be face area information including at least area information of the face area, position information of the face area, or reliability information of the face area. With this configuration, it is possible to accurately identify the macroblocks belonging to the face area among the macroblocks configured in the frame image based on the reliability. The area information is shown, for example, in pixel units, and the position information is shown by XY coordinates. The feature region is not limited to the face region, and may be any region having other features.
[0013]
When the feature area information is generated from the video data, the encoding control unit, based on the feature area information of the video data compressed and encoded at least one frame or one field before the video data, It can be configured to correct the region information. With this configuration, when a plurality of feature areas are detected in a frame image, for example, included in the feature area information detected before the detected frame image, for example, one frame, one field, or one scene. Based on information such as reliability, it can be corrected to appropriate feature region information regarding the frame image. Note that the present invention is not limited to a frame image, and may be a field image or a scene image, for example.
[0014]
The video communication apparatus can be configured to further include an inspection unit that detects a network congestion state. With this configuration, by grasping the congestion status of the network, it is possible to deliver the data via the network based on the transmission data capacity that matches the congestion status. Therefore, the load on the network traffic can be minimized and the communication efficiency can be improved.
[0015]
The encoding control unit can be configured to change the encoding parameter for the feature region and the encoding parameter for the region that does not belong to the feature region, according to the congestion status of the network. With this configuration, when the network traffic is congested, the amount of data that can be transmitted is limited. Therefore, the object in the feature area is extracted from the frame image that is video data, and the above object has a high image quality. The data is compressed and encoded in the state. Regions other than the feature region are deleted or ignored without compression encoding. Therefore, since only the feature region that is an indispensable element for visual recognition of video data is cut out and transmitted, video data with high visibility can be distributed with a small data capacity. Note that the congestion status can be distributed by setting one or two or more threshold values in stages, so that the image quality and data capacity can be flexibly changed according to the level of the congestion status. Further, the present invention is not limited to the frame image, and may be a field image or a scene image, for example.
[0016]
The encoding control unit is configured to change the encoding parameter of the video data concerning the feature area and the encoding parameter of the video data concerning the area not belonging to the feature area at least in frame, field, or scene unit. May be.
[0017]
The encoding control unit may be configured to cut out the video data relating to the feature area as a separate object. With this configuration, compression encoding can be performed only for macroblocks belonging to the feature region of the frame image. Furthermore, it is possible to control whether or not compression encoding is performed for a macroblock that does not belong to a feature region. Therefore, for example, video data can be compressed and encoded flexibly according to network traffic. Note that the present invention is not limited to a frame image, and may be a field image or a scene image, for example.
[0018]
The encoder unit is at least H.264. The video data can be configured to be compression-encoded by the H.263 or MPEG-4 compression-encoding method. H. ITU-T Recommendation H.264 is not limited to H.263 or MPEG-4. 261 or the like may be used.
[0019]
The video communication apparatus can be configured to further include a special processing unit that performs at least mosaic conversion of video data relating to the feature area. With this configuration, the feature region detected in the frame image can be prevented from being accurately recognized by performing a special process such as mosaic transformation or replacement with another image. Note that the present invention is not limited to a frame image, and may be a field image or a scene image, for example. Furthermore, special processing such as mosaic transformation or replacement with another image may be performed on regions other than the feature region.
[0020]
The video data can be configured to be at least one or both of image data and audio data.
[0021]
Furthermore, according to another aspect of the present invention,A video communication device provided in a video distribution unit comprising an imaging device that generates video data and an output device that displays the video data: detecting a face region from the video data generated by the imaging device; A feature detection unit that generates face area information; an encoding control unit that generates an encoding parameter based on the face area information; an encoder unit that compresses and encodes the video data into transmission data based on the encoding parameter; A decoder unit that decompresses the transmission data into the video data, and the face area information includes at least one of area information of the face area, position information of the face area, or reliability information of the face area. , The encoding control unit has the reliability information of one face area selected from the current frame image, the reliability information of the other face area of the current frame image, or If the reliability information of the face area of the frame image is lower than the reliability information of the selected one face area, the reliability information of the face area of the selected previous frame image is equal to or more than the reliability information of the face area of the previous frame image. A video communication apparatus is provided that corrects to a value that is equal to or greater than the reliability information of the other face area.
[0022]
According to the present invention, when a region (feature region) having a feature that is an indispensable element for visual recognition is detected among captured video data between video distribution units capable of transmitting and receiving video data to each other. Considering the congestion situation of the network, the feature region is distinguished from the region other than the feature region, and compression coding is performed according to each region. According to this invention, the quantization parameter is reduced for the feature region and the image quality is improved compared to the normal compression encoding, and the quantization parameter is increased for the region other than the feature region and the compression encoding is performed. By doing so, video data with high visibility can be displayed on the output device of the distribution destination while reducing the data capacity to the extent that the network is not loaded. This video communication apparatus has substantially the same configuration as that of the video communication apparatus employed in the bidirectional communication system.
[0023]
The feature area information can be configured to be face area information including at least area information of the face area, position information of the face area, or reliability information of the face area. With this configuration, it is possible to accurately identify the macroblocks belonging to the face area among the macroblocks configured in the frame image based on the reliability. The area information is shown, for example, in pixel units, and the position information is shown by XY coordinates. The feature region is not limited to the face region, and may be any region having other features.
[0024]
When the feature area information is generated from the video data, the encoding control unit obtains the feature area information of the video data based on the feature area information of the video data compressed and encoded at least one frame before the video data. You may comprise so that it may correct | amend.
[0025]
The video communication apparatus may be configured to further include an inspection unit that detects a network congestion state, and the encoding control unit includes an encoding parameter for the feature region and a feature according to the network congestion state. You may comprise so that the encoding parameter concerning the area | region which does not belong to an area | region may be changed.
[0026]
The encoding control unit changes the encoding parameter of the video data relating to the feature region and the encoding parameter of the video data relating to the region not belonging to the feature region to at least a frame, field, or scene unit of the video data. You may comprise.
[0027]
The encoding control unit may be configured to cut out the video data relating to the feature region as a separate object. The video data may be compressed and encoded by a compression encoding method of H.263 or MPEG-4.
[0028]
The video communication apparatus may be configured to further include a special processing unit that performs at least mosaic conversion of video data relating to the feature area.
[0029]
Furthermore, according to another aspect of the present invention, there is provided a video data distribution method for a video communication apparatus provided in one or more video distribution units connected to a network, generating at least video data, and displaying the video data. The In the video data distribution method of the video communication device, the video communication device generates feature area information from the video data; generates an encoding parameter based on the feature area information; compresses the video data into transmission data based on the encoding parameter It is characterized by encoding.
[0030]
The feature area information may be configured to be face area information including at least area information of the face area, position information of the face area, or reliability information of the face area.
[0031]
When the feature area information is generated from the video data, the video communication device corrects the feature area information of the video data based on the feature area information of the video data compressed and encoded at least one frame before the video data. You may comprise.
[0032]
The video communication apparatus may be configured to further include an inspection unit that detects a network congestion state. The video communication apparatus includes an encoding parameter for the feature region and a feature region according to the network congestion state. It may be configured to change the encoding parameter relating to a region that does not belong to.
[0033]
The video communication apparatus changes the video data encoding parameter for the feature area and the video data encoding parameter for the area not belonging to the feature area to at least a frame, field, or scene unit of the video data. It may be configured.
[0034]
The video communication apparatus may be configured to cut out video data relating to the feature area as a separate object. The video data may be compressed and encoded by a compression encoding method of H.263 or MPEG-4.
[0035]
The video communication apparatus may be further configured to perform video processing on the feature region at least with mosaic processing or other video data.
[0036]
DETAILED DESCRIPTION OF THE INVENTION
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the accompanying drawings. In the following description and the accompanying drawings, components having substantially the same functions and configurations are denoted by the same reference numerals, and redundant description is omitted. The feature detection unit according to the present invention corresponds to, for example, the face detection block 203 according to the present embodiment.
[0037]
(1. System configuration)
First, the bidirectional communication system according to the present embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing a schematic configuration of a bidirectional communication system according to the present embodiment.
[0038]
As shown in FIG. 1, in the interactive communication system, one or more video distribution units 101 (a, b,..., N) are connected to a network 105.
[0039]
By the video distribution unit 101 (a, b,..., N), the user 106 (a, b,..., N) exchanges images or sounds with each other via the network 105, for example, a video conference system. You can receive services such as.
[0040]
The video distribution unit 101 (a, b,..., N) transmits and receives video data generated by the imaging device 102 and the imaging device 102 (a, b,..., N) such as a video camera. A communication device 104 (a, b,..., N) and an output device 103 (a, b,..., N) for displaying video data are provided. Note that the video data according to the present embodiment includes at least one or both of audio data and image data.
[0041]
The imaging device 102 is a video camera capable of generating video data. For example, the imaging device 102 is a video camera for low bit rate communication applied to video conferencing, monitoring / monitoring, and the like. The imaging apparatus 102 according to the embodiment can be implemented even in the case of a camcorder or the like that captures a news program for broadcasting or a game pattern such as sports.
[0042]
The output device 103 is a device that can display video data, such as a TV device or a liquid crystal display device, and can output sound and images by further including a speaker.
[0043]
The video communication device 104 detects the face area which is the face of the user 106 based on the video data generated by the imaging device 102, and compresses the video data based on the face area information generated from the face area. The compressed and encoded transmission data is transmitted via the network 105. The transmitted transmission data is received and the transmission data is expanded. The decompressed video data is transmitted to the output device 103. Furthermore, when transmitting transmission data via the network 105, the transmission data is controlled according to the traffic congestion of the network 105.
[0044]
Note that the compression coding based on the face area according to the present embodiment is at least H.264. This is performed based on H.263, MPEG-4, and will be described in detail later. Further, detection of traffic congestion on the network 105 will be described in detail later.
[0045]
Next, a typical operation example of this system will be described.
[0046]
When a video conference is performed with a certain user 106, for example, between the user 106a and the user 106b, video data of the user 106a is generated by the imaging device 102a provided in the video distribution unit 101a. Video data is transmitted to the video distribution unit 101b via the network 105.
[0047]
Therefore, the output device 103b included in the video distribution unit 101b displays the video data transmitted via the network 105. The image data of the user 106b is generated by the imaging device 102b, transmitted to the video distribution unit 101a via the network 105, and displayed on the output device 103a.
[0048]
Communication between the user 106a and the user 106b can be achieved by transmitting and receiving video data between the video distribution unit 101a and the video distribution unit 101b via the network 105 even at a remote location. .
[0049]
Although the video distribution unit 101 according to the present embodiment has been described by taking as an example the case where the imaging device 102, the output device 103, and the video communication device 104 are provided, the present invention is not limited to this example. For example, one video distribution unit 101 may include the video communication device 104 and the output device 103, and the other video distribution unit 101 may include the imaging device 102 and the video communication device 104. is there. In this case, for example, the present invention can also be applied as a monitoring system that uses the imaging device 102 to monitor a license plate such as a passenger car or a motorcycle parked in a parking lot.
[0050]
(2 Configuration of each component of the interactive communication system)
Next, the configuration of each component of the bidirectional communication system according to the present embodiment will be described.
[0051]
(2.1 Network 105)
The network 105 connects the video communication devices 104 (a, b,..., N) included in the video distribution unit 101 (a, b,..., N) so that they can mutually communicate with each other. It is a public line network such as the Internet, but also includes a closed line network such as WAN, LAN, and IP-VPN. The connection medium includes an optical fiber cable such as FDDI (Fiber Distributed Data Interface), a coaxial cable or twisted pair cable based on Ethernet (registered trademark), or a satellite communication network such as IEEE802.11b.
[0052]
(2.2 Video distribution unit 101)
The video distribution unit 101 (a, b,..., N) is an imaging device 102 (a, b,..., N), and a video communication device 104 (a, b) that transmits and receives video data generated by the imaging device 102. b,..., n), or any one or any combination of output devices 103 (a, b,..., n) for displaying video data.
[0053]
(2.2.1 Imaging device 102)
An imaging apparatus 102 illustrated in FIG. 1 includes an imaging unit (not illustrated) provided with at least one or more imaging elements (imaging devices), a microphone unit (not illustrated) to which sound is input, and a video communication apparatus 104. And an output unit (not shown) for outputting video data as a video input signal.
[0054]
The image pickup device can photoelectrically convert an optical image received from a subject and output the image data as image data by a plurality of pixels including photoelectric conversion elements provided two-dimensionally on the light receiving surface. For example, examples of the imaging element include various types of solid-state imaging devices such as a CCD.
[0055]
The output unit generates video data based on the image data generated by the imaging unit and the audio data generated from the microphone unit, and outputs the video data to the video communication device 104 as a video input signal.
[0056]
Note that the output unit provided in the imaging apparatus 102 according to the present embodiment outputs video data to the video communication apparatus 104 as analog data. However, the output unit is not limited to this example, and an A / D converter (A / D converter) Even if it is output as digital data, it can be implemented.
[0057]
(2.2.2 Video communication device 104)
Next, the video communication apparatus 104 according to the present embodiment will be described with reference to FIG. FIG. 2 is a block diagram showing a schematic configuration of the video communication apparatus according to the present embodiment.
[0058]
As shown in FIG. 2, the video communication device 104 includes a conversion unit 201 that performs A / D conversion on video data sent from the imaging device 102, a memory unit 202 that temporarily stores and holds video data, and video data. A face detection block 203 that detects a face area based on the image, a special processing unit 204 that replaces at least mosaic transformation or other image with respect to the face area in the video data, and a face generated based on at least the detection result of the face detection block 203 An encoding control unit 205 that generates an encoding parameter based on region information, an encoder unit 206 that compresses and encodes video data based on the encoding parameter, a communication unit 207 that transmits and receives compression-encoded transmission data, A decoder unit 208 that decompresses transmission data received by the communication unit 207, and D / A converts the video data; And a conversion unit 209 to be transmitted to the force device 103. The face detection block 203 and the communication unit 207 will be described in detail later. Hereinafter, the face area is the face area 700 or the face area 702 shown in FIG.
[0059]
(2.2.3 Output device 103)
The output device 103 displays the video data D / A converted by the conversion unit 209 as shown in FIG. Further, as described above, the output device 103 is, for example, a TV device or a liquid crystal display device, and is a device that can output sound or an image.
[0060]
The output device 103 according to the present embodiment has been described by taking the case of displaying D / A converted video data as an example. However, the present invention is not limited to this example, and for example, without performing D / A conversion. The present invention can be implemented even when video data is displayed as digital data.
[0061]
(2.2.4 Face detection block 203)
Next, the face detection block 203 for detecting the face area included in the video data stored in the memory unit 202 and the face area detection process will be described with reference to FIG.
[0062]
The face detection block 203 detects a face area, which is a human face image, from the video data by using the video data stored in the memory unit 202 as a frame unit. Therefore, the face detection block 203 is provided with each part for detecting the face area by a plurality of steps.
[0063]
Note that the face detection block 203 according to the present embodiment has been described by taking the case of detecting a human face area as an example. However, if the face detection block 203 has a characteristic area in the video data, the face detection block 203 is limited to this example. For example, the present invention can be implemented even when detecting an image area such as a license plate of a passenger car, a clock, or a personal computer.
[0064]
As shown in FIG. 2, the face detection block 203 includes a resizing unit 230, a window cutout unit 231, a template matching unit 232, a preprocessing unit 233, and an SVM (Support Vector Machine) identifying unit 234. And a result determination unit 235.
[0065]
The resizing unit 230 reads out the video data generated by the imaging device 102 from the memory unit 202 in units of frames, and the video data read out in units of the frames (hereinafter referred to as frame images) has a plurality of different reduction ratios. Convert to scale image.
[0066]
For example, when the frame image according to the present embodiment is composed of NTSC (National Television System Committee) 704 × 480 pixels (horizontal × vertical), it is sequentially reduced by 0.8 times to 5 stages (1.0). Times, 0.8 times, 0.64 times, 0.51 times, 0.41 times). In the following, the plurality of scale images will be referred to as 1.0-scale image as the first scale image, and as the second to fifth scale images each time it is sequentially reduced.
[0067]
First, the window cutout unit 231 scans the first scale image while shifting the appropriate pixels to the right side or the lower side in order, for example, by two pixels, in order from the upper left of the image to the lower right of the frame image. Thus, a rectangular area of 20 × 20 pixels (hereinafter referred to as a window image) is cut out sequentially. Note that the starting point of the scale image according to the present embodiment is not limited to the upper left corner of the image, and may be implemented, for example, at the upper right corner of the image.
[0068]
The plurality of window images cut out from the first scale image are sequentially sent out to the subsequent template matching unit 232 by the window cutout unit 231.
[0069]
The template matching unit 232 converts the window image sent from the window cutout unit 231 into a function curve having a peak value by performing arithmetic processing such as a normalized correlation method and an error square method, for example. Is set to a sufficiently low threshold so that the recognition performance does not deteriorate, and it is determined whether or not the region of the window image is a face region based on the threshold.
[0070]
In the template matching unit 232, for example, an average human face area generated from an average of about 100 human face images is registered as template data in advance.
[0071]
Whether or not the window image area is a face area is registered as the template data of the face area in the template matching unit 232, and a threshold value is set as a criterion for determining whether or not it is the face area. The window image is determined by simple matching with an average face area serving as template data.
[0072]
The template matching unit 232 performs a matching process using the template data on the window image sent out by the window cutout unit 231. When the template matching unit 232 matches the template data and is determined to be a face area, the window image is scored. An image (a window image determined to be a face area) is sent to the subsequent preprocessing unit 233.
[0073]
If it is determined that the window image is not a face area, the window image is sent to the result determination unit 235 as it is. Note that the score image includes reliability information indicating how much the degree of determination as a face region is likely. For example, the reliability information represents a numerical value within a range of score values “00” to “99”, and the higher the numerical value, the more likely that the face area is. The reliability information may be stored in a cache (not shown) provided in the result determination unit 235, for example.
[0074]
Computational processing such as the normalized correlation method and the error square method described above requires an arithmetic processing amount of about 1/10 to 1/100 compared with the arithmetic processing in the subsequent preprocessing unit 233 and SVM identification unit 234. At the same time, it is possible to detect a window image that is a face region with a probability of 80% or more at the time of the matching process by the template matching unit 232. That is, a window image that is clearly not a face region can be removed at this point.
[0075]
For the score image obtained from the template matching unit 232, the pre-processing unit 233 extracts the four corner areas corresponding to the background that are unrelated to the human face area from the score image that is a rectangular area. 360 pixels are extracted from a score image having 20 × 20 pixels using a mask obtained by cutting out the above area. The score image according to the present embodiment has been described by taking an example of extracting 360 pixels from which four corners have been cut out. However, the present invention is not limited to such an example, and for example, even when four corners are not extracted. It can be implemented.
[0076]
Further, the pre-processing unit 233 extracts the subject by using a calculation method based on, for example, a mean square error (RSM) in order to eliminate the subject inclination condition represented by shading due to illumination at the time of imaging or the like. Correction is applied to the gray value of the 360-pixel score image.
[0077]
Subsequently, the pre-processing unit 233 performs histogram smoothing processing on the score image obtained by enhancing the contrast of the 360-pixel score image, thereby controlling the gain of the imaging device 102 or the intensity of illumination. It is possible to detect a score image that is not performed.
[0078]
Furthermore, the pre-processing unit 233 performs Gabor filtering processing, for example, to vector-convert score images and further convert the obtained vector group into one pattern vector. Note that the type of filter in Gabor filtering can be changed as necessary.
[0079]
The SVM identification unit 234 detects a face area from the score image obtained as a pattern vector from the preprocessing unit 233. If detected, it is output as face area detection data. If it is not detected, it is added as face area undetected data and further learning is performed.
[0080]
When the SVM identifying unit 234 determines whether or not a face region exists in the score image with respect to the pattern vector generated based on the score image transmitted by the preprocessing unit 233, and the face region is detected , The upper left position (coordinate position) of the face area in the score image, the area of the face area (vertical x horizontal number of pixels), reliability information indicating the likelihood of being a face area, the origin of the score image For example, a cache (not shown) provided in the result determination unit 235 includes face area information including a reduction ratio of the scale image (any one of the reduction ratios corresponding to the first to fifth scale images). ) Is stored for each score image. Note that the position (starting point) of the face area according to the present embodiment is not limited to the upper left of the image, and can be implemented even at the upper right of the image, for example.
[0081]
For example, when the detection of the face area of the first window image in the first scale image is completed by the SVM identification unit 234, the window image scanned next in the first scale image by the window cutout unit 231 is displayed. It is sent to the template matching unit 232.
[0082]
Next, the template matching unit 232 sends the score image to the preprocessing unit 233 only when the window image matches the template data. The preprocessing unit 233 converts the score image into a pattern vector and sends the pattern vector to the SVM identification unit 234. When detecting the face area based on the pattern vector, the SVM identifying unit 234 generates face area information related to the scale image and stores the face area information in the cache provided in the result determining unit 235.
[0083]
As described above, with respect to the window image sequentially scanned by the window cutout unit 231 with respect to the first scale image, the subsequent processing by the template matching unit 232, the preprocessing unit 233, and the SVM identification unit 234 is executed. Thus, a plurality of score images including a face area can be detected from the first scale image.
[0084]
Further, when all the scans of the first scale image by the window cutout unit 231 are completed and the subsequent processes by the template matching unit 232, the preprocessing unit 233, and the SVM identification unit 234 are also completed, the second scale image is obtained. Each process for detecting a face area is executed in substantially the same manner as the first scale image described above. For the third to fifth scale images, face area detection processing is executed in substantially the same manner as the first scale image.
[0085]
The SVM identifying unit 234 obtains score images from which face areas have been detected for the first to fifth scale images, which are frame images that are video data read from the memory unit 202, and are composed of five different reduction ratios. A plurality of detected face area information is stored in a cache (not shown) provided in the result determination unit 235. The cache according to the present embodiment has been described by taking the case where the result determination unit 235 is provided as an example. However, the cache is not limited to this example. For example, the cache is provided in the face detection block 203 alone. Can also be implemented. Furthermore, there are cases where a face area is not detected and a score image is not obtained at all. However, if a predetermined number of score images such as at least one are obtained, the face detection process is continued.
[0086]
In the score images in which the face area is detected in the first to fifth scale images, the scan in the window cutout unit 231 is performed while moving by predetermined pixels (for example, two pixels). Between the score images, there is a high correlation in the vicinity region, and there are many cases where there are regions overlapping each other.
[0087]
The result determination unit 235 determines whether or not there is an overlap based on the position of the two score images, the number of pixels of the score image, and a predetermined mathematical expression in order to remove the overlapping region.
[0088]
For example, the position of the upper left corner as the position of the two score images is X. (X_A, Y_A), (X_B, Y_B) And the number of pixels in the score image (vertical x horizontal)_A× L_A, H_B× L_B, DX (= X_B-X_A), DX (= X_B-X_A), It is determined that the two score images overlap when the following expressions (1) and (2) are simultaneously satisfied.
[0089]
(L_A−dX) × (L_B+ DX)> 0 (1)
[0090]
(H_A−dY) × (H_B+ DY)> 0 (2)
[0091]
Based on the determination result, the result determination unit 235 obtains a final face area that does not overlap by removing an overlapping area from the plurality of score images, and generates face area information that is finally determined. , The face area information stored in the cache is updated. Note that the stored face area information according to the present embodiment has been described as an example in which the face area information is updated to the confirmed face area information. However, the present invention is not limited to this, and a newly newly established face area information is used. Even when information is stored, it can be implemented.
[0092]
When there is an overlapping area, the result determination unit 235 has a high reliability, that is, a score with a high probability of being a face area, based on reliability information corresponding to a score image stored in a cache (not shown). The face area information of the image is generated, and the face area information stored in the cache is updated to the highly reliable face area information.
[0093]
When the face area is not detected, the result determination unit 235 does not perform the storing process in the cache, and does not update the face area information when there is no overlapping face area.
[0094]
As described above, the face detection block 203 can generate face area information for a highly reliable face area from video data captured by the imaging apparatus 102. Therefore, even if a plurality of face areas are detected, for example, the face area of the user 106 can be detected more reliably.
[0095]
The generated face area information is transmitted to the encoding control unit 205 shown in FIG. 2, and an encoding parameter for compressing and encoding video data is generated based on the face area information.
[0096]
Note that the overlap area determination processing by the result determination unit 235 according to the present embodiment has been described with reference to the case defined in Equation (1), but is not limited to this example, and other mathematical expressions are used. Even if it is, it can be implemented.
[0097]
Further, the position of the image including the scale image according to the present embodiment has been described as an example in which the upper left corner is represented as a reference. However, the present invention is not limited to this example, and other positions are used as the reference. Even if it is, it is possible to implement.
[0098]
Further, the video data in which the face area according to the present embodiment is detected has been described by taking as an example the case where the face area is read in units of frames and the face area is detected. However, the present invention is not limited to this example. , Even when a face area detection process is performed for each scene composed of a field unit or a plurality of frames.
[0099]
Further, the template data registered in the template matching 232 according to the present embodiment has been described by taking as an example a case where a face area indicating an average human face is registered, but is not limited to such an example. For example, the present invention can be implemented even when an image area of a face of an animal such as a license plate, a clock, or a pet is registered as template data.
[0100]
(2.2.5 Communication unit 207)
Next, the communication unit 207 according to the present embodiment will be described. The communication unit 207 is connected to the network 105, and transmits or receives transmission data that has been compression-encoded via the network 105.
[0101]
The communication unit 207 includes an inspection unit 210 that detects the traffic congestion status of the network 105. In order to detect the traffic congestion state of the network 105, the inspection unit 210 confirms the operation of the connection destination video communication apparatus 104 or an arbitrary host by ICMP using “ping” at predetermined time intervals, for example. Request (echo inspection).
[0102]
The inspection unit 210 sets at least address information of a connection partner by a ping command and transmits an ICMP packet. Upon receiving the ICMP packet, for example, the host of the connection partner transmits a response (Reply) packet indicating that the connection is normally received to the inspection unit 210 that issued the ping command. If the connection partner does not receive the packet normally (or if the ICMP packet is not received within the time limit), an error occurs.
[0103]
Therefore, the inspection unit 210 acquires the time from when the ICMP packet is transmitted until the response packet is received, and detects the traffic congestion state. For example, when the traffic of the network 105 is normal, the communication speed is 128 KBytes / sec, and the time until the response packet is received (hereinafter referred to as response time) is 40 msec, the above response is obtained by the inspection of the inspection unit 210 at a certain time. When the time is detected as 80 msec, the inspection unit 210 determines that the traffic on the network 105 is congested.
[0104]
When detecting the traffic congestion of the network 105, the inspection unit 210 generates congestion information and transmits the congestion information to the encoding control unit 205. The congestion information is data indicating the traffic congestion status of the network 105, and includes information such as response time, for example.
[0105]
When receiving the congestion information, the encoding control unit 205 sets an encoding parameter in order to control the compression encoding of video data in units of macroblocks according to the traffic congestion state of the network 105. For example, video data of a plurality of frames within a predetermined time is not compressed and encoded so that transmission data is not transmitted. For video data of a plurality of frames within a predetermined time, only macroblocks belonging to the face area are controlled. Examples of the encoding parameters are compression coding and control to transmit transmission data. The macroblock described below may indicate the MB 503 shown in FIG. The macro block will be described in detail later.
[0106]
In addition, although the inspection unit 210 according to the present embodiment has been described by taking as an example a case where traffic congestion status is detected by ICMP (Internet Control Message Protocol), the present invention is not limited to this example, and for example, TCP (Transmission Control) In the process of retransmitting data such as Protocol (segment), the time (RTT: Round Trip Time) until the confirmation response is returned from the other party of the connection destination is acquired, or the connection destination, for example, the host, etc. The present invention can be implemented even when a congestion situation is detected by a change in a window size (data size that can be received) in which data can be transmitted and received.
[0107]
(3. Operation of interactive communication system)
Next, an embodiment of the operation of the bidirectional communication system configured as described above will be described with reference to FIG. FIG. 3 is a flowchart showing an outline of the operation of the bidirectional communication system according to the present embodiment.
[0108]
As shown in FIG. 3, in the interactive communication system according to the present embodiment, when a plurality of users 106 make a meeting, for example, by a video conference, the plurality of video distribution units 101 are constantly in the meeting time. , The video data is exchanged with each other, and the operation of the interactive communication system is continued.
[0109]
Therefore, the video data distribution process (S301) continues (distribution loop) between the video distribution units 101 until the meeting time ends (the shooting process ends).
[0110]
(3.1 Video Data Distribution Processing from Video Distribution Unit 101)
Next, the video data distribution processing according to the present embodiment will be described with reference to FIG. FIG. 4 is a flowchart showing an outline of the video data distribution processing according to the present embodiment. The following explanation is based on ITU-T recommendation H.264. The video data distribution process in the case of H.263 will be described, but MPEG-4 is also compliant.
[0111]
In the video data distribution process (S301), when video data is generated by the imaging process of the imaging apparatus 102, the video data distribution process (S301) is sent to the conversion unit 201 of the video communication apparatus 104 via RS-232C or RS-422, for example. The
[0112]
The conversion unit 201 A / D converts the video data and sends it to the memory unit 202. When the video data is sent to the memory unit 202, a face detection process (S401) is performed by the face detection block 203 as shown in FIG. Note that the face detection processing according to the present embodiment is substantially the same as that described above, and therefore will be omitted.
[0113]
The face detection process (S401) is performed in units of frames of video data sent to the memory unit 202, but is not limited to this example, and may be performed in units of fields. If no face area exists in the frame image (picture), which is video data in units of frames, and is not detected (S402), face detection processing (S401) is performed again.
[0114]
As a result of the face detection process (S401), when a face area is detected (S402), the face area information stored in the cache of the result determination unit 235 provided in the video communication device 104 is transmitted to the encoding control unit 205. (S403).
[0115]
When the encoding control unit 205 receives the face area information, the encoding control unit 205 acquires face area information relating to a frame image at least one frame before stored in a storage unit (not shown) provided in the encoding control unit 205. . The acquired frame image is not limited to one frame before, but may be, for example, a plurality of frames before or one field before.
[0116]
When the face area information related to the frame image one frame before (the previous frame image) is stored, the face area information of the received current frame image is compared with the face area information related to the previous frame image; Correction processing is performed (S405).
[0117]
When the face area information related to the previous frame image is not stored in the storage unit (S404), that is, when the face area is not detected in the previous frame image (S404), the face area information correction process (S405) is performed. Not executed.
[0118]
The correction process (S405) is performed by comparing at least one of the area information, the position information, and the reliability information of the face area information of the face area information concerning the previous frame and the current frame image, thereby comparing the face area concerning the current frame image. Correct the information.
[0119]
In the correction processing (S405) according to the present embodiment, for example, only one face area exists in the previous frame image, two face areas exist in the current frame image, and are detected in the previous frame image also in the current frame image. When selecting a face area, it is necessary to accurately determine the face area information in order to select the face area information related to the previous frame image included in the current frame image.
[0120]
Since the time difference between the previous frame image and the current frame image is extremely short and the range within which the human body can move within the frame image is extremely limited, the encoding control unit 205 determines the area information and position information of the face area information. Based on the above, the face area information of the face area existing in the vicinity of the face area related to the previous frame image is selected from the face areas related to the current frame image.
[0121]
If the reliability information of the selected face area information is lower than the other reliability information of the current frame image or the reliability information of the previous frame image, the reliability information of the previous frame image is equal to or the current frame image (S405). Therefore, for example, if face area information having the highest reliability information is selected, the face area of the previous frame image can be accurately selected even in the current frame image. Note that the correction processing according to the present embodiment is not limited to such an example.
[0122]
The encoding control unit 205 performs object extraction processing (S406) on the face area with the highest reliability information based on the face area information concerning the corrected current frame image. Note that the object extraction processing according to the present embodiment is not limited to the face area with the highest reliability information. For example, all face areas that do not depend on the reliability information, or the lowest reliability information. Even when the object extraction process (S406) is performed for all the other face areas except the above, the present invention can be implemented.
[0123]
(3.1.1 Video format)
Here, before explaining the object cutting process (S406), the video format according to the present embodiment will be explained with reference to FIG. FIG. 5 is an explanatory diagram showing a schematic configuration of a video format according to the present embodiment.
[0124]
Video data captured by the imaging apparatus 102 in the NTSC system or the PAL system is a frame image unit, for example, H.264 defined in the ITU-T recommendation. 261, H.M. In the case of MPEG-4 defined in H.263, ISO / IEC 14496, etc., it is converted into a frame image such as a CIF screen, QCIF screen, or SQCIF screen defined in advance as a common format, and further compressed and encoded as transmission data. It is transmitted via the network 105.
[0125]
As shown in FIG. 5, the screen 501 corresponds to one of the CIF screen, QCIF screen, or SQCIF screen, and is composed of a plurality of GOBs (502A, 502B, 502C,...) Called group of blocks. ing.
[0126]
For example, the GOB 502 according to this embodiment is H.264. In the case of H.261, the CIF screen is composed of 12 GOBs 502, and the QCIF screen is composed of 3 GOBs 502.
[0127]
The GOB 502 is further composed of a plurality of MBs (503A, 503B, 503C,...) Called macroblocks (MB), and each MB503 has MB503-1 and 8 × 16 pixel luminance macroblocks. × 8 pixel C_BMB503-2 which is a color difference macroblock and C of 8 × 8 pixels_RAlthough the color difference macroblock 503-3 is configured, the number of MB503 configured in the GOB 502 is, for example, H.264. 261, H.M. 263, MPEG-4, etc. In the case of H.261, one GOB 502 is composed of 33 MBs 503.
[0128]
The MB 503 is further composed of blocks (504A, 504B, 504C, 504D) of the minimum unit composed of 8 × 8 pixels. Therefore, one MB 503 includes four luminance blocks (504A, 504B, 504C, 504D) and two (C_B, C_R) Color difference blocks (504E, 504F).
[0129]
(3.1.2 Macroblock data structure)
Next, the data structure of the macroblock according to the present embodiment will be described with reference to FIG. FIG. 6 is an explanatory diagram showing a schematic configuration of the data structure of the macroblock according to the present embodiment.
[0130]
As shown in FIG. 6, the data structure of a macroblock consists of a macroblock header and block data. The macroblock header includes “COD”, “MCBPC”, “MODB”, and “CBPB”. , “CBPY”, “DQUANT”, “MVD”, “MVD”₂"And" MVD₃"And" MVD₄”And“ MVDB ”.
[0131]
The data structure of the macroblock according to this embodiment is H.264. The data structure according to H.263 has been described as an example. However, the present invention is not limited to this example. H.261, MPEG-4, etc. 263.
[0132]
The “DQUANT” is 2-bit or variable-length data, and defines a change in QUANT. QUANT is a quantization parameter for the macroblock, and can take a value in the range of 1 to 31. QUANT is set to an arbitrary value in advance.
[0133]
Therefore, since “DQUANT” represents a difference value, for example, when “DQUANT” is “00” in binary notation, the difference value is “−1”, and when “DQUANT” is “01”, the difference value is “ −2 ”and“ 10 ”, the difference value is“ 1 ”, and if“ 11 ”, the difference value can be expressed as“ 2 ”.
[0134]
When the difference value of “DQUANT” changes, the value of QUANT changes, but when the quantization parameter QUANT increases, the image quality of the corresponding macroblock decreases, resulting in an image that is blurry and lacks detail. When QUANT is reduced, the image quality is improved, and even if compression coding is performed, the image is almost in the state of the original image. That is, by controlling the change of “DQUANT” for each macroblock, it is possible to control the image quality of an arbitrary area of the video data. The change in “DQUANT” is controlled based on the encoding parameter generated by the encoding control unit 205.
[0135]
As shown in FIG. “COD” according to H.263 is an encoded macroblock indicator, and is data consisting of 1 bit. When “COD” is “0”, it indicates a macroblock to be compression-encoded, and when it is “1”, it indicates a macroblock to be deleted or ignored without being compression-encoded. .
[0136]
Therefore, H. In the case of H.263, the encoding control unit 205 generates an encoding parameter for instructing a value to “COD” of the macroblock in order to control whether or not the macroblock is compression-encoded.
[0137]
Here, as shown in FIG. 4, when the face area information correction process (S405) ends and the encoding control unit 205 receives the face area information, the area information of the face area included in the face area information is obtained. Alternatively, based on the position information of the face area, the process of cutting out the face area of the frame image as an object is executed (S406).
[0138]
Further, the object according to the present embodiment will be described with reference to FIGS. 7A and 7B. FIG. 7A is an explanatory diagram showing a schematic structure of a face area block at the time of initial formation according to the present embodiment, and FIG. 7B is a face at the time of final determination according to the present embodiment. It is explanatory drawing which shows the schematic structure of an area | region block.
[0139]
The frame image 701 of the video data shown in FIGS. 7A and 7B is composed of 36 (6 × 6) macroblocks.
[0140]
First, as shown in FIG. 7A, the encoding control unit 205 initially forms a region of the face region 700 based on the area information or the position information included in the received face region information. A face area 700 shown in FIG. 7A is within the range of four macroblocks that include all human faces. That is, it is within the range of 3 × 3 macroblocks with 3 blocks from the top of the face area 700 and 3 blocks from the left as the upper left corner.
[0141]
However, since compression encoding is performed in units of macroblocks, as shown in FIG. 7B, the encoding control unit 205 has a macroblock unit area in which the ratio of enlarging or reducing the face area 700 is minimal. The face area 702 is corrected. Since compression encoding is performed in units of macroblocks, correction is performed as in the face area 702, and the face area is finally determined.
[0142]
With the corrected face area 702 shown in FIG. 7B, the encoding control unit 205 cuts the macroblock belonging to the face area 702, the macroblock not belonging to the face area 702, and another area into object units. (S406). Therefore, it is possible to instruct the object in the face area 702 with the encoding parameter so that the object is compression-encoded for each object, for example, by reducing the quantization parameter.
[0143]
Further, for example, the encoding control unit 205 instructs the encoding parameter so that “0” is set to “COD” for the macroblock belonging to the face area 702, and the macro that does not belong to the face area 702. For the block, by specifying with the encoding parameter so that “COD” is set to “1”, only the face area 702 is compression-encoded and transmitted as transmission data via the network 105. be able to.
[0144]
(3.1.3 Face area conversion process)
The special processing unit 204 shown in FIG. 2 performs, for example, mosaic processing or replacement with another image such as an animal image for the detected face area in units of frames of video data stored in the memory unit 202. A face area conversion process (S407) is executed.
[0145]
The face area conversion processing (S407) is executed when mosaic processing or replacement processing is set by, for example, a mosaic processing setting button and a replacement processing setting button (not shown) provided in the video communication apparatus 104. . Note that the face area conversion process (S407) according to the present embodiment can be performed either when setting in advance before the shooting process or when setting during the shooting process.
[0146]
Here, the face area conversion processing according to the present embodiment will be described with reference to FIG. FIG. 8 is a flowchart showing an outline of face area conversion processing according to the present embodiment.
[0147]
As shown in FIG. 8, when the face area conversion process including the mosaic process or the replacement process is set (S801), the special processing unit 204 reads the video data stored in the memory unit 202 in units of frames, If replacement processing is set, appropriate replacement image data for replacement is read.
[0148]
Further, the special processing unit 204 performs mosaic processing or replacement processing (S802) on the face region of the frame image in the video data based on the face region information transmitted from the face detection block 203, and sends the frame image to the encoder unit 206. Is sent out.
[0149]
When the mosaic process or the replacement process (S802) ends, the face area conversion process (S407) shown in FIG. 4 ends. Note that the face area conversion processing according to the present embodiment has been described by taking as an example a case where it is configured by mosaic processing or replacement processing. However, the present invention is not limited to this example. For example, sharpness processing, brightness of a frame image is set. Even in the case of increasing brightness processing, it can be implemented.
[0150]
Further, the face area conversion processing according to the present embodiment has been described by way of an example in which mosaic processing or replacement processing is performed on the face area, but the present invention is not limited to this example, and the face area conversion processing is applied to areas other than the face area. Even if the mosaic process or the replacement process is executed, the present invention can be implemented.
[0151]
Next, as shown in FIG. 4, when the face area conversion process (S407) is completed in the special processing unit 204, the encoding control unit 205 generates an encoding parameter for the frame image sent from the special processing unit 204. (S408).
[0152]
The encoding control unit 205 sets the quantization parameter for at least a macroblock belonging to the face area 702, sets the quantization parameter for a macroblock that does not belong to the face area 702, or compresses and encodes in units of objects. An encoding parameter for instructing whether to set or not is generated (S408).
[0153]
Further, as described above, the inspection unit 210 performs a traffic congestion state detection process (S409) of the network 105. As a result of the detection process (S409), when the traffic congestion state exceeds a predetermined threshold value and the inspection unit 210 determines that the traffic is congested (S410), the congestion information is generated and transmitted to the encoding control unit 205. To do.
[0154]
When receiving the congestion information, for example, the encoding control unit 205 transmits an encoding parameter to the encoder unit 206 so as to perform compression encoding only on the object that is the face region 702, and controls compression encoding.
[0155]
Only the face area 702 of the frame image is compression-encoded as described above, in which “0” is set in “COD” of the macro block belonging to the face area 702 and the macro block not belonging to the face area 702 is By setting “1” to “COD”, transmission data related to the face area 702 is transmitted to the network 105.
[0156]
Accordingly, the encoding control unit 205 changes the encoding parameter generated in the encoding parameter generation processing (S408) (S411) in order to cause the encoder 206 to compress and encode only the object of the face region 702. The encoding parameter is transmitted to the encoder unit 206.
[0157]
With the encoding parameter changing process (S411), it is possible to control whether or not the encoder 206 performs compression encoding, and it is possible to minimize the load on the traffic of the network 105.
[0158]
Next, the encoder unit 206 compresses and encodes a frame image, which is video data transmitted from the special processing unit 204, based on the encoding parameter (S412), and transmits the frame image to the communication unit 207 as transmission data. Therefore, for example, a macroblock belonging to the face area 702 can be compressed and encoded without reducing the image quality, and a macroblock not belonging to the face area 702 can be compressed and encoded with a reduced image quality. Furthermore, it is possible to compression-encode only the macroblocks belonging to the face area 702.
[0159]
Therefore, it is possible to cut out only the macro block for the face area 702 in the frame image without compressing and encoding the entire frame image, and to compress and encode it, thereby saving the data capacity to be transmitted to the network 105. Furthermore, since the image quality of the human face image does not deteriorate, video data with high visibility can be displayed.
[0160]
Here, the compression coding according to the present embodiment in the case of MPEG-4 will be described. MPEG-4 compression coding (S412) is H.264. 261 and H.H. It is different from the compression encoding (S412) of H.263 in that the encoder 206 is provided with a shape encoding unit (not shown) and a texture encoding unit (not shown).
[0161]
In order to encode the shape of the object that is the face region 702, the shape encoding unit first sets a bounding rectangle for the region to be encoded in the frame image 701 shown in FIG. 7 (A) or (B), A block of 16 × 16 pixels (binary shape block: BAB) is set at the same position as the macro block shown in FIG.
[0162]
As shown in FIG. 9, when the shape encoding unit sets a binary shape block based on the encoding parameter, the binary shape block belonging to the object that is the face area 702 is represented by “1”, and A binary shape block that does not belong is represented by “0”. FIG. 9 is an explanatory diagram showing a schematic configuration of a binary shape block according to the present embodiment.
[0163]
In order to distinguish between the inside and the outside of the object that is the face region 702 as in the binary shape block shown in FIG. 9, the shape encoding unit displays the binary shape block for each binary shape block. The shape of the frame image 701 is encoded.
[0164]
In addition to shape coding, the texture coding unit performs padding processing on the macroblock belonging to the object that is the face region 702, and the like, and compression coding of the texture (pixel value) is performed. By performing shape coding and texture coding, the compression coding processing (S412) is completed, and the encoder unit 206 sends transmission data to the communication unit 207. Note that the texture encoding unit according to the present embodiment can be implemented even when compression encoding is performed on a macroblock that does not belong to an object.
[0165]
Therefore, it is possible to cut out only the macro block for the face area 702 without compressing and encoding the entire frame image, and to compress and encode it, so that the data capacity to be transmitted to the network 105 can be reduced and the human face image can be reduced. Since the image quality of the video is not degraded, video data with high visibility can be displayed.
[0166]
The transmitted transmission data is multiplexed by the communication unit 207 and distributed via the network 105 (S413). The video data distribution process (S401 to S413) configured as described above is continued until the photographing process is completed.
[0167]
As for the reception processing of the video data after distribution according to the present embodiment, when the transmission data transmitted via the network 105 is received by the communication unit 207 and decompressed by the decoder unit 208, it is stored in the memory unit 202. Video data is sequentially stored.
[0168]
For the subsequent processes, the face detection process (S401) to the face area conversion process (S407) shown in FIG. 4 are performed, and the video data is D / A converted by the conversion unit 209. After D / A conversion, the output device 103 displays video data. The processing in the face detection process (S401) to the face area conversion process (S407) in the video data reception process according to the present embodiment is the same as the face detection process (S401) to the face area conversion process (S407) in the video data distribution process. ), The detailed description is omitted.
[0169]
As mentioned above, although preferred embodiment of this invention was described referring an accompanying drawing, this invention is not limited to this example. It is obvious for those skilled in the art that various changes or modifications can be envisaged within the scope of the technical idea described in the claims, and these are naturally within the technical scope of the present invention. It is understood that it belongs.
[0170]
In the above embodiment, the case where a plurality of video distribution units are configured has been described as an example, but the present invention is not limited to such an example. For example, the present invention can be implemented even when the video distribution unit is composed of one unit. In this case, it can be implemented as a monitoring system.
[0171]
In the above embodiment, the case of a human face region has been described as an example, but the present invention is not limited to such an example. For example, an image of a passenger car license plate or the like may be implemented as a characteristic region.
[0172]
In the above embodiment, the case where the distribution processing and reception processing of video data are performed in units of frames has been described as an example, but the present invention is not limited to such an example. For example, the present invention can be carried out even when it is performed in a field unit of video data or a scene unit composed of a plurality of frames of video data.
[0173]
In the above-described embodiment, the video distribution unit has been described as an example used for a video conference. However, the present invention is not limited to such an example. For example, the present invention can be implemented even when it is used for a mobile phone, a mobile terminal, a personal computer, or the like.
[0174]
【The invention's effect】
As described above, according to the present invention, even when there are a plurality of feature areas, it is possible to accurately determine a feature area based on past feature area information, and to extract and compress only the feature area without reducing the image quality. Thus, an image with high visibility can be displayed without depending on network traffic.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a bidirectional communication system according to an embodiment.
FIG. 2 is a block diagram illustrating a schematic configuration of the video communication apparatus according to the present embodiment.
FIG. 3 is a flowchart showing an outline of the operation of the bidirectional communication system according to the present embodiment.
FIG. 4 is a flowchart showing an outline of video data distribution processing according to the present embodiment;
FIG. 5 is an explanatory diagram showing a schematic configuration of a video format according to the present embodiment.
FIG. 6 is an explanatory diagram showing a schematic configuration of a data structure of a macroblock according to the present embodiment.
FIG. 7A is an explanatory diagram showing a schematic structure of a face area block at the time of initial formation according to the present embodiment, and FIG. 7B is a final view according to the present embodiment; It is explanatory drawing which shows the schematic structure of the face area block at the time of determination.
FIG. 8 is a flowchart showing an outline of face area conversion processing according to the present embodiment;
FIG. 9 is an explanatory diagram showing a schematic configuration of a binary shape block according to the present embodiment;
[Explanation of symbols]
101: Video distribution unit
102: Imaging device
103: Output device
104: Video communication apparatus
105: Network
106: User

Claims

An interactive communication system in which one or more video distribution units are connected by a network :
The video delivery unit includes an imaging device for generating video data;
A feature detection unit that detects a face area from the video data and generates face area information, an encoding control unit that generates an encoding parameter based on the face area information, and converts the video data into transmission data based on the encoding parameter A video communication apparatus comprising at least an encoder for compression encoding and a decoder for expanding the transmission data into the video data;
An output device for displaying the video data;
With
The one video distribution unit on the sender side transmits the transmission data compressed and encoded for each area of at least the face area and the area not belonging to the face area in the video data. Deliver to other video distribution units ,
The face area information includes at least one of area information of the face area, position information of the face area, or reliability information of the face area,
In the encoding control unit, the reliability information of one face area selected from the current frame image is the reliability information of the other face area of the current frame image or the reliability information of the face area of the previous frame image. The reliability information of the selected one face area is equal to the reliability information of the face area of the previous frame image, or the reliability information of the other face area of the current frame image. A two-way communication system characterized by correcting to a value greater than information .

When the face area information is generated from the video data, the encoding control unit is configured to generate the video data based on the face area information of the video data compressed and encoded at least one frame before the video data. The bidirectional communication system according to claim 1, wherein the face area information is corrected.

The bidirectional communication system according to claim 1, wherein the video communication apparatus further includes an inspection unit that detects a congestion state of the network.

The encoding control unit, depending on the congestion of the network, and wherein the coding parameters relating to the face area, and changes and said coding parameters according to the area where the not belonging to the face area, The two-way communication system according to claim 1.

The interactive communication system according to claim 1, wherein the encoding control unit cuts out video data concerning the face area as a separate object.

The bidirectional communication system according to claim 1, wherein the video communication apparatus further includes a special processing unit that performs at least mosaic conversion of video data relating to the face area .

A video communication device provided in a video distribution unit including an imaging device that generates video data and an output device that displays the video data :
A feature detection unit that detects a face area from the video data generated by the imaging device and generates face area information ;
An encoding control unit that generates an encoding parameter based on the face area information ;
An encoder for compressing and encoding the video data into transmission data based on the encoding parameter;
A decoder unit for expanding the transmission data into the video data;
Equipped with a,
The face area information includes at least one of area information of the face area, position information of the face area, or reliability information of the face area,
In the encoding control unit, the reliability information of one face area selected from the current frame image is the reliability information of the other face area of the current frame image or the reliability information of the face area of the previous frame image. The reliability information of the selected one face area is the same as the reliability information of the face area of the previous frame image, or the reliability information of the other face area of the current frame image. A video communication device characterized by correcting to a value greater than information .

When the face area information is generated from the video data, the encoding control unit is configured to generate the video data based on the face area information of the video data compressed and encoded at least one frame before the video data. The video communication apparatus according to claim 7 , wherein the face area information is corrected.

The video communication apparatus according to claim 7 , further comprising an inspection unit that detects a congestion state of the network.

The encoding control unit, depending on the congestion of the network, and wherein the coding parameters relating to the face area, and changes and said coding parameters according to the area where the not belonging to the face area, The video communication apparatus according to claim 7 .

8. The video communication apparatus according to claim 7 , wherein the encoding control unit cuts out video data concerning the face area as a separate object.

The video communication device according to claim 7 , further comprising a special processing unit that at least mosaic-converts video data relating to the face area .