JP4515005B2

JP4515005B2 - Electronic camera

Info

Publication number: JP4515005B2
Application number: JP2001298601A
Authority: JP
Inventors: 敬二光久
Original assignee: Olympus Corp
Current assignee: Olympus Corp
Priority date: 2001-09-27
Filing date: 2001-09-27
Publication date: 2010-07-28
Anticipated expiration: 2021-09-27
Also published as: JP2003110991A

Description

【０００１】
【発明の属する技術分野】
本発明は、一般的には録音機能を備えた電子カメラに関し、特に被写体と音声との関連付け機能を実現する技術に関する。
【０００２】
【従来の技術】
近年、デジタルビデオカメラ（ＤＶＣ）だけでなく、音声を記録する録音機能を備えたデジタルスチールカメラ（ＤＳＣ）とも呼ばれる電子カメラが開発されている。このような電子カメラであれば、撮影した画像（静止画像又は動画像）を再生すると共に、録音した音声を再生することができるため、撮影画像に対する再生効果を高めることができる。但し、ここでは、電子カメラは、ＤＳＣだけでなく、静止画撮影機能を備えたＤＶＣも意味する。
【０００３】
関連する先行技術としては、対話制御技術を有する情報処理装置において、画面表示及び音声出力の少なくとも一方を用いてガイダンス情報を提示する技術が提案されている（例えば特開平１０−１１２４８号公報を参照）。
【０００４】
一般的に、電子カメラは、撮影画像をパーソナルコンピュータ等の情報処理装置に入力させることができる。パーソナルコンピュータは、入力した撮影画像を表示画面上に表示すると共に、付属しているスピーカから録音した音声を再生することも可能である。
【０００５】
【発明が解決しようとする課題】
パーソナルコンピュータでは、撮影画像を表示画面上に表示しているときに、被写体位置をポインタ（カーソル）で指定するタイミングで、録音した音声を再生することも可能である。しかしながら、従来の電子カメラでは、撮影した画像において、被写体と録音した音声とを対応付けして記録することはできない。パーソナルコンピュータなどで、入力した静止画像に音声位置情報を付加する処理は可能であるが、通常では高度の画像処理が必要となる。
【０００６】
そこで、本発明の目的は、録音機能を有する電子カメラにおいて、撮影画像の被写体と音声とを関連付けした位置情報を生成できる機能を備えた電子カメラを提供することにある。
【０００７】
【課題を解決するための手段】
本発明の観点は、録音機能を有する電子カメラにおいて、撮影画像中の被写体（人や物）と、録音（検出）した音声とを関連付けした位置情報（発音位置情報）を取得（算出又は推定）する機能を備えた電子カメラに関する。
【０００８】
本発明の電子カメラは、被写体を撮像して画像データに変換する撮像手段と、音声を検出して音声データに変換する音声検出手段と、前記音声の発生位置に関連する位置情報を取得する位置情報取得手段と、前記位置情報を前記画像データまたは前記音声データに関連付けさせて出力する出力手段と、前記画像データを使用して被写体の動態を検出する動態検出手段とを有し、前記位置情報取得手段は、前記動態検出手段での動態検出から被写体位置を検出したと判断した場合には当該被写体位置に対応する位置情報を被写体の発音位置として一時記憶し、前記動態検出手段での動態検出から被写体位置を検出できないと判断した場合には、撮影画面上のほぼ中央位置の被写体位置を発音候補位置として一時記憶する構成である。
【０００９】
位置情報取得手段は、例えば被写体の動態（例えば人の口の動きなど）を検出し、当該検出位置を音声の発生位置（発音位置）として推定した発音位置情報を取得する構成である。出力手段は、当該位置情報を記録媒体に記録する記録手段や、無線通信等によりデータを伝送する伝送手段に出力する。
【００１０】
このような構成の電子カメラであれば、撮影画像及び録音した音声と共に、被写体と音声とを関連付けした位置情報（発音位置情報）を得ることができるため、画像再生時に、被写体に関連付けして音声再生を行う処理を比較的容易に実現することができる。これにより、画像再生中の単なる音声再生だけでなく、例えば会話中の被写体を指定することにより、当該会話の音声再生を実現できる等、優れた再生効果を得ることができる。
【００１１】
【発明の実施の形態】
以下図面を参照して、本発明の実施の形態を説明する。
【００１２】
（電子カメラの構成）
図１は、同実施形態に関する電子カメラの要部を示すブロック図である。同電子カメラは、静止画及び動画の撮影機能を備えたデジタルスチールカメラ（ＤＳＣ）を想定しているが、静止画撮影機能を有するデジタルビデオカメラ（ＤＶＣ）にも適用可能である。
【００１３】
同電子カメラは、大別して撮影系、画像処理系、制御・操作系、表示・出力系、及び記録系から構成されている。撮影系は、光学レンズユニット１と、レンズ駆動回路２と、撮像素子３と、撮像系制御回路４と、Ａ／Ｄコンバータ５とを有する。撮影系は、被写体の結像を入力し、デジタル画像データ（静止画像データ及び動画像データ）に変換する機能を有する。
【００１４】
光学レンズユニット１は、ズーム及びフォーカス機構を含み、レンズ駆動回路２によりズーム及びフォーカス制御がなされる。撮像素子３は、通常では数百万画素のＣＣＤ（charge coupled device）を有し、光学レンズユニット１から入力される入力光（被写体像）を撮像信号（色成分を含む）に光電変換する。撮像系制御回路４は、撮像素子１１からの撮像信号の出力タイミング、出力レベルなどを制御するためのＡＧＣ（自動利得調整）処理、及びＣＤＳ（相関２重サンプリング）処理などを実行する。Ａ／Ｄコンバータ５は、撮像信号をデジタル画像データに変換する。
【００１５】
画像処理系は、ディジタル信号プロセッサ（ＤＳＰ）回路６、動き検出回路７、及びＳＤＲＡＭ（synchronous DRAM）８を有する。ＤＳＰ回路６は、ＳＤＲＡＭ８に格納された画像データに対する各種の画像処理（ホワイトバランス補正処理や、画像圧縮・伸張処理等）、及び音声データに対する加工処理を実行する。ＳＤＲＡＭ８は、画像データ及び音声データ以外に、同実施形態の発音位置情報に関係する情報（動き量分布、被写体位置情報など）を一時的に格納するワークメモリ（バッファメモリ）である。動き検出回路７は、後述するように、被写体の動態に関係する動き検出（動体検出）を実行する回路であり、動き量分布及び被写体の位置情報を取得する機能を有する。
【００１６】
表示・出力系及び記録系は、ＤＭＡ回路１１、撮影画像をディスプレイ上に表示する画像表示部１２、及びカードインターフェース１３を有する。ＤＭＡ回路１１は、システムコントローラ１７の指示に応じて、ＳＤＲＡＭ８とカードインターフェース１３との間のデータ転送を実行する。画像表示部１２は、例えばＬＣＤ（liquid crystal display）モニタからなる。カードインターフェース１３は、記録媒体であるメモリカード１４に対する画像データ、音声データ及び発音位置情報などの書込み及び読出しを制御する。なお、表示・出力系としては、無線又は有線通信方式でのデータ伝送を行うための装置や、ビデオ出力端子なども含まれる（図示せず）。
【００１７】
同実施形態の電子カメラは、録音機能を実現するための音声インターフェース９、及びマイクロフォン（音声入力機器）１０を有する。音声インターフェース９は、マイク１０から入力した音声のサンプリングデータを生成し、ＳＤＲＡＭ８に音声データとして格納する。
【００１８】
制御・操作系は、電源部１５、操作部１６、及びシステムコントローラ１７を有する。電源部１５は、システムコントローラ１７の指示に応じて、電子カメラの各ブロックに対して電源供給を制御する。操作部１６は、レリーズボタンや、各種のモード設定に必要なスイッチなどを有する。システムコントローラ１７は、電子カメラの各ブロックの入出力や動作を制御するマイクロプロセッサ及び其のプログラムから構成されている。システムコントローラ１７は、同実施形態に関係する発音位置情報の生成処理を実行する。また、システムコントローラ１７は、操作部１６の入力操作に応じて、画像表示部１２上にモード設定などの操作用画面を表示する制御も実行する。
【００１９】
（撮影動作）
以下図１と共に、図２と図３のフローチャート、図４及び図５を参照して、同実施形態の被写体と音声とを関連付けるための位置情報（発音位置情報）を生成する動作を含む撮影動作について説明する。
【００２０】
まず、電子カメラの電源がオンされると、システムコントローラ１７は、撮影処理と音声処理を開始させる（ステップＳ１のＹＥＳ，Ｓ２）。撮影処理では、撮影系で得られた撮像画像データがＳＤＲＡＭ８に格納されるまでの一連の動作が実行される（ステップＳ６を参照）。音声処理では、被写体の近傍から発生する音声がマイク１０により検出されると、音声インターフェース９は、音声データに変換してＳＤＲＡＭ８に格納する。
【００２１】
ここで、同実施形態では、被写体位置検出モードと呼ぶモードが用意されており、当該モードが操作部１６を介してシステムコントローラ１７に設定されると、撮影動作に伴って、位置情報（発音位置情報）の生成処理が実行される（ステップＳ３のＹＥＳ）。当該モードが設定されていない場合には、操作部１６のレリーズボタンの操作に応じて、システムコントローラ１７は、前記の撮影処理を実行する（ステップＳ３のＮＯ，Ｓ５，Ｓ６）。システムコントローラ１７は、ＳＤＲＡＭ８からＤＳＰ回路６に撮影画像を転送し、ＤＳＰ回路６で画像圧縮処理などを実行させる。そして、カードインターフェース１３を介して、当該画像データをメモリカード１４に記録する（ステップＳ７のＮＯ，Ｓ１１）。また、音声記録モードが設定されている場合には、システムコントローラ１７は、ＳＤＲＡＭ８から音声データを読出し、ＤＳＰ回路６及びカードインターフェース１３を経て、メモリカード１４に記録する（ステップＳ１２のＹＥＳ，Ｓ１３）。
【００２２】
一方、被写体位置検出モードが設定されている場合には、システムコントローラ１７は、動き検出回路７を制御して、動き検出処理を開始させる（ステップＳ３のＹＥＳ，Ｓ４）。システムコントローラ１７は、撮影処理での撮影したタイミング（即ち、撮影フレーム）を動き検出回路７に指示する（ステップＳ６，Ｓ７のＹＥＳ，Ｓ８）。動き検出回路７は、フレームの動き量分布から被写体を判別し、被写体の位置情報を取得する（ステップＳ９）。システムコントローラ１７は、動き検出回路７により取得された位置情報を、画像データ及び音声データに関連付けしてメモリカード１４に記録する（ステップＳ１０）。
【００２３】
（動き検出回路の動作原理）
同実施形態は、被写体位置検出モードが設定されている場合には、撮影時の被写体の位置を音声の発生位置である発音位置として想定し、当該位置情報を撮影時に記録する画像データ及び音声データに関連付けてメモリカード１４に記録する。
【００２４】
この場合、同実施形態では、動き検出回路７は、被写体の動態、具体的には人体全体や口などの動き（動体）を検出して、当該検出結果（動体の位置）を被写体の位置情報として算出する。要するに、動きのある被写体から音声が発生し、その被写体の位置が発音位置であると想定する。
【００２５】
以下図３から図５を参照して、動き検出回路７の動作原理を説明する。
【００２６】
動き検出回路７は、システムコントローラ１７からの指示に応じて動作を開始し、ＳＤＲＡＭ８からフレームを取得する（ステップＳ２０，Ｓ２１）。具体的には、画像フレーム中の輝度データを取得する。動き検出回路７は、少なくとも２フレーム分の輝度データを取得すると、図４に示すように、最新フレーム（Ｆｎ）４０および直前のフレーム（Ｆｎ−１）４１の各輝度データを使用して、被写体の動き量分布（Ｍｎ）４２を算出する（ステップＳ２３）。この動き量分布（Ｍｎ）４２は、所定の閾値（Ｍｆ）を超えるフレーム間の差分絶対値の集合である。
【００２７】
動き量分布（Ｍｎ）は、撮影画像での座標値Ｍ（ｎ，ｘ，ｙ）の集合として、下記式（１）により表現できる。
【００２８】
Ｍ（ｎ，ｘ，ｙ）＝ＭＡＸ（｜Ｆ（ｎ，ｘ，ｙ）−Ｆ（ｎ−１，ｘ，ｙ）｜，Ｍｆ）…（１）
ここで、ｘ，ｙは、撮影画像の任意のＸ軸、Ｙ軸方向の座標位置を意味する。Ｆ（ｎ，ｘ，ｙ）は、ｎ番目のフレーム位置での座標（ｘ，ｙ）位置における輝度を示す。Ｍｆは、輝度を２値化するための閾値を示す。ＭＡＸ（ａ，ｂ）は、「ｂ＞ａ」である場合を「０」、それ以外は「１」を返す関数を意味する。
【００２９】
次に、システムコントローラ１７から撮影フレーム（時系列的に最後のフレーム）が指示されると、動き検出回路７は、動き量区間分布（Ｓｎ）をｍフレーム分を算出する（ステップＳ２４のＹＥＳ，Ｓ２５）。この動き量区間分布（Ｓｎ）は２値化された分布であり、図５（Ａ）に示すように、Ｘ軸、Ｙ軸で分割し、区間毎に閾値（Ｍｆ）を超えた個数をカウントしたものである。動き検出回路７は、指定された撮影フレームからｍフレーム前までの動き量区間分布（Ｓｎ）を比較し、被写体位置を取得する（ステップＳ２６）。
【００３０】
ここで、図５を参照して、動き量区間分布（Ｓｎ）について説明する。図５（Ａ）は、動き量分布（Ｍｎ）中の［Ｘａｒｅａ，Ｙａｒｅａ］で分割されたエリア中の（ａ，ｂ）区間でのＭ（ｎ，ｘ，ｙ）を足した数（これをＳ（ｎ，ａ，ｂ）と定義する）を動き量分布（Ｍｎ）として示したものである。また、同図（Ｂ）は、Ｓ（ｎ，ａ，ｂ）の値が閾値（Ｋｓ）を超える区間を示すものである。
【００３１】
動き量区間分布（Ｓｎ）は、全エリアでのＳ（ｎ，ａ，ｂ）の集合として、下記式（２）により表現できる。
【００３２】
Ｓ（ｎ，ａ，ｂ）＝Σ（ｘ＝ａ［０］，ｘ＝ａ［Ｘｅ−１］，Σ（ｙ＝ｂ［０］，ｙ＝ｂ［Ｙｅ−１］，Ｍ（ｎ，ｘ，ｙ）））…（２）
ここで、ａ，ｂは、「Ｘａｒｅａ×Ｙａｒｅａ」で分割された任意のエリア［ａ，ｂ］を示す。「Σ（ｘ＝０，ｘ＝ｎ，Ｘ（ｘ））」は、ｘ＝０からｘ＝ｎまでのＸ（ｘ）を積算した値であることを意味する。Ｘｅは、撮影画像エリアでのＸ軸方向区間での分割数（Ｘｅ＝Ｘ軸画素数／Ｘａｒｅａ）を示す。Ｙｅは、撮影画像エリアでのＹ軸方向区間での分割数（Ｙｅ＝Ｙ軸画素数／Ｙａｒｅａ）を示す。ａ［０］はエリアａでの０番目のＸ座標を示し、ｂ［０］はエリアｂでの０番目のＹ座標を示す。
【００３３】
動き検出回路７は、被写体の部分的動きが激しい場合に、図５（Ｂ）に示すように、Ｓ（ｎ，ａ，ｂ）の値が閾値（Ｋｓ）を超える数がＮ未満となる場合には、Ｓ（ｎ，ａ，ｂ）の値が最大となる区間を被写体が存在すると推定する。また、その区間に隣接する区間で、閾値（Ｋｓ）を超える区間があれば、その区間も被写体位置とする。逆に全体の動きが激しい場合（区間毎にカウントされた数Ｓ（ｎ，ａ，ｂ）が閾値（Ｋｓ）より多い区間の数がＮ以上の場合）は、カメラが動いているものと仮定し、被写体位置を特定しない。
【００３４】
動き検出回路７は、撮影フレームの画像に対して最終的に被写体位置を検出したと判断した場合には、当該位置情報をＳＤＲＡＭ８に一時記憶する（ステップＳ２７のＹＥＳ，Ｓ２８）。一方、最終的に被写体位置を特定できない場合には、複数の候補位置情報をＳＤＲＡＭ８に一時記憶する（ステップＳ２７のＮＯ，Ｓ２９）。具体的には、ｍフレーム間の被写体の位置が時系列で連続する場合（即ち、被写体の位置がｍフレーム間で同じ位置または隣接区間にある場合）には、最終的に被写体位置を検出したと判断される。また、ｍフレーム間で被写体の位置が時系列で連続しない場合（即ち、被写体の位置がｍフレーム間で１つ以上離れた区間に存在する場合）には、最終的に被写体位置を検出できないと判断される。この場合には、時系列の要素と、撮影画像の中心に近いなどの位置的要素から被写体の複数の候補位置を決定してＳＤＲＡＭ８に一時記憶することになる。
【００３５】
以上要するに、被写体位置検出モードが設定されている場合には、動き検出回路７は、被写体の動態、具体的には人体全体や口などの動き（動体）を検出して、当該検出結果（動体の位置）を被写体の位置情報として取得する。即ち、撮影時に動きのある被写体の位置を音声の発生位置である発音位置として想定し、当該位置情報を撮影時に記録する画像データ及び音声データに関連付けてメモリカード１４に記録する。従って、撮影画像及び録音した音声と共に、被写体と音声とを関連付けした位置情報（発音位置情報）を得ることができるため、画像再生時に、被写体に関連付けして音声再生を行う処理を比較的容易に実現することができる。これにより、画像再生中の単なる音声再生だけでなく、例えば会話中の被写体を指定することにより、当該会話の音声再生を実現できる等、優れた再生効果を得ることができる。
【００３６】
（変形例１）
同実施形態の原理では、被写体として、動かない人物が後ろ向きに撮影している場合には、その口の動きなどが検出できないため、発音位置を特定できない。そこで、同実施形態の変形例として、動き検出回路７を使用することなく、システムコントローラ１７は、撮影画面上の例えば中央位置で撮影されている被写体を発音位置として想定する。従って、システムコントローラ１７は、想定した発音位置の位置情報を、撮影時に記録する画像データ及び音声データに関連付けてメモリカード１４に記録する。
【００３７】
通常では、音声記録モード時には、音声を発生していると思われる被写体をほぼ中央にして撮影する場合が多いため、当該中央位置は発音位置として妥当であると推定できる。また、ユーザが操作部１６から被写体の位置情報（指定位置）を入力できる機能があれば、当該位置情報を発音位置情報として記録してもよい。
【００３８】
（変形例２）
図６は、同実施形態の変形例２として、音声再生時に、疑似ステレオ効果を実現する方式を説明するための図である。
【００３９】
同実施形態の動き検出回路７は、図６（Ａ）に示すように、時間軸Ｔに対して、ｍフレーム間での被写体６０の位置を検出する。そこで、当該被写体６０の位置情報を利用して、同図（Ｂ）に示すように、仮想的なステレオマイクの位置ＸＬ，ＸＲを設定し、モノラル音声からステレオ音声に変換するための音量に関係する係数（ＫＬ，ＫＲ）を算出することができる。位置ＸＬは左側仮想マイク位置を意味し、位置ＸＲは右側仮想マイク位置を意味する。
【００４０】
具体的には、システムコントローラ１７は、以下のような手順で係数（ＫＬ，ＫＲ）を算出する。即ち、図６（Ｂ）に示すように、被写体６０までの距離Ｌを撮影系から得られる焦点距離から算出する。次に、被写体６０の中心位置６１と、仮想的なステレオマイクの位置ＸＬ，ＸＲとの角度α，β、及び距離Ｍ（被写体６０までの距離とレンズの仕様から算出）から、下記関係式（３），（４）が成立する。
【００４１】
ｔａｎ（α）＝Ｌ／（ＸＬ＋Ｍ）…（３）
ｔａｎ（β）＝Ｌ／（ＸＲ−Ｍ）…（４）
前記関係式（３）を使用して、左側仮想マイク位置ＸＬから被写体６０までの距離（ＬＬとする）を算出する。また、前記関係式（４）を使用して、右側仮想マイク位置ＸＲから被写体６０までの距離（ＬＲとする）を算出する。そして、角度α，βに基づいた仮想的なステレオマイクの位置ＸＬ，ＸＲ、及び距離ＬＬ，ＬＲを含むテーブルを作成する。システムコントローラ１７は、最終的に当該テーブルからモノラル音声からステレオ音声に変換するための音量に関係する係数（ＫＬ，ＫＲ）を算出する。
【００４２】
以上のように本変形例によれば、被写体の位置と関連付けて音声を再生する場合に、当該音声を仮想的なステレオマイクにより録音して、あたかもステレオ再生するような疑似ステレオ再生効果を得ることができる。
【００４３】
【発明の効果】
以上詳述したように本発明によれば、録音機能を有する電子カメラにおいて、撮影画像の被写体と音声とを関連付けした位置情報を生成できる機能を備えた電子カメラを提供できる。これにより、画像再生時に、被写体に関連付けして音声再生を行う処理を比較的容易に実現することができる。
【図面の簡単な説明】
【図１】本発明の実施形態に関係する電子カメラの要部を示すブロック図。
【図２】同実施形態に関する撮影動作を説明するためのフローチャート。
【図３】同実施形態に関する動き検出回路の動作を説明するためのフローチャート。
【図４】同実施形態に関する動き検出処理の原理を説明するための図。
【図５】同実施形態に関する動き検出処理の原理を説明するための図。
【図６】同実施形態の変形例に関する疑似ステレオ効果処理を説明するための図。
【符号の説明】
１…光学レンズユニット
２…レンズ駆動回路
３…撮像素子
４…撮像系制御回路
５…Ａ／Ｄコンバータ
６…ディジタル信号プロセッサ（ＤＳＰ）回路
７…動き検出回路
８…ＳＤＲＡＭ
９…音声インターフェース
１０…マイクロフォン
１１…ＤＭＡ回路
１２…画像表示部
１３…カードインターフェース
１４…メモリカード
１５…電源部
１６…操作部
１７…システムコントローラ[0001]
BACKGROUND OF THE INVENTION
The present invention generally relates to an electronic camera having a recording function, and more particularly to a technique for realizing a function of associating a subject with sound.
[0002]
[Prior art]
In recent years, not only a digital video camera (DVC) but also an electronic camera called a digital still camera (DSC) having a recording function for recording sound has been developed. With such an electronic camera, a captured image (still image or moving image) can be reproduced and recorded sound can be reproduced, so that the reproduction effect on the captured image can be enhanced. However, here, the electronic camera means not only a DSC but also a DVC having a still image shooting function.
[0003]
As a related prior art, a technique for presenting guidance information using at least one of screen display and audio output in an information processing apparatus having a dialog control technique has been proposed (see, for example, Japanese Patent Laid-Open No. 10-11248). ).
[0004]
Generally, an electronic camera can input a captured image to an information processing apparatus such as a personal computer. The personal computer can display the input captured image on the display screen and can reproduce the sound recorded from the attached speaker.
[0005]
[Problems to be solved by the invention]
In a personal computer, when a photographed image is displayed on the display screen, it is also possible to play back the recorded sound at the timing when the subject position is designated with a pointer (cursor). However, the conventional electronic camera cannot record the subject and the recorded voice in association with each other in the captured image. Although it is possible to add audio position information to an input still image with a personal computer or the like, it is usually necessary to perform advanced image processing.
[0006]
SUMMARY OF THE INVENTION An object of the present invention is to provide an electronic camera having a function capable of generating positional information that associates a subject of a captured image with sound in an electronic camera having a recording function.
[0007]
[Means for Solving the Problems]
An aspect of the present invention obtains (calculates or estimates) position information (pronunciation position information) that associates a subject (person or object) in a captured image with a recorded (detected) sound in an electronic camera having a recording function. The present invention relates to an electronic camera having a function to perform.
[0008]
An electronic camera according to the present invention includes an imaging unit that captures an image of a subject and converts it into image data, an audio detection unit that detects audio and converts it into audio data, and a position that acquires position information related to the position where the audio is generated An information acquisition means; an output means for outputting the position information in association with the image data or the audio data; and a dynamic detection means for detecting the dynamics of a subject using the image data. When it is determined that the subject position is detected from the dynamic detection by the dynamic detection unit , the acquisition unit temporarily stores position information corresponding to the subject position as a pronunciation position of the subject, and the dynamic detection by the dynamic detection unit If it is determined that the subject position cannot be detected from the subject position, the subject position at the substantially central position on the shooting screen is temporarily stored as the pronunciation candidate position .
[0009]
The position information acquisition unit is configured to detect, for example, dynamics of a subject (for example, movement of a person's mouth) and to acquire pronunciation position information in which the detected position is estimated as a voice generation position (sound generation position). The output unit outputs the position information to a recording unit that records the position information on a recording medium or a transmission unit that transmits data by wireless communication or the like.
[0010]
With an electronic camera having such a configuration, it is possible to obtain position information (pronunciation position information) that associates a subject with sound as well as a captured image and recorded sound. The process of performing the reproduction can be realized relatively easily. Thereby, not only simple audio reproduction during image reproduction, but also, for example, by specifying a subject in conversation, audio reproduction of the conversation can be realized, and an excellent reproduction effect can be obtained.
[0011]
DETAILED DESCRIPTION OF THE INVENTION
Embodiments of the present invention will be described below with reference to the drawings.
[0012]
(Configuration of electronic camera)
FIG. 1 is a block diagram illustrating a main part of the electronic camera according to the embodiment. The electronic camera is assumed to be a digital still camera (DSC) having a still image and moving image shooting function, but is also applicable to a digital video camera (DVC) having a still image shooting function.
[0013]
The electronic camera is roughly divided into a photographing system, an image processing system, a control / operation system, a display / output system, and a recording system. The imaging system includes an optical lens unit 1, a lens driving circuit 2, an imaging device 3, an imaging system control circuit 4, and an A / D converter 5. The imaging system has a function of inputting subject image formation and converting it into digital image data (still image data and moving image data).
[0014]
The optical lens unit 1 includes a zoom and focus mechanism, and the lens drive circuit 2 performs zoom and focus control. The image sensor 3 usually has a charge coupled device (CCD) of several million pixels, and photoelectrically converts input light (subject image) input from the optical lens unit 1 into an image signal (including color components). The imaging system control circuit 4 executes an AGC (automatic gain adjustment) process, a CDS (correlated double sampling) process, and the like for controlling the output timing and output level of the imaging signal from the image sensor 11. The A / D converter 5 converts the imaging signal into digital image data.
[0015]
The image processing system includes a digital signal processor (DSP) circuit 6, a motion detection circuit 7, and an SDRAM (synchronous DRAM) 8. The DSP circuit 6 executes various types of image processing (white balance correction processing, image compression / decompression processing, etc.) for image data stored in the SDRAM 8 and processing processing for audio data. The SDRAM 8 is a work memory (buffer memory) that temporarily stores information (motion amount distribution, subject position information, etc.) related to the sound generation position information of the embodiment in addition to image data and audio data. As will be described later, the motion detection circuit 7 is a circuit that performs motion detection (moving object detection) related to the dynamics of the subject, and has a function of acquiring motion amount distribution and subject position information.
[0016]
The display / output system and the recording system include a DMA circuit 11, an image display unit 12 that displays a captured image on a display, and a card interface 13. The DMA circuit 11 executes data transfer between the SDRAM 8 and the card interface 13 in accordance with an instruction from the system controller 17. The image display unit 12 includes, for example, an LCD (liquid crystal display) monitor. The card interface 13 controls writing and reading of image data, audio data, sound generation position information, and the like with respect to the memory card 14 that is a recording medium. Note that the display / output system includes a device for performing data transmission in a wireless or wired communication system, a video output terminal, and the like (not shown).
[0017]
The electronic camera of the embodiment includes an audio interface 9 and a microphone (audio input device) 10 for realizing a recording function. The audio interface 9 generates sampling data of audio input from the microphone 10 and stores it as audio data in the SDRAM 8.
[0018]
The control / operation system includes a power supply unit 15, an operation unit 16, and a system controller 17. The power supply unit 15 controls power supply to each block of the electronic camera in accordance with an instruction from the system controller 17. The operation unit 16 includes a release button, switches necessary for various mode settings, and the like. The system controller 17 includes a microprocessor for controlling input / output and operation of each block of the electronic camera and its program. The system controller 17 executes sound generation position information generation processing related to the embodiment. The system controller 17 also executes control for displaying an operation screen such as mode setting on the image display unit 12 in accordance with an input operation of the operation unit 16.
[0019]
(Shooting operation)
Hereinafter, referring to the flowcharts of FIGS. 2 and 3 together with FIG. 1, and FIGS. 4 and 5, the photographing operation including the operation of generating position information (sound generation position information) for associating the subject and the sound according to the embodiment. Will be described.
[0020]
First, when the power of the electronic camera is turned on, the system controller 17 starts shooting processing and sound processing (YES in step S1, S2). In the photographing process, a series of operations until the captured image data obtained by the photographing system is stored in the SDRAM 8 is executed (see step S6). In the sound processing, when sound generated from the vicinity of the subject is detected by the microphone 10, the sound interface 9 converts the sound into sound data and stores it in the SDRAM 8.
[0021]
Here, in the same embodiment, a mode called a subject position detection mode is prepared, and when the mode is set in the system controller 17 via the operation unit 16, position information (sound generation position) is associated with the shooting operation. Information) is generated (YES in step S3). If the mode is not set, the system controller 17 executes the above-described photographing process according to the operation of the release button of the operation unit 16 (NO in steps S3, S5, S6). The system controller 17 transfers the captured image from the SDRAM 8 to the DSP circuit 6 and causes the DSP circuit 6 to execute image compression processing and the like. Then, the image data is recorded on the memory card 14 via the card interface 13 (NO in step S7, S11). If the audio recording mode is set, the system controller 17 reads the audio data from the SDRAM 8 and records it in the memory card 14 via the DSP circuit 6 and the card interface 13 (YES in step S12, S13). .
[0022]
On the other hand, when the subject position detection mode is set, the system controller 17 controls the motion detection circuit 7 to start the motion detection process (YES in step S3, S4). The system controller 17 instructs the motion detection circuit 7 at the timing of shooting in the shooting process (that is, the shooting frame) (YES in steps S6 and S7, S8). The motion detection circuit 7 discriminates the subject from the motion amount distribution of the frame, and acquires the position information of the subject (step S9). The system controller 17 records the position information acquired by the motion detection circuit 7 on the memory card 14 in association with image data and audio data (step S10).
[0023]
(Operation principle of motion detection circuit)
In this embodiment, when the subject position detection mode is set, the position of the subject at the time of shooting is assumed as the sound generation position that is the sound generation position, and the position information is recorded as image data and sound data at the time of shooting. And recorded in the memory card 14.
[0024]
In this case, in the embodiment, the motion detection circuit 7 detects the dynamics of the subject, specifically, the movement of the entire human body or the mouth (moving body), and the detection result (the position of the moving body) is used as the position information of the subject. Calculate as In short, it is assumed that sound is generated from a moving subject and the position of the subject is a sounding position.
[0025]
Hereinafter, the operation principle of the motion detection circuit 7 will be described with reference to FIGS.
[0026]
The motion detection circuit 7 starts operating in response to an instruction from the system controller 17 and acquires a frame from the SDRAM 8 (steps S20 and S21). Specifically, luminance data in the image frame is acquired. When the motion detection circuit 7 acquires the brightness data for at least two frames, the motion detection circuit 7 uses the brightness data of the latest frame (Fn) 40 and the immediately preceding frame (Fn-1) 41 as shown in FIG. Motion amount distribution (Mn) 42 is calculated (step S23). This motion amount distribution (Mn) 42 is a set of absolute differences between frames that exceed a predetermined threshold (Mf).
[0027]
The motion amount distribution (Mn) can be expressed by the following equation (1) as a set of coordinate values M (n, x, y) in the captured image.
[0028]
M (n, x, y) = MAX (| F (n, x, y) −F (n−1, x, y) |, Mf) (1)
Here, x and y mean coordinate positions in the arbitrary X-axis and Y-axis directions of the captured image. F (n, x, y) indicates the luminance at the coordinate (x, y) position at the n-th frame position. Mf represents a threshold value for binarizing the luminance. MAX (a, b) means a function that returns “0” when “b> a” and returns “1” otherwise.
[0029]
Next, when an imaging frame (the last frame in time series) is instructed from the system controller 17, the motion detection circuit 7 calculates the motion amount interval distribution (Sn) for m frames (YES in step S24). S25). This motion amount interval distribution (Sn) is a binarized distribution. As shown in FIG. 5A, the movement amount interval distribution (Sn) is divided by the X axis and the Y axis, and the number exceeding the threshold (Mf) is counted for each interval. It is a thing. The motion detection circuit 7 compares the motion amount interval distribution (Sn) from the designated shooting frame to m frames before, and acquires the subject position (step S26).
[0030]
Here, the motion amount interval distribution (Sn) will be described with reference to FIG. FIG. 5A shows the number obtained by adding M (n, x, y) in the (a, b) section in the area divided by [Xarea, Area] in the motion amount distribution (Mn) S (n, a, b)) is shown as a motion amount distribution (Mn). FIG. 5B shows a section where the value of S (n, a, b) exceeds the threshold value (Ks).
[0031]
The motion amount interval distribution (Sn) can be expressed by the following equation (2) as a set of S (n, a, b) in all areas.
[0032]
S (n, a, b) = Σ (x = a [0], x = a [Xe−1], Σ (y = b [0], y = b [Ye−1], M (n, x , Y))) ... (2)
Here, a and b indicate arbitrary areas [a, b] divided by “Xarea × Yarea”. “Σ (x = 0, x = n, X (x))” means a value obtained by integrating X (x) from x = 0 to x = n. Xe indicates the number of divisions in the X-axis direction section in the captured image area (Xe = number of X-axis pixels / Xarea). Ye indicates the number of divisions in the Y-axis direction section in the captured image area (Ye = number of Y-axis pixels / Yarea). a [0] indicates the 0th X coordinate in area a, and b [0] indicates the 0th Y coordinate in area b.
[0033]
When the subject's partial movement is intense, the motion detection circuit 7 has a case where the number of S (n, a, b) exceeding the threshold value (Ks) is less than N as shown in FIG. , It is estimated that the subject exists in the section where the value of S (n, a, b) is maximum. In addition, if there is a section that is adjacent to the section and exceeds the threshold (Ks), that section is also set as the subject position. On the other hand, when the overall movement is intense (when the number of sections S (n, a, b) counted per section is greater than the threshold (Ks) is N or more), it is assumed that the camera is moving. The subject position is not specified.
[0034]
When the motion detection circuit 7 determines that the subject position is finally detected from the image of the shooting frame, the motion detection circuit 7 temporarily stores the position information in the SDRAM 8 (YES in step S27, S28). On the other hand, when the subject position cannot be finally determined, a plurality of pieces of candidate position information are temporarily stored in the SDRAM 8 (NO in step S27, S29). Specifically, when the position of the subject between m frames is continuous in time series (that is, when the position of the subject is in the same position or adjacent section between m frames), the subject position is finally detected. It is judged. If the subject position is not continuous in time series between m frames (that is, if the subject position exists in one or more sections between m frames), the subject position cannot be finally detected. To be judged. In this case, a plurality of candidate positions of the subject are determined from time-series elements and positional elements such as close to the center of the captured image, and are temporarily stored in the SDRAM 8.
[0035]
In short, when the subject position detection mode is set, the motion detection circuit 7 detects the dynamics of the subject, specifically the movement of the entire human body or the mouth (moving object), and the detection result (moving object). Is acquired as the position information of the subject. That is, the position of a subject that moves at the time of shooting is assumed as the sound generation position that is the sound generation position, and the position information is recorded on the memory card 14 in association with the image data and sound data to be recorded at the time of shooting. Accordingly, it is possible to obtain position information (pronunciation position information) that associates the subject with the sound together with the photographed image and the recorded sound, so that it is relatively easy to perform sound reproduction in association with the subject during image reproduction. Can be realized. Thereby, not only simple audio reproduction during image reproduction, but also, for example, by specifying a subject in conversation, audio reproduction of the conversation can be realized, and an excellent reproduction effect can be obtained.
[0036]
(Modification 1)
According to the principle of the embodiment, when a person who does not move is photographed backward as a subject, since the movement of the mouth cannot be detected, the sound generation position cannot be specified. Therefore, as a modification of the embodiment, the system controller 17 assumes that the subject photographed at, for example, the center position on the photographing screen is the sound generation position without using the motion detection circuit 7. Accordingly, the system controller 17 records the position information of the assumed sound generation position in the memory card 14 in association with the image data and sound data recorded at the time of shooting.
[0037]
Usually, in the audio recording mode, the subject that is likely to generate audio is often photographed with the center approximately, so it can be estimated that the center position is appropriate as the sound generation position. In addition, if there is a function that allows the user to input subject position information (designated position) from the operation unit 16, the position information may be recorded as pronunciation position information.
[0038]
(Modification 2)
FIG. 6 is a diagram for explaining a method for realizing the pseudo stereo effect at the time of sound reproduction as a second modification of the embodiment.
[0039]
The motion detection circuit 7 of the same embodiment detects the position of the subject 60 between m frames with respect to the time axis T as shown in FIG. Therefore, the position information of the subject 60 is used to set the virtual stereo microphone positions XL and XR as shown in FIG. 5B and relate to the sound volume for converting from monaural sound to stereo sound. The coefficients (KL, KR) to be calculated can be calculated. The position XL means the left virtual microphone position, and the position XR means the right virtual microphone position.
[0040]
Specifically, the system controller 17 calculates the coefficients (KL, KR) by the following procedure. That is, as shown in FIG. 6B, the distance L to the subject 60 is calculated from the focal length obtained from the imaging system. Next, from the angles α and β between the center position 61 of the subject 60 and the virtual stereo microphone positions XL and XR, and the distance M (calculated from the distance to the subject 60 and the lens specifications), the following relational expression ( 3) and (4) hold.
[0041]
tan (α) = L / (XL + M) (3)
tan (β) = L / (XR−M) (4)
Using the relational expression (3), a distance (referred to as LL) from the left virtual microphone position XL to the subject 60 is calculated. Further, the distance (referred to as LR) from the right virtual microphone position XR to the subject 60 is calculated using the relational expression (4). Then, a table including virtual stereo microphone positions XL and XR and distances LL and LR based on the angles α and β is created. The system controller 17 finally calculates coefficients (KL, KR) related to the sound volume for converting monaural sound to stereo sound from the table.
[0042]
As described above, according to this modification, when reproducing sound in association with the position of the subject, the sound is recorded by a virtual stereo microphone, and a pseudo stereo reproduction effect is obtained as if the sound is reproduced in stereo. Can do.
[0043]
【The invention's effect】
As described above in detail, according to the present invention, in an electronic camera having a recording function, an electronic camera having a function capable of generating position information that associates a subject of a captured image with sound can be provided. Thereby, at the time of image reproduction, it is possible to relatively easily realize the process of performing audio reproduction in association with the subject.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a main part of an electronic camera related to an embodiment of the present invention.
FIG. 2 is a flowchart for explaining a photographing operation according to the embodiment.
FIG. 3 is a flowchart for explaining the operation of the motion detection circuit according to the embodiment;
FIG. 4 is a view for explaining the principle of motion detection processing according to the embodiment;
FIG. 5 is a view for explaining the principle of motion detection processing according to the embodiment;
FIG. 6 is a view for explaining pseudo stereo effect processing according to a modification of the embodiment;
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 ... Optical lens unit 2 ... Lens drive circuit 3 ... Imaging element 4 ... Imaging system control circuit 5 ... A / D converter 6 ... Digital signal processor (DSP) circuit 7 ... Motion detection circuit 8 ... SDRAM
DESCRIPTION OF SYMBOLS 9 ... Audio | voice interface 10 ... Microphone 11 ... DMA circuit 12 ... Image display part 13 ... Card interface 14 ... Memory card 15 ... Power supply part 16 ... Operation part 17 ... System controller

Claims

Imaging means for imaging a subject and converting it into image data;
Voice detection means for detecting voice and converting it into voice data;
Position information acquisition means for acquiring position information related to the sound generation position;
An output means for outputting the position information in association with the image data or the audio data;
Dynamic detection means for detecting the dynamics of the subject using the image data;
The position information acquisition means includes
When it is determined that the subject position has been detected from the dynamic detection by the dynamic detection means , the position information corresponding to the subject position is temporarily stored as the pronunciation position of the subject,
An electronic camera characterized in that, when it is determined that the subject position cannot be detected from the dynamic detection by the dynamic detection means, the subject position at the substantially central position on the photographing screen is temporarily stored as a pronunciation candidate position .

The electronic camera according to claim 1, wherein the movement detection unit is a moving body detection unit that detects a moving body based on a movement amount of a subject.

The electronic camera according to claim 1, further comprising means for generating audio data corresponding to a virtual audio generation position based on the audio data and the position information .