JP7217471B2

JP7217471B2 - Imaging device

Info

Publication number: JP7217471B2
Application number: JP2019222866A
Authority: JP
Inventors: 宏樹春日井
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2019-03-13
Filing date: 2019-12-10
Publication date: 2023-02-03
Anticipated expiration: 2039-12-10
Also published as: JP2020156076A

Description

本開示は、音声を取得しながら撮像を行う撮像装置に関する。 The present disclosure relates to an imaging device that captures an image while acquiring sound.

撮像装置による動画の撮影時などに、特定の被写体による音声を明瞭に収音するための技術が検討されている（例えば特許文献１）。 2. Description of the Related Art Techniques for clearly picking up the sound of a specific subject when shooting a moving image with an imaging device have been studied (for example, Patent Document 1).

特許文献１は、撮像部及びマイクロフォンアレイを備えた音声識別装置を開示している。この音声識別装置は、撮像部により生成された画像データから被写体画像の特徴情報を検出すると共に、マイクロフォンアレイにより生成された音声データから音声の特徴情報を検出している。この音声識別装置は、画像データから算出される被写体の距離等と音声データから算出される音源の距離等に基づいて、マイクロフォンアレイの指向特性を調整することにより、断続的に音声を発生する音源についても良好な音声を得ることを図っている。 Patent Literature 1 discloses a speech identification device comprising an imaging unit and a microphone array. This sound identification device detects feature information of a subject image from image data generated by an imaging unit, and detects sound feature information from sound data generated by a microphone array. This sound identification device adjusts the directional characteristics of the microphone array based on the distance of the subject calculated from the image data and the distance of the sound source calculated from the sound data, thereby intermittently generating sounds. We are also trying to obtain good sound for

特開２０１０－１５４２６０号公報JP 2010-154260 A

しかしながら、撮影中のユーザは、動く被写体に対して目で追って撮影装置の向き等を変えることとなり、音声の検出結果に基づきマイクロフォンの指向性を追従させることは、精度良く行い難い。従来技術では、撮像装置において特定の被写体による音声を明瞭に得難いという問題があった。 However, during shooting, the user changes the direction of the shooting device by following the moving subject with his or her eyes, and it is difficult to accurately follow the directivity of the microphone based on the sound detection result. In the prior art, there is a problem that it is difficult to clearly obtain the sound of a specific subject in an imaging device.

本開示は、ユーザの意図に沿って被写体による音声を明瞭に得ることを行い易くすることができる撮像装置を提供する。 The present disclosure provides an imaging device that can facilitate obtaining the voice of a subject clearly according to the user's intention.

本開示の一態様に係る撮像装置は、撮像部と、音声取得部と、検出部と、音声処理部と、操作部とを備える。撮像部は、被写体像を撮像して画像データを生成する。音声取得部は、撮像部による撮像中の音声を示す音声データを取得する。検出部は、撮像部によって生成された画像データに基づいて、被写体とその種別を検出する。音声処理部は、検出部によって検出された被写体の種別に基づいて、音声取得部によって取得された音声データを処理する。操作部は、ユーザによる自装置の操作に基づいて、第１の種別および第１の種別とは異なる第２の種別を含む複数の種別の中から、音声処理部による処理の対象とする対象種別を設定する。音声処理部は、画像データにおいて対象種別の被写体が検出されたときに、取得された音声データにおいて対象種別に応じた音声を強調又は抑制するように、当該音声データを処理する。 An imaging device according to an aspect of the present disclosure includes an imaging unit, an audio acquisition unit, a detection unit, an audio processing unit, and an operation unit. The imaging unit captures a subject image and generates image data. The audio acquisition unit acquires audio data representing audio being captured by the imaging unit. The detection unit detects a subject and its type based on the image data generated by the imaging unit. The audio processing unit processes the audio data acquired by the audio acquisition unit based on the type of subject detected by the detection unit. The operation unit selects a target type to be processed by the audio processing unit from among a plurality of types including a first type and a second type different from the first type, based on the user's operation of the device. set. The audio processing unit processes the audio data so that, when the subject of the target type is detected in the image data, the audio corresponding to the target type is emphasized or suppressed in the acquired audio data.

本開示の別の態様に係る撮像装置は、撮像部と、音声取得部と、検出部と、表示部と、操作部と、音声処理部と、制御部とを備える。撮像部は、被写体像を撮像して画像データを生成する。音声取得部は、撮像部による撮像中の音声を示す音声データを取得する。検出部は、撮像部によって生成された画像データに基づいて、被写体とその種別を検出する。表示部は、画像データが示す画像を表示する。操作部は、ユーザによる自装置の操作に基づいて、検出部によって検出された被写体の中から、画像におけるフォーカス対象の被写体を選択する。音声処理部は、操作部によって選択された被写体の種別に基づいて、音声取得部によって取得された音声データを処理する。制御部は、音声処理部による処理の対象とする対象種別としてフォーカス対象の被写体の種別を示す対象種別情報を表示部に表示させる。 An imaging device according to another aspect of the present disclosure includes an imaging unit, an audio acquisition unit, a detection unit, a display unit, an operation unit, an audio processing unit, and a control unit. The imaging unit captures a subject image and generates image data. The audio acquisition unit acquires audio data representing audio being captured by the imaging unit. The detection unit detects a subject and its type based on the image data generated by the imaging unit. The display unit displays an image indicated by the image data. The operation unit selects a subject to be focused in the image from the subjects detected by the detection unit based on the user's operation of the device. The audio processing unit processes the audio data acquired by the audio acquiring unit based on the subject type selected by the operation unit. The control unit causes the display unit to display target type information indicating the type of a subject to be focused as a target type to be processed by the audio processing unit.

本開示に係る撮像装置によると、ユーザの意図に沿って被写体による音声を明瞭に得ることを行い易くすることができる。 According to the imaging device according to the present disclosure, it is possible to easily obtain the voice of the subject clearly according to the user's intention.

本開示の実施の形態１に係るデジタルカメラの構成を示す図FIG. 1 shows a configuration of a digital camera according to Embodiment 1 of the present disclosure; デジタルカメラにおける音声処理エンジンの構成を示すブロック図Block diagram showing configuration of audio processing engine in digital camera 音声処理エンジンの音声抽出部における特定の種別のデータ例を説明した図A diagram explaining an example of data of a specific type in the audio extraction unit of the audio processing engine 音声抽出部における、図３とは別の種別のデータ例を説明した図FIG. 3 is a diagram explaining an example of data of a type different from that in FIG. 3 in the voice extraction unit; 実施の形態１に係るデジタルカメラの人優先モードの概要を説明するための図FIG. 2 is a diagram for explaining an outline of the human priority mode of the digital camera according to the first embodiment; FIG. 実施の形態１に係るデジタルカメラの動作を例示するフローチャート4 is a flowchart illustrating the operation of the digital camera according to the first embodiment; デジタルカメラの人優先モードにおける「人」の移動時の動作例を説明した図A diagram explaining an example of operation when a "person" moves in the human priority mode of a digital camera. 実施の形態２に係るデジタルカメラのフォーカス優先モードの概要を説明するための図FIG. 5 is a diagram for explaining an outline of a focus priority mode of a digital camera according to Embodiment 2; 実施の形態２に係るデジタルカメラの動作を例示するフローチャート4 is a flowchart illustrating the operation of the digital camera according to the second embodiment; 図９に続くデジタルカメラの動作を例示するフローチャートFlowchart illustrating the operation of the digital camera continued from FIG. フォーカス対象の被写体が移動する場合のデジタルカメラの動作例を説明した図Diagram explaining an operation example of a digital camera when a subject to be focused moves フォーカス対象を変更するユーザ操作に対するデジタルカメラの動作例を説明した図A diagram explaining an example of how a digital camera operates in response to a user operation for changing the focus target. 実施の形態３に係るデジタルカメラの構成を示す図FIG. 12 shows a configuration of a digital camera according to Embodiment 3; 実施の形態３に係るデジタルカメラの動作を例示するフローチャート12 is a flowchart illustrating the operation of the digital camera according to the third embodiment;

以下、適宜図面を参照しながら、実施の形態を詳細に説明する。但し、必要以上に詳細な説明は省略する場合がある。例えば、既によく知られた事項の詳細説明や実質的に同一の構成に対する重複説明を省略する場合がある。これは、以下の説明が不必要に冗長になるのを避け、当業者の理解を容易にするためである。なお、発明者（ら）は、当業者が本開示を十分に理解するために添付図面および以下の説明を提供するのであって、これらによって特許請求の範囲に記載の主題を限定することを意図するものではない。 Hereinafter, embodiments will be described in detail with reference to the drawings as appropriate. However, more detailed description than necessary may be omitted. For example, detailed descriptions of well-known matters and redundant descriptions of substantially the same configurations may be omitted. This is to avoid unnecessary verbosity in the following description and to facilitate understanding by those skilled in the art. It is noted that the inventor(s) provide the accompanying drawings and the following description in order for those skilled in the art to fully understand the present disclosure, which are intended to limit the claimed subject matter. not something to do.

（実施の形態１）
実施の形態１では、本開示に係る撮像装置の一例として、画像認識技術と音声抽出技術とを連動させて人や動物やといった特定の種別の被写体による音声を明瞭に得るデジタルカメラについて説明する。 (Embodiment 1)
In Embodiment 1, as an example of an imaging device according to the present disclosure, a digital camera that clearly obtains the voice of a specific type of subject such as a person or an animal by linking image recognition technology and voice extraction technology will be described.

〔１－１．構成〕
実施の形態１に係るデジタルカメラの構成について、図１を用いて説明する。 [1-1. composition〕
A configuration of a digital camera according to Embodiment 1 will be described with reference to FIG.

図１は、本実施形態に係るデジタルカメラ１００の構成を示す図である。本実施形態のデジタルカメラ１００は、イメージセンサ１１５と、画像処理エンジン１２０と、表示モニタ１３０と、コントローラ１３５とを備える。さらに、デジタルカメラ１００は、バッファメモリ１２５と、カードスロット１４０と、フラッシュメモリ１４５と、操作部１５０と、通信モジュール１５５とを備える。また、デジタルカメラ１００は、マイク１６０と、マイク用のアナログ／デジタル（Ａ／Ｄ）コンバータ１６５と、音声処理エンジン１７０とを備える。また、デジタルカメラ１００は、例えば光学系１１０及びレンズ駆動部１１２を備える。 FIG. 1 is a diagram showing the configuration of a digital camera 100 according to this embodiment. The digital camera 100 of this embodiment includes an image sensor 115 , an image processing engine 120 , a display monitor 130 and a controller 135 . Digital camera 100 further includes buffer memory 125 , card slot 140 , flash memory 145 , operation unit 150 and communication module 155 . The digital camera 100 also includes a microphone 160 , an analog/digital (A/D) converter 165 for the microphone, and an audio processing engine 170 . The digital camera 100 also includes an optical system 110 and a lens driving section 112, for example.

光学系１１０は、フォーカスレンズ、ズームレンズ、光学式手ぶれ補正レンズ（ＯＩＳ）、絞り、シャッタ等を含む。フォーカスレンズは、イメージセンサ１１５上に形成される被写体像のフォーカス状態を変化させるためのレンズである。ズームレンズは、光学系で形成される被写体像の倍率を変化させるためのレンズである。フォーカスレンズ等は、それぞれ１枚又は複数枚のレンズで構成される。 The optical system 110 includes a focus lens, a zoom lens, an optical image stabilization lens (OIS), an aperture, a shutter, and the like. The focus lens is a lens for changing the focus state of the subject image formed on the image sensor 115 . A zoom lens is a lens for changing the magnification of a subject image formed by an optical system. Each focus lens or the like is composed of one or a plurality of lenses.

レンズ駆動部１１２は、光学系１１０におけるフォーカスレンズ等を駆動する。レンズ駆動部１１２はモータを含み、コントローラ１３５の制御に基づいてフォーカスレンズを光学系１１０の光軸に沿って移動させる。レンズ駆動部１１２においてフォーカスレンズを駆動する構成は、ＤＣモータ、ステッピングモータ、サーボモータ、または超音波モータなどで実現できる。 A lens driving unit 112 drives the focus lens and the like in the optical system 110 . The lens drive unit 112 includes a motor and moves the focus lens along the optical axis of the optical system 110 under the control of the controller 135 . A configuration for driving the focus lens in the lens driving section 112 can be realized by a DC motor, a stepping motor, a servo motor, an ultrasonic motor, or the like.

イメージセンサ１１５は、光学系１１０を介して形成された被写体像を撮像して、撮像データを生成する。撮像データは、イメージセンサ１１５による撮像画像を示す画像データを構成する。イメージセンサ１１５は、所定のフレームレート（例えば、３０フレーム／秒）で新しいフレームの画像データを生成する。イメージセンサ１１５における、撮像データの生成タイミングおよび電子シャッタ動作は、コントローラ１３５によって制御される。イメージセンサ１１５は、ＣＭＯＳイメージセンサ、ＣＣＤイメージセンサ、またはＮＭＯＳイメージセンサなど、種々のイメージセンサを用いることができる。 The image sensor 115 captures the subject image formed via the optical system 110 and generates captured data. The imaging data constitute image data representing an image captured by the image sensor 115 . Image sensor 115 generates new frames of image data at a predetermined frame rate (eg, 30 frames/second). A controller 135 controls the generation timing of imaging data and the electronic shutter operation in the image sensor 115 . The image sensor 115 can use various image sensors such as a CMOS image sensor, a CCD image sensor, or an NMOS image sensor.

イメージセンサ１１５は、動画像、静止画像の撮像動作、スルー画像の撮像動作等を実行する。スルー画像は主に動画像であり、ユーザが例えば静止画像の撮像のための構図を決めるために表示モニタ１３０に表示される。
スルー画像、動画像及び静止画像は、それぞれ本実施形態における撮像画像の一例である。イメージセンサ１１５は、本実施形態における撮像部の一例である。 The image sensor 115 performs a moving image, a still image imaging operation, a through image imaging operation, and the like. The through image is mainly a moving image, and is displayed on the display monitor 130 for the user to decide the composition for capturing a still image, for example.
A through image, a moving image, and a still image are examples of captured images in this embodiment. The image sensor 115 is an example of an imaging unit in this embodiment.

画像処理エンジン１２０は、イメージセンサ１１５から出力された撮像データに対して各種の処理を施して画像データを生成したり、画像データに各種の処理を施して、表示モニタ１３０に表示するための画像を生成したりする。各種処理としては、ホワイトバランス補正、ガンマ補正、ＹＣ変換処理、電子ズーム処理、圧縮処理、伸張処理等が挙げられるが、これらに限定されない。画像処理エンジン１２０は、ハードワイヤードな電子回路で構成してもよいし、プログラムを用いたマイクロコンピュータ、プロセッサなどで構成してもよい。 The image processing engine 120 generates image data by performing various types of processing on the imaging data output from the image sensor 115 , and performs various types of processing on the image data to generate an image to be displayed on the display monitor 130 . to generate. Various types of processing include, but are not limited to, white balance correction, gamma correction, YC conversion processing, electronic zoom processing, compression processing, and expansion processing. The image processing engine 120 may be configured by a hardwired electronic circuit, or may be configured by a microcomputer, processor, or the like using a program.

本実施形態において、画像処理エンジン１２０は、撮像画像の画像認識によって人及び動物といった種々の種別の被写体の検出機能を実現する画像認識部１２２を含む。画像認識部１２２の詳細については後述する。 In this embodiment, the image processing engine 120 includes an image recognition unit 122 that realizes a function of detecting various types of subjects such as people and animals by image recognition of captured images. Details of the image recognition unit 122 will be described later.

表示モニタ１３０は、種々の情報を表示する表示部の一例である。例えば、表示モニタ１３０は、イメージセンサ１１５で撮像され、画像処理エンジン１２０で画像処理された画像データが示す画像（スルー画像）を表示する。また、表示モニタ１３０は、ユーザがデジタルカメラ１００に対して種々の設定を行うためのメニュー画面等を表示する。表示モニタ１３０は、例えば、液晶ディスプレイデバイスまたは有機ＥＬデバイスで構成できる。 The display monitor 130 is an example of a display that displays various information. For example, the display monitor 130 displays an image (through image) represented by image data captured by the image sensor 115 and image-processed by the image processing engine 120 . The display monitor 130 also displays menu screens and the like for the user to make various settings for the digital camera 100 . The display monitor 130 can be composed of, for example, a liquid crystal display device or an organic EL device.

操作部１５０は、デジタルカメラ１００の外装に設けられた操作釦や操作レバー等のハードキーの総称であり、使用者による操作を受け付ける。操作部１５０は、例えば、レリーズ釦、モードダイヤル、タッチパネルを含む。操作部１５０はユーザによる操作を受け付けると、ユーザ操作に対応した操作信号をコントローラ１３５に送信する。 The operation unit 150 is a general term for hard keys such as operation buttons and operation levers provided on the exterior of the digital camera 100, and receives operations by the user. Operation unit 150 includes, for example, a release button, a mode dial, and a touch panel. Upon receiving an operation by the user, the operation unit 150 transmits an operation signal corresponding to the user's operation to the controller 135 .

コントローラ１３５は、デジタルカメラ１００全体の動作を統括制御する。コントローラ１３５はＣＰＵ等を含み、ＣＰＵがプログラム（ソフトウェア）を実行することで所定の機能を実現する。コントローラ１３５は、ＣＰＵに代えて、所定の機能を実現するように設計された専用の電子回路で構成されるプロセッサを含んでもよい。すなわち、コントローラ１３５は、ＣＰＵ、ＭＰＵ、ＧＰＵ、ＤＳＵ、ＦＰＧＡ、ＡＳＩＣ等の種々のプロセッサで実現できる。コントローラ１３５は１つまたは複数のプロセッサで構成してもよい。また、コントローラ１３５は、画像処理エンジン１２０などと共に１つの半導体チップで構成してもよい。 The controller 135 centrally controls the operation of the entire digital camera 100 . The controller 135 includes a CPU and the like, and the CPU executes a program (software) to realize a predetermined function. Instead of the CPU, the controller 135 may include a processor composed of dedicated electronic circuits designed to implement predetermined functions. That is, the controller 135 can be implemented with various processors such as CPU, MPU, GPU, DSU, FPGA, ASIC, and the like. Controller 135 may be comprised of one or more processors. Also, the controller 135 may be composed of one semiconductor chip together with the image processing engine 120 and the like.

バッファメモリ１２５は、画像処理エンジン１２０やコントローラ１３５のワークメモリとして機能する記録媒体である。バッファメモリ１２５は、ＤＲＡＭ（ＤｙｎａｍｉｃＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）などにより実現される。フラッシュメモリ１４５は不揮発性の記録媒体である。また、図示していないが、コントローラ１３５は各種の内部メモリを有してもよく、例えばＲＯＭを内蔵してもよい。ＲＯＭには、コントローラ１３５が実行する様々なプログラムが記憶されている。また、コントローラ１３５は、ＣＰＵの作業領域として機能するＲＡＭを内蔵してもよい。 A buffer memory 125 is a recording medium that functions as a work memory for the image processing engine 120 and the controller 135 . The buffer memory 125 is implemented by a DRAM (Dynamic Random Access Memory) or the like. Flash memory 145 is a non-volatile recording medium. Also, although not shown, the controller 135 may have various internal memories, such as a ROM. Various programs executed by the controller 135 are stored in the ROM. Also, the controller 135 may incorporate a RAM that functions as a work area for the CPU.

カードスロット１４０は、着脱可能なメモリカード１４２が挿入される手段である。カードスロット１４０は、メモリカード１４２を電気的及び機械的に接続可能である。メモリカード１４２は、内部にフラッシュメモリ等の記録素子を備えた外部メモリである。メモリカード１４２は、画像処理エンジン１２０で生成される画像データなどのデータを格納できる。 The card slot 140 is a means into which a removable memory card 142 is inserted. The card slot 140 can electrically and mechanically connect a memory card 142 . The memory card 142 is an external memory internally provided with a recording element such as a flash memory. The memory card 142 can store data such as image data generated by the image processing engine 120 .

通信モジュール１５５は、通信規格ＩＥＥＥ８０２．１１またはＷｉ－Ｆｉ規格等に準拠した通信を行う通信モジュール（回路）である。デジタルカメラ１００は、通信モジュール１５５を介して、他の機器と通信することができる。デジタルカメラ１００は、通信モジュール１５５を介して、他の機器と直接通信を行ってもよいし、アクセスポイント経由で通信を行ってもよい。通信モジュール１５５は、インターネット等の通信ネットワークに接続可能であってもよい。 The communication module 155 is a communication module (circuit) that performs communication conforming to the communication standard IEEE802.11, the Wi-Fi standard, or the like. Digital camera 100 can communicate with other devices via communication module 155 . The digital camera 100 may directly communicate with another device via the communication module 155, or may communicate via an access point. Communication module 155 may be connectable to a communication network such as the Internet.

マイク１６０は、音を収音する収音部の一例である。マイク１６０は、収音した音声を電気信号であるアナログ信号に変換して出力する。マイク１６０は、１つ又は複数のマイクロフォン素子から構成されてもよい。 Microphone 160 is an example of a sound pickup unit that picks up sound. The microphone 160 converts the collected sound into an analog signal, which is an electric signal, and outputs the analog signal. Microphone 160 may be composed of one or more microphone elements.

マイク用のＡ／Ｄコンバータ１６５は、マイク１６０からのアナログ信号をデジタル信号の音声データに変換する。マイク用のＡ／Ｄコンバータ１６５は、本実施形態における音声取得部の一例である。なお、マイク１６０は、デジタルカメラ１００の外部にあるマイクロフォン素子を含んでもよい。この場合、デジタルカメラ１００は音声取得部として、外部のマイク１６０に対するインタフェース回路を備える。 The microphone A/D converter 165 converts the analog signal from the microphone 160 into digital audio data. The A/D converter 165 for the microphone is an example of the voice acquisition section in this embodiment. It should be noted that microphone 160 may include a microphone element external to digital camera 100 . In this case, the digital camera 100 is provided with an interface circuit for the external microphone 160 as an audio acquisition section.

音声処理エンジン１７０は、マイク用のＡ／Ｄコンバータ１６５等の音声取得部から出力された音声データを受信して、受信した音声データに対して種々の音声処理を施す。音声処理エンジン１７０は、本実施形態における音声処理部の一例である。音声処理エンジン１７０は、画像処理エンジン１２０と一体的に実装されてもよい。音声処理エンジン１７０の構成の詳細については後述する。 The audio processing engine 170 receives audio data output from an audio acquisition unit such as the A/D converter 165 for the microphone, and applies various audio processing to the received audio data. The audio processing engine 170 is an example of an audio processing unit in this embodiment. The audio processing engine 170 may be implemented integrally with the image processing engine 120 . The details of the configuration of the audio processing engine 170 will be described later.

〔１－１－１．画像認識部について〕
本実施形態における画像認識部１２２の詳細を、以下説明する。 [1-1-1. Image Recognition Unit]
Details of the image recognition unit 122 in this embodiment will be described below.

画像認識部１２２は、例えば畳み込みニューラルネットワーク等のニューラルネットワークによる学習済みモデルを採用する。画像認識部１２２は、イメージセンサ１１５からの撮像データを学習済みモデルに入力して、当該モデルによる画像認識処理を実行する。画像認識部１２２は、画像認識処理による被写体の種別の検出結果を示す検出情報を出力する。画像認識部１２２は、本実施形態における検出部の一例である。画像認識部１２２は、画像処理エンジン１２０とコントローラ１３５との協働によって構成されてもよい。 The image recognition unit 122 employs a trained model by a neural network such as a convolutional neural network. The image recognition unit 122 inputs the captured data from the image sensor 115 to the learned model and executes image recognition processing using the model. The image recognition unit 122 outputs detection information indicating the detection result of the subject type by image recognition processing. The image recognition unit 122 is an example of a detection unit in this embodiment. The image recognition unit 122 may be configured by cooperation between the image processing engine 120 and the controller 135 .

画像認識部１２２の画像認識処理は、学習済みモデルに入力されたデータが示す画像において、予め設定された複数のカテゴリの何れかに分類される被写体が映っている領域を示す位置情報と対応するカテゴリとを関連付けて、検出情報として出力する。複数のカテゴリは、例えば「人」及び「動物」といった種別を含む。また、各カテゴリは更に細分化されてもよく、例えば、人の体、顔および瞳といった人の各部、並びに動物の体、顔および瞳といった動物の各部を含んでもよい。位置情報は、例えば処理対象の画像上の水平座標及び垂直座標で規定され、例えば検出された被写体を矩形状に囲む領域を示す（図５など参照）。 The image recognition processing of the image recognition unit 122 corresponds to position information indicating an area in which a subject classified into one of a plurality of preset categories is shown in an image indicated by data input to the trained model. Output as detection information by associating with categories. The multiple categories include types such as "people" and "animals." Also, each category may be further subdivided and may include, for example, human parts such as human body, face and eyes, and animal parts such as animal body, face and eyes. The position information is defined, for example, by horizontal coordinates and vertical coordinates on the image to be processed, and indicates, for example, a rectangular area surrounding the detected subject (see FIG. 5, etc.).

画像認識部１２２は、各カテゴリについて、予め設定された最大の個数までの被写体を同時に検出してもよい。また、上記の動物のカテゴリ（或いは種別）は、さらに、動物の種類に応じて分類されてもよい。例えば、犬、猫および鳥などのカテゴリが別々に設定されてもよいし、犬と猫を１つにまとめたカテゴリが設定されてもよい。以下では、デジタルカメラ１００において予め設定された複数の種別が、第１の種別の一例として種別「人」と、第２の種別の一例として種別「猫」とを含む場合を説明する。 The image recognition unit 122 may simultaneously detect up to a preset maximum number of subjects for each category. In addition, the category (or type) of animals described above may be further classified according to the type of animal. For example, categories such as dogs, cats, and birds may be set separately, or a category combining dogs and cats may be set. A case will be described below in which a plurality of types preset in the digital camera 100 include the type "person" as an example of the first type and the type "cat" as an example of the second type.

以上のような画像認識部１２２の学習済みモデルは、例えば、各カテゴリの被写体が映った画像を正解とする正解ラベルを関連付けた画像データを教師データとして用いた教師あり学習によって得ることができる。学習済みモデルは、各カテゴリの検出結果に関する信頼度或いは尤度を生成してもよい。 The trained model of the image recognition unit 122 as described above can be obtained, for example, by supervised learning using, as teacher data, image data associated with a correct label indicating that an image of a subject in each category is correct. A trained model may generate a confidence or likelihood for each category of detection results.

画像認識部１２２の学習済みモデルはニューラルネットワークに限らず、種々の画像認識に関する機械学習モデルであってもよい。また、画像認識部１２２は機械学習に限らず、種々の画像認識アルゴリズムを採用してもよい。また、画像認識部１２２は、例えば人の顔および瞳などの一部のカテゴリに対する検出がルールベースの画像認識処理によって行われるように構成されてもよい。 The trained model of the image recognition unit 122 is not limited to a neural network, and may be machine learning models related to various image recognitions. Also, the image recognition unit 122 may adopt various image recognition algorithms without being limited to machine learning. Further, the image recognition unit 122 may be configured such that detection of some categories, such as human faces and eyes, is performed by rule-based image recognition processing.

〔１－１－２．音声処理エンジンについて〕
音声処理エンジン１７０の構成の詳細について、図２～図４を用いて説明する。図２は、デジタルカメラ１００における音声処理エンジン１７０の構成を示すブロック図である。 [1-1-2. About the audio processing engine]
Details of the configuration of the audio processing engine 170 will be described with reference to FIGS. 2 to 4. FIG. FIG. 2 is a block diagram showing the configuration of the audio processing engine 170 in the digital camera 100. As shown in FIG.

音声処理エンジン１７０は、例えば機能的構成として、図２に示すように、雑音抑圧部１７２と、音声抽出部１７４と、強調処理部１７６とを備える。音声処理エンジン１７０は、マイク用のＡ／Ｄコンバータ１６５から音声データＡｉｎを入力して、各種機能による音声処理を行う。音声抽出部１７４及び強調処理部１７６は、例えばコントローラ１３５によって制御される。 The audio processing engine 170 includes, for example, a noise suppression unit 172, an audio extraction unit 174, and an enhancement processing unit 176 as shown in FIG. The audio processing engine 170 receives audio data Ain from the microphone A/D converter 165 and performs audio processing using various functions. The voice extraction unit 174 and the enhancement processing unit 176 are controlled by the controller 135, for example.

雑音抑圧部１７２は、音声処理エンジン１７０に入力された音声データＡｉｎにおいて雑音を抑制する処理を行う。雑音抑圧部１７２による処理は、例えば風の音や、レンズ等の駆動音、ユーザ等がデジタルカメラ１００に触れて生じる各種ハンドリング雑音といった所定の雑音を抑圧するために行われ、例えばルールベースのアルゴリズムで実装される。雑音抑圧部１７２は、処理した音声データＡ１０を、音声抽出部１７４及び強調処理部１７６に出力する。雑音抑圧部１７２の処理後の音声データＡ１０は、例えば音声抽出を行わずに動画を撮影する際に得られる動画音声を示す。 The noise suppression unit 172 performs noise suppression processing in the audio data Ain input to the audio processing engine 170 . The processing by the noise suppression unit 172 is performed to suppress predetermined noise such as wind noise, lens drive noise, and various handling noises generated when the user touches the digital camera 100, and is implemented by, for example, a rule-based algorithm. be done. The noise suppression unit 172 outputs the processed audio data A10 to the audio extraction unit 174 and the enhancement processing unit 176 . The audio data A10 after processing by the noise suppression unit 172 indicates, for example, video audio obtained when shooting a video without performing audio extraction.

音声抽出部１７４は、雑音抑圧部１７２からの動画音声の音声データＡ１０において、特定の種別（以下「対象種別」という場合がある）の音声を抽出する処理を行って、抽出音声を示す音声データＡ１１を出力する。音声抽出部１７４の処理は、例えばニューラルネットワーク等の機械学習による学習済みモデルによって実現される。以下では、畳み込みニューラルネットワーク（ＣＮＮ）を用いる例を説明する。 The audio extracting unit 174 extracts audio of a specific type (hereinafter sometimes referred to as “target type”) from the audio data A10 of the video audio from the noise suppressing unit 172, and extracts audio data representing the extracted audio. Output A11. The processing of the speech extraction unit 174 is realized by a machine learning model such as a neural network, for example. An example using a convolutional neural network (CNN) is described below.

音声抽出部１７４のＣＮＮは、例えば画像認識に用いられる場合と同様に、画像データを入力とする畳み込み層などを含む。本例において、音声抽出部１７４は、動画音声の音声データを画像データに変換する音声／画像変換部１７４ａと、変換された画像データ上で特定の種別に対応する部分を識別するようにＣＮＮによる処理を実行するＣＮＮ処理部１７５と、識別された部分の画像データを音声データに変換する画像／音声変換部１７４ｂとを備える。音声抽出部１７４は、例えば所定のフレーム周期で周期的に動作可能である。 The CNN of the speech extraction unit 174 includes, for example, a convolution layer that receives image data as in the case of image recognition. In this example, the audio extraction unit 174 includes an audio/image conversion unit 174a that converts audio data of video audio into image data, and a CNN to identify a portion corresponding to a specific type on the converted image data. It comprises a CNN processing unit 175 that executes processing, and an image/audio conversion unit 174b that converts image data of the identified portion into audio data. The audio extractor 174 can operate periodically, for example, at a predetermined frame period.

音声処理エンジン１７０には予め、対象種別として設定可能な複数の種別が設定されている。音声処理エンジン１７０における複数の種別は、例えば画像認識部１２２に予め設定された複数の種別と対応している。図３は、音声抽出部１７４における特定の種別のデータ例を説明した図である。 A plurality of types that can be set as target types are set in the audio processing engine 170 in advance. A plurality of types in the audio processing engine 170 correspond to a plurality of types preset in the image recognition unit 122, for example. FIG. 3 is a diagram explaining an example of data of a specific type in the voice extraction unit 174. As shown in FIG.

図３（Ａ）は、種別「人」のデータ例として人の声の音声データＡ１２による音声の波形を例示する。図３（Ｂ）は、図３（Ａ）の変換後の画像データＢ１２を例示する。音声データＡ１２は、図３（Ａ）に例示するように、時間方向に沿って音声波形の振幅が規定される時系列データを構成する。音声／画像変換部１７４ａは、例えば短時間フーリエ変換（ＳＴＦＴ）等を演算して、音声データＡ１２の変換後の画像データＢ１２を生成する。 FIG. 3A exemplifies the waveform of the voice of voice data A12 of a human voice as an example of data of the type "person". FIG. 3B illustrates image data B12 after conversion of FIG. 3A. The audio data A12 constitutes time-series data in which the amplitude of the audio waveform is defined along the time direction, as exemplified in FIG. 3(A). The audio/image conversion unit 174a calculates, for example, a short-time Fourier transform (STFT) or the like to generate image data B12 after conversion of the audio data A12.

図３（Ｂ）に示すように、変換後の画像データＢ１２は、音声データＡ１２のスペクトログラム或いは声紋画像を示し、時間方向Ｘに加えて周波数方向Ｙを有する。画像データＢ１２の画素値は、（Ｘ，Ｙ）座標で規定される音の成分の強さ（振幅）を示す。画像データＢ１２の画像上の領域は、変換前の音声データＡ１２において対応する時間区間及び周波数帯の成分を表す。 As shown in FIG. 3B, the converted image data B12 represents the spectrogram or voiceprint image of the audio data A12, and has the frequency direction Y in addition to the time direction X. As shown in FIG. The pixel value of the image data B12 indicates the strength (amplitude) of the sound component defined by the (X, Y) coordinates. The area on the image of the image data B12 represents the corresponding time section and frequency band components in the audio data A12 before conversion.

図４は、図３とは別の種別のデータ例を説明した図である。図４（Ａ）は、種別「猫」のデータ例として猫の鳴き声の音声データＡ１３による音声の波形を例示する。図４（Ｂ）は、図４（Ａ）の変換後の画像データＢ１３を例示する。図３（Ｂ），図４（Ｂ）に示す画像データＢ１２，Ｂ１３間には、図３（Ａ），図４（Ａ）の音声データＡ１２，Ａ１３における種別の違いに応じて、異なる特徴量が含まれる。ＣＮＮ処理部１７５の機械学習によると、このような特徴量の識別方法が獲得される。 FIG. 4 is a diagram explaining an example of data of a type different from that of FIG. FIG. 4A exemplifies the waveform of the voice by the voice data A13 of the meow of a cat as an example of the data of the type "cat". FIG. 4B illustrates image data B13 after conversion of FIG. 4A. Between the image data B12 and B13 shown in FIGS. 3(B) and 4(B), different feature amounts is included. According to the machine learning of the CNN processing unit 175, such a feature quantity identification method is obtained.

例えば種々の種別による音声に応じた画像データＢ１２，Ｂ１３をラベル付けした教師となる画像データが、ＣＮＮ処理部１７５の機械学習のための教師データベース（ＤＢ）４０に格納される。ＣＮＮ処理部１７５の学習済みモデルは、教師ＤＢ４０を用いた教師あり学習において、画像データを入力すると特定の種別の識別情報を出力するように、ＣＮＮの重みパラメータ群を誤差逆伝播法で入力データと教師データの誤差を小さくするために調整することによって構成できる。なお、教師ＤＢ４０では、画像データの代わりに音声データが格納されてもよい。この場合、教師ＤＢ４０中の音声データに対しても音声／画像変換部１７４ａの変換が適用可能である。 For example, image data B12 and B13 corresponding to various types of voices are labeled and image data serving as a teacher is stored in a teacher database (DB) 40 for machine learning of the CNN processing unit 175 . In the supervised learning using the teacher DB 40, the trained model of the CNN processing unit 175 converts the weight parameter group of the CNN into the input data by backpropagation so that when image data is input, identification information of a specific type is output. and by adjusting to reduce the error of the training data. Note that the teacher DB 40 may store audio data instead of image data. In this case, the conversion of the audio/image converter 174a can be applied to the audio data in the teacher DB 40 as well.

ＣＮＮ処理部１７５が出力する識別情報は、例えば、入力の画像データ上で特定の種別に対応すると識別された領域等を示す画像データを含み、又この識別の信頼度あるいは尤度を含んでもよい。ＣＮＮ処理部１７５には、例えば上記のＣＮＮに加えて又はこれに代えて、種別毎の音声に応じた画像データ等を生成する各種の生成モデルが含まれてもよい。ＣＮＮ処理部１７５では、種別ごとに別々に機械学習された学習済みモデルを用いることができる。例えば、各種別の学習済みモデル或いは対応する重みパラメータ群は、フラッシュメモリ１４５において学習データベース（ＤＢ）４５に格納され、特定の種別の音声抽出を実行するために用いる設定情報として適時、コントローラ１３５によってＣＮＮ処理部１７５に設定される。なお、ＣＮＮ処理部１７５には、複数の種別を同時に識別する学習済みモデルを用いてもよい。 The identification information output by the CNN processing unit 175 includes, for example, image data indicating an area identified as corresponding to a specific type on the input image data, and may also include the reliability or likelihood of this identification. . The CNN processing unit 175 may include, for example, in addition to or instead of the above-described CNN, various generation models for generating image data or the like corresponding to each type of voice. The CNN processing unit 175 can use trained models that have undergone machine learning separately for each type. For example, each type of trained model or corresponding set of weighting parameters may be stored in a learning database (DB) 45 in flash memory 145 and used by controller 135 as setting information for performing a particular type of speech extraction. It is set in the CNN processing unit 175 . The CNN processing unit 175 may use a trained model that simultaneously identifies a plurality of types.

図２に戻り、画像／音声変換部１７４ｂは、ＣＮＮ処理部１７５によって識別された画像データに対して、例えば音声／画像変換部１７４ａによるＳＴＦＴの逆変換を演算して、音声抽出部１７４における抽出結果を示す抽出音声の音声データＡ１１を生成する。 Returning to FIG. 2, the image/audio conversion unit 174b performs, for example, the inverse conversion of STFT by the audio/image conversion unit 174a on the image data identified by the CNN processing unit 175, and the extraction by the audio extraction unit 174 is performed. Speech data A11 of extracted speech indicating the result is generated.

強調処理部１７６は、音声抽出部１７４からの抽出音声の音声データＡ１１を入力する音声増幅部１７７と、雑音抑圧部１７２からの動画音声の音声データＡ１０を入力する音声減衰部１７８と、音声増幅部１７７と音声減衰部１７８の出力を統合する音声結合部１７９とを備える。強調処理部１７６は、音声抽出部１７４による抽出音声が動画音声から強調されるように、抽出音声及び動画音声の各音声データＡ１０，Ａ１１を処理して、音声処理エンジン１７０による処理結果の音声データＡｏｕｔを出力する。 The enhancement processing unit 176 includes an audio amplification unit 177 that receives the audio data A11 of the extracted audio from the audio extraction unit 174, an audio attenuation unit 178 that receives the audio data A10 of the video audio from the noise suppression unit 172, and an audio amplification unit. A voice combiner 179 that integrates the outputs of the unit 177 and the voice attenuator 178 is provided. The enhancement processing unit 176 processes the audio data A10 and A11 of the extracted audio and the video audio so that the audio extracted by the audio extraction unit 174 is emphasized from the video audio, and outputs the audio data as a result of processing by the audio processing engine 170. Output Aout.

音声増幅部１７７は、入力される音声データＡ１１に対して、例えばコントローラ１３５によって設定されるゲインＧ１を乗じる乗算処理を行って、抽出音声を増幅する。音声減衰部１７８は、入力される音声データＡ１０に対して、当該音声データＡ１０が示す動画音声の音量と、音声結合部１７９による結合後の音声の音量とを同じにする値のゲインＧ０（＜１）を乗じて、動画音声を抑圧する。音声結合部１７９は、増幅された抽出音声と抑圧された動画音声とを同期して合成し、処理結果の音声データＡｏｕｔを生成する。 The audio amplifier 177 multiplies the input audio data A11 by a gain G1 set by the controller 135, for example, to amplify the extracted audio. The audio attenuator 178 applies a gain G0 (< 1) is multiplied to suppress the moving image sound. The audio combiner 179 synchronously synthesizes the amplified extracted audio and the suppressed video audio to generate audio data Aout as a processing result.

なお、音声減衰部１７８のゲインＧ０は、強調処理部１７６において算出されてもよいし、コントローラ１３５によって設定されてもよい。音声増幅部１７７のゲインＧ１は、例えば１以下であってもよい。この場合であっても、動画音声の中に抽出音声と同じ音声が含まれていることから、抽出対象となった音声は、処理結果の音声データＡｏｕｔにおいて動画音声中の分よりも増幅されることとなる。 Note that the gain G<b>0 of the sound attenuator 178 may be calculated by the enhancement processor 176 or may be set by the controller 135 . The gain G1 of the audio amplifier 177 may be 1 or less, for example. Even in this case, since the same audio as the extracted audio is included in the video audio, the extracted audio is amplified more than the video audio in the audio data Aout resulting from the processing. It will happen.

以上のような音声処理エンジン１７０において、音声抽出部１７４の機能はＣＮＮに限らず、他のニューラルネットワークで実現されてもよいし、ニューラルネットワーク以外の種々の音声識別に関する機械学習モデルであってもよい。また、教師ＤＢ４０等を用いた音声抽出部１７４の機械学習は、デジタルカメラ１００への実装前に予め行われてもよい。この場合、デジタルカメラ１００のフラッシュメモリ１４５には、学習結果の学習ＤＢ４５が記録されれば、特に教師ＤＢ４０は記録されなくてもよい。 In the speech processing engine 170 as described above, the function of the speech extraction unit 174 is not limited to the CNN, and may be realized by other neural networks, or may be machine learning models related to various speech identifications other than neural networks. good. Further, the machine learning of the voice extraction unit 174 using the teacher DB 40 and the like may be performed in advance before being mounted on the digital camera 100 . In this case, if the learning result DB 45 is recorded in the flash memory 145 of the digital camera 100, the teacher DB 40 need not be recorded.

また、音声処理エンジン１７０においては、教師ＤＢ４０のような種々の種別と対応付けた音声データ等を含むデータベースを用いて、音声抽出部１７４の抽出結果の補正が行われてもよい。例えば当該データベースをフラッシュメモリ１４５に格納しておき、音声処理エンジン１７０が音声抽出部１７４の抽出結果とデータベース中のデータとを照合してもよい。また、音声抽出部１７４等の機能は機械学習に限らず、種々の音声識別アルゴリズムにより実現されてもよく、上記のようなデータベースにおける検索が利用されてもよい。 Further, in the speech processing engine 170, the extraction result of the speech extraction unit 174 may be corrected using a database such as the teacher DB 40 that includes speech data associated with various types. For example, the database may be stored in the flash memory 145, and the speech processing engine 170 may compare the extraction result of the speech extraction unit 174 with the data in the database. Also, the functions of the voice extraction unit 174 and the like may be realized by various voice recognition algorithms without being limited to machine learning, and searches in databases as described above may be used.

〔１－２．動作〕
以上のように構成されるデジタルカメラ１００の動作について説明する。以下では、デジタルカメラ１００による動画撮影時の動作を説明する。 [1-2. motion〕
The operation of the digital camera 100 configured as above will be described. The operation of the digital camera 100 when shooting a moving image will be described below.

デジタルカメラ１００は順次、光学系１１０を介して形成された被写体像をイメージセンサ１１５で撮像して撮像データを生成する。画像処理エンジン１２０は、イメージセンサ１１５により生成された撮像データに対して各種処理を施して画像データを生成し、バッファメモリ１２５に記録する。また、画像処理エンジン１２０の画像認識部１２２は、撮像データが示す画像に基づき、被写体の種別および領域を検出して、例えば検出情報Ｄ１をコントローラ１３５に出力する。 The digital camera 100 sequentially captures subject images formed via the optical system 110 with the image sensor 115 to generate captured data. The image processing engine 120 generates image data by performing various types of processing on the imaging data generated by the image sensor 115 , and records the image data in the buffer memory 125 . Also, the image recognition unit 122 of the image processing engine 120 detects the type and area of the subject based on the image represented by the image data, and outputs detection information D1 to the controller 135, for example.

以上の撮像動作と同時並行で、デジタルカメラ１００は、マイク１６０において収音を行う。マイク用のＡ／Ｄコンバータ１６５から収音結果の音声データを音声処理エンジン１７０にて処理する。音声処理エンジン１７０は、処理後の音声データＡｏｕｔをバッファメモリ１２５に記録する。 Simultaneously with the above imaging operation, the digital camera 100 picks up sound with the microphone 160 . The voice processing engine 170 processes voice data as a result of sound pickup from the microphone A/D converter 165 . The audio processing engine 170 records the processed audio data Aout in the buffer memory 125 .

コントローラ１３５は、バッファメモリ１２５を介して、画像処理エンジン１２０から受け付ける画像データと音声処理エンジン１７０から受け付ける音声データとの間で、同期を取って動画をメモリカード１４２に記録する。また、コントローラ１３５は逐次、表示モニタ１３０にスルー画像を表示させる。ユーザは、表示モニタ１３０のスルー画像により随時、撮影の構図等を確認することができる。動画撮影の動作は、操作部１５０におけるユーザの操作に応じて開始／終了される。 The controller 135 synchronizes the image data received from the image processing engine 120 and the audio data received from the audio processing engine 170 via the buffer memory 125 and records the moving image in the memory card 142 . Also, the controller 135 sequentially causes the display monitor 130 to display a through image. The user can check the shooting composition and the like at any time using the through image on the display monitor 130 . The motion picture shooting operation is started/finished according to the user's operation on the operation unit 150 .

以上のようなデジタルカメラ１００の動画撮影は、「人」又は「動物」といった特定の種別の被写体に注目して行われる場合がある。この場合、音声についても、上記種別の発声を明瞭に収集したいとのニーズが考えられる。 Moving image shooting by the digital camera 100 as described above may be performed by focusing on a specific type of subject such as a "person" or an "animal". In this case, it is conceivable that there is a need to clearly collect utterances of the above types also for speech.

本実施形態のデジタルカメラ１００は、画像処理エンジン１２０における画像認識部１２２の検出情報Ｄ１によって被写体の種別を検出し、画像認識で特定の種別の被写体が検出されたときに、音声処理エンジン１７０において当該種別に対する音声抽出の処理を実行する。このように、画像処理エンジン１２０の画像認識と音声処理エンジン１７０の音声抽出等とを連動させて、特定の種別の被写体による音声の抽出を精度良く実現する。 The digital camera 100 of this embodiment detects the type of subject by the detection information D1 of the image recognition unit 122 in the image processing engine 120, and when a specific type of subject is detected by image recognition, the sound processing engine 170 Perform voice extraction processing for the type. In this manner, the image recognition of the image processing engine 120 and the sound extraction of the sound processing engine 170 are interlocked, and the sound of a specific type of subject can be accurately extracted.

以下では、上記のような特定の種別が「人」に設定された動作モード（以下「人優先モード」という）におけるデジタルカメラ１００の動作例を説明する。 An example of operation of the digital camera 100 in an operation mode (hereinafter referred to as "person priority mode") in which the specific type is set to "person" will be described below.

〔１－２－１．人優先モードについて〕
図５は、デジタルカメラ１００の人優先モードの概要を説明するための図である。人優先モードは、種別が「人」の被写体に注目して動画撮影等を行うための動作モードである。 [1-2-1. About people priority mode]
FIG. 5 is a diagram for explaining an outline of the human priority mode of the digital camera 100. As shown in FIG. The person-priority mode is an operation mode for performing moving image shooting or the like while focusing on a subject whose type is "person".

図５（Ａ）は、人優先モードにおける表示モニタ１３０の表示の一例を示す。デジタルカメラ１００のコントローラ１３５は、表示モニタ１３０にスルー画像と共に、スルー画像中で枠表示などにより、被写体が検出された検出領域Ｒ１を表示する。また、図５の例において、表示モニタ１３０は、音声抽出アイコン５を表示している。音声抽出アイコン５は、音声抽出の対象とする種別を示す対象種別マーク５ａと、抽出された音声が増幅されるレベルを示す増幅レベルバー５ｂとを含む。対象種別マーク５ａ（対象種別情報の一例）と増幅レベルバー５ｂ（強調レベル情報の一例）とは、それぞれコントローラ１３５の制御によって表示される。人優先モードの音声抽出アイコン５では、対象種別マーク５ａとして「人」のマークが表示される。 FIG. 5A shows an example of display on the display monitor 130 in the person priority mode. The controller 135 of the digital camera 100 displays the through image on the display monitor 130 together with the detection area R1 where the subject is detected by frame display or the like in the through image. Further, in the example of FIG. 5, the display monitor 130 displays the voice extraction icon 5. FIG. The audio extraction icon 5 includes a target type mark 5a indicating the type of audio extraction target, and an amplification level bar 5b indicating the level at which the extracted audio is amplified. The target type mark 5a (an example of target type information) and the amplification level bar 5b (an example of emphasis level information) are displayed under the control of the controller 135, respectively. In the voice extraction icon 5 in the person-priority mode, a "person" mark is displayed as the target type mark 5a.

人優先モードのデジタルカメラ１００において、画像処理エンジン１２０の画像認識部１２２は、例えば種々の種別の被写体を検出する。図５（Ａ）の例では、被写体において、対象種別の人２１，２２と、対象種別とは別の種別の猫２０とが、それぞれ検出されている。この際、画像認識による人２１，２２の検出に応じて、音声処理エンジン１７０の音声抽出部１７４が動作し、対象種別「人」に対する音声抽出の処理を開始する。 In the digital camera 100 in human priority mode, the image recognition unit 122 of the image processing engine 120 detects, for example, various types of subjects. In the example of FIG. 5A, as subjects, persons 21 and 22 of the target type and a cat 20 of a type different from the target type are detected. At this time, the voice extracting unit 174 of the voice processing engine 170 operates according to the detection of the persons 21 and 22 by image recognition, and starts voice extraction processing for the target type "person".

図５（Ｂ）は、図５（Ａ）に対応した音声変化を例示するグラフである。図５（Ｂ）において、横軸は時間を示し、縦軸は増幅（又は抑圧の）レベルを示す。曲線Ｃ１は抽出音声を表し、曲線Ｃ０は動画音声を表している。人２１，２２の何れかが発声して、種別「人」の音声が抽出されると、強調処理部１７６は抽出音声の増幅を行う。一方、「猫」の鳴き声は、音声抽出の対象とはならない。このように、ユーザが意図した対象種別「人」の音声が他の音声よりも優先して明瞭に得られる。 FIG. 5B is a graph illustrating voice changes corresponding to FIG. 5A. In FIG. 5B, the horizontal axis indicates time and the vertical axis indicates amplification (or suppression) level. A curve C1 represents the extracted audio, and a curve C0 represents the video audio. When one of the persons 21 and 22 vocalizes and the speech of the type "person" is extracted, the enhancement processing unit 176 amplifies the extracted speech. On the other hand, the cry of "cat" is not subject to speech extraction. In this way, the voice of the target type "person" intended by the user is clearly obtained with priority over other voices.

また、音声処理エンジン１７０の強調処理部１７６は、図５（Ｂ）の曲線Ｃ１に示すように、抽出音声を徐々に緩やかに増大させる。これにより、ユーザにとって強調後の音声が聴き難くなるような急激な音声変化を回避することができる。また、音声処理エンジン１７０は、強調処理部１７６による処理の前後で全音量を一定に保つように、抽出音声の増幅と、動画音声の抑圧とを行う。これにより、ユーザにとって強調後の音声をより聴き易くすることができる。また、ユーザは、抽出音声の増幅のレベルを、図５（Ａ）において増幅レベルバー５ｂで確認することができる。さらに、対象種別マーク５ａにより、ユーザは現在の対象種別を確認でき、ユーザの意図に沿った音声強調を実現し易くすることができる。 Further, the enhancement processing unit 176 of the voice processing engine 170 gradually and gently increases the extracted voice as indicated by the curve C1 in FIG. 5(B). As a result, it is possible to avoid sudden voice changes that make it difficult for the user to hear the emphasized voice. Also, the audio processing engine 170 amplifies the extracted audio and suppresses the video audio so that the total volume is kept constant before and after the processing by the enhancement processing unit 176 . This makes it easier for the user to hear the emphasized voice. Also, the user can confirm the level of amplification of the extracted voice with the amplification level bar 5b in FIG. 5(A). Furthermore, the target type mark 5a allows the user to confirm the current target type, thereby facilitating realization of voice enhancement in accordance with the user's intention.

〔１－２－２．動作の詳細〕
以上のような人優先モードにおけるデジタルカメラ１００の動作の詳細を、図６～図７を用いて説明する。ユーザは、例えば種別「人」の被写体による音声を明瞭に得たい意図があるときに、デジタルカメラ１００の設定メニュー等においてタッチパネルや各種キーなどの操作部１５０にユーザ操作を入力して、デジタルカメラ１００を人優先モードに設定できる。 [1-2-2. Operation details]
Details of the operation of the digital camera 100 in the human priority mode as described above will be described with reference to FIGS. 6 and 7. FIG. For example, when the user intends to clearly obtain the voice of a subject of type "person", the user inputs a user operation to the operation unit 150 such as a touch panel and various keys in the setting menu of the digital camera 100, etc. 100 can be set to people-first mode.

図６は、実施の形態１に係るデジタルカメラ１００の動作を例示するフローチャートである。図６に示すフローチャートは、例えばデジタルカメラ１００が人優先モードに設定された状態で動画の撮影中に実行される。この状態で、表示モニタ１３０は、コントローラ１３５の制御により、種別「人」を示す対象種別マーク５ａ等を表示している。本フローチャートによる各処理は、例えば、デジタルカメラ１００のコントローラ１３５によって実行される。なお、コントローラ１３５の代わりに、以下の各処理を実行させる機能が音声処理エンジン１７０に実装されてもよい。 FIG. 6 is a flow chart illustrating the operation of the digital camera 100 according to the first embodiment. The flowchart shown in FIG. 6 is executed, for example, while the digital camera 100 is set to the people-priority mode while shooting a moving image. In this state, the display monitor 130 is controlled by the controller 135 to display the target type mark 5a indicating the type "person" and the like. Each process according to this flowchart is executed by the controller 135 of the digital camera 100, for example. It should be noted that, instead of the controller 135, the audio processing engine 170 may be implemented with a function for executing each of the following processes.

まず、コントローラ１３５は、画像処理エンジン１２０から検出情報Ｄ１を取得して、画像認識部１２２において種別が「人」の被写体が検出されたか否かを判断する（Ｓ１）。コントローラ１３５は、種別「人」の被写体が検出されるまで、例えば所定の周期でステップＳ１の判断を繰り返す（Ｓ１でＮＯ）。当該周期は、例えば画像処理エンジン１２０における画像認識部１２２の動作周期である。 First, the controller 135 acquires the detection information D1 from the image processing engine 120, and determines whether or not a subject whose type is "human" has been detected in the image recognition section 122 (S1). The controller 135 repeats the determination of step S1 at predetermined intervals, for example, until a subject of type "person" is detected (NO in S1). The cycle is, for example, the operation cycle of the image recognition section 122 in the image processing engine 120 .

ステップＳ１において、音声処理エンジン１７０は、音声抽出部１７４（図２）の処理は実行せずに雑音抑圧部１７２の処理後の音声データＡ１０を生成して、強調処理部１７６にて特に抑圧せずに（Ｇ０＝１）、バッファメモリ１２５に出力する。 In step S1, the audio processing engine 170 generates audio data A10 processed by the noise suppressor 172 without executing the processing of the audio extractor 174 (FIG. 2), and specifically suppresses the audio data A10 by the enhancement processor 176. output to the buffer memory 125 (G0=1).

画像認識において種別「人」の被写体が検出されたとき（Ｓ１でＹＥＳ）、コントローラ１３５は、「人」を対象種別とする音声抽出を開始させるように、音声処理エンジン１７０を制御する（Ｓ２）。コントローラ１３５は、学習ＤＢ４５を参照して、対象種別「人」の音声抽出を行うための設定情報を、音声処理エンジン１７０の音声抽出部１７４に設定する。また、コントローラ１３５は、例えば強調処理部１７６における音声増幅部１７７のゲインＧ１を初期値に設定する。ゲインＧ１の初期値は、ユーザが急激な音量変化とは感じないと想定される値に設定される。 When an object of type "person" is detected in image recognition (YES in S1), the controller 135 controls the audio processing engine 170 so as to start extracting audio with "person" as the target type (S2). . The controller 135 refers to the learning DB 45 and sets the setting information for extracting the voice of the target type “person” in the voice extraction unit 174 of the voice processing engine 170 . Also, the controller 135 sets, for example, the gain G1 of the audio amplification section 177 in the enhancement processing section 176 to an initial value. The initial value of the gain G1 is set to a value assumed that the user does not perceive a sudden volume change.

コントローラ１３５は、音声処理エンジン１７０の音声抽出部１７４において対象種別の音声が抽出されたか否かを判断する（Ｓ３）。ステップＳ３の判断は、例えば、音声抽出部１７４のＣＮＮ処理部１７５から出力される識別情報の信頼度に基づいて行われる。コントローラ１３５は、対象とする種別「人」の音声が抽出されたと判断するまで、例えば所定の周期でステップＳ１の判断を繰り返す（Ｓ３でＮＯ）。当該周期は、例えば音声処理エンジン１７０における音声抽出部１７４の動作周期である。 The controller 135 determines whether or not the sound of the target type has been extracted by the sound extractor 174 of the sound processing engine 170 (S3). The determination in step S3 is made based on the reliability of the identification information output from the CNN processing unit 175 of the voice extraction unit 174, for example. The controller 135 repeats the determination in step S1 at predetermined intervals, for example, until it determines that the target type "human" voice has been extracted (NO in S3). The cycle is, for example, the operating cycle of the voice extractor 174 in the voice processing engine 170 .

ステップＳ２後の音声処理エンジン１７０においては、音声抽出部１７４が対象種別の音声を抽出すると逐次、強調処理部１７６の音声増幅部１７７が抽出音声を増幅する。この際、音声増幅部１７７では順次、設定されたゲインＧ１が用いられる。例えばステップＳ３において抽出された音声には初期値のゲインＧ１が適用される。また、強調処理部１７６の音声減衰部１７８は、音声増幅部１７７に設定されたゲインＧ１に応じて、音量を維持する値のゲインＧ０を用いる。 In the sound processing engine 170 after step S2, when the sound extraction unit 174 extracts the target type sound, the sound amplification unit 177 of the enhancement processing unit 176 amplifies the extracted sound. At this time, the audio amplification unit 177 sequentially uses the set gain G1. For example, the initial gain G1 is applied to the voice extracted in step S3. Also, the sound attenuation unit 178 of the enhancement processing unit 176 uses the gain G0 of a value that maintains the volume according to the gain G1 set in the sound amplification unit 177 .

コントローラ１３５は、対象種別「人」の音声が抽出されたと判断したとき（Ｓ３でＹＥＳ）、音声増幅部１７７のゲインＧ１を初期値から増大させる（Ｓ４）。これにより、次に抽出された音声には、増大されたゲインＧ１が適用される。ステップＳ４は、所定ピッチでゲインＧ１を増やしてもよいし、連続的に増やしてもよい。又、コントローラ１３５は、ステップＳ４において、ゲインＧ１の増大に応じて増幅レベルバー５ｂが示すレベルを上げるように表示モニタ１３０を制御する（図５（Ａ），（Ｂ）参照）。 When the controller 135 determines that the voice of the target type “human” has been extracted (YES in S3), it increases the gain G1 of the voice amplifying section 177 from the initial value (S4). This causes the increased gain G1 to be applied to the next extracted speech. In step S4, the gain G1 may be increased at a predetermined pitch, or may be increased continuously. Also, in step S4, the controller 135 controls the display monitor 130 so as to increase the level indicated by the amplification level bar 5b as the gain G1 increases (see FIGS. 5A and 5B).

次に、コントローラ１３５は、画像認識部１２２から検出情報Ｄ１を再度取得して、現時点で対象種別「人」の被写体が検出されているか否かを判断する（Ｓ５）。ステップＳ５の判断は、ステップＳ１と同様に行われる。 Next, the controller 135 acquires the detection information D1 again from the image recognition unit 122, and determines whether or not a subject of the target type "person" is detected at this time (S5). The determination in step S5 is made in the same manner as in step S1.

コントローラ１３５は、対象種別「人」の被写体が検出されていると判断すると（Ｓ５でＹＥＳ）、現時点で音声抽出部１７４において対象種別の音声が抽出されたか否かを、ステップＳ３と同様に判断する（Ｓ６）。 When the controller 135 determines that a subject of the target type “person” has been detected (YES in S5), the controller 135 determines whether or not the voice of the target type has been extracted by the voice extraction unit 174 at the present time in the same manner as in step S3. (S6).

対象種別の音声が抽出されている場合（Ｓ６でＹＥＳ）、コントローラ１３５は、音声増幅部１７７に設定されたゲインＧ１が最大値か否かを判断する（Ｓ７）。最大値は、例えばユーザにとって抽出音声が充分に強調されていると感じられる程度の値に設定される。設定済みのゲインＧ１が最大値に到っていない場合（Ｓ７でＮＯ）、コントローラ１３５は再度、音声増幅部１７７のゲインＧ１を増大させて（Ｓ４）、ステップＳ５以降の処理を再度行う。これにより、新たに抽出された音声に対してさらに増大されたゲインＧ１が適用される。 If the target type of sound has been extracted (YES in S6), the controller 135 determines whether or not the gain G1 set in the sound amplifying section 177 is the maximum value (S7). The maximum value is set, for example, to a value at which the user feels that the extracted speech is sufficiently emphasized. If the set gain G1 has not reached the maximum value (NO in S7), the controller 135 increases the gain G1 of the audio amplifier 177 again (S4), and repeats the processing from step S5. This applies a further increased gain G1 to the newly extracted speech.

一方、ゲインＧ１が最大値である場合（Ｓ７でＹＥＳ）、コントローラ１３５は、ステップＳ４の処理を行わずに、ステップＳ５以降の処理を再度行う。これにより、音声処理エンジン１７０において抽出音声を強調する増幅を、適切なゲインＧ１で維持することができる。 On the other hand, if the gain G1 is the maximum value (YES in S7), the controller 135 does not perform the process of step S4, but performs the processes after step S5 again. Thereby, the amplification for emphasizing the extracted speech in the speech processing engine 170 can be maintained at an appropriate gain G1.

また、コントローラ１３５は、現時点で種別が「人」の被写体が検出されていなかったり（Ｓ５でＮＯ）、対象種別の音声が抽出されていなかったりすると（Ｓ６でＮＯ）、音声増幅部１７７のゲインＧ１を減少させる（Ｓ８）。ステップＳ８の処理は、例えばステップＳ４と同じピッチで行われる。又、コントローラ１３５は、ステップＳ８において、ゲインＧ１の減少に応じて増幅レベルバー５ｂが示すレベルを下げるように表示モニタ１３０を制御する（図７（Ａ），（Ｂ）参照）。 In addition, if the controller 135 does not detect a subject whose type is “human” at the present time (NO in S5) or does not extract the sound of the target type (NO in S6), the controller 135 sets the gain of the sound amplification unit 177 to G1 is decreased (S8). The processing of step S8 is performed at the same pitch as step S4, for example. Also, in step S8, the controller 135 controls the display monitor 130 so as to lower the level indicated by the amplification level bar 5b as the gain G1 decreases (see FIGS. 7A and 7B).

また、コントローラ１３５は、例えば減少させたゲインＧ１が最小値であるか否かを判断する（Ｓ９）。ゲインＧ１の最小値は、例えば初期値と同じ値であってもよい。コントローラ１３５は、ゲインＧ１が最小値に到っていない場合（Ｓ９でＮＯ）、ステップＳ５以降の処理を再度行う。これにより、その後のステップＳ６において、音声抽出部１７４が対象種別の音声を抽出すると、減少させたゲインＧ１を適用して音声増幅が為される。一方、ゲインＧ１が最小値に到った場合（Ｓ９でＹＥＳ）、コントローラ１３５は、音声抽出部１７４による音声抽出の処理を停止させて（Ｓ１０）、ステップＳ１に戻る。 Also, the controller 135 determines whether the reduced gain G1 is the minimum value, for example (S9). The minimum value of gain G1 may be, for example, the same value as the initial value. If the gain G1 has not reached the minimum value (NO in S9), the controller 135 repeats the processing after step S5. As a result, when the sound extraction unit 174 extracts the sound of the target type in subsequent step S6, the sound is amplified by applying the reduced gain G1. On the other hand, when the gain G1 reaches the minimum value (YES in S9), the controller 135 stops the voice extraction processing by the voice extraction unit 174 (S10), and returns to step S1.

以上の処理は、例えばデジタルカメラ１００の人優先モードで動画の撮影中に繰り返し、実行される。動画の記録としては、音声処理後の音声データＡｏｕｔが記録される。 The above processing is repeatedly executed during video shooting in the human-priority mode of the digital camera 100, for example. Audio data Aout after audio processing is recorded as moving image recording.

以上の処理によると、「人」のような特定の種別の画像認識に連動して、音声の抽出と、抽出された音声の増幅とが実行される。 According to the above processing, extraction of voice and amplification of the extracted voice are executed in conjunction with image recognition of a specific type such as "person".

例えば図５（Ｂ）の例では、時刻ｔ１前には、種別「人」の被写体が検出されておらず（Ｓ１でＮＯ）、音声の抽出及び増幅／抑圧も行われていない。このように、人優先モードであっても画像認識で種別「人」の被写体が検出されていなければ、対象種別についての音声処理を行わないことで、不必要に動画音声を小さくすることを回避できる。 For example, in the example of FIG. 5B, before time t1, no subject of type "person" has been detected (NO in S1), and no sound has been extracted and amplified/suppressed. In this way, even in human-priority mode, if a subject of type "human" is not detected by image recognition, audio processing for the target type is not performed, thereby avoiding unnecessary reduction of video audio. can.

また、画像認識部１２２において種別「人」の被写体が検出され（Ｓ１でＹＥＳ）、かつ音声抽出部１７４において象種別の音声が抽出され始めると（Ｓ３でＹＥＳ）、コントローラ１３５は、音声増幅部１７７のゲインＧ１を次第に増大させる（Ｓ２～Ｓ７）。これにより、図５（Ｂ）の時刻ｔ１から抽出音声の強調が緩やかに進み、増幅開始のタイミング前後でもユーザにとって聴き易い音声を得ることができる。 When the image recognition unit 122 detects a subject of type “human” (YES in S1), and when the sound extraction unit 174 starts extracting elephant-type sound (YES in S3), the controller 135 controls the sound amplification unit. The gain G1 of 177 is gradually increased (S2-S7). As a result, the enhancement of the extracted voice gradually progresses from time t1 in FIG. 5B, and it is possible to obtain a voice that is easy for the user to listen to even before or after the timing of starting amplification.

図７は、デジタルカメラ１００の人優先モードにおける「人」の移動時の動作例を説明した図である。人優先モードのデジタルカメラ１００においては、一人でも種別「人」の画像認識がされている限り、音声抽出が継続する。 7A and 7B are diagrams for explaining an operation example when a "person" moves in the person priority mode of the digital camera 100. FIG. In the digital camera 100 in the person-priority mode, voice extraction continues as long as even one person is recognized as a type "person".

図７（Ａ）は、図５（Ａ）の後の表示例を示す。図７（Ｂ）は、図７（Ａ）に対応した音声変化を例示する。本例では、人２１，２２が一人も居なくなっており、画像認識において種別「人」が検出されなくなる（Ｓ５でＮＯ）。この際、音声抽出は即座に停止されるのではなく、例えばコントローラ１３５が音声増幅部１７７のゲインＧ１を次第に減少させる（Ｓ７，Ｓ８）。 FIG. 7A shows a display example after FIG. 5A. FIG. 7B illustrates voice changes corresponding to FIG. 7A. In this example, none of the persons 21 and 22 are present, and the type "person" is no longer detected in image recognition (NO in S5). At this time, the voice extraction is not stopped immediately, but the controller 135, for example, gradually decreases the gain G1 of the voice amplifier 177 (S7, S8).

図７（Ｃ）は、図７（Ａ）の後の表示例を示す。図７（Ｄ）は、図７（Ｃ）に対応した音声変化を例示する。図７（Ｃ）の例では、図７（Ａ）の後に再度、人２２が検出されており、音声増幅部１７７のゲインＧ１も再度、増大される（Ｓ４～Ｓ８）。以上のように、人２２等の被写体が移動する状況であっても、抽出音声の変化を急激にすることなく、より明瞭な音声を得ることができる。 FIG. 7C shows a display example after FIG. 7A. FIG. 7(D) illustrates voice changes corresponding to FIG. 7(C). In the example of FIG. 7C, the person 22 is detected again after FIG. 7A, and the gain G1 of the audio amplifier 177 is increased again (S4 to S8). As described above, even in a situation where the subject such as the person 22 is moving, it is possible to obtain clearer voice without abrupt changes in the extracted voice.

以上の説明では、対象種別の音声を強調する例を説明したが、これに代えて対象種別の音声が抑制されるようにしてもよい。例えばユーザは、人の音声を抑制したい場合に、上述した人優先モードの代わりの動作モードを選択する。この動作モードでは、例えば、音声処理エンジン１７０が、図６のフローチャートにおいて音声の増幅と抑圧とを入れ替えた処理を行うことにより、対象種別の音声を抑制できる。これにより、特定の種別の音声を抑制したいというようなユーザの意図に沿った音声の明瞭化を実現することができる。 In the above description, an example of emphasizing the sound of the target type has been described, but instead of this, the sound of the target type may be suppressed. For example, if the user wishes to suppress human speech, the user selects an operating mode instead of the human priority mode described above. In this operation mode, for example, the audio processing engine 170 can suppress the audio of the target type by performing processing in which the audio amplification and audio suppression are interchanged in the flowchart of FIG. As a result, it is possible to realize voice clarification that meets the user's intention, such as suppressing a specific type of voice.

以上の説明では、対象種別が種別「人」である場合の動作例を説明したが、他の種別についても同様の動作が可能である。例えば、デジタルカメラ１００は、画像認識部１２２及び音声処理エンジン１７０に設定可能な複数の種別の各々を対象種別として採用する動作モードを有してもよい。例えば、表示モニタ１３０において設定メニューに各動作モードの選択肢を表示した状態で操作部１５０からユーザ操作を入力して、ユーザ所望の対象種別に応じた動作モードが選択されてもよい。 In the above description, an operation example in which the target type is the type "person" has been described, but the same operation is possible for other types as well. For example, the digital camera 100 may have an operation mode that employs each of a plurality of types that can be set for the image recognition unit 122 and the audio processing engine 170 as target types. For example, the operation mode corresponding to the target type desired by the user may be selected by inputting the user's operation from the operation unit 150 while displaying the options of each operation mode on the setting menu on the display monitor 130 .

〔１－３．まとめ〕
以上のように、実施の形態１のデジタルカメラ１００は、撮像部の一例としてイメージセンサ１１５と、音声取得部の一例としてマイク用のＡ／Ｄコンバータ１６５と、検出部の一例として画像認識部１２２と、音声処理部の一例として音声処理エンジン１７０と、操作部１５０とを備える。イメージセンサ１１５は、被写体像を撮像して画像データを生成する。マイク用のＡ／Ｄコンバータ１６５は、イメージセンサ１１５による撮像中の音声を示す音声データＡｉｎを取得する。画像認識部１２２は、イメージセンサ１１５によって生成された画像データに基づいて、被写体とその種別を検出する。音声処理エンジン１７０は、画像認識部１２２によって検出された被写体の種別に基づいて、取得された音声データＡｉｎを処理する。操作部１５０は、ユーザによるデジタルカメラ１００の各種操作に基づいて、例えば人に関する第１の種別および第１の種別とは異なる第２の種別を含む複数の種別の中から、音声処理エンジン１７０による処理の対象とする対象種別を設定する。音声処理エンジン１７０は、画像データにおいて対象種別の被写体が検出されたときに（Ｓ１）、取得された音声データＡｉｎにおいて対象種別に応じた音声を強調又は抑制するように、音声抽出部１７４及び強調処理部１７６で当該音声データＡｉｎを処理する（Ｓ２～Ｓ４）。 [1-3. summary〕
As described above, the digital camera 100 of Embodiment 1 includes the image sensor 115 as an example of an imaging unit, the A/D converter 165 for a microphone as an example of a sound acquisition unit, and the image recognition unit 122 as an example of a detection unit. , an audio processing engine 170 as an example of an audio processing unit, and an operation unit 150 . The image sensor 115 captures a subject image and generates image data. The microphone A/D converter 165 acquires audio data Ain representing audio being captured by the image sensor 115 . The image recognition unit 122 detects the subject and its type based on the image data generated by the image sensor 115 . The audio processing engine 170 processes the acquired audio data Ain based on the subject type detected by the image recognition unit 122 . Based on various operations of the digital camera 100 by the user, the operation unit 150 selects, for example, a first type related to people and a second type different from the first type, from among a plurality of types, the audio processing engine 170 Set the target type to be processed. When a subject of the target type is detected in the image data (S1), the voice processing engine 170 controls the voice extraction unit 174 and the enhancement unit 174 so as to emphasize or suppress the voice corresponding to the target type in the acquired voice data Ain. The processing unit 176 processes the audio data Ain (S2 to S4).

以上のデジタルカメラ１００によると、イメージセンサ１１５による画像データの画像認識においてユーザ所望の対象種別に該当する特定の被写体が検出されたときに、特定の被写体の種別に応じた音声が強調又は抑制された音声データＡｏｕｔが得られる。これにより、ユーザの意図に沿って特定の被写体による音声を明瞭に得やすくすることができる。 According to the digital camera 100 described above, when a specific subject corresponding to the target type desired by the user is detected in the image recognition of the image data by the image sensor 115, the sound corresponding to the specific subject type is emphasized or suppressed. Voice data Aout is obtained. As a result, it is possible to easily obtain the voice of the specific subject clearly according to the user's intention.

本実施形態において、デジタルカメラ１００は、画像データが示す画像を表示する表示部の一例として表示モニタ１３０をさらに備える。表示モニタ１３０は、対象種別を示す対象種別情報の一例である対象種別マーク５ａを表示する。これにより、ユーザは、現在の対象種別を確認しながら動作の撮影等を行え、ユーザの意図に沿った被写体の音声取得を実現し易くできる。また、さらに表示モニタ１３０は、被写体の音声を強調又は抑制するレベルを示す強調レベル情報の一例である増幅レベルバー５ｂを表示させてもよい。 In this embodiment, the digital camera 100 further includes a display monitor 130 as an example of a display that displays an image represented by image data. The display monitor 130 displays the target type mark 5a, which is an example of target type information indicating the target type. As a result, the user can shoot the action while confirming the current target type, and can easily acquire the voice of the subject in accordance with the user's intention. Furthermore, the display monitor 130 may display an amplification level bar 5b, which is an example of emphasis level information indicating the level at which the subject's voice is emphasized or suppressed.

本実施形態において、デジタルカメラ１００は、ユーザの操作を入力する操作部１５０を備えている。音声処理エンジン１７０の処理対象となる対象種別は、操作部１５０におけるユーザの操作に基づき設定される。これにより、ユーザ所望の種別による音声を明瞭に得やすくすることができる。 In this embodiment, the digital camera 100 includes an operation unit 150 for inputting user's operations. The target type to be processed by the audio processing engine 170 is set based on the user's operation on the operation unit 150 . As a result, it is possible to easily obtain the voice of the type desired by the user.

本実施形態において、デジタルカメラ１００は、被写体の種別に応じた動作モードの一例として、種別「人」による人優先モードを有する。操作部１５０は、デジタルカメラ１００の動作モードを選択するユーザの操作に従って、対象種別を設定する。例えば、人優先モードが選択されると対象種別は「人」に設定される。なお、このような動作モードは人優先モードに限らず、例えば種別「人」の代わりに「猫」など各種の動物の種別を優先する動作モードが用いられてもよい。 In this embodiment, the digital camera 100 has a person priority mode according to the type "person" as an example of an operation mode according to the type of subject. The operation unit 150 sets the target type according to the user's operation for selecting the operation mode of the digital camera 100 . For example, when the person priority mode is selected, the target type is set to "person". Note that such an operation mode is not limited to the human-priority mode, and an operation mode that prioritizes various types of animals such as "cat" instead of the type "human" may be used.

本実施形態において、音声処理エンジン１７０は、対象種別に応じた音声を強調する増幅率であるゲインＧ１を、画像認識部１２２が当該対象種別の被写体を検出したとき（Ｓ２でＹＥＳ）から次第に増大させる（Ｓ３～Ｓ７）。これにより、急激な音声変化を回避して、強調された抽出音声をユーザにとって聴き易くすることができる。 In this embodiment, the sound processing engine 170 gradually increases the gain G1, which is an amplification factor for enhancing the sound according to the target type, from when the image recognition unit 122 detects the subject of the target type (YES in S2). (S3 to S7). This makes it easier for the user to listen to the emphasized extracted voice by avoiding sudden voice changes.

本実施形態において、音声処理エンジン１７０は、画像認識部１２２が対象種別の被写体を検出した後に対象種別の被写体が検出されなくなったとき（Ｓ５でＮＯ）、ゲインＧ１を次第に減少させる（Ｓ８，Ｓ９）。これにより、被写体が検出されているか否かによって抽出音声の強調を過度に変化させることを回避し、ユーザにとってより聴き易い音声を得ることができる。 In this embodiment, the audio processing engine 170 gradually decreases the gain G1 (S8, S9 ). As a result, it is possible to avoid excessively changing the emphasis of the extracted voice depending on whether or not the subject is detected, and obtain a voice that is easier for the user to listen to.

本実施形態において、音声処理エンジン１７０は、音声抽出部１７４及び強調処理部１７６において、対象種別に応じた音声を強調する処理前の音声データＡ１０と処理後の音声データＡｏｕｔとの間において音量を維持するように、音声減衰部１７８に入力された音声データＡ１０を処理する。これにより、音声処理の前後で音量を変えないようにして、ユーザがより聴き易い音声を得られる。 In the present embodiment, the audio processing engine 170 adjusts the volume between the audio data A10 before the process of enhancing the audio corresponding to the target type and the audio data Aout after the process in the audio extraction unit 174 and the enhancement processing unit 176. The audio data A10 input to the audio attenuator 178 is processed so as to maintain the As a result, it is possible to obtain sound that is easier for the user to listen to without changing the volume before and after the sound processing.

本実施形態において、デジタルカメラ１００は、音を収音する収音部の一例としてマイク１６０をさらに備える。マイク用のＡ／Ｄコンバータ１６５は、マイク１６０の収音結果を示す音声データＡｉｎを取得する。なお、マイク１６０は、デジタルカメラ１００内蔵に限らず、外部構成であってもよい。外部のマイク１６０を用いる場合であっても、収音結果の音声データを取得して、音声処理エンジン１７０の音声処理を、画像認識部１２２による検出結果に応じて行うことにより、デジタルカメラ１００にて特定の種別の被写体による音声を明瞭に得ることができる。 In this embodiment, the digital camera 100 further includes a microphone 160 as an example of a sound pickup unit that picks up sound. A/D converter 165 for microphone acquires audio data Ain indicating the result of sound pickup by microphone 160 . Note that the microphone 160 is not limited to being built in the digital camera 100, and may be configured externally. Even when the external microphone 160 is used, the digital camera 100 acquires audio data as a sound collection result and performs audio processing of the audio processing engine 170 according to the detection result of the image recognition unit 122. It is possible to clearly obtain the voice of a specific type of subject.

本実施形態のデジタルカメラ１００は、イメージセンサ１１５（撮像部）と、マイク用のＡ／Ｄコンバータ１６５（音声取得部）と、画像認識部１２２（検出部）と、表示モニタ１３０（表示部）と、音声処理エンジン１７０（音声処理部）と、操作部１５０（操作部）と、コントローラ１３５（制御部）とを備える。本実施形態の操作部は、ユーザによる自装置の設定メニュー等の操作に基づいて、複数の種別の中から、音声処理部による処理の対象とする対象種別を設定する。制御部は、対象種別を示す対象種別情報の一例として対象種別マーク５ａを表示部に表示させる。これによっても、ユーザは、現在の対象種別を確認しながら動作の撮影等を行え、ユーザの意図に沿って被写体による音声を明瞭に得ることを行い易くできる。 The digital camera 100 of this embodiment includes an image sensor 115 (imaging unit), a microphone A/D converter 165 (sound acquisition unit), an image recognition unit 122 (detection unit), and a display monitor 130 (display unit). , an audio processing engine 170 (audio processing unit), an operation unit 150 (operation unit), and a controller 135 (control unit). The operation unit of the present embodiment sets a target type to be processed by the audio processing unit from among a plurality of types based on the user's operation of the setting menu of the device itself. The control unit causes the display unit to display the target type mark 5a as an example of target type information indicating the target type. This also allows the user to shoot the action while confirming the current target type, making it easier to obtain the voice of the subject clearly according to the user's intention.

（実施の形態２）
以下、図８～図１２を用いて実施の形態２を説明する。実施の形態１では、デジタルカメラ１００の人優先モードの動作例を説明したが、実施の形態２では、フォーカス優先モードの動作例を説明する。フォーカス優先モードは、デジタルカメラ１００においてフォーカス対象として選択された被写体の種別を優先して、音声抽出を実行する動作モードである。 (Embodiment 2)
Embodiment 2 will be described below with reference to FIGS. 8 to 12. FIG. Embodiment 1 describes an operation example of the digital camera 100 in the human priority mode, but Embodiment 2 describes an operation example of the focus priority mode. The focus priority mode is an operation mode in which the type of subject selected as a focus target in the digital camera 100 is prioritized and audio extraction is performed.

以下、実施の形態１に係るデジタルカメラ１００と同様の構成および動作の説明は適宜、省略して、本実施形態に係るデジタルカメラ１００について説明する。 Hereinafter, the digital camera 100 according to the present embodiment will be described, omitting the description of the same configuration and operation as those of the digital camera 100 according to the first embodiment.

〔２－１．フォーカス優先モードについて〕
図８は、デジタルカメラ１００のフォーカス優先モードの概要を説明するための図である。本実施形態のデジタルカメラ１００では、例えば表示モニタ１３０のスルー画像に被写体が映っている状態で、タッチパネルやキーなどの操作部１５０におけるユーザ操作により、フォーカス対象の被写体を選択可能である。 [2-1. About focus priority mode]
FIG. 8 is a diagram for explaining the outline of the focus priority mode of the digital camera 100. As shown in FIG. In the digital camera 100 of the present embodiment, the subject to be focused can be selected by user operation on the operation unit 150 such as a touch panel or keys while the subject is displayed in the through image on the display monitor 130, for example.

図８（Ａ）は、フォーカス選択前の表示例を示す。図８（Ｂ）は、図８（Ａ）に対応した音声変化を例示する。図８（Ｃ）は、フォーカス選択後の表示例を示す。図８（Ｄ）は、図８（Ｂ）に対応した音声変化を例示する。 FIG. 8A shows a display example before focus selection. FIG. 8(B) illustrates voice changes corresponding to FIG. 8(A). FIG. 8C shows a display example after focus selection. FIG. 8(D) illustrates voice changes corresponding to FIG. 8(B).

図８（Ａ）の表示例では、実施形態１と同様の画像認識部１２２により、猫２０と二人の人２１，２２とによる三つの被写体が検出されている。例えば、ユーザは、表示モニタ１３０において検出領域Ｒ１に対応する各被写体の周りの表示枠を視認して、フォーカス対象の被写体を選択できる。フォーカス対象の選択前には、特に音声抽出は行われず、図８（Ｂ）の曲線Ｃ０に示すように動画音声が得られる。 In the display example of FIG. 8A, three subjects, a cat 20 and two people 21 and 22, are detected by the image recognition unit 122 similar to that of the first embodiment. For example, the user can visually recognize the display frame around each subject corresponding to the detection region R1 on the display monitor 130 to select the subject to be focused. Before the selection of the focus target, audio extraction is not performed in particular, and moving image audio is obtained as indicated by the curve C0 in FIG. 8(B).

図８（Ｃ）の表示例は、図８（Ａ）の状態から一方の人２１がフォーカス対象として選択された例を示す。表示モニタ１３０は、選択された人２１の周りに、他の被写体２０，２２の表示枠とは別の表示態様で、フォーカス対象の表示枠Ｆ１を表示させる。また、レンズ駆動部１１２は、表示枠Ｆ１内の被写体に合焦するように、光学系１１０のフォーカスレンズを駆動する。 The display example of FIG. 8(C) shows an example in which one person 21 is selected as a focus target from the state of FIG. 8(A). The display monitor 130 displays a focus target display frame F1 around the selected person 21 in a display mode different from the display frames of the other subjects 20 and 22 . Further, the lens driving section 112 drives the focus lens of the optical system 110 so as to focus on the subject within the display frame F1.

図８（Ｄ）に示すように、本実施形態の音声処理エンジン１７０は、以上のような動作に連動して、フォーカス対象の被写体の種別に応じた音声を強調するための音声処理を行う。なお、本実施形態の音声処理エンジン１７０は、例えば音声抽出部１７４及び強調処理部１７６において複数種別の抽出音声を並列して処理可能に構成される。 As shown in FIG. 8D, the audio processing engine 170 of the present embodiment performs audio processing for enhancing audio according to the type of the subject to be focused in conjunction with the above operation. Note that the speech processing engine 170 of the present embodiment is configured such that, for example, the speech extraction unit 174 and the enhancement processing unit 176 can process multiple types of extracted speech in parallel.

〔２－２．動作の詳細〕
以上のようなフォーカス優先モードにおけるデジタルカメラ１００の動作の詳細を、図９～図１２を用いて説明する。図９，１０は、本実施形態に係るデジタルカメラ１００の動作を例示するフローチャートである。以下では、人優先モードの動作（図６）と同様の説明は適宜、省略する。 [2-2. Operation details]
Details of the operation of the digital camera 100 in the focus priority mode as described above will be described with reference to FIGS. 9 to 12. FIG. 9 and 10 are flowcharts illustrating the operation of the digital camera 100 according to this embodiment. In the following, explanations similar to those of the operation in the person priority mode (FIG. 6) will be omitted as appropriate.

フォーカス優先モードのデジタルカメラ１００において、コントローラ１３５は、図６のステップＳ１の代わりに、画像認識部１２２による検出情報Ｄ１に基づいて、画像認識で検出された被写体があるか否かを判断する（Ｓ１Ａ）。検出された被写体がある場合（Ｓ１ＡでＹＥＳ）、コントローラ１３５は、操作部１５０におけるユーザ操作によって、フォーカス対象の被写体が選択されたか否かを判断する（Ｓ１Ｂ）。 In the digital camera 100 in the focus priority mode, the controller 135 determines whether or not there is a subject detected by image recognition based on the detection information D1 by the image recognition unit 122 instead of step S1 in FIG. S1A). If there is a subject detected (YES in S1A), the controller 135 determines whether or not the subject to be focused has been selected by the user's operation on the operation unit 150 (S1B).

フォーカス対象の被写体が選択されると（Ｓ１ＢでＹＥＳ）、コントローラ１３５は、選択された被写体の種別を対象種別として、図６のステップＳ２と同様に音声処理エンジン１７０に音声抽出を開始させる（Ｓ２Ａ）。このとき、コントローラ１３５は、対象種別マーク５ａが、選択された被写体の種別を示すように音声抽出アイコン５を表示モニタ１３０に表示させる（図８（Ｃ）参照）。又、増幅レベルバー５ｂの表示は、その後のステップＳ４，Ｓ８において実施形態１と同様にコントローラ１３５によってゲインＧ１に対応するように制御される。 When the subject to be focused is selected (YES in S1B), the controller 135 sets the type of the selected subject as the target type and causes the sound processing engine 170 to start sound extraction in the same manner as in step S2 of FIG. 6 (S2A). ). At this time, the controller 135 causes the display monitor 130 to display the voice extraction icon 5 so that the target type mark 5a indicates the type of the selected subject (see FIG. 8C). Further, the display of the amplification level bar 5b is controlled by the controller 135 in subsequent steps S4 and S8 so as to correspond to the gain G1, as in the first embodiment.

また、音声処理エンジン１７０による抽出音声の増幅（Ｓ３，Ｓ４）の後、コントローラ１３５は、図６のステップＳ５の代わりに、操作部１５０においてフォーカス対象の被写体を変更するユーザ操作が行われたか否かを判断する（Ｓ５Ａ）。 Further, after the audio processing engine 170 amplifies the extracted audio (S3, S4), the controller 135 determines whether or not a user operation to change the focus target subject has been performed on the operation unit 150 instead of step S5 in FIG. (S5A).

フォーカス対象の変更がない場合（Ｓ５ＡでＮＯ）、コントローラ１３５は、画像認識部１２２から再度、検出情報Ｄ１を取得して、フォーカス対象に選択された被写体が、現時点で検出されているか否かを判断する（Ｓ５Ｂ）。現時点の画像認識においてフォーカス対象の被写体が検出されていれば（Ｓ５ＢでＹＥＳ）、コントローラ１３５は、ステップＳ６以降の処理を実施の形態１と同様に行う。フォーカス対象の被写体が移動する場合の動作例を、図１１に例示する。 If there is no change in the focus target (NO in S5A), the controller 135 acquires the detection information D1 again from the image recognition unit 122, and determines whether the subject selected as the focus target is detected at this time. Judge (S5B). If the subject to be focused is detected in the current image recognition (YES in S5B), the controller 135 performs the processes after step S6 in the same manner as in the first embodiment. FIG. 11 illustrates an operation example when the subject to be focused moves.

図１１（Ａ）は、図８（Ｃ）の後の表示例を示す。図１１（Ｂ）は、図１１（Ａ）に対応した音声変化を例示する。図１１（Ａ）の例では、図８（Ｃ）でフォーカス対象として選択された人２１が移動して、表示モニタ１３０の画像に映らなくなっている。画像認識部１２２では、他の被写体２０，２２は検出されるものの、フォーカス対象として選択された人２１は検出されなくなる。このように、フォーカス対象の被写体が検出されなくなると（Ｓ５ＢでＮＯ）、例えば図１１（Ｂ）の曲線Ｃ１に示すように、コントローラ１３５は抽出音声のゲインＧ１を減らす（Ｓ８）。 FIG. 11A shows a display example after FIG. 8C. FIG. 11(B) illustrates voice changes corresponding to FIG. 11(A). In the example of FIG. 11A, the person 21 selected as the focus target in FIG. The image recognition unit 122 detects the other subjects 20 and 22, but does not detect the person 21 selected as the focus target. In this way, when the subject to be focused is no longer detected (NO in S5B), the controller 135 reduces the gain G1 of the extracted voice as shown by the curve C1 in FIG. 11B, for example (S8).

また、フォーカス対象を変更するユーザ操作があった場合（Ｓ５ＡでＹＥＳ）の動作例を、図１２に例示する。図１２（Ａ）は、図１１（Ａ）の後の表示例を示す。図１２（Ｂ）は、図１２（Ａ）に対応した音声変化を例示する。図１２（Ｂ）のグラフは、種別「猫」の抽出音声を示す曲線Ｃ２をさらに含む。ステップＳ５Ａにおいて、コントローラ１３５は、例えばフォーカス対象の被写体の種別が変化した場合に「ＹＥＳ」に進む一方、変更前後でフォーカス対象の種別が変わらない場合は「ＮＯ」に進んでもよい。 FIG. 12 illustrates an operation example when there is a user operation to change the focus target (YES in S5A). FIG. 12A shows a display example after FIG. 11A. FIG. 12(B) illustrates voice changes corresponding to FIG. 12(A). The graph of FIG. 12(B) further includes a curve C2 representing extracted speech of type "cat". In step S5A, the controller 135 may proceed to "YES" if, for example, the type of the subject to be focused has changed, and to "NO" if the type of the focused subject has not changed before and after the change.

図１２（Ａ）の例では、猫２０が新たなフォーカス対象として選択されており、種別「猫」の対象種別マーク５ａが表示されている。フォーカス対象の種別の変更がある場合（Ｓ５ＡでＹＥＳ）、コントローラ１３５は、例えば図１０に示すように、変更後のフォーカス対象の画像認識があるか否かを判断する（Ｓ２０）。コントローラ１３５は、フォーカス対象の画像認識がある場合（Ｓ２０でＹＥＳ）、当該フォーカス対象の種別を対象種別として、音声処理エンジン１７０による音声抽出をステップＳ２Ａと同様に開始させる（Ｓ２１）。このとき、コントローラ１３５は、例えば図１１（Ａ）で表示した対象種別マーク５ａを、図１２（Ａ）に示すように、新たな対象種別を示すよう更新する。 In the example of FIG. 12A, the cat 20 is selected as a new focus target, and the target type mark 5a of type "cat" is displayed. If there is a change in the focus target type (YES in S5A), the controller 135 determines whether or not there is image recognition of the focus target after the change, as shown in FIG. 10, for example (S20). When there is image recognition of a focus target (YES in S20), the controller 135 sets the type of the focus target as the target type and causes the voice processing engine 170 to start voice extraction in the same manner as in step S2A (S21). At this time, the controller 135 updates, for example, the target type mark 5a displayed in FIG. 11(A) to indicate the new target type as shown in FIG. 12(A).

コントローラ１３５は、音声抽出部１７４において、変更後の対象種別の音声が抽出されたか否かを、ステップＳ３と同様に判断する（Ｓ２２）。例えば図１２（Ｂ）に示すように、猫２０の鳴き声が発したときに種別「猫」の抽出音声が得られ、次第に増大される。このとき、変更前のフォーカス対象についての音声抽出は、即座には停止されない。 The controller 135 determines whether or not the voice of the target type after the change has been extracted in the voice extraction unit 174 in the same manner as in step S3 (S22). For example, as shown in FIG. 12B, when the cat 20 barks, an extracted voice of type "cat" is obtained and gradually increases. At this time, audio extraction for the focus target before the change is not stopped immediately.

以下、変更後の対象種別についての抽出音声のゲインを「Ｇ１ａ」と記し、変更前の対象種別についての抽出音声のゲインを「Ｇ１ｂ」と記す。変更後の対象種別の音声が抽出されると（Ｓ２２でＹＥＳ）、変更後の対象種別のゲインＧ１ａを増やし（Ｓ２３）、変更前の対象種別のゲインＧ１ｂを減らす（Ｓ２４）。また、動画音声のゲインＧ０は適宜、処理前後の音量が維持されるように、各ゲインＧ１ａ，Ｇ１ｂに応じて設定される。増幅レベルバー５ｂは、例えば変更後のゲインＧ１ｂに対応するように、コントローラ１３５によって制御される。 Hereinafter, the gain of the extracted voice for the target type after the change is referred to as "G1a", and the gain of the extracted voice for the target type before the change is referred to as "G1b". When the voice of the target type after change is extracted (YES in S22), the gain G1a of the target type after change is increased (S23), and the gain G1b of the target type before change is decreased (S24). Also, the gain G0 of the moving image sound is appropriately set according to the respective gains G1a and G1b so that the volume before and after the processing is maintained. The amplification level bar 5b is controlled by the controller 135 so as to correspond to the changed gain G1b, for example.

コントローラ１３５は、変更前の対象種別のゲインＧ１ｂが最小値に到るまで（Ｓ２５）、ステップＳ２２～Ｓ２５の処理を繰り返す（Ｓ２５でＮＯ）。コントローラ１３５は、当該ゲインＧ１ｂが最小値に到ると（Ｓ２５でＹＥＳ）、変更前の対象種別についての音声抽出を停止して（Ｓ２６）、例えば図９のステップＳ５Ａに戻る。 The controller 135 repeats the processing of steps S22 to S25 (NO in S25) until the gain G1b of the target type before change reaches the minimum value (S25). When the gain G1b reaches the minimum value (YES in S25), the controller 135 stops voice extraction for the target type before change (S26), and returns to step S5A in FIG. 9, for example.

また、コントローラ１３５は、フォーカス対象の画像認識がない場合（Ｓ２０でＮＯ）、ステップＳ８に進む。これにより、画像認識および音声抽出の対象外の領域にフォーカスを合わすユーザの操作があった場合にも対処することができる。 If the controller 135 does not recognize the image of the focus target (NO in S20), the controller 135 proceeds to step S8. This makes it possible to deal with the case where the user performs an operation to focus on an area that is not targeted for image recognition and voice extraction.

以上の処理によると、画像認識に加えてユーザによるフォーカス対象の選択に連動して、特定の被写体の音声を強調する音声処理を実現することができる。フォーカス対象を変更するユーザ操作があった場合（Ｓ５ＡでＹＥＳ）の更なる動作例を、図１２に例示する。図１２（Ｃ）は、図１２（Ａ）の後の表示例を示す。図１２（Ｄ）は、図１２（Ｃ）に対応した音声変化を例示する。 According to the above processing, in addition to image recognition, it is possible to realize voice processing that emphasizes the voice of a specific subject in conjunction with the user's selection of a focus target. FIG. 12 illustrates another operation example when there is a user operation to change the focus target (YES in S5A). FIG. 12(C) shows a display example after FIG. 12(A). FIG. 12(D) illustrates voice changes corresponding to FIG. 12(C).

図１２（Ｃ）の例では、フォーカス対象が、猫２０から人２２に切り替えられている。このように、音声抽出の対象種別であった猫２０が、画像認識において継続的に検出されていても（Ｓ５ＢでＹＥＳ）、ユーザの操作によってフォーカス対象が人２２に切り替えられると（Ｓ５ＡでＹＥＳ）、フォーカスに連動して「人」を対象種別とする音声抽出が開始される（Ｓ２１）。また、この際の音声変化も、図１２（Ｄ）に示すように緩やかに行われ、ユーザにとって聴きやすい音声を得ることができる。 In the example of FIG. 12C, the focus target is switched from cat 20 to person 22 . In this way, even if the cat 20, which was the target type for speech extraction, is continuously detected in image recognition (YES in S5B), if the focus target is switched to the person 22 by the user's operation (YES in S5A) ), voice extraction for the target type of “person” is started in conjunction with the focus (S21). In addition, the voice change at this time is also performed gently as shown in FIG.

上記のステップＳ１Ｂ，Ｓ５Ａにおいて、フォーカス対象の被写体を選択するユーザ操作としては、例えば表示モニタ１３０における被写体２０～２２毎の検出領域Ｒ１について、タッチパネルのタッチ操作、或いは各種キーによる選択操作が挙げられる。この他にも、デジタルカメラ１００が自動的にデフォルトのフォーカス対象を選択する機能を利用したユーザ操作であってもよい。 In steps S1B and S5A described above, the user operation for selecting the subject to be focused includes, for example, a touch operation on a touch panel or a selection operation using various keys for the detection area R1 of each of the subjects 20 to 22 on the display monitor 130. . In addition, the user operation using the function of automatically selecting the default focus target of the digital camera 100 may be used.

例えば、デジタルカメラ１００のコントローラ１３５は、画像認識部１２２の検出情報Ｄ１に基づいて、画像全体における中央に位置したり、比較的大きく映っていたりする被写体をデフォルトのフォーカス対象に自動で選択してもよい。このような自動選択の機能を利用して、ユーザは、デジタルカメラ１００を向ける方向を変えたり、ズーム値を変えたりする各種の操作を行うことにより、所望の被写体をデジタルカメラ１００にフォーカス対象として選択させることができる。こうした選択の結果は、例えばフォーカス対象の表示枠Ｆ１の表示態様によって確認できる。この場合のステップＳ１Ｂ，Ｓ５Ａでも、デジタルカメラ１００では上記と同様に、フォーカス対象として選択された被写体の種別が、対象種別として設定できる。以上のようなユーザ操作に利用されるデジタルカメラ１００の各部は、本実施形態における操作部の一例である。 For example, the controller 135 of the digital camera 100 automatically selects a subject located in the center of the entire image or appearing relatively large as a default focus target based on the detection information D1 of the image recognition unit 122. good too. Using such an automatic selection function, the user performs various operations such as changing the direction in which the digital camera 100 is pointed, changing the zoom value, etc., so that the digital camera 100 can focus on a desired subject. can be selected. The result of such selection can be confirmed, for example, by the display mode of the focus target display frame F1. Also in steps S1B and S5A in this case, in the digital camera 100, the type of the subject selected as the focus target can be set as the target type in the same manner as described above. Each unit of the digital camera 100 used for user operations as described above is an example of the operation unit in this embodiment.

〔２－３．まとめ〕
以上のように、実施の形態２のデジタルカメラ１００において、表示モニタ１３０は、画像認識部１２２による被写体の検出結果を示す情報をさらに表示する。本実施形態のデジタルカメラ１００における操作部は、表示モニタ１３０によって表示された情報に基づきデジタルカメラ１００におけるフォーカスの対象とする被写体を指定するユーザの操作に従って、対象種別を設定する。これにより、ユーザの操作に従い音声抽出の対象種別を動的に設定して、ユーザ所望の種別についての音声を明瞭に得ることができる。 [2-3. summary〕
As described above, in the digital camera 100 of the second embodiment, the display monitor 130 further displays information indicating the subject detection result by the image recognition section 122 . The operation unit in the digital camera 100 of the present embodiment sets the target type according to the user's operation of specifying the subject to be focused in the digital camera 100 based on the information displayed by the display monitor 130 . Accordingly, the target type of voice extraction can be dynamically set in accordance with the user's operation, and the voice of the type desired by the user can be clearly obtained.

又、本実施形態においてデジタルカメラ１００（撮像装置）は、イメージセンサ１１５（撮像部）と、マイク用のＡ／Ｄコンバータ１６５（音声取得部）と、画像認識部１２２（検出部）と、表示モニタ１３０（表示部）と、音声処理エンジン１７０（音声処理部）と、操作部１５０（操作部）と、コントローラ１３５（制御部）とを備える。本実施形態の操作部は、ユーザによる自装置の操作に基づいて、検出部によって検出された被写体の中から、画像におけるフォーカス対象の被写体を選択してもよい（Ｓ１Ｂ）。音声処理部は、操作部によって選択された被写体の種別に基づいて、音声取得部によって取得された音声データを処理する（Ｓ２Ａ～Ｓ１０）。制御部は、音声処理部による処理の対象とする対象種別としてフォーカス対象の被写体の種別を示す対象種別情報の一例として対象種別マーク５ａを表示部に表示させる（Ｓ２Ａ，図８（Ｃ）等）。これにより、ユーザは、現在の対象種別を確認しながら動作の撮影等を行え、ユーザの意図に沿って被写体による音声を明瞭に得ることを行い易くできる。 In this embodiment, the digital camera 100 (imaging device) includes an image sensor 115 (imaging unit), an A/D converter 165 for a microphone (sound acquisition unit), an image recognition unit 122 (detection unit), and a display unit. It includes a monitor 130 (display unit), an audio processing engine 170 (audio processing unit), an operation unit 150 (operation unit), and a controller 135 (control unit). The operation unit of the present embodiment may select a subject to be focused in the image from among the subjects detected by the detection unit based on the user's operation of the own device (S1B). The audio processing unit processes the audio data acquired by the audio acquisition unit based on the subject type selected by the operation unit (S2A to S10). The control unit causes the display unit to display the target type mark 5a as an example of the target type information indicating the type of the subject to be focused as the target type to be processed by the audio processing unit (S2A, FIG. 8C, etc.). . As a result, the user can shoot the action while confirming the current target type, and can easily obtain the voice of the subject clearly according to the user's intention.

本実施形態において、制御部は、表示部にさらに、音声処理部が選択された被写体の音声を強調又は抑制するレベルを示す強調レベル情報の一例として増幅レベルバー５ｂを表示させる（Ｓ４，Ｓ８，図８（Ｃ）等）。これにより、ユーザは、動画等の撮影中に得られる音声が強調または抑制される程度を確認でき、ユーザの意図に沿った音声取得を行い易くできる。 In this embodiment, the control unit further causes the display unit to display the amplification level bar 5b as an example of emphasis level information indicating the level of emphasis or suppression of the sound of the subject selected by the sound processing unit (S4, S8, FIG. 8(C), etc.). As a result, the user can confirm the degree to which the sound obtained during shooting of a moving image or the like is emphasized or suppressed, and can easily obtain the sound according to the user's intention.

本実施形態において、フォーカス対象の被写体が変更される際に変更前後の被写体の種別が異なった場合（Ｓ５ＢでＹＥＳ）、制御部は、変更後の種別を対象種別として示すように対象種別情報を更新して表示部に表示させてもよい（Ｓ２１，図１２（Ａ），（Ｃ）等）。これにより、ユーザは、撮影中に動的に変化する対象種別を確認でき、ユーザの意図に沿った被写体の音声取得を行い易くできる。 In this embodiment, when the type of the subject before and after the change is different when the subject to be focused is changed (YES in S5B), the control unit stores the target type information so as to indicate the type after the change as the target type. It may be updated and displayed on the display unit (S21, FIGS. 12A and 12C, etc.). As a result, the user can confirm the subject type that dynamically changes during shooting, and can easily acquire the voice of the subject in accordance with the user's intention.

（実施の形態３）
以下、図１３～図１４を用いて実施の形態３を説明する。実施の形態１，２のデジタルカメラ１００は、画像認識に連動して特定の種別の音声抽出を行った。実施の形態３では、さらに、画像認識に連動して収音の指向性を制御するデジタルカメラについて説明する。 (Embodiment 3)
Embodiment 3 will be described below with reference to FIGS. 13 and 14. FIG. The digital camera 100 of Embodiments 1 and 2 extracts a specific type of voice in conjunction with image recognition. Embodiment 3 will further describe a digital camera that controls the directivity of sound pickup in conjunction with image recognition.

以下、実施の形態１，２に係るデジタルカメラ１００と同様の構成および動作の説明は適宜、省略して、本実施形態に係るデジタルカメラについて説明する。 Hereinafter, the digital camera according to the present embodiment will be described, omitting the description of the same configuration and operation as those of the digital camera 100 according to the first and second embodiments.

〔３－１．構成〕
図１３は、実施の形態３に係るデジタルカメラ１００Ａの構成を示す図である。本実施形態のデジタルカメラ１００Ａは、実施の形態１，２のデジタルカメラ１００と同様の構成において、複数のマイク１６０Ａを備え、さらにビーム形成部１６２を備え、収音される音声の指向性を生成する。本実施形態のマイク１６０Ａは、例えば３個又はそれ以上のマイクロフォン素子を含み、素子間で互いに位置決めして配置される。 [3-1. composition〕
FIG. 13 is a diagram showing the configuration of a digital camera 100A according to the third embodiment. The digital camera 100A of the present embodiment has a configuration similar to that of the digital camera 100 of the first and second embodiments, but includes a plurality of microphones 160A and a beam forming section 162 to generate the directivity of sound to be collected. do. The microphone 160A of this embodiment includes, for example, three or more microphone elements, positioned relative to each other between the elements.

ビーム形成部１６２は、例えばマイク１６０Ａの各素子の遅延期間を調整する回路であり、マイク１６０Ａで収音された音声を、所望の向き及び幅に形成する。ビーム形成部１６２によると、マイク１６０Ａが収音する物理的な範囲を設定できる。ビーム形成部１６２は、マイク１６０Ａ又はＡ／Ｄコンバータ１６５と一体的に構成されてもよいし、ビーム形成部１６２の機能が音声処理エンジン１７０に実装されてもよい。 The beam forming unit 162 is, for example, a circuit that adjusts the delay period of each element of the microphone 160A, and forms the sound picked up by the microphone 160A in a desired direction and width. The beam forming unit 162 can set the physical range in which the microphone 160A picks up sound. The beam forming section 162 may be configured integrally with the microphone 160A or the A/D converter 165, or the functions of the beam forming section 162 may be implemented in the audio processing engine 170.

〔３－２．動作〕
図１４は、実施の形態３に係るデジタルカメラ１００Ａの動作を例示するフローチャートである。本実施形態のデジタルカメラ１００Ａにおいて、コントローラ１３５は、実施の形態１，２と同様の処理に加えて、画像認識部１２２による検出情報Ｄ１に基づきマイク１６０Ａの収音範囲を可変するビーム形成部１６２を制御する（Ｓ３０，Ｓ３１）。図１４では、フォーカス優先モード（図９）において収音範囲が動的に設定される動作例を説明する。 [3-2. motion〕
FIG. 14 is a flow chart illustrating the operation of the digital camera 100A according to the third embodiment. In the digital camera 100A of the present embodiment, the controller 135 performs the same processing as in the first and second embodiments, and the beam forming unit 162 that varies the sound pickup range of the microphone 160A based on the detection information D1 by the image recognition unit 122. is controlled (S30, S31). FIG. 14 illustrates an operation example in which the sound pickup range is dynamically set in the focus priority mode (FIG. 9).

コントローラ１３５は、例えばフォーカス対象の被写体が選択されると（Ｓ１ＢでＹＥＳ）、そのときの画像認識部１２２の検出情報Ｄ１に基づいて、マイク１６０Ａが当該被写体の方向からの音を収音するようにビーム形成部１６２を制御する（Ｓ３０）。ビーム形成部１６２は、検出情報Ｄ１における特定の被写体の検出領域Ｒ１の位置およびサイズに応じて、マイク１６０Ａのビームを形成する。これにより、画像認識に応じた収音範囲においてマイク１６０Ａの収音が行われ、当該収音範囲の音声データに対して対象種別の音声抽出が適用される（Ｓ２Ａ）。 For example, when a subject to be focused is selected (YES in S1B), the controller 135 controls the microphone 160A to pick up sound from the direction of the subject based on the detection information D1 of the image recognition unit 122 at that time. , the beam forming unit 162 is controlled (S30). The beam forming section 162 forms the beam of the microphone 160A according to the position and size of the detection area R1 of the specific subject in the detection information D1. As a result, sound is picked up by the microphone 160A in the sound pickup range according to the image recognition, and target type sound extraction is applied to the sound data in the sound pickup range (S2A).

また、コントローラ１３５は、フォーカス対象の画像認識が継続している場合（Ｓ５ＢでＹＥＳ）も逐次、ステップＳ３０と同様にビーム形成部１６２を制御してマイク１６０Ａの収音範囲を動的に設定する（Ｓ３１）。これにより、例えばフォーカス対象の被写体が移動したり、別の被写体に変更されたりすることに応じて、マイク１６０Ａの収音範囲が変更される。 Also, when the image recognition of the focus target is continued (YES in S5B), the controller 135 sequentially controls the beam forming unit 162 to dynamically set the sound pickup range of the microphone 160A in the same manner as in step S30. (S31). As a result, the sound pickup range of the microphone 160A is changed, for example, when the subject to be focused moves or is changed to another subject.

以上の処理によると、画像認識部１２２の検出結果に応じてマイク１６０Ａの収音範囲がフォーカス対象の被写体に向けられ、当該被写体からの音声をより明瞭に得ることができる。以上の説明では、ビーム形成部１６２によるマイク１６０Ａの収音範囲の制御が、フォーカス優先モードで行われる例を説明したが、特にこれに限らず、人優先モードなど他の動作モードで行われてもよい。 According to the above processing, the sound pickup range of the microphone 160A is directed toward the subject to be focused according to the detection result of the image recognition unit 122, and the voice from the subject can be obtained more clearly. In the above description, an example in which the sound pickup range of the microphone 160A is controlled by the beam forming unit 162 in the focus priority mode has been described. good too.

〔３－３．まとめ〕
以上のように、実施の形態３のデジタルカメラ１００Ａは、ビーム形成部１６２をさらに備える。ビーム形成部１６２は、画像認識部１２２の検出結果に応じてマイク１６０Ａが収音する範囲を変更する。これにより、画像認識部１２２に検出された被写体からの音声をより明瞭に得ることができる。 [3-3. summary〕
As described above, digital camera 100A of the third embodiment further includes beam forming section 162 . The beam forming unit 162 changes the sound pickup range of the microphone 160A according to the detection result of the image recognition unit 122 . As a result, the voice from the subject detected by the image recognition unit 122 can be obtained more clearly.

（他の実施の形態）
以上のように、本出願において開示する技術の例示として、実施の形態１～３を説明した。しかしながら、本開示における技術は、これに限定されず、適宜、変更、置き換え、付加、省略などを行った実施の形態にも適用可能である。また、上記実施の形態１で説明した各構成要素を組み合わせて、新たな実施の形態とすることも可能である。 (Other embodiments)
As described above, Embodiments 1 to 3 have been described as examples of the technology disclosed in the present application. However, the technology in the present disclosure is not limited to this, and can be applied to embodiments in which modifications, replacements, additions, omissions, etc. are made as appropriate. Also, it is possible to combine the constituent elements described in the first embodiment to form a new embodiment.

上記の実施の形態１，２では、デジタルカメラ１００の人優先モード及びフォーカス優先モードについて説明した。このような動作モードは、操作部１５０におけるユーザの操作によって設定可能であり、例えばデジタルカメラ１００は表示モニタ１３０にメニュー画面を表示し、上記の動作モードを選択可能に構成されてもよい。 In the above first and second embodiments, the human-priority mode and focus-priority mode of the digital camera 100 have been described. Such an operation mode can be set by a user's operation on the operation unit 150. For example, the digital camera 100 may display a menu screen on the display monitor 130 so that the operation mode can be selected.

上記の各実施形態においては、第１の種別の一例として種別「人」、及び第２の種別の一例として種別「猫」を例示したが、第１及び第２の種別は上記に限らず、様々な種別であってもよい。例えば、第２の種別は、「猫」に限らず「犬」或いは「鳥」など各種の動物であってもよいし、人以外の各種の動物を含む種別「動物」であってもよい。また、人又は動物に限らず、例えば列車或いは楽器といった特有の音を有する物体が、適宜種別に採用されてもよい。こうした物体からの音は、例えば背景音として強調／抑制の対象とされ得る。さらに、第１の種別は不特定の「人」に限らず、例えば特定の個人であってもよい。この場合、第２の種別は、第１の種別と異なる個人であってもよい。 In each of the above embodiments, the type "person" is used as an example of the first type, and the type "cat" is used as an example of the second type, but the first and second types are not limited to the above. It may be of various types. For example, the second type may be not only "cat" but also various animals such as "dog" or "bird", or may be the type "animal" including various animals other than humans. In addition, not only people or animals, but also objects having unique sounds such as trains or musical instruments may be appropriately adopted as types. Sounds from such objects may be targeted for enhancement/suppression, for example as background sounds. Furthermore, the first type is not limited to an unspecified "person", and may be, for example, a specific individual. In this case, the second type may be an individual different from the first type.

すなわち、本実施形態において、第１及び第２の種別は、それぞれ人、人以外の動物、および背景音を有する物体のうちの何れかに関する種々の種別に設定されてもよい。また、デジタルカメラ１００に設定される複数の種別は、第１及び第２の種別以外の種別をさらに含んでもよい。 That is, in the present embodiment, the first and second types may be set to various types related to any one of humans, animals other than humans, and objects having background sounds. Also, the plurality of types set in the digital camera 100 may further include types other than the first and second types.

以上のような様々な種別であっても、例えば機械学習において各々の種別に応じた画像と音声の学習用のデータセットを用意することにより、上記各実施形態と同様の動作が実現可能である。又、こうした様々な種別であっても、画像認識部１２２と音声処理エンジン１７０とに設定する種別を互い対応付けることにより、上記各実施形態と同様に、画像認識部１２２に連動して音声処理エンジン１７０で所望の種別の音声を強調／抑制できる。なお、画像認識部１２２と音声処理エンジン１７０とに設定される種別は必ずしも同一でなくてもよく、例えば画像認識部１２２に設定される種別が、音声処理エンジン１７０に設定される種別よりも細分化されていてもよい。又、画像認識部１２２に設定される種別の中に、特に音声処理の対象種別とせず、音声処理エンジン１７０に設定されない種別が含まれてもよい。 Even with various types as described above, for example, by preparing data sets for image and sound learning according to each type in machine learning, the same operation as in each of the above embodiments can be realized. . Moreover, even with such various types, by associating the types set in the image recognition unit 122 and the sound processing engine 170 with each other, the sound processing engine can be processed in conjunction with the image recognition unit 122 as in the above-described embodiments. At 170, desired types of speech can be enhanced/suppressed. Note that the types set in the image recognition unit 122 and the audio processing engine 170 may not necessarily be the same. For example, the type set in the image recognition unit 122 is more subdivided than the type set in the audio processing engine 170. may be modified. Further, the types set in the image recognition unit 122 may include types that are not set in the voice processing engine 170 because they are not particularly subject to voice processing.

上記の各実施形態において、対象種別情報の一例として対象種別マーク５ａを例示し、強調レベル情報の一例として増幅レベルバー５ｂを例示した。本実施形態において、対象種別情報は、対象種別マーク５ａに限らず、例えば対象種別の名称などの文字情報であってもよいし、サムネイル等の画像であってもよい。また、強調レベル情報も、増幅レベルバー５ｂに限らず、例えば強調または抑制のレベルを示す数字等の文字情報であってもよいし、円グラフ等のグラフであってもよい。また、対象種別情報と強調レベル情報とは、それぞれ独立したアイコンとして表示されてもよい。 In each of the above-described embodiments, the target type mark 5a is exemplified as an example of the target type information, and the amplification level bar 5b is exemplified as an example of the emphasis level information. In this embodiment, the target type information is not limited to the target type mark 5a, and may be character information such as the name of the target type, or may be an image such as a thumbnail. Further, the emphasis level information is not limited to the amplification level bar 5b, and may be character information such as numbers indicating the level of emphasis or suppression, or a graph such as a pie chart. Also, the target type information and the emphasis level information may be displayed as independent icons.

上記の各実施形態において、画像認識部１２２を備えるデジタルカメラ１００を説明した。本実施形態において、画像認識部１２２は、外部サーバに設けられてもよい。この場合、デジタルカメラ１００は、通信モジュール１５５を介して、外部サーバに撮像画像の画像データを送信し、外部サーバから画像認識部１２２による処理結果の検出情報Ｄ１を受信してもよい。このようなデジタルカメラ１００においては、通信モジュール１５５が検出部として機能する。また、例えば音声抽出部１７４など音声処理エンジン１７０の機能についても、上記と同様に外部サーバで行われてもよい。 The digital camera 100 including the image recognition unit 122 has been described in each of the above embodiments. In this embodiment, the image recognition unit 122 may be provided in an external server. In this case, the digital camera 100 may transmit the image data of the captured image to the external server via the communication module 155 and receive the detection information D1 of the processing result by the image recognition section 122 from the external server. In such a digital camera 100, the communication module 155 functions as a detector. Also, the functions of the audio processing engine 170, such as the audio extraction unit 174, may be performed by an external server in the same manner as described above.

また、上記の各実施形態では、光学系１１０及びレンズ駆動部１１２を備えるデジタルカメラ１００を例示した。本実施形態の撮像装置は、光学系１１０及びレンズ駆動部１１２を備えなくてもよく、例えば交換レンズ式のカメラであってもよい。 Further, in each of the above embodiments, the digital camera 100 including the optical system 110 and the lens driving section 112 has been exemplified. The imaging apparatus of this embodiment may not include the optical system 110 and the lens driving unit 112, and may be an interchangeable lens type camera, for example.

また、上記の各実施形態では、撮像装置の例としてデジタルカメラを説明したが、これに限定されない。本開示の撮像装置は、画像撮影機能を有する電子機器（例えば、ビデオカメラ、スマートフォン、タブレット端末等）であればよい。 Also, in each of the above embodiments, a digital camera was described as an example of an imaging device, but the present invention is not limited to this. The imaging device of the present disclosure may be an electronic device (for example, a video camera, a smart phone, a tablet terminal, etc.) having an image capturing function.

以上のように、本開示における技術の例示として、実施の形態を説明した。そのために、添付図面および詳細な説明を提供した。 As described above, the embodiment has been described as an example of the technique of the present disclosure. To that end, the accompanying drawings and detailed description have been provided.

したがって、添付図面および詳細な説明に記載された構成要素の中には、課題解決のために必須な構成要素だけでなく、上記技術を例示するために、課題解決のためには必須でない構成要素も含まれ得る。そのため、それらの必須ではない構成要素が添付図面や詳細な説明に記載されていることをもって、直ちに、それらの必須ではない構成要素が必須であるとの認定をするべきではない。 Therefore, among the components described in the attached drawings and detailed description, there are not only components essential for solving the problem, but also components not essential for solving the problem in order to illustrate the above technology. can also be included. Therefore, it should not be immediately recognized that those non-essential components are essential just because they are described in the attached drawings and detailed description.

また、上述の実施の形態は、本開示における技術を例示するためのものであるから、特許請求の範囲またはその均等の範囲において種々の変更、置き換え、付加、省略などを行うことができる。 In addition, the above-described embodiments are intended to illustrate the technology of the present disclosure, and various modifications, replacements, additions, omissions, etc. can be made within the scope of the claims or equivalents thereof.

本開示は、音声を取得しながら撮像を行う撮像装置に適用可能である。 The present disclosure is applicable to an imaging device that captures an image while acquiring sound.

１００，１００Ａデジタルカメラ
１１５イメージセンサ
１２０画像処理エンジン
１２２画像認識部
１３０表示モニタ
１３５コントローラ
１５０操作部
１６０，１６０Ａマイク
１６２ビーム形成部
１６５マイク用のＡ／Ｄコンバータ
１７０音声処理エンジン
１７２雑音抑圧部
１７４音声抽出部
１７６強調処理部 100, 100A Digital camera 115 Image sensor 120 Image processing engine 122 Image recognition unit 130 Display monitor 135 Controller 150 Operation unit 160, 160A Microphone 162 Beam forming unit 165 A/D converter for microphone 170 Audio processing engine 172 Noise suppression unit 174 Sound Extraction unit 176 Enhancement processing unit

Claims

an imaging unit that captures an image of a subject and generates image data;
an audio acquisition unit that acquires audio data representing audio being captured by the imaging unit;
a detection unit that detects a subject and its type based on the image data generated by the imaging unit;
an audio processing unit that processes the audio data acquired by the audio acquisition unit based on the type of subject detected by the detection unit;
A target type to be processed by the audio processing unit from among a plurality of types including a first type indicating a person and a second type indicating a subject other than the person based on the user's operation of the device. an operation unit for setting the
a display unit for displaying an image indicated by the image data and information indicating a subject detection result by the detection unit;
The operation unit follows a user operation of designating a subject to be focused in the device based on information displayed by the display unit when shooting a moving image, and the type of the subject designated as the focus target by the user operation. is set as the target type,
The audio processing unit is an imaging device that processes audio data acquired when a subject of the target type is detected in the image data so as to emphasize or suppress audio corresponding to the target type in the audio data. .

an imaging unit that captures an image of a subject and generates image data;
an audio acquisition unit that acquires audio data representing audio being captured by the imaging unit;
a detection unit that detects a subject and its type based on the image data generated by the imaging unit;
an audio processing unit that processes the audio data acquired by the audio acquisition unit based on the type of subject detected by the detection unit;
A target type to be processed by the audio processing unit is set from among a plurality of types including a first type and a second type different from the first type based on the operation of the device by the user. An imaging device comprising an operation unit for
The imaging device has an operation mode according to the type of the subject,
The operation unit sets the target type according to a user's operation for selecting an operation mode of the imaging device,
The sound processing unit does not emphasize the sound of the type when a subject of a type different from the target type is detected by the detection unit, and when the subject of the target type is detected in the image data, An imaging device that processes acquired audio data so as to emphasize or suppress the audio corresponding to the target type in the acquired audio data.

The imaging apparatus according to claim 1, wherein the display unit displays target type information indicating the target type.

3. The imaging apparatus according to claim 1, wherein the first and second types are set to types related to any one of humans, animals other than humans, and objects having background sounds, respectively.

3. The imaging apparatus according to claim 1, wherein the sound processing section gradually increases an amplification factor for emphasizing the sound corresponding to the target type from when the detection section detects a subject of the target type.

6. The imaging apparatus according to claim 5, wherein the sound processing unit gradually decreases the amplification factor when the object of the target type is no longer detected after the detection unit detects the object of the target type.

Equipped with a sound pickup part that collects sound,
3. The imaging apparatus according to claim 1, wherein the sound acquisition unit acquires sound data indicating a result of sound pickup by the sound pickup unit.

8. The imaging apparatus according to claim 7, further comprising a beam forming section that changes a range of sound picked up by said sound pickup section according to a detection result of said detection section.

an imaging unit that captures an image of a subject and generates image data;
an audio acquisition unit that acquires audio data representing audio being captured by the imaging unit;
a detection unit that detects a subject and its type based on the image data generated by the imaging unit;
a display unit that displays an image indicated by the image data;
an operation unit that selects a subject to be focused in the image from subjects detected by the detection unit based on a user's operation of the device;
an audio processing unit that processes audio data so as to emphasize or suppress audio in the audio data acquired by the audio acquisition unit based on the type of subject selected by the operation unit;
a control unit that causes the display unit to display target type information indicating a target type to be processed by the audio processing unit;
The control unit controls the display unit such that the target type information indicates, as the target type, the type of the subject selected as the focus target by user operation on the operation unit.

10. The imaging apparatus according to claim 9, wherein the control section further causes the display section to display emphasis level information indicating a level at which the sound processing section emphasizes or suppresses the sound of the selected subject.

When the type of the subject before and after the change is different when the subject to be focused is changed, the control unit updates the target type information so as to indicate the type after the change as the target type, and the display unit 11. The imaging device according to claim 9 or 10, wherein the image is displayed on the .