JP2021056885A

JP2021056885A - Detector, detection method, and program

Info

Publication number: JP2021056885A
Application number: JP2019180711A
Authority: JP
Inventors: 敬正角田; Norimasa Kadota
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2021-04-08

Abstract

To reduce processing costs of a detector to improve detection accuracy.SOLUTION: One or more objects are detected from a photographed image. In accordance with positions of one or more objects detected from photographed image at first time, a detection target region for one or more objects is set for photographed image at second time, following the first time, which is referenced by detection means.SELECTED DRAWING: Figure 6

Description

本発明は、検出装置、検出方法、及びプログラムに関する。 The present invention relates to a detection device, a detection method, and a program.

固定カメラを用いて被写体の位置を推定する技術がある。これらの技術の多くは、時間的に連続する複数の画像における被写体を検出してその同一性を判定することで、被写体の軌跡の推定を行う。例えば、特許文献１には、状態空間モデルを用いた追尾対象物体の動きの予測と更新に基づきパンチルトズームの制御と追尾を行う方法が開示されている。 There is a technique for estimating the position of a subject using a fixed camera. Most of these techniques estimate the trajectory of a subject by detecting the subject in a plurality of images that are continuous in time and determining their identity. For example, Patent Document 1 discloses a method of controlling and tracking a pan-tilt zoom based on prediction and updating of the movement of a tracking target object using a state space model.

また近年、畳み込みニューラルネットワーク（以降においてはＣＮＮと呼ぶ）を用いることにより、複数カテゴリの物体検出を高速に実行する技術が多数提案されている。例えば、非特許文献１に開示されている技術においては、３５２×３５２サイズの入力画像をニューラルネットワークに入力することにより、２０カテゴリの物体検出問題を、毎秒８１フレームで実行することができる。 Further, in recent years, many techniques have been proposed to execute object detection of a plurality of categories at high speed by using a convolutional neural network (hereinafter referred to as CNN). For example, in the technique disclosed in Non-Patent Document 1, by inputting an input image having a size of 352 × 352 into a neural network, 20 categories of object detection problems can be executed at 81 frames per second.

一方、一般的な監視カメラによる撮像画像の解像度はより大きく、例えば１９２０×１０８０サイズである。このようなサイズの画像を小さくリサイズしてＣＮＮに入力すると、被写体の検出精度が低下する。非特許文献２は、元画像をリサイズして低解像度化した画像から、被写体の検出のために選択的にズームインする部分領域を選択する手法を開示している。 On the other hand, the resolution of the image captured by a general surveillance camera is larger, for example, 1920 × 1080 size. If an image of such a size is resized to a small size and input to the CNN, the detection accuracy of the subject is lowered. Non-Patent Document 2 discloses a method of selecting a partial region to be selectively zoomed in for detecting a subject from an image obtained by resizing the original image to reduce the resolution.

特許第５０１８３２１号公報Japanese Patent No. 5018321 ＪｏｓｅｐｈＲｅｄｍｏｎ，Ａｌｉ，Ｆａｒｈａｄｉ，“ＹＯＬＯ９０００：Ｂｅｔｔｅｒ，Ｆａｓｔｅｒ，Ｓｔｒｏｎｇｅｒ”，ＣＶＰＲ２０１７Joseph Redmon, Ali, Farhadi, "YOLO9000: Better, Faster, Stronger", CVPR2017 ＭｉｎｇｆｅｉＧａｏ，ＲｕｉｃｈｉＹｕ，ＡｎｇＬｉ，ＶｌａｄＩ．Ｍｏｒａｒｉｕ，ＬａｒｒｙＳ．Ｄａｖｉｓ，“ＤｙｎａｍｉｃＺｏｏｍ−ｉｎｎｅｔｗｏｒｋｆｏｒｆａｓｔｏｂｊｅｃｔｄｅｔｅｃｔｉｏｎｉｎｌａｒｇｅｉｍａｇｅｓ”，ａｒＸｉｖ：１７１１．０５１８７ｖ１Mingfei Gao, Ruichi Yu, Ang Li, Vlad I. Morariu, Larry S.M. Davis, "Dynamic Zoom-in network for fast object detection in range images", arXiv: 1711.05187v1

しかしながら、非特許文献２に記載の方法では、部分領域を決定するために、処理コストの大きいＣＮＮベースの検出器を用いた元画像に対する処理が毎時刻必ず行われ、これが処理のボトルネックとなっている。 However, in the method described in Non-Patent Document 2, in order to determine a partial region, processing on the original image using a CNN-based detector, which has a high processing cost, is always performed every hour, which becomes a bottleneck of processing. ing.

本発明は、被写体検出処理の処理コストを下げることを目的とする。 An object of the present invention is to reduce the processing cost of subject detection processing.

本発明の目的を達成するために、例えば、一実施形態に係る検出装置は以下の構成を備える。すなわち、撮像画像から１以上の被写体を検出する検出手段と、前記検出手段によって第１の時刻における撮像画像から検出された１以上の被写体の位置に従って、前記検出手段によって参照される、前記第１の時刻に後続する第２の時刻における撮像画像に前記１以上の被写体の検出対象領域を設定する設定手段と、を備えることを特徴とする。 In order to achieve the object of the present invention, for example, the detection device according to one embodiment has the following configuration. That is, the first detection means referred to by the detection means according to the position of the detection means for detecting one or more subjects from the captured image and the position of one or more subjects detected from the captured image at the first time by the detection means. It is characterized by including a setting means for setting a detection target area of one or more subjects in the captured image at a second time following the time of.

被写体検出処理の処理コストを下げることができる。 The processing cost of the subject detection process can be reduced.

実施形態１に係る検出装置における撮像画像の一例を示す図。The figure which shows an example of the captured image in the detection apparatus which concerns on Embodiment 1. FIG. 実施形態１〜３に係る検出装置の機能構成の一例を示す図。The figure which shows an example of the functional structure of the detection apparatus which concerns on Embodiments 1-3. 実施形態１〜３に係る検出方法における処理例のフローチャート。The flowchart of the processing example in the detection method which concerns on Embodiments 1-3. 実施形態１に係る検出方法における候補領域の作成例を示す図。The figure which shows the creation example of the candidate area in the detection method which concerns on Embodiment 1. FIG. 実施形態１に係る検出方法における候補領域の構成例を示す図。The figure which shows the structural example of the candidate area in the detection method which concerns on Embodiment 1. FIG. 実施形態１に係る検出方法における検出対象領域の設定例を示す図。The figure which shows the setting example of the detection target area in the detection method which concerns on Embodiment 1. FIG. 実施形態１に係る検出方法における候補領域リストの一例を示す図。The figure which shows an example of the candidate area list in the detection method which concerns on Embodiment 1. FIG. 実施形態３に係る検出装置の実施例を示す図。The figure which shows the Example of the detection apparatus which concerns on Embodiment 3. FIG. 実施形態３に係る検出方法における設定例を示すフローチャート。The flowchart which shows the setting example in the detection method which concerns on Embodiment 3. 実施形態３に係る検出方法における検出対象領域の設定例を示す図。The figure which shows the setting example of the detection target area in the detection method which concerns on Embodiment 3. FIG. 実施形態３に係る検出方法における候補領域のリストの一例を示す図。The figure which shows an example of the list of the candidate area in the detection method which concerns on Embodiment 3. 実施形態３に係る検出方法における検出対象領域リストの一例を示す図。The figure which shows an example of the detection target area list in the detection method which concerns on Embodiment 3. 実施形態４に係る検出装置の機能構成の一例を示す図。The figure which shows an example of the functional structure of the detection apparatus which concerns on Embodiment 4. FIG. 実施形態４に係る検出方法における処理例のフローチャート。The flowchart of the processing example in the detection method which concerns on Embodiment 4. 実施形態４に係る検出方法における設定例を示すフローチャート。The flowchart which shows the setting example in the detection method which concerns on Embodiment 4. 実施形態１に係る撮像システムの構成の一例を示す図。The figure which shows an example of the structure of the imaging system which concerns on Embodiment 1. FIG. 実施形態１に係る検出装置の内部構成の一例を示す図。The figure which shows an example of the internal structure of the detection apparatus which concerns on Embodiment 1. FIG. 実施形態１に係るクライアント装置の内部構成の一例を示す図。The figure which shows an example of the internal structure of the client apparatus which concerns on Embodiment 1. FIG.

以下、添付図面を参照して実施形態を詳しく説明する。なお、以下の実施形態は特許請求の範囲に係る発明を限定するものではない。実施形態には複数の特徴が記載されているが、これらの複数の特徴の全てが発明に必須のものとは限らず、また、複数の特徴は任意に組み合わせられてもよい。さらに、添付図面においては、同一若しくは同様の構成に同一の参照番号を付し、重複した説明は省略する。 Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. The following embodiments do not limit the invention according to the claims. Although a plurality of features are described in the embodiment, not all of the plurality of features are essential to the invention, and the plurality of features may be arbitrarily combined. Further, in the attached drawings, the same or similar configurations are designated by the same reference numbers, and duplicate explanations are omitted.

［実施形態１］
図１６は本実施形態に係る撮像システム１６００の構成の一例を示すブロック図である。図１６に示す撮像システム１６００は、検出装置１６０５、ネットワーク１６０１を介して相互に通信可能な状態で接続されるクライアント装置１６０２、入力装置１６０３、及び、表示装置１６０４から構成されている。検出装置１６０５は、例えば、動画像を撮像および画像処理する監視カメラ又はネットワークカメラであってもよい。 [Embodiment 1]
FIG. 16 is a block diagram showing an example of the configuration of the imaging system 1600 according to the present embodiment. The imaging system 1600 shown in FIG. 16 includes a detection device 1605, a client device 1602 connected to each other via a network 1601 in a communicable state, an input device 1603, and a display device 1604. The detection device 1605 may be, for example, a surveillance camera or a network camera that captures and processes a moving image.

図１７は、本実施形態における検出装置１６０５の内部構成の一例を示すブロック図である。光学部１７０１はフォーカスレンズ、ブレ補正レンズ、絞り、シャッターから構成され被写体の光情報を集光する。撮像素子部１７０２は、光学部１７０１にて集光される光情報を電流値へと変換する素子で、カラーフィルタなどと組み合わせることで色情報を取得する。また、すべての画素に対して、任意の露光時間を設定可能な撮像センサーとする。ＣＰＵ１７０３は、各構成の処理すべてに関わり、ＲＯＭ（Read Only Memory）１７０４や、ＲＡＭ（Random Access Memory）１７０５に格納された命令を順次に読み込み、解釈し、その結果に従って処理を実行する。ＣＰＵ１７０３は、ＲＯＭ１７０４などに記憶された各種プログラムをＲＡＭ１７０５に読み出して実行することにより、本実施形態に係る各処理を実行すると共に、クライアント装置１６０２との間で各種情報の送受信を制御する。 FIG. 17 is a block diagram showing an example of the internal configuration of the detection device 1605 according to the present embodiment. The optical unit 1701 is composed of a focus lens, a blur correction lens, an aperture, and a shutter, and collects light information of a subject. The image sensor unit 1702 is an element that converts the light information collected by the optical unit 1701 into a current value, and acquires color information by combining with a color filter or the like. In addition, an image sensor capable of setting an arbitrary exposure time for all pixels is used. The CPU 1703 is involved in all the processes of each configuration, sequentially reads and interprets the instructions stored in the ROM (Read Only Memory) 1704 and the RAM (Random Access Memory) 1705, and executes the processes according to the result. The CPU 1703 executes each process according to the present embodiment by reading and executing various programs stored in the ROM 1704 or the like into the RAM 1705, and controls transmission / reception of various information to / from the client device 1602.

また、撮像系制御部１７０６は光学部１７０１に対して、フォーカスを合わせる、シャッターを開く、及び絞りを調整するなどのＣＰＵ１７０３から指示された制御を行う。制御部１７０７は、クライアント装置１６０２からの指示に応じて、検出装置１６０５の撮像範囲を制御するなどの制御を行う。Ａ／Ｄ変換部１７０８は、光学部１７０１にて検知した被写体の光量をデジタル信号値に変換する。画像処理部１７０９は上記のデジタル信号の画像データに対して、画像処理を行う。エンコーダ部１７１０は、画像処理部１７０９にて処理した画像データをＭｏｔｉｏｎＪｐｅｇやＨ.２６４、Ｈ.２６５などのファイルフォーマットへと変換する処理を行う。エンコーダ部１７１０における変換処理により生成された静止画、或いは動画像のデータは、「配信画像」としてネットワーク１６０１を介してクライアント装置１６０２に提供される。ネットワークＩ／Ｆ１７１１は、クライアント装置１６０２等の外部の装置とのネットワーク１６０１を介した通信に利用されるインタフェースである。 Further, the image pickup system control unit 1706 performs control instructed by the CPU 1703, such as focusing, opening the shutter, and adjusting the aperture, on the optical unit 1701. The control unit 1707 performs control such as controlling the imaging range of the detection device 1605 in response to an instruction from the client device 1602. The A / D conversion unit 1708 converts the amount of light of the subject detected by the optical unit 1701 into a digital signal value. The image processing unit 1709 performs image processing on the image data of the above digital signal. The encoder unit 1710 performs a process of converting the image data processed by the image processing unit 1709 into a file format such as Motion Jpeg, H.264, or H.265. The still image or moving image data generated by the conversion process in the encoder unit 1710 is provided to the client device 1602 as a "delivered image" via the network 1601. The network I / F 1711 is an interface used for communication with an external device such as the client device 1602 via the network 1601.

ネットワーク１６０１は、検出装置１６０５と、クライアント装置１６０２を接続するネットワークである。ネットワーク１６０１は、例えばＥｔｈｅｒｎｅｔ（登録商標）等の通信規格を満足する複数のルータ、スイッチ、ケーブル等から構成される。本実施形態では、ネットワーク１６０１は、検出装置１６０５とクライアント装置１６０２との間の通信を行うことができるものであればよく、その通信規格、規模、構成を問わない。例えば、ネットワーク１６０１は、インターネットや有線ＬＡＮ（Local Area Network）、無線ＬＡＮ（Wireless LAN）、ＷＡＮ（Wide Area Network）等により構成されてもよい。 The network 1601 is a network that connects the detection device 1605 and the client device 1602. The network 1601 is composed of a plurality of routers, switches, cables and the like that satisfy communication standards such as Ethernet (registered trademark). In the present embodiment, the network 1601 may be any network 1601 as long as it can communicate between the detection device 1605 and the client device 1602, regardless of its communication standard, scale, and configuration. For example, the network 1601 may be configured by the Internet, a wired LAN (Local Area Network), a wireless LAN (Wireless LAN), a WAN (Wide Area Network), or the like.

図１８は本実施形態に対応するクライアント装置１６０２の内部構成の一例を示すブロック図である。クライアント装置１６０２は、ＣＰＵ１８０１、主記憶装置１８０２、補助記憶装置１８０３、入力Ｉ／Ｆ１８０４、出力Ｉ／Ｆ１８０５、ネットワークＩ／Ｆ１８０６を含む。各要素は、システムバスを介して、相互に通信可能に接続されている。クライアント装置１６０２は、検出装置１６０５の各種設定を行うための設定装置として動作できる。 FIG. 18 is a block diagram showing an example of the internal configuration of the client device 1602 corresponding to the present embodiment. The client device 1602 includes a CPU 1801, a main storage device 1802, an auxiliary storage device 1803, an input I / F 1804, an output I / F 1805, and a network I / F 1806. The elements are communicatively connected to each other via the system bus. The client device 1602 can operate as a setting device for making various settings of the detection device 1605.

ＣＰＵ１８０１は、クライアント装置１６０２の動作を制御する。主記憶装置１８０２は、ＣＰＵ１８０１のデータの一時的な記憶場所として機能するＲＡＭ等の記憶装置である。補助記憶装置１８０３は、各種プログラム、各種設定データ等を記憶するＨＤＤ、ＲＯＭ、ＳＳＤ等の記憶装置である。入力Ｉ／Ｆ１８０４は、入力装置１６０３等からの入力を受付ける際に利用されるインタフェースである。出力Ｉ／Ｆ１８０５は、表示装置１６０４等への情報の出力に利用されるインタフェースである。ネットワークＩ／Ｆ１８０６は、検出装置１６０５等の外部の装置とのネットワーク１６０１を介した通信に利用されるインタフェースである。クライアント装置１６０２は、ネットワークＩ／Ｆ１８０６を介して、検出装置１６０５から撮像画像又は映像を取得し、格納することができる。クライアント装置１６０２は、このような画像を格納して提供するサーバとして機能してもよい。また、クライアント装置１６０２が各種プログラム、各種設定データ等を記憶するのは、補助記憶装置１８０３に限定されない。例えば、クライアント装置１６０２は、そのようなデータ等を、ネットワークＩ／Ｆ１８０６を介してサーバや記憶装置のような外部の格納部（不図示）に記憶していてもよい。 The CPU 1801 controls the operation of the client device 1602. The main storage device 1802 is a storage device such as a RAM that functions as a temporary storage place for data of the CPU 1801. The auxiliary storage device 1803 is a storage device such as an HDD, a ROM, or an SSD that stores various programs, various setting data, and the like. The input I / F 1804 is an interface used when receiving an input from the input device 1603 or the like. The output I / F 1805 is an interface used for outputting information to a display device 1604 or the like. The network I / F1806 is an interface used for communication with an external device such as the detection device 1605 via the network 1601. The client device 1602 can acquire and store a captured image or video from the detection device 1605 via the network I / F 1806. The client device 1602 may function as a server that stores and provides such images. Further, it is not limited to the auxiliary storage device 1803 that the client device 1602 stores various programs, various setting data, and the like. For example, the client device 1602 may store such data or the like in an external storage unit (not shown) such as a server or a storage device via the network I / F 1806.

ＣＰＵ１８０１は、補助記憶装置１８０３に記憶された各種プログラムを主記憶装置１８０２に読み出して実行することにより、本実施形態に係る各処理を実行すると共に、検出装置１６０５との間で各種情報の送受信を制御する。また、入力Ｉ／Ｆ１８０４を介して入力装置１６０３からの入力を受付けると共に、出力Ｉ／Ｆ１８０５を介して表示装置１６０４における画像や各種情報の表示制御を行う。また、クライアント装置１６０２は、補助記憶装置１８０３、外部の格納部（不図示）を用いてもよい。 The CPU 1801 reads and executes various programs stored in the auxiliary storage device 1803 in the main storage device 1802 to execute each process according to the present embodiment, and transmits and receives various information to and from the detection device 1605. Control. Further, the input from the input device 1603 is received via the input I / F 1804, and the display control of the image and various information in the display device 1604 is performed via the output I / F 1805. Further, the client device 1602 may use the auxiliary storage device 1803 and an external storage unit (not shown).

入力装置１６０３は、マウス、キーボード、タッチパネル、ボタン等から構成される入力装置である。表示装置１６０４は、クライアント装置１６０２が出力した画像を表示するディスプレイモニタ等の表示装置である。本実施形態では、クライアント装置１６０２と入力装置１６０３と表示装置１６０４とを、各々独立した装置とすることができる。この場合、例えばクライアント装置１６０２をパーソナルコンピュータ（ＰＣ）として構成し、入力装置１６０３を当該ＰＣに接続されたマウスやキーボード、表示装置１６０４を当該ＰＣに接続されたディスプレイとすることができる。また、当該構成以外にも、クライアント装置１６０２と表示装置１６０４とが一体化されていてもよいし、タッチパネルのように入力装置１６０３と表示装置１６０４とが一体化されていてもよい。また、スマートフォンやタブレット端末のようにクライアント装置１６０２と入力装置１６０３と表示装置１６０４とが、一体化されていてもよい。また、表示装置１６０４は、後述するモニタリング部１３００として機能してもよい。 The input device 1603 is an input device composed of a mouse, a keyboard, a touch panel, buttons, and the like. The display device 1604 is a display device such as a display monitor that displays an image output by the client device 1602. In the present embodiment, the client device 1602, the input device 1603, and the display device 1604 can be independent devices. In this case, for example, the client device 1602 can be configured as a personal computer (PC), the input device 1603 can be a mouse or keyboard connected to the PC, and the display device 1604 can be a display connected to the PC. In addition to the configuration, the client device 1602 and the display device 1604 may be integrated, or the input device 1603 and the display device 1604 may be integrated like a touch panel. Further, the client device 1602, the input device 1603, and the display device 1604 may be integrated like a smartphone or a tablet terminal. Further, the display device 1604 may function as a monitoring unit 1300 described later.

本実施形態に係る検出装置は、第１の時刻における撮像画像から１以上の被写体を検出し、及び、検出された被写体の位置にしたがって、第１の時刻に後続する第２の時刻における被写体の検出対象領域を、撮像画像中に設定する。そのような処理のために、図２（ａ）に示す一実施形態に係る検出装置１０００は、撮像部１１００と処理部１２００とを有する。ここで、検出装置１０００は、図１６に示す検出装置１６０５であってもよい。この場合、処理部１２００の処理は、検出装置１６０５の制御部１７０７が実現することができる。また、本発明の一実施形態に係る検出装置は、ネットワークを介して接続された複数の装置によって構成されていてもよい。例えば、図１６に示す検出装置１０００の機能は、図１６に示す検出装置１６０５とクライアント装置１６０２とによって実現されてもよい。例えば、検出装置１６０５が撮像部１１００として用いられ、クライアント装置１６０２が処理部１２００として用いられてもよい。この場合、処理部１２００の処理は、クライアント装置１６０２のＣＰＵ１８０１が実現することができる。 The detection device according to the present embodiment detects one or more subjects from the captured image at the first time, and according to the position of the detected subject, the subject at the second time following the first time. The detection target area is set in the captured image. For such processing, the detection device 1000 according to the embodiment shown in FIG. 2A has an imaging unit 1100 and a processing unit 1200. Here, the detection device 1000 may be the detection device 1605 shown in FIG. In this case, the processing of the processing unit 1200 can be realized by the control unit 1707 of the detection device 1605. Further, the detection device according to the embodiment of the present invention may be composed of a plurality of devices connected via a network. For example, the function of the detection device 1000 shown in FIG. 16 may be realized by the detection device 1605 and the client device 1602 shown in FIG. For example, the detection device 1605 may be used as the imaging unit 1100, and the client device 1602 may be used as the processing unit 1200. In this case, the processing of the processing unit 1200 can be realized by the CPU 1801 of the client device 1602.

図２は各実施形態に係る検出装置の機能構成の一例を示すブロック図であり、図２（ａ）は、実施形態１に係る検出装置の例を示している。撮像部１１００は動画取得部１００１を有する。動画取得部１００１は、撮像装置による撮像画像を取得する。本実施例において、動画取得部１００１は、例えば、被写体を含む所定のエリアの撮像画像を取得することができる。動画取得部１００１による撮像画像の解像度は特に限定されないが、本実施形態においては説明のため、動画取得部１００１は、解像度ＦＨＤ（１９２０×１０８０ピクセル）の撮像画像を取得するものとする。動画取得部１００１は、所定の時間間隔で撮像画像を取得することができる。例えば、動画取得部１００１は、秒間３０フレームの速度で撮像を行ってもよく、数１０ミリ秒程度の間隔で撮像を行ってもよく、又は、より広い間隔で撮像を行ってもよい。また、動画取得部１００１は、取得した撮像画像を処理部１２００へと出力することができる。また、撮像部１１００は、処理部１２００と接続されている。撮像部１１００と処理部１２００との接続手段は特に限定されない。撮像部１１００及び処理部１２００は、例えばローカルエリアネットワークなどの通信経路を介して接続されていてもよく、ＵＳＢケーブルなどを介して有線で接続されていてもよい。また例えば、撮像部１１００は、出力した撮像画像を不図示の記憶装置に格納し、及び、処理部１２００が、その記憶装置から所定のフレームを取得してもよい。 FIG. 2 is a block diagram showing an example of the functional configuration of the detection device according to each embodiment, and FIG. 2A shows an example of the detection device according to the first embodiment. The imaging unit 1100 has a moving image acquisition unit 1001. The moving image acquisition unit 1001 acquires an image captured by the imaging device. In this embodiment, the moving image acquisition unit 1001 can acquire, for example, a captured image of a predetermined area including a subject. The resolution of the captured image by the moving image acquisition unit 1001 is not particularly limited, but in the present embodiment, for the sake of explanation, the moving image acquisition unit 1001 shall acquire the captured image having a resolution of FHD (1920 × 1080 pixels). The moving image acquisition unit 1001 can acquire captured images at predetermined time intervals. For example, the moving image acquisition unit 1001 may perform imaging at a speed of 30 frames per second, may perform imaging at intervals of about several tens of milliseconds, or may perform imaging at wider intervals. In addition, the moving image acquisition unit 1001 can output the acquired captured image to the processing unit 1200. Further, the imaging unit 1100 is connected to the processing unit 1200. The connecting means between the imaging unit 1100 and the processing unit 1200 is not particularly limited. The imaging unit 1100 and the processing unit 1200 may be connected via a communication path such as a local area network, or may be connected by wire via a USB cable or the like. Further, for example, the imaging unit 1100 may store the output captured image in a storage device (not shown), and the processing unit 1200 may acquire a predetermined frame from the storage device.

処理部１２００は、図２（ａ）の例においては、初期値設定部１００２、検出部１００３、対応付け部１００４、領域設定部１００５、及び可視化部１００６を有する。撮像部１１００による被写体の追尾処理を行うにあたり、処理部１２００が有する各部は、処理を繰り返し行うことができる。初期値設定部１００２は、検出部１００３が初めに被写体の検出を行う際に用いられ、撮像画像中に設定される、検出対象領域の初期設定を行う。検出部１００３は、撮像画像中の検出対象領域から１以上の被写体を検出する。対応付け部１００４は、前回の繰り返しで検出された被写体の像と今回検出された被写体の像とを対応付け、又は初回の場合は被写体に識別情報を割り振る。領域設定部１００５は、次の繰り返しの処理において検出部１００３が被写体の検出を行う際に用いる検出対象領域を撮像画像中に設定する。可視化部１００６は、被写体の軌跡の可視化を行う。これらの機能の詳細については、図３（ａ）のフローチャートと共に後述する。 In the example of FIG. 2A, the processing unit 1200 has an initial value setting unit 1002, a detection unit 1003, a correspondence unit 1004, an area setting unit 1005, and a visualization unit 1006. When the image pickup unit 1100 performs the subject tracking process, each unit of the processing unit 1200 can repeat the process. The initial value setting unit 1002 is used when the detection unit 1003 first detects the subject, and initially sets the detection target area set in the captured image. The detection unit 1003 detects one or more subjects from the detection target area in the captured image. The association unit 1004 associates the image of the subject detected in the previous repetition with the image of the subject detected this time, or allocates the identification information to the subject in the case of the first time. The area setting unit 1005 sets the detection target area used when the detection unit 1003 detects the subject in the captured image in the next repetitive process. The visualization unit 1006 visualizes the trajectory of the subject. Details of these functions will be described later together with the flowchart of FIG. 3 (a).

モニタリング部１３００は、処理部１２００による処理の結果を表示することができる。例えば、モニタリング部１３００は、可視化部１００６によって可視化された被写体の軌跡を、モニタ内の撮像画像上に軌跡や点として重畳表示してもよい。また、モニタリング部１３００は、処理部１２００と接続されていてもよい。モニタリング部１３００と処理部１２００との接続方法は特に限定されない。例えば、モニタリング部１３００及び処理部１２００は、有線で接続されていてもよく、又は無線の通信を介して接続されていてもよい。 The monitoring unit 1300 can display the result of processing by the processing unit 1200. For example, the monitoring unit 1300 may superimpose and display the locus of the subject visualized by the visualization unit 1006 as a locus or a point on the captured image in the monitor. Further, the monitoring unit 1300 may be connected to the processing unit 1200. The connection method between the monitoring unit 1300 and the processing unit 1200 is not particularly limited. For example, the monitoring unit 1300 and the processing unit 1200 may be connected by wire or may be connected via wireless communication.

図１は、本実施形態に係る検出装置１０００による撮像画像取得の一例を説明するための図である。図１（ａ）の配置例１０４は、空間中に存在する人物群と、空間中に設置された撮像部１１００であるカメラ１０１と、の配置例を示す俯瞰図である。この例においては、カメラ１０１が、人物１、２、３及び４の撮像を行っている。そのようなカメラ１０１による撮像画像の例が、図１（ｂ）の画像例１１０に示されている。図１（ａ）並びに図１（ｂ）における人物１、２、３及び４はそれぞれ対応している。図１（ｂ）に示される検出対象領域１１１は、検出装置１０００が設定する検出対象領域の例である。また、１１２及び１１３は、人物１及び２にそれぞれ対応する、検出装置１０００による検出結果に相当するバウンディングボックスである。バウンディングボックスは、画像の縦方向（ｕ軸方向）並びに横方向（ｖ軸方向）について、それぞれ位置及び幅の計４次元の数値で表現される矩形であってもよい。この例では、検出装置１０００は、人体の頭部を囲むバウンディングボックスを、画像上に検出した頭部の数だけ出力するように学習されている。しかし、検出装置１０００が出力する検出結果は特に限定されず、例えば、被写体について対応付けられたＩＤ又はＩＤに対応する名前のような識別情報を表示してもよい。検出装置１０００は、そのようなバウンディングボックスに加え、検出結果の信頼度を表すスコアを出力することができる。 FIG. 1 is a diagram for explaining an example of acquiring a captured image by the detection device 1000 according to the present embodiment. The arrangement example 104 of FIG. 1A is a bird's-eye view showing an arrangement example of a group of people existing in the space and a camera 101 which is an imaging unit 1100 installed in the space. In this example, the camera 101 is capturing images of people 1, 2, 3 and 4. An example of an image captured by such a camera 101 is shown in image example 110 of FIG. 1 (b). Persons 1, 2, 3 and 4 in FIGS. 1 (a) and 1 (b) correspond to each other, respectively. The detection target area 111 shown in FIG. 1B is an example of the detection target area set by the detection device 1000. Further, 112 and 113 are bounding boxes corresponding to the detection results by the detection device 1000, which correspond to the persons 1 and 2, respectively. The bounding box may be a rectangle represented by a total of four-dimensional numerical values of position and width in the vertical direction (u-axis direction) and the horizontal direction (v-axis direction) of the image, respectively. In this example, the detection device 1000 is learned to output as many bounding boxes surrounding the heads of the human body as the number of heads detected on the image. However, the detection result output by the detection device 1000 is not particularly limited, and identification information such as an ID associated with the subject or a name corresponding to the ID may be displayed. In addition to such a bounding box, the detection device 1000 can output a score indicating the reliability of the detection result.

検出結果の信頼度を表すスコアとは、例えば、検出範囲内に含まれる被写体に対し、検出装置１０００がどの程度の精度でそのような被写体を検出したかを表すモデルであってもよい。実施形態１の例においては、検出装置１０００は、非特許文献１と同様の手法により被写体の検出を行ってもよい。例えば、後述する検出部１００３は、検出対象領域のそれぞれをＳ×Ｓ（Ｓは予め与えられる所定の数）のグリッドに分割することができる。また、検出部１００３は、被写体の存在する各グリッドから、所定の数のバウンディングボックス、及び各バウンディングボックスにおける信頼度のスコアを推定してもよい。次いで検出装置１０００は、検出対象領域内に設定された複数のバウンディングボックスの内から、任意の閾値を超えるスコアを有するバウンディングボックスを、被写体を囲むバウンディングボックスとして推定することができる。非特許文献１の例においては、バウンディングボックス及びスコアがニューラルネットワークを用いて推定される。この例では、被写体が存在する確率とＩｏＵ（正しい被写体の領域と被写体として誤検出した領域とを足した領域に対する、正しい被写体の領域の割合）の積を、スコアとして与えるように学習されたニューラルネットワークが用いられている。このように、スコアとは、推定された被写体領域の位置の正しさと、推定された被写体領域の大きさの正しさと、推定された被写体領域に被写体が存在する確率と、の少なくとも１つを示す値であってもよい。また、検出装置１０００は、複数の被写体を検出することができる。さらに、この例においては人物の頭部が検出されているが、検出装置１０００の検出対象はこれには限られない。検出装置１０００は、例えば犬若しくは馬のような動物を検出してもよく、又はサッカーボールを検出してもよい。 The score representing the reliability of the detection result may be, for example, a model showing how accurately the detection device 1000 detects such a subject with respect to the subject included in the detection range. In the example of the first embodiment, the detection device 1000 may detect the subject by the same method as in Non-Patent Document 1. For example, the detection unit 1003, which will be described later, can divide each of the detection target areas into a grid of S × S (S is a predetermined number given in advance). Further, the detection unit 1003 may estimate a predetermined number of bounding boxes and a reliability score in each bounding box from each grid in which the subject exists. Next, the detection device 1000 can estimate a bounding box having a score exceeding an arbitrary threshold value as a bounding box surrounding the subject from among a plurality of bounding boxes set in the detection target area. In the example of Non-Patent Document 1, the bounding box and the score are estimated using a neural network. In this example, a neural trained to give the product of the probability that a subject exists and IoU (the ratio of the correct subject area to the area obtained by adding the correct subject area and the falsely detected area as the subject) as a score. A network is used. As described above, the score is at least one of the correctness of the position of the estimated subject area, the correctness of the estimated size of the subject area, and the probability that the subject exists in the estimated subject area. It may be a value indicating. In addition, the detection device 1000 can detect a plurality of subjects. Further, in this example, the head of a person is detected, but the detection target of the detection device 1000 is not limited to this. The detection device 1000 may detect an animal such as a dog or a horse, or may detect a soccer ball.

以下では図３（ａ）を参照して、本実施形態に係る検出装置１０００が行う検出方法の流れを説明する。図３（ａ）は、本実施形態における、被写体を認識した際の処理手順の一例を示すフローチャートである。本実施形態において、検出装置１０００は、複数の時刻１〜ｔのそれぞれにおいて撮像された撮像画像のそれぞれから被写体を検出し、その次の時刻に撮像された撮像画像に対して被写体の検出を行う検出対象領域を設定する。ループＬ４００１において、検出装置１０００は、時刻１からｔまでに撮像された撮像画像のそれぞれに対して、以下のステップＳ４００２〜Ｓ４００５の操作を順に繰り返し、及び次の時刻の撮像画像に進むことができる。以下においては、今回とはある時刻の撮像画像を処理する現時点のループを指し、前回とは前の時刻の撮像画像を処理するループを指し、次回とは後の時刻の撮像画像を処理するループを指すものとする。 Hereinafter, the flow of the detection method performed by the detection device 1000 according to the present embodiment will be described with reference to FIG. 3A. FIG. 3A is a flowchart showing an example of a processing procedure when a subject is recognized in the present embodiment. In the present embodiment, the detection device 1000 detects a subject from each of the captured images captured at each of the plurality of times 1 to t, and detects the subject with respect to the captured image captured at the next time. Set the detection target area. In the loop L4001, the detection device 1000 can repeat the following operations of steps S4002 to S4005 in order for each of the captured images captured from time 1 to t, and proceed to the captured image at the next time. .. In the following, this time refers to the current loop that processes the captured image at a certain time, the previous time refers to the loop that processes the captured image at the previous time, and the next time refers to the loop that processes the captured image at a later time. It shall point to.

ステップＳ４００１で初期値設定部１００２は、動画取得部１００１が取得した撮像画像について、最初に被写体の検出を行うための１つ以上の検出対象領域を設定する。検出対象領域としては、例えば、ＦＨＤ（１９２０×１０８０サイズ）の撮像画像に対し、６４０×３６０サイズの領域を用いてもよい。そのような場合、初期値設定部１００２は、例えば、まず検出対象領域を左上の隅に設定することができる。次いで初期値設定部１００２は、その検出対象領域を、横方向に６４０ピクセル、及び縦方向に３６０ピクセルずつ、それぞれの方向について最大２回ずつ任意の回数スライドさせることで、計９個の検出対象領域を設定してもよい。例えば、被写体が画像内のどこにいても検出できるようにするという観点から、初期値設定部１００２は、検出対象領域の集合が撮像画像のすべての領域を隙間なく被覆するように検出対象領域を設定してもよい。しかし、検出対象領域の設定方法は特にそのように限定されるわけではない。例えば、初期値設定部１００２は、被写体が存在し得る位置の範囲が予め与えられているような場合において、そのような範囲を隙間なく被覆するように検出対象領域を設定してもよい。また、初期値設定部１００２は、検出対象領域同士の境界線上に被写体が存在する可能性を考慮して、隣接する検出対象領域が重複する領域を持つように検出対象領域を設定してもよい。 In step S4001, the initial value setting unit 1002 sets one or more detection target areas for first detecting the subject in the captured image acquired by the moving image acquisition unit 1001. As the detection target region, for example, a region having a size of 640 × 360 may be used with respect to a captured image of FHD (1920 × 1080 size). In such a case, the initial value setting unit 1002 can first set the detection target area in the upper left corner, for example. Next, the initial value setting unit 1002 slides the detection target area by 640 pixels in the horizontal direction and 360 pixels in the vertical direction at a maximum of two times in each direction, thereby causing a total of nine detection targets. The area may be set. For example, from the viewpoint of enabling the subject to be detected anywhere in the image, the initial value setting unit 1002 sets the detection target area so that the set of the detection target areas covers the entire area of the captured image without gaps. You may. However, the method of setting the detection target area is not particularly limited as such. For example, the initial value setting unit 1002 may set the detection target area so as to cover such a range without a gap when a range of positions where the subject can exist is given in advance. Further, the initial value setting unit 1002 may set the detection target area so that the adjacent detection target areas have overlapping areas in consideration of the possibility that the subject exists on the boundary line between the detection target areas. ..

ステップＳ４００２において、検出部１００３は、撮像画像内の、ステップＳ４００１で設定された、又は前回のループにおけるステップＳ４００４（後述する）で設定された検出対象領域から、被写体の検出を行う。また、ｔ＝１の場合、つまり初回の検出を行う場合においては、検出部１００３は、検出対象領域を用いることにより、被写体の検出を行ってもよい。検出部１００３は、検出された被写体について、その被写体を示すバウンディングボックス、及び検出結果の信頼度を表すスコアを出力することができる。また、検出部１００３が同一の被写体を複数の検出対象領域において検出した場合においては、それらの結果を統合してもよい。そのような場合、統合の仕方は特に限定されない。例えば、検出部１００３は、各検出対象領域における同一被写体のバウンディングボックスの中心座標（ｕ、ｖ）を算出し、及びそれらの平均を取る事により、検出結果を統合してもよい。また例えば、検出対象領域それぞれにおいてそのサイズに基づいた重みが設定されている場合、検出部１００３は、同一の被写体を有している検出対象領域それぞれの重みに基づいて、被写体の（ｕ、ｖ）の値の重み付き平均を取る事により結果を統合してもよい。 In step S4002, the detection unit 1003 detects the subject from the detection target area set in step S4001 or set in step S4004 (described later) in the previous loop in the captured image. Further, when t = 1, that is, when the first detection is performed, the detection unit 1003 may detect the subject by using the detection target area. The detection unit 1003 can output a bounding box indicating the subject and a score indicating the reliability of the detection result for the detected subject. Further, when the detection unit 1003 detects the same subject in a plurality of detection target regions, the results may be integrated. In such a case, the method of integration is not particularly limited. For example, the detection unit 1003 may integrate the detection results by calculating the center coordinates (u, v) of the bounding box of the same subject in each detection target area and averaging them. Further, for example, when weights based on the size of each detection target area are set, the detection unit 1003 determines the subject (u, v) based on the weights of the detection target areas having the same subject. The results may be integrated by taking a weighted average of the values of).

ステップＳ４００３において、対応付け部１００４は、前回のループにおいて検出された被写体に対応付けられた識別情報と、今回検出された被写体とを対応付ける。つまり、前回と今回とにおける同一の被写体の像を対応付ける。新たに検出された被写体が存在する場合には、対応付け部１００４は、その被写体に新たな識別情報を割り振る。また、ｔ＝１の場合には、対応付け部１００４は、検出された被写体についてそれぞれ識別情報を割り振る。対応付け部１００４は、例えば、各被写体のバウンディングボックスの中心座標（ｕ、ｖ）及び信頼度のスコア（ｑ）による３次元の値（ｕ、ｖ、ｑ）の、前回のものと今回のものとのユークリッド距離を、すべての組み合わせについて算出することができる。そのような場合において、対応付け部１００４は、例えば、線形計画法の割り当て問題として、被写体の像の対応付けを行ってもよい。つまり、例えば、対応付け部１００４は、ハンガリアン法のような公知の技術を用いることにより、上述のユークリッド距離を用いて、前回の像と今回の像との対応付けを行ってもよい。識別情報としては、本実施例においてはＩＤが用いられているが、被写体をそれぞれ識別できるものであれば特に限定はされない。 In step S4003, the associating unit 1004 associates the identification information associated with the subject detected in the previous loop with the subject detected this time. That is, the images of the same subject in the previous time and this time are associated with each other. When a newly detected subject exists, the association unit 1004 allocates new identification information to the subject. Further, when t = 1, the association unit 1004 allocates identification information to each of the detected subjects. The association unit 1004 is, for example, the previous one and the current one of the three-dimensional values (u, v, q) based on the center coordinates (u, v) of the bounding box of each subject and the reliability score (q). The Euclidean distance to and can be calculated for all combinations. In such a case, the association unit 1004 may associate the image of the subject as, for example, as an allocation problem of the linear programming method. That is, for example, the association unit 1004 may associate the previous image with the current image using the above-mentioned Euclidean distance by using a known technique such as the Hungarian method. As the identification information, an ID is used in this embodiment, but the information is not particularly limited as long as it can identify each subject.

ステップＳ４００４で領域設定部１００５は、次回のステップＳ４００２において検出に用いる検出対象領域を設定する。本実施例においては、領域設定部１００５は、まず、検出対象領域の候補となる候補領域の、撮像画像内での座標を取得する。候補領域については後述する。領域設定部１００５は、複数の候補領域のうち、被写体を１以上含む候補領域を検出対象領域として選定することができる。また例えば、領域設定部１００５は、候補領域の内で被写体を１以上含むものの中から、被写体の検出結果の信頼度を表すスコアを用いて、所望の条件を満たす検出対象領域を選定してもよい。領域設定部１００５が行う処理についてはステップＳ４００５の後に詳述する。ステップＳ４００５において、可視化部１００６は、処理された撮像画像から検出された被写体を可視化して表示することができる。可視化部１００６による被写体の可視化の方法は特に限定されない。可視化部１００６は、例えば、被写体をモニタリング部１３００上に、バウンディングボックスとして表示してもよい。また、可視化部１００６は、被写体のＩＤ又は被写体に対応する名前のような識別情報を、被写体又は被写体の軌跡と共に表示してもよい。また例えば、可視化部１００６は、被写体として、異なる時刻の撮像画像から検出された、複数の時刻に渡るバウンディングボックスの中心点の遷移を示す線を、モニタリング部１３００上に表示してもよい。 In step S4004, the area setting unit 1005 sets the detection target area to be used for detection in the next step S4002. In this embodiment, the area setting unit 1005 first acquires the coordinates of the candidate area that is a candidate for the detection target area in the captured image. The candidate area will be described later. The area setting unit 1005 can select a candidate area including one or more subjects from a plurality of candidate areas as a detection target area. Further, for example, the area setting unit 1005 may select a detection target area satisfying a desired condition from among the candidate areas including one or more subjects by using a score indicating the reliability of the detection result of the subject. Good. The process performed by the area setting unit 1005 will be described in detail after step S4005. In step S4005, the visualization unit 1006 can visualize and display the subject detected from the processed captured image. The method of visualizing the subject by the visualization unit 1006 is not particularly limited. The visualization unit 1006 may display the subject on the monitoring unit 1300 as a bounding box, for example. Further, the visualization unit 1006 may display identification information such as a subject ID or a name corresponding to the subject together with the subject or the trajectory of the subject. Further, for example, the visualization unit 1006 may display on the monitoring unit 1300 a line indicating the transition of the center point of the bounding box over a plurality of times, which is detected from the captured images at different times as the subject.

以下、領域設定部１００５が行う処理について詳細な説明を行う。図４は、上述の候補領域の取得について説明するための図である。領域設定部１００５は、互いに異なる大きさ、つまりサイズを有する候補領域の内から、少なくとも一つ以上を検出対象領域として選定することができる。すなわち、選定される検出対象領域が、互いに異なるサイズを有していてもよい。図４の例においては、領域設定部１００５は、サイズ１から３までの３種類のサイズの候補領域の座標を取得し、及び、各サイズの候補領域に基づいて撮像画像から部分画像を作成する。候補領域は後述のステップＳ４００４で検出対象領域を選定する際の候補となる領域である。候補領域の位置及び形状は、例えば図４に示されるように、予め定めておくことができる。候補領域の形状は特に限定されず、例えば三角形又は円形であってもよいが、以下においては説明のため、候補領域は矩形の領域であるとする。領域設定部１００５は、例えば、矩形である候補領域の４隅の座標を取得してもよい。 Hereinafter, the processing performed by the area setting unit 1005 will be described in detail. FIG. 4 is a diagram for explaining the acquisition of the above-mentioned candidate region. The area setting unit 1005 can select at least one or more of the candidate areas having different sizes, that is, sizes, as the detection target area. That is, the selected detection target areas may have different sizes from each other. In the example of FIG. 4, the area setting unit 1005 acquires the coordinates of the candidate areas of three types of sizes 1 to 3, and creates a partial image from the captured image based on the candidate areas of each size. .. The candidate area is a candidate area when selecting the detection target area in step S4004 described later. The position and shape of the candidate region can be predetermined, for example, as shown in FIG. The shape of the candidate region is not particularly limited and may be, for example, a triangle or a circle, but in the following, for the sake of explanation, the candidate region is assumed to be a rectangular region. The area setting unit 1005 may acquire the coordinates of the four corners of the candidate area which is a rectangle, for example.

４００は、領域設定部１００５がサイズ１の候補領域を撮像範囲内に作成している図であり、及び、Ｎｕ１×Ｎｖ１個の候補領域が作成されている。この４００において、候補領域４０１は１個目の候補領域（Ｃ_{１，１，１}）を示し、及び候補領域４０２はＮｕ１×Ｎｖ１個目の候補領域（Ｃ_{１，Ｎｕ１，Ｎｖ１}）を示す。つまり、例えばこの４００においては、領域設定部１００５は、まず（Ｃ_{１，１，１}）を作成し、及び、（Ｃ_{１，１，１}）から横方向にＮｕ１個（順番に（Ｃ_{１，Ｎｕ１，１}）まで）候補領域を作成することができる。次いで領域設定部１００５は、その横方向のＮｕ１個の候補領域それぞれから、縦方向にＮｖ１個（例えば、順番に（Ｃ_{１，１，Ｎｖ１}）まで）の候補領域を作成することができる。４１０及び４２０においても同様に、領域設定部１００５は、（Ｃ_{２，１，１}）から（Ｃ_{２，Ｎｕ２，Ｎｖ２}）までのＮｕ２×Ｎｖ２個、及び（Ｃ_{３，１，１}）から（Ｃ_{３，Ｎｕ３，Ｎｖ３}）までのＮｕ３×Ｎｖ３個の候補領域をそれぞれ作成することができる。 Reference numeral 400 denotes a diagram in which the area setting unit 1005 creates a candidate area of size 1 within the imaging range, and Nu1 × Nv1 candidate areas are created. In this 400, the candidate area 401 indicates the first candidate area (C _1,1,1 ), and the candidate area 402 indicates the Nu1 × Nv1th candidate area (C1 _{, Nu1, Nv1} ). That is, for example, in the 400, the area setting unit 1005 creates first a _{(C 1, 1, 1),} _and, Nu1 pieces laterally from _{(C 1, 1, 1)} (in the order _{(C 1,} Candidate areas can be created up to _Nu1,1). Next, the area setting unit 1005 can create one Nv candidate area in the vertical direction (for example, up to (C _{1, 1, Nv1} ) in order) from each of the one Nu 1 candidate area in the horizontal direction. Similarly, in 410 and 420, the area setting unit _{1005, (C 2,1,1)} from _{(C 2, Nu2, Nv2)} Nu2 × Nv2 amino up, and _{(C 3,1,1)} from (C It is possible to create each of Nu3 × Nv3 candidate regions up to _{3, Nu3, Nv3).}

各候補領域は、それぞれ重複する範囲を有していてもよく、接していてもよく、又は所望の検出結果が得られる範囲で離れていてもよい。この例においては、同サイズの隣接する候補領域の間隔は、縦方向及び横方向それぞれについて、等間隔で設定されているものとしたが、特にそのようには限られない。例えば、撮像画像中に、候補領域が適宜狭い間隔で配置される（すなわち、例えば候補領域同士が広く重複する）範囲が存在していてもよい。そのような構成によれば、候補同士が重複している範囲において複数回の検出処理が行われるため、検出のロバスト性を向上させることができる。作成される候補領域の１パターンとして、Ｓ４００１で設定された検出対象領域と同様の領域が作成されていてもよい。 The candidate regions may have overlapping ranges, may be in contact with each other, or may be separated from each other within a range in which a desired detection result can be obtained. In this example, the intervals between adjacent candidate regions of the same size are set at equal intervals in the vertical direction and the horizontal direction, but this is not particularly limited. For example, in the captured image, there may be a range in which the candidate regions are arranged at appropriate narrow intervals (that is, for example, the candidate regions widely overlap each other). According to such a configuration, since the detection process is performed a plurality of times in the range where the candidates overlap each other, the robustness of the detection can be improved. As one pattern of the candidate area to be created, an area similar to the detection target area set in S4001 may be created.

領域設定部１００５は、ステップＳ４００２において検出された被写体が、上述の候補領域の内のどの領域に含まれているかを確認することができる。領域設定部１００５は、例えば、候補領域それぞれにおいて、その候補領域が被覆している被写体について、その被写体の識別情報と、その被写体のスコアと、を対応付けてもよい。そのような対応付けをされた候補領域をリスト化した表の例が図７に示されている。図７に示される表は、候補領域それぞれについて、その候補領域が被覆している被写体の識別情報であるＩＤ及び候補領域のスコアを表示している。図７の例において、１つの候補領域が複数の被写体を被覆しているような場合には、その候補領域のスコアとして、被覆している被写体のスコアの内最も値が高いものが表示されている。このような設定によれば、後述する検出対象領域の選定において、スコアが高い、つまり検出しやすい被写体を被覆する候補領域が優先して選定される。 The area setting unit 1005 can confirm in which area of the above-mentioned candidate areas the subject detected in step S4002 is included. For example, in each of the candidate regions, the region setting unit 1005 may associate the identification information of the subject with the score of the subject for the subject covered by the candidate region. An example of a table listing such associated candidate regions is shown in FIG. The table shown in FIG. 7 displays the ID, which is the identification information of the subject covered by the candidate area, and the score of the candidate area for each of the candidate areas. In the example of FIG. 7, when one candidate area covers a plurality of subjects, the score of the candidate area having the highest value among the scores of the covered subjects is displayed. There is. According to such a setting, in the selection of the detection target area described later, a candidate area having a high score, that is, a candidate area covering a subject that is easy to detect is preferentially selected.

図５には、候補領域が被写体を被覆している状態を説明するための、候補領域の一例が示されている。この例において、候補領域５００は、被覆判定領域５０１及びバッファ幅５０２を有している。候補領域５００は、例えば、被覆判定領域５０１内に被写体を有している場合、その被写体を被覆しているとしてもよい。バッファ幅は、候補領域５００が次回の検出対象領域として選定される場合において、被写体が検出対象領域外に出にくくなるように、検出対象領域上に余裕を持たせて被覆判定領域の外側に設定されるバッファ領域の幅であってもよい。バッファ幅５０２の値は特に限定されない。バッファ幅５０２の値は、例えば、候補領域５００が次回の被写体の検出に用いられることを考えて、被写体が次回の時刻までに移動し得る移動距離と同じだけの値として設定されていてもよい。被写体のそのような移動距離は予め与えられていてもよく、検出途中に算出されてもよいが、そのような例については実施形態２において詳細に説明する。また、バッファ幅は、画像内での横方向の右端及び左端、並びに縦方向の上端及び下端において、それぞれ異なる値を取っていてもよい。つまり、例えば、被写体の進行方向が定まっているような場合に、バッファ幅による領域がその進行方向と同じ方向について大きくなるように、バッファ幅の値が設定されていてもよい。 FIG. 5 shows an example of a candidate region for explaining a state in which the candidate region covers the subject. In this example, the candidate region 500 has a coverage determination region 501 and a buffer width 502. When the candidate region 500 has a subject in the covering determination region 501, for example, the candidate region 500 may cover the subject. The buffer width is set outside the covering determination area with a margin above the detection target area so that the subject does not easily go out of the detection target area when the candidate area 500 is selected as the next detection target area. It may be the width of the buffer area to be created. The value of the buffer width 502 is not particularly limited. The value of the buffer width 502 may be set as a value equal to the movement distance that the subject can move by the next time, considering that the candidate area 500 is used for detecting the next subject, for example. .. Such a moving distance of the subject may be given in advance or calculated during detection, but such an example will be described in detail in the second embodiment. Further, the buffer width may have different values at the right end and the left end in the horizontal direction and the upper end and the lower end in the vertical direction in the image. That is, for example, when the traveling direction of the subject is fixed, the value of the buffer width may be set so that the area due to the buffer width becomes larger in the same direction as the traveling direction.

図６は、検出された被写体のバウンディングボックス、候補領域、及びステップＳ４００４で得る検出対象領域について説明するための図である。領域を示す例６００は、バウンディングボックス６０１、６０２、６０３及び６０４、候補領域６０５、６０６、及び６０８、並びに検出対象領域６０７及び６０９を有している。ここで設定された検出対象領域は、次回のループにおけるステップＳ４００２での検出で、検出部１００３によって用いられる。 FIG. 6 is a diagram for explaining a bounding box of the detected subject, a candidate area, and a detection target area obtained in step S4004. Example 600 showing the region has bounding boxes 601, 602, 603 and 604, candidate regions 605, 606, and 608, and detection target regions 607 and 609. The detection target area set here is used by the detection unit 1003 in the detection in step S4002 in the next loop.

上述したように、領域設定部１００５は、候補領域の内の被写体を１以上含むものの中から、被写体の検出結果の信頼度を表すスコアを用いて、所望の条件を満たす検出対象領域を選定してもよい。そのような検出対象領域の選定例について、図７の表を参照しながら説明する。まず、例えば、領域設定部１００５は、候補領域の内から、例えば図７の（Ｃ_{１，１，１}）のような、被写体を被覆していない領域を取り除く。次いで、領域設定部１００５は、残った候補領域の内から、被覆している被写体の集合が等しく、及び候補領域のスコアが等しい複数の候補領域から、１つの候補領域、つまり検出対象領域を選定することができる。ここでの選定の条件は特に限定されない。例えば、領域設定部１００５は、同一の被写体を被覆し、及びそれらの候補領域のスコアが等しい領域の内、領域のサイズがより小さい領域を優先して選定してもよい。また例えば、領域設定部１００５は、同一の被写体を被覆し、同じスコア及びサイズを有する候補領域の内から、被覆している被写体の平均位置に対して、中心位置が最も近い候補領域を選定してもよい。このような処理により残った候補領域は、被覆している被写体の集合とスコアとの組み合わせが互いに異なっている。 As described above, the area setting unit 1005 selects a detection target area satisfying a desired condition from among the candidate areas including one or more subjects by using a score indicating the reliability of the detection result of the subject. You may. An example of selecting such a detection target region will be described with reference to the table of FIG. 7. First, for example, the area setting unit 1005 removes an area that does not cover the subject, such as _{(C 1, 1, 1) in FIG. 7, from the candidate areas.} Next, the area setting unit 1005 selects one candidate area, that is, a detection target area from a plurality of candidate areas having the same set of subjects to be covered and the same score of the candidate areas from the remaining candidate areas. can do. The conditions for selection here are not particularly limited. For example, the area setting unit 1005 may preferentially select an area having a smaller area size among the areas covering the same subject and having the same score of the candidate areas. Further, for example, the area setting unit 1005 covers the same subject and selects a candidate area whose center position is closest to the average position of the covered subject from the candidate areas having the same score and size. You may. In the candidate area remaining by such processing, the combination of the set of covered subjects and the score is different from each other.

さらに、撮像画像内のすべての被写体の追尾を行うことを考えて、領域設定部１００５は、複数の候補領域から、現時点におけるすべての被写体を被覆するように、上述の検出対象領域を１以上選定することができる。そのような場合、例えば、領域設定部１００５は、上記の処理により残っている、互いに異なる被写体の集合とスコアとの組み合わせを有する候補領域から、検出された被写体すべてを少なくとも一度被覆するように１以上の検出対象領域を選定してもよい。また、検出の精度を向上させるという観点から、領域設定部１００５は、検出対象領域に選定される候補領域のスコアの合計値が大きくなるように、検出対象領域を選定することができる。そのためには、集合被覆問題の最適化法を適用すればよく、つまり下記の条件付き最適化を解けばよい。

Further, in consideration of tracking all the subjects in the captured image, the area setting unit 1005 selects one or more of the above-mentioned detection target areas from a plurality of candidate areas so as to cover all the subjects at the present time. can do. In such a case, for example, the area setting unit 1005 covers all the detected subjects at least once from the candidate areas having combinations of different subject sets and scores remaining by the above processing. The above detection target area may be selected. Further, from the viewpoint of improving the accuracy of detection, the area setting unit 1005 can select the detection target area so that the total value of the scores of the candidate areas selected for the detection target area becomes large. For that purpose, the optimization method of the set cover problem may be applied, that is, the following conditional optimization may be solved.

この式において、ｉは被写体に関するインデックスであり、及びｊは候補領域に関するインデックスである。ｊは、選定の対象となる候補領域、例えば、上記の処理により残っている互いに異なる被写体の集合とスコアとの組み合わせを有する候補領域、に付されたインデックスであり、ｎは選定の対象となる候補領域の数を表す。ｓ_ｊは候補領域ｊのスコアを示す。また、ｘ_ｊは、候補領域ｊが選定される場合にはｘ_ｊ＝１、そうでない場合にはｘ_ｊ＝０となる。さらに、ａ_ｉｊは、候補領域ｊが被写体ｉを被覆する場合にａ_ｉｊ＝１、そうでない場合にａ_ｉｊ＝０となる。 In this equation, i is an index relating to the subject and j is an index relating to the candidate region. j is an index attached to a candidate area to be selected, for example, a candidate area having a combination of different sets of subjects and scores remaining by the above processing, and n is a candidate to be selected. Represents the number of candidate areas. s _j indicates the score of the candidate area j. Further, x _j _{is x j} = 1 when the candidate area j is selected, and _{x j} = 0 otherwise. Further, a _ij _{is a ij} = 1 when the candidate region j covers the subject i, _{and a ij} = 0 when the candidate region j does not cover the subject i.

このような最適化問題は、特に上記の式に限定されるわけではない。つまり、領域設定部１００５は、所望の条件に応じて、適宜異なる式を用いてもよい。例えば、領域設定部１００５は、検出のロバスト性を向上させることを考えて、上記の式（１）のΣａ_ｉｊ≧１をΣａ_ｉｊ≧２とすることにより、全被写体を少なくとも２回以上被覆する検出対象領域を作成してもよい。 Such an optimization problem is not particularly limited to the above equation. That is, the area setting unit 1005 may use different formulas as appropriate according to desired conditions. For example, the area setting unit 1005 covers the entire subject at least twice or more by _{setting Σa ij} ≥ 1 in _{the above equation (1) to Σa ij ≥ 2 in consideration of improving the robustness of detection.} A detection target area may be created.

また、処理のコストを低減するという観点から、領域設定部１００５は、選定される検出対象領域の総数が少なくなるように、検出対象領域の選定を行うことができる。つまり、そのように上記の最適化問題を解くことができる。検出対象領域の総数が少なくなるように最適化問題を解く方法は特に限定されない。例えば、領域設定部１００５は、貪欲法又はラグランジュ緩和法などの公知の最適化法をこの問題に適用することにより、検出対象領域を選定してもよい。また例えば、領域設定部１００５は、検出対象領域の総数が予め定められた所定の数以下になるように、検出対象領域を選定してもよい。 Further, from the viewpoint of reducing the processing cost, the area setting unit 1005 can select the detection target area so that the total number of the detection target areas to be selected is reduced. That is, the above optimization problem can be solved in this way. The method of solving the optimization problem so that the total number of detection target areas is small is not particularly limited. For example, the region setting unit 1005 may select a detection target region by applying a known optimization method such as a greedy method or a Lagrange relaxation method to this problem. Further, for example, the area setting unit 1005 may select the detection target area so that the total number of detection target areas is equal to or less than a predetermined number.

次いで、次の時刻の撮像画像の処理に移り、ステップＳ４００２において、検出部１００３が、選定された検出対象領域から被写体を検出する。 Next, the process proceeds to the processing of the captured image at the next time, and in step S4002, the detection unit 1003 detects the subject from the selected detection target area.

このような構成によれば、撮像画像から１以上の対象物を検出し、その対象物の位置にしたがって後続する時刻での被写体の検出において使用することができる検出対象領域を設定する検出装置を得ることができる。したがって、単一の固定カメラの視野内を通過する被写体を、計算コストと検出精度を両立させ、少ない計算コストでより高精度に追尾することが可能になる。 According to such a configuration, a detection device that detects one or more objects from the captured image and sets a detection target area that can be used in detecting the subject at a subsequent time according to the position of the objects. Obtainable. Therefore, it is possible to achieve both calculation cost and detection accuracy for a subject passing through the field of view of a single fixed camera, and to track the subject with higher accuracy at a low calculation cost.

［実施形態２］
実施形態２に係る検出装置は、次の時刻における被写体の位置を予測し、それに基づいた検出対象領域を設定することができる。特に、実施形態２に係る検出装置は、予測された被写体の位置、予測から生じ得るずれの量の幅に応じたバッファ幅を有する検出対象領域を設定することができる。したがって、本実施形態に係る検出装置は、例えば被写体が停止しているような場合においても、検出対象領域について余分なバッファ幅を取ることなく、少ない処理コストで検出処理を行うことができる。そのような処理のために、本実施形態に係る検出装置２０００は予測部２００１を有する。また、検出装置２０００は、予測部２００１を有することを除き実施形態１と同様であり、重複する説明は省略する。 [Embodiment 2]
The detection device according to the second embodiment can predict the position of the subject at the next time and set the detection target area based on the prediction. In particular, the detection device according to the second embodiment can set a detection target area having a buffer width corresponding to the predicted position of the subject and the width of the amount of deviation that can occur from the prediction. Therefore, the detection device according to the present embodiment can perform the detection process at a low processing cost without taking an extra buffer width for the detection target area even when the subject is stopped, for example. For such processing, the detection device 2000 according to the present embodiment has a prediction unit 2001. Further, the detection device 2000 is the same as that of the first embodiment except that the detection device 2000 has the prediction unit 2001, and the duplicate description will be omitted.

図２（ｂ）は、実施形態２に係る検出装置２０００の機能構成の一例を示すブロック図である。予測部２００１は、各被写体について、次回に検出を行う時のその被写体の位置を予測する。領域設定部１００５は、予測部２００１が予測した被写体の位置を考慮に入れて検出対象領域を設定する。 FIG. 2B is a block diagram showing an example of the functional configuration of the detection device 2000 according to the second embodiment. The prediction unit 2001 predicts the position of each subject when the next detection is performed. The area setting unit 1005 sets the detection target area in consideration of the position of the subject predicted by the prediction unit 2001.

以下では図３（ｂ）を参照して、本実施形態に係る検出装置２０００が行う検出方法の流れを説明する。図３（ｂ）は本実施形態に係る検出を行うための処理手順の一例を示すフローチャートである。本実施形態に係る検出装置２０００の処理手順は、ステップＳ５００１及びステップＳ５００２を除き、実施形態１と同様に行うことができる。 Hereinafter, the flow of the detection method performed by the detection device 2000 according to the present embodiment will be described with reference to FIG. 3 (b). FIG. 3B is a flowchart showing an example of a processing procedure for performing the detection according to the present embodiment. The processing procedure of the detection device 2000 according to the present embodiment can be performed in the same manner as in the first embodiment except for step S5001 and step S5002.

ステップＳ５００１で予測部２００１は、次の時刻において各被写体が検出される位置を予測する。予測部２００１が次回の被写体の位置の予測をするための方法は特に限定されない。ループＬ５００１において、被写体は、ステップＳ４００３で、前時刻で検出された同一の識別情報を持つ被写体の像と対応付けられている。つまり、予測部２００１は、特定の被写体について、現時点までの毎時刻の座標を取得することが可能である。例えば、予測部２００１は、被写体の前回の位置と現時点での位置の差分を取る事により、被写体の前回の検出から今回の検出までの移動距離及び移動方向を算出し、及びそれらに基づいて次回の検出時の被写体の位置を予測してもよい。また例えば、予測部２００１は、被写体の前回の位置と現時点の位置に加えて、前回より以前の任意の時刻における被写体の位置を適宜用いることにより、被写体の前回の検出から今回の検出までの移動距離及び移動方向についての情報を算出することができる。そのような処理によれば、予測部２００１は、被写体の前回の位置と今回の位置とのみを用いて被写体の前回から今回までの移動距離及び移動方向を算出する場合と比べて、より平滑化した情報を算出することができる。このような場合においても、予測部２００１は、算出した被写体の情報から、次回の被写体の位置を予測することができる。 In step S5001, the prediction unit 2001 predicts the position where each subject is detected at the next time. The method for the prediction unit 2001 to predict the position of the next subject is not particularly limited. In the loop L5001, the subject is associated with the image of the subject having the same identification information detected at the previous time in step S4003. That is, the prediction unit 2001 can acquire the coordinates of each time up to the present time for a specific subject. For example, the prediction unit 2001 calculates the movement distance and the movement direction from the previous detection of the subject to the current detection by taking the difference between the previous position of the subject and the current position, and based on them, the next time. The position of the subject at the time of detection may be predicted. Further, for example, the prediction unit 2001 moves from the previous detection of the subject to the current detection by appropriately using the position of the subject at an arbitrary time before the previous time in addition to the previous position and the current position of the subject. Information about distance and direction of travel can be calculated. According to such processing, the prediction unit 2001 is smoother than the case where the movement distance and the movement direction of the subject from the previous time to the present time are calculated using only the previous position and the current position of the subject. Information can be calculated. Even in such a case, the prediction unit 2001 can predict the position of the next subject from the calculated subject information.

ステップＳ５００２で領域設定部１００５は、次回のステップＳ４００２において検出に用いる検出対象領域を設定する。この例においては、領域設定部１００５は、検出対象領域の移動処理、及びバッファ幅の設定方法を除き実施形態１のステップＳ４００４と同様の処理を行うため、重複する説明は省略する。領域設定部１００５は、ステップＳ５００１で予測した次回の被写体の位置に基づいて、その被写体を被覆する検出対象領域の位置を移動させることができる。そのような場合、検出対象領域の移動のさせ方は特に限定されない。領域設定部１００５は、例えば、検出対象領域が一つの被写体を被覆している場合に、その被写体の予測位置への移動と同様に検出対象領域を移動させてもよい。また領域設定部１００５は、検出対象領域が複数の被写体を被覆している場合には、例えば、それらの中の最もスコアの高い被写体の移動に応じて検出対象領域を移動させてもよく、それらの被写体の予測される移動の平均に応じて検出対象領域を移動させてもよい。さらに、領域設定部１００５は、被写体の位置の予測時に生じるノイズ分（真値からのずれ量分）の幅を適宜算出し、及びそのようなノイズ分の値のバッファ幅を設定してもよい。そのような場合、領域設定部１００５は、例えば、ノイズ分の値を、被写体のトラッキングデータを用いて、被写体について、Ｓ５００１における方法と同様にして予測される予測位置と検出された位置とのずれ量の平均として算出してもよい。つまり、領域設定部１００５は、ノイズ分の値を、現時点までのループにおける、被写体の予測位置と検出された位置とのずれ量の平均として算出してもよい。 In step S5002, the area setting unit 1005 sets the detection target area to be used for detection in the next step S4002. In this example, the area setting unit 1005 performs the same processing as in step S4004 of the first embodiment except for the movement processing of the detection target area and the buffer width setting method, and thus the duplicate description will be omitted. The area setting unit 1005 can move the position of the detection target area covering the subject based on the position of the next subject predicted in step S5001. In such a case, the method of moving the detection target area is not particularly limited. For example, when the detection target area covers one subject, the area setting unit 1005 may move the detection target area in the same manner as the movement of the subject to the predicted position. Further, when the detection target area covers a plurality of subjects, the area setting unit 1005 may move the detection target area according to the movement of the subject having the highest score among them. The detection target area may be moved according to the average of the predicted movements of the subject. Further, the area setting unit 1005 may appropriately calculate the width of the noise component (the amount of deviation from the true value) generated when the position of the subject is predicted, and set the buffer width of the value of such noise component. .. In such a case, the area setting unit 1005 uses, for example, the noise component to determine the deviation between the predicted position and the detected position of the subject in the same manner as in the method in S5001. It may be calculated as an average of the amounts. That is, the area setting unit 1005 may calculate the value of the noise as the average of the amount of deviation between the predicted position of the subject and the detected position in the loop up to the present time.

このような構成によれば、被写体の予測位置に基づいて、検出に適した検出対象領域を設定することができる。また、被写体の予測位置に基づいてバッファ領域を設定することができる。したがって、単一の固定カメラの視野内を通過する被写体を、より少ない計算コストで追尾することが可能となる。 According to such a configuration, it is possible to set a detection target area suitable for detection based on the predicted position of the subject. In addition, the buffer area can be set based on the predicted position of the subject. Therefore, it is possible to track a subject passing through the field of view of a single fixed camera at a lower calculation cost.

［実施形態３］
実施形態３に係る検出装置は、複数のカメラにより得られた撮像画像のそれぞれから被写体を検出し、その結果を用いて被写体を追跡する。その際に、検出装置は、前回のループにおいて被写体の観測値から推定された被写体の３次元空間上の状態（つまり、位置、姿勢及び速度）の予測値に基づいて、現時点における被写体の状態を予測することができる。また、予測された現時点での被写体の状態に基づいて、被写体の撮像画像上における座標及びスコアの予測値をさらに取得し、その被写体の座標及びスコアに基づいて検出対象領域を設定し、及び被写体を検出することができる。さらに、次回の時刻において予測される被写体の検出のスコアを最大化させる検出対象領域を設定することができる。以下では、複数の固定カメラを用いて、フットサルと呼ばれる小スケールのサッカーの屋内ピッチの撮像を行う場合について説明するが、この用途には限定されない。つまり、本実施形態における被写体は、人物の頭部と、サッカーボール（以下ボールと呼ぶ）とであるとする。 [Embodiment 3]
The detection device according to the third embodiment detects a subject from each of the captured images obtained by a plurality of cameras, and tracks the subject using the result. At that time, the detection device determines the current state of the subject based on the predicted values of the state (that is, position, posture, and velocity) of the subject in the three-dimensional space estimated from the observed values of the subject in the previous loop. Can be predicted. Further, based on the predicted current state of the subject, the predicted values of the coordinates and the score on the captured image of the subject are further acquired, the detection target area is set based on the coordinates and the score of the subject, and the subject is set. Can be detected. Further, it is possible to set a detection target area that maximizes the detection score of the subject predicted at the next time. In the following, a case where a plurality of fixed cameras are used to image a small-scale indoor pitch of soccer called futsal will be described, but the present invention is not limited to this application. That is, it is assumed that the subjects in the present embodiment are the head of a person and a soccer ball (hereinafter referred to as a ball).

図８は、本実施形態において想定される検出装置３０００の実施形態を説明するための図である。カメラ配置例８００は、本実施形態に係るカメラ配置及びピッチの俯瞰図であり、カメラ８０１〜８０６、３次元空間の原点８０７及び８０７を原点とした３次元座標のＸ軸、Ｙ軸及びＺ軸を示す８０８、８０９及び８１０、並びにピッチ８１１を有している。本実施形態において、各カメラは地面からある程度の高さの空間壁面に固定されており、及び、ピッチ上に存在する被写体を撮像するように設置されていてもよい。また、検出装置３０００の有する各カメラは、カメラキャリブレーションにより、それぞれ内部パラメータ及び外部パラメータが与えられている。よって、以下においては、検出装置３０００は、被写体の３次元座標から、被写体のピクセル座標を求めることができるものとする。カメラキャリブレーションについては公知の技術であるため、詳細な説明は省略する。またこの例において、Ｘ軸８０８とＺ軸８１０がなす平面が地面であり、及びＹ軸８０９が高さを表す方向である。 FIG. 8 is a diagram for explaining an embodiment of the detection device 3000 assumed in the present embodiment. The camera arrangement example 800 is a bird's-eye view of the camera arrangement and pitch according to the present embodiment, and is an X-axis, Y-axis, and Z-axis of three-dimensional coordinates with the origins 807 and 807 of the cameras 801 to 806 and the three-dimensional space as the origins. It has 808, 809 and 810, and a pitch 811. In the present embodiment, each camera is fixed to a space wall surface at a certain height from the ground, and may be installed so as to image a subject existing on the pitch. Further, each camera included in the detection device 3000 is given an internal parameter and an external parameter by camera calibration, respectively. Therefore, in the following, it is assumed that the detection device 3000 can obtain the pixel coordinates of the subject from the three-dimensional coordinates of the subject. Since camera calibration is a known technique, detailed description thereof will be omitted. Further, in this example, the plane formed by the X-axis 808 and the Z-axis 810 is the ground, and the Y-axis 809 is the direction representing the height.

人物配置例８２０は、同空間中に存在する人物とボールのある時刻での配置の一例である。人物配置例８２０のピッチ８１１は、カメラ配置例８００のピッチ８１１と同じピッチである。８２１は同ピッチのハーフウェーラインであり、及び８２２はセンターマークである。Ａ０、Ａ１、Ａ２、Ａ３及びＡ４は、Ａチームの選手（人物）で、並びに、Ｂ０、Ｂ１、Ｂ２、Ｂ３及びＢ４は、Ｂチームの選手（人物）である。また、Ｓ０はボールである。 The person arrangement example 820 is an example of the arrangement of the person and the ball existing in the same space at a certain time. The pitch 811 of the person arrangement example 820 is the same pitch as the pitch 811 of the camera arrangement example 800. 821 is a halfway line of the same pitch, and 822 is a center mark. A0, A1, A2, A3 and A4 are players (persons) of team A, and B0, B1, B2, B3 and B4 are players (persons) of team B. Further, S0 is a ball.

画像例８３０、８４０、８５０、８６０、８７０、及び８８０は、人物配置例８２０の人物及びボール配置を、それぞれカメラ８０１、８０２、８０３、８０４、８０５及び８０６で撮像した場合の画像例である。また、各画像例におけるＡ０、Ａ１、Ａ２、Ａ３、Ａ４、Ｂ０、Ｂ１、Ｂ２、Ｂ３、及びＢ４、並びにＳ０は人物及びボールであり、人物配置例８２０のＡ０、Ａ１、Ａ２、Ａ３、Ａ４、Ｂ０、Ｂ１、Ｂ２、Ｂ３、及びＢ４並びにＳ０にそれぞれ対応する。 Image examples 830, 840, 850, 860, 870, and 880 are image examples when the person and ball arrangement of the person arrangement example 820 are imaged by the cameras 801, 802, 803, 804, 805, and 806, respectively. Further, A0, A1, A2, A3, A4, B0, B1, B2, B3, and B4, and S0 in each image example are a person and a ball, and A0, A1, A2, A3, and A4 of the person arrangement example 820. , B0, B1, B2, B3, and B4 and S0, respectively.

図２（ｃ）は実施形態３に係る検出装置３０００の機能構成の一例を示すブロック図である。検出装置３０００は、撮像部３１００と処理部３２００とを有する。撮像部３１００は、第１の動画取得部３００１と、第Ｋの動画取得部３００２と、図中で省略されている動画取得部との、計Ｋ個の動画取得部を有している。例えば、図８の例においては、カメラの数は６台であるため、Ｋは６となる。本実施形態に係るこれらの動画取得部は、それぞれ実施形態１における動画取得部１００１と同様の構成を有する。処理部３２００は、図２（ｃ）の例においては、初期値設定部３００３、予測部３００４、領域設定部３００５、検出部３００６、対応付け部３００７、重み計算部３００８、更新部３００９、及び可視化部３０１０を有する。 FIG. 2C is a block diagram showing an example of the functional configuration of the detection device 3000 according to the third embodiment. The detection device 3000 has an imaging unit 3100 and a processing unit 3200. The imaging unit 3100 has a total of K moving image acquisition units, including a first moving image acquisition unit 3001, a Kth moving image acquisition unit 3002, and a moving image acquisition unit omitted in the drawing. For example, in the example of FIG. 8, since the number of cameras is 6, K is 6. Each of these moving image acquisition units according to the present embodiment has the same configuration as the moving image acquisition unit 1001 in the first embodiment. In the example of FIG. 2C, the processing unit 3200 includes an initial value setting unit 3003, a prediction unit 3004, an area setting unit 3005, a detection unit 3006, an association unit 3007, a weight calculation unit 3008, an update unit 3009, and a visualization unit. It has a part 3010.

初期値設定部３００３は、検出処理の初期時刻における、被写体の位置、姿勢、及び速度の値を設定する。予測部３００４は、各被写体の３次元空間上の位置、姿勢、及び速度の予測を行い、及び、カメラそれぞれについて被写体の観測値の予測を行う。詳しい説明は後述するが、観測値とは、被写体のピクセル座標上での位置及び検出のスコアである。領域設定部３００５は、現時点における検出対象領域を設定する。検出部３００６は、領域設定部３００５が設定した検出対象領域及び各カメラが取得する画像から、その画像における検出対象領域での被写体の検出を行い、及び、被写体の位置及びスコアを取得する。対応付け部３００７は、前回の被写体の像と今回の被写体の像とを対応付ける。重み計算部３００８は、各カメラのそれぞれの観測値の重みを計算する。更新部３００９は、各被写体について、前回のループにおける観測値と観測値の重みとを用いることにより、その被写体の状態、つまり位置、姿勢、及び速度を更新する。可視化部３０１０は、各被写体の、検出を行った時刻での位置の軌跡を可視化する。処理部３２００の有するこれらの機能部が行う処理の詳細については、図３（ｃ）のフローチャートと共に後述する。処理部３２００は、実施形態１の処理部１２００と同様にモニタリング部１３００と接続されていてもよい。 The initial value setting unit 3003 sets the values of the position, posture, and speed of the subject at the initial time of the detection process. The prediction unit 3004 predicts the position, posture, and speed of each subject in the three-dimensional space, and predicts the observed value of the subject for each camera. Although detailed description will be described later, the observed value is the position of the subject on the pixel coordinates and the detection score. The area setting unit 3005 sets the detection target area at the present time. The detection unit 3006 detects a subject in the detection target area in the image from the detection target area set by the area setting unit 3005 and the image acquired by each camera, and acquires the position and score of the subject. The association unit 3007 associates the image of the previous subject with the image of the current subject. The weight calculation unit 3008 calculates the weight of each observed value of each camera. The update unit 3009 updates the state, that is, the position, the posture, and the speed of the subject by using the observed value in the previous loop and the weight of the observed value for each subject. The visualization unit 3010 visualizes the locus of the position of each subject at the time of detection. Details of the processing performed by these functional units of the processing unit 3200 will be described later together with the flowchart of FIG. 3C. The processing unit 3200 may be connected to the monitoring unit 1300 in the same manner as the processing unit 1200 of the first embodiment.

本実施形態に係る検出装置は、被写体の観測値から、被写体の状態を推定することができる。３次元空間上での被写体の追尾の枠組みを説明するにあたり、検出装置３０００により検出される被写体の観測値と、その観測値から推定される被写体の状態変数とについて説明する。検出装置３０００による被写体の観測値とは、被写体の、撮像画像のピクセル座標上での位置（ｕ、ｖ）及びスコア（ｑ）であり、計３次元の（ｕ、ｖ、ｑ）で表されてもよい。被写体の位置（ｕ、ｖ）は、被写体を囲むバウンディングボックスの中心の位置であり、そのバウンディングボックスの座標情報及びバウンディングボックスを含む検出対象領域の座標情報から、検出装置３０００が算出することができる。また、同一の被写体についてのスコア（ｑ）の値は、その被写体を含む検出対象領域のサイズによって異なり得る。 The detection device according to the present embodiment can estimate the state of the subject from the observed value of the subject. In explaining the framework for tracking a subject in a three-dimensional space, the observed value of the subject detected by the detection device 3000 and the state variable of the subject estimated from the observed value will be described. The observed value of the subject by the detection device 3000 is the position (u, v) and the score (q) of the subject on the pixel coordinates of the captured image, and is represented by a total of three dimensions (u, v, q). You may. The position (u, v) of the subject is the position of the center of the bounding box surrounding the subject, and can be calculated by the detection device 3000 from the coordinate information of the bounding box and the coordinate information of the detection target area including the bounding box. .. Further, the value of the score (q) for the same subject may differ depending on the size of the detection target area including the subject.

被写体の状態変数とは、被写体の３次元空間上の状態、つまり位置、姿勢及び速度を表す変数である。つまり、この状態変数を推定することにより、検出装置が、被写体の３次元空間上での位置を推定し、及び被写体の追尾を行うことができる。本実施形態に係る検出装置３０００は、被写体の状態変数を、その被写体の観測値から推定することができる。 The state variable of the subject is a variable representing the state of the subject in the three-dimensional space, that is, the position, posture, and speed. That is, by estimating this state variable, the detection device can estimate the position of the subject in the three-dimensional space and track the subject. The detection device 3000 according to the present embodiment can estimate the state variable of the subject from the observed value of the subject.

予測部３００４は、前回のループにおける被写体の状態変数から、現時点における状態変数及び観測値の予測分布を取得することができる。本実施形態においては被写体が頭部又はボールであるので、それぞれについての状態変数を考慮する。頭部の状態変数は、被写体の３次元空間上の位置（ｘ、ｙ、ｚ）、姿勢（φ、θ、ψ）、及び速度（ｘ’、ｙ’、ｚ’）の計９次元の変数として与えられる。また、ボールの状態変数は、ボールが球形であり、その姿勢の変化によってもカメラから見た形状が不変であることから、被写体の３次元空間上の位置（ｘ、ｙ、ｚ）及び速度（ｘ’、ｙ’、ｚ’）の計６次元の変数として与えられる。つまり、観測値ｙ、頭部の状態変数ｘ^ｈｅａｄ、及びボールの状態変数ｘ^ｂａｌｌは、下記の式で記述されることができる。

The prediction unit 3004 can acquire the current state variable and the predicted distribution of the observed value from the state variable of the subject in the previous loop. In this embodiment, since the subject is the head or the ball, the state variables for each are considered. The state variables of the head are a total of 9-dimensional variables such as the position (x, y, z), posture (φ, θ, ψ), and velocity (x', y', z') of the subject in the three-dimensional space. Given as. Further, the state variables of the ball are the position (x, y, z) and velocity (x, y, z) and velocity (x, y, z) of the subject in the three-dimensional space because the ball is spherical and the shape seen from the camera does not change even if the posture of the ball changes. It is given as a total of 6-dimensional variables of x', y', z'). That is, the observed value y, the state variable x ^{head of the head} , and the state variable x ^ball of the ball can be described by the following equations.

上記の式において、添え字ｔは時刻を表す。また、ｋ_ｓｊは、カメラｋによる撮像画像内における、サイズｓの検出対象領域におけるｊ番目の検出対象領域の観測値を表す。Ｔは転置である。また、添え字ｎは人物を表し、本実施形態においては、その人物の、ＩＤのような識別情報の値であってもよい。 In the above equation, the subscript t represents the time. Further, k _sj represents an observed value of the j-th detection target region in the detection target region of size s in the image captured by the camera k. T is transpose. Further, the subscript n represents a person, and in the present embodiment, it may be a value of identification information such as an ID of the person.

さらに、後述するステップＳ６００６の処理により、ｎとｋ_ｓｊの対応付けが行われる。結果として、ｙ_{ｔ，ｋｓｊ}はｙ_{ｔ、ｋ、ｓ、ｎ}＝［ｕ_{ｔ，ｋ，ｓ，ｎ，}、ｖ_{ｔ，ｋ，ｓ，ｎ，}、ｑ_{ｔ，ｋ，ｓ，ｎ，}］^Ｔと対応付けられる。ここで、ｙ_{ｔ、ｋ、ｓ、ｎ}は、カメラｋによる撮像画像内の、サイズｓの検出対象領域における時刻ｔの被写体ｎの観測値を表す。本実施形態においては、上述の観測値及び状態変数を持つ状態空間モデルを用いることにより、観測値から状態を推定する拡張カルマンフィルタを用いて、頭部とボールの検出及び追尾を行う。拡張カルマンフィルタについては公知であるため、詳細な説明は省略する。 Further, by the process of step S6006 described later, the _{association between n and k sj} is performed. As a result, y _{t, ksj} becomes y _{t, k, s, n} = [ _{ut, k, s, n} , v _{t, k, s, n} ,, q _{t, k, s, n,} ] ^T. Associated. Here, y _{t, k, s, and n} represent the observed values of the subject n at time t in the detection target region of size s in the image captured by the camera k. In the present embodiment, by using the state space model having the above-mentioned observed values and state variables, the head and the ball are detected and tracked by using the extended Kalman filter that estimates the state from the observed values. Since the extended Kalman filter is known, detailed description thereof will be omitted.

図３（ｃ）は本実施形態に係る検出を行うための処理手順の一例を示すフローチャートである。ループＬ６００１において、検出装置３０００は、時刻１からｔまで、以下のステップＳ６００２〜Ｓ６００９の操作を順に繰り返し、及び次の時刻に進むことができる。ステップＳ６００１において初期値設定部３００３は、開始時刻（ｔ＝１）における被写体の初期の状態の取得を行う。開始時刻において、被写体の状態変数における速度及び姿勢は０とすることができる。また、被写体の状態変数における位置は、検出対象領域内の複数の被写体の観測値と被写体とを対応付けることを考えて、その被写体の３次元座標上の正しい位置の値に近い値であってもよい。 FIG. 3C is a flowchart showing an example of a processing procedure for performing the detection according to the present embodiment. In the loop L6001, the detection device 3000 can repeat the following operations of steps S6002 to S6009 in order from time 1 to t, and proceed to the next time. In step S6001, the initial value setting unit 3003 acquires the initial state of the subject at the start time (t = 1). At the start time, the velocity and posture in the state variable of the subject can be set to 0. Further, the position in the state variable of the subject may be a value close to the value of the correct position on the three-dimensional coordinates of the subject in consideration of associating the observed value of a plurality of subjects in the detection target area with the subject. Good.

以下、被写体の３次元座標上の正しい位置の値（ｘ、ｙ、ｚ）に近い値を取得する方法について説明する。ステップＳ６００１において初期値設定部３００３は、各カメラの撮像画像内から被写体を検出し、及びそれぞれのカメラのピクセル座標上での被写体の位置（ｕ、ｖ）を取得する。次いで、初期値設定部３００３は、被写体の種類に応じて被写体の高さ方向の値ｙを仮定する。観測値と被写体の対応付けのためには被写体の高さｙの正確な値は必要ではないことから、初期値設定部３００３は、被写体の高さを、大まかな値として仮定してもよい。例えば、初期値設定部３００３は、頭部の高さを１．５ｍ、及びボールの高さを０．１ｍと仮定してもよい。以下においては、説明のため頭部の高さを１．５ｍ、及びボールの高さを０．１ｍであると仮定して説明を行うが、被写体の高さはそのように限定されるわけではない。次いで、初期値設定部３００３は、透視投影行列を用いることにより、そのような（ｕ、ｖ）から、被写体の３次元空間上の位置（ｘ、１．５、ｚ）又は（ｘ、０．１、ｚ）を取得する。 Hereinafter, a method of acquiring a value close to the value (x, y, z) of the correct position on the three-dimensional coordinates of the subject will be described. In step S6001, the initial value setting unit 3003 detects the subject from the captured image of each camera and acquires the position (u, v) of the subject on the pixel coordinates of each camera. Next, the initial value setting unit 3003 assumes a value y in the height direction of the subject according to the type of the subject. Since an accurate value of the height y of the subject is not necessary for associating the observed value with the subject, the initial value setting unit 3003 may assume the height of the subject as a rough value. For example, the initial value setting unit 3003 may assume that the height of the head is 1.5 m and the height of the ball is 0.1 m. In the following, for the sake of explanation, the height of the head is assumed to be 1.5 m and the height of the ball is 0.1 m, but the height of the subject is not limited to that. Absent. Next, the initial value setting unit 3003 uses a perspective projection matrix to obtain a position (x, 1.5, z) or (x, 0.) of the subject in the three-dimensional space from such (u, v). 1, z) is acquired.

さらに初期値設定部３００３は、すべてのカメラにおいて取得された各被写体の３次元座標上の位置から、同一の被写体の像を対応付ける。初期値設定部３００３は、例えば、取得された各被写体の３次元座標上の位置を、例えば公知のｋ−ｍｅａｎｓ法のような手法によってクラスタリングし、及び、クラスタリングされた各クラスタに含まれる被写体を同一の被写体としてもよい。そのような場合、初期値設定部３００３は、各クラスタに含まれる位置の値の平均を取る事により、各被写体の初期の位置（ｘ、ｙ、ｚ）を取得してもよい。 Further, the initial value setting unit 3003 associates the images of the same subject with the positions on the three-dimensional coordinates of each subject acquired by all the cameras. The initial value setting unit 3003 clusters the acquired positions of each subject on the three-dimensional coordinates by a method such as a known k-means method, and sets the subjects included in each clustered cluster. It may be the same subject. In such a case, the initial value setting unit 3003 may acquire the initial position (x, y, z) of each subject by averaging the values of the positions included in each cluster.

このような処理により、初期値設定部３００３は、ボールの状態変数の初期値ｘ^ｂａｌｌ _０，ｎ＝（ｘ_０，ｎ、０．１、ｚ_０，ｎ、０、０、０）を取得することができる。また、初期値設定部３００３は、頭部の状態変数の初期値ｘ^ｈｅａｄ _０，ｎ＝（ｘ_０，ｎ、１．５、ｚ_０，ｎ、０、０、０、０、０、０）も取得することができる。これらの初期値は、状態変数の初期のフィルタ分布（事後分布）の１次モーメント（平均）ｘ_{０｜０，ｎ}とすることができる。そのような場合、状態変数の初期のフィルタ分布の２次モーメント（分散共分散行列）は、適当な大きさの半正定値行列であってもよい。 By such a process, the initial value setting unit 3003 _{acquires the initial value x ball 0, n} = (x _{0, n} , 0.1, z _{0, n} , 0, 0, 0) of ^{the state variable of the ball.} be able to. Further, the initial value setting unit 3003 uses the initial value of the state variable of the ^head _{x head 0, n} = (x _{0, n} , 1.5, z _{0, n, 0, 0, 0, 0, 0, 0} ). Can also be obtained. These initial values can be the first-order moment (mean) x _{0 | 0, n} of the initial filter distribution (posterior distribution) of the state variable. In such a case, the quadratic moment (variance-covariance matrix) of the initial filter distribution of the state variable may be a semi-normal definite matrix of an appropriate size.

ステップＳ６００２において、予測部３００４は、各被写体の状態変数及び観測値の予測分布を取得する。予測部３００４は、例えば、下記のシステム方程式（３）を用いることにより、頭部である被写体についての予測分布を取得することができる。この式において、Δｔは、Ｌ６００１における前回から今回までの時間幅（秒）を表す。また、ｓ_ｔは、プロセスノイズと呼ばれる（すなわち、例えば予測プロセス中に発生するノイズである）白色ガウスノイズである。Ｑ_ｔはＳ_ｔの分散逆分散行列である。本実施形態においては、予測部は式（３）を用いて被写体の状態変数の予測分布を取得するものとして説明するが、その手法が特に限定されるわけではない。このシステム方程式において、被写体の位置（ｘ、ｙ、ｚ）の変化は、２次のマルコフ過程でモデル化した被写体の位置及び速度のトレンド成分モデルとして扱われている。さらに、姿勢（φ、θ、ψ）は、被写体の姿勢の１次のマルコフ過程としてモデル化されている。

In step S6002, the prediction unit 3004 acquires the state variables of each subject and the prediction distribution of the observed values. The prediction unit 3004 can acquire the prediction distribution for the subject, which is the head, by using, for example, the following system equation (3). In this equation, Δt represents the time width (seconds) from the previous time to the present time in L6001. Further, _st is white Gaussian noise called process noise (that is, noise generated during a prediction process, for example). Q _t is the variance of the inverse covariance matrix of _{S t.} In the present embodiment, the prediction unit will be described as acquiring the prediction distribution of the state variables of the subject using the equation (3), but the method is not particularly limited. In this system equation, the change in the position of the subject (x, y, z) is treated as a trend component model of the position and velocity of the subject modeled in the second-order Markov process. Further, the posture (φ, θ, ψ) is modeled as a first-order Markov process of the posture of the subject.

また、被写体がボールである場合のシステム方程式は、式（３）から姿勢（φ、θ、ψ）に関する次元を無視した、下記のシステム方程式（４）を用いる。以降においては、簡単のため、頭部とボールを明確に区別する必要がある場合を除き、ｘ^ｈｅａｄ _ｔｎ及びｘ^ｂａｌｌ _ｔｎを、ｘ_ｔｎのように表記する。

Further, as the system equation when the subject is a ball, the following system equation (4) is used, which ignores the dimensions related to the posture (φ, θ, ψ) from the equation (3). In the following, for the sake of simplicity, x _{head tun} and x ^ball _tn will be referred to as _{x tn} ^{unless it is necessary to clearly distinguish the head from the ball.}

予測部３００４は、後述する観測方程式（６）及び（７）、並びに（８）又は（８’）を用いることにより、被写体の観測値の予測分布を取得することができる。観測方程式（６）及び（７）は、下記の式（５）に基づいて導出される。式（５）は、３次元空間上の点をカメラのピクセル座標上に射影する式である。上述の通り、検出装置３０００の有するカメラの内部パラメータ及び外部パラメータは予め取得されているので、検出装置３０００は、３次元空間上の点をピクセル座標上に射影することができる。そのような射影は、下式（５）のように記述することができる。ここで、ｐ_ｘｘ，ｋは、カメラｋにおける透視投影行列の各要素である。γは、同時座標系のパラメータである。

The prediction unit 3004 can acquire the predicted distribution of the observed values of the subject by using the observation equations (6) and (7), which will be described later, and (8) or (8'). The observation equations (6) and (7) are derived based on the following equation (5). Equation (5) is an equation that projects a point in three-dimensional space onto the pixel coordinates of the camera. As described above, since the internal parameters and external parameters of the camera included in the detection device 3000 are acquired in advance, the detection device 3000 can project a point in the three-dimensional space on the pixel coordinates. Such a projection can be described by the following equation (5). Here, _{pxx and k} are each element of the perspective projection matrix in the camera k. γ is a parameter of the simultaneous coordinate system.

予測部３００４は、式（５）に基づいて、下記の観測方程式（６）及び（７）を上述の通り導出することができる。これらの観測方程式により、予測部３００４は、被写体の３次元空間上の位置（ｘ、ｙ、ｚ）から、被写体の観測値である位置（ｕ、ｖ）を算出することができる。ｗ_ｔは、観測ノイズと呼ばれる白色ガウスノイズである。 The prediction unit 3004 can derive the following observation equations (6) and (7) as described above based on the equation (5). From these observation equations, the prediction unit 3004 can calculate the position (u, v) which is the observed value of the subject from the position (x, y, z) in the three-dimensional space of the subject. w _t is a white Gaussian noise, called the observation noise.

式（６）及び（７）とはつまり、被写体の３次元空間上の位置（ｘ、ｙ、ｚ）がピクセル座標（ｕ、ｖ）として観測される過程をモデル化した式である。本実施形態においては、複数のカメラが非同期であることによるカメラそれぞれが取得する被写体の位置のずれ、及び一部カメラのコマ落ちによる被写体の位置のずれが発生する。また、検出装置３０００の処理過程で発生する被写体の位置のずれ、及びカメラキャリブレーションの誤差に伴う被写体の位置のずれも発生する。ｗ_ｔは、これらの要因により検出装置３０００に観測されると考えられる、３次元空間上の被写体の位置のずれをモデル化したものである。

The equations (6) and (7) are equations that model the process in which the position (x, y, z) of the subject in the three-dimensional space is observed as the pixel coordinates (u, v). In the present embodiment, the position of the subject acquired by each of the cameras is deviated due to the asynchronousness of the plurality of cameras, and the position of the subject is deviated due to frame dropping of some cameras. In addition, the deviation of the position of the subject that occurs in the processing process of the detection device 3000 and the deviation of the position of the subject due to the error of the camera calibration also occur. w _t is considered to be observed in the detector 3000 by these factors, it is obtained by modeling the deviation of position of the object in a three-dimensional space.

また、下記の観測方程式（８）及び（８’）は被写体のスコアについての観測方程式であり、それぞれ被写体が頭部である場合とボールである場合とに対応する。

Further, the following observation equations (8) and (8') are observation equations for the score of the subject, and correspond to the case where the subject is the head and the case where the subject is a ball, respectively.

ここで、Ｃ_ｋはカメラｋの３次元空間上の位置を表す。また、||ｘ−Ｃ||_２は被写体とカメラとのユークリッド距離を表す。α_ｓ ^（０）、α_ｓ ^（１）、α_ｓ ^（２）、α_ｓ ^（３）、及びα_ｓ ^（４）はモデルパラメータである。θ_ｘ、θ_ｙ、及びθ_ｚは、カメラの外部パラメータの回転行列をＲとし、及び頭部の姿勢（φ、θ、ψ）から得られる回転行列をＲｏとしたときの行列（下記の式（９））の要素を用いて表現できる。例えばこの場合、θ_ｘはａｓｉｎ（ｒ_３２）、θ_ｙはａｔａｎ（−ｒ_３１／ｒ_３３）、及びθ_ｚはａｔａｎ（ｒ_２１／ｒ_１１）と表現することができる。

Here, C _k represents the position of the camera k in the three-dimensional space. Further, || x−C || ₂ represents the Euclidean distance between the subject and the camera. α _s ⁽⁰⁾ , α _s ⁽¹⁾ , α _s ⁽²⁾ , α _s ⁽³⁾ , and α _s ⁽⁴⁾ are model parameters. θ _x , θ _y , and θ _z are the matrices when the rotation matrix of the external parameters of the camera is R and the rotation matrix obtained from the posture of the head (φ, θ, ψ) is Ro (the following equation). It can be expressed using the elements of (9)). For example, in this case, θ _x can be expressed as asin (r ₃₂ ), θ _y can be expressed as atan (−r ₃₁ / r ₃₃ ), and θ _z can be expressed as atan (r ₂₁ / r _11).

式（８）及び（８’）は、被写体のスコアを検出装置３０００が観測する過程をモデル化した重回帰モデルである。所定の被写体を検出する検出装置は、一般に、撮像された被写体の大きさ及び姿勢に基づいて出力するスコアを変化させ、及び、そのようなスコアに応じて被写体を検出することができる。また一般に、撮像画像中の被写体の大きさは、カメラと被写体との距離と相関関係にあることが多い。そのため、検出装置３０００は、カメラと被写体との距離に応じて被写体のスコアを変化させてもよい。また、検出のための学習データに偏りがない検出装置は、検出する被写体が撮像画像内において大きく映し出されている場合に、その被写体について、テクスチャなどの画像特徴量をロバストに取得し、及び検出のスコアも高くなる。また、特に検出する被写体が人物である場合、その人物がカメラに対して正面を向いている場合に、目、鼻及び口などの識別に関わる重要なパーツの見えが安定するため、検出のスコアが高くなる傾向がある。逆に人物がカメラに対して反対の方向を向いている場合には、識別の手がかりとなるそのようなパーツの見えが少なくなり、検出のスコアが低くなる傾向がある。一方で、ボールを被写体とする場合には、ボールは姿勢の変化による形状の変化が生じないため、被写体とカメラとの距離のみに応じて検出のスコアが変化してもよい。 Equations (8) and (8') are multiple regression models that model the process by which the detection device 3000 observes the score of the subject. A detection device that detects a predetermined subject can generally change a score to be output based on the size and posture of the captured subject, and can detect the subject according to such a score. In general, the size of the subject in the captured image often correlates with the distance between the camera and the subject. Therefore, the detection device 3000 may change the score of the subject according to the distance between the camera and the subject. In addition, the detection device that does not bias the learning data for detection robustly acquires and detects image features such as textures for the subject when the subject to be detected is projected large in the captured image. The score of is also high. In addition, especially when the subject to be detected is a person, when the person is facing the front of the camera, the visibility of important parts related to identification such as eyes, nose, and mouth is stable, so the detection score. Tends to be high. Conversely, when the person is facing in the opposite direction to the camera, such parts, which are clues to identification, are less visible and tend to have lower detection scores. On the other hand, when the ball is the subject, the shape of the ball does not change due to the change in posture, so that the detection score may change only according to the distance between the subject and the camera.

式（８）の第１項は定数項である。また、式（８）の第２項は、カメラから被写体までの距離とその被写体のスコアとの関係を表す項である。式（８）の第３、４、及び５項は、カメラから見える頭部の姿勢とその被写体のスコアとの関係をコサイン関数でモデル化した項である。さらに、第６項はノイズ項である。これらの要素を要因とする被写体のスコアの変化は、検出対象領域のサイズに応じて異なってくると考えられるため、式（８）及び（８’）のモデルパラメータα_ｓ ^（０）〜_αｓ ^（４）は検出対象領域のサイズに基づいて異なる値を取ってもよい。また、上述の理由によりボールの検出には姿勢の変化が関わってこないため、ボールを検出する場合のスコアの観測方程式としては、式（３’）のモデル化が行われてもよい。 The first term of equation (8) is a constant term. The second term of the equation (8) is a term expressing the relationship between the distance from the camera to the subject and the score of the subject. The third, fourth, and fifth terms of the equation (8) are terms in which the relationship between the posture of the head seen from the camera and the score of the subject is modeled by the cosine function. Further, the sixth term is a noise term. Change in the score of the object that these elements and factors, the detection since it is considered that the target area varies according to the size, model parameters alpha _{s ⁽⁰⁾} of formula (8) and (8 ') ~ _.alpha.s ^{( 4)} may take different values based on the size of the detection target area. Further, since the change in posture is not involved in the detection of the ball for the above-mentioned reason, the equation (3') may be modeled as the observation equation of the score when the ball is detected.

予測部３００４は、モデルパラメータα_ｓ ^（０）、α_ｓ ^（１）、α_ｓ ^（２）、α_ｓ ^（３）、及びα_ｓ ^（４）の推定を行うことができる。この推定の方法は特に限定されない。例えば、予測部３００４は、撮像画像中の複数の頭部に、それぞれ３次元空間上の向きの正解値を付与し、頭部それぞれについてのスコアを取得することができる。次いで、予測部３００４は、そのような向きの情報とスコアを持つ頭部のサンプルを複数用いて最小２乗法を行うことによりパラメータの推定を行ってもよい。最小２乗法は複数のデータの組（ｘ、ｙ）が与えられた場合にｘとｙの関係を表すもっともらしい関数を求める方法であるが、公知の技術であるため、詳細な説明は省略する。例えば、予測部３００４は、後述する式（１６）の尤度関数を用いることにより、被写体の観測値に対するモデルの尤度を計算してもよい。そのような場合、予測部３００４は、多数の観測値から式（１６）を用いて対数尤度を算出し、及び、グリッドサーチやベイズ最適化法のような公知のパラメータ探索手法を用いることによって、対数尤度を最大化させるモデルパラメータを推定することができる。例えば、尤度関数を用いる上記の方法を用いてモデルパラメータを推定することにより、予測部３００４は、ユーザ入力による正解値の付与を必要としない、効率的なモデルパラメータの推定を行うことができる。また、モデルパラメータを推定する方法はこれらには限られず、例えば、ＥＭ法を用いた再帰的な探索方法、又はモデルパラメータも状態空間に組み込んだ自己組織的なモデルとする方法などにより行われてもよい。上述の手法については、公知の技術であるため、詳細な説明は省略する。 The prediction unit 3004 can estimate the model parameters α _s ⁽⁰⁾ , α _s ⁽¹⁾ , α _s ⁽²⁾ , α _s ⁽³⁾ , and α _s ⁽⁴⁾ . The method of this estimation is not particularly limited. For example, the prediction unit 3004 can give a correct answer value of the orientation in the three-dimensional space to each of a plurality of heads in the captured image, and can acquire a score for each of the heads. Next, the prediction unit 3004 may estimate the parameters by performing the least squares method using a plurality of head samples having such orientation information and scores. The least squares method is a method of finding a plausible function representing the relationship between x and y when a plurality of data sets (x, y) are given, but since it is a known technique, detailed description thereof will be omitted. .. For example, the prediction unit 3004 may calculate the likelihood of the model with respect to the observed value of the subject by using the likelihood function of the equation (16) described later. In such a case, the prediction unit 3004 calculates the log-likelihood from a large number of observed values using the equation (16), and uses a known parameter search method such as a grid search or a Bayesian optimization method. , Model parameters that maximize log-likelihood can be estimated. For example, by estimating the model parameters using the above method using the likelihood function, the prediction unit 3004 can efficiently estimate the model parameters without the need to give the correct answer value by the user input. .. Further, the method of estimating the model parameters is not limited to these, and is performed by, for example, a recursive search method using the EM method, or a method of making a self-organizing model in which the model parameters are also incorporated into the state space. May be good. Since the above method is a known technique, detailed description thereof will be omitted.

以下においては、上述の式（６）、（７）、（８）及び（８’）をまとめ、下記の式（１０）のように表現する。ここで、観測ノイズｗ_ｔの分散共分散行列はＲであるとする。また、この式（１０）から、尤度関数Ｐ（ｙ_{ｔ，ｋｊ，ｓ}｜ｘ_ｔ，ｎ）が取得される。
ｙ_{ｔ，ｋ，ｊ，ｓ}＝ｈ_{ｔ，ｋ，ｓ}（ｘ_ｔ，ｎ）＋Ｗ_ｔ式（１０） In the following, the above equations (6), (7), (8) and (8') will be summarized and expressed as the following equation (10). Here, the variance-covariance matrix of the observation noise w _t is assumed to be R. Further, the likelihood function P ( _{yt, kj, s} | _{xt, n} ) is obtained from this equation (10).
y _{t, k, j, s} = _{ht, k, s} (x _{t, n} ) + W _t equation (10)

以上のシステム方程式及び観測方程式を用いた下記の式（１１）〜（１４）により、予測部３００４は、被写体（頭部）ｎの１時刻前（時刻ｔ−１）の状態から、被写体の現在（時刻ｔ）の状態及び観測値を予測することができる。ここで、ｘ_{ｔ｜ｔ−１，ｎ}及びＶ_{ｔ｜ｔ−１，ｎ}は状態変数の予測分布の１次モーメント及び２次モーメントをそれぞれ表す。また、ｙ_{ｔ｜ｔ−１，ｋ，ｓ，ｎ}及びＵ_{ｔ｜ｔ−１，ｋ，ｓ，ｎ}は観測値の予測分布の１次モーメント及び２次モーメントをそれぞれ表す。また、Ｑ_ｔはプロセスノイズの分散共分散行列を、Ｒ_ｔは観測ノイズｗ_ｔの分散共分散行列を表す。Ｈ_{ｔ，ｋ，ｓ}はｈ_{ｔ，ｋ，ｓ}（ｘ_ｔ，ｎ）のヤコビ行列である。

According to the following equations (11) to (14) using the above system equations and observation equations, the prediction unit 3004 starts from the state one hour before the subject (head) n (time t-1) to the present state of the subject. The state and observed value at (time t) can be predicted. Here, x _{t | t-1, n} and V _{t | t-1, n} represent the first moment and the second moment of the predicted distribution of the state variables, respectively. Further, y _{t | t-1, k, s, n} and U _{t | t-1, k, s, n} represent the first moment and the second moment of the predicted distribution of the observed values, respectively. Further, Q _t represents the variance-covariance matrix of the process noise, and R _t represents the variance-covariance matrix of the observed noise w _t. H _{t, k, s} is a Jacobian matrix of ht _{, k, s} (x _{t, n).}

以降において、簡単のため、上述の式（１１）〜（１４）に示される１次モーメント及び２次モーメントを有するガウス分布に従う状態変数並びに観測値の予測分布を、Ｐ（ｘ_ｔ，ｎ｜Ｙ_ｔ−１）並びにＰ（ｙ_{ｔ，ｋ，ｓ，ｎ}｜Ｙ_ｔ−１）と表現する。ここで、Ｙ_ｔ−１は時刻ｔ−１までの被写体の観測値の集合である。また、ｙ_{ｔ，ｋ，ｓ，ｎ}は、時刻ｔの、カメラｋによる撮像画像中の、サイズｓの検出対象領域内の、被写体ｎの観測値である。なお、時刻ｔ＝１である場合、被写体の観測値及び状態変数は初期値であるものとする。 In the following, for the sake of simplicity, the predicted distributions of state variables and observed values according to the Gaussian distribution having the first and second moments shown in the above equations (11) to (14) are set to P ( _{xt, n} | Y). _{It is expressed} as t-1) and P (y _{t, k, s, n} | Y _t-1 ). Here, Y _t-1 is a set of observed values of the subject up to the time t-1. Further, y _{t, k, s, and n} are observed values of the subject n in the detection target region of size s in the image captured by the camera k at time t. When the time t = 1, the observed value and the state variable of the subject are assumed to be initial values.

ステップＳ６００３において領域設定部３００５は、後述のステップＳ６００５において被写体の検出に用いる検出対象領域を設定する。図９は、ステップＳ６００３における検出対象領域の設定を行うための処理手順の一例を示すフローチャートである。 In step S6003, the area setting unit 3005 sets the detection target area used for detecting the subject in step S6005 described later. FIG. 9 is a flowchart showing an example of a processing procedure for setting the detection target area in step S6003.

ステップＳ７００１で領域設定部３００５は、被写体を有する、前回のループで作成された第２候補領域のピクセル座標を取得する。ステップＳ７００１で用いられる第２候補領域は、後述のステップＳ７００４で検出対象領域を選定する際の候補であり、ステップＳ７００３で各被写体に対してそれぞれ異なる第２候補領域がそれぞれ１つずつ割り当てられるように作成される。そのように第２候補領域が割り当てられた被写体を、その第２候補領域における代表被覆要素と呼び、及び、その第２候補領域が有する他の被写体を、非代表被覆要素と呼ぶ。また、領域設定部３００５は、代表被覆要素である被写体の現時点における予測位置に基づいて、その被写体に割り当てられた第２候補領域を移動させることができる。各被写体の現時点におけるピクセル座標は、ステップＳ６００２において予測されている（つまり、観測値の予測分布の１次モーメント（式（６）））。例えば、領域設定部３００５は、代表被覆要素の前回の位置から現時点の位置への移動と同様に第２候補領域を移動させてもよく、又は、第２候補領域の中心座標が代表被覆要素の予測位置と一致するように、第２候補領域を移動させてもよい。また、第２候補領域に対して代表被覆要素が割り当てられていない場合、領域設定部３００５は、そのような第２候補領域を移動させなくてもよい。また、代表被覆要素が全てのカメラの視野から出ていった場合、領域設定部３００５は、対応する第２候補領域を削除してもよい。時刻ｔ＝１の場合には、第２候補領域が存在しないので、処理はステップＳ７００２へと移動する。 In step S7001, the area setting unit 3005 acquires the pixel coordinates of the second candidate area created in the previous loop, which has the subject. The second candidate area used in step S7001 is a candidate for selecting a detection target area in step S7004 described later, and one different second candidate area is assigned to each subject in step S7003. Is created in. A subject to which the second candidate region is assigned is referred to as a representative covering element in the second candidate region, and another subject possessed by the second candidate region is referred to as a non-representative covering element. Further, the area setting unit 3005 can move the second candidate area assigned to the subject based on the current predicted position of the subject which is the representative covering element. The current pixel coordinates of each subject are predicted in step S6002 (that is, the first moment of the predicted distribution of the observed values (Equation (6))). For example, the area setting unit 3005 may move the second candidate region in the same manner as the movement of the representative covering element from the previous position to the current position, or the center coordinate of the second candidate region is the representative covering element. The second candidate region may be moved so as to match the predicted position. Further, when the representative covering element is not assigned to the second candidate region, the region setting unit 3005 does not have to move such a second candidate region. Further, when the representative covering element comes out of the field of view of all the cameras, the area setting unit 3005 may delete the corresponding second candidate area. When the time t = 1, since the second candidate region does not exist, the process moves to step S7002.

Ｂ７００１で領域設定部３００５は、全被写体の被覆をチェックする。例えば、領域設定部３００５は、前回のループで作成された第２候補領域と、Ｓ６００２において予測された被写体の位置に基づいて、全ての被写体が第２候補領域のどれかに被覆されているかどうかを判定することができる。全ての被写体が第２候補領域に被覆されていない場合、第２候補領域の割り当てを行うことができる。また、領域設定部３００５は、第２候補領域が割り当てられていない被写体が存在するかどうかを判定することができる。領域設定部３００５は、前回の検出から新たにいずれかのカメラの視野内に移動してきた被写体がいないかどうかを判定してもよい。第２候補領域が割り当てられていない被写体が存在する場合、第２候補領域の割り当てを行うことができる。また、時刻ｔ＝１の場合も、被写体に第２候補領域を割り当てることができる。第２候補領域の割り当てを行う場合、ステップＳ７００２へと移動する。そうでない場合は、ステップＳ７００４へと移動する。 In B7001, the area setting unit 3005 checks the covering of all subjects. For example, the area setting unit 3005 determines whether or not all the subjects are covered with any of the second candidate areas based on the second candidate area created in the previous loop and the position of the subject predicted in S6002. Can be determined. When all the subjects are not covered by the second candidate area, the second candidate area can be assigned. In addition, the area setting unit 3005 can determine whether or not there is a subject to which the second candidate area is not assigned. The area setting unit 3005 may determine whether or not there is a subject newly moved within the field of view of any of the cameras since the previous detection. When there is a subject to which the second candidate area is not assigned, the second candidate area can be assigned. Also, when the time t = 1, the second candidate area can be assigned to the subject. When allocating the second candidate area, the process proceeds to step S7002. If not, the process proceeds to step S7004.

ステップＳ７００２で領域設定部３００５は、各カメラ毎に、候補領域（実施形態１のステップＳ４００４と同様に作成される）の集合から、各撮像画像内に存在する被写体をすべて被覆するように１以上の第１候補領域を選定する。領域設定部３００５は、例えば、ステップＳ４００４と同様に、候補領域から、被写体を被覆していない領域を取り除いてもよい。次いでステップＳ７００３において、領域設定部３００５は、すべてのカメラについて選定された第１候補領域の集合から、すべての被写体に対してそれぞれ少なくとも１つずつの異なる領域が割り当てられるように、第２候補領域を選定する。そのためには、例えば、領域設定部３００５は、下記の整数計画問題（式（１５））を解くことにより、第２候補領域を選定することができる。ここで、ｉは被写体のインデックスであり、ｍは被写体の数の合計である。またｊは候補領域のインデックスであり、ｓ_ｊは候補領域のスコアである。ｘ_ｊは、候補領域が選定されればｘ_ｊ＝１となり、そうでない場合は０となる。またａ_ｉｊは、候補領域ｊが被写体ｉを被覆する場合は１、そうでない場合は０となる。この時、領域設定部３００５は、各被写体の検出のスコアの予測値から、実施形態１の図７の例のように、その領域の有する被写体のスコアの内の最も高いスコアの予測値を、その領域のスコアとして用いることができる。領域設定部３００５は、式（１５）について、貪欲法又はハンガリー法などを用いることにより、上述の割り当てを行うことができる。このように、被写体に対して第２候補領域を割り当てることができ、ある第２候補領域が割り当てられた被写体がこの第２候補領域についての代表被覆要素として扱われる。

In step S7002, the area setting unit 3005 is set to cover all the subjects existing in each captured image from the set of candidate areas (created in the same manner as in step S4004 of the first embodiment) for each camera. Select the first candidate area of. The area setting unit 3005 may remove the area that does not cover the subject from the candidate area, as in step S4004, for example. Next, in step S7003, the area setting unit 3005 assigns at least one different area to all the subjects from the set of the first candidate areas selected for all the cameras. To select. For that purpose, for example, the area setting unit 3005 can select the second candidate area by solving the following integer programming problem (Equation (15)). Here, i is the index of the subject, and m is the total number of subjects. Further, j is the index of the candidate area, and s _j is the score of the candidate area. x _j _{is x j} = 1 if the candidate region is selected, and 0 otherwise. Further, a _ij is 1 when the candidate region j covers the subject i, and 0 when the candidate region j does not cover the subject i. At this time, the area setting unit 3005 determines the predicted value of the highest score among the scores of the subjects possessed by the area from the predicted value of the detection score of each subject, as in the example of FIG. 7 of the first embodiment. It can be used as a score for that area. The area setting unit 3005 can make the above allocation for the equation (15) by using the greedy method, the Hungarian method, or the like. In this way, the second candidate area can be assigned to the subject, and the subject to which a certain second candidate area is assigned is treated as a representative covering element for the second candidate area.

ステップＳ７００４において領域設定部３００５は、第２候補領域から、後述のステップＳ６００５において用いる検出対象領域を選定する。領域設定部３００５は、例えば、そのような第２候補領域を候補領域として、実施形態１における式（１）を解くことにより、検出対象領域を求めてもよい。 In step S7004, the area setting unit 3005 selects a detection target area to be used in step S6005, which will be described later, from the second candidate area. The area setting unit 3005 may obtain the detection target area by solving the equation (1) in the first embodiment, for example, using such a second candidate area as a candidate area.

図１０は、図８と同様の例であり、本実施形態において想定される検出装置３０００が撮像する６視点の画像を用いて、第２候補領域及び検出対象領域を説明するための図である。各視点の画像は図８における同一の参照番号がふられた視点のものと等しい。各画像には、図８の各画像と同様の被写体（Ａ１〜４、Ｂ１〜４及びＣ０）が映っている。図１０における画像例１４００、１４１０、１４２０、１４３０、１４４０、１４５０、及び１４６０は、それぞれカメラ（視点）８０１、８０２、８０３、８０４、８０５、及び８０６による撮像画像の例である。図１０において、領域１４０１は第２候補領域Ｃ１であり、最終的に検出対象領域として選定される。また、領域１４１１及び１４１２はそれぞれ第２候補領域Ｃ２及びＣ３であり、並びに、最終的に、Ｃ２は検出対象領域として選定されないが、Ｃ３は検出対象領域として選定される。領域１４２１は第２候補領域Ｃ４であり、最終的に検出対象領域として選定される。領域１４３１、１４３２、１４３３及び１４３４はそれぞれ第２候補領域Ｃ５、Ｃ６、Ｃ７及びＣ８であり、並びに、最終的に、Ｃ５及びＣ６は検出対象領域として選定されないが、Ｃ７及びＣ８は検出対象領域として選定される。領域１４４１、１４４２及び１４４３はそれぞれ第２候補領域Ｃ９、Ｃ１０及びＣ１１であり、並びに、最終的に、Ｃ９、Ｃ１０及びＣ１１は検出対象領域として選定される。画像例１４５０には第２候補領域は存在しない。 FIG. 10 is an example similar to that of FIG. 8, and is a diagram for explaining a second candidate region and a detection target region using an image of six viewpoints captured by the detection device 3000 assumed in the present embodiment. .. The image of each viewpoint is the same as that of the viewpoint in FIG. 8 with the same reference number. Each image shows a subject (A1-4, B1-4 and C0) similar to each image in FIG. Image examples 1400, 1410, 1420, 1430, 1440, 1450, and 1460 in FIG. 10 are examples of images captured by cameras (viewpoints) 801, 802, 803, 804, 805, and 806, respectively. In FIG. 10, the region 1401 is the second candidate region C1 and is finally selected as the detection target region. Further, the regions 1411 and 1412 are the second candidate regions C2 and C3, respectively, and finally, C2 is not selected as the detection target region, but C3 is selected as the detection target region. The region 1421 is the second candidate region C4, and is finally selected as the detection target region. Regions 1431, 1432, 1433 and 1434 are the second candidate regions C5, C6, C7 and C8, respectively, and finally C5 and C6 are not selected as detection target regions, but C7 and C8 are detection target regions. Be selected. Regions 1441, 1442 and 1443 are second candidate regions C9, C10 and C11, respectively, and finally C9, C10 and C11 are selected as detection target regions. The second candidate region does not exist in the image example 1450.

図１０に示される第２候補領域をリスト化した表の一例が、図１１に示されている。図１１において、上述のように、すべての被写体（この例では１１個）について１つずつ第２候補領域が割り当てられている。第２候補領域の数は、被写体それぞれに少なくとも１つずつ割り当てられるそれぞれ異なる第２候補領域が存在する限りは特に限定されない。例えば、各被写体に対して異なる第２候補領域が２つずつ、つまりこの例では計２２個の第２候補領域が存在していてもよい。 An example of a table listing the second candidate regions shown in FIG. 10 is shown in FIG. In FIG. 11, as described above, the second candidate area is assigned to all the subjects (11 in this example) one by one. The number of the second candidate regions is not particularly limited as long as there are different second candidate regions allocated to each subject. For example, there may be two different second candidate regions for each subject, that is, a total of 22 second candidate regions may exist in this example.

図１２には、図１１に示される第２候補領域から式（１）に基づいて選定された検出対象領域をリスト化した表の一例が示されている。ステップＳ７００４で領域設定部３００５は、第２候補領域のスコアに基づいて式（１）の条件付き最適化を実行することにより、スコアの合計が最大となる、図１２に示されるような最終的な検出対象領域を選定することができる。この例においては８つの検出対象領域が選定されており、及び、被写体毎に１つずつの検出対象領域を設定する場合と比較すると、計算コストが軽減されている。 FIG. 12 shows an example of a table listing the detection target regions selected based on the equation (1) from the second candidate region shown in FIG. In step S7004, the area setting unit 3005 executes the conditional optimization of the equation (1) based on the score of the second candidate area, so that the total score is maximized, as shown in FIG. The detection target area can be selected. In this example, eight detection target areas are selected, and the calculation cost is reduced as compared with the case where one detection target area is set for each subject.

このような処理によれば、第２候補領域と、現時点のループにおける検出を行うための検出対象領域と、を設定することができる。ステップＳ７００３で選定された第２候補領域は、次回のループにおけるステップＳ７００４においても用いるため、領域設定部３００５は、第２候補領域を記憶装置（不図示）に格納してもよい。また、Ｂ７００１において第２候補領域の割り当てが行われなかった場合には、ステップＳ７００１で移動させた第２候補領域を記憶装置に格納してもよい。ここにおける記憶装置は検出装置３０００の内部に存在していてもよく、また外部に存在していてもよい。また、検出装置３０００は、記憶装置に、ＵＳＢケーブルを介して保存を行ってもよく、ＳＤカードなどを介して保存を行ってもよく、又は無線の通信を介して保存を行ってもよい。 According to such a process, the second candidate area and the detection target area for performing the detection in the current loop can be set. Since the second candidate area selected in step S7003 is also used in step S7004 in the next loop, the area setting unit 3005 may store the second candidate area in a storage device (not shown). Further, when the second candidate area is not allocated in B7001, the second candidate area moved in step S7001 may be stored in the storage device. The storage device here may exist inside the detection device 3000, or may exist outside. Further, the detection device 3000 may store in the storage device via a USB cable, may store via an SD card or the like, or may store via wireless communication.

ステップＳ６００４において、撮像部３１００の有するＫ台の動画取得部が、ある時刻においてそれぞれ撮像画像を取得する。これらのＫ台の動画取得部が有するカメラの撮像は、どのように制御されていてもよい。例えば、Ｋ台のカメラのシャターは、トリガーパルス、同期信号のような電気的な信号によって同期された周期で撮像されてもよく、又はカメラ内部のマイクロコントローラのクロックによってそれぞれ自律的な周期によって撮像されてもよい。また、Ｋ台のカメラの内の同時刻に撮像する台数は特に限定されない。例えば、Ｋ台の内半数のカメラが同時に撮像を行い、その後に続いて残りの半数のカメラが同時に撮像を行ってもよい。また、撮像部３１００と処理部３２００との接続手段は特に限定されない。撮像部３１００及び処理部３２００は、例えばローカルエリアネットワークなどの通信経路を介して接続されていてもよく、ＵＳＢケーブルなどを介して有線で接続されていてもよい。例えば、撮像部３１００は、出力した撮像画像を不図示の記憶装置に格納し、及び、処理部３２００が、その記憶装置から所定のフレームを取得してもよい。 In step S6004, the K moving image acquisition units of the imaging unit 3100 acquire captured images at a certain time. The imaging of the cameras included in these K moving image acquisition units may be controlled in any way. For example, the shutters of K cameras may be imaged in a cycle synchronized by an electrical signal such as a trigger pulse or a synchronization signal, or by a clock of a microcontroller inside the camera, respectively, by an autonomous cycle. May be done. Further, the number of K cameras that capture images at the same time is not particularly limited. For example, half of the K cameras may simultaneously take an image, and then the other half of the cameras may simultaneously take an image. Further, the means for connecting the imaging unit 3100 and the processing unit 3200 is not particularly limited. The imaging unit 3100 and the processing unit 3200 may be connected via a communication path such as a local area network, or may be connected by wire via a USB cable or the like. For example, the imaging unit 3100 may store the output captured image in a storage device (not shown), and the processing unit 3200 may acquire a predetermined frame from the storage device.

本実施形態においては、説明のため、撮像部と処理部は通信経路を介して接続されているとする。そのような構成によれば、撮像部３１００が取得及び送信し、並びに処理部３２００が受信する撮像フレームは、ネットワーク経路に存在するスイッチングハブなどの中継部のパフォーマンス又は帯域の制限などにより、コマ落ちを生じ得る。そのような観点から、本実施形態に係る処理装置は、撮像部３１００が取得したフレームを全ての時刻においてバッファリングしてもよい。そのような場合、コマ落ちが発生した際にその時刻で取得されるフレームは、前時刻に取得されたフレームと同じであってもよい。 In the present embodiment, for the sake of explanation, it is assumed that the imaging unit and the processing unit are connected via a communication path. According to such a configuration, the imaging frame acquired and transmitted by the imaging unit 3100 and received by the processing unit 3200 is dropped due to the performance of a relay unit such as a switching hub existing in the network path or the limitation of the band. Can occur. From such a viewpoint, the processing apparatus according to the present embodiment may buffer the frames acquired by the imaging unit 3100 at all times. In such a case, the frame acquired at that time when a frame drop occurs may be the same as the frame acquired at the previous time.

ステップＳ６００５において検出部３００６は、ステップＳ６００４で取得された撮像画像のうち、ステップＳ６００３で設定した検出対象領域から、被写体を検出する。本実施形態においては、実施形態１で用いたものと同様の構成を有する検出装置を用いる。また、この例においては、被写体が人物の頭部又はサッカーボールであることから、特に頭部とボールとを検出するように学習された検出装置を用いてもよい。 In step S6005, the detection unit 3006 detects the subject from the detection target area set in step S6003 among the captured images acquired in step S6004. In this embodiment, a detection device having the same configuration as that used in the first embodiment is used. Further, in this example, since the subject is the head of a person or a soccer ball, a detection device learned to detect the head and the ball may be used.

ステップＳ６００６において対応付け部３００７は、各カメラにおいて、時刻ｔにおける撮像画像から得られる被写体の観測値と、３次元空間上の被写体との対応付けを行う。対応付け部３００７は、時刻ｔにおいて、カメラｋによる撮像画像中のサイズｓの検出対象領域中に誤検出を含んだＪ個の観測値｛ｙ_{ｔ，ｋｓ１}、ｙ_{ｔ，ｋｓ２}…ｙ_{ｔ，ｋｓＪ}｝を得ることができる。この時、式（１３）及び（１４）により予測部３００４が取得する観測値の予測分布の１次モーメント及び２次モーメントから、任意のｊ番目の観測値に対して、下記のガウス分布（式（１６））が記述される。この関数に観測値ｙ_{ｔ，ｋｓｊ}を因数として与えることにより、対応付け部３００７は、被写体ｎの観測値としての尤度ｌ_{ｋｓｊ，ｎ}を算出することができる。対応付け部３００７は、例えば、複数の観測値｛ｙ_{ｔ，ｋｓ１}、ｙ_{ｔ，ｋｓ２}…ｙ_{ｔ，ｋｓＪ}｝それぞれに式（１６）を適用し、及び、尤度の高い観測値を被写体ｎの観測値として対応付けることにより、観測値と被写体の対応付けを行うことができる。時刻ｔが１である場合、つまり初回のループである場合は、検出された被写体それぞれについて識別情報を割り振る。
ｌ_{ｋｓｊ，ｎ}＝Ｎ（ｙ_{ｔ，ｋｓｊ}；ｙ_{ｔ｜ｔ−１，ｋ，ｓ，ｎ}、Ｕ_{ｔ｜ｔ−１，ｋ，ｓ，ｎ}）式（１６） In step S6006, the association unit 3007 associates the observed value of the subject obtained from the captured image at time t with the subject in the three-dimensional space in each camera. _{At time t, the association unit 3007 has J observation values {y t, ks 1} , y _{t, ks 2} ... y _{t, ksJ including} false detections in the detection target area of the size s in the image captured by the camera k. } Can be obtained. At this time, from the first and second moments of the predicted distribution of the observed values acquired by the prediction unit 3004 by the equations (13) and (14), the following Gaussian distribution (formula) is applied to an arbitrary j-th observed value. (16)) is described. _{By giving the observed values y t and ksj} as factors to this function, the association unit 3007 can calculate _{the likelihood l ksj and n} as the observed values of the subject n. For example, the associating unit 3007 _{applies the equation (16) to each of the plurality of observed values {y t, ks1} , y _{t, ks 2} ... y _{t, ksJ}, and sets the} observed value having a high likelihood as the subject n. By associating the observed values with each other, the observed values can be associated with the subject. When the time t is 1, that is, when it is the first loop, identification information is assigned to each of the detected subjects.
l _{ksj, n} = N (y _{t, ksj} ; y _{t | t-1, k, s, n} , U _{t | t-1, k, s, n} ) Equation (16)

ステップＳ６００６における対応付けの方法は特に限定されない。例えば、対応付け部３００７は、貪欲法に基づいて、被写体の複数の観測値の内の尤度が最大となる観測値を、その被写体の観測値として割り当てることができる。また例えば、対応付け部３００７は、線形計画法によって、それぞれの被写体の観測値の尤度の和が最大になるように、被写体と観測値を対応付けてもよい。そのような場合は、例えば、観測値並びに予測分布の１次モーメント及び２次モーメントに基づいて算出されるマハラノビス距離を用いて、マハラノビス距離の和が最小となる対応付けをハンガリアン法で計算することで、尤度の和が最大になる対応付けが取得できる。 The method of associating in step S6006 is not particularly limited. For example, the association unit 3007 can assign the observation value having the maximum likelihood among the plurality of observation values of the subject as the observation value of the subject based on the greedy algorithm. Further, for example, the association unit 3007 may associate the subject with the observed value so that the sum of the likelihoods of the observed values of each subject is maximized by the linear programming method. In such a case, for example, using the Mahalanobis distance calculated based on the observed value and the first and second moments of the predicted distribution, the association that minimizes the sum of the Mahalanobis distances should be calculated by the Hungarian method. Then, the correspondence that maximizes the sum of the likelihoods can be obtained.

ステップＳ６００７において重み計算部３００８は、時刻ｔにおける各観測値の重みを算出する。重み計算部３００８は、例えば、被写体が他の被写体によって隠蔽されている場合に、その隠蔽されている被写体の重みを低く計算することができる。本実施形態においては、対応付け部３００７は、そのような隠蔽の発生する確率、つまり予測隠蔽率を予測、及び定量化してもよい。また、対応付け部３００７は、そのような予測隠蔽率を、被写体と他の被写体との観測値の予測分布の類似度、及びカメラに対する被写体と他の被写体との前後関係に基づいて定量化することができる。 In step S6007, the weight calculation unit 3008 calculates the weight of each observed value at time t. For example, when the subject is concealed by another subject, the weight calculation unit 3008 can calculate the weight of the concealed subject to be low. In the present embodiment, the association unit 3007 may predict and quantify the probability that such concealment will occur, that is, the predicted concealment rate. Further, the association unit 3007 quantifies such a predicted concealment rate based on the similarity of the predicted distribution of the observed values between the subject and the other subject and the context of the subject and the other subject with respect to the camera. be able to.

以下、本実施形態に係る、被写体の観測値の予測分布の１次モーメントのみを用いた、予測隠蔽率の軽量な定量化方法を説明する。この計算過程は特に限定されないが、この例においてはコサイン類似度を用いることにより観測値の予測分布の類似度を表現する。すなわち、重み計算部３００８は、被写体ｎと被写体ｍとの間の類似度を、ｃｏｓ^β（ｙ_{ｔ｜ｔ−１，ｋ，ｓ，ｍ}、ｙ_{ｔ｜ｔ−１，ｋ，ｓ，ｎ}）として表現することができる。ここで、βは予め与えられる所定のべき指数である。 Hereinafter, a lightweight quantification method of the predicted concealment rate using only the first moment of the predicted distribution of the observed value of the subject according to the present embodiment will be described. This calculation process is not particularly limited, but in this example, the similarity of the predicted distribution of the observed values is expressed by using the cosine similarity. That is, the weight calculation unit 3008 determines the degree of similarity between the subject n and the subject m as cos ^β (y _{t | t-1, k, s, m} , y _{t | t-1, k, s, n} ). Can be expressed as. Here, β is a predetermined exponent given in advance.

また、重み計算部３００８は、カメラに対する被写体ｎと被写体ｍとの前後関係を、下記の式（１７）によって算出することができる。ここで、Ｃ_ｋは、カメラｋの３次元空間上の位置である。式（１７）の関数は、つまり、カメラｋから見て、被写体ｍが被写体ｎよりも近くに存在する場合には１を返し、そうでない場合には０を返す関数である。この式を用いることにより、重み計算部３００８は、下記の式（１８）から予測隠蔽率ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}を計算することができる。
ｍｉｎ（ｍａｘ（||ｘ_ｔ，ｎ−Ｃ_ｋ||_２−||ｘ_ｔ，ｍ−Ｃ_ｋ||_２、０）、１）式（１７）

Further, the weight calculation unit 3008 can calculate the anteroposterior relationship between the subject n and the subject m with respect to the camera by the following equation (17). Here, C _k is the position of the camera k in the three-dimensional space. That is, the function of the equation (17) is a function that returns 1 when the subject m is closer to the subject n when viewed from the camera k, and returns 0 otherwise. By using this equation, the weight calculation unit 3008 can calculate the predicted concealment rate ^pocct _{, k, s, n} from the following equation (18).
min (max (|| x _{t, n −} C _k || ₂ − || x _{t, m} −C _k || ₂ , 0), 1) Equation (17)

ここで、Ｎ_{ｔ，ｋ，ｓ}は、時刻ｔにおける、カメラｋによる撮像画像中の、サイズｓの検出対象領域内に検出される被写体の数である。式（１８）は、カメラｋに対して、被写体の手前に別の被写体が存在し、及びカメラｋからそれらの被写体を結ぶ視線が類似しているときに、カメラｋから見て、その被写体がその被写体によって隠蔽されるという考え方に基づくものである。式（１７）とコサイン類似度とを乗算した値は、カメラｋに対して被写体ｍが被写体ｎよりも近い位置に存在し、及びそれらの被写体がピクセル座標上で近い位置に存在している場合に、１に近い値になる。式（１０）は、そのような計算を、ある被写体が他の被写体すべてに対して計算し及び正規化したものである。つまり、ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}が１である場合には被写体ｎが他の被写体に完全に隠蔽されており、ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}が０である場合には被写体ｎが全く隠蔽されていないことを示す。 Here, N _{t, k, and s} are the number of subjects detected in the detection target region of size s in the image captured by the camera k at time t. In the equation (18), when another subject exists in front of the subject with respect to the camera k and the line of sight connecting the subjects from the camera k is similar, the subject is viewed from the camera k. It is based on the idea that it is hidden by the subject. The value obtained by multiplying the equation (17) by the cosine similarity is when the subject m is closer to the camera k than the subject n and the subjects are closer to each other in pixel coordinates. In addition, the value is close to 1. Equation (10) is such a calculation calculated and normalized by one subject to all other subjects. That is, when the ^OCC _{t, k, s, n} is 1, the subject n is completely concealed by another subject, and when the ^OCC _{t, k, s, n} is 0, the subject n Indicates that is not concealed at all.

式（１７）及び（１８）に渡って、重み計算部３００８は、観測値の予測分布の１次モーメントのみを用いることにより予測隠蔽率の定量化を行ったが、その方法は特にそれに制限されるものではない。例えば、重み計算部３００８は、観測値の予測分布の２次モーメントまでを考慮してＫＬダイバージェンス等で分布間の距離を、カメラから各被写体への視線の類似度として計量し、及びその値を用いることにより、予測隠蔽率の定量化を行ってもよい。また、本実施形態において重み計算部３００８は、被写体同士による予測隠蔽率を定量化したが、特にその条件に限るわけではない。例えば、重み計算部３００８は、被写体と、被写体以外の遮蔽物、例えば看板のような動かない遮蔽物と、の予測隠蔽率を定量化してもよい。ＫＬダイバージェンスは２つの確率分布がどの程度類似しているかを表す尺度であり、下記の式（２５）のように定義される。

Over equations (17) and (18), the weight calculation unit 3008 quantified the predicted concealment rate by using only the first-order moments of the predicted distribution of the observed values, but the method is particularly limited thereto. It's not something. For example, the weight calculation unit 3008 measures the distance between distributions by KL divergence or the like in consideration of up to the second moment of the predicted distribution of the observed value as the similarity of the line of sight from the camera to each subject, and measures the value. By using it, the predicted concealment rate may be quantified. Further, in the present embodiment, the weight calculation unit 3008 quantifies the predicted concealment rate between subjects, but the conditions are not particularly limited. For example, the weight calculation unit 3008 may quantify the predicted concealment rate of the subject and a shield other than the subject, for example, a non-moving shield such as a signboard. KL divergence is a measure of how similar the two probability distributions are, and is defined as the following equation (25).

ステップＳ６００８において、更新部３００９は、時刻ｔにおける観測値を用いることにより、被写体の状態変数の予測分布を更新し、及び、その被写体の状態変数のフィルタ分布の取得を行う。また、この際、本実施形態に係る状態空間モデルにおいて、特定の被写体に関する観測値の個数は、被写体の移動に伴ってその被写体を観測可能なカメラの数が変動することなどにより変化し得る。そのようなことを鑑みて、更新部３００９は、特定の被写体について、各カメラが出力する複数の観測値を統合することにより、その被写体の状態変数の予測分布を、その統合値として更新してもよい。以下において、フィルタ分布とは、被写体の状態変数のフィルタ分布を指してそう呼ぶものとする。 In step S6008, the update unit 3009 updates the predicted distribution of the state variable of the subject by using the observed value at time t, and acquires the filter distribution of the state variable of the subject. Further, at this time, in the state space model according to the present embodiment, the number of observed values for a specific subject may change due to changes in the number of cameras capable of observing the subject as the subject moves. In view of such a situation, the update unit 3009 updates the predicted distribution of the state variable of the subject as the integrated value by integrating a plurality of observation values output by each camera for the specific subject. May be good. In the following, the filter distribution refers to the filter distribution of the state variable of the subject and is referred to as such.

本実施形態においては、更新部３００９は、例えば、被写体について、予測隠蔽率ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}を考慮した観測値の統合を行うことができる。つまり、更新部３００９は、隠蔽が予測される観測値を、その予測隠蔽率に応じた重みを付与した上で、つまりその状態変数の更新への反映率を低下させて、他の観測値と統合することができる。また例えば、更新部３００９は、観測値ｑ_{ｔ，ｋ，ｓ，ｎ}のスコアを用いることにより、カメラに対する距離又は向きのような検出に好適な条件を有する可能性が高い被写体の観測値を、更新への反映率を増加させて、他の観測値と統合することができる。また、更新部３００９が観測値を統合する方法は特に限定されない。以下、そのような統合方法について、２つの方針を説明する。 In the present embodiment, the update unit 3009 can integrate the observed values of the subject in consideration of the predicted concealment rates of ^OCC _{t, k, s, n, for example.} That is, the update unit 3009 attaches a weight corresponding to the predicted concealment rate to the observed value predicted to be concealed, that is, lowers the reflection rate of the state variable in the update, and sets it as another observed value. Can be integrated. Further, for example, the update unit 3009 _{uses the scores of the observed values q t, k, s, and n} to obtain the observed values of the subject that are likely to have suitable conditions for detection such as the distance or orientation with respect to the camera. The rate of reflection in updates can be increased and integrated with other observations. Further, the method in which the update unit 3009 integrates the observed values is not particularly limited. Two policies will be described below for such an integration method.

［統合方法１］
更新部３００９は、例えば、各カメラの尤度関数Ｐ（ｙ_{ｔ，ｋｓｊ}｜ｘ_ｔ，ｎ）の観測ノイズ分散共分散行列Ｒ_ｔに、（１−ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}）とｑ_{ｔ，ｋ，ｓ，ｎ}の逆数をかけてもよい。続いて、更新部３００９は、各カメラが独立して観測値を取得しているという前提のもと、各カメラにおける観測値を同時分布として統合した統合尤度関数を、例えば下記の式（１９）のようにモデル化することができる。ここで、Ｙ_{ｔ，ＫＳｎ，ｎ}は、時刻ｔにおいて、複数のカメラの複数の検出対象領域内で観測される、被写体の観測値の集合である。また、Ｐ（ｙ_{ｔ，ｋ，ｓ，ｎ}｜ｘ_ｔ，ｎ、ｑ_{ｔ，ｋ，ｓ，ｎ}、ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}）の分散共分散行列は、（ｑ_{ｔ，ｋ，ｓ，ｎ}・（１−ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}））^−１・Ｒ_ｔであるとすることができる。つまり、この式は、被写体について、検出のスコアが小さいほど、及び予測隠蔽率が高いほど、その被写体の観測ノイズが大きくなるようにモデル化されることができる。

[Integration method 1]
The update unit 3009 adds (1- ^pocc _{t, k, s, n} ) to the observed noise variance-covariance matrix R _t of the likelihood function P (y _{t, ksj} | x _{t, n) of each camera, for example.} You may multiply by the reciprocal of q _{t, k, s, n.} Subsequently, the update unit 3009 calculates an integrated likelihood function that integrates the observation values of each camera as a joint distribution on the premise that each camera independently acquires the observation values, for example, the following equation (19). ) Can be modeled. Here, Y _{t, KSn, and n} are a set of observed values of the subject observed in a plurality of detection target regions of a plurality of cameras at time t. The variance-covariance matrix of P (y _{t, k, s, n} | x _{t, n} , q _{t, k, s, n} , ^OCc _{t, k, s, n} _{) is (q t, k,} It can be assumed that _{s, n} · (1- ^pocc _{t, k, s, n} )) ^-1 · R _t. That is, this equation can be modeled so that the smaller the detection score and the higher the predicted concealment rate, the larger the observation noise of the subject.

［統合方法１−１］
更新部３００９は、例えば、下記の式（２０）を用いて尤度関数の積の分布を計算することにより、通常の拡張カルマンフィルタの更新を適用することができる。つまり、更新部３００９は、尤度関数の積の分布から、状態変数の予測分布を推定することができる。ここで、Ｓ_ｋｎは、被写体ｎにおける、ある時刻でのカメラｋによる撮像画像中の検出対象領域の総数の値である。この方法によれば、更新部３００９は、用いられるカメラ数の値を与えられることにより、状態変数の予測分布を推定することができる。つまり、更新部３００９は、複数のガウス分布の積を予め計算し、及び１から所定数までの観測値についてガウス分布の積を関数として実装することにより、式（２０）の計算を行うことができる。

[Integration method 1-1]
The update unit 3009 can apply the update of a normal extended Kalman filter by calculating the distribution of the product of the likelihood functions using, for example, the following equation (20). That is, the update unit 3009 can estimate the predicted distribution of the state variables from the distribution of the product of the likelihood functions. Here, _Skn is a value of the total number of detection target regions in the image captured by the camera k at a certain time in the subject n. According to this method, the update unit 3009 can estimate the predicted distribution of the state variables by being given the value of the number of cameras used. That is, the update unit 3009 can calculate the equation (20) by calculating the product of a plurality of Gaussian distributions in advance and implementing the product of the Gaussian distributions as a function for the observed values from 1 to a predetermined number. it can.

［統合方法１−２］
また、更新部３００９は、例えば下記の再帰的な式（２１）を用いることにより、尤度関数の積を算出することができる。このような方法によれば、例えば検出に用いられるカメラの総数が不明である場合にも、状態変数の予測分布を推定することができる。

[Integration method 1-2]
Further, the update unit 3009 can calculate the product of the likelihood functions by using, for example, the following recursive equation (21). According to such a method, the predicted distribution of state variables can be estimated even when the total number of cameras used for detection is unknown, for example.

［統合方法２］ [Integration method 2]

更新部３００９は、例えば、統合尤度関数を、（１−ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}）とｑ_{ｔ，ｋ，ｓ，ｎ}の積を混合比として、各カメラの尤度関数Ｐ（ｙ_{ｔ，ｋｓｊ}｜ｘ_ｔ，ｎ）の混合分布で、下記の式（２２）のようにモデル化してもよい。この方針によれば、更新部３００９は、複数のカメラの視線（カメラの高額中心と被写体とを結ぶ直線）の交点以外の、各市洗浄にも尤度が分布する統合を行うことができる。

The update unit 3009 uses, for example, the integrated likelihood function as the product of (1- ^pocc _{t, k, s, n} ) and q _{t, k, s, n} as a mixture ratio, and the likelihood function P (1) of each camera. It may be modeled as the following equation (22) with a mixture distribution of _{y t, ksj} | x _{t, n).} According to this policy, the update unit 3009 can integrate the likelihood distribution in each city cleaning other than the intersection of the lines of sight of a plurality of cameras (the straight line connecting the high-priced center of the camera and the subject).

例えば、更新部３００９は、下記の式（２３）を用いて、それぞれ重みづけされたカルマンフィードバックの和を算出することにより、観測値の統合を行ってもよい。このような方法によれば、ガウス分布の積を用いたモデル化が行われないため、例えば検出に用いられるカメラの数が多い場合においても、分布の分散が縮退しない。つまり、すべてのカメラの観測値が除外されずに統合される。結果として、更新部３００９による、時間的変化が滑らかな状態変数の推定が可能になる。

For example, the update unit 3009 may integrate the observed values by calculating the sum of the weighted Kalman feedbacks using the following equation (23). According to such a method, modeling using the product of the Gaussian distribution is not performed, so that the variance of the distribution does not degenerate even when the number of cameras used for detection is large, for example. That is, the observations of all cameras are not excluded but integrated. As a result, the update unit 3009 can estimate the state variable whose temporal change is smooth.

これらの方法の何れかによれば、カメラ毎の観測値を統合し及び、そのような統合を反映させた更新を行うことにより、複数の観測値と予測の誤差を補正した状態変数のフィルタ分布の取得を実行することが可能となる。また、すべてのカメラにおいて、すべての被写体の予測隠蔽率ｐ^ｏｃｃ _{ｔ，ｋ，ｓ，ｎ}が１である場合、又はすべての観測値が欠損している場合には、更新部３００９は、フィルタ分布として、更新されていない状態変数の予測分布を取得してもよい。つまり、下記の式（２４）を実行すればよい。

According to any of these methods, the filter distribution of state variables is corrected for multiple observations and prediction errors by integrating the observations for each camera and updating to reflect such integration. It becomes possible to execute the acquisition of. Further, in all cameras, when the predicted concealment rate ^pocct _{, k, s, n of} all subjects is 1, or when all the observed values are missing, the update unit 3009 has a filter distribution. As a result, the predicted distribution of state variables that have not been updated may be acquired. That is, the following equation (24) may be executed.

ステップＳ６００９において可視化部３０１０は、被写体について、推定された３次元空間上の位置と、そのような推定位置の時系列と、の可視化を行う。つまり、被写体の推定位置を時系列に応じて可視化する。可視化部３０１０は、例えば、時系列に応じた被写体の推定位置を、仮想的な３次元空間上に描画することによって可視化を行ってもよく、又はカメラで取得した撮像画像上に軌跡や点として重畳表示させることによって可視化を行ってもよい。また、可視化部は、そのような可視化の結果をモニタリング部１３００へと送信することができる。次いで、次の時刻に映り、予測部３００４が、更新された状態変数を用いて、次の時刻における状態変数及び観測値の予測分布の取得を行う。 In step S6009, the visualization unit 3010 visualizes the estimated positions in the three-dimensional space and the time series of such estimated positions for the subject. That is, the estimated position of the subject is visualized in chronological order. The visualization unit 3010 may visualize the estimated position of the subject according to the time series by drawing it in a virtual three-dimensional space, or as a locus or a point on the captured image acquired by the camera. Visualization may be performed by superimposing the display. In addition, the visualization unit can transmit the result of such visualization to the monitoring unit 1300. Next, it is reflected at the next time, and the prediction unit 3004 acquires the predicted distribution of the state variable and the observed value at the next time using the updated state variable.

このような構成によれば、複数の撮像装置による複数の撮像画像から、少なくとも一つの撮像画像において、各被写体が検出対象領域に被覆される検出対象領域を設定することができる。つまり、前回のループにおける被写体の状態の予測値、及び複数のカメラで取得した画像による被写体の観測値に基づいて、被写体の３次元空間上の状態を予測することができる。また、予測された現時点での被写体の状態に基づいて、被写体の撮像画像上における座標及びスコアの予測値をさらに取得し、その被写体の座標及びスコアに基づいて検出対象領域を設定し、及び被写体を検出することができる。さらに、次回の時刻において予測される被写体の検出のスコアを最大化させる領域を設定することができる。したがって、３次元空間上に存在する複数の被写体、特にこの例では頭部とボールの、位置及び時系列に応じた軌跡の推定を、処理コストの軽減及び検出精度の向上を両立させながら実行することができる。 According to such a configuration, it is possible to set a detection target region in which each subject is covered with a detection target region in at least one captured image from a plurality of images captured by a plurality of imaging devices. That is, the state of the subject in the three-dimensional space can be predicted based on the predicted value of the state of the subject in the previous loop and the observed value of the subject by the images acquired by the plurality of cameras. Further, based on the predicted current state of the subject, the predicted values of the coordinates and the score on the captured image of the subject are further acquired, the detection target area is set based on the coordinates and the score of the subject, and the subject is set. Can be detected. Further, it is possible to set an area for maximizing the predicted subject detection score at the next time. Therefore, the loci of a plurality of subjects existing in the three-dimensional space, especially the head and the ball in this example, are estimated according to the position and time series while reducing the processing cost and improving the detection accuracy. be able to.

［実施形態４］
実施形態４に係る検出装置は、被写体の予測位置に応じて、撮像装置の姿勢を制御し、及び、そのような姿勢制御量に基づいて検出対象領域を設定する。図１３は、実施形態４に係る検出装置の機能構成の一例を示すブロック図である。本実施形態に係る検出装置８０００は、パン、チルト及びズーム操作（以下ＰＴＺ操作と呼ぶ）が可能なカメラを用いて、処理コストを抑制した被写体の追尾を行うことができる。そのために、検出装置８０００は、撮像部８１００及び処理部８２００を持つ。撮像部８１００及び撮像部８１００が有するＫ個の動画取得部（例えば、８００１及び８００２）は、ＰＴＺ操作が可能であることを除き、実施形態３における撮像部３１００及び撮像部３１００の有するＫ台の動画取得部と同様であるため、重複する説明は省略する。ＰＴＺ操作とは、水平方向の向き制御であるパニング操作、垂直方向の向き制御であるチルティング操作、及び被写体の拡大縮小を行うズーム操作のいずれか１つ以上を含む操作のことである。つまり、ＰＴＺ操作が可能である動画取得部はＰＴＺ操作によって撮像範囲を水平方向、垂直方向、又はこの２つの方向を組み合わせた方向に制御することができる。処理部８２００は、制御部８００３を有することを除き実施形態３における処理部３２００と同様の構成を有しており、重複する説明は省略する。制御部８００３は、各動画取得部を制御することによって、撮像部８１００の撮像範囲を制御する。例えば、制御部８００３は、各動画取得部をＰＴＺ操作することにより撮像部８１００の撮像範囲を制御することができる。 [Embodiment 4]
The detection device according to the fourth embodiment controls the posture of the image pickup device according to the predicted position of the subject, and sets the detection target area based on such a posture control amount. FIG. 13 is a block diagram showing an example of the functional configuration of the detection device according to the fourth embodiment. The detection device 8000 according to the present embodiment can track a subject with reduced processing costs by using a camera capable of pan, tilt, and zoom operations (hereinafter referred to as PTZ operations). Therefore, the detection device 8000 has an imaging unit 8100 and a processing unit 8200. The K moving image acquisition units (for example, 8001 and 8002) included in the imaging unit 8100 and the imaging unit 8100 are of the K units included in the imaging unit 3100 and the imaging unit 3100 in the third embodiment, except that the PTZ operation is possible. Since it is the same as the video acquisition unit, duplicate explanations will be omitted. The PTZ operation is an operation including one or more of a panning operation which is a horizontal orientation control, a tilting operation which is a vertical orientation control, and a zoom operation which enlarges / reduces a subject. That is, the moving image acquisition unit capable of PTZ operation can control the imaging range in the horizontal direction, the vertical direction, or a direction in which these two directions are combined by the PTZ operation. The processing unit 8200 has the same configuration as the processing unit 3200 in the third embodiment except that it has a control unit 8003, and duplicate description will be omitted. The control unit 8003 controls the imaging range of the imaging unit 8100 by controlling each moving image acquisition unit. For example, the control unit 8003 can control the imaging range of the imaging unit 8100 by operating each moving image acquisition unit in PTZ.

図１４は、本実施形態に係る検出を行うための処理手順の一例を示すフローチャートである。本実施形態に係る検出装置８０００の処理手順は、ステップＳ９００１、Ｓ９００３、Ｓ９００４を除き、実施形態３と同様に行うことができる。 FIG. 14 is a flowchart showing an example of a processing procedure for performing the detection according to the present embodiment. The processing procedure of the detection device 8000 according to the present embodiment can be performed in the same manner as in the third embodiment except for steps S9001, S9003, and S9004.

ステップＳ９００１において初期値設定部３００３は、検出処理の初期時刻において、被写体の位置、姿勢、及び速度の値を設定し、並びに、カメラの各制御パラメータを初期化し及び初期値として設定する。被写体の位置、姿勢、及び速度の値の設定については、実施形態３と同様であるため説明は省略する。この例においては、カメラの時刻ｔにおけるパン角、チルト角、及びズーム量のそれぞれの状態は、Ｐ_ｔ、Ｔ_ｔ、及びＺ_ｔと表される。また、ＰＴＺ操作のそれぞれの可動範囲は、Ｐ_ｍｉｎ≦Ｐ_ｔ≦Ｐ_ｍａｘ、Ｔ_ｍｉｎ≦Ｔ_ｔ≦Ｔ_ｍａｘ、及びＺ_ｍｉｎ≦Ｚ_ｔ≦Ｚ_ｍａｘ、と表される。また、ＰＴＺ操作によって制御される、撮像範囲の制御量（以下、これをＰＴＺ制御量と呼ぶ）は、それぞれΔＰ_ｔ、ΔＴ_ｔ、及びΔＺ_ｔと表される。時刻ｔにおけるＰＴＺ制御の、１時刻における制御可能なＰＴＺ制御量の範囲は、Δ_ｍｉｎＰ_ｔ≦ΔＰ_ｔ≦Δ_ｍａｘＰ_ｔ、Δ_ｍｉｎＴ_ｔ≦ΔＴ_ｔ≦Δ_ｍａｘＴ_ｔ、及びΔ_ｍｉｎＺ_ｔ≦ΔＺ_ｔ≦Δ_ｍａｘＺ_ｔと表される。そのようなＰＴＺの制御に関わる値は、複数のカメラ間で同一であってもよく、カメラの位置及び種類などに応じて異なっていてもよい。例えば、複数のカメラの内の少なくとも１台は、被写体の動作に関わらずピッチの全範囲を撮像していてもよい。このような構成によれば、例えば検出装置の誤動作などによって一時的に追尾しそこねた被写体が存在する場合において、ピッチの全範囲を撮像する画像からの検出結果に基づいて、その被写体の追尾を再開しやすくなる。 In step S9001, the initial value setting unit 3003 sets the values of the position, posture, and speed of the subject at the initial time of the detection process, and initializes and sets each control parameter of the camera as the initial value. Since the setting of the position, posture, and speed value of the subject is the same as that in the third embodiment, the description thereof will be omitted. In this example, the pan angle, tilt angle, and zoom amount states at time t of the camera are represented as _{P t} , T _t , and Z _t. Further, each movable range of the PTZ operation is expressed as P _min ≤ P _t ≤ P _max , T _min ≤ T _t ≤ T _max , and Z _min ≤ Z _t ≤ Z _max . Further, the control amount of the imaging range controlled by the PTZ operation (hereinafter, this is referred to as a PTZ control amount) is expressed as _{ΔP t} , ΔT _t , and ΔZ _{t, respectively.} The PTZ control at time t, the range of controllable PTZ control amount at 1 _{_{_{time, Δ min P t ≦ ΔP t}}} ≦ Δ max P t, Δ min T t ≦ ΔT t ≦ Δ max T t, and delta _min Z _It is expressed as t ≤ ΔZ _t ≤ Δ _max Z _t. The values related to the control of such PTZ may be the same among a plurality of cameras, or may differ depending on the position and type of the cameras. For example, at least one of the plurality of cameras may capture the entire pitch range regardless of the movement of the subject. According to such a configuration, when there is a subject that cannot be tracked temporarily due to, for example, a malfunction of the detection device, the subject is tracked based on the detection result from the image that captures the entire pitch range. Will be easier to restart.

また、ステップＳ９００１において初期値設定部３００３は、Ｐ_ｔ、Ｔ_ｔ、及びＺ_ｔの値を、それぞれ０に設定してもよい。しかしここで設定されるＰ_ｔ、Ｔ_ｔ、及びＺ_ｔの値は特に限定はされず、初期のカメラの状態に応じて適宜設定されてもよい。 Further, in step S9001, the initial value setting unit 3003 may set the values of P _t , T _t , and Z _t to 0, respectively. However, _{the values of P t} , T _t , and Z _t set here are not particularly limited and may be appropriately set according to the initial state of the camera.

ステップＳ９００３において領域設定部３００５は、ステップＳ６００６において被写体の検出に用いる検出対象領域を、カメラのＰＴＺ操作の制御量を考慮して設定する。また、領域設定部３００５は、そのような検出対象領域を撮像するために必要なＰＴＺ制御量を取得する。ステップＳ９００３における詳細な処理手順については、図１５のフローチャートと共に後述する。 In step S9003, the area setting unit 3005 sets the detection target area used for detecting the subject in step S6006 in consideration of the control amount of the PTZ operation of the camera. Further, the area setting unit 3005 acquires the PTZ control amount required for imaging such a detection target area. The detailed processing procedure in step S9003 will be described later together with the flowchart of FIG.

ステップＳ９００４において制御部８００３は、ステップＳ９００３において取得されたＰＴＺ制御量に基づいて、撮像部８１００の有する各カメラの撮像範囲を取得する。この例においては、制御部８００３は、時刻ｔにおいて推定された各カメラのＰＴＺ制御量ΔＰ_ｔ、ΔＴ_ｔ、及びΔＺ_ｔに基づいて、そのカメラの撮像範囲を制御する。 In step S9004, the control unit 8003 acquires the imaging range of each camera of the imaging unit 8100 based on the PTZ control amount acquired in step S9003. In this example, the control unit 8003 controls the imaging range of _{each camera based on the PTZ control amounts ΔP t} , ΔT _t , and ΔZ _{t estimated at time t.}

以下、ステップＳ９００３において領域設定部３００５が行う設定処理について、図１５を参照しながら説明する。図１５はステップＳ９００３に係る設定を行うための処理手順の一例を示すフローチャートである。ステップＳ１５０２、Ｓ１５０３及びＳ１５０６以降の処理は、実施形態３の図９におけるステップＳ７００２、Ｓ７００３及びＳ７００４以降の処理とそれぞれ同様であるため、説明は省略する。 Hereinafter, the setting process performed by the area setting unit 3005 in step S9003 will be described with reference to FIG. FIG. 15 is a flowchart showing an example of a processing procedure for making the setting according to step S9003. Since the processes after steps S1502, S1503 and S1506 are the same as the processes after steps S7002, S7003 and S7004 in FIG. 9 of the third embodiment, the description thereof will be omitted.

ステップＳ１５０１において領域設定部３００５は、実施形態３におけるステップＳ７００１と同様に、前ループで設定した第２候補領域を、現時点における代表被覆要素の位置に基づいて移動させる。この時、領域設定部３００５は、現時点における代表被覆要素の位置だけではなく、例えば、カメラのＰＴＺ制御量の制御範囲を考慮して第２候補領域を移動させてもよい。つまり、領域設定部３００５は、各カメラについて、撮像画像の範囲に加え、上下及び左右に、Δ_ｍａｘＰ_ｔの縦方向の制御量、及びΔ_ｍａｘＴ_ｔの横方向の制御量の値を、撮像画像の上下及び左右にそれぞれ足した範囲を算出してもよい。次いで、領域設定部３００５は、各カメラの撮像画像について、そのように算出された範囲の内で、第２候補領域を移動させてもよい。例えば、そのような移動により第２候補領域が元の撮像範囲を超えて移動した場合、後のステップＳ９００４において、制御部８００３が、第２候補領域の移動した位置に応じて、ＰＴＺ制御によって撮像範囲を移動させることができる。つまり、領域設定部３００５は、そのようなＰＴＺ制御量を取得することができる。複数の第２候補領域が元の撮像範囲を超えて移動する場合、領域設定部３００５は、そのような移動後の第２候補領域をすべてカメラが撮像できるようにＰＴＺ制御量を取得してもよい。さらに、ＰＴＺ制御によってもカメラが移動後の第２候補領域をすべて撮像できない場合において、領域設定部３００５は、第２候補領域のスコアに応じた優先度を設定し、及び、優先度の高い第２候補領域が撮像されるように、制御量を取得してもよい。そのような場合、領域設定部３００５は、優先度の低い第２候補領域は視野端の移動に応じて、撮像範囲外に出ないように移動させてもよい。また、領域設定部３００５は、代表被覆要素が存在しない第２候補領域を移動させなくてもよい。 In step S1501, the region setting unit 3005 moves the second candidate region set in the previous loop based on the position of the representative covering element at the present time, as in step S7001 in the third embodiment. At this time, the area setting unit 3005 may move the second candidate area in consideration of not only the position of the representative covering element at the present time but also the control range of the PTZ control amount of the camera, for example. That is, for each camera, the area setting unit 3005 sets _{the values of the vertical control amount of Δ max} P _{t and} the horizontal control amount of Δ _max T _{t in} the vertical and horizontal directions in addition to the range of the captured image. The range added to the top, bottom, left and right of the captured image may be calculated. Next, the area setting unit 3005 may move the second candidate area within the range calculated so as for the captured image of each camera. For example, when the second candidate region moves beyond the original imaging range due to such movement, in a later step S9004, the control unit 8003 takes an image by PTZ control according to the moved position of the second candidate region. You can move the range. That is, the area setting unit 3005 can acquire such a PTZ control amount. When a plurality of second candidate regions move beyond the original imaging range, the area setting unit 3005 may acquire the PTZ control amount so that the camera can image all the second candidate regions after such movement. Good. Further, when the camera cannot capture the entire second candidate region after movement even by the PTZ control, the region setting unit 3005 sets the priority according to the score of the second candidate region, and has a higher priority. The control amount may be acquired so that the two candidate regions are imaged. In such a case, the area setting unit 3005 may move the second candidate region having a low priority so as not to go out of the imaging range according to the movement of the visual field edge. Further, the region setting unit 3005 does not have to move the second candidate region in which the representative covering element does not exist.

Ｂ１５０１における処理は基本的には実施形態３のＢ７００１における処理と同様であるため、異なる部分についてのみ説明する。領域設定部３００５は、第２候補領域内に前回検出されていない被写体が存在する場合において、その時刻におけるＰＴＺの操作量の状態Ｐ_ｔ、Ｔ_ｔ、及びＺ_ｔを、初期値に戻すことができる。そのような場合、Ｚ_ｔの値の初期化は、Ｐ_ｔ及びＴ_ｔを初期値に戻した後に行われてもよい。 Since the processing in B1501 is basically the same as the processing in B7001 of the third embodiment, only the different parts will be described. When there is a subject that has not been detected last time in the second candidate area, the area setting unit 3005 can _{return the states P t} , T _t , and Z _t of the operation amount of the PTZ at that time to the initial values. it can. In such a case, the initialization of the value of _{Z t} may be performed after returning _{P t} and T _{t to the initial values.}

ステップＳ１５０４において領域設定部３００５は、第２候補領域のスコアを最大化するズーム量を推定する。領域設定部３００５は、例えば、被写体とカメラとの間の距離を説明変数として検出のスコアを推定する多項式回帰モデルを用いることにより、検出スコアを最大化するズーム制御量ΔＺｔ_ｍａｘを、ズーム操作による制御が可能な範囲内で推定してもよい。領域設定部３００５は、そのような多項式回帰モデルを、例えば、実施形態１における式（８）及び（８’）の回帰モデルと同様の方法で学習してもよい。また、スコアを最大化するズーム制御量の探索方法は特に限定されず、例えば、グリッドサーチのような公知の方法で行われてもよい。さらに領域設定部３００５は、そのようにして算出されたズーム制御量によるスコアの上昇幅が所定の閾値よりも小さい場合においては、ズーム制御量を０に設定する、つまりズーム操作を行わなくてもよい。そのような処理によれば、効果が微小なズーム操作を省略することにより、処理コストを低減することができる。 In step S1504, the area setting unit 3005 estimates the zoom amount that maximizes the score of the second candidate area. The area setting unit 3005 uses, for example, a polynomial regression model that estimates the detection score using the distance between the subject and the camera as an explanatory variable, so that the zoom control amount ΔZt _max that maximizes the detection score is set by the zoom operation. It may be estimated within a controllable range. The region setting unit 3005 may learn such a polynomial regression model in the same manner as the regression models of the equations (8) and (8') in the first embodiment, for example. Further, the search method of the zoom control amount that maximizes the score is not particularly limited, and may be performed by a known method such as a grid search. Further, the area setting unit 3005 sets the zoom control amount to 0 when the increase width of the score due to the zoom control amount calculated in this way is smaller than a predetermined threshold value, that is, even if the zoom operation is not performed. Good. According to such processing, the processing cost can be reduced by omitting the zoom operation having a minute effect.

ステップＳ１５０５において領域設定部３００５は、ステップＳ１５０５において推定された量のズーム制御によって検出スコアが変化する場合に、既存のスコアを変化後のスコアへと更新する。 In step S1505, the area setting unit 3005 updates the existing score to the changed score when the detection score is changed by the zoom control of the amount estimated in step S1505.

このような構成によれば、被写体の予測位置に対して、複数のカメラの姿勢を制御することができる。また、そのような姿勢の制御量に基づいて、検出対象領域を設定することができる。したがって、複数の被写体について、検出のコストを抑制した効率的な追尾が可能となる検出装置を提供することができる。 According to such a configuration, it is possible to control the postures of a plurality of cameras with respect to the predicted position of the subject. Further, the detection target area can be set based on the control amount of such a posture. Therefore, it is possible to provide a detection device capable of efficient tracking of a plurality of subjects while suppressing the cost of detection.

（その他の実施例）
本発明は、上述の実施形態の１以上の機能を実現するプログラムを、ネットワーク又は記憶媒体を介してシステム又は装置に供給し、そのシステム又は装置のコンピュータにおける１つ以上のプロセッサがプログラムを読出し実行する処理でも実現可能である。また、１以上の機能を実現する回路（例えば、ＡＳＩＣ）によっても実現可能である。 (Other Examples)
The present invention supplies a program that realizes one or more functions of the above-described embodiment to a system or device via a network or storage medium, and one or more processors in the computer of the system or device reads and executes the program. It can also be realized by the processing to be performed. It can also be realized by a circuit (for example, ASIC) that realizes one or more functions.

発明は上記実施形態に制限されるものではなく、発明の精神及び範囲から離脱することなく、様々な変更及び変形が可能である。したがって、発明の範囲を公にするために請求項を添付する。 The invention is not limited to the above embodiments, and various modifications and modifications can be made without departing from the spirit and scope of the invention. Therefore, a claim is attached to make the scope of the invention public.

１００１：動画取得部、１００２：初期値設定部、１００３：検出部、１００４：ＩＤ対応付け部、１００５：領域設定部、１００６：可視化部、１１００：撮像部、１２００：処理部、１３００：モニタリング部 1001: Video acquisition unit, 1002: Initial value setting unit, 1003: Detection unit, 1004: ID mapping unit, 1005: Area setting unit, 1006: Visualization unit, 1100: Imaging unit, 1200: Processing unit, 1300: Monitoring unit

Claims

A detection means that detects one or more subjects from the captured image,
According to the position of one or more subjects detected from the captured image at the first time by the detecting means, the one or more of the captured images at the second time following the first time referred to by the detecting means. Setting means for setting the detection target area of the subject and
A detection device comprising.

The detection device according to claim 1, wherein the setting means selects the detection target area from a plurality of predetermined candidate areas.

The detection device according to claim 2, wherein the setting means selects a plurality of candidate regions having different sizes as the detection target region.

The detection means outputs a score indicating the reliability of detection of the subject, and outputs a score.
The detection device according to claim 2 or 3, wherein the setting means selects the detection target area from the plurality of candidate areas based on the score of the subject included in the candidate area.

The setting means according to any one of claims 1 to 4, wherein the detection target area is selected so that at least one detection target area covers all of the one or more subjects. The detector described.

The setting means is characterized in that the detection target area is set so that the detection target area covers the position of the subject at the first time or the predicted position of the subject at the second time. The detection device according to any one of claims 1 to 5.

The setting means predicts the position of the subject at the second time based on the position of the subject at the first time and the position of the subject at a time before the first time. The detection device according to claim 6, which is characterized.

The detection target area includes a covering determination area and a buffer area set outside the covering determination area.
The detection according to claim 6 or 7, wherein the setting means sets the detection target area so that the cover determination area of the detection target area covers the position of the subject or the predicted position. apparatus.

An estimation means for predicting the state of the subject by using the detection result of the subject by the detection means is further provided.
The detection device according to any one of claims 1 to 8, wherein the setting means sets the detection target area according to the predicted state of the subject.

The estimation means predicts a score representing the reliability of detection of the subject according to the predicted state of the subject.
The detection device according to claim 9, wherein the setting means further sets the detection target area of the subject according to the predicted score.

Further provided with an acquisition means for acquiring a plurality of captured images by acquiring captured images from each of the plurality of imaging devices.
The setting means according to any one of claims 1 to 10, wherein the detection target area of the subject is set for each captured image so that the subject is included in the detection target area in at least one captured image. The detection device according to any one item.

The setting means is
A region corresponding to each of the subjects and different from each other including the subject is set in at least one captured image at the first time.
Based on each predicted position of the subject at the second time, the area including the subject corresponding to each of the subjects is moved.
The detection device according to claim 11, wherein at least one of the regions including the subject corresponding to each of the subjects after the movement is selected as the detection target region.

The setting means predicts the position of the subject in the three-dimensional space at the second time from each position of the subject detected by the plurality of captured images at the first time.
The eleventh or twelfth aspect of claim 11 or 12, wherein the position of the subject in each of the captured images at the second time is predicted from the predicted position of the subject in the three-dimensional space at the second time. Detection device.

The setting means is
For each of the plurality of captured images, one or more regions covering the position of the subject at the first time or the predicted position of the subject at the second time are set.
Claims 11 to 13 are characterized in that the detection target region is selected from the one or more regions of the plurality of captured images so that the subject is included in the detection target region in at least one captured image. The detection device according to any one of the above.

Further, a control means for controlling the posture of the image pickup apparatus for capturing the captured image according to the predicted position of the one or more subjects at the second time is further provided.
The detection device according to any one of claims 1 to 14, further comprising setting the detection target area based on the posture control amount of the image pickup device.

The detection device according to any one of claims 1 to 15, further comprising a visualization means for visualizing and outputting the position of the subject for each time series.

The detection means outputs a score indicating the reliability of detection of the subject, and outputs a score.
The visualization means has the position of the subject and the score at the first time, and the position of the subject and the score at the first time based on the position of the subject and the score at the second time. The detection device according to claim 16, wherein the position of the subject is visualized for each time series by associating the position of the subject with the position of the subject at the time of 2.

It is a detection method performed by the detection device.
The process of detecting one or more subjects from the captured image and
According to the position of one or more subjects detected from the captured image at the first time in the detecting step, the captured image at the second time following the first time referred to in the detecting step is described. The process of setting the detection target area of one or more subjects, and
A detection method comprising.

A program for causing a computer to function as each means of the detection device according to any one of claims 1 to 17.